Wednesday, October 28, 2015

Unix/Linux exercises

Protocol for using MEGA to generate a 16s rDNA tree for giant panda evolution

We will first align the 16s rDNA sequences and then generate a neighor-joining tree.

The steps are illustrated below:

Start MEGA6 software

Click "File" --> "Open A File/Session", Choose "panda_16srDNA.fas". 

 Choose "Align"

Click "Alignment"-->"Align by ClustalW"

Click "OK" to select all sequences, 

Click "OK" to accept the default setting for alignment

Click "No", because 16s rDNA is an RNA coding sequence

Your sequences are now aligned. 

Click "Data" -->"Export Alignment"-->"MEGA Format"

Input an appropriate file name. 

Input an appropriate title: 

Choose "No" because 16s rDNA is not protein-coding gene.

Now, we can start to generate a phylogeny. 
At the main MEGA window, click "Phylogeny" -->"Construct/Test Neighbor-Joining Tree": 

Choose the appropriate alignment file, and then "Open"

Click "Computing" to use the default setting

Generated tree (only partial picture is shown here)

Save your image to use in your lab report.

Tuesday, October 27, 2015

bio386 20151027Tue

Research topic update.

Gordon login, and Unix shell practice.
1. There is a delay for Gordon to update user profiles. Some students cannot log into Gordon after I add them to Gordon.
2. My login attempt to "" led to "access denied".

Monday, October 26, 2015

use nutrient plates for yeast

Oct 26-Nov3. 2015
Use nutrient plates for yeast. The growth is extremely slow. Small colonies can be seen after 1 week.

Tuesday, October 20, 2015

bio233, review for exam 2

Course evaluation
learning outcomes
key concepts
mentioned that cAMP and glucose are in a seesaw.

bio386, 20151020Tue, go over exam 1, key

Screen cast of exam 1 key codes.
Use some of student video.

Monday, October 19, 2015

bio233, 20151019Mon

_ submission of oral presentation topic.

2:20pm, _ distribute midterm grades,

2:300-3pm -> count colonies
 _ tally, marker, counting colonies
_ input data on googleSheet

3:20pm_ practical test on aseptic streak for single colonies.

_ microscrope exam

Sunday, October 18, 2015

bio233 midterm grades

byte:midterm,Fall15 hqin$ pwd


Friday, October 16, 2015

reorganize mactower network aging directories

The directory is messy due to the increasing complexity of the project. I decided to make separate folders for the grid-network simulation part and original network part.

byte:mactower-network-failure-simulation hqin$ ll
drwxr-xr-x   7 hqin  staff   238B Oct 16 09:45

drwxr-xr-x   7 hqin  staff   238B Oct 15 11:28 1.ori.ginppit.2015Oct

I copy-pasted some R code into, but was not certain how to work on it. 

Wednesday, October 14, 2015


Purpose: generate grid network to my simulation code, to see whether simulation can reproduce analytic results. 

 file: 20151014_generate_grid_networks.R
helen:mactower-network-failure-simulation hqin$ pwd


helen:mactower-network-failure-simulation hqin$ ll -t | head
total 201880
-rw-r--r--   1 hqin  staff    35K Oct 14 22:05 _Degree4N1000_network.csv
-rw-r--r--   1 hqin  staff   9.8K Oct 14 22:05 _Degree4N100_EssenLookupTb.csv

-rw-r--r--   1 hqin  staff   1.3K Oct 14 22:05 20151014_generate_grid_networks.R

# generate gride network for simulation and quality control

debug = 0; 

# R -f file --args Degree numOfEssenNode outNetworkFile outEssenLookupTbFile 
# R -f 20151014_generate_grid_networks.R --args 4 1000 _Degree4N1000_network.csv _Degree4N100_EssenLookupTb.csv

options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
degree = as.numeric(args[1]); degree;
numOfEssenNodes = as.numeric(args[2]); numOfEssenNodes;
outNetworkFile = args[3]
outEssenLookupTbFile = args[4]

#for debug 
# degree=4; numOfEssenNodes=1000; outNetworkFile = "__test.csv"; outEssenLookupTb="__EssenLookupTb.csv"

pairs = data.frame(integer(numOfEssenNodes*degree), integer(numOfEssenNodes*degree))
names(pairs) = c("No1","No2")

for( i in 1:numOfEssenNodes) {
  current_i_row = (i-1)*degree + 1
  for( j in 1:degree){
     pairs$No1[current_i_row +j-1] = i;
     pairs$No2[current_i_row +j-1] = current_i_row +j-1 + numOfEssenNodes
# check
table( pairs$No1)[1:10]
table( pairs$No2)[1:10]

write.csv(pairs, outNetworkFile, row.names = F)

outEssenLookupTb = c(rep(1, numOfEssenNodes), rep(0, numOfEssenNodes*degree))
write.csv(outEssenLookupTb, outEssenLookupTbFile, row.names = F)


X install R
X install Rstudio
X sign up install Github
X install Github
X clone mactower simulation project

I mentioned bigdata,  high performace computing, precision medicine.

Her summer mentior is Dr. Shawnis Patel at Emory.

  EdX intro to R course
  Read Qin13 proposal
  Watch qin research talk videos. 

Tuesday, October 13, 2015

simulate network aging with two lambda, results need to be verified.

File: "20151013-net-sim-ginppi.R". 
$ pwd

#2015Oct13, use numeric lookup table for essential genes.
#2014March3, redo ginppi simulation witht same parameter for ms02networks. 
# 2014 Feb 17, change name "20131221.DIPandGIN.sim.aging_v2.R" to "net-aging-sim-2014Feb17.R"
# 2013 Dec 20, merge DIP PPI and Genetic Inxt Net -> Multi-net approach

single_network_failure_v2 = function(lambda1, lambda2=lambda1/10, threshold=4, p, pairs, essenLookupTb ) {
  # single network failure simulation, 20151013Tue
  # lambda1: First exponential constant failure rate for edges with degree > threshold
  # lambda2: Second exponential constant failure rate for edges with degree <= threshold
  # threshold: degree threshold for lambda1 and lambda2
  # pairs: network in pairwide format, using numeric NOs 20151013
  # essenLookupTb: lookup table for essential and nonessential genes, numeric values 
  ## for debug:   lambda1 = 1/50; lambda2= lambda1/10; threshold=4; p=0.8
  inpairs = pairs[,3:4] #bookkeeping  
  names(inpairs) = c('No1','No2')
  #get connectivities per node
  degreeTb = data.frame( table(c(inpairs$No1, inpairs$No2)))
  names(degreeTb) = c("No", "degree")
  degreeTb$moduleAge = NA;
  for( i in 1:length(degreeTb[,1])){
    if ( essenLookupTb[ degreeTb$No[i] ]) { #essential node
      lambda = ifelse( degreeTb$degree[i] >= threshold, lambda1, lambda2)
      age = rexp( degreeTb$degree[i], rate=lambda ) #exponential age
      if(degreeTb$degree[i] >= threshold){
        active = runif(degreeTb$degree[i])  #uniform interaction stochasticity
        active = ifelse( active<=p, 1, NA  ) #pick active interactions
        if( sum(active, na.rm=T) > 0 ){ #there should be at least 1 active intxn
          age = age * active # only active interactions for modular age estimation
          degreeTb$moduleAge[i] = max(age, na.rm=T) #maximum intxn age is the module age
        } else {# when no active intxn is available 
          degreeTb$moduleAge[i] = 0; #this module is born dead.
      } else { # for degree < threshold, no stochasticity is applied. 
        degreeTb$moduleAge[i] = max(age, na.rm=T) #maximum intxn age is the module age
    } else {# non-essential node
      degreeTb$moduleAge[i] = NA 
  currentNetworkAge = min(degreeTb$moduleAge, na.rm=T)


# R -f file --args lambda1 lambda2 degreeThreshold p popSize
# R -f 20151013-net-sim-ginppi.R --args 0.002 0.0002 4 0.9 5
options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
lambda1 = as.numeric(args[1]); lambda1;
lambda2 = as.numeric(args[2]); lambda2;
degreeThreshold = as.integer((args[3])); degreeThreshold
p = as.numeric(args[4]); p;
popSize = as.numeric(args[5]); popSize;

list.files(path='data', )
debug = 0; 

#essential gene info
essenTb = read.csv("data/SummaryRegressionHetHomFactorized2015Oct13.csv", colClasses=rep('character', 9))
essenLookupTb = read.csv("data/essntialGeneLookupTable_20151013.csv", row.names=1)
essenLookupTb = essenLookupTb[,1]

infile = "data/merged_PPIGIN_Factorized2015Oct13.csv"
pairs = read.csv(infile)
  names(pairs) = c("id1",'id2', "No1", "No2")
  if(debug==9) {     pairs = pairs[1:1000,]  }
  pairs = pairs[ pairs$No1 != pairs$No2, ]  

  # label essential nodes, remove nonesse-nonessen pairs
  pairs$essen1 = essenLookupTb[pairs$No1]
  pairs$essen2 = essenLookupTb[pairs$No2]
  #remove nonessen <-> nonessen intxn because they do not affect aging. 
  pairs$remove = ifelse( pairs$essen1==F & pairs$essen2==F, T, F  )
  pairs= pairs[! pairs$remove, ]  
  # 31394 for one ms02 network
  #get connectivities per node
  degreeTb = data.frame( table(c(pairs$No1, pairs$No2)))
  #median degree =5, mean=12
  #for one ms02, media =6, mean=13.68, so orginal network is power-law like, skew at two ends. 

  full_age_dir = "ori.ginppit.2015Oct"  
      popAges = numeric(popSize)
      time1 = date()
      j=1; count = 0; 
      while ((j <= popSize) && ( count < popSize*30)) {
        count = count + 1;      
        currentNetworkAge = single_network_failure_v2(lambda1, lambda2, degreeCutoff, p, pairs, essenLookupTb)
        if (currentNetworkAge > 0) {
          popAges[j] = currentNetworkAge      
          j = j+1
      }# end of j while-loop, population loop
      timestamp = format(Sys.time(), "%Y%b%d_%H%M%S")"threshold", degreeCutoff, "p", p, "lambda1", lambda1, 
                          "lambda2", lambda2,'popsize',popSize, "time",timestamp, "txt", sep="." )
      full_age_file = paste( full_age_dir,'/',, sep='')
      write.csv( popAges, full_age_file, row.names=F)

Monday, October 12, 2015

convert string to numeric for essential gene lookup, 20151012-factorize_networks.R

To speedup network aging simulation, use numeric values to replace string orf values. 

file: "20151012-factorize_networks.R"
pwd: /Users/hqin/github/mactower-network-failure-simulation

# 20151012 Plan
# input a gene network and essential gene list.
# remove self-interactions
# assign smaller number to essential nodes, and large numbers to non-essential nodes
# identify essential and non-essential nodes in gene networks
# remove non-essential gene -noessential gene interactions.


list.files(path='data', )
debug = 0;

#essential gene info
 #essenTb = read.csv('SummaryRegressionHetHom2013Oct29.csv', colClasses=rep('character', 9))
 #20151012, fixed a bug due to comma in orf.
essenTb = read.csv("SummaryRegressionHetHom2015Oct12.csv", colClasses=rep('character', 9))
length(unique(essenTb$orf)) #5772,

  infile = "data/";
  pairs = read.table( infile,  header=F, sep="\t", colClass = c("character", "character") )
  names(pairs) = c("id1",'id2')
  if(debug==9) {     pairs = pairs[1:1000,]  }
  pairs = pairs[ pairs$id1 != pairs$id2, ]

  # How do the two data set overlap? DIP seems to contain some questionable orfs
  uniq.orf.from.pairs = unique(c(pairs$id1, pairs$id2)) #5496 ORF
  matches = uniq.orf.from.pairs %in% unique(essenTb$orf)
  #unmatchedORF = uniq.orf.from.pairs[! matches]

  # Generate geneNOs (numeric IDs) using the essential, nonessential gene file.
  essentialORFs = essenTb$orf[essenTb$essenflag == 'essential']
  essentialORFs = essentialORFs[order(essentialORFs)]
  nonessentialORFs = essenTb$orf[essenTb$essenflag != 'essential']
  mergedORFs = c(essentialORFs, nonessentialORFs)
  mergedNOs = 1: length(mergedORFs)
  names(mergedNOs) = mergedORFs   # This is the numbers assigned to ORFs/Nodes !!!
  mergedNOs[c('YAL016W', "YPR116W")] #check, passed 20151012Mon
  essenTb$geneNO = mergedNOs[essenTb$orf]
  essenTb$essenflagNumeric = ifelse( essenTb$essenflag=='essential', 1, 0)
  rownames(essenTb) = essenTb$orf

  essenTb2 = essenTb[, c("geneNO", "essenflagNumeric")]
  essenLookupTb = essenTb2$essenflagNumeric[order(essenTb2$geneNO)]
  #manual check

  ### convert pairs ORF into geneNOs
  essenTb[pairs$id1[1:10], 'geneNO']
  pairs$No1 = essenTb$geneNO[match(pairs$id1, essenTb$orf)]
  pairs$No2 = essenTb$geneNO[match(pairs$id2, essenTb$orf)]
  #manually check a few
  pairs[100, ]
  # id1     id2 No1  No2
  # 100 YDL043C YGL049C 131 2623
  essenTb["YDL043C", ]
  essenTb["YGL049C", ]

  write.csv(pairs, "data/merged_PPIGIN_Factorized2015Oct13.csv")

fix a bug in orf parsing for

new output file is  "SummaryRegressionHetHom2015Oct12.csv".
The bug is caused by comma in 1 or 2 orfs, such as "DUR1,2".

# This bug is caused by write.csv(, quote=F, row.names=F)
# The problem is caused by "DUR1,2" an ORF that contains a comma.

notes and speeding up network aging simulation.

The old simulation method is slow due to single_network_failure() 

inpairs$age = rexp( length(inpairs[,1]), rate=lambda )  #exponential ages for pairs
inpairs$age = ifelse(inpairs$active > (1-p), inpairs$age, NA ) #if not active, intxn is excluded. 
  ModuleTb = data.frame(runningORFs) #buffer for module ages    
  #loop every essential genes to identify the module age
  for (i in 1:length(runningORFs)) {
    myORF = runningORFs[i]
    pos1 = grep(myORF, inpairs$id1)
    pos2 = grep(myORF, inpairs$id2)  #id1,2 to ORF1,2 is a really bad choice. 
    if( length( c(pos1,pos2))>=1 ) {
      ModuleTb$age.m[i] = max( inpairs$age[c(pos1,pos2)], na.rm=T )   #maximal intxn age -> module age
    } else {
      ModuleTb$age.m[i] = NA; 

Basically, there were too much ifelse, matching, grep(). The grep() probably is the most expensive step.

I propose a new algorithm to calculate by Degree for each essential gene, thereby eliminating the grep() process.

minority in STEM references

Ref1: National Science Foundation, National Center for Science and Engineering Statistics. 2013. Science and Engineering Doctorate Awards: 2009–10. Detailed Statistical Tables NSF 13-322. Arlington, VA. Available at

Ref 2National Science Foundation | National Center for Science and Engineering Statistics (NCSES) Science and Engineering Indicators 2014  Arlington, VA (NSB 14-01) | February 2014

Ref 3 National Science Foundation, National Center for Science and Engineering Statistics. 2015. Women, Minorities, and Persons with Disabilities in Science and Engineering: 2015. Special Report NSF 15-311. Arlington, VA. Available at

Ref 4. Fiegener, M.K; Proudfoot, S.L. Baccalaureate Origins of U.S.-trained S and E Doctorate Recipients.  National Center for Science and Engineering Statics (NCSES); NSF-13-123; National Science Foundation: Arlington, VA, 2013.

Sunday, October 11, 2015

statistical power, GWAS, quantitative genetics

Genet Epidemiol. 2015 Oct 10. doi: 10.1002/gepi.21935. Evaluating the Calibration and Power of Three Gene-Based Association Tests of Rare Variants for the X Chromosome. Ma C1, Boehnke M1, Lee S1; GoT2D Investigators.

Calibrating the Performance of SNP Arrays for WholeGenome Association Studies Ke Hao, Eric E. Schadt, John D. Storey, 2008, PLOS Genetics

Family-based association tests for sequence data, and comparisons with population-based association tests, Iuliana Ionita-Laza1, Seunggeun Lee, Vladimir Makarov, Joseph D. Buxbaum, and Xihong Lin

awk usage

Friday, October 9, 2015

*** h2, heritability

See also 

My R code on h2

genotype_elements = c(-1,0,1)
#genotype_elements = rnorm(N)
error = rnorm(N)
tb = data.frame(error)
tb$g1 = sample(genotype_elements, N, replace=T)
h2 = 0.5
tb$phynotype = sqrt(h2)*tb$g1 + sqrt(1-h2)*tb$error
summary(lm(tb$phynotype ~ tb$g1))

tb$g2 = sample(genotype_elements, N, replace=T)
h2 = 1/10
tb$phynotype = sqrt(h2)*tb$g1 + sqrt(h2)*tb$g2+ sqrt(1-2*h2)*tb$error
summary(lm(tb$phynotype ~ tb$g1 + tb$g2))

h2 = 1/10
tb$g3 = sample(genotype_elements, N, replace=T)
tb$phynotype = sqrt(h2)*tb$g1 + sqrt(h2)*tb$g2 + sqrt(h2)*tb$g3 + sqrt(1-2*h2)*tb$error
summary(lm(tb$phynotype ~ tb$g1 + tb$g2 + tb$g3))
summary(lm(tb$phynotype ~ tb$g1 ))
summary(lm(tb$phynotype ~ tb$g2 ))
summary(lm(tb$phynotype ~ tb$g3 ))

NSF proposal guideline

Use one of the following typefaces identified below:
  • Arial10, Courier New, or Palatino Linotype at a font size of 10 points or larger;
  • Times New Roman at a font size of 11 points or larger; or
  • Computer Modern family of fonts at a font size of 11 points or larger.

Thursday, October 8, 2015

Precision medicine initative cohort

bio233, 20151008Thu Ecoli genome investigation in R

2:30-3:20pm. short class. Using R code to analyze E coli K12 genome.
Ask students to work in pairs. Several students said, "We like this exercise."


R Code:

# Change working direcotry to the current folder

seqs = read.fasta(file="EcoliK12.fna")
# seqs = seqs[[1:3]]

# look at the first sequence
seq1 = seqs[[789]]
table( seq1 ); #nucleotide composition
GC(seq1);  # GC content
# now, try to find name and GC content for 865th gene.

# a loop for all sequences 
num = 1:length(seqs); #these are storages for later use
gc  = 1:length(seqs); # gc = num; 
out = data.frame( cbind( num, gc ) );

for( i in 1:length(seqs) ) {
  out$gene_name = getName( seqs[[i]] )
  out$gc[i] = GC( seqs[[i]] );

write.csv(out, "gc.csv", row.names=F) # output the results

#### now, calculate the length of each gene
num = 1:length(seqs);
out2 = data.frame( cbind( num) );
for( i in 1:length(seqs) ) {
  out2$name[i] = getName( seqs[[i]] )
  out2$len[i] = length( seqs[[i]] );  
  out2$gc[i] = GC( seqs[[i]] );

write.csv(out2, "gc2.csv", row.names=F) # output the results

prot = translate( seqs[[999 ]])

# now, find the protein length  for 1005th gene

#### now, translate the ORF into proteins
num = 1:length(seqs);
out3 = data.frame( cbind( num) );
for( i in 1:length(seqs) ) {
  out3$name[i] = getName( seqs[[i]] )
  out3$ORFLen[i] = length( seqs[[i]] );  
  prot = translate( seqs[[i ]])
  out3$ProtLen[i] = length( prot)  
write.csv( out3, "gc-prot.csv")

Tuesday, October 6, 2015

bio233, Tue, serial dilution data analysis for hemocytometer

Led 20 students download and install R and Rstudio.
Then run sample R code to analyze the class data on hemocytometer.

=> Problems:
-> Many students did not ZIP file and how to extract them. On windows computer, the R code can be openned directly using Rstudio, but now xlsx or csv files.
-> R package 'xlsx' lead to runtime error on Yosemite Rstudio.  I had to convert xlsx to csv files for these students to run the codes.  Lessons: Simple methods are always safer than fancier ways in classrooms. 

=> Solutions
-> I went step-by-step on screen, and make sure most students caught up.
-> I ask students who finished the steps to help students that were stuck.

JC HHMI project,

30 mintues to save documents and turn off applications on student's frezon laptop

15 minutes on Github account

todo Github clone
R code and PPT of past materials.

Monday, October 5, 2015

*** (in progress) bio233, serial dilution lab.

20150928Monday, streak DBY1394, box 1, I1 to YPD plate. 30C O/N. DBY1394 grew relatively slower than wild isolates.

20150930Wed, Pick colonies for growth in 5ml YPD, 30C O/N in 10ml falcon tube. (Cell precipated at the bottom).

  Made fresh YPD plates, 20% glucose.

20151001Thu. 5pm. Added 1ml of overnight grown culture into 10 ml of YPD in large glass tube. 30C shaker.  A total of 4 tubes were prepared.

Friday, expand to 4 more large glass tubes

20151005 Monday
1pm:  mix all tubes and do serial dilutions. I prepared 9 groups, but only 6 group actually formed.

2pm, bio233 lab. I did not do a demo for the class, and many mistakes happened.
1) many student put pipetman deep into glass culture tubes, and bunsen burners were not turned on.
2) many students did not know which pipetman to use for what volumes.
3) some students forgot to put cover skips on hemocytomers.
4) some students put glassbeads onto their gloved hands and then to plates.
5) many students did vortex their samples before put them on YPD plates.

3:45pm. Students put their number into master googleDoc sheet.

10 sets of pipettman. P1000, P200, P20. tips.  YPD plates.  Nutrient plates. Glass beads. Hemocytometer. Tally counters.


references for 2-credits bioinformatics course

Genomics Medicine Gets Personal, EdX

Introduction to Bioinformatics 4th Edition, Arthur Lesk

Bioinformatics for biologists,

Thursday, October 1, 2015

HU experiment changes, bio125

* HU+ cell left in water after day 1.  HU- cells left in SD media to avoid water induced G1 arrest.

* Add antibiotics to the media to avoid bacteria growth.

bio233, chapter 5, growth (part 1)

Class started 10 mintues later.

Students worked on two problems.

1) Growth curve problem.
For the slope and intercept, one student volunteered to solve the problem on the board.
For generation calculation, I asked two students to work on the board. Most students can reach
 N = N0 * 2^n.  However, most did not know how to solve for "n".

2) Hemocytometer problem.
I let students read the manual of Hausser bright line hemocytomer counter.

bio386 20151001Thu midterm exam released

Released midterm exam. Found the keys were somehow released as well, luckily I correct the error before most students downloaded the files.

I then let students worked on practical problems.