Monday, December 31, 2012

*** Git/Github repository for Bacillus spore coat project

$ cd /BacillusSporeCoat-master
$ git init
Initialized empty Git repository in /Users/hongqin/github/BacillusSporeCoat-master/.git/

$ git add _coat.profile.for.submission.Dec31.2012.xlsx
$ git commit -m "coat profile"
[master (root-commit) fe3211c] coat profile
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 _coat.profile.for.submission.Dec31.2012.xlsx

$ git remote add origin

$ git add
$ git commit -m "coat profile 2"
[master 5aac99d] coat profile 2
1 file changed, 4 insertions(+)
create mode 100644
ace:BacillusSporeCoat-master hongqin$ git remote add origin
fatal: remote origin already exists. 

$ git pull -m "test"
$ git remote add origin
fatal: remote origin already exists.
$ git push origin master
Counting objects: 7, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 21.62 KiB, done.
Total 6 (delta 1), reused 0 (delta 0)
5d270ff..1b82f44 master -> master

This works. 

Saturday, December 29, 2012

Install Git and GitHub for snow leopard

I have a snow leopard laptop. 

From, it automatically downloaded  "git-". Double-click this file generated a "Git Snow Leopard Intel Universal" disk, which contains four files: 1), 2) README, 3) 'setup git PATH for non-terminal, 4) git- After running the fourth file, a successful install message occurred. 

I then found out that the free GUI for Git, GitHub, has stopped support for snow leopard. This leaves me the only option to follow the command-line version on

$ git config --global "hongqin"
$ git config --global ""
$ curl -s -O \
$ chmod u+x git-credential-osxkeychain 
$ cd /opt/local/bin   #I saw osxkeychain already here
$ git config --global credential.helper osxkeychain

By now, I have finished all the steps on  

I then proceed to created a 'Public' repository, named "sandbox". I chose a README option.  After hitting the green "create repository" button, a new page appeared in about one second.

When I clicked "Clone in Mac", it took me back to GitHub Mac version for 10.7+. When I clicked "ZIP", it started download "" into local Download. This zip file contains the file and the sandbox-master directory structure.  I then noticed that I was in the 'branch:master' mode. 

To test how GitHub works locally, I created a hello world file in local sandbox-master. 

$cd sandbox-master
$git init
Initialized empty Git repository in /Users/hongqin/github/sandbox-master/.git/

$ git add helloworld.txt
$ git commit -m "first local commit"
[master (root-commit) 6654d46] first local commit
 1 file changed, 1 insertion(+)
 create mode 100644 helloword.txt

$ git remote add origin
$ git push origin master
Username for '': hongqin
Password for '':
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to ''
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Merge the remote changes (e.g. 'git pull')
hint: before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

This non-fast-forward error seems to be a common error:

$ git push --force origin master
Counting objects: 3, done.
Writing objects: 100% (3/3), 230 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
 + c8d6cad...6654d46 master -> master (forced update)

I then refreshed GitHub sandbox web page, and found "helloworld.txt" there. But "" disappeared.This may explain my 'non-fast-forward error'.

$ git add README.txt
$ git commit -m "second local commit"

[master bdcefa8] second local commit
 1 file changed, 2 insertions(+)
 create mode 100644 README.txt

$ git remote add origin
fatal: remote origin already exists.
$ git push origin master
Counting objects: 4, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 288 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
   6654d46..bdcefa8  master -> master
I  then refreshed my GitHub webpage and see both files. 

Oddly, there is still "" file, but I do not see this file in the downloaed ZIP archive.  I leave it for now. 

NPR food-for-thought blog on lactose tolerance.

NPR has a blog on the evolution of lactose tolerance: An Evolutionary Whodunit: How Did Humans Develop Lactose Tolerance? by Helen Thompson.   This blog discuss a review on lactase persistence in Europe based on archaeological and genetic evidences, written by Leonardi, Gerbault, Thomas, and Burger. This blog highlighted the speculative combination of famine and diarrhea to explain the quick adaptive sweep of lactase persistence in European populations. I was surprised that lactase persistence  was argued by Cordain to provide advantage against malaria in Africa and Southern Europe and rickets in Northern Europe. 

 I found many interesting ideas in the comments posted on this blog, often with entertaining exchanges. One reader linked another blog on evolution of lactose tolerance at, that also gave some interesting speculations. I have to say that some of the discussions on lactose intolerance in Chinese population contain inaccurate facts.

Saturday, December 22, 2012

Read pairwise network data into igraph, then do permutation.

#continental US demo, permutation of pariwse interaction network, igraph usage

URL = ""
states = read.csv(textConnection(getURL(URL)), colClass = c("character", "character"))

g =, directed=F) = degree(g) [ == max(] #TN and MO have 8 bordering states

g.shortestpath.m = shortest.paths(g)
sorted.names = sort( rownames(g.shortestpath.m) )
gsm = g.shortestpath.m[, sorted.names]
gsm = gsm[sorted.names, ]

URL2 = ""
state.year.tb = read.csv(textConnection(getURL(URL2)), colClass = c("character", NA))

permute.pairs.wo.selfpairs = function( inpairs,  ncycles=10, debug=1 ) {
  if (ncycles >= 1 ) {
    if(debug) {
      print(paste('ncycles=', ncycles))
    longids = c(as.character(inpairs[,1]), as.character(inpairs[,2]) )
    longids = sample(longids)
    len = length(inpairs[,1])
    newpairs = data.frame( cbind( longids[1:len], longids[(len+1): (2*len)]) )
    names(newpairs) = c('id1', 'id2')
    newpairs$id1 = as.character( newpairs$id1)
    newpairs$id2 = as.character( newpairs$id2)   
    newpairs$selfpairs = ifelse( newpairs$id1 == newpairs$id2, 1, 0 )
    self.tb = newpairs[ newpairs$selfpairs==1, ]
    nonself.tb = newpairs[newpairs$selfpairs==0, ]
    if(debug) {
    if( length(self.tb[,1])>0 ) {
      if ( ncycles == 0) { return (c(NA,NA, NA) )
      } else {
        ncycles = ncycles - 1
        selectedpairs = rbind(self.tb,  nonself.tb[1: (length(self.tb[,1])*2), ] )
        restpairs = nonself.tb[ (length(self.tb[,1])*2+1): length(nonself.tb[,1]), ]
        return( rbind(restpairs, permute.pairs.wo.selfpairs(selectedpairs, ncycles)))
    } else { 
      return (newpairs)
  } else {
    return( c(NA,NA,NA ))

x = permute.pairs.wo.selfpairs(states)
x2 = permute.pairs.wo.selfpairs(states)

Thursday, December 20, 2012

The meaning of dN/dS (Ka/Ks) ratio in molecular evolution

Omega, Ka/Ks or dN/dS, is the ratio of nonsynonymous substitution per site to synonymous substitution per site. Ks (dS) can be viewed as a proxy of background mutation. The ratio of Ka/Ks, omega, indicate the extent of changes at the amino acid levels after normalized by silent mutational changes at DNA level, hence, is a proxy for positive selection pressure in coding genes. This definition assumes selection only at the protein level, not at the DNA or RNA level. It is clear that omega is only applicable in protein-coding genes.

Omega is designed to study divergence because its definition assumes fixed changes. When Omega is used in the population context, standing genetic variation would be treated as fixed ones, which can inflate the estimation of omega.

Kryazhimskiy and Plotkin (2008) showed that in population genetics context, dN/dS ratio is not a reliable index for selection pressure. When population scaled mutation rate is small enough, the population omega depends on both selection coefficient (s) and mutation rate (\mu).  Based its figure 1, it can be seen that population omega is generally higher on divergent omega for negative selections.  KP08 used a two-allele model to find an analytic solution, and then used simulation to study the behavior in a Markov chain model with 64 states. KP08 is mostly well written. I found Equation 3 was presented without explanations.

Wednesday, December 19, 2012

Check and balance between adaptation and mutation in E. coli

Wielgoss et al, 2012 PNAS, Mutation rate dynamics in a bacterial population reflect tension between adaptation and genetic load. The authors sequenced one E. coli strain that has been evolved for 20 years in the Lenski lab. The author estimated that 20 years of E coli is about 50,000 generations. Twenty-two clones sampled at various time point during evolution were resequenced for this study. Previously, it was found that a frame-shift in mutT became 'established' in the population around 25K generations. After the fixation of this hypermutator mutT, mutation rate in two branches show reducted mutation rates (confirmed by fluctuation tests). Because the number of substitutions can be revealed by sequencing and the number of generations can be estimated, the mutations rates at different time points (hence, mutation dynamics) can be monitored (This is very nice thanks to the 20 years of in-vitro evolution experiment).

For reading, a few key concepts here are mutation rate, genetic load, hypermutators. Mutation rate are measured using synonymous substitutions. Genetic load is the normalized difference between maximal and mean fitness. The authors seem to use relative growth rate as compared to the ancestral clone to calculate the genetic load.

Specific terms include mutT, mutY, which are the specific genes under discussion.

I noticed the use of "established" instead of 'fixed' in the population, which indicates that mutT did not reach 100% of the population.

A question: How does the author establish the causal interaction between mutT and mutY? It seems that they begin with a candidate gene approach, and used association arguments. I was hoping for a regression based causal inference like those in quantitative genetics, but this kind of analysis may not be available here.

Tuesday, December 18, 2012

Dietary restriction in chronologicaled aged cells leads to extened replicative life span, Delaney et al, 2012 Exp Gen

Delaney et al, 2012 Experimental Gerontology

A paper from the Kaeberlein group ( showed the dietary restriction during chronologically aging can lead to extended replicative life span. This is a technically tricky experiment, because when chronologically aged cells start to lose viability, much more cells will be required to conduct the subsequent replicative life span assays. They author stated that they pick as much as 1000 cells to start the replicative lifespan assays.

One of my questions is a technique one: How did they carry out DR during chronological aging? It seems that starting cultures are SC with various glucose concentration. Five microliters of chronologically aged cultures were spread to YPD (2% glucose) plate for RLS assay, which is a standard practice. DR was achieved in 0.05% glucose in this study.

The authors double-stained yeast cells with 5nM SYTOX red and 17.5nM DiOC6, and monitor them by flow cytometry. DiOC6 is a mitochrondrial membrane potential dependent dye, and SYTOX Red is a used distinguish live and dead cells. SYTOX Red-negative cells are considered live cells.

This paper argues that both DR and buffering during chronological aging can delay the subsequent replicating aging process by maintiaing a low mitochondria membrane potential. Overall, the paper suggests that mitochondria is a link between post-mitotic and mitotic aging.

One question that I have is the role of mitochondria during chronological aging. The author conducted chronological aging in rich media. Although many cells are probably metabolically active during in the depleted media, many are also metabolically inactive. Is it possible that DR have differential effect on different cell sub-populations in different metabolic states? 

For comparison, I often measure CLS in water, and many cells are metabolically much less active, if not domant in G0. DR cannot be studied in water-based chronological aging.

Monday, December 17, 2012

Generate pruned phylogeny using Biopython

Biopython Phylo

Its cookbook ( offers some useful examples. Biopython is still tied to python 2.7.

The following code will read a Newick tree, generate a lookup dictionry for leaf-nodes, and then remove individual leaf node to generate a pruned tree. 

from Bio import Phylo
#from Bio.Phylo import PhyloXML, NewickIO

def lookup_by_names(tree):
    names = {}
    for clade in tree.get_terminals():
            if in names:
                raise ValueError("Duplicate key: %s" %
            names[] = clade
    return names

EX_NEWICK = 'spec.nwk'
treeA =, 'newick')
names = lookup_by_names(treeA)
treeA.prune(names['Bcl'])  #prune() takes an object from the same tree
treeA.prune(names['Bha'])  #prune() takes an object from the same tree
treeA.prune(names['Bsu'])  #prune() takes an object from the same tree
treeA.prune(names['Bpu'])  #prune() takes an object from the same tree

Saturday, December 15, 2012

DendroPy tutorial

Followed DendroPy's tutorial and I was able to prune a phylogenetic tree in 50 minutes. DendroPy's example scripts can be found both on its website   and in the examples/ folder in the downloaded source code package.
The interactive mode of Python is also very useful in this learning experience. It should be noted that DendroPy does not support Python3.0, which is a problem shared by most other bioinformatics related python projects, such as biopython. 

import dendropy

tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"

tree = dendropy.Tree.get_from_string(


tree.prune_taxa_with_labels(["A", "C", "G"])

Later, I found that DendroPy cannot parse nodes with multiple labels, such as those output by codeml. 

Monday, November 12, 2012

# Run freebayes on bam files

include ${ROOT_DIR}/src/
include ${ROOT_DIR}/src/hong_analysis/



HONG_VCF_FILES=$(addprefix ${HONG_VCF_DIR}/, $(patsubst %.bam,%.vcf,$(notdir ${HONG_SORTEDBAM_FILES})))

HONG_VCF_GZ_FILES=$(addsuffix .gz, ${HONG_VCF_FILES})

hongvcall: ${HONG_VCF_INDEX_FILES}

        mkdir ${HONG_VCF_DIR}

        ${FREEBAYES} --fasta ${REFERENCE_INDEX} --pooled $< > $@

        ${BGZIP} $<

        ${TABIX} $<
        @echo ${HONG_VCF_INDEX_FILES}

Friday, November 9, 2012

Parse genetic variants from NGS data, sequenced qin-lab natural isolates

These shell script were written by Lance at the Princeton genome center.

batch code are in src/
mkdir -p /Volumes/BotlabSeq/hong_analysis/mapped_bam 
bwa aln -t 2 /Volumes/BotlabSeq/reference_genomes/saccharomyces_cerevisiae_s288c_saccer3/saccharomyces_cerevisiae_s288c_saccer3.fasta /Volumes/BotlabSeq/fastq/3-001_M1-2_lib1_read-1.fastq.gz > /Volumes/BotlabSeq/hong_analysis/mapped_bam/3-001_M1-2_lib1_read-1_aln_sa.sai
... ... 


DIRECTORY=$(cd `dirname $0` && pwd)

ROOT_DIR=$(cd "${DIRECTORY}/../../../" && pwd)

function process_batch {
    for strain in "${strain_batch[@]}"; do
        HONG_FASTQ_FILES+=`find "${FASTQ_DIR}" -name "$strain*.fastq.gz" | xargs echo`
        HONG_FASTQ_FILES+=" "
        HONG_ORIGINAL_BAM_FILES+=`find "${ORIGINAL_BAM_DIR}" -name "$strain*.bam" | xargs echo`
    echo "make -n -f ${DIRECTORY}/ -j 5"
    echo ""

while IFS=$'\t' read -r -a fields
    if [[ $number == 3-* ]]; then
        echo "$number $individual_name $genus_species - SELECTED"

for strain in "${strain_list[@]}"; do
    if [[ $i -eq 10 ]]; then
        echo ${strain_batch[@]}
echo ${strain_batch[@]}


#make -f ${DIRECTORY}/ -j 4

The mapped variant are in "vcf" files.

  File formats
  VCF perl tools

Monday, November 5, 2012

Wednesday, October 31, 2012

bio386, paper presentations

2012 Oct 31, Wed
Costanzo 2012 Science, genetic network -> Jasmine, 2012 Oct 31 PDF document
zotenko 2008 plos, essential hub genes -> Jessica, 2012 Oct 31 presentation

Monday, October 29, 2012

bio386, Dhami11, Gasch2000

2012 Oct 29. Monday
Dhami 11 paper, Ella and Teneisha, 2012 Oct 29.

Gasch 2000 paper, Khayla and Asha, 2012 Oct 29
 Clustering exercise
 CV calculation

Saturday, September 1, 2012

bio386 fall 2012 log

2012 Oct 29. Monday
Dhami 11 paper, Ella and Teneisha, 2012 Oct 29.

Gasch 2000 paper, Khayla and Asha, 2012 Oct 29
 Clustering exercise
 CV calculation

2012 Oct 24 Wed.
inclass self-paced moodle quiz on qin's research proposal.
2012 Oct 22 Monday
 review exam
Read DNA microarray wikipedia entry:

read Qin's proposal
read Gasch paper, presentation
what is robustness? how to measure them?

2012 Oct 17 Wed, midterm exam, collaborative part
em: permutaiton on gIN

3 days, intro to cellular aging and research
2012 Oct 10.
 yeast aging presentation to m0 and G formula.

 take-home exam, due Oct 17 Wed before class. 1) permutation on genetic network 2) compare yeast gNet and PIN: are there correlation between genetic interactions and protein interactions?

 review genetic network assignment
 student presentations (Dhami paper, Newman paper)

2012 Oct 8.
 intro to yeast aging.
    (aging, gompertz model, dS/dt = slope, non-aging)
    RLS, CLS
 figure 3B.

2012 Oct 3, Wed
  match slides

  Figure 2
  partial regression demo, age, shoesize, readingability

  homework on figure 3B.

quiz on Wed to see whether Fraser paper data can be loaded to R and first few lines can be ran.
set up dropbox accounts:

2012 Sep 26 Wed
summary key points
  Ka, Ks, poistive, negative, neutral selection
  pairwise interaction -> network
  yeast ORF names
  competive growth fitness measures (using bar codes)
  use table to counter protein interactions


For review and quiz, load genetic interaction data into R.
convert ORF to letters.

2012 Sep 24, Mon
 discussion of Fraser Science paper

Some key ponts:

Protein interaction network
For Fig 1:
connectivity = number of interactions per gene
evolution rate =K
linear regression
For Fig 2: causal relationships (daycare example), multiple regression
For Fig 3: null distribution and p-value; permutation

Key messages:
 science paper is not a big deal
 how to read a scientific paper
 how to interpret figures
 p-value and null distribution

2012 Sep 19, Wed
 ***collaborative exam, student demo and exercise on screen. This method works well!
2012 Sep 17, Mon
 exercises and reviews
 go over exam1
2012 Sep 12 Wed
  function on make solution
  give home work
  input out,
 summary, leave homework, quiz on Monday.

=>2012 Sep 10, Mon
  quiz on daycare score again
  basic programming concepts: loops, conditions, functions.
=>2012 Sep 5, Wed
  Quiz on salary.R
  Homework on fraser paper
  review new learnings. regression, t.test(), p-value.
  homework is requested.
=>2012 August 29, Wed
  quiz: a list of values (0.1, 0.01, 0.001), take log
  irb signature: ella
  salary example. socratice method, working on it, the ask questions.
  give review guide.

skills and functions to cover:
str, read.table(), pick columns, pick rows, pick rows and columns,
table() #how many female ‘Arts’ faculty
#how many female ‘Assistant’, ‘Associate’, and ‘Full’ professors

=>2012 August 27, Mon
  welcome new comers
  simple.R takes 2 hours

=> 2012 Aug 22, Wed
x irb signature
x go over syllabus, writing milestones, previous grades,
x pre-survey
* foucs on why computing for biology majors (four students just come to class for no reasons without computing backgrounds)
 googleDoc to work on list of reasons for biology majors to learn programming and computing concepts and skills
using cards to explain computational thinking and search space

Reasons why biology major should learn programming and computing
    Learning programming skills and computational thinking offers a key set of problem solving skills.
    genomics medicine
    highlights on resumes when applying for medical school or graduate schools
    more competitive for jobs
    Another useful language
    Computing holds the key to understand complicated biology phenomena.
    Further understanding of genomic dynamics of emerging pathogens and diseases
    To use computer storage and other programs to study the biological data and maps of the human genome sequence
    Updated on new technology
    Understanding programming and computing can aid in drug discovery
    Aids in conducting research projects

How to find out a missing card from a stack? (A key concept is search space)
Plan A: count card one by one. Search space is 52 possibilities.
Plan B: separate into sets and colors
Question: What are the key differences between these two search plans?
Answers: Random data versus organized (structured) data

* I should ask students to introduce themselves.

Tuesday, August 21, 2012

20120821 DHE DHR BY4743 H2O2

During Clibur measure, we measured blanks, DHE, DHR, and DHEDHR, and "on" for overnight DHR signals. The overnight DHR sample were basically left in the dark at room temperature (as I recall).

Wednesday, February 22, 2012

Tuesday, January 10, 2012

bio386, spring 2012 log

-> exam question: yeast gene in FASTA format, calculate its length, find out
 mean ORF length.

partion H2O2-LOH data to bio320 classes for assignment

Exam 2-> deg factor on salary

-> bayesian network modeling?
-> bio320 projects: story paper, morphology, expression; wynter gwa study; realibility simulation and threshold trait;

=> 2012 Feb 21
 Dropbox account creation for projects
 Levy's presentation and explanation of data;
 Go over research projects;
 Pubmed and Google scholar search to find the most recent protein interaction network data in S. cerevisiae.

=>2012 Jan 26, The exercises take the entire class,

=>2012 Jan 23Tue discuss my MRI proposal

=>2012 Jan 19Thu
 proofread an NSF proposal. 5 students, working 3 groups. 5 pages per group.

=> Jan 17, 2012 Tues
work through salary example. R2 and p-value take a long time.
One students ask for open-ended qustions. I asked them study
'deg' effect on 'salary', 'gender', and 'rank'.

=> Jan 12, 2012 Thursday
hp laptops usages
Moodle site
go over syllabus, writing milestones
previous student projects, and posters
install R, Rstudio, notepad++,
worked on simple.R
assignment: watch youtube videos, 1st ppt, salary.R