Monday, December 31, 2012

*** Git/Github repository for Bacillus spore coat project

$ cd /BacillusSporeCoat-master
$ git init
Initialized empty Git repository in /Users/hongqin/github/BacillusSporeCoat-master/.git/


$ git add _coat.profile.for.submission.Dec31.2012.xlsx
$ git commit -m "coat profile"
[master (root-commit) fe3211c] coat profile
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 _coat.profile.for.submission.Dec31.2012.xlsx



$ git remote add origin https://github.com/hongqin/BacillusSporeCoat.git

$ git add README.md
$ git commit -m "coat profile 2"
[master 5aac99d] coat profile 2
1 file changed, 4 insertions(+)
create mode 100644 README.md
ace:BacillusSporeCoat-master hongqin$ git remote add origin https://github.com/hongqin/BacillusSporeCoat.git
fatal: remote origin already exists. 

$ git pull https://github.com/hongqin/BacillusSporeCoat.git -m "test"
$ git remote add origin https://github.com/hongqin/BacillusSporeCoat.git
fatal: remote origin already exists.
$ git push origin master
Counting objects: 7, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 21.62 KiB, done.
Total 6 (delta 1), reused 0 (delta 0)
To https://github.com/hongqin/BacillusSporeCoat.git
5d270ff..1b82f44 master -> master
 

This works. 
 

Saturday, December 29, 2012

Install Git and GitHub for snow leopard

I have a snow leopard laptop. 

From http://git-scm.com/download/mac, it automatically downloaded  "git-1.8.0.2-3-intel-universal-snow-leopard.dmg". Double-click this file generated a "Git 1.8.0.2 Snow Leopard Intel Universal" disk, which contains four files: 1) uninstall.sh, 2) README, 3) 'setup git PATH for non-terminal programs.sh, 4) git-1.8.0.2-intel-universal-snow-leopard.pkg. After running the fourth file, a successful install message occurred. 

I then found out that the free GUI for Git, GitHub, has stopped support for snow leopard. This leaves me the only option to follow the command-line version on https://help.github.com/articles/set-up-git.

$ git config --global user.name "hongqin"
$ git config --global user.email "myworkemail@gmail.com"
$ curl -s -O \
  http://github-media-downloads.s3.amazonaws.com/osx/git-credential-osxkeychain
$ chmod u+x git-credential-osxkeychain 
$ cd /opt/local/bin   #I saw osxkeychain already here
$ git config --global credential.helper osxkeychain

By now, I have finished all the steps on https://help.github.com/articles/set-up-git.  

I then proceed to created a 'Public' repository, named "sandbox". I chose a README option.  After hitting the green "create repository" button, a new page appeared in about one second.

When I clicked "Clone in Mac", it took me back to GitHub Mac version for 10.7+. When I clicked "ZIP", it started download "sandbox-master.zip" into local Download. This zip file contains the README.md file and the sandbox-master directory structure.  I then noticed that I was in the 'branch:master' mode. 

To test how GitHub works locally, I created a hello world file in local sandbox-master. 



$cd sandbox-master
$git init
Initialized empty Git repository in /Users/hongqin/github/sandbox-master/.git/

$ git add helloworld.txt
$ git commit -m "first local commit"
[master (root-commit) 6654d46] first local commit
 1 file changed, 1 insertion(+)
 create mode 100644 helloword.txt


$ git remote add origin https://github.com/hongqin/sandbox.git
$ git push origin master
Username for 'https://github.com': hongqin
Password for 'https://hongqin@github.com':
To https://github.com/hongqin/sandbox.git
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to 'https://github.com/hongqin/sandbox.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Merge the remote changes (e.g. 'git pull')
hint: before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.


This non-fast-forward error seems to be a common error:
https://help.github.com/articles/dealing-with-non-fast-forward-errors

$ git push --force origin master
Counting objects: 3, done.
Writing objects: 100% (3/3), 230 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/hongqin/sandbox.git
 + c8d6cad...6654d46 master -> master (forced update)


I then refreshed GitHub sandbox web page, and found "helloworld.txt" there. But "README.md" disappeared.This may explain my 'non-fast-forward error'.

$ git add README.txt
$ git commit -m "second local commit"

[master bdcefa8] second local commit
 1 file changed, 2 insertions(+)
 create mode 100644 README.txt
 

$ git remote add origin https://github.com/hongqin/sandbox.git
fatal: remote origin already exists.
 
$ git push origin master
Counting objects: 4, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 288 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/hongqin/sandbox.git
   6654d46..bdcefa8  master -> master
 
I  then refreshed my GitHub webpage and see both files. 

Oddly, there is still "README.md" file, but I do not see this file in the downloaed ZIP archive.  I leave it for now. 
 

NPR food-for-thought blog on lactose tolerance.

NPR has a blog on the evolution of lactose tolerance: An Evolutionary Whodunit: How Did Humans Develop Lactose Tolerance? by Helen Thompson.   This blog discuss a review on lactase persistence in Europe based on archaeological and genetic evidences, written by Leonardi, Gerbault, Thomas, and Burger. This blog highlighted the speculative combination of famine and diarrhea to explain the quick adaptive sweep of lactase persistence in European populations. I was surprised that lactase persistence  was argued by Cordain to provide advantage against malaria in Africa and Southern Europe and rickets in Northern Europe. 

 I found many interesting ideas in the comments posted on this blog, often with entertaining exchanges. One reader linked another blog on evolution of lactose tolerance at darwinstudents.blogspot.com, that also gave some interesting speculations. I have to say that some of the discussions on lactose intolerance in Chinese population contain inaccurate facts.




Saturday, December 22, 2012

Read pairwise network data into igraph, then do permutation.

#continental US demo, permutation of pariwse interaction network, igraph usage

require(RCurl)
URL = "https://docs.google.com/spreadsheet/pub?key=0ArLJZixvTlU7dDI5c3dnTzdRX1dndzBORUk1UC0wYlE&single=true&gid=0&output=csv"
states = read.csv(textConnection(getURL(URL)), colClass = c("character", "character"))

g = graph.data.frame(states, directed=F)
g.degree = degree(g)
g.degree [g.degree == max(g.degree)] #TN and MO have 8 bordering states

g.shortestpath.m = shortest.paths(g)
str(g.shortestpath.m)
sorted.names = sort( rownames(g.shortestpath.m) )
gsm = g.shortestpath.m[, sorted.names]
gsm = gsm[sorted.names, ]

URL2 = "https://docs.google.com/spreadsheet/pub?key=0ArLJZixvTlU7dC1rTlRhTkw5UlY0eWZGT1JOVUVCWGc&single=true&gid=0&output=csv"
state.year.tb = read.csv(textConnection(getURL(URL2)), colClass = c("character", NA))


permute.pairs.wo.selfpairs = function( inpairs,  ncycles=10, debug=1 ) {
  if (ncycles >= 1 ) {
    if(debug) {
      print(paste('ncycles=', ncycles))
    }
    longids = c(as.character(inpairs[,1]), as.character(inpairs[,2]) )
    longids = sample(longids)
    len = length(inpairs[,1])
    newpairs = data.frame( cbind( longids[1:len], longids[(len+1): (2*len)]) )
    names(newpairs) = c('id1', 'id2')
    newpairs$id1 = as.character( newpairs$id1)
    newpairs$id2 = as.character( newpairs$id2)   
    newpairs$selfpairs = ifelse( newpairs$id1 == newpairs$id2, 1, 0 )
    self.tb = newpairs[ newpairs$selfpairs==1, ]
    nonself.tb = newpairs[newpairs$selfpairs==0, ]
    if(debug) {
      print(self.tb)
    }
    if( length(self.tb[,1])>0 ) {
      if ( ncycles == 0) { return (c(NA,NA, NA) )
      } else {
        ncycles = ncycles - 1
        selectedpairs = rbind(self.tb,  nonself.tb[1: (length(self.tb[,1])*2), ] )
        restpairs = nonself.tb[ (length(self.tb[,1])*2+1): length(nonself.tb[,1]), ]
        return( rbind(restpairs, permute.pairs.wo.selfpairs(selectedpairs, ncycles)))
      }
    } else { 
      return (newpairs)
    }
  } else {
    return( c(NA,NA,NA ))
  }
}

#test
x = permute.pairs.wo.selfpairs(states)
x2 = permute.pairs.wo.selfpairs(states)

Thursday, December 20, 2012

The meaning of dN/dS (Ka/Ks) ratio in molecular evolution

Omega, Ka/Ks or dN/dS, is the ratio of nonsynonymous substitution per site to synonymous substitution per site. Ks (dS) can be viewed as a proxy of background mutation. The ratio of Ka/Ks, omega, indicate the extent of changes at the amino acid levels after normalized by silent mutational changes at DNA level, hence, is a proxy for positive selection pressure in coding genes. This definition assumes selection only at the protein level, not at the DNA or RNA level. It is clear that omega is only applicable in protein-coding genes.

Omega is designed to study divergence because its definition assumes fixed changes. When Omega is used in the population context, standing genetic variation would be treated as fixed ones, which can inflate the estimation of omega.

Kryazhimskiy and Plotkin (2008) showed that in population genetics context, dN/dS ratio is not a reliable index for selection pressure. When population scaled mutation rate is small enough, the population omega depends on both selection coefficient (s) and mutation rate (\mu).  Based its figure 1, it can be seen that population omega is generally higher on divergent omega for negative selections.  KP08 used a two-allele model to find an analytic solution, and then used simulation to study the behavior in a Markov chain model with 64 states. KP08 is mostly well written. I found Equation 3 was presented without explanations.














Wednesday, December 19, 2012

Check and balance between adaptation and mutation in E. coli

Wielgoss et al, 2012 PNAS, Mutation rate dynamics in a bacterial population reflect tension between adaptation and genetic load. The authors sequenced one E. coli strain that has been evolved for 20 years in the Lenski lab. The author estimated that 20 years of E coli is about 50,000 generations. Twenty-two clones sampled at various time point during evolution were resequenced for this study. Previously, it was found that a frame-shift in mutT became 'established' in the population around 25K generations. After the fixation of this hypermutator mutT, mutation rate in two branches show reducted mutation rates (confirmed by fluctuation tests). Because the number of substitutions can be revealed by sequencing and the number of generations can be estimated, the mutations rates at different time points (hence, mutation dynamics) can be monitored (This is very nice thanks to the 20 years of in-vitro evolution experiment).

For reading, a few key concepts here are mutation rate, genetic load, hypermutators. Mutation rate are measured using synonymous substitutions. Genetic load is the normalized difference between maximal and mean fitness. The authors seem to use relative growth rate as compared to the ancestral clone to calculate the genetic load.

Specific terms include mutT, mutY, which are the specific genes under discussion.

I noticed the use of "established" instead of 'fixed' in the population, which indicates that mutT did not reach 100% of the population.

A question: How does the author establish the causal interaction between mutT and mutY? It seems that they begin with a candidate gene approach, and used association arguments. I was hoping for a regression based causal inference like those in quantitative genetics, but this kind of analysis may not be available here.

Tuesday, December 18, 2012

Dietary restriction in chronologicaled aged cells leads to extened replicative life span, Delaney et al, 2012 Exp Gen

Delaney et al, 2012 Experimental Gerontology

A paper from the Kaeberlein group (  http://www.ncbi.nlm.nih.gov/pubmed/23235143) showed the dietary restriction during chronologically aging can lead to extended replicative life span. This is a technically tricky experiment, because when chronologically aged cells start to lose viability, much more cells will be required to conduct the subsequent replicative life span assays. They author stated that they pick as much as 1000 cells to start the replicative lifespan assays.

One of my questions is a technique one: How did they carry out DR during chronological aging? It seems that starting cultures are SC with various glucose concentration. Five microliters of chronologically aged cultures were spread to YPD (2% glucose) plate for RLS assay, which is a standard practice. DR was achieved in 0.05% glucose in this study.

The authors double-stained yeast cells with 5nM SYTOX red and 17.5nM DiOC6, and monitor them by flow cytometry. DiOC6 is a mitochrondrial membrane potential dependent dye, and SYTOX Red is a used distinguish live and dead cells. SYTOX Red-negative cells are considered live cells.

This paper argues that both DR and buffering during chronological aging can delay the subsequent replicating aging process by maintiaing a low mitochondria membrane potential. Overall, the paper suggests that mitochondria is a link between post-mitotic and mitotic aging.

One question that I have is the role of mitochondria during chronological aging. The author conducted chronological aging in rich media. Although many cells are probably metabolically active during in the depleted media, many are also metabolically inactive. Is it possible that DR have differential effect on different cell sub-populations in different metabolic states? 

For comparison, I often measure CLS in water, and many cells are metabolically much less active, if not domant in G0. DR cannot be studied in water-based chronological aging.




Monday, December 17, 2012

Generate pruned phylogeny using Biopython

Biopython Phylo

Its cookbook (http://biopython.org/wiki/Phylo_cookbook) offers some useful examples. Biopython is still tied to python 2.7.

The following code will read a Newick tree, generate a lookup dictionry for leaf-nodes, and then remove individual leaf node to generate a pruned tree. 

from Bio import Phylo
#from Bio.Phylo import PhyloXML, NewickIO

def lookup_by_names(tree):
    names = {}
    for clade in tree.get_terminals():
        if clade.name:
            if clade.name in names:
                raise ValueError("Duplicate key: %s" % clade.name)
            names[clade.name] = clade
    return names


EX_NEWICK = 'spec.nwk'
treeA = Phylo.read(EX_NEWICK, 'newick')
print(treeA)
names = lookup_by_names(treeA)
treeA.prune(names['Bcl'])  #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bha'])  #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bsu'])  #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bpu'])  #prune() takes an object from the same tree
treeA.count_terminals()



Saturday, December 15, 2012

DendroPy tutorial

Followed DendroPy's tutorial and I was able to prune a phylogenetic tree in 50 minutes. DendroPy's example scripts can be found both on its website   and in the examples/ folder in the downloaded source code package.
 
The interactive mode of Python is also very useful in this learning experience. It should be noted that DendroPy does not support Python3.0, which is a problem shared by most other bioinformatics related python projects, such as biopython. 

import dendropy
dir(dendropy)
help(dendropy.Tree)

tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"

tree = dendropy.Tree.get_from_string(
        tree_str,
        "newick")

dir(tree)

print("Before:")
print(tree.as_string('newick'))
print(tree.as_ascii_plot())
tree.prune_taxa_with_labels(["A", "C", "G"])
print("After:")
print(tree.as_string('newick'))
print(tree.as_ascii_plot())

Later, I found that DendroPy cannot parse nodes with multiple labels, such as those output by codeml.