Open Notebook: 2012

Monday, December 31, 2012

*** Git/Github repository for Bacillus spore coat project

$ cd /BacillusSporeCoat-master
$ git init
Initialized empty Git repository in /Users/hongqin/github/BacillusSporeCoat-master/.git/

$ git add _coat.profile.for.submission.Dec31.2012.xlsx
$ git commit -m "coat profile"
[master (root-commit) fe3211c] coat profile
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 _coat.profile.for.submission.Dec31.2012.xlsx

Saturday, December 29, 2012

Install Git and GitHub for snow leopard

I have a snow leopard laptop.

From http://git-scm.com/download/mac, it automatically downloaded "git-1.8.0.2-3-intel-universal-snow-leopard.dmg". Double-click this file generated a "Git 1.8.0.2 Snow Leopard Intel Universal" disk, which contains four files: 1) uninstall.sh, 2) README, 3) 'setup git PATH for non-terminal programs.sh, 4) git-1.8.0.2-intel-universal-snow-leopard.pkg. After running the fourth file, a successful install message occurred.

I then found out that the free GUI for Git, GitHub, has stopped support for snow leopard. This leaves me the only option to follow the command-line version on https://help.github.com/articles/set-up-git.

$ git config --global user.name "hongqin"

$ curl -s -O \
  http://github-media-downloads.s3.amazonaws.com/osx/git-credential-osxkeychain

$ chmod u+x git-credential-osxkeychain

$ cd /opt/local/bin   #I saw osxkeychain already here

$ git config --global credential.helper osxkeychain

$ git remote add origin https://github.com/hongqin/sandbox.git

NPR food-for-thought blog on lactose tolerance.

NPR has a blog on the evolution of lactose tolerance: An Evolutionary Whodunit: How Did Humans Develop Lactose Tolerance? by Helen Thompson. This blog discuss a review on lactase persistence in Europe based on archaeological and genetic evidences, written by Leonardi, Gerbault, Thomas, and Burger. This blog highlighted the speculative combination of famine and diarrhea to explain the quick adaptive sweep of lactase persistence in European populations. I was surprised that lactase persistence was argued by Cordain to provide advantage against malaria in Africa and Southern Europe and rickets in Northern Europe.

I found many interesting ideas in the comments posted on this blog, often with entertaining exchanges. One reader linked another blog on evolution of lactose tolerance at darwinstudents.blogspot.com, that also gave some interesting speculations. I have to say that some of the discussions on lactose intolerance in Chinese population contain inaccurate facts.

Saturday, December 22, 2012

Read pairwise network data into igraph, then do permutation.

#continental US demo, permutation of pariwse interaction network, igraph usage

require(RCurl)
URL = "https://docs.google.com/spreadsheet/pub?key=0ArLJZixvTlU7dDI5c3dnTzdRX1dndzBORUk1UC0wYlE&single=true&gid=0&output=csv"
states = read.csv(textConnection(getURL(URL)), colClass = c("character", "character"))

g = graph.data.frame(states, directed=F)
g.degree = degree(g)
g.degree [g.degree == max(g.degree)] #TN and MO have 8 bordering states

g.shortestpath.m = shortest.paths(g)
str(g.shortestpath.m)
sorted.names = sort( rownames(g.shortestpath.m) )
gsm = g.shortestpath.m[, sorted.names]
gsm = gsm[sorted.names, ]

URL2 = "https://docs.google.com/spreadsheet/pub?key=0ArLJZixvTlU7dC1rTlRhTkw5UlY0eWZGT1JOVUVCWGc&single=true&gid=0&output=csv"
state.year.tb = read.csv(textConnection(getURL(URL2)), colClass = c("character", NA))

permute.pairs.wo.selfpairs = function( inpairs, ncycles=10, debug=1 ) {
if (ncycles >= 1 ) {
    if(debug) {
      print(paste('ncycles=', ncycles))
    }
    longids = c(as.character(inpairs[,1]), as.character(inpairs[,2]) )
    longids = sample(longids)
    len = length(inpairs[,1])
    newpairs = data.frame( cbind( longids[1:len], longids[(len+1): (2*len)]) )
    names(newpairs) = c('id1', 'id2')
    newpairs$id1 = as.character( newpairs$id1)
    newpairs$id2 = as.character( newpairs$id2)
    newpairs$selfpairs = ifelse( newpairs$id1 == newpairs$id2, 1, 0 )
    self.tb = newpairs[ newpairs$selfpairs==1, ]
    nonself.tb = newpairs[newpairs$selfpairs==0, ]
    if(debug) {
      print(self.tb)
    }
    if( length(self.tb[,1])>0 ) {
      if ( ncycles == 0) { return (c(NA,NA, NA) )
      } else {
        ncycles = ncycles - 1
        selectedpairs = rbind(self.tb, nonself.tb[1: (length(self.tb[,1])*2), ] )
        restpairs = nonself.tb[ (length(self.tb[,1])*2+1): length(nonself.tb[,1]), ]
        return( rbind(restpairs, permute.pairs.wo.selfpairs(selectedpairs, ncycles)))
      }
    } else {
      return (newpairs)
    }
} else {
    return( c(NA,NA,NA ))
}
}

#test
x = permute.pairs.wo.selfpairs(states)
x2 = permute.pairs.wo.selfpairs(states)

Thursday, December 20, 2012

The meaning of dN/dS (Ka/Ks) ratio in molecular evolution

Omega, Ka/Ks or dN/dS, is the ratio of nonsynonymous substitution per site to synonymous substitution per site. Ks (dS) can be viewed as a proxy of background mutation. The ratio of Ka/Ks, omega, indicate the extent of changes at the amino acid levels after normalized by silent mutational changes at DNA level, hence, is a proxy for positive selection pressure in coding genes. This definition assumes selection only at the protein level, not at the DNA or RNA level. It is clear that omega is only applicable in protein-coding genes.

Omega is designed to study divergence because its definition assumes fixed changes. When Omega is used in the population context, standing genetic variation would be treated as fixed ones, which can inflate the estimation of omega.

Kryazhimskiy and Plotkin (2008) showed that in population genetics context, dN/dS ratio is not a reliable index for selection pressure. When population scaled mutation rate is small enough, the population omega depends on both selection coefficient (s) and mutation rate (\mu). Based its figure 1, it can be seen that population omega is generally higher on divergent omega for negative selections. KP08 used a two-allele model to find an analytic solution, and then used simulation to study the behavior in a Markov chain model with 64 states. KP08 is mostly well written. I found Equation 3 was presented without explanations.

Wednesday, December 19, 2012

Check and balance between adaptation and mutation in E. coli

Wielgoss et al, 2012 PNAS, Mutation rate dynamics in a bacterial population reflect tension between adaptation and genetic load. The authors sequenced one E. coli strain that has been evolved for 20 years in the Lenski lab. The author estimated that 20 years of E coli is about 50,000 generations. Twenty-two clones sampled at various time point during evolution were resequenced for this study. Previously, it was found that a frame-shift in mutT became 'established' in the population around 25K generations. After the fixation of this hypermutator mutT, mutation rate in two branches show reducted mutation rates (confirmed by fluctuation tests). Because the number of substitutions can be revealed by sequencing and the number of generations can be estimated, the mutations rates at different time points (hence, mutation dynamics) can be monitored (This is very nice thanks to the 20 years of in-vitro evolution experiment).

For reading, a few key concepts here are mutation rate, genetic load, hypermutators. Mutation rate are measured using synonymous substitutions. Genetic load is the normalized difference between maximal and mean fitness. The authors seem to use relative growth rate as compared to the ancestral clone to calculate the genetic load.

Specific terms include mutT, mutY, which are the specific genes under discussion.

I noticed the use of "established" instead of 'fixed' in the population, which indicates that mutT did not reach 100% of the population.

A question: How does the author establish the causal interaction between mutT and mutY? It seems that they begin with a candidate gene approach, and used association arguments. I was hoping for a regression based causal inference like those in quantitative genetics, but this kind of analysis may not be available here.

Tuesday, December 18, 2012

Dietary restriction in chronologicaled aged cells leads to extened replicative life span, Delaney et al, 2012 Exp Gen

Delaney et al, 2012 Experimental Gerontology

A paper from the Kaeberlein group ( http://www.ncbi.nlm.nih.gov/pubmed/23235143) showed the dietary restriction during chronologically aging can lead to extended replicative life span. This is a technically tricky experiment, because when chronologically aged cells start to lose viability, much more cells will be required to conduct the subsequent replicative life span assays. They author stated that they pick as much as 1000 cells to start the replicative lifespan assays.

One of my questions is a technique one: How did they carry out DR during chronological aging? It seems that starting cultures are SC with various glucose concentration. Five microliters of chronologically aged cultures were spread to YPD (2% glucose) plate for RLS assay, which is a standard practice. DR was achieved in 0.05% glucose in this study.

The authors double-stained yeast cells with 5nM SYTOX red and 17.5nM DiOC6, and monitor them by flow cytometry. DiOC6 is a mitochrondrial membrane potential dependent dye, and SYTOX Red is a used distinguish live and dead cells. SYTOX Red-negative cells are considered live cells.

This paper argues that both DR and buffering during chronological aging can delay the subsequent replicating aging process by maintiaing a low mitochondria membrane potential. Overall, the paper suggests that mitochondria is a link between post-mitotic and mitotic aging.

One question that I have is the role of mitochondria during chronological aging. The author conducted chronological aging in rich media. Although many cells are probably metabolically active during in the depleted media, many are also metabolically inactive. Is it possible that DR have differential effect on different cell sub-populations in different metabolic states?

For comparison, I often measure CLS in water, and many cells are metabolically much less active, if not domant in G0. DR cannot be studied in water-based chronological aging.

Monday, December 17, 2012

Generate pruned phylogeny using Biopython

Biopython Phylo

Its cookbook (http://biopython.org/wiki/Phylo_cookbook) offers some useful examples. Biopython is still tied to python 2.7.

The following code will read a Newick tree, generate a lookup dictionry for leaf-nodes, and then remove individual leaf node to generate a pruned tree.

from Bio import Phylo
#from Bio.Phylo import PhyloXML, NewickIO

def lookup_by_names(tree):
    names = {}
    for clade in tree.get_terminals():
        if clade.name:
            if clade.name in names:
                raise ValueError("Duplicate key: %s" % clade.name)
            names[clade.name] = clade
    return names

EX_NEWICK = 'spec.nwk'
treeA = Phylo.read(EX_NEWICK, 'newick')
print(treeA)
names = lookup_by_names(treeA)
treeA.prune(names['Bcl']) #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bha']) #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bsu']) #prune() takes an object from the same tree
treeA.count_terminals()
treeA.prune(names['Bpu']) #prune() takes an object from the same tree
treeA.count_terminals()

Saturday, December 15, 2012

DendroPy tutorial

Followed DendroPy's tutorial and I was able to prune a phylogenetic tree in 50 minutes. DendroPy's example scripts can be found both on its website   and in the examples/ folder in the downloaded source code package.

The interactive mode of Python is also very useful in this learning experience. It should be noted that DendroPy does not support Python3.0, which is a problem shared by most other bioinformatics related python projects, such as biopython.

import dendropy
dir(dendropy)
help(dendropy.Tree)

tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"

tree = dendropy.Tree.get_from_string(
        tree_str,
        "newick")

dir(tree)

print("Before:")
print(tree.as_string('newick'))
print(tree.as_ascii_plot())
tree.prune_taxa_with_labels(["A", "C", "G"])
print("After:")
print(tree.as_string('newick'))
print(tree.as_ascii_plot())

Later, I found that DendroPy cannot parse nodes with multiple labels, such as those output by codeml.

Monday, November 12, 2012

botseq_hong_variant_calling.mk

# Run freebayes on bam files
#
.DELETE_ON_ERROR:

ROOT_DIR?=/ndata/botlabseq/seqshare
include ${ROOT_DIR}/src/config.mk
include ${ROOT_DIR}/src/hong_analysis/botseq_hong_mapping.mk

REFERENCE_INDEX?=${ROOT_DIR}/reference_genomes/saccharomyces_cerevisiae_s288c_saccer3/saccharomyces_cerevisiae_s288c_saccer3.fasta

HONG_VCF_DIR=${ROOT_DIR}/hong_analysis/vcf

HONG_VCF_FILES=$(addprefix ${HONG_VCF_DIR}/, $(patsubst %.bam,%.vcf,$(notdir ${HONG_SORTEDBAM_FILES})))

HONG_VCF_GZ_FILES=$(addsuffix .gz, ${HONG_VCF_FILES})
HONG_VCF_INDEX_FILES=$(addsuffix .tbi, ${HONG_VCF_GZ_FILES})

.DEFAULT_GOAL:=hongvcall
hongvcall: ${HONG_VCF_INDEX_FILES}

${HONG_VCF_DIR}:
mkdir ${HONG_VCF_DIR}

.SECONDARY: ${HONG_VCF_FILES}
${HONG_VCF_FILES}: ${HONG_VCF_DIR}/%.vcf: ${HONG_MAPPED_BAM_DIR}/%.bam | ${HONG_VCF_DIR}
${FREEBAYES} --fasta ${REFERENCE_INDEX} --pooled $< > $@

${HONG_VCF_GZ_FILES}: ${HONG_VCF_DIR}/%.gz: ${HONG_VCF_DIR}/%
${BGZIP} $<

${HONG_VCF_INDEX_FILES}: ${HONG_VCF_DIR}/%.tbi: ${HONG_VCF_DIR}/%
${TABIX} $<
hongvctest:
@echo ${HONG_VCF_INDEX_FILES}

Friday, November 9, 2012

Parse genetic variants from NGS data, sequenced qin-lab natural isolates

These shell script were written by Lance at the Princeton genome center.

batch code are in src/
mkdir -p /Volumes/BotlabSeq/hong_analysis/mapped_bam
bwa aln -t 2 /Volumes/BotlabSeq/reference_genomes/saccharomyces_cerevisiae_s288c_saccer3/saccharomyces_cerevisiae_s288c_saccer3.fasta /Volumes/BotlabSeq/fastq/3-001_M1-2_lib1_read-1.fastq.gz > /Volumes/BotlabSeq/hong_analysis/mapped_bam/3-001_M1-2_lib1_read-1_aln_sa.sai

... ...

#!/bin/bash

DIRECTORY=$(cd `dirname $0` && pwd)

ROOT_DIR=$(cd "${DIRECTORY}/../../../" && pwd)
STRAINS_FILE=${ROOT_DIR}/Sequencing_Consortium_Strains-MASTER.txt
FASTQ_DIR=${ROOT_DIR}/fastq
ORIGINAL_BAM_DIR=${ROOT_DIR}/original_bam

function process_batch {
HONG_FASTQ_FILES=""
ORIGINAL_BAM_FILES=""
for strain in "${strain_batch[@]}"; do
HONG_FASTQ_FILES+=`find "${FASTQ_DIR}" -name "$strain*.fastq.gz" | xargs echo`
HONG_FASTQ_FILES+=" "
HONG_ORIGINAL_BAM_FILES+=`find "${ORIGINAL_BAM_DIR}" -name "$strain*.bam" | xargs echo`
HONG_ORIGINAL_BAM_FILES+=" "
done
export HONG_FASTQ_FILES
echo $HONG_FASTQ_FILES
export HONG_ORIGINAL_BAM_FILES
echo $HONG_ORIGINAL_BAM_FILES
echo "make -n -f ${DIRECTORY}/botseq_hong_mapping.mk -j 5"
echo ""
}

strain_list=()

while IFS=$'\t' read -r -a fields

genus_species=${fields[1]}

individual_name=${fields[4]}

number=${fields[11]}

if [[ $number == 3-* ]]; then

echo "$number $individual_name $genus_species - SELECTED"

strain_list+=("$number")

done < $STRAINS_FILE

strain_batch=()

i=0

for strain in "${strain_list[@]}"; do

strain_batch+=("$strain")

i=$i+1

if [[ $i -eq 10 ]]; then

echo ${strain_batch[@]}

process_batch

strain_batch=()

i=0

done

echo ${strain_batch[@]}

process_batch

exit

#${DIRECTORY}/botseq_lte_merge_bam.py > ${DIRECTORY}/botseq_combine_bam.mk

#make -f ${DIRECTORY}/botseq_combine_bam.mk -j 4

The mapped variant are in "vcf" files.

References
http://www.broadinstitute.org/igv/book/export/html/16
File formats

http://vcftools.sourceforge.net/perl_module.html
VCF perl tools

http://sourceforge.net/projects/samtools/files/tabix/

Monday, November 5, 2012

bio386, GEO and writing milestone

2012 Nov 5
GEO database analysis
writing milestone

Wednesday, October 31, 2012

bio386, paper presentations

-----------
2012 Oct 31, Wed
Costanzo 2012 Science, genetic network -> Jasmine, 2012 Oct 31 PDF document
zotenko 2008 plos, essential hub genes -> Jessica, 2012 Oct 31 presentation

Monday, October 29, 2012

bio386, Dhami11, Gasch2000

------------
2012 Oct 29. Monday
Dhami 11 paper, Ella and Teneisha, 2012 Oct 29.

Gasch 2000 paper, Khayla and Asha, 2012 Oct 29
Clustering exercise
CV calculation

Saturday, September 1, 2012

bio386 fall 2012 log

------------
2012 Oct 29. Monday
Dhami 11 paper, Ella and Teneisha, 2012 Oct 29.

Gasch 2000 paper, Khayla and Asha, 2012 Oct 29
Clustering exercise
CV calculation

-------
2012 Oct 24 Wed.
inclass self-paced moodle quiz on qin's research proposal.
------------
2012 Oct 22 Monday
review exam
Read DNA microarray wikipedia entry:
http://en.wikipedia.org/wiki/DNA_microarray

read Qin's proposal
read Gasch paper, presentation
what is robustness? how to measure them?

--------------
2012 Oct 17 Wed, midterm exam, collaborative part
em: permutaiton on gIN

--------
3 days, intro to cellular aging and research
---------------------
2012 Oct 10.
yeast aging presentation to m0 and G formula.

take-home exam, due Oct 17 Wed before class. 1) permutation on genetic network 2) compare yeast gNet and PIN: are there correlation between genetic interactions and protein interactions?

review genetic network assignment
student presentations (Dhami paper, Newman paper)

--------
2012 Oct 8.
intro to yeast aging.
    (aging, gompertz model, dS/dt = slope, non-aging)
    RLS, CLS
figure 3B.

-----
2012 Oct 3, Wed
match slides

Figure 2
partial regression demo, age, shoesize, readingability

Permutation
homework on figure 3B.
-----

quiz on Wed to see whether Fraser paper data can be loaded to R and first few lines can be ran.
set up dropbox accounts:

-----
2012 Sep 26 Wed
summary key points
Ka, Ks, poistive, negative, neutral selection
pairwise interaction -> network
yeast ORF names
competive growth fitness measures (using bar codes)
use table to counter protein interactions
data.frames

match

For review and quiz, load genetic interaction data into R.
convert ORF to letters.

----------
2012 Sep 24, Mon
discussion of Fraser Science paper

Some key ponts:

Protein interaction network
For Fig 1:
connectivity = number of interactions per gene
evolution rate =K
linear regression
For Fig 2: causal relationships (daycare example), multiple regression
For Fig 3: null distribution and p-value; permutation

Key messages:
science paper is not a big deal
how to read a scientific paper
how to interpret figures
p-value and null distribution

----------
2012 Sep 19, Wed
***collaborative exam, student demo and exercise on screen. This method works well!
-----------
2012 Sep 17, Mon
exercises and reviews
go over exam1
-------------
2012 Sep 12 Wed
2.exericse
function on make solution
give home work
input out,
summary, leave homework, quiz on Monday.

---------------
=>2012 Sep 10, Mon
quiz on daycare score again
basic programming concepts: loops, conditions, functions.
---------------
=>2012 Sep 5, Wed
Quiz on salary.R
Homework on fraser paper
salary.R
review new learnings. regression, t.test(), p-value.
homework is requested.
//
-------------------
=>2012 August 29, Wed
quiz: a list of values (0.1, 0.01, 0.001), take log
irb signature: ella
salary example. socratice method, working on it, the ask questions.
give review guide.

skills and functions to cover:
str, read.table(), pick columns, pick rows, pick rows and columns,
hist
table() #how many female ‘Arts’ faculty
#how many female ‘Assistant’, ‘Associate’, and ‘Full’ professors
//

-------------------
=>2012 August 27, Mon
welcome new comers
simple.R takes 2 hours
//

-------------------
=> 2012 Aug 22, Wed
x irb signature
x go over syllabus, writing milestones, previous grades,
x pre-survey
* foucs on why computing for biology majors (four students just come to class for no reasons without computing backgrounds)
googleDoc to work on list of reasons for biology majors to learn programming and computing concepts and skills
using cards to explain computational thinking and search space

Reasons why biology major should learn programming and computing
    Learning programming skills and computational thinking offers a key set of problem solving skills.
    genomics medicine
    highlights on resumes when applying for medical school or graduate schools
    more competitive for jobs
    Another useful language
    Computing holds the key to understand complicated biology phenomena.
    Further understanding of genomic dynamics of emerging pathogens and diseases
    To use computer storage and other programs to study the biological data and maps of the human genome sequence
    Updated on new technology
    Understanding programming and computing can aid in drug discovery
    Aids in conducting research projects

How to find out a missing card from a stack? (A key concept is search space)
Plan A: count card one by one. Search space is 52 possibilities.
Plan B: separate into sets and colors
Question: What are the key differences between these two search plans?
Answers: Random data versus organized (structured) data

* I should ask students to introduce themselves.

Tuesday, August 21, 2012

20120821 DHE DHR BY4743 H2O2

During Clibur measure, we measured blanks, DHE, DHR, and DHEDHR, and "on" for overnight DHR signals. The overnight DHR sample were basically left in the dark at room temperature (as I recall).

Wednesday, June 6, 2012

DHR stain, YP2%D, YP0.5%D, BY4743, RM112N, M2-8

Monday, June 4, 2012

DHR time course, RM112N, M2-8, YPS163, SGU57

Friday, April 6, 2012

DHE stain of BY4743, tor1D, pcp1D

FacsComp 3 color report, Calibur

Wednesday, March 28, 2012

PI standard curve by Calibur, instrumental setting, for parameter settings, BY4743, M5star, YPD

Wednesday, March 21, 2012

DHE stain of HQ Met15+/- strains, YP2%D and YP0.5%D

Wednesday, February 22, 2012

DHE stain, BY4743, pcp1D, 2%glucose

DHE can be detected in both FL2 and FL3, but not in FL1.

Tuesday, January 10, 2012

bio386, spring 2012 log

-> exam question: yeast gene in FASTA format, calculate its length, find out
mean ORF length.

partion H2O2-LOH data to bio320 classes for assignment

Exam 2-> deg factor on salary

-> bayesian network modeling?
-> bio320 projects: story paper, morphology, expression; wynter gwa study; realibility simulation and threshold trait;

=> 2012 Feb 21
Dropbox account creation for projects
Levy's presentation and explanation of data;
Go over research projects;
Pubmed and Google scholar search to find the most recent protein interaction network data in S. cerevisiae.

=>2012 Jan 26, 3.zip. The exercises take the entire class,

=>2012 Jan 23Tue discuss my MRI proposal

=>2012 Jan 19Thu
proofread an NSF proposal. 5 students, working 3 groups. 5 pages per group.
//

=> Jan 17, 2012 Tues
work through salary example. R2 and p-value take a long time.
One students ask for open-ended qustions. I asked them study
'deg' effect on 'salary', 'gender', and 'rank'.

=> Jan 12, 2012 Thursday
survey
hp laptops usages
Moodle site
go over syllabus, writing milestones
previous student projects, and posters
install R, Rstudio, notepad++,
worked on simple.R
assignment: watch youtube videos, 1st ppt, salary.R