Wednesday, November 27, 2013

Error propagation

Monday, November 25, 2013

Cb/Cv H2O2-LOH choices

There are two ways to average Cv/Cb: First is to calculate the ratio in each experiment and then average by experiments. Second is to average Cv and Cb by experiment and then calculate the ratio. The second method seems to generate large variance and is therefore more prone to experimental errors.

Because some strains are extremely sensitive to H2O2 so their Cv is very small, which leads to large fluctuations. So, Cv/Cb is more stable, and this is my best choice from the practical perspective.

Friday, November 15, 2013

Multicollinearity, correlated predictors in regression

Multicollinearity does not influence the prediction of response (y), but it affect the interpretation of individual predictors.

*** How to generated correlated random numbers with specifed R-squared

R^2 = 1 - SSres / SStot
SSres = sum of (y_obs - fitted)^2
SStot = sum of (y_obs - y_mean)^2
For y = rho * x1 + sqrt(1-rho^2)*x2
rho^2 =  1-  (y - x1*rho)^2 / (x2^2)

For standardized random numbers
x1 = rnorm(100)
x2 = rnorm(100)
rho = 0.5
y = rho * x1 + sqrt(1-rho^2)*x2

> set.seed(2014)
> N=500
> x = rnorm(N)
> error = rnorm(N)
> rho = sqrt(0.5)
> y= rho*x + sqrt(1-rho^2)*error  #rho is the slope
> summary(lm(y ~ x ))

lm(formula = y ~ x)

     Min       1Q   Median       3Q      Max
-1.85255 -0.47818  0.04374  0.46947  2.17727

            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.00915    0.03101  -0.295    0.768   
x            0.69389    0.03165  21.921   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6915 on 498 degrees of freedom
Multiple R-squared: 0.4911,    Adjusted R-squared:  0.49
F-statistic: 480.5 on 1 and 498 DF,  p-value: < 2.2e-16 

Generate non-standardized random numbers
> set.seed(2014)
> x=rnorm(1000)*4+2
> error = rnorm(1000)
> rho=sqrt(0.5)
> y = rho*(x-2)/4 + sqrt(1-rho^2)*error
> summary(lm(y~x))

lm(formula = y ~ x)

    Min      1Q  Median      3Q     Max
-2.1879 -0.4414 -0.0138  0.4289  2.7802

            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.37723    0.02428  -15.53   <2e-16 ***
x            0.18475    0.00546   33.83   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6751 on 998 degrees of freedom
Multiple R-squared: 0.5342,    Adjusted R-squared: 0.5338
F-statistic:  1145 on 1 and 998 DF,  p-value: < 2.2e-16

> y2 = y*4+2*rho
> summary(lm(y2~x))

lm(formula = y2 ~ x)

    Min      1Q  Median      3Q     Max
-8.7515 -1.7656 -0.0552  1.7155 11.1207

            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.09471    0.09713  -0.975     0.33   
x            0.73899    0.02184  33.834   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.701 on 998 degrees of freedom
Multiple R-squared: 0.5342,    Adjusted R-squared: 0.5338
F-statistic:  1145 on 1 and 998 DF,  p-value: < 2.2e-16 

AIC ( preferred model is the one with the lowest AIC value)

Akaike information criterion

AIC = 2k - 2ln(L)
So, the preferred model is the one with the lowest AIC value.

See example at

From wikipedia,

Correlated regression, Langley example, Faraway

Correlated errors g <- gls( Employed ~ GNP + Population, correlation=corAR1(form=~Year), data=longley)

Thursday, November 14, 2013

nlme lme usage

Regression with correlated variance, from R blogger

Regression with correlated variance/error

# Loading packages
library(lattice)       # Fancy graphics
library(nlme)          # Generalized linear mixed models 

# Reading data
setwd('~/Dropbox/quantumforest')  # Sets default working directory

un = read.csv('nzunemployment.csv', header = TRUE)

# Plotting first youth data and then adding adults
# as time series
  plot(youth ~ q, type = 'l', ylim = c(0,30), col='red',
       xlab = 'Quarter', ylab = 'Percentage unemployment')
  lines(q, adult, lty=2, col='blue')
  legend('topleft', c('Youth', 'Adult'), lty=c(1, 2), col=c('red', 'blue'))
  abline(v = 90)

# Creating minimum wage policy factor
un$minwage = factor(ifelse(un$q < 90, 'Different', 'Equal'))

# And a scatterplot
xyplot(youth ~ adult, group=minwage, data = un, 
       type=c('p', 'r'), auto.key = TRUE)

# Linear regression accounting for change of policy
mod1 = lm(youth ~ adult*minwage, data = un)

# Centering continuous predictor
un$cadult = with(un, adult - mean(adult))
mod2 = lm(youth ~ cadult*minwage, data = un)

plot(mod2)    # Plots residuals for the model fit
acf(mod2$res) # Plots autocorrelation of the residuals

# Now we move to use nlme

# gls() is an nlme function when there are no random effects
mod3 = gls(youth ~ cadult*minwage, data = un)

# Adding autocorrelation
mod4 = gls(youth ~ cadult*minwage, correlation = corAR1(form=~1), data = un)

Student Jiang, 2013Nov 14

Last time we discussed problems in retrieving genes associated with SNPs that have OMIM IDs. One of the problem is that some OMIM entries in the OmimVarLocusIdSNP table do not have genes listed. I just tried searching the genome database at UCSC using the SNP_id, and got the corresponding gene names.
For example, "188890" is associated with 11 SNPs and the query results are:

rs686           chr5    DRD1
rs4532         chr5    DRD1
rs921451      chr7    DDC
rs1051730    chr15    CHRNA3
rs1451371    chr7    DDC
rs2060762    chr7    DDC
rs3733829    chr19    EGLN2
rs3733829    chr19    RAB4B-EGLN2
rs3735273    chr7    DDC
rs3757472    chr7    DDC
rs4105144    chr19  N/A
rs6474412    chr8    N/A
We got DRD1, DDC, CHRNA3, and EGLN2 (I assume EGLN2 and RAB4B-EGLN2 are the same), and there are no gene returned for SNP rs4105144 and rs6474412.
However, on the webpage of "188890" (
1. The table "Phenotype Gene Relationships" listed several genes/locus, which do not match with the above results.
2. In the "Mapping" section, rs4105144 and rs6474412 are mentioned, and are associated with CYP2A6, CYP2B6, CHRNB3 and CHRNA6, but I got no results when search on UCSC genome database.
"These 2 loci include genes involved in nicotine metabolism, such as CYP2A6 (122720) and CYP2B6 (123930) on 19q13, and nicotinic acetylcholine receptor subunits on 8p11 (CHRNB3; 118508) and CHRNA6 (606888)."

In this section, EGLN2 gene is also mentioned.

3. In the "Molecular Genetics" section, several genes are described,
Glutamate Transporters: SLC1A2, SLC17A6 and SLC17A7
Dopamine Transporter/Receptors: SLC6A3, DRD2 and DRD1
DOPA Decarboxylase: DDC
Cholinergic Receptors: CHRNA4, CHRNB2, CHRNA2, CHRNB4, NTRK2, CHRNA3, CHRNA5
G Protein-Coupled Receptor-51: GABABR2
Serotonin Transporter: SLC6A4
Taste Receptor Polymorphisms: TAS2R38
I think I am lost on which gene we should choose in our analysis.

Friday, November 8, 2013

Sagi, Wolf, Koonin 2012, PlosComBiol, Universal pacemakers of genome evolution

SWK12 focus on 6901 orthologous gene families in 41 archaeal and 59 bacterial genomes, provided in supporting text.

So, if the SWK12 molecular pacemakers occur in yeast, they are very conserved genes.  Maybe, I could simulate the aging of the evolutionary core network, and then study how the recent network component change the aging dynamics of the core network.  This is a way to see how evolution influence network reliability, i.e., network robustness.

Wednesday, November 6, 2013

Wilke & Drummond, 2010, protein biophysics and coding sequence evolution

WD10 list a few factors of protein biophysics that can influence dN dS (exluding fuction)
-> Protein structure, solvent exposure, disorder regions (alpha and beta secondary structure are not important)

-> translational selection.
  Synonymous codons can alter ribosomal speed and influence protein folding.

-> protein-protein interactions

Austad 1989, aging of bowl and doily spider


Steven N. Austad
1989 Exp Gerontology.

Survival curve of the bowl and doily spider is exponential in the wild, indicating that sth is kiling the spiders regardless of ages. This is commented as non-aging characteristic. 

SNPedia, openSNP

A simple table from SNPedia:

OmimVarLocusIdSNP table

Some do not have their gene names listed. For example, the first entry "188890" links to RS_686, but there is no gene name listed. I searched on OMIM. ( Looks like this OMIM ID has 5 phenotypes and links to 4 different genes. So, for those that gene names are not listed, what should we do?

Hmm, it seems OMIM record can be different from other database:
RS686 is clearly linked to DRD1 gene on SNPedia. 

Student's and my notes, dbSNP project

I think we are trying to retrieve from dbSNP XML files those SNPs that have clinical significance, i.e. disease-associated SNPs. But I found out that, there are no attributes for "clinical significance", or any OMIM info, in the dbSNP XML files.

Here is what I did:
1. In the following site:, choose "22" in "Chromosomes" and "OMIM" in "Annotation", and then search. There are 266 results returned. I take RS_4633 as the example.
2. I found a software "010 Editer" at, which can open the >3GB XML for Chromosome 22. And I extract the entry of RS_4633, which is in the attachment.
I could not found anything related to OMIM or clinical significance in the extracted info.

On, it says:

What dbSNP report format will provide both a SNP and its specific OMIM ID number?
The easiest way you can get all the SNPs with OMIM links is from Entrez SNP:
1.Go to the Entrez SNP site.
2.Click on the grey “Limits” tab near the top of the page (just beneath the text search boxes).
3.Select the organism you are interested in from the organism list located at the top left of the page.
4.Scroll down the page almost to the bottom, until you find the list of “Annotation” limits. Select “OMIM”
5.Press the “Go” button located at the top of the page next to the empty text search box, and you will receive a list of your organism’s SNPs with OMIM annotation.
You can also get those SNPs with an OMIM ID number by downloading from the dbSNP FTP site: the OmimVarLocusIdSNP table contains the information you need for your organisim of interest (human, in this case). This table is located in your organism’s organism_data directory on the dbSNP FTP site."

Column definitions for this table are as follows:
2The locus id the SNP is on
3 omim variation id.
4locus symbol
5Amino acid using the contig reference allele.
6Amino acid position in the protein.
7Amino acid of the snp variance.
8var class (used for internal dbSNP processing)
9snp_id (rs#)

Columns of OmimVarLocusIdSNP table are:
2The locus id the SNP is on
3 omim variation id.
4locus symbol
5Amino acid using the contig reference allele.
6Amino acid position in the protein.
7Amino acid of the snp variance.
8var class (used for internal dbSNP processing)
9snp_id (rs#)

The last column should be the SNP rs ID. The 4th column seems to be gene name which may be mapped directly into the human protein network.


How to automatically forward emails to other accounts in Lotus Notes

Here are my steps to set up automatic forwarding in LotusNotes. The steps can be slightly different in different versions of Lotus Notes. The basic idea is to set up a Mail Rule that can work with most emails.

In Lotus Notes, go to the Mail window and click "More".

Then click "Mail Rules" in the drop-down menu.

 Click "New Rule" in the new pop-up window,

Choose "does not contain" for this rule,

Type some weird letters to make sure they are unlikely to be in most emails, and then click "Add"
For the action part, choose "send copy to",

Enter the email address that you want to forward and then click "add action",
You should now see the rule in the window beneath. Click "OK" to complete this procedure.

 Now, you should this active rule in the mail rule window.

 You can now send yourself an email to test wether this mail rule works or not.

Note: Every now and then, when Lotus Note are updated, mail rules can stop working because the versions between your account and the server are not the same. You should ask your systems administrator to update your LotusNote configurations.

Monday, November 4, 2013

AUC library useful tips

AUC Robert Woodruff Library

Inter library loan site

Remote Access

Electronic JSTOR link: