Open Notebook: October 2013

Thursday, October 31, 2013

install wget to mountain lion server

following steps on http://osxdaily.com/2012/05/22/install-wget-mac-os-x/

curl -O http://ftp.gnu.org/gnu/wget/wget-1.13.4.tar.gz
tar -xzf wget-1.13.4.tar.gz
cd wget-1.13.4
./configure --with-ssl=openssl
make
sudo make install
wget --help #Good, it worked.

biomaRt, SNP, ensembl

http://uswest.ensembl.org/info/genome/variation/index.html

http://uswest.ensembl.org/info/docs/api/variation/variation_tutorial.html
http://uswest.ensembl.org/info/docs/api/variation/variation_tutorial.html#alleles

http://uswest.ensembl.org/info/website/tutorials/Ensembl_variation_quick_reference_card.pdf

library(biomaRt)
snp.db <- useMart("snp", dataset="hsapiens_snp")
nt.biomart <- getBM(c("refsnp_id","allele","chr_name","chrom_start",                   
                      "chrom_strand","associated_gene",
                      "ensembl_gene_stable_id"),
                    filters="refsnp",
                    values=the.snps,
                    mart=snp.db)

Reference:
http://www.biostars.org/p/1332/

Ensemble ported some dbSNP population infor about SNP
Example: rs1800944 gene CCR5
http://uswest.ensembl.org/Homo_sapiens/Phenotype/Locations?ph=763

I tried hemoglobin but did not see snps.

1000genomes data, note

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/

2010 July SNP
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/exon/snps/

Populations are described at
http://www.1000genomes.org/about

697 total samples from 7 populations

              CEU: 90 samples, utah
              TSI: 66 samples, Toscani, Italy
              CHB: 109 samples, beijing
              CHD: 107 samples, ? 
              JPT: 105 samples, Japanese
              LWK: 108 samples, Luhya, Kenya
              YRI: 112 samples Yoruba, Nigeria

Initial target selection targeted ~1,000 genes (~10,000 exons or 2.3Mb), using CCDS annotations. The current data release only includes SNP calls that fell into regions 
that were successfully captured by all four data producing centers. The number of these regions is ~8496 , of total target length of ~1.43Mb.

Name	Size	Date Modified
[parent directory]
CEU.exon.2010_03.genotypes.vcf.gz	592 kB	11/11/10 12:00:00 AM
CEU.exon.2010_03.genotypes.vcf.gz.tbi	20.9 kB	11/11/10 12:00:00 AM
CEU.exon.2010_03.sites.vcf.gz	53.1 kB	7/7/10 12:00:00 AM
CEU.exon.2010_03.sites.vcf.gz.tbi	19.1 kB	7/7/10 12:00:00 AM
CHB.exon.2010_03.genotypes.vcf.gz	627 kB	11/11/10 12:00:00 AM
CHB.exon.2010_03.genotypes.vcf.gz.tbi	20.7 kB	11/11/10 12:00:00 AM
CHB.exon.2010_03.sites.vcf.gz	49.4 kB	7/7/10 12:00:00 AM
CHB.exon.2010_03.sites.vcf.gz.tbi	18.9 kB	7/7/10 12:00:00 AM
CHD.exon.2010_03.genotypes.vcf.gz	649 kB	11/11/10 12:00:00 AM
CHD.exon.2010_03.genotypes.vcf.gz.tbi	20.8 kB	11/11/10 12:00:00 AM
CHD.exon.2010_03.sites.vcf.gz	48.6 kB	7/7/10 12:00:00 AM
CHD.exon.2010_03.sites.vcf.gz.tbi	19.0 kB	7/7/10 12:00:00 AM
JPT.exon.2010_03.genotypes.vcf.gz	545 kB	11/11/10 12:00:00 AM
JPT.exon.2010_03.genotypes.vcf.gz.tbi	19.7 kB	11/11/10 12:00:00 AM
JPT.exon.2010_03.sites.vcf.gz	44.1 kB	7/7/10 12:00:00 AM
JPT.exon.2010_03.sites.vcf.gz.tbi	17.9 kB	7/7/10 12:00:00 AM
LWK.exon.2010_03.genotypes.vcf.gz	910 kB	11/11/10 12:00:00 AM
LWK.exon.2010_03.genotypes.vcf.gz.tbi	23.9 kB	11/11/10 12:00:00 AM
LWK.exon.2010_03.sites.vcf.gz	74.0 kB	7/7/10 12:00:00 AM
LWK.exon.2010_03.sites.vcf.gz.tbi	22.0 kB	7/7/10 12:00:00 AM
README.2010_03.exon_snps	4.9 kB	6/18/10 12:00:00 AM
TSI.exon.2010_03.genotypes.vcf.gz	424 kB	11/11/10 12:00:00 AM
TSI.exon.2010_03.genotypes.vcf.gz.tbi	20.4 kB	11/11/10 12:00:00 AM
TSI.exon.2010_03.genotypes.vcf.tbi	72 B	6/30/10 12:00:00 AM
TSI.exon.2010_03.sites.vcf.gz	49.3 kB	7/7/10 12:00:00 AM
TSI.exon.2010_03.sites.vcf.gz.tbi	18.7 kB	7/7/10 12:00:00 AM
YRI.exon.2010_03.genotypes.vcf.gz	976 kB	11/11/10 12:00:00 AM
YRI.exon.2010_03.genotypes.vcf.gz.tbi	23.9 kB	11/11/10 12:00:00 AM
YRI.exon.2010_03.sites.vcf.gz	75.5 kB	7/7/10 12:00:00 AM
YRI.exon.2010_03.sites.vcf.gz.tbi	21.9 kB	7/7/10 12:00:00 AM
exon.snps.LOF.txt	7.7 kB	7/19/10 12:00:00 AM
experimental_validation/		7/20/10 12:00:00 AM

Tuesday, October 29, 2013

Simulate aging of yeast PPI

Summary: When essential genes with <=2 links are used, survival curves have a very short initial shoulder (look like exponential). When essential genes with >=4 links are used, sigmoidal curves readily occur. These are expected based on my analytic results.

File: 20131029.ppi.sim.aging.R

rm(list=ls())

list.files(path='data', )

#yeast PPI
pairs = read.csv('data/pairs.csv', colClasses=c('character','character'))
#this yeast ppi dataset is consistent with Taiwan group's report.

#essential gene info
essenTb = read.csv('data/SummaryRegressionHetHom2013Oct29.csv', colClasses=rep('character', 9))

#######################
# How do the two data set overlap? DIP seems to contain some questionable orfs
uniq.orf.from.pairs = unique(c(pairs$ORF1, pairs$ORF2))
matches = uniq.orf.from.pairs %in% unique(essenTb$orf)
table(matches)
#FALSE TRUE
# 261 4217
# YDL026W dubious, YAR009C transposable element
dubiousORF = uniq.orf.from.pairs[! matches]

matches = uniq.orf.from.pairs %in% unique(essenTb$orf[essenTb$essenflag=='essential'])
table(matches)
#FALSE TRUE
#3506 972 #this is amazingly consistent with Taiwan group's report.

matches = uniq.orf.from.pairs %in% unique(essenTb$orf[essenTb$essenflag=='nonessential'])
table(matches)
# FALSE TRUE
# 1287 3191 #this is amazingly consistent with Taiwan group's report.

#remove dubious orfs from PPI
pairs$Removeflag = ifelse( pairs$ORF1 %in%dubiousORF | pairs$ORF2 %in%dubiousORF, T,F )
table(pairs$Removeflag)
#FALSE TRUE
#12180 1487

pairs = pairs[! pairs$Removeflag, ]
table(pairs$Removeflag)
pairs = pairs[,1:2] ##This set of pairs is read for analysis

###############################
# label essential nodes, remove nonesse-nonessen pairs
essentialORFs = essenTb$orf[essenTb$essenflag=='essential']
pairs$essen1 = pairs$ORF1 %in% essentialORFs
pairs$essen2 = pairs$ORF2 %in% essentialORFs

head(pairs)

#remove nonessen <-> nonessen intxn because they do not affect aging.
pairs$remove = ifelse( pairs$essen1==F & pairs$essen2==F, T, F )
pairs= pairs[! pairs$remove,1:4 ] #only 6279 intxn left

#remove self-intxn, just to make sure
pairs = pairs[ pairs$ORF1 != pairs$ORF2, ]

#how many essen <--> essen intxn?
pairs$inxnEE = pairs$essen1 & pairs$essen2
table(pairs$inxnEE)
# FALSE TRUE
# 4521 1758

#How many essen genes?
tmp = essentialORFs %in% unique(c(pairs$ORF1, pairs$ORF2))
table(tmp)
#FALSE TRUE
#194 958 #So, intxn among essential genes appear to be suppressed.

essentialORFsPPI = essentialORFs[tmp]

#get connectivities per node
degreeTb = data.frame( table(c(pairs[,1], pairs[,2])))
summary(degreeTb)
degreeTb$ORF = as.character( degreeTb[,1])

degreeCutoff = 4;
tmp = essentialORFsPPI %in% degreeTb$ORF[degreeTb$Freq>degreeCutoff]
GooddEssentialORFsPPI = essentialORFsPPI[tmp]

###########################
# simulate aging
# -> exponential age to all pairs
# -> maximal age for each essential gene
# -> minimal age for all essential genes

set.seed(2013)

lambda = 1/3500

# runningORFs = essentialORFsPPI
runningORFs = GooddEssentialORFsPPI

popSize = 100
popAges = numeric(popSize)
time1 = date()
for( j in 1:popSize) {
ModuleTb = data.frame(runningORFs)
pairs$age = rexp( length(pairs[,1]), rate=lambda ) #exponential ages for pairs

#I could add stochasticity into pairs here.

for (i in 1:length(runningORFs)) {
myORF = runningORFs[i]
pos1 = grep(myORF, pairs$ORF1)
pos2 = grep(myORF, pairs$ORF2)
ModuleTb$age.m[i] = max( pairs$age[c(pos1,pos2)] )
}
head(ModuleTb);
summary(ModuleTb)
currentNetworkAge = min(ModuleTb$age.m)
popAges[j] = currentNetworkAge
}
time2 = date()
hist(popAges)
summary(popAges)

time1; time2;
source("/Users/hongqin/lib/R/lifespan.r")
s.tb = calculate.s ( popAges )
plot( s.tb$s ~ s.tb$t )
#This is exponential, perhaps not surprisingly, because power-law elevate the role of the weakest link.
# So, if the weakest links die more slowly, maybe we can see sigmoidal shapes.

# Maybe I should remove the single-linked essential genes. In essence, assumming they die very slow.
# maybe I should also use house-keeping genes

plot( s.tb$s ~ s.tb$t, type='l', log='x' )

#todo,
# permutation effect on aging?
# lambda ~ 1/connectivity of nodes

#I should try to normalized the survival curves for a different prespective

Monday, October 28, 2013

Verify essential genes parsing results

Results parsed from YPD growth measures from the Deletion collection data.
File: /0.ppi.reliability.simulation/20131028.growth.fitness.R

tbReal[tbReal$Growth1=='nogrowth1' & tbReal$Growth2=='nogrowth2', ] [1:10,]
         orf    gene.x     Anno1   Growth1    gene.y     Anno2   Growth2 name.check
3    YAL003W      EFB1 verified1 nogrowth1      EFB1 verified2 nogrowth2       TRUE
15   YAL016W      tpd3 verified1 nogrowth1      tpd3 verified2 nogrowth2       TRUE
24   YAL025C     mak16 verified1 nogrowth1     mak16 verified2 nogrowth2       TRUE
31   YAL032C     PRP45 verified1 nogrowth1     PRP45 verified2 nogrowth2       TRUE
32   YAL033W      POP5 verified1 nogrowth1      POP5 verified2 nogrowth2       TRUE
34 YAL034W-A      MTW1 verified1 nogrowth1      MTW1 verified2 nogrowth2       TRUE
35 YAL035C-A YAL035C-A verified1 nogrowth1 YAL035C-A verified2 nogrowth2       TRUE
39   YAL038W     cdc19 verified1 nogrowth1     cdc19 verified2 nogrowth2       TRUE
42   YAL041W     cdc24 verified1 nogrowth1     cdc24 verified2 nogrowth2       TRUE
44   YAL043C      pta1 verified1 nogrowth1      pta1 verified2 nogrowth2       TRUE

YAL025C is inviable on SGD.
YAL035C-A is inviable on SGD, but is a dubious orf.
YAL041W cdc24 is inviable on SGD, but also have some other phenotypes.

> tbReal[tbReal$Growth1=='growth1' & tbReal$Growth2=='nogrowth2', ]
         orf gene.x     Anno1 Growth1 gene.y     Anno2   Growth2 name.check
956 YDR050C   TPI1 verified1 growth1   TPI1 verified2 nogrowth2       TRUE
1448 YEL002C   WBP1 verified1 growth1   WBP1 verified2 nogrowth2       TRUE
2685 YIL078W   THS1 verified1 growth1   THS1 verified2 nogrowth2       TRUE
4152 YML126C ERG13 verified1 growth1 ERG13 verified2 nogrowth2       TRUE
4383 YMR218C TRS130 verified1 growth1 TRS130 verified2 nogrowth2       TRUE

YDR050C TPI1, sgd inviable, but with mixed phenotypes
YEL002C, inviable
YMR218C, inviable, but with mixed phenotypes

--------------------------------------------

> rm(list=ls())
> setwd("~/projects/0.network.aging.prj/0.ppi.reliability.simulation")
> list.files(path="data/", pattern="Regression")
[1] "Regression_Tc1_het.txt" "Regression_Tc1_hom.txt" "Regression_Tc2_het.txt"
[4] "Regression_Tc2_hom.txt"
> my.files = c("Regression_Tc1_het.txt", "Regression_Tc1_hom.txt", "Regression_Tc2_het.txt", "Regression_Tc2_hom.txt")
> # I removed the single quotes in these files
>
> tb1 = read.table(paste('data/',my.files[1],sep=''), header=T, sep='\t', fill=T)
> tb2 = read.table(paste('data/',my.files[2],sep=''), header=T, sep='\t', fill=T)
> tb3 = read.table(paste('data/',my.files[3],sep=''), header=T, sep='\t', fill=T)
> tb4 = read.table(paste('data/',my.files[4],sep=''), header=T, sep='\t', fill=T)
>
> tb1$AnnotationFlag = ifelse( is.na(tb1$YPD), 'questionable1', 'verified1' )
> table(tb1$AnnotationFlag)

questionable1     verified1
          174          5744
> # questional verified
> # 174         5744
>
> tb3$AnnotationFlag = ifelse( is.na(tb3$YPD), 'questionable2', 'verified2' )
> table(tb3$AnnotationFlag)

questionable2     verified2
          152          5766
> # quetionable    verified
> # 152        5766
> hist(tb3$YPD)
>
> tb2$GrowthFlag = ifelse( is.na(tb2$YPD), 'nogrowth1', 'growth1' )
> table(tb2$GrowthFlag)

growth1 nogrowth1
     4659      1259
> # growth nogrowth
> # 4659     1259
>
> tb4$GrowthFlag = ifelse( is.na(tb4$YPD), 'nogrowth2', 'growth2' )
> table(tb4$GrowthFlag)

growth2 nogrowth2
     4718      1200
> # growth nogrowth
> # 4718     1200
>
> tb12 = merge(tb1,tb2, by.x='orf', by.y='orf')
> tb12 = tb12[, c(1,2,grep('Flag',names(tb12)))]
>
> tb34 = merge(tb3,tb4, by.x='orf', by.y='orf')
> tb34 = tb34[, c(1,2,grep('Flag',names(tb34)))]
>
> names(tb12) = c("orf",'gene','Anno1','Growth1')
> names(tb34) = c('orf','gene','Anno2','Growth2')
> str(tb12)
'data.frame':    5918 obs. of 4 variables:
$ orf    : Factor w/ 5918 levels "YAL001C","YAL002W",..: 1 2 3 4 5 6 7 8 9 10 ...
$ gene   : Factor w/ 5915 levels "37165","AAC1",..: 3133 3395 750 3424 2971 803 906 2912 1470 3425 ...
$ Anno1 : chr "verified1" "verified1" "verified1" "verified1" ...
$ Growth1: chr "growth1" "growth1" "nogrowth1" "growth1" ...
> str(tb34)
'data.frame':    5918 obs. of 4 variables:
$ orf    : Factor w/ 5918 levels "YAL001C","YAL002W",..: 1 2 3 4 5 6 7 8 9 10 ...
$ gene   : Factor w/ 5915 levels "37165","AAC1",..: 3133 3395 750 3424 2971 803 906 2912 1470 3425 ...
$ Anno2 : chr "verified2" "verified2" "verified2" "verified2" ...
$ Growth2: chr "growth2" "growth2" "nogrowth2" "growth2" ...
>
> tb = merge(tb12,tb34, by.x='orf', by.y='orf')
> head(tb)
      orf gene.x     Anno1   Growth1 gene.y     Anno2   Growth2
1 YAL001C    TFC3 verified1   growth1    TFC3 verified2   growth2
2 YAL002W    VPS8 verified1   growth1    VPS8 verified2   growth2
3 YAL003W    EFB1 verified1 nogrowth1    EFB1 verified2 nogrowth2
4 YAL004W YAL004W verified1   growth1 YAL004W verified2   growth2
5 YAL005C    ssa1 verified1   growth1    ssa1 verified2   growth2
6 YAL007C    ERP2 verified1   growth1    ERP2 verified2   growth2
> tb$name.check= ifelse(tb$gene.x == tb$gene.y, T, F)
> table(tb$name.check)

TRUE
5918
> # tb[ tb$name.check==F, ]
> length( grep('question', tb$Anno1) ) #174
[1] 174
> length( grep('question', tb$Anno2) ) #152
[1] 152
>
> #tbReal = tb[ tb$Anno1=='verified1' & tb$Anno2=='verified2', ]
> tbReal = tb[ tb$Anno1=='verified1' | tb$Anno2=='verified2', ] #At least one Het growth. The other one might be an error?!
> length( grep('question', tbReal$Anno1) ) #28
[1] 28
> length( grep('question', tbReal$Anno2) ) #6
[1] 6
>
> table(tbReal$Growth1, tbReal$Growth2)

            growth2 nogrowth2
growth1      4552         5
nogrowth1      63      1152
> #           growth2 nogrowth2
> # growth1      4552         5
> # nogrowth1      63      1152
>
> # Verify some manually on SGD.
> tbReal[tbReal$Growth1=='nogrowth1' & tbReal$Growth2=='nogrowth2', ] [1:10,]
         orf    gene.x     Anno1   Growth1    gene.y     Anno2   Growth2
3    YAL003W      EFB1 verified1 nogrowth1      EFB1 verified2 nogrowth2
15   YAL016W      tpd3 verified1 nogrowth1      tpd3 verified2 nogrowth2
24   YAL025C     mak16 verified1 nogrowth1     mak16 verified2 nogrowth2
31   YAL032C     PRP45 verified1 nogrowth1     PRP45 verified2 nogrowth2
32   YAL033W      POP5 verified1 nogrowth1      POP5 verified2 nogrowth2
34 YAL034W-A      MTW1 verified1 nogrowth1      MTW1 verified2 nogrowth2
35 YAL035C-A YAL035C-A verified1 nogrowth1 YAL035C-A verified2 nogrowth2
39   YAL038W     cdc19 verified1 nogrowth1     cdc19 verified2 nogrowth2
42   YAL041W     cdc24 verified1 nogrowth1     cdc24 verified2 nogrowth2
44   YAL043C      pta1 verified1 nogrowth1      pta1 verified2 nogrowth2
   name.check
3        TRUE
15       TRUE
24       TRUE
31       TRUE
32       TRUE
34       TRUE
35       TRUE
39       TRUE
42       TRUE
44       TRUE
> tbReal[tbReal$Growth1=='growth1' & tbReal$Growth2=='nogrowth2', ]
         orf gene.x     Anno1 Growth1 gene.y     Anno2   Growth2 name.check
956 YDR050C   TPI1 verified1 growth1   TPI1 verified2 nogrowth2       TRUE
1448 YEL002C   WBP1 verified1 growth1   WBP1 verified2 nogrowth2       TRUE
2685 YIL078W   THS1 verified1 growth1   THS1 verified2 nogrowth2       TRUE
4152 YML126C ERG13 verified1 growth1 ERG13 verified2 nogrowth2       TRUE
4383 YMR218C TRS130 verified1 growth1 TRS130 verified2 nogrowth2       TRUE
>
> tbReal$essenflag = ifelse( tbReal$Growth1=='growth1' & tbReal$Growth2=='growth2', 'nonessential',
+                        ifelse(tbReal$Growth1=='nogrowth1' & tbReal$Growth2=='nogrowth2', 'essential', 'abnormal') )
> head(tbReal)
      orf gene.x     Anno1   Growth1 gene.y     Anno2   Growth2 name.check
1 YAL001C    TFC3 verified1   growth1    TFC3 verified2   growth2       TRUE
2 YAL002W    VPS8 verified1   growth1    VPS8 verified2   growth2       TRUE
3 YAL003W    EFB1 verified1 nogrowth1    EFB1 verified2 nogrowth2       TRUE
4 YAL004W YAL004W verified1   growth1 YAL004W verified2   growth2       TRUE
5 YAL005C    ssa1 verified1   growth1    ssa1 verified2   growth2       TRUE
6 YAL007C    ERP2 verified1   growth1    ERP2 verified2   growth2       TRUE
     essenflag
1 nonessential
2 nonessential
3    essential
4 nonessential
5 nonessential
6 nonessential
> table(tbReal$essenflag)

    abnormal    essential nonessential
          68         1152         4552
> # abnormal    essential nonessential
> # 68         1152         4552
> #Good, consistent results. Use these for further analysis. 2013 Oct 29
>
> write.csv(tbReal, "SummaryRegressionHetHom2013Oct29.csv", quote=F, row.names=F )
>
>

Deletion growth fitness data contain single quotes that can cause loading problem in R

yeast deletion growth fitness data loading problems in R.

my.files = c("Regression_Tc1_het.txt", "Regression_Tc1_hom.txt", "Regression_Tc2_het.txt", "Regression_Tc2_hom.txt")

tb1 = read.table(paste('data/',my.files[1],sep=''), header=T, sep='\t|\s', fill=T)

# visual check show 4 genes contain single quotes that cause loading problems in R.
ace:0.ppi.reliability.simulation hongqin$ grep "'" data/Regression_Tc1_het.txt
MRF1'    YBR026C    3    27    0.17065    0.02195    0.01897    0.96921    0.97176    0.97954    0.97909    0.98818
AAP1'    YHR047C    4    44    3.36638    1.70583    0.04385    0.98022    0.99351    0.99193    0.97897    1.02282
IMP2'    YIL154C    2    22    1.00960    0.36309    0.02293    0.98047    0.97849    0.95754    0.96384    0.97215
MSF1'    YLR168C    0    0

markdown output of 20131028.growth.fitness.R

Conver 20131028.growth.fitness.R to Rmd. Output is here. Retro added on Nov 20, 2017.

Based on 20131028 code and its 20151012 corrections of comma in gene names. For deletion fitness data, download from http://www-deletion.stanford.edu/YDPM/YDPM_index.html

“In the Deletion Project, strains from the deletion collection were monitored under 9 different media conditions selected for the study of mitochondrial function. 5791 heterozygous diploid and 4706 homozygous diploid deletion strains were monitored in parallel using molecular barcodes on fermentable (YPD, YPDGE) and non-fermentable substrates (YPG, YPE, YPL). The YDPM database contains both the raw data and growth rates calculated for each strain in each media condition. Strains can be searched by ORF or Gene name to access growth measurements and data plots for each strain.”

What are the genotypes in the 4 yeast fitness files?

rm(list=ls())
datapath = "data"
my.files = c("Regression_Tc1_het.txt", 
             "Regression_Tc1_hom.txt", 
             "Regression_Tc2_het.txt", 
             "Regression_Tc2_hom.txt")

tb1 = read.table(paste('data/',my.files[1],sep=''), header=T, sep='\t', fill=T)
tb2 = read.table(paste('data/',my.files[2],sep=''), header=T, sep='\t', fill=T)
tb3 = read.table(paste('data/',my.files[3],sep=''), header=T, sep='\t', fill=T)
tb4 = read.table(paste('data/',my.files[4],sep=''), header=T, sep='\t', fill=T)

tb1$AnnotationFlag = ifelse( is.na(tb1$YPD), 'questionable1', 'verified1' )
table(tb1$AnnotationFlag)

## 
## questionable1     verified1 
##           174          5744

# questional verified 
# 174         5744 
x = table(tb1$orf); str(x)

##  'table' int [1:5918(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5918] "YAL001C" "YAL002W" "YAL003W" "YAL004W" ...

tb3$AnnotationFlag = ifelse( is.na(tb3$YPD), 'questionable2', 'verified2' )
table(tb3$AnnotationFlag)

## 
## questionable2     verified2 
##           152          5766

# quetionable    verified 
# 152        5766 
hist(tb3$YPD)

x = table(tb3$orf); str(x)

##  'table' int [1:5918(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5918] "YAL001C" "YAL002W" "YAL003W" "YAL004W" ...

tb2$GrowthFlag = ifelse( is.na(tb2$YPD), 'nogrowth1', 'growth1' )
table(tb2$GrowthFlag)

## 
##   growth1 nogrowth1 
##      4659      1259

# growth nogrowth 
# 4659     1259 
x = table(tb2$orf); str(x)

##  'table' int [1:5918(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5918] "YAL001C" "YAL002W" "YAL003W" "YAL004W" ...

tb4$GrowthFlag = ifelse( is.na(tb4$YPD), 'nogrowth2', 'growth2' )
table(tb4$GrowthFlag)

## 
##   growth2 nogrowth2 
##      4718      1200

# growth nogrowth 
# 4718     1200 
x = table(tb4$orf); str(x)

##  'table' int [1:5918(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5918] "YAL001C" "YAL002W" "YAL003W" "YAL004W" ...

tb12 = merge(tb1,tb2, by.x='orf', by.y='orf')
tb12 = tb12[, c(1,2,grep('Flag',names(tb12)))]

tb34 = merge(tb3,tb4, by.x='orf', by.y='orf')
tb34 = tb34[, c(1,2,grep('Flag',names(tb34)))]

names(tb12) = c("orf",'gene','Anno1','Growth1')
names(tb34) = c('orf','gene','Anno2','Growth2')
str(tb12)

## 'data.frame':    5918 obs. of  4 variables:
##  $ orf    : Factor w/ 5918 levels "YAL001C","YAL002W",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene   : Factor w/ 5915 levels "37165","AAC1",..: 3133 3395 750 3424 2971 803 906 2912 1470 3425 ...
##  $ Anno1  : chr  "verified1" "verified1" "verified1" "verified1" ...
##  $ Growth1: chr  "growth1" "growth1" "nogrowth1" "growth1" ...

str(tb34)

## 'data.frame':    5918 obs. of  4 variables:
##  $ orf    : Factor w/ 5918 levels "YAL001C","YAL002W",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene   : Factor w/ 5915 levels "37165","AAC1",..: 3133 3395 750 3424 2971 803 906 2912 1470 3425 ...
##  $ Anno2  : chr  "verified2" "verified2" "verified2" "verified2" ...
##  $ Growth2: chr  "growth2" "growth2" "nogrowth2" "growth2" ...

tb = merge(tb12,tb34, by.x='orf', by.y='orf')
head(tb)

##       orf  gene.x     Anno1   Growth1  gene.y     Anno2   Growth2
## 1 YAL001C    TFC3 verified1   growth1    TFC3 verified2   growth2
## 2 YAL002W    VPS8 verified1   growth1    VPS8 verified2   growth2
## 3 YAL003W    EFB1 verified1 nogrowth1    EFB1 verified2 nogrowth2
## 4 YAL004W YAL004W verified1   growth1 YAL004W verified2   growth2
## 5 YAL005C    ssa1 verified1   growth1    ssa1 verified2   growth2
## 6 YAL007C    ERP2 verified1   growth1    ERP2 verified2   growth2

tb$name.check= ifelse(tb$gene.x == tb$gene.y, T, F)
table(tb$name.check)

## 
## TRUE 
## 5918

# tb[ tb$name.check==F, ]
length( grep('question', tb$Anno1) ) #174

## [1] 174

length( grep('question', tb$Anno2) ) #152

## [1] 152

#tbReal = tb[ tb$Anno1=='verified1' & tb$Anno2=='verified2', ]
tbReal = tb[ tb$Anno1=='verified1' | tb$Anno2=='verified2', ]  #At least one Het growth. The other one might be an error?!
length( grep('question', tbReal$Anno1) ) #28

## [1] 28

length( grep('question', tbReal$Anno2) ) #6

## [1] 6

table(tbReal$Growth1, tbReal$Growth2)

##            
##             growth2 nogrowth2
##   growth1      4552         5
##   nogrowth1      63      1152

#           growth2 nogrowth2
# growth1      4552         5
# nogrowth1      63      1152

# Verify some manually on SGD.
tbReal[tbReal$Growth1=='nogrowth1' & tbReal$Growth2=='nogrowth2',  ] [1:10,]

##          orf    gene.x     Anno1   Growth1    gene.y     Anno2   Growth2
## 3    YAL003W      EFB1 verified1 nogrowth1      EFB1 verified2 nogrowth2
## 15   YAL016W      tpd3 verified1 nogrowth1      tpd3 verified2 nogrowth2
## 24   YAL025C     mak16 verified1 nogrowth1     mak16 verified2 nogrowth2
## 31   YAL032C     PRP45 verified1 nogrowth1     PRP45 verified2 nogrowth2
## 32   YAL033W      POP5 verified1 nogrowth1      POP5 verified2 nogrowth2
## 34 YAL034W-A      MTW1 verified1 nogrowth1      MTW1 verified2 nogrowth2
## 35 YAL035C-A YAL035C-A verified1 nogrowth1 YAL035C-A verified2 nogrowth2
## 39   YAL038W     cdc19 verified1 nogrowth1     cdc19 verified2 nogrowth2
## 42   YAL041W     cdc24 verified1 nogrowth1     cdc24 verified2 nogrowth2
## 44   YAL043C      pta1 verified1 nogrowth1      pta1 verified2 nogrowth2
##    name.check
## 3        TRUE
## 15       TRUE
## 24       TRUE
## 31       TRUE
## 32       TRUE
## 34       TRUE
## 35       TRUE
## 39       TRUE
## 42       TRUE
## 44       TRUE

tbReal[tbReal$Growth1=='growth1' & tbReal$Growth2=='nogrowth2',  ]

##          orf gene.x     Anno1 Growth1 gene.y     Anno2   Growth2
## 956  YDR050C   TPI1 verified1 growth1   TPI1 verified2 nogrowth2
## 1448 YEL002C   WBP1 verified1 growth1   WBP1 verified2 nogrowth2
## 2685 YIL078W   THS1 verified1 growth1   THS1 verified2 nogrowth2
## 4152 YML126C  ERG13 verified1 growth1  ERG13 verified2 nogrowth2
## 4383 YMR218C TRS130 verified1 growth1 TRS130 verified2 nogrowth2
##      name.check
## 956        TRUE
## 1448       TRUE
## 2685       TRUE
## 4152       TRUE
## 4383       TRUE

#what was I doing here? 
tbReal$essenflag = ifelse( tbReal$Growth1=='growth1' & tbReal$Growth2=='growth2', 'nonessential', 
                       ifelse(tbReal$Growth1=='nogrowth1' & tbReal$Growth2=='nogrowth2', 'essential', 'abnormal') )
head(tbReal)

##       orf  gene.x     Anno1   Growth1  gene.y     Anno2   Growth2
## 1 YAL001C    TFC3 verified1   growth1    TFC3 verified2   growth2
## 2 YAL002W    VPS8 verified1   growth1    VPS8 verified2   growth2
## 3 YAL003W    EFB1 verified1 nogrowth1    EFB1 verified2 nogrowth2
## 4 YAL004W YAL004W verified1   growth1 YAL004W verified2   growth2
## 5 YAL005C    ssa1 verified1   growth1    ssa1 verified2   growth2
## 6 YAL007C    ERP2 verified1   growth1    ERP2 verified2   growth2
##   name.check    essenflag
## 1       TRUE nonessential
## 2       TRUE nonessential
## 3       TRUE    essential
## 4       TRUE nonessential
## 5       TRUE nonessential
## 6       TRUE nonessential

table(tbReal$essenflag)

## 
##     abnormal    essential nonessential 
##           68         1152         4552

# abnormal    essential nonessential 
# 68         1152         4552 
#Good, consistent results. Use these for further analysis. 2013 Oct 29
# 20151012, found two "TRUE" in orfs
length(tbReal$orf) #5772

## [1] 5772

length(unique(tbReal$orf)) #5772

## [1] 5772

#write.csv(tbReal, "SummaryRegressionHetHom2015Oct12.csv", row.names = F ) #change 20151012
#write.csv(tbReal, "SummaryRegressionHetHom2015Oct12.csv",  quote=F, row.names=F )

Open Notebook