This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
Monday, December 23, 2013
degree>=6, GIN+PPI realibility simulation, popSize =100
So, if degree is larger enough, $p$ has to be less than 1 to cover the natural variations. This is actually very reasonable, because not all GIN+PPI are activated.
Sunday, December 22, 2013
degree>=4, PPI+GIN, reliability simulation, popsize=100
Each lambda loop take about 6 hours, in mactower.
This simulation was run twice in mactower. Results are similar.
Summary:
For degree >=4, lambda = 1/81, p=0.8~1.0 will cover yeast natural variations
lambda = 1/243, p=0.6~0.75 will cover yeast natural variations.
So, lambda and $p$ trade-off with each other. I can plot the average lifespan as contour lines in lambda ~ p plots.
Saturday, December 21, 2013
rgompertz, Gompertz random number
##########################################
#inverse of gompertz CDF
# see http://hongqinlab.blogspot.com/2013/06/median-lifespan-of-2-parameter-gompertz.html
#inverse.gomp.CDF = function(R,G,y) { (1/G)*log(1 - (G/R)*log(1-y) ) }
#see generate random number with a given distribution
# http://hongqinlab.blogspot.com/2013/12/generate-gompertz-random-numbers.html
#generate Gompertz random numbers
rgompertz = function(R,G, n){
x.uniform = runif(n)
inverse.gomp.CDF = function(R,G,y) { (1/G)*log(1 - (G/R)*log(1-y) ) }
x.gompertz = inverse.gomp.CDF(0.001,0.2, x.uniform)
return(x.gompertz)
}
rgompertz(0.001,0.2,100)
Use simulated Gompertz random number to test flexsurv Gompertz and Weibull fitting results
Summary: I experimented the sample size. For 50 cells, the Gompertz model may be the better fit 4 out of 5 times. For 100 cells, Gompertz is better than Weibull maybe 9 out of 10 times. For 500 cells, Gompertz is always bettern than the Weibull model.
Conclusion: For most yeast life span assay, it is not always easy to tell whether Gompertz or Weibull is a better fit.
Comments: The actual results probably also depend on R and G, which determine the variance.
code: 20131221.gompertz.simulation.R
# generate gompertz random numbers
# fit with flexsurv Gompertz and weibull models
#inverse of gompertz CDF
# see http://hongqinlab.blogspot.com/2013/06/median-lifespan-of-2-parameter-gompertz.html
inverse.gomp.CDF = function(R,G,y) { (1/G)*log(1 - (G/R)*log(1-y) ) }
#see generate random number with a given distribution
# http://hongqinlab.blogspot.com/2013/12/generate-gompertz-random-numbers.html
x.uniform = runif(60)
hist(x.uniform)
x.gompertz = inverse.gomp.CDF(0.001,0.2, x.uniform)
hist(x.gompertz)
summary(x.gompertz)
source("lifespan.r")
tb = calculate.s(x.gompertz)
plot(tb$s ~ tb$t)
require(flexsurv)
require(flexsurv)
lifespan = x.gompertz
lifespanGomp = flexsurvreg(formula = Surv(lifespan) ~ 1, dist = 'gompertz')
lifespanWeib = flexsurvreg(formula = Surv(lifespan) ~ 1, dist = 'weibull')
c(lifespanWeib$AIC, lifespanGomp$AIC, lifespanWeib$AIC - lifespanGomp$AIC )
Conclusion: For most yeast life span assay, it is not always easy to tell whether Gompertz or Weibull is a better fit.
Comments: The actual results probably also depend on R and G, which determine the variance.
code: 20131221.gompertz.simulation.R
# generate gompertz random numbers
# fit with flexsurv Gompertz and weibull models
#inverse of gompertz CDF
# see http://hongqinlab.blogspot.com/2013/06/median-lifespan-of-2-parameter-gompertz.html
inverse.gomp.CDF = function(R,G,y) { (1/G)*log(1 - (G/R)*log(1-y) ) }
#see generate random number with a given distribution
# http://hongqinlab.blogspot.com/2013/12/generate-gompertz-random-numbers.html
x.uniform = runif(60)
hist(x.uniform)
x.gompertz = inverse.gomp.CDF(0.001,0.2, x.uniform)
hist(x.gompertz)
summary(x.gompertz)
source("lifespan.r")
tb = calculate.s(x.gompertz)
plot(tb$s ~ tb$t)
require(flexsurv)
require(flexsurv)
lifespan = x.gompertz
lifespanGomp = flexsurvreg(formula = Surv(lifespan) ~ 1, dist = 'gompertz')
lifespanWeib = flexsurvreg(formula = Surv(lifespan) ~ 1, dist = 'weibull')
c(lifespanWeib$AIC, lifespanGomp$AIC, lifespanWeib$AIC - lifespanGomp$AIC )
Generate random numbers with a given distribution
This is related to generating random numbers from arbitrary distributions. Basically, inverse of an uniform random number can be used.
From: http://www.ece.virginia.edu/mv/edu/prob/stat/random-number-generation.pdf
See also
http://epub.ub.uni-muenchen.de/1716/1/paper_338.pdf
What is the inverse function of Gompertz and Weibull distribution?
From: http://www.ece.virginia.edu/mv/edu/prob/stat/random-number-generation.pdf
See also
http://epub.ub.uni-muenchen.de/1716/1/paper_338.pdf
What is the inverse function of Gompertz and Weibull distribution?
Demo codes to calculate mortality rates, then fit with flexsurv
I generated some normally distributed random lifespan, fit with flexsurv for Gomertz and Weibull. Interestingly, it fits Weibull better than Gompertz.
source("lifespan.r")
lifespan = round(rnorm(1000)*10, 0) + 30
tb = calculate.s(lifespan)
head(tb)
tb$ds=NA; tb$dt=NA
tb$dt[1] = tb$s[1]
tb$ds[1] = 1 - tb$s[1]
tb$mortality.rate[1] = tb$ds[1] / tb$dt[1]
for( j in 2:length(tb[,1])) {
tb$ds[j] = tb$s[j-1] - tb$s[j]
tb$dt[j] = -tb$t[j-1] + tb$t[j]
tb$mortality.rate[j] = tb$ds[j] / ( tb$s[j] * tb$dt[j])
}
plot( tb$s ~ tb$t)
plot( log10(tb$mortality.rate) ~ tb$t ) #linear for Gompertz
plot( log10(tb$mortality.rate) ~ log10(tb$t ) ) #linear for Weibull
tb2 = tb2[-length(tb2[1,]), ]
summary(lm(log10(tb2$mortality.rate) ~ tb2$t, na.rm=T ))
summary(lm(log10(tb$mortality.rate) ~ log10(tb$t), na.rm=T ))
log(m)~t is the linear Gompertz plot. It can be seen that Gompertz is better suited for the early life stage.
log(m)~log(t) is the linear Weibull plot. Interesting, normally distributed lifespan is better fitted with Weibull model, especially for the late life stage.
Not surprsingly, Weibull model give better likelihood fitting.
source("lifespan.r")
lifespan = round(rnorm(1000)*10, 0) + 30
tb = calculate.s(lifespan)
head(tb)
tb$ds=NA; tb$dt=NA
tb$dt[1] = tb$s[1]
tb$ds[1] = 1 - tb$s[1]
tb$mortality.rate[1] = tb$ds[1] / tb$dt[1]
for( j in 2:length(tb[,1])) {
tb$ds[j] = tb$s[j-1] - tb$s[j]
tb$dt[j] = -tb$t[j-1] + tb$t[j]
tb$mortality.rate[j] = tb$ds[j] / ( tb$s[j] * tb$dt[j])
}
plot( tb$s ~ tb$t)
plot( log10(tb$mortality.rate) ~ tb$t ) #linear for Gompertz
plot( log10(tb$mortality.rate) ~ log10(tb$t ) ) #linear for Weibull
tb2 = tb2[-length(tb2[1,]), ]
summary(lm(log10(tb2$mortality.rate) ~ tb2$t, na.rm=T ))
summary(lm(log10(tb$mortality.rate) ~ log10(tb$t), na.rm=T ))
log(m)~t is the linear Gompertz plot. It can be seen that Gompertz is better suited for the early life stage.
log(m)~log(t) is the linear Weibull plot. Interesting, normally distributed lifespan is better fitted with Weibull model, especially for the late life stage.
Not surprsingly, Weibull model give better likelihood fitting.
Birth month as a predictor of social stress
Children with brith month close to the cutoff of school entry have compete with children that is nearly one year older. So, this is an example that birth month may have an effect on children's emotional development and academic performance.
We could generate a survey to see whether month of birth is associated with GPAs and other stress factors.
Ref:
Gavorilov's birth month study.
http://www.hindawi.com/journals/jar/2011/104616/#!
We could generate a survey to see whether month of birth is associated with GPAs and other stress factors.
Ref:
Gavorilov's birth month study.
http://www.hindawi.com/journals/jar/2011/104616/#!
Friday, December 20, 2013
Network aging simulation notes, week Dec 16-20, 2013
Product $n$ and $p$ will decide the average links for the essential node. So, if $np$<1, most of the networks are dead. When $np$ is small, simulation take forever to generate enough viable individuals. Even in this case, it is looking for the extreme cases, so it is not a 'true' representation of the original distribution. From the biological perspective, it probably does not matter, because only viable new cells will be observed.
It can be seen that small $p$ values favor Gompertz model and large $p$ values favor Weibull model. (Because p=1.00 mean perfect network and Weibull aging).
Because the lattice network actually predict binomial mortality model, and only early stage is exponential. So, fitting the entire curve with Gompertz and Weibull is not a good way to test the prediction. It can be seen that $t_0 = (1/p -1) / lambda$, so small $p$ lead to large $t_0$, and large $p$ leads to small $t_0$. It can also be shown that small lambda lead to large $t0$. So, small $p$ and small $lambda$ should should give to better Gompertz model.
Mixture model, network aging with degree distribution
Import Costanzo 09 genetic interaction data into R.
Running read.table ("sgadata_costanzo2009_lenientCutoff_101120.txt", sep="\t") will lead to some warnings. So, I import this file into Excel, visually check it and find it seems fine. I then saved it as
"sgadata_costanzo2009_lenientCutoff_101120.csv", which can be read.csv in R without warnings.
Thursday, December 19, 2013
Convert DIP IDs to SGD ORFs, code only
#convert DIP IDs to SGD ORFs
# setwd("~/databases/DIP/sandbox")
list.files()
dic = read.table("_matchedSGD2DIP_IDs.txt", sep="\t", colClass=c("character","character"))
names(dic) = c("ORF","DIP")
str(dic)
DIP.tb = read.csv("Scere20131031.csv", colClass=c(rep("character",18)))
str(DIP.tb)
names(DIP.tb)
head(DIP.tb)
summary(DIP.tb)
#There non-Sce entries in DIP data. I need to remove them from yeast PPI
# ... to do
#DIP id formats in interaction and sequence files are not the same
DIP.tb$ID.interactor.A[1:10]
#[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
#[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
DIP.tb$A.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.A)
head(DIP.tb$A.DIP)
DIP.tb$B.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.B)
head(DIP.tb$B.DIP)
dic$DIP[1:5]
#[1] "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
#[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
#[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
dic$DIP2 = gsub(pattern="dip:", replacement="", dic$DIP)
head(dic$DIP2)
dic$DIP3 = gsub(pattern="\\|.+", replacement="", dic$DIP2)
head(dic$DIP3)
intersect(dic$DIP3, DIP.tb$A.DIP)[1:10] #it worked.
intersect(dic$DIP3, DIP.tb$B.DIP)[1:10] #it worked.
# MAP DIP to SGD ORFs
DIP.tb$ORF1 = dic$ORF[match(DIP.tb$A.DIP, dic$DIP3)]
head(DIP.tb$ORF1)
summary(DIP.tb$ORF1) #it worked.
DIP.tb$ORF2 = dic$ORF[match(DIP.tb$B.DIP, dic$DIP3)]
head(DIP.tb$ORF2)
summary(DIP.tb$ORF2)
head(DIP.tb[,1:10])
DIP.tb= DIP.tb[, c(21,22,19,20,1:18)]
head(DIP.tb[,1:6])
write.csv(DIP.tb, "_SceDIP_withORFs.20131219.csv",row.names=F)
# setwd("~/databases/DIP/sandbox")
list.files()
dic = read.table("_matchedSGD2DIP_IDs.txt", sep="\t", colClass=c("character","character"))
names(dic) = c("ORF","DIP")
str(dic)
DIP.tb = read.csv("Scere20131031.csv", colClass=c(rep("character",18)))
str(DIP.tb)
names(DIP.tb)
head(DIP.tb)
summary(DIP.tb)
#There non-Sce entries in DIP data. I need to remove them from yeast PPI
# ... to do
#DIP id formats in interaction and sequence files are not the same
DIP.tb$ID.interactor.A[1:10]
#[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
#[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
DIP.tb$A.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.A)
head(DIP.tb$A.DIP)
DIP.tb$B.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.B)
head(DIP.tb$B.DIP)
dic$DIP[1:5]
#[1] "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
#[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
#[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
dic$DIP2 = gsub(pattern="dip:", replacement="", dic$DIP)
head(dic$DIP2)
dic$DIP3 = gsub(pattern="\\|.+", replacement="", dic$DIP2)
head(dic$DIP3)
intersect(dic$DIP3, DIP.tb$A.DIP)[1:10] #it worked.
intersect(dic$DIP3, DIP.tb$B.DIP)[1:10] #it worked.
# MAP DIP to SGD ORFs
DIP.tb$ORF1 = dic$ORF[match(DIP.tb$A.DIP, dic$DIP3)]
head(DIP.tb$ORF1)
summary(DIP.tb$ORF1) #it worked.
DIP.tb$ORF2 = dic$ORF[match(DIP.tb$B.DIP, dic$DIP3)]
head(DIP.tb$ORF2)
summary(DIP.tb$ORF2)
head(DIP.tb[,1:10])
DIP.tb= DIP.tb[, c(21,22,19,20,1:18)]
head(DIP.tb[,1:6])
write.csv(DIP.tb, "_SceDIP_withORFs.20131219.csv",row.names=F)
Convert DIP IDs to SGD ORFs, code with runningn results
$ R -f _20131219.convertDIPtoSGD.R
> #convert DIP IDs to SGD ORFs
>
> # setwd("~/databases/DIP/sandbox")
> list.files()
[1] "_20131219.convertDIPtoSGD.R" "_matchedSGD2DIP_IDs.txt"
[3] "_out_DIP2SGD.txt" "_out_SGD2DIP.txt"
[5] "_test.faa" "_tmp.txt"
[7] "fasta20131201.seq" "fasta20131201.seq.phr"
[9] "fasta20131201.seq.pin" "fasta20131201.seq.psq"
[11] "formatdb.log" "s288c-prot.faa"
[13] "s288c-prot.faa.phr" "s288c-prot.faa.pin"
[15] "s288c-prot.faa.psq" "Scere20131031.csv"
>
> dic = read.table("_matchedSGD2DIP_IDs.txt", sep="\t", colClass=c("character","character"))
> names(dic) = c("ORF","DIP")
> str(dic)
'data.frame': 5160 obs. of 2 variables:
$ ORF: chr "Q0045" "Q0050" "Q0080" "Q0085" ...
$ DIP: chr "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875" "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854" ...
>
> DIP.tb = read.csv("Scere20131031.csv", colClass=c(rep("character",18)))
> str(DIP.tb)
'data.frame': 22584 obs. of 18 variables:
$ ID.interactor.A : chr "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578" "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978" ...
$ ID.interactor.B : chr "DIP-871N|refseq:NP_010481|uniprotkb:P42073" "DIP-18N|refseq:NP_010765|uniprotkb:P06782" "DIP-97N|refseq:NP_013341|uniprotkb:P22543" "DIP-33N|refseq:NP_010513|uniprotkb:P11978" ...
$ Alt..ID.interactor.A : chr " -" " -" " -" "-" ...
$ Alt..ID.interactor.B : chr "-" "-" "-" "-" ...
$ Alias.es..interactor.A : chr "-" "-" "-" "-" ...
$ Alias.es..interactor.B : chr "-" "-" "-" "-" ...
$ Interaction.detection.method.s.: chr "MI:0018(two hybrid)" "MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography tech"| __truncated__ "MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)" "MI:0018(two hybrid)" ...
$ Publication.1st.author.s. : chr "-" "-" "-" "-" ...
$ Publication.Identifier.s. : chr "pubmed:9196079|pubmed:DIP-356S" "pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pub"| __truncated__ "pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S" "pubmed:1946372|pubmed:DIP-50S" ...
$ Taxid.interactor.A : chr "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" ...
$ Taxid.interactor.B : chr "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" ...
$ Interaction.type.s. : chr "MI:0218(physical interaction)" "MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218"| __truncated__ "MI:0218(physical interaction)|MI:0407(direct interaction)" "MI:0218(physical interaction)" ...
$ Source.database.s. : chr "MI:0465(dip)" "MI:0465(dip)" "MI:0465(dip)" "MI:0465(dip)" ...
$ Interaction.identifier.s. : chr "DIP-536E" "DIP-539E" "DIP-540E" "DIP-541E" ...
$ Confidence.value.s. : chr "dip-quality-status:core" "dip-quality-status:core" "dip-quality-status:core" "dip-quality-status:core" ...
$ Processing.Status : chr "dip:0002(small scale)" "dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small sc"| __truncated__ "dip:0002(small scale)|dip:0002(small scale)" "dip:0002(small scale)" ...
$ X : chr "" "" "" "" ...
$ X.1 : chr "-" "-" "-" "-" ...
> names(DIP.tb)
[1] "ID.interactor.A" "ID.interactor.B"
[3] "Alt..ID.interactor.A" "Alt..ID.interactor.B"
[5] "Alias.es..interactor.A" "Alias.es..interactor.B"
[7] "Interaction.detection.method.s." "Publication.1st.author.s."
[9] "Publication.Identifier.s." "Taxid.interactor.A"
[11] "Taxid.interactor.B" "Interaction.type.s."
[13] "Source.database.s." "Interaction.identifier.s."
[15] "Confidence.value.s." "Processing.Status"
[17] "X" "X.1"
> head(DIP.tb)
ID.interactor.A
1 DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B Alt..ID.interactor.A
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073 -
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782 -
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543 -
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978 -
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750 -
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957 -
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
1 - - -
2 - - -
3 - - -
4 - - -
5 - - -
6 - - -
Interaction.detection.method.s.
1 MI:0018(two hybrid)
2 MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography technology)|MI:0019(coimmunoprecipitation)|MI:0676(tandem affinity purification)
3 MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)
4 MI:0018(two hybrid)
5 MI:0019(coimmunoprecipitation)
6 MI:0018(two hybrid)|MI:0018(two hybrid)
Publication.1st.author.s.
1 -
2 -
3 -
4 -
5 -
6 -
Publication.Identifier.s.
1 pubmed:9196079|pubmed:DIP-356S
2 pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pubmed:9121458|pubmed:DIP-723S|pubmed:7813428|pubmed:DIP-724S|pubmed:11805826|pubmed:DIP-1768S
3 pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S
4 pubmed:1946372|pubmed:DIP-50S
5 pubmed:8508778|pubmed:DIP-198S
6 pubmed:10848588|pubmed:DIP-1428S|pubmed:9461437|pubmed:DIP-189S
Taxid.interactor.A Taxid.interactor.B
1 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
2 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
3 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
4 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
5 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
6 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
Interaction.type.s.
1 MI:0218(physical interaction)
2 MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0915(physical association)
3 MI:0218(physical interaction)|MI:0407(direct interaction)
4 MI:0218(physical interaction)
5 MI:0218(physical interaction)
6 MI:0218(physical interaction)|MI:0218(physical interaction)
Source.database.s. Interaction.identifier.s. Confidence.value.s.
1 MI:0465(dip) DIP-536E dip-quality-status:core
2 MI:0465(dip) DIP-539E dip-quality-status:core
3 MI:0465(dip) DIP-540E dip-quality-status:core
4 MI:0465(dip) DIP-541E dip-quality-status:core
5 MI:0465(dip) DIP-542E dip-quality-status:core
6 MI:0465(dip) DIP-590E dip-quality-status:core
Processing.Status
1 dip:0002(small scale)
2 dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0005(high throughput)
3 dip:0002(small scale)|dip:0002(small scale)
4 dip:0002(small scale)
5 dip:0002(small scale)
6 dip:0002(small scale)|dip:0002(small scale)
X X.1
1 -
2 -
3 -
4 -
5 -
6 -
> summary(DIP.tb)
ID.interactor.A ID.interactor.B Alt..ID.interactor.A
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Interaction.detection.method.s. Publication.1st.author.s.
Length:22584 Length:22584
Class :character Class :character
Mode :character Mode :character
Publication.Identifier.s. Taxid.interactor.A Taxid.interactor.B
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Interaction.type.s. Source.database.s. Interaction.identifier.s.
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Confidence.value.s. Processing.Status X X.1
Length:22584 Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
>
> #There non-Sce entries in DIP data. I need to remove them from yeast PPI
> # ... to do
>
> #DIP id formats in interaction and sequence files are not the same
> DIP.tb$ID.interactor.A[1:10]
[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020"
[2] "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219"
[4] "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
[5] "DIP-698N|refseq:NP_010270|uniprotkb:P15646"
[6] "DIP-982N|refseq:NP_011399|uniprotkb:P26309"
[7] "DIP-982N|refseq:NP_011399|uniprotkb:P26309"
[8] "DIP-948N|uniprotkb:Q80US8"
[9] "DIP-314N|refseq:NP_012619|uniprotkb:P18852"
[10] "DIP-954N|refseq:NP_014855|uniprotkb:P18851"
> #[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
> #[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
> DIP.tb$A.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.A)
> head(DIP.tb$A.DIP)
[1] "DIP-844N" "DIP-777N" "DIP-814N" "DIP-33N" "DIP-698N" "DIP-982N"
> DIP.tb$B.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.B)
> head(DIP.tb$B.DIP)
[1] "DIP-871N" "DIP-18N" "DIP-97N" "DIP-33N" "DIP-746N" "DIP-293N"
>
> dic$DIP[1:5]
[1] "dip:DIP-6088N|uniprot:P98002"
[2] "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856"
[4] "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
> #[1] "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
> #[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
> #[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
>
> dic$DIP2 = gsub(pattern="dip:", replacement="", dic$DIP)
> head(dic$DIP2)
[1] "DIP-6088N|uniprot:P98002"
[2] "DIP-8242N|refseq:NP_009310|uniprot:P03875"
[3] "DIP-3040N|refseq:NP_009312|uniprot:P00856"
[4] "DIP-3038N|refseq:NP_009313|uniprot:P00854"
[5] "DIP-350N|refseq:YP_209217|uniprot:P00157"
[6] "DIP-3041N|refseq:NP_009319|uniprot:P61829"
>
> dic$DIP3 = gsub(pattern="\\|.+", replacement="", dic$DIP2)
> head(dic$DIP3)
[1] "DIP-6088N" "DIP-8242N" "DIP-3040N" "DIP-3038N" "DIP-350N" "DIP-3041N"
>
> intersect(dic$DIP3, DIP.tb$A.DIP)[1:10] #it worked.
[1] "DIP-3040N" "DIP-3038N" "DIP-3041N" "DIP-7592N" "DIP-7698N" "DIP-518N"
[7] "DIP-6298N" "DIP-6445N" "DIP-2750N" "DIP-2253N"
> intersect(dic$DIP3, DIP.tb$B.DIP)[1:10] #it worked.
[1] "DIP-8242N" "DIP-3040N" "DIP-3038N" "DIP-3041N" "DIP-7592N" "DIP-8133N"
[7] "DIP-7698N" "DIP-518N" "DIP-6739N" "DIP-6298N"
>
> # MAP DIP to SGD ORFs
> DIP.tb$ORF1 = dic$ORF[match(DIP.tb$A.DIP, dic$DIP3)]
> head(DIP.tb$ORF1)
[1] "YER032W" "YDR422C" "YBR097W" "YDR227W" "YDL014W" "YGL116W"
> summary(DIP.tb$ORF1) #it worked.
Length Class Mode
22584 character character
>
> DIP.tb$ORF2 = dic$ORF[match(DIP.tb$B.DIP, dic$DIP3)]
> head(DIP.tb$ORF2)
[1] "YDR195W" "YDR477W" "YLR240W" "YDR227W" "YLL011W" "YGL086W"
> summary(DIP.tb$ORF2)
Length Class Mode
22584 character character
>
> head(DIP.tb[,1:10])
ID.interactor.A
1 DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B Alt..ID.interactor.A
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073 -
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782 -
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543 -
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978 -
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750 -
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957 -
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
1 - - -
2 - - -
3 - - -
4 - - -
5 - - -
6 - - -
Interaction.detection.method.s.
1 MI:0018(two hybrid)
2 MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography technology)|MI:0019(coimmunoprecipitation)|MI:0676(tandem affinity purification)
3 MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)
4 MI:0018(two hybrid)
5 MI:0019(coimmunoprecipitation)
6 MI:0018(two hybrid)|MI:0018(two hybrid)
Publication.1st.author.s.
1 -
2 -
3 -
4 -
5 -
6 -
Publication.Identifier.s.
1 pubmed:9196079|pubmed:DIP-356S
2 pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pubmed:9121458|pubmed:DIP-723S|pubmed:7813428|pubmed:DIP-724S|pubmed:11805826|pubmed:DIP-1768S
3 pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S
4 pubmed:1946372|pubmed:DIP-50S
5 pubmed:8508778|pubmed:DIP-198S
6 pubmed:10848588|pubmed:DIP-1428S|pubmed:9461437|pubmed:DIP-189S
Taxid.interactor.A
1 taxid:4932(Saccharomyces cerevisiae)
2 taxid:4932(Saccharomyces cerevisiae)
3 taxid:4932(Saccharomyces cerevisiae)
4 taxid:4932(Saccharomyces cerevisiae)
5 taxid:4932(Saccharomyces cerevisiae)
6 taxid:4932(Saccharomyces cerevisiae)
> DIP.tb= DIP.tb[, c(21,22,19,20,1:18)]
> head(DIP.tb[,1:6])
ORF1 ORF2 A.DIP B.DIP ID.interactor.A
1 YER032W YDR195W DIP-844N DIP-871N DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 YDR422C YDR477W DIP-777N DIP-18N DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 YBR097W YLR240W DIP-814N DIP-97N DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 YDR227W YDR227W DIP-33N DIP-33N DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 YDL014W YLL011W DIP-698N DIP-746N DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 YGL116W YGL086W DIP-982N DIP-293N DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957
>
> write.csv(DIP.tb, "_SceDIP_withORFs.20131219.csv",row.names=F)
>
> #convert DIP IDs to SGD ORFs
>
> # setwd("~/databases/DIP/sandbox")
> list.files()
[1] "_20131219.convertDIPtoSGD.R" "_matchedSGD2DIP_IDs.txt"
[3] "_out_DIP2SGD.txt" "_out_SGD2DIP.txt"
[5] "_test.faa" "_tmp.txt"
[7] "fasta20131201.seq" "fasta20131201.seq.phr"
[9] "fasta20131201.seq.pin" "fasta20131201.seq.psq"
[11] "formatdb.log" "s288c-prot.faa"
[13] "s288c-prot.faa.phr" "s288c-prot.faa.pin"
[15] "s288c-prot.faa.psq" "Scere20131031.csv"
>
> dic = read.table("_matchedSGD2DIP_IDs.txt", sep="\t", colClass=c("character","character"))
> names(dic) = c("ORF","DIP")
> str(dic)
'data.frame': 5160 obs. of 2 variables:
$ ORF: chr "Q0045" "Q0050" "Q0080" "Q0085" ...
$ DIP: chr "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875" "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854" ...
>
> DIP.tb = read.csv("Scere20131031.csv", colClass=c(rep("character",18)))
> str(DIP.tb)
'data.frame': 22584 obs. of 18 variables:
$ ID.interactor.A : chr "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578" "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978" ...
$ ID.interactor.B : chr "DIP-871N|refseq:NP_010481|uniprotkb:P42073" "DIP-18N|refseq:NP_010765|uniprotkb:P06782" "DIP-97N|refseq:NP_013341|uniprotkb:P22543" "DIP-33N|refseq:NP_010513|uniprotkb:P11978" ...
$ Alt..ID.interactor.A : chr " -" " -" " -" "-" ...
$ Alt..ID.interactor.B : chr "-" "-" "-" "-" ...
$ Alias.es..interactor.A : chr "-" "-" "-" "-" ...
$ Alias.es..interactor.B : chr "-" "-" "-" "-" ...
$ Interaction.detection.method.s.: chr "MI:0018(two hybrid)" "MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography tech"| __truncated__ "MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)" "MI:0018(two hybrid)" ...
$ Publication.1st.author.s. : chr "-" "-" "-" "-" ...
$ Publication.Identifier.s. : chr "pubmed:9196079|pubmed:DIP-356S" "pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pub"| __truncated__ "pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S" "pubmed:1946372|pubmed:DIP-50S" ...
$ Taxid.interactor.A : chr "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" ...
$ Taxid.interactor.B : chr "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" "taxid:4932(Saccharomyces cerevisiae)" ...
$ Interaction.type.s. : chr "MI:0218(physical interaction)" "MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218"| __truncated__ "MI:0218(physical interaction)|MI:0407(direct interaction)" "MI:0218(physical interaction)" ...
$ Source.database.s. : chr "MI:0465(dip)" "MI:0465(dip)" "MI:0465(dip)" "MI:0465(dip)" ...
$ Interaction.identifier.s. : chr "DIP-536E" "DIP-539E" "DIP-540E" "DIP-541E" ...
$ Confidence.value.s. : chr "dip-quality-status:core" "dip-quality-status:core" "dip-quality-status:core" "dip-quality-status:core" ...
$ Processing.Status : chr "dip:0002(small scale)" "dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small sc"| __truncated__ "dip:0002(small scale)|dip:0002(small scale)" "dip:0002(small scale)" ...
$ X : chr "" "" "" "" ...
$ X.1 : chr "-" "-" "-" "-" ...
> names(DIP.tb)
[1] "ID.interactor.A" "ID.interactor.B"
[3] "Alt..ID.interactor.A" "Alt..ID.interactor.B"
[5] "Alias.es..interactor.A" "Alias.es..interactor.B"
[7] "Interaction.detection.method.s." "Publication.1st.author.s."
[9] "Publication.Identifier.s." "Taxid.interactor.A"
[11] "Taxid.interactor.B" "Interaction.type.s."
[13] "Source.database.s." "Interaction.identifier.s."
[15] "Confidence.value.s." "Processing.Status"
[17] "X" "X.1"
> head(DIP.tb)
ID.interactor.A
1 DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B Alt..ID.interactor.A
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073 -
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782 -
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543 -
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978 -
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750 -
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957 -
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
1 - - -
2 - - -
3 - - -
4 - - -
5 - - -
6 - - -
Interaction.detection.method.s.
1 MI:0018(two hybrid)
2 MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography technology)|MI:0019(coimmunoprecipitation)|MI:0676(tandem affinity purification)
3 MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)
4 MI:0018(two hybrid)
5 MI:0019(coimmunoprecipitation)
6 MI:0018(two hybrid)|MI:0018(two hybrid)
Publication.1st.author.s.
1 -
2 -
3 -
4 -
5 -
6 -
Publication.Identifier.s.
1 pubmed:9196079|pubmed:DIP-356S
2 pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pubmed:9121458|pubmed:DIP-723S|pubmed:7813428|pubmed:DIP-724S|pubmed:11805826|pubmed:DIP-1768S
3 pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S
4 pubmed:1946372|pubmed:DIP-50S
5 pubmed:8508778|pubmed:DIP-198S
6 pubmed:10848588|pubmed:DIP-1428S|pubmed:9461437|pubmed:DIP-189S
Taxid.interactor.A Taxid.interactor.B
1 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
2 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
3 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
4 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
5 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
6 taxid:4932(Saccharomyces cerevisiae) taxid:4932(Saccharomyces cerevisiae)
Interaction.type.s.
1 MI:0218(physical interaction)
2 MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0218(physical interaction)|MI:0915(physical association)
3 MI:0218(physical interaction)|MI:0407(direct interaction)
4 MI:0218(physical interaction)
5 MI:0218(physical interaction)
6 MI:0218(physical interaction)|MI:0218(physical interaction)
Source.database.s. Interaction.identifier.s. Confidence.value.s.
1 MI:0465(dip) DIP-536E dip-quality-status:core
2 MI:0465(dip) DIP-539E dip-quality-status:core
3 MI:0465(dip) DIP-540E dip-quality-status:core
4 MI:0465(dip) DIP-541E dip-quality-status:core
5 MI:0465(dip) DIP-542E dip-quality-status:core
6 MI:0465(dip) DIP-590E dip-quality-status:core
Processing.Status
1 dip:0002(small scale)
2 dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0002(small scale)|dip:0005(high throughput)
3 dip:0002(small scale)|dip:0002(small scale)
4 dip:0002(small scale)
5 dip:0002(small scale)
6 dip:0002(small scale)|dip:0002(small scale)
X X.1
1 -
2 -
3 -
4 -
5 -
6 -
> summary(DIP.tb)
ID.interactor.A ID.interactor.B Alt..ID.interactor.A
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Interaction.detection.method.s. Publication.1st.author.s.
Length:22584 Length:22584
Class :character Class :character
Mode :character Mode :character
Publication.Identifier.s. Taxid.interactor.A Taxid.interactor.B
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Interaction.type.s. Source.database.s. Interaction.identifier.s.
Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Confidence.value.s. Processing.Status X X.1
Length:22584 Length:22584 Length:22584 Length:22584
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
>
> #There non-Sce entries in DIP data. I need to remove them from yeast PPI
> # ... to do
>
> #DIP id formats in interaction and sequence files are not the same
> DIP.tb$ID.interactor.A[1:10]
[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020"
[2] "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219"
[4] "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
[5] "DIP-698N|refseq:NP_010270|uniprotkb:P15646"
[6] "DIP-982N|refseq:NP_011399|uniprotkb:P26309"
[7] "DIP-982N|refseq:NP_011399|uniprotkb:P26309"
[8] "DIP-948N|uniprotkb:Q80US8"
[9] "DIP-314N|refseq:NP_012619|uniprotkb:P18852"
[10] "DIP-954N|refseq:NP_014855|uniprotkb:P18851"
> #[1] "DIP-844N|refseq:NP_010949|uniprotkb:P40020" "DIP-777N|refseq:NP_010710|uniprotkb:P32578"
> #[3] "DIP-814N|refseq:NP_009655|uniprotkb:P22219" "DIP-33N|refseq:NP_010513|uniprotkb:P11978"
> DIP.tb$A.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.A)
> head(DIP.tb$A.DIP)
[1] "DIP-844N" "DIP-777N" "DIP-814N" "DIP-33N" "DIP-698N" "DIP-982N"
> DIP.tb$B.DIP = gsub(pattern="\\|.+", replacement="", DIP.tb$ID.interactor.B)
> head(DIP.tb$B.DIP)
[1] "DIP-871N" "DIP-18N" "DIP-97N" "DIP-33N" "DIP-746N" "DIP-293N"
>
> dic$DIP[1:5]
[1] "dip:DIP-6088N|uniprot:P98002"
[2] "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856"
[4] "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
> #[1] "dip:DIP-6088N|uniprot:P98002" "dip:DIP-8242N|refseq:NP_009310|uniprot:P03875"
> #[3] "dip:DIP-3040N|refseq:NP_009312|uniprot:P00856" "dip:DIP-3038N|refseq:NP_009313|uniprot:P00854"
> #[5] "dip:DIP-350N|refseq:YP_209217|uniprot:P00157"
>
> dic$DIP2 = gsub(pattern="dip:", replacement="", dic$DIP)
> head(dic$DIP2)
[1] "DIP-6088N|uniprot:P98002"
[2] "DIP-8242N|refseq:NP_009310|uniprot:P03875"
[3] "DIP-3040N|refseq:NP_009312|uniprot:P00856"
[4] "DIP-3038N|refseq:NP_009313|uniprot:P00854"
[5] "DIP-350N|refseq:YP_209217|uniprot:P00157"
[6] "DIP-3041N|refseq:NP_009319|uniprot:P61829"
>
> dic$DIP3 = gsub(pattern="\\|.+", replacement="", dic$DIP2)
> head(dic$DIP3)
[1] "DIP-6088N" "DIP-8242N" "DIP-3040N" "DIP-3038N" "DIP-350N" "DIP-3041N"
>
> intersect(dic$DIP3, DIP.tb$A.DIP)[1:10] #it worked.
[1] "DIP-3040N" "DIP-3038N" "DIP-3041N" "DIP-7592N" "DIP-7698N" "DIP-518N"
[7] "DIP-6298N" "DIP-6445N" "DIP-2750N" "DIP-2253N"
> intersect(dic$DIP3, DIP.tb$B.DIP)[1:10] #it worked.
[1] "DIP-8242N" "DIP-3040N" "DIP-3038N" "DIP-3041N" "DIP-7592N" "DIP-8133N"
[7] "DIP-7698N" "DIP-518N" "DIP-6739N" "DIP-6298N"
>
> # MAP DIP to SGD ORFs
> DIP.tb$ORF1 = dic$ORF[match(DIP.tb$A.DIP, dic$DIP3)]
> head(DIP.tb$ORF1)
[1] "YER032W" "YDR422C" "YBR097W" "YDR227W" "YDL014W" "YGL116W"
> summary(DIP.tb$ORF1) #it worked.
Length Class Mode
22584 character character
>
> DIP.tb$ORF2 = dic$ORF[match(DIP.tb$B.DIP, dic$DIP3)]
> head(DIP.tb$ORF2)
[1] "YDR195W" "YDR477W" "YLR240W" "YDR227W" "YLL011W" "YGL086W"
> summary(DIP.tb$ORF2)
Length Class Mode
22584 character character
>
> head(DIP.tb[,1:10])
ID.interactor.A
1 DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B Alt..ID.interactor.A
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073 -
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782 -
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543 -
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978 -
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750 -
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957 -
Alt..ID.interactor.B Alias.es..interactor.A Alias.es..interactor.B
1 - - -
2 - - -
3 - - -
4 - - -
5 - - -
6 - - -
Interaction.detection.method.s.
1 MI:0018(two hybrid)
2 MI:0018(two hybrid)|MI:0018(two hybrid)|MI:0019(coimmunoprecipitation)|MI:0018(two hybrid)|MI:0004(affinity chromatography technology)|MI:0019(coimmunoprecipitation)|MI:0676(tandem affinity purification)
3 MI:0019(coimmunoprecipitation)|MI:0030(cross-linking study)
4 MI:0018(two hybrid)
5 MI:0019(coimmunoprecipitation)
6 MI:0018(two hybrid)|MI:0018(two hybrid)
Publication.1st.author.s.
1 -
2 -
3 -
4 -
5 -
6 -
Publication.Identifier.s.
1 pubmed:9196079|pubmed:DIP-356S
2 pubmed:1496382|pubmed:DIP-448S|pubmed:9121458|pubmed:DIP-723S|pubmed:1496382|pubmed:DIP-448S|pubmed:7813428|pubmed:DIP-724S|pubmed:9121458|pubmed:DIP-723S|pubmed:7813428|pubmed:DIP-724S|pubmed:11805826|pubmed:DIP-1768S
3 pubmed:8387919|pubmed:DIP-391S|pubmed:8387919|pubmed:DIP-391S
4 pubmed:1946372|pubmed:DIP-50S
5 pubmed:8508778|pubmed:DIP-198S
6 pubmed:10848588|pubmed:DIP-1428S|pubmed:9461437|pubmed:DIP-189S
Taxid.interactor.A
1 taxid:4932(Saccharomyces cerevisiae)
2 taxid:4932(Saccharomyces cerevisiae)
3 taxid:4932(Saccharomyces cerevisiae)
4 taxid:4932(Saccharomyces cerevisiae)
5 taxid:4932(Saccharomyces cerevisiae)
6 taxid:4932(Saccharomyces cerevisiae)
> DIP.tb= DIP.tb[, c(21,22,19,20,1:18)]
> head(DIP.tb[,1:6])
ORF1 ORF2 A.DIP B.DIP ID.interactor.A
1 YER032W YDR195W DIP-844N DIP-871N DIP-844N|refseq:NP_010949|uniprotkb:P40020
2 YDR422C YDR477W DIP-777N DIP-18N DIP-777N|refseq:NP_010710|uniprotkb:P32578
3 YBR097W YLR240W DIP-814N DIP-97N DIP-814N|refseq:NP_009655|uniprotkb:P22219
4 YDR227W YDR227W DIP-33N DIP-33N DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 YDL014W YLL011W DIP-698N DIP-746N DIP-698N|refseq:NP_010270|uniprotkb:P15646
6 YGL116W YGL086W DIP-982N DIP-293N DIP-982N|refseq:NP_011399|uniprotkb:P26309
ID.interactor.B
1 DIP-871N|refseq:NP_010481|uniprotkb:P42073
2 DIP-18N|refseq:NP_010765|uniprotkb:P06782
3 DIP-97N|refseq:NP_013341|uniprotkb:P22543
4 DIP-33N|refseq:NP_010513|uniprotkb:P11978
5 DIP-746N|refseq:NP_013090|uniprotkb:P33750
6 DIP-293N|refseq:NP_011429|uniprotkb:P40957
>
> write.csv(DIP.tb, "_SceDIP_withORFs.20131219.csv",row.names=F)
>
Reciprocal blast runs for DIP and Sce ID match
# Do reciprocal best hits to match DIP IDs to SGD ORFs.
ace:sandbox hongqin$ pwd
/Users/hongqin/DIP/sandbox
$ makeblastdb -in s288c-prot.faa -dbtype prot
Building a new DB, current time: 12/19/2013 20:30:31
New DB name: s288c-prot.faa
New DB title: s288c-prot.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6718 sequences in 0.428148 seconds.
$ makeblastdb -in fasta20131201.seq -dbtype prot
#This give out many warnings for empty entries.
#Runs from 20:50 to 21:48, for 1 hour, on a snowleopard laptop
$ blastp -db s288c-prot.faa -query fasta20131201.seq -outfmt 6 -out _out_DIP2SGD.txt -evalue 1e-10 &
$ blastp -db fasta20131201.seq -query s288c-prot.faa -outfmt 6 -out _out_SGD2DIP.txt -evalue 1e-10 &
$ps
PID TTY TIME CMD
879 ttys000 0:00.04 -bash
906 ttys000 3:50.79 blastp -db s288c-prot.faa -query fasta20131201.seq -outfmt 6 -out _out_DIP2SGD.txt -evalue 1e
919 ttys000 0:02.18 blastp -db fasta20131201.seq -query s288c-prot.faa -outfmt 6 -out _out_SGD2DIP.txt -evalue 1e
713 ttys001 0:00.18 -bash
$ simple_reciprocal_best_hits.01.pl -i1 _out_SGD2DIP.txt -i2 _out_DIP2SGD.txt -o _matchedSGD2DIP_IDs.txt
$ wc -l _matchedSGD2DIP_IDs.txt
5160 _matchedSGD2DIP_IDs.txt
*** BLASTP Formatting options
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
#######################
ace:blast.fasta.demo hongqin$ cat commands.txt
blastp -subject db.faa -query query.faa
makeblastdb -in db2.faa
blastp -query my.seq.faa -db db2.faa | less
blastp -query my.seq.faa -db db2.faa -outfmt 6
blastp -query my.seq.faa -db db2.faa -outfmt 7 | less
#old version, still work on osX
blastall -p blastp -d db.faa -i query.faa -F F | less
ace:sandbox hongqin$ pwd
/Users/hongqin/DIP/sandbox
$ makeblastdb -in s288c-prot.faa -dbtype prot
Building a new DB, current time: 12/19/2013 20:30:31
New DB name: s288c-prot.faa
New DB title: s288c-prot.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6718 sequences in 0.428148 seconds.
$ makeblastdb -in fasta20131201.seq -dbtype prot
#This give out many warnings for empty entries.
#Runs from 20:50 to 21:48, for 1 hour, on a snowleopard laptop
$ blastp -db s288c-prot.faa -query fasta20131201.seq -outfmt 6 -out _out_DIP2SGD.txt -evalue 1e-10 &
$ blastp -db fasta20131201.seq -query s288c-prot.faa -outfmt 6 -out _out_SGD2DIP.txt -evalue 1e-10 &
$ps
PID TTY TIME CMD
879 ttys000 0:00.04 -bash
906 ttys000 3:50.79 blastp -db s288c-prot.faa -query fasta20131201.seq -outfmt 6 -out _out_DIP2SGD.txt -evalue 1e
919 ttys000 0:02.18 blastp -db fasta20131201.seq -query s288c-prot.faa -outfmt 6 -out _out_SGD2DIP.txt -evalue 1e
713 ttys001 0:00.18 -bash
$ simple_reciprocal_best_hits.01.pl -i1 _out_SGD2DIP.txt -i2 _out_DIP2SGD.txt -o _matchedSGD2DIP_IDs.txt
$ wc -l _matchedSGD2DIP_IDs.txt
5160 _matchedSGD2DIP_IDs.txt
*** BLASTP Formatting options
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
#######################
ace:blast.fasta.demo hongqin$ cat commands.txt
blastp -subject db.faa -query query.faa
makeblastdb -in db2.faa
blastp -query my.seq.faa -db db2.faa | less
blastp -query my.seq.faa -db db2.faa -outfmt 6
blastp -query my.seq.faa -db db2.faa -outfmt 7 | less
#old version, still work on osX
blastall -p blastp -d db.faa -i query.faa -F F | less
DIP data import to R
I downloaded Sce interaction data from DIP. The text files seems to use the Windows format and cause an import error in R on my Snowleopard laptop. I then import this DIP text file into Excel and and it worked. I save the files in csv format using Excel. This generated csv file can be load into R correctly.
dip.sce.tb = read.csv( "Scere20131031.csv")
MI TAB 2.5 format
DIP now uses this format.
http://code.google.com/p/psicquic/wiki/MITAB25Format
http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html
http://code.google.com/p/psicquic/wiki/PythonCodeSamples
http://code.google.com/p/psicquic/wiki/PerlCodeSamples
http://code.google.com/p/psicquic/wiki/PythonCodeSamples
http://code.google.com/p/psicquic/wiki/MITAB25Format
http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html
http://code.google.com/p/psicquic/wiki/PythonCodeSamples
http://code.google.com/p/psicquic/wiki/PerlCodeSamples
http://code.google.com/p/psicquic/wiki/PythonCodeSamples
Mixture of Weibull models
Weibull model is known as the exponential distribution because the form of its pdf. see
http://en.wikipedia.org/wiki/Weibull_distribution
Weibull mortality rate = pdf / (1- CDF) ~ t^n
Mixture Weibull distributions is studied by Nicholas Jewell, 1982
http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aos/1176345789
http://en.wikipedia.org/wiki/Weibull_distribution
Weibull mortality rate = pdf / (1- CDF) ~ t^n
Mixture Weibull distributions is studied by Nicholas Jewell, 1982
http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aos/1176345789
Subscribe to:
Posts (Atom)