Monday, December 29, 2014

orthogonal projection, orthogonality

orthogonal projection of vector y to u:
y^hat = \frac{y \dot u}{u \dot u } u

U has orthonormal columns if and only if U^T U = I

inner product of vectors, dot products, orthogonality

$u \dot v = u^T v$

$u \dot v = v \dot u $

length (or norm) of vector $v$ is the square root of its inner product. This can be seen from the v [a,b], whose length(norm) is sqrt(a^2 + b^2)

u \dot v = ||u|| ||v|| cos \theta
||u-v|| = ||u||^2 + ||v||^2 - 2||u|| ||v|| cos\theta

Two vector $u$ and $v$ are orthogonal if and only if $u \dot v = 0$. 

U has orthonormal columns if and only if U^T U = I

orthogonal projection of vector y to u:
y^hat = \frac{y \dot u}{u \dot u } u

Orthogonal projection of a point y to W space with {u1, u2, ... up} basis can be found by orthogonal projections on each base vector, u1, u2, ..., u_p.

diagonal matrix of essential genes in network aging model

diagonal matrix of essential genes in network aging model
 A^k = P D P^-1
If the diagonal matrix contains only the number of links of essential genes, its decaying might be easily computed numerically.
Diagonalization of A can be found through eigen values and eigen vectors.

PCA notes

From covariance matrix, the eigen vector is the PCA.

Sunday, December 28, 2014

toread, interaction based discovery of cancer genes

2014 Feb;42(3):e18. doi: 10.1093/nar/gkt1305. Epub 2013 Dec 19.

Interaction-based discovery of functionally important genes in cancers.

Saturday, December 27, 2014

SQLite3 code on rls.db

file 'test_rls.sql'

.open rls.db
.separator ::
.headers on
.mode column
select distinct experiment from result_experiment limit 20;
.indices set
.width 5
select * from result  limit 1;

/* The following select can take rls and its reference rls */
select experiments,set_name,set_strain,set_background,set_genotype,
 from result  limit 2;

/* The fields of set_name and set_genotype sometimes provide the ORF-name pair, but there are many exceptions. */

mysql tips

MYSQL tips "mysql.txt" file

show tables like "h%";

select * form someTable into outfile "/tmp/tmpfile";

create temporary table tmptab select distinct id1 from sampleTab1 UNION ALL
select distinct id2 from sampleTab2;

grant ALL on homo_sapiens_core_17_33.* to hqin@localhost;


#Returns a substring len characters long from string str, starting at position pos.
#The variant form that uses FROM is SQL-92 syntax:

mysql> SELECT SUBSTRING('Quadratically',5,6);
        -> 'ratica'

mysqldump test name --no-data --no-create-db > tmp.dump
mysqlimport -u root -h shanghai hong_database *.txt.table

 /* try left, inner, outer join to see what's missing */
mysql>  create temporary table bader2gu
    ->  select orf, Name1
    ->  from   curagenOrf2name left join Ks_Ka_Yeast_Ca
    ->         on curagenOrf2name.orf = Ks_Ka_Yeast_Ca.Name1;
Query OK, 6268 rows affected (54.27 sec)
Records: 6268  Duplicates: 0  Warnings: 0

mysql>  select * from bader2gu where Name1 is NULL;
/* return 4313 rows */

mysql>  select * from bader2gu where Name1 is not NULL;
/* return 1955 rows.  Note, one record is missing probably
due to different annotations bw curagen and the public release from SGD

SQLite 3, osX, byte, rls.db


#I want to install SQLite to load 'rls.db'. 

$ sudo port install sqlite3

#how to load 'rls.db' ?
$ sqlite3 
SQLite version 2014-12-09 01:34:36
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open rls.db
sqlite> .databases
seq  name             file                                                      
---  ---------------  ----------------------------------------------------------
0    main             /Users/hqin/projects/     
sqlite> .tables
build_log           genotype_pubmed_id  result_experiment   set               
cross_mating_type   meta                result_ref          yeast_strain      
cross_media         result              result_set 

sqlite> .indices

sqlite> select distinct experiment from result_experiment limit 20;

sqlite> .separator :::
sqlite> select * from result limit 2;
#Notes, field 'experiments' in 'result' maybe used to find the in-experiment wildtype controls. 
# Ken once suggested that "pooled by" column?? file, genotype, mixed
# set lifespan
# ref lifespan

select experiments,set_name,set_strain,set_background,set_genotype,

 from result  limit 2;

Wednesday, December 24, 2014

toread, quality control of inner nuclear membrane proteins by the Asi complex

 2014 Nov 7;346(6210):751-5. doi: 10.1126/science.1255638. Epub 2014 Sep 18.

Quality control of inner nuclear membrane proteins by the Asi complex.


Monday, December 22, 2014

CITI training

I spent 70 minutes (9:30-10:40) on CITI training, with 93% final score (1 wrong due to mis-clicking)

CITI refresher course reading materials, URLs

  • If you have not read the Belmont Report yet, please review this document and/or copy it for future reference. (Close the new browser window to return here.)
Links to Ethical Codes and Regulations of Human Subjects in Research.
  • Title 21, CFR Part 50 and CFR 56 of the Code of Federal Regulations.
  • CLIA - Clinical Laboratory Improvement Amendments
  • Title 21 Code of Federal Regulations (21 CFR Part 11) Electronic Records; Electronic Signatures.

CITI training, Belmont report

Belmont report

svm project, pca() repalaced by princomp() 20141222

updated file ''

'pca' package from old code does not exist in R 3.x anymore. I switched to princomp() in the base package.

TODO: check the predicted long-lived strains in the Kaeberlein database.

study notes on PCA, principal components, with R testing codes, princomp()

ResearchGate: Principal components are linear combinations of original variables x1, x2, etc. So when you do SVM on PCA decomposition you work with these combinations instead of original variables.

37:50 in Ng's video
Ng showes to PCA (linear combinations of raw data) to reduce dimension of data. 

PCA is basically orthogonal transformation
"Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components."
"PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score)."

R:  princomp( ) <- princomp(~ Murder + Assault + UrbanPop,
                  data = USArrests, na.action = na.exclude, cor = TRUE)$scores[1:5, ]  #scores probably are PCA results, based on wikipedia entry

I can use examples of linear combination to verify my guess. 

#######start of the R testing code and results #########
x1 = rnorm(100)
x2 = rnorm(100)
x3 = x1 + x2 + rnorm(100)/20
x4 = 2*x1 + rnorm(100)/20
X = data.frame(cbind(x1,x2,x3,x4))
pc <- princomp(X)
plot(pc)#only two major components, consistent


#######start of the R testing code and results #########

#######start of 2nd R testing code and results #########
x1 = rnorm(100)
x2 = x1 + rnorm(100)/20
X = data.frame(cbind(x1,x2))
pc <- princomp(X, cor = TRUE)
pc$score[,1] - (0.707*x1 + 0.707*x2) #does this approach zero? 
summary( pc$score[,1] - (0.707*x1 + 0.707*x2) + mean( 0.707*x1 + 0.707*x2 ) )

#good, it approaches zero

summary(lm( pc$score[,1] ~ x1 ))

#######End of 2nd R testing code and results #########

Useful Unix / Linux shell commands

 cat tmp.txt | sed s/CREATE/DROP/

 who | cut -c1-8 | sort | uniq | nl
 cat /usr/local/apache2/logs/access_log | grep 128\.135 | cut -c1-16 | uniq     
 ps -ef | grep nohup | cut -c53-57 | sort | uniq | nl 

 /sbin/shutdown -r now ?

 /etc/rc.local  # system startup configuration

 grep CREATE ensembl_mart_16_1.sql | sed s/CREATE/DROP/ | sed s/\(/\;/ > $HOME/trim_mart.sql

ls enc.* | sed "s/^/\"/" | sed "s/$/\"\,/"

health disparity gene expression datasets, collection, data resources

Differential endothelial cell gene expression by African Americans versus Caucasian Americans: A possible contribution to health disparity in vascular disease and cancer

PLoS One. 2008 Aug 6;3(8):e2847. doi: 10.1371/journal.pone.0002847.
Gene expression and functional studies of the optic nerve head astrocyte transcriptome from normal African Americans and Caucasian Americans donors.
Miao H1, Chen L, Riordan SM, Li W, Juarez S, Crabb AM, Lukas TJ, Du P, Lin SM, Wise A, Agapova OA, Yang P, Gu CC, Hernandez MR.

Genome Biol. 2008;9(7):R111. doi: 10.1186/gb-2008-9-7-r111. Epub 2008 Jul 9.
Susceptibility to glaucoma: differential comparison of the astrocyte transcriptome from glaucomatous African American and Caucasian American donors.

Physiol Genomics. 2011 Jul 14;43(13):836-43. doi: 10.1152/physiolgenomics.00243.2010. Epub 2011 Apr 26.
Gene expression variation between African Americans and whites is associated with coronary artery calcification: the multiethnic study of atherosclerosis.

review, subclinical coronalry atherosclerosis, racial profiling is necessary

General Cardiovascular Risk Profile identifies advanced coronary artery calcium and is improved by family history: the multiethnic study of atherosclerosis.

J Transl Med. 2013 Oct 1;11:239. doi: 10.1186/1479-5876-11-239.
Quantitative proteomic analysis in HCV-induced HCC reveals sets of proteins with potential significance for racial disparity.
Dillon ST, Bhasin MK, Feng X, Koh DW, Daoud SS.

Exerc Sport Sci Rev. 2013 Jan;41(1):44-54. doi: 10.1097/JES.0b013e318279cbbd.
Are there race-dependent endothelial cell responses to exercise?
Brown MD1, Feairheller DL.

Sunday, December 21, 2014

NIH grants searchable database

Braunewell Bornholdt, 2007, Superstability of the yeast cell-cycle dynamics

[PB07JTB 2007 Apr 21;245(4):638-43. Epub 2006 Nov 21. Superstability of the yeast cell-cycle dynamics: ensuring causality in the presence of biochemical stochasticity

In their 2009 JTB paper, the author cited a measure of reliability in this 07JTB paper. I searched the entire paper for reliability, but did find one hit in the abstract. In the main text, the author mentioned  "stability of the systems under strong noise", termed "stability criterion" (basically robustness or reliability. Based on its explanation below, this is a rather context-specific criterion. 

It seems that PB07 and PB09 are based on the Li04PNAS paper, a boolean network model on yeast cell cycle. 

Braunewell and Bornholdt, 2009, reliability of network

investigate the interplay of topological structure and dynamical robustness.

reliability of attractors

boolean network dynamics

The reliability criteriont was used to show the robustness of the yeast cell-
cycle dynamics against timing perturbations (Braunewell and Bornholdt, 2007

See also 


Help us get the word out about PRIDE by forwarding the below to your institution, colleagues, & organizations you may be a member of and posting on Linked In & Facebook.  We appreciate your assistance!

The PRIDE Summer Institute Programs to Increase Diversity Among Individuals Engaged in Health-Related Research are now accepting applications. Space is limited for the 2015 mentored summer training programs so Apply early!
Who: Eligible applicants are junior-level faculty or scientists from minority groups that are under-represented in the biomedical or health sciences, and are United States Citizens or Permanent Residents. Research interests should be compatible with those of the National Heart, Lung, and Blood Institute (NHLBI) in the prevention and treatment of heart, lung, blood, and sleep (HLBS) disorders.
What: Seven unique Summer Institute programs with intensive mentored training opportunities to enhance the research skills and to promote the scientific and career development of trainees. Trainees will learn effective strategies for preparing, submitting and obtaining external funding for research purposes, including extensive tips on best practices. Research emphasis varies by program.
Where/When (Dates subject to change.  Verify on website): 
  • Location:Arizona Health Sciences Center, University of Arizona, Tucson, Arizona
  • PI: Joe G.N. “Skip” Garcia, MD; Francisco Moreno, MD
Behavioral and Sleep Medicine (BSM) (July 19 – August 1, 2015)
  • Location: NYU Langone Medical Center, New York, New York
  • PI: Girardin Jean-Louis, PhD

  • Location: Washington University in St. Louis, St. Louis, Missouri
  • PI: D.C. Rao, PhD; Victor Davila-Roman, MD

Cardiovascular Health-Related Research (CVD) (July 19 – August 1, 2015)
  • Location: SUNY Downstate Medical College, New York, New York
  • PI: Mohamed Boutjdir, PhD

  • Location: Georgia Regents University, Augusta, Georgia
  • PI: Betty Pace, MD

HBCU-PRIDE (June 21 – July 1, 2015)
  • Location:University of Mississippi Medical Center, Jackson, Mississippi
  • PI: Bettina M. Beech, DrPH, MPH; Keith C. Norris, MD, PhD

  • Location: The UCSF Center for Vulnerable Populations at San Francisco General Hospital, San Francisco, California
  • PI: Kirsten Bibbins-Domingo, PhD, MD, MAS; Alicia Fernandez, MD; Margaret Handley, PhD, MPH

Programs typically are all expenses paid including travel, meals, housing, and tuition. Contact the program of interest for details. Mentees can apply to more than one program, but may accept only one.

If you know of colleagues or program alumni at the junior faculty level who would benefit from this innovative research training and mentorship opportunity, we urge you to encourage them to Apply.
We would appreciate your help in getting the word out …
·         Forward this message to appropriate faculty advisors and colleagues.
·         Print and post the program flyer in a common location.

·         Encourage eligible junior faculty to consider this 

Thursday, December 18, 2014

NGS method, RNA seq

Method in Lei 2013, Gene, Diminishing returns in next-generation sequencing (NGS)
transcriptome data. 

The sequencing files downloaded from NCBI SRA database were initially
converted from SRA format to FASTQ format using SRA toolkit
( Then, the
raw data were filtered using the following criteria: (1) the number of
unknown bases (N) was no more than two for each read; and (2) the
fraction of low quality sites (Q b 5) was no more than 50% for each
read. The data that passed this quality control were then used to map
back to their respective genome sequences using bowtie2 (Langmead
and Salzberg, 2012). Only uniquely mapped reads with no more than
two mismatches were retained for further analysis. After mapping, the
counts for each gene were summarized using HTSeq (http://wwwhuber. In the simulation,
a predetermined-sized subset of reads was randomly selected
from the original file. Using the samemapping procedure as mentioned
above, the RPKM for each gene and depth of coverage were calculated
and comparedwith those fromoriginal data. In-house Perl and R scripts

were developed for data analysis and graphing (available upon request).

Monday, December 15, 2014

How children learn math

How children learn math

Liu & Chen, 2012, Protein Cell, Proteome-wide prediction of protein-protein interactions from high-throughput data.

2012 Jul;3(7):508-20. doi: 10.1007/s13238-012-2945-1. Epub 2012 Jun 22.

Proteome-wide prediction of protein-protein interactions from high-throughput data.

Good Review on protein/gene network study

Sunday, December 14, 2014

BIO233 final grade calculation

#This is file "gradebio233,20141214.R"


# tb = read.csv("201409-62376-01BIO233 Grades 20141209c.csv")
tb = read.csv("201409-62376-01BIO233 Grades 20141214-a.csv")
#The - signs have to be replaced with zeros in textwrangler

empty.columns= NULL
for (j in 8:length(tb[1,])){
  #for( i in 1:length(tb[,1])){
  #  if( tb[i,j]=='-') {tb[i,j]=NA } 
 tb[,j] = as.numeric( tb[,j])
 tb[[,j]),j] = 0
 if( max(tb[,j])==0 ) { empty.columns = c(empty.columns, j)}
tb2 = tb[, - empty.columns]
#tb2 = tb2[, -""]
#tb2 = tb2[, - grep('Spring', names(tb2))]
#tb2 = tb2[, - grep('spring', names(tb2))]
tb2 = tb2[, -grep("", names(tb2))]

examColumns = names(tb)[grep("xam", names(tb))]
exam1 = c(         "Quiz.Exam1.Part1..Fall.2014"      ,                                   
          "Assignment.Exam1..part2..calculation.questions..Fall.2014"     ,      
exam2= c("",             
        ""   )                        
final = c( "",     

names(tb)[grep("inal", names(tb))]

report= tb2[,1:2]
report$Exam1 = apply( tb2[,exam1], 1, sum)
report$Exam2 = apply( tb2[,exam2], 1, sum)
report$Exam3 = apply( tb2[,exam3], 1, sum)
report$Final = apply( tb2[,final], 1, sum)

practical = names(tb2)[grep("ractical", names(tb2))]
report$ToTpractical = (tb2[,"Assignment.Practical.Exam..microscope.and.morphology..Sep.29..2014"]/10
 + tb2[,"Assignment.Streak.plate..practical.exam"])/4

### do find out assignments and chapter quiz
#scale lap report were posted twice
scale = c("Quiz.Lab.assignment..Scale.of.Microbes",                    
report$scale= apply( tb2[, scale],1, max)

# chapter homework can be found with "Quiz" or "Chapter". The names should be consisteny!!!
names(tb2)[grep("Chapter", names(tb2))]

report$ch1 = tb2[, grep("Chapter.1",names(tb2))]
report$ch2 = tb2[, grep("Chapter\\.2",names(tb2))] ##.2 can match 32
report$ch3= apply( tb2[, grep("Chapter.3",names(tb2))],1, max)
report$ch4= apply( tb2[, grep("Chapter.4",names(tb2))],1, max)
report$ch5= apply( tb2[, grep("Chapter.5",names(tb2))],1, max)
report$ch5= apply( tb2[, grep("Chapter.6",names(tb2))],1, max)
report$ch7= apply( tb2[, grep("Chapter.7",names(tb2))],1, max)
report$ch8= apply( tb2[, grep("Chapter.8",names(tb2))],1, max)
report$ch9= apply( tb2[, grep("Chapter9",names(tb2))],1, max)
report$ch10= apply( tb2[, grep("Chapter10",names(tb2))],1, max)
report$ch16= apply( tb2[, grep("Chapter16",names(tb2))],1, max)
report$ch32= apply( tb2[, grep("Chapter32",names(tb2))],1, max)

#misc assignment and lab reports, which can be quiz or assignments
names(tb2)[grep("ment", names(tb2))]

misc= c( ""      ,                  
 "Quiz.DePaepeTaddei.Reading.Assignment"                 ,             
 ""     ,          
 "Quiz.Lab.assignment..Scale.of.Microbes"                 ,            
 "Quiz.Lab.assignment..E.coli.genome.studies"       ,                  
 ""  ,         
 "Assignment.homework.on.circulating.tumor.DNA"   )

tb2[, "Assignment.homework.on.circulating.tumor.DNA" ] =  tb2[,  "Assignment.homework.on.circulating.tumor.DNA" ]/10
tb2[, ""] =tb2[, ""]/10
tb2[1:5, misc]

report$misc= apply( tb2[, misc],1, sum)

assignAndLab =c("scale","misc","ch1","ch2","ch3","ch4","ch5","ch7","ch8", "ch9","ch10","ch16","ch32")
report$ToTassignAndLab = apply( report[,assignAndLab], 1, sum)
maxS = apply( report[, assignAndLab], 2, max)
report$ToTassignAndLab = 15*report$ToTassignAndLab / sum(maxS)
## end of assignment and lab reports

att.tb= read.csv( "201409-62376-01BIO233_Attendances_2014129-1734.csv")
att.tb$ToTAttendence = apply( att.tb[, 6:33], 1, sum)
hist(att.tb$ToTAttendence, br=20)
report$ToTAttendence = att.tb$ToTAttendence[match(report$, att.tb$]
report$ToTAttendence = report$ToTAttendence*5/ max(report$ToTAttendence)

# take best 2 regular exam and the final
report$badExam = apply(report[,c("Exam1","Exam2", "Exam3")], 1, min)
report$ExamTot = (report$Exam1 + report$Exam2 + report$Exam3 + report$Final - report$badExam) / 3


# bonus points, need to add R bonus points
names(tb2)[grep("onus", names(tb2))]
bonus = c("Assignment.Bonus.points.of.paper.presentations.and.volunteering" ,      
      ""     ,
      ""    )
report$bonus = apply( tb2[,bonus], 1, sum)

# oral 
report$oral = tb2[,"Assignment.Oral.presentation.grades..fall.2014"]

#written report
report$written = tb2$WrittenReport

FinalGrades= c("ExamTot","ToTpractical","ToTassignAndLab", "ToTAttendence", 'bonus', 'oral', "written")

report$FinalGrade= apply(report[,FinalGrades], 1, sum) 
hist(report$FinalGrade, br=20)

grade2letter = function(x){
  if(x>94){    ret='A'
  }else if (x >90) {    ret='A-'
  }else if (x >87 ){    ret = 'B+'
  }else if (x > 84){    ret = 'B'
  }else if (x >80){    ret = 'B-'
  }else if (x > 76){  ret = 'C+'  
  }else if (x > 70){ ret = 'C'  
  }else if (x > 67){ ret = 'C-'
  }else if (x > 64){ ret = 'D+'
  }else if (x > 60){ ret = 'D'
  }else {   ret = 'F'
  return (ret)
grade2letter(70); grade2letter(88)
report$letter = lapply(report$FinalGrade,  grade2letter)

write.xlsx(report, "bio233FinalGradesFall20141214-a.xlsx")

#generate a sorted report
report.sorted = report[order(report$FinalGrade),]
write.xlsx(report.sorted, "bio233FinalGradesFall20141214-a-sorted.xlsx")