This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
Saturday, November 29, 2014
Friday, November 28, 2014
LotusNotes Apple iPhone/iPad
- 1. Direct the Safari browser (ON your Apple iPhone/iPad) to the Traveler User Home Page (http://spelbes.spelman.edu/
2. A User Status section at the top of the home page shows the status of the user and any of the user's devices. Make sure that there are no error messages, which would be highlighted in red, in this section. If errors exist, they probably need to be addressed before synchronization will be successful.
3. Select Configure your Apple iPhone/iPod Touch
4. Select Generate.
5. Select Install to begin the profile installation process.
6. When prompted about the authenticity of the profile, select Install Now to continue to install the profile.
7. When prompted, enter your Lotus Notes webmail password and select Next.
8. When the profile has been installed, select Done to return to the previous application (e.g., Safari). Your new Lotus Notes ActiveSync account will have been created under Mail, Contacts, and Calendars in the Settings Application. Registration with the server begins immediately and mail, calendar, and contacts should begin to show up soon.
Wednesday, November 26, 2014
Elements of Statistical Learning, video, pdf, Hastie, Tibshirani, 2014
Cloned from R-Blogger.
In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as “machine learning”), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book.
In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as “machine learning”), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book.
If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors’ website.
If you decide to attempt the exercises at the end of each chapter, there is a GitHub repository of solutions provided by students you can use to check your work.
As a supplement to the textbook, you may also want to watch the excellent course lecture videos (linked below), in which Dr. Hastie and Dr. Tibshirani discuss much of the material. In case you want to browse the lecture content, I’ve also linked to the PDF slides used in the videos.
Chapter 1: Introduction (slides, playlist)
- Opening Remarks and Examples (18:18)
- Supervised and Unsupervised Learning (12:12)
Chapter 2: Statistical Learning (slides, playlist)
- Statistical Learning and Regression (11:41)
- Curse of Dimensionality and Parametric Models (11:40)
- Assessing Model Accuracy and Bias-Variance Trade-off (10:04)
- Classification Problems and K-Nearest Neighbors (15:37)
- Lab: Introduction to R (14:12)
Chapter 3: Linear Regression (slides, playlist)
- Simple Linear Regression and Confidence Intervals (13:01)
- Hypothesis Testing (8:24)
- Multiple Linear Regression and Interpreting Regression Coefficients (15:38)
- Model Selection and Qualitative Predictors (14:51)
- Interactions and Nonlinearity (14:16)
- Lab: Linear Regression (22:10)
Chapter 4: Classification (slides, playlist)
- Introduction to Classification (10:25)
- Logistic Regression and Maximum Likelihood (9:07)
- Multivariate Logistic Regression and Confounding (9:53)
- Case-Control Sampling and Multiclass Logistic Regression (7:28)
- Linear Discriminant Analysis and Bayes Theorem (7:12)
- Univariate Linear Discriminant Analysis (7:37)
- Multivariate Linear Discriminant Analysis and ROC Curves (17:42)
- Quadratic Discriminant Analysis and Naive Bayes (10:07)
- Lab: Logistic Regression (10:14)
- Lab: Linear Discriminant Analysis (8:22)
- Lab: K-Nearest Neighbors (5:01)
Chapter 5: Resampling Methods (slides, playlist)
- Estimating Prediction Error and Validation Set Approach (14:01)
- K-fold Cross-Validation (13:33)
- Cross-Validation: The Right and Wrong Ways (10:07)
- The Bootstrap (11:29)
- More on the Bootstrap (14:35)
- Lab: Cross-Validation (11:21)
- Lab: The Bootstrap (7:40)
Chapter 6: Linear Model Selection and Regularization (slides, playlist)
- Linear Model Selection and Best Subset Selection (13:44)
- Forward Stepwise Selection (12:26)
- Backward Stepwise Selection (5:26)
- Estimating Test Error Using Mallow’s Cp, AIC, BIC, Adjusted R-squared (14:06)
- Estimating Test Error Using Cross-Validation (8:43)
- Shrinkage Methods and Ridge Regression (12:37)
- The Lasso (15:21)
- Tuning Parameter Selection for Ridge Regression and Lasso (5:27)
- Dimension Reduction (4:45)
- Principal Components Regression and Partial Least Squares (15:48)
- Lab: Best Subset Selection (10:36)
- Lab: Forward Stepwise Selection and Model Selection Using Validation Set (10:32)
- Lab: Model Selection Using Cross-Validation (5:32)
- Lab: Ridge Regression and Lasso (16:34)
Chapter 7: Moving Beyond Linearity (slides, playlist)
- Polynomial Regression and Step Functions (14:59)
- Piecewise Polynomials and Splines (13:13)
- Smoothing Splines (10:10)
- Local Regression and Generalized Additive Models (10:45)
- Lab: Polynomials (21:11)
- Lab: Splines and Generalized Additive Models (12:15)
Chapter 8: Tree-Based Methods (slides, playlist)
- Decision Trees (14:37)
- Pruning a Decision Tree (11:45)
- Classification Trees and Comparison with Linear Models (11:00)
- Bootstrap Aggregation (Bagging) and Random Forests (13:45)
- Boosting and Variable Importance (12:03)
- Lab: Decision Trees (10:13)
- Lab: Random Forests and Boosting (15:35)
Chapter 9: Support Vector Machines (slides, playlist)
- Maximal Margin Classifier (11:35)
- Support Vector Classifier (8:04)
- Kernels and Support Vector Machines (15:04)
- Example and Comparison with Logistic Regression (14:47)
- Lab: Support Vector Machine for Classification (10:13)
- Lab: Nonlinear Support Vector Machine (7:54)
Chapter 10: Unsupervised Learning (slides, playlist)
- Unsupervised Learning and Principal Components Analysis (12:37)
- Exploring Principal Components Analysis and Proportion of Variance Explained (17:39)
- K-means Clustering (17:17)
- Hierarchical Clustering (14:45)
- Breast Cancer Example of Hierarchical Clustering (9:24)
- Lab: Principal Components Analysis (6:28)
- Lab: K-means Clustering (6:31)
- Lab: Hierarchical Clustering (6:33)
Interviews (playlist)
- Interview with John Chambers (10:20)
- Interview with Bradley Efron (12:08)
- Interview with Jerome Friedman (10:29)
- Interviews with statistics graduate students (7:44)
Tuesday, November 25, 2014
Algorithms and tools for protein–protein interaction networks clustering, with a special focus on population-based stochastic methods Clara Pizzuti1,† and Simona E. Rombo2,*,†
Algorithms and tools for protein–protein interaction networks clustering, with a special focus on population-based stochastic methods
http://bioinformatics.oxfordjournals.org/content/30/10/1343.short
PR14 used 3 yeast PPI data to compare MCL with others. The MCL parameter was taken from Boheree2006. PR14 used protein complex as 'golden standard'. When overlapping score > 20%, MCL is the best algorithm. Bader's MCODE is also a good method for certain parameter settings.
useful references on teaching
Active learning increases student performance in science, engineering, and mathematics
Scott Freemana,1, Sarah L. Eddya, Miles McDonougha, Michelle K. Smithb, Nnadozie Okoroafora,Hannah Jordta,and Mary Pat Wenderoth, April 15, 2014
http://www.pnas.org/content/111/23/8410.full.pdf+html
Research-Based Learning Principles
http://www.josephjaywilliams.com/education#TOC-Comparison:-Help-learners-grasp-or-construct-new-abstract-principles-by-comparison-of-specific-examples-of-the-generalization.
Monday, November 24, 2014
Wang, Li, Deng, Pan, BMC review on clustering methods for protein interaction networks.
Recent advances in clustering methods for protein interaction networks
Jianxin Wang1,2*, Min Li1*, Youping Deng3, Yi Pan2
From The ISIBM International Joint Conference on Bioinformatics, Systems Biology and Intelligent
Computing (IJCBS), Shanghai, China. 3-8 August 2009
cited by
http://scholar.google.com/scholar?cites=16432683922097612422&as_sdt=5,43&sciodt=0,43&hl=en
Reviewed 20 clustering methods, including MCL. MCL is commented as the highly successful.
10. Brohée S, van Helden J: Evaluation of clustering algorithms for proteinprotein
interaction networks. BMC Bioinformatics 2006, 7:48.
63. Vlasblom J, Wodak SJ: Markov clustering versus affinity propagation for
the partitioning of protein interaction graphs. BMC Bioinformatics 2009,10:99.
Lin C, Cho Y-R, Hwang W-C, Pei P, and Zhang A. 2007. Clustering Methods in a Protein–Protein Interaction Network. In: Hu X, and Pan Y, eds. Knowledge Discovery in Bioinformatics: John Wiley & Sons, Inc., 319-355.
CLUSTERING METHODS IN PROTEIN-PROTEIN INTERACTION NETWORK
Chuan Lin, Young-rae Cho, Woo-chang Hwang, Pengjun Pei, Aidong Zhang
Department of Computer Science and Engineering
State University of New York at Buffalo
Cite as:
Lin C, Cho Y-R, Hwang W-C, Pei P, and Zhang A. 2007. Clustering Methods in a Protein–Protein Interaction Network. In: Hu X, and Pan Y, eds. Knowledge Discovery in Bioinformatics: John Wiley & Sons, Inc., 319-355.
This review article did not provide enough details on validation and comparison of different algorithms.
Chuan Lin, Young-rae Cho, Woo-chang Hwang, Pengjun Pei, Aidong Zhang
Department of Computer Science and Engineering
State University of New York at Buffalo
Cite as:
Lin C, Cho Y-R, Hwang W-C, Pei P, and Zhang A. 2007. Clustering Methods in a Protein–Protein Interaction Network. In: Hu X, and Pan Y, eds. Knowledge Discovery in Bioinformatics: John Wiley & Sons, Inc., 319-355.
This review article did not provide enough details on validation and comparison of different algorithms.
Sunday, November 23, 2014
minimal version, R package
For my todo list, write a personalized R package
http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Friday, November 21, 2014
MCL algorithm comparison
Brohee and van Helden 2006, BMC [BH06]
BH06 compared Markov Clustering (MCL), Restricted Neighborhood Search Clustering (RNSC), Super Paramagnetic Clustering (SPC), and Molecular Complex Detection (MCODE), using annotated protein complex as bench mark. Random noises were introduced by randomly adding and deleting edges. MCL is the most reliable and robust method.
Thursday, November 20, 2014
Atlanta-qbio google group (AQBIO)
https://groups.google.com/forum/#!forum/atlanta-qbio
http://groups.google.com/d/groupsettings/atlanta-qbio/information
atlanta-qbio@googlegroups.com
http://groups.google.com/d/managemembers/atlanta-qbio/invite
http://groups.google.com/d/groupsettings/atlanta-qbio/information
atlanta-qbio@googlegroups.com
http://groups.google.com/d/managemembers/atlanta-qbio/invite
conference call on comp and systems biology
Lessons:
I cannot see all the peoples. Sometimes it is hard to know who is talking.
When people not talking, mic should be muted to decrease background noises.
Wednesday, November 19, 2014
bio386, proofread of CNV manuscript
The student proofread CNV manuscript.
Final take home exam was given.
Final take home exam was given.
Tuesday, November 18, 2014
bio233, epidemiology,
I used an R-based simulation in class. The pace was fast and about 2/3 students are not following. However, a few students clearly paid attention.
Problems:
1) Plot did not show in Rstudio due to screen resolution. I used PDF to circumvent the problem.
2) I did not give student enough time to run the R code themselves, partially because I only require this for bonus points.
Problems:
1) Plot did not show in Rstudio due to screen resolution. I used PDF to circumvent the problem.
2) I did not give student enough time to run the R code themselves, partially because I only require this for bonus points.
Monday, November 17, 2014
bio233, lab, analysis of flow cytomter data, Australian rabbit virus
Went over oral presentation order, final project report
Exam schedule
30 minutes, analysis of flow cytomter data,
http://youtu.be/BN5Ldu1AFgk
30 minutes, Australian rabbit virus
http://youtu.be/FqSDxGYu3K0
30 minutes, streak single colonies. I used my streak plate as an example of "A".
Exam schedule
30 minutes, analysis of flow cytomter data,
http://youtu.be/BN5Ldu1AFgk
30 minutes, Australian rabbit virus
http://youtu.be/FqSDxGYu3K0
30 minutes, streak single colonies. I used my streak plate as an example of "A".
Thursday, November 13, 2014
bio233, phylogeny
Let bio233 worked through power point slides.
I stumbled on the comparison between the canonical endosymbiosis and hydrogen hypothesis for mitochondria, after a student asked the question.
I stumbled on the comparison between the canonical endosymbiosis and hydrogen hypothesis for mitochondria, after a student asked the question.
Wednesday, November 12, 2014
bio386, R coursera
I led students worked on R programming offered by Coursera.
Tuesday, November 11, 2014
bio233, virus, problem sets
I let students worked through 2 set of MCAT-styple problems. It was the first time that we finished two reading paragraphs in a 75-minute class.
Lotus Notes, initiate a new proposal for internal routing
http://spelmanosp.wordpress.com/2014/09/22/obtaining-approval-to-submit-a-grant-proposal/
Monday, November 10, 2014
bio233, streaking for single colonies
Six-streak procedure
1: short
2: twice
3: three times
4: 4 times.
5: many times
6: many times to evenly spread out the cells.
Usually the 4th streak should be thinned out.
Many students went back to the cell cultures for every streak, and this even happened after 2-3 trials.
1: short
2: twice
3: three times
4: 4 times.
5: many times
6: many times to evenly spread out the cells.
Usually the 4th streak should be thinned out.
Many students went back to the cell cultures for every streak, and this even happened after 2-3 trials.
bio233, flow cytometer lab on DHE-labelled yeast cells.
The class spent 2 hours on DHE staining. I could save the time by giving assignment on the protocol itself.
When I let students to re-streak their plates, some groups stopped working on their DHE staining procedure.
At 4pm, I started the Cellquest but no signal can be read from the Calibur. This is the 3rd time that this machine malfunctioned. Really bad timing.
Thursday, November 6, 2014
Flow cytometry, flow cytometer teaching resource
YouTube Learning material on flow cytometry
Hand-drawing introduction.
Animated introduction
UW's tutorial on flow cytometry
Hong Qin's tutorial on BD FACS Calibur usage. Useful for understanding experimental procedure.
Reading materials on flow cytometr
Flow cytmetry from Wikipedia
DHE, superoxide indicator
http://www.lifetechnologies.com/order/catalog/product/D1168
http://sigs.nih.gov/FCIG/Pages/default.aspx
Hand-drawing introduction.
Animated introduction
UW's tutorial on flow cytometry
Hong Qin's tutorial on BD FACS Calibur usage. Useful for understanding experimental procedure.
Reading materials on flow cytometr
Flow cytmetry from Wikipedia
DHE, superoxide indicator
http://www.lifetechnologies.com/order/catalog/product/D1168
NIH, Flow Cytometry Interest Group
|
Wednesday, November 5, 2014
BIO125, spring 2015 strain and data request,
AGY 75, yeast strain with pSH44 reporter plasmid
AGY125, yeast strain with the wild type pMSH2 and pSH44 (This is the wildtype MSH2 control)
AGY124, yeast strain with pRS413 and pSH44 (This is the plasmid control)
Ecoli strain with plasmid
AG372 pmsh2-H658R
AG421 pmsh2-A618V
Read Gammie's recent papers.
Small NGS data of wildtype MSH2 and mutant msh2 for students to analyze using Galaxy
AGY125, yeast strain with the wild type pMSH2 and pSH44 (This is the wildtype MSH2 control)
AGY124, yeast strain with pRS413 and pSH44 (This is the plasmid control)
Ecoli strain with plasmid
AG372 pmsh2-H658R
AG421 pmsh2-A618V
Read Gammie's recent papers.
Small NGS data of wildtype MSH2 and mutant msh2 for students to analyze using Galaxy
Tuesday, November 4, 2014
bio233, guest lecture, circulating tumor DNA
bio233 guest lecture
CAPP-seq
CT, biopsy are common method for tumor diagnosis.
How did ctDNA comes from tumor?
There arre much cell-free DNA in human circulation, typically 5ng/ml of plasma in healthy adults, primarily from hemopoetic cells. Cell-free DNA often have half-live are 0.5 ~ 2 hours.
Hybrid selection (NimbeGen), target enrichment.
10,000X sequencing is required?
CAPP-seq
CT, biopsy are common method for tumor diagnosis.
How did ctDNA comes from tumor?
There arre much cell-free DNA in human circulation, typically 5ng/ml of plasma in healthy adults, primarily from hemopoetic cells. Cell-free DNA often have half-live are 0.5 ~ 2 hours.
Hybrid selection (NimbeGen), target enrichment.
10,000X sequencing is required?
Monday, November 3, 2014
bio233 phylogeny and lab, practical exam on streaking single colonies
I spent 1 hour on introduction of phylogeny using my own slides.
Many students did not bring laptops.
For the lab, MEGA6 on Mac runs very slow.
For practical exam, some students did not see the previous streaking example clearly.
Many students did not bring laptops.
For the lab, MEGA6 on Mac runs very slow.
For practical exam, some students did not see the previous streaking example clearly.
Sunday, November 2, 2014
SVM, reading notes
See http://hongqinlab.blogspot.com/2014/11/elements-of-statistical-learning-video.html
SVM kernel trick
trial and error to separate data in high dimenstional space
cross validation
predict True Negative?
Mathews correlation coefficient (MCC) (for binary classification)
SVM maximize soft margin.
Data should be standardized for SVM analysis, because SVM treats every columns the same.
On researchgate, someone argues: Perform different normalization such as Z-Score or Min-Max before using PCA. Z-Score normalization before using PCA might be beneficial.
For principal component (PCA) and svm,
http://www.softcomputing.net/isda2010_2.pdf
On researchGate: Principal components are linear combinations of original variables x1, x2, etc. So when you do SVM on PCA decomposition you work with these combinations instead of original variables.
Support vector classifer in the enlarged spaced solves separation problem in the lower-dimensional space.
Question: Kernel is used to computer inner products of vectors. Why are there different types of kernels for computing the same thing (inner products)?
SVM for more than 2 classes:
SVM kernel trick
trial and error to separate data in high dimenstional space
cross validation
predict True Negative?
Mathews correlation coefficient (MCC) (for binary classification)
In general the equation for a hyperplane has the form
SVM maximize soft margin.
Data should be standardized for SVM analysis, because SVM treats every columns the same.
On researchgate, someone argues: Perform different normalization such as Z-Score or Min-Max before using PCA. Z-Score normalization before using PCA might be beneficial.
For principal component (PCA) and svm,
http://www.softcomputing.net/isda2010_2.pdf
On researchGate: Principal components are linear combinations of original variables x1, x2, etc. So when you do SVM on PCA decomposition you work with these combinations instead of original variables.
Support vector classifer in the enlarged spaced solves separation problem in the lower-dimensional space.
Question: Kernel is used to computer inner products of vectors. Why are there different types of kernels for computing the same thing (inner products)?
SVM for more than 2 classes:
MATLAB ODE solver
ode15s
fmincon
http://laser.cheng.cam.ac.uk/wiki/images/e/e5/NumMeth_Handout_7.pdf
fmincon
http://laser.cheng.cam.ac.uk/wiki/images/e/e5/NumMeth_Handout_7.pdf
polytopes and phylogeny
polytopes is the convex hull
http://en.wikipedia.org/wiki/Polytope
tree -> matrix as markov process OR polytope
http://en.wikipedia.org/wiki/Polytope
tree -> matrix as markov process OR polytope
Saturday, November 1, 2014
funding 2014
PD 14-7513 , due Feb 4, 2014
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504976&org=EHR&from=home#.Uo-EwOFRdH8.facebook
NIH big data
http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-14-009.html
http://bd2k.nih.gov/funding_opportunities.html#sthash.hFOQ3jpE.dpbs
NIH diversity RFP
http://grants.nih.gov/grants/guide/pa-files/PAR-12-016.html expires in Jan 2015
Cholera
http://cph.osu.edu/people/jtien
Cholera SIWR model
Seasonal variations will be used to further improve the model, modeled directly into infection force.
19th centry sample
http://muttermuseum.org/
Haiti, no recorded history of cholera infection. no immune responses.
Cholera spatial spread, waterways, human movement, cell phone movement (Digicel, Flowminder)
Moran's I to compare cell phone movement and cholera spread. Local movement versus waterways.
Community networks with environmental pathogen movement
patch heterogeneity
weight directed edges in networks
When can disease invade the network? R0 of the network,
Coupled locations
Next generation matrix (second generation matrix, Diekmann, Heesterbeek, Metz 1990, van den Driessche and Watmough 2002.
Transfer matrix V = Transfer out - Transfers in + Decay
Laplacian matrix(from graph theory) can be used model transfer out and in.
i.e. V = L + D
D = diag{\delta_i}
Time scale have to be right
V^-1 as a perturbation problem
Langenhop 1971, Laurent series for perturbed singular matrices
According to Tien, lifespan of cholera were fitted with expoential model in the lab. Later, Tien explained that fresh cholera have high infectious rate, so fitness of cholera bacteria has a characteristic of aging.
Subscribe to:
Posts (Atom)