This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
Sunday, May 28, 2017
RNAseq coursesource materials
http://www.coursesource.org/courses/teaching-rnaseq-at-undergraduate-institutions-a-tutorial-and-r-package-from-the-genome-0#tabs-0-content=1
NIBLSE, bioinformatics core competencies
https://qubeshub.org/groups/niblse/resourcecollection/core_competencies#2
This set of bioinformatics core competencies for undergraduate life
scientists is informed by the survey results of more than 1,200 people,
analysis of 90 syllabi addressing bioinformatics across institutions and
diverse departments, and discussion among experts across academia and
industry. The bulleted lists contain examples illustrating the
competencies.
-
Explain the role of computation and
data mining in addressing hypothesis-driven and hypothesis-generating
questions within the life sciences: It is crucial for students
to have a clear understanding of the role computing and data mining play
in the modern life sciences. Given a traditional hypothesis-driven
research question, students should have ideas about what types of data
and software exist that could help them answer the question quickly and
efficiently. They should also appreciate that mining large datasets can
generate novel hypotheses to be tested in the lab or field.
- What hypotheses can one ask based biometric data being compiled (Fitbit, Google, etc.)
- Understand the role of various databases in identifying potential gene targets for drug development
-
Summarize key computational concepts, such as algorithms and relational databases, and their applications in the life sciences: In
order to make use of sophisticated software and database tools,
students must have a basic understanding of the underlying principles
that these tools are based upon. Students are not expected to be experts
in multiple algorithms or sophisticated databases, but currently the
vast majority of life sciences majors never take a programming or
database course, and have essentially zero exposure to how these tools
work. This must change.
- Be exposed to how data is organized in relational databases
- Be able to modify the search parameters to achieve biologically meaningful results
- Understand underlying algorithm(s) employed in sequence alignment (e.g. BLAST)
-
Apply statistical concepts used in bioinformatics: Many
biology curricula contain statistics, either as a standalone
biostatistics course or as part of other courses such as capstone
research courses. The primary distinction with regard to bioinformatics
has to do with the statistics of large datasets and multiple
comparisons.
- Drug trials: Interpretation of well designed drug trial data
- Transcriptomics: Understand the statistical modelling used to identify differentially expressed genes; Understand how genes implicated in cancer are identified using panels of sequenced tumor and WT cell lines or biopsies
- Sequence similarity searching: Understand that there is a probability of finding a given sequence similarity score by chance (the p-value); The size of the database searched affects the probability that they would see that particular score in a particular search (the expectation, or e-value).
-
Use bioinformatics tools to examine
complex biological problems in evolution, information flow, and other
important areas of biology: This competency is written broadly
so as to encompass a variety of problems addressed using bioinformatics
tools, from understanding the evolutionary underpinnings of sequence
comparison and homology detection, to the distinctions between genomic
sequences, RNA sequences, and protein sequences, to the interpretation
of phylogenetic trees. We want to emphasize that bioinformatics tools
can be used to teach existing parts of the curriculum such as the
central dogma or phylogenetic relationships, thus integrating the
bioinformatics into the curriculum as opposed to adding it on as an
addition to an already overfull curriculum (and thus forcing decisions
about what topic to remove to make room). The point of saying “complex”
biological problems is that students should be able to work through a
problem with multiple steps, not just perform isolated tasks.
- Employ gene ontology tools (e.g., Mapman, GO, KEGG).
- Understand protein sequence, structure, and function, using a variety of tools
- Understand gene structure, genomic context, alternative splicing using genome browsers
- Understand concept of homology
-
Find, retrieve, and organize various types of biological data: Given
the numerous and varied datasets currently being generated from all of
the ‘omics fields, students should develop the facility to: identify
appropriate data repositories; navigate and retrieve data from these
databases; and organize data relevant to their area of study (in flat
files or small local stand-alone databases).
- Store and interrogate small datasets using spreadsheets or delimited text files.
- Navigate and retrieve data from genome browsers
- Retrieve data from protein and genome databases (PDB, UniProt, NCBI)
-
Explore and/or model biological interactions, networks and data integration using bioinformatics: Modeling
of biological systems at all levels, from cellular to ecological, is
being facilitated by technological (e.g., sequencing, biochemical,
genetics) and algorithmic advances. These models provide novel insights
into the perturbations in systems causative of disease, interactions of
microbes with various eukaryotic systems, and how metabolic networks
respond to environmental stresses. Students should be familiar with the
techniques used to generate these analyses, have the ability to
interpret the outputs, and use the data to generate novel hypotheses.
- Cell Biology: predict impact of gene knockout on cell-signaling pathway
- Transcriptome: Analysis of transcriptomic data (RNA-Seq) available from SRS using Galaxy
- Ecological: Analysis of microbial sequence data using QIIME on Galaxy
-
Use command-line bioinformatics tools and write simple computer scripts: The
majority of the datasets students should be familiar with and be able
to interact with (e.g., genomic and proteomic sequences, BLAST results,
RNASeq and resulting differential expression data) are text files. The
most powerful and dynamic way to interact with these datasets is through
the command line or shell scripting, both of which are readily acquired
skills. Students need to have the flexibility to manipulate their own
data, and to create and modify complex data processing and analysis
workflows.
- Write simple unix shell scripts to manipulate files
- Apply RNASeq analyses using R (STAR, Tophat, DESeq2) to open source data sets (SRS)
- Build and run statistical analyses using R or Python scripts
- Run BLAST using command line options
-
Describe and manage biological data types, structure, and reproducibility: This
competency addresses two distinct concerns: 1) each of the varied
‘omics fields produce data in formats particular to its needs, and these
formats evolve with changes in technologies and refinements in
downstream software; and 2) all experimental data is subject to error
and the user must be cognizant of the need to verify the reproducibility
of their data. The first concern highlights the requirement for
students to develop an awareness of and ability to manipulate different
data types given the versioning of formats. The second points to the
need for caution, to carry out appropriate statistical analyses on their
data as part of normal operating procedures and report the uncertainty
of their results, and to provide the relevant information to enable
reproduction of their results. Sometimes students have the tendency to
assume that anything they retrieve from an online database must be
correct; they need to be taught that this is not always the case.
- Reproducibility: Compare reproducibility of biological replicate data (e.g.transcriptomic data) using statistical tests (Spearman).
- Formats: Understand the various sequence formats used to store DNA and protein sequences (FASTA, FASTQ); Understand the representation of gene features using Gene Feature Format (GFF) files; Mass-Spec
-
Interpret the ethical, legal, medical, and social implications of biological data: The
increasing scale and penetrance of human genetic and genomic data has
greatly enhanced our ability to identify disease-related loci, druggable
targets, and potential for gene replacements with developing
techniques. However, with this information also comes many ethical,
legal, and social questions which are often outpaced by the
technological advances. As part of their scientific training, students
should debate the medicinal, societal and ethical implications of these
information sets and techniques.
- How does the scientific community protect against the falsification or manipulation of large datasets?
- Who should have access to this data, and how should it be protected?
- What are the implications, good and bad, of being able to walk into a doctor’s office and have your genome sequenced and analyzed in minutes?
Friday, May 26, 2017
KR
datacamp, $150 for a year service for advanced courses
raspberryPI can be hooked up to a monitor
raspberryPI can be hooked up to a monitor
Thursday, May 25, 2017
Friday, May 19, 2017
day5, jackson lab
==========
Mark Adams
microbial genomics service
mock microbial community => assess DNA extraction method, or other procedures
==========
Aditya Srikanth Kovuri
Sandeep Namburi
NIST, cloud characteristics,
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
on-demand self-service
broad network access
resource pooling
rapid elasticity, https://www.techopedia.com/definition/29526/rapid-elasticity
measured service
https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-7-43
Amazon S3
glalaxy cloudman
Google cloud is cheaper than GoogleCloud.
GoogleGenomics API.
https://cloud.google.com/genomics/reference/rest/
Dockers
Microsoft Azure Research awards
Google Research award
Mark Adams
microbial genomics service
mock microbial community => assess DNA extraction method, or other procedures
==========
Aditya Srikanth Kovuri
Sandeep Namburi
NIST, cloud characteristics,
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
on-demand self-service
broad network access
resource pooling
rapid elasticity, https://www.techopedia.com/definition/29526/rapid-elasticity
measured service
https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-7-43
Amazon S3
glalaxy cloudman
Google cloud is cheaper than GoogleCloud.
GoogleGenomics API.
https://cloud.google.com/genomics/reference/rest/
Dockers
Microsoft Azure Research awards
Google Research award
Thursday, May 18, 2017
day4, jackson lab,
=> Krish Karuturi
big data genomics, computational and informatics challenges
https://www.jax.org/research-and-faculty/tools/scientific-research-services/computational-sciences/staff/krishna-karuturi
https://github.com/TheJacksonLaboratory/civet
TORQUE resource manager
https://en.wikipedia.org/wiki/Comparison_of_cluster_software
benchmarking pipelines
https://en.wikipedia.org/wiki/SNV_calling_from_NGS_data#List_of_available_software
https://www.nature.com/articles/srep43169
GSEA
GSA, Effron & Tibshirani
XENOME
etherpad
https://public.etherpad-mozilla.org/p/2017-05-18-bigData-grad-prof
===============================
Peter Robinson, Ph.D., The Jackson Laboratory for Genomic Medicine
Phenotype driven genome analysis
https://scholar.google.com/citations?user=TPOD_XUAAAAJ&hl=en
Ontology, disambuilgous terms.
human phenotype ontology
information content (IC) of concept.
semantically similar diseases scores
PhenoBLAST
Washington NL 2009, Plos Biology
======================================
Y Ada Zhan, ChIP-seq
https://en.wikipedia.org/wiki/ChIP-sequencing
encodeproject.org
https://academic.oup.com/bib/article/17/6/953/2453197/A-comprehensive-comparison-of-tools-for
bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat readme.txt
###################
# ChIP-seq module #
###################
# ChIP-seq data
In the directory ChIPseq/
GM12878_control_chr1.fastq
GM12878_CTCF_chr1.fastq
# Genome
In the directory ChIPseq/hg38/
GRCh38.chr1.fa
GRCh38.chr1.size
# Tools
fastqc (quality check)
bowtie (sequence mapping or alignments)
samtools (manipulating alignments in SAM format. BAM format is a compressed version of SAM file)
macs2 (peak calling)
bedtools (to handle sequence coordinate files in BED format)
bd2kuser@ip-172-31-73-47:~/ChIPseq$
bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat workflow.sh
# quality check
fastqc GM12878_control_chr1.fastq
fastqc GM12878_CTCF_chr1.fastq
# Prepare genome
bowtie-build hg38/GRCh38.chr1.fa hg38/GRCh38.chr1
# Mapping
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_control_chr1.fastq > GM12878_control_chr1.sam
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_CTCF_chr1.fastq > GM12878_CTCF_chr1.sam
# Further processing
## compress to BAM
samtools view -bSo GM12878_control_chr1.bam GM12878_control_chr1.sam
samtools view -bSo GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sam
## sort
samtools sort GM12878_control_chr1.bam GM12878_control_chr1.sorted
samtools sort GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sorted
## index
samtools index GM12878_control_chr1.sorted.bam
samtools index GM12878_CTCF_chr1.sorted.bam
# Peak calling
macs2 callpeak -t GM12878_CTCF_chr1.sorted.bam -c GM12878_control_chr1.sorted.bam -f BAM -g 175000000 -n GM12878_CTCF_chr1 -B -q 0.01
# Check the peak model
Rscript GM12878_CTCF_chr1_model.r
# Motif analysis
## extend summits 100bp on both directions
bedtools slop -i GM12878_CTCF_chr1_summits.bed -g hg38/GRCh38.chr1.size -b 100 > GM12878_CTCF_chr1_summits_ext.bed
## get sequence file (i.e. fasta)
bedtools getfasta -fi hg38/GRCh38.chr1.fa -bed GM12878_CTCF_chr1_summits_ext.bed -fo GM12878_CTCF_chr1_summits_ext.fa
## The .fa file will be uploaded to MEME online server for motif discovery (http://meme-suite.org/tools/meme)
BED file format
MEME motif discovery
ChiPseek website for interactive data analysis,
big data genomics, computational and informatics challenges
https://www.jax.org/research-and-faculty/tools/scientific-research-services/computational-sciences/staff/krishna-karuturi
https://github.com/TheJacksonLaboratory/civet
TORQUE resource manager
https://en.wikipedia.org/wiki/Comparison_of_cluster_software
benchmarking pipelines
https://en.wikipedia.org/wiki/SNV_calling_from_NGS_data#List_of_available_software
https://www.nature.com/articles/srep43169
GSEA
GSA, Effron & Tibshirani
XENOME
etherpad
https://public.etherpad-mozilla.org/p/2017-05-18-bigData-grad-prof
===============================
Peter Robinson, Ph.D., The Jackson Laboratory for Genomic Medicine
Phenotype driven genome analysis
https://scholar.google.com/citations?user=TPOD_XUAAAAJ&hl=en
Ontology, disambuilgous terms.
human phenotype ontology
information content (IC) of concept.
semantically similar diseases scores
PhenoBLAST
Washington NL 2009, Plos Biology
======================================
Y Ada Zhan, ChIP-seq
https://en.wikipedia.org/wiki/ChIP-sequencing
encodeproject.org
https://academic.oup.com/bib/article/17/6/953/2453197/A-comprehensive-comparison-of-tools-for
bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat readme.txt
###################
# ChIP-seq module #
###################
# ChIP-seq data
In the directory ChIPseq/
GM12878_control_chr1.fastq
GM12878_CTCF_chr1.fastq
# Genome
In the directory ChIPseq/hg38/
GRCh38.chr1.fa
GRCh38.chr1.size
# Tools
fastqc (quality check)
bowtie (sequence mapping or alignments)
samtools (manipulating alignments in SAM format. BAM format is a compressed version of SAM file)
macs2 (peak calling)
bedtools (to handle sequence coordinate files in BED format)
bd2kuser@ip-172-31-73-47:~/ChIPseq$
bd2kuser@ip-172-31-73-47:~/ChIPseq$ cat workflow.sh
# quality check
fastqc GM12878_control_chr1.fastq
fastqc GM12878_CTCF_chr1.fastq
# Prepare genome
bowtie-build hg38/GRCh38.chr1.fa hg38/GRCh38.chr1
# Mapping
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_control_chr1.fastq > GM12878_control_chr1.sam
bowtie -m 1 -S ./hg38/GRCh38.chr1 GM12878_CTCF_chr1.fastq > GM12878_CTCF_chr1.sam
# Further processing
## compress to BAM
samtools view -bSo GM12878_control_chr1.bam GM12878_control_chr1.sam
samtools view -bSo GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sam
## sort
samtools sort GM12878_control_chr1.bam GM12878_control_chr1.sorted
samtools sort GM12878_CTCF_chr1.bam GM12878_CTCF_chr1.sorted
## index
samtools index GM12878_control_chr1.sorted.bam
samtools index GM12878_CTCF_chr1.sorted.bam
# Peak calling
macs2 callpeak -t GM12878_CTCF_chr1.sorted.bam -c GM12878_control_chr1.sorted.bam -f BAM -g 175000000 -n GM12878_CTCF_chr1 -B -q 0.01
# Check the peak model
Rscript GM12878_CTCF_chr1_model.r
# Motif analysis
## extend summits 100bp on both directions
bedtools slop -i GM12878_CTCF_chr1_summits.bed -g hg38/GRCh38.chr1.size -b 100 > GM12878_CTCF_chr1_summits_ext.bed
## get sequence file (i.e. fasta)
bedtools getfasta -fi hg38/GRCh38.chr1.fa -bed GM12878_CTCF_chr1_summits_ext.bed -fo GM12878_CTCF_chr1_summits_ext.fa
## The .fa file will be uploaded to MEME online server for motif discovery (http://meme-suite.org/tools/meme)
BED file format
MEME motif discovery
ChiPseek website for interactive data analysis,
Wednesday, May 17, 2017
day3, 20170517Wed Jackson Lab, Galaxy, IGV,
=> Paola Vera-Licona
gene network
time series gene expression data -> network
structure-based control of signaling networks (optimization of interaction? )
HER2-positive breast cancer
BiNoM -> geneXplain --> OCSANA
gene expression -> list TFs ---> mapping pathways + master regulator --> identify optimal combination of intervention from network analysis
candidate genes with p-values
pick largest connected component
using random sampling permutation to evaluate the choice of p-value cutoff.
https://binom.curie.fr/
http://compsysmed.org/Software/OCSANA/OCSANA.html
Using annotated pathway to build a directed nework for intervention analysis and prediction.
How drugble? Drug reposition?
Q: KEGG?
==================================
=> Reinhard Laubenbacher
https://www.ncbi.nlm.nih.gov/myncbi/browse/collection/46337356/?sort=date&direction=descending
http://www.sciencedirect.com/science/article/pii/S1040842813002308
================
Karl Broman, Reproducible research (should added to my REU bootcamp training).
biostatistics and medical informatics
http://kbroman.org/
https://github.com/QinLab/Talk_ReproRes
http://kbroman.org/steps2rr/
IGV: need *bam file for alignment, *bai file for index.
vcf file can be visualized in IGV or Ensembl Variant Effect Predictor.
http://www.cbioportal.org/
Usually, large genes tend to have more mutations than small genes. Genes with repetitive elements tend to have more mutations.
genomespace.org
gene network
time series gene expression data -> network
structure-based control of signaling networks (optimization of interaction? )
HER2-positive breast cancer
BiNoM -> geneXplain --> OCSANA
gene expression -> list TFs ---> mapping pathways + master regulator --> identify optimal combination of intervention from network analysis
candidate genes with p-values
pick largest connected component
using random sampling permutation to evaluate the choice of p-value cutoff.
https://binom.curie.fr/
http://compsysmed.org/Software/OCSANA/OCSANA.html
Using annotated pathway to build a directed nework for intervention analysis and prediction.
How drugble? Drug reposition?
Q: KEGG?
==================================
=> Reinhard Laubenbacher
https://www.ncbi.nlm.nih.gov/myncbi/browse/collection/46337356/?sort=date&direction=descending
http://www.sciencedirect.com/science/article/pii/S1040842813002308
================
Karl Broman, Reproducible research (should added to my REU bootcamp training).
biostatistics and medical informatics
http://kbroman.org/
https://github.com/QinLab/Talk_ReproRes
http://kbroman.org/steps2rr/
IGV: need *bam file for alignment, *bai file for index.
vcf file can be visualized in IGV or Ensembl Variant Effect Predictor.
http://www.cbioportal.org/
Usually, large genes tend to have more mutations than small genes. Genes with repetitive elements tend to have more mutations.
genomespace.org
network software
=> RTN, bioconductor
http://bioconductor.org/packages/release/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.R
=> Cytoscape
=> KENev
=> MARINa (MATlab)
=> ingenuity
=> geneXplain
bioconductor KEGG.db
KEGG.db contains mappings based on older data because the original
resource was removed from the the public domain before the most
recent update was produced. This package should now be considered
deprecated and future versions of Bioconductor may not have it
available. Users who want more current data are encouraged to
look at the KEGGREST or reactome.db packages
resource was removed from the the public domain before the most
recent update was produced. This package should now be considered
deprecated and future versions of Bioconductor may not have it
available. Users who want more current data are encouraged to
look at the KEGGREST or reactome.db packages
Tuesday, May 16, 2017
day 2, afternoon, 20170515 jackson lab
genome data sources
https://repositive.io/product/datasources/
genomes in a bottle
https://discover.repositive.io/datasets/c1802ea2-a853-476d-91e6-c08290cc6fe4
Carl Zimmer
George Church
http://www.personalgenomes.org/
JAX HPC 256G RAM per node, 20 cores per node,
https://repositive.io/product/datasources/
genomes in a bottle
https://discover.repositive.io/datasets/c1802ea2-a853-476d-91e6-c08290cc6fe4
Carl Zimmer
George Church
http://www.personalgenomes.org/
JAX HPC 256G RAM per node, 20 cores per node,
day2, moring, 20170516
=> Sheng Li, RNAseq
RNAseq library contruction
Kukurba KR, montgomery SB, Cold Spring Harbo Protoc, 2015,
https://www.ncbi.nlm.nih.gov/pubmed/25870306
For microRNA, ~20nt, special protocol is required.
stranded and non-stranded library (to distinguish overlapping exons or genes on opposite DNA strands)
minimal reads: 20-25 millions reads for mammalian transcriptiome
Illumina Hiseq-4000, ~ 4000 millions per lane. 4-8 libraries per lane. Often, double indexing can be used for high number of multiplexing libraries.
2nd step, Gene annotation: GenCode gencode-help@sanger.ac.uk
Ensembl88,
GTF format
3rd step, gene expression quantification
RNAseq metric,
single-end RPKM, reads per kilobase per million reads
paired-end, FPKM, fragments per kilobase per million reads
nomalize read counts for sequencing depth, length of gene
TPM, transipts per million
pro: sum of total normalized reads is the same for all samples.(not for R/FPKM)
before 1st step, Quality check step.
genebody coverage, (with genes)
insert sizes
GC content
reads distribution
adaptor enrichment (containmination or PCR amplification bias?)
read quality
RSeQC, Liguo Wang, Bioinformatics 2012
FastQC
polyA selected 3' UTR, so 5'UTR degradation can be a problem.
Public data:
GEO
RNA-seq blog
http://rpubs.com/shelly1436/274304
combatR, correct of batch effect
biological degradation of mRNA during aging, using sva latent variable, to distinguish biological degradation from non-biological degradation.
https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e#file-rpkm_versus_tpm-r
=======================
Single cell RNAseq, Ion Mandoiu
psuedotemporal order of cells
https://www.nature.com/nbt/journal/v32/n4/pdf/nbt.2859.pdf
single cell mutaional profieing and clonal phylogeny in cancer
Potter, Genome Re
http://genome.cshlp.org/content/23/12/2115.long
cell type identification in primary visual cortex
Fluidigm
https://www.ncbi.nlm.nih.gov/gds/?term=fluidigm
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77477
challenges in single-cell RNAseq: low RT and sequencing depth, "zero inflated" data
DAVID
GeneMania
Matching clusters to cell types or organism parts
10X genomics . https://www.10xgenomics.com/ .
neuron cortex
https://support.10xgenomics.com/single-cell-gene-expression/datasets
https://support.10xgenomics.com/single-cell-gene-expression/datasets/1M_neurons
RNAseq library contruction
Kukurba KR, montgomery SB, Cold Spring Harbo Protoc, 2015,
https://www.ncbi.nlm.nih.gov/pubmed/25870306
For microRNA, ~20nt, special protocol is required.
stranded and non-stranded library (to distinguish overlapping exons or genes on opposite DNA strands)
minimal reads: 20-25 millions reads for mammalian transcriptiome
Illumina Hiseq-4000, ~ 4000 millions per lane. 4-8 libraries per lane. Often, double indexing can be used for high number of multiplexing libraries.
2nd step, Gene annotation: GenCode gencode-help@sanger.ac.uk
Ensembl88,
GTF format
3rd step, gene expression quantification
RNAseq metric,
single-end RPKM, reads per kilobase per million reads
paired-end, FPKM, fragments per kilobase per million reads
nomalize read counts for sequencing depth, length of gene
TPM, transipts per million
pro: sum of total normalized reads is the same for all samples.(not for R/FPKM)
before 1st step, Quality check step.
genebody coverage, (with genes)
insert sizes
GC content
reads distribution
adaptor enrichment (containmination or PCR amplification bias?)
read quality
RSeQC, Liguo Wang, Bioinformatics 2012
FastQC
polyA selected 3' UTR, so 5'UTR degradation can be a problem.
Public data:
GEO
RNA-seq blog
http://rpubs.com/shelly1436/274304
combatR, correct of batch effect
biological degradation of mRNA during aging, using sva latent variable, to distinguish biological degradation from non-biological degradation.
https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e#file-rpkm_versus_tpm-r
=======================
Single cell RNAseq, Ion Mandoiu
psuedotemporal order of cells
https://www.nature.com/nbt/journal/v32/n4/pdf/nbt.2859.pdf
single cell mutaional profieing and clonal phylogeny in cancer
Potter, Genome Re
http://genome.cshlp.org/content/23/12/2115.long
cell type identification in primary visual cortex
Fluidigm
https://www.ncbi.nlm.nih.gov/gds/?term=fluidigm
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77477
challenges in single-cell RNAseq: low RT and sequencing depth, "zero inflated" data
DAVID
GeneMania
Matching clusters to cell types or organism parts
10X genomics . https://www.10xgenomics.com/ .
neuron cortex
https://support.10xgenomics.com/single-cell-gene-expression/datasets
https://support.10xgenomics.com/single-cell-gene-expression/datasets/1M_neurons
yeast GEO, aging and large scale
Dang lab
methylation and chip-seq
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65764
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65766
287 samples, Holstege
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42536
|
Aging, single cell expression data set
physiologically aged hematopoietic stem cells
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77477
To uphold appropriate homeostasis of short-lived blood cells, immature blood cells need to proliferate vigorously. Here, using a conditional H2B-mCherry labeling mouse-model, we characterize hematopoietic stem cell (HSC) and progenitor proliferation dynamics in steady state, upon physiological aging and following several types of induced stress. Following transplantation, HSCs shifted towards higher degrees of proliferation that was sustained long-term. HSCs were, by contrast, poorly recruited into proliferation following cytokine-induced mobilization and after acute depletions of selected blood cell lineages. Using indexed single cell sorting coupled to multiplex gene expression analyses, proliferation history separated candidate HSCs into units with distinct molecular and functional attributes. Our data thereby highlight that HSC proliferation following transplantation is fundamentally different not only from native hematopoiesis but also from other stress contexts, and demonstrate the power of divisional history as a functional criterion to resolve HSC heterogeneity About 1000 genes are measured in GSE77477 |
Monday, May 15, 2017
Jackson lab genomics Day 1
Moring
install anaconda3 on ubuntu virtualbox
#
hqin@rainboxdash:~/anaconda3/bin/$ ./jupter notebook &
https://thejacksonlaboratory.github.io/bd2k-workshop/
samtools view example.bam | less
Linux exercises: up and down arrows, tab for file-names autocomplete
bam file
vcf file (variant call file?)
less test.vcf
mkdir tmpdir
cd tmpdir
cd ..
ls
Essential probability and statistics for introduction to big data
https://en.wikipedia.org/wiki/Median_absolute_deviation
Bayes's rule
Introductory Data Mining
=====================
Afternoon
curricula:
all biology are computational
computational biology are parasite of biology
Approach:
research project approach using real datasets
targeted students: professional and academic oriented?
data carpentry's R for genomcis
http://www.datacarpentry.org/R-genomics/
RStudio projects
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
args(barplot)
?lm
??lm
help.search("kruskal")
==========
Barke Southern Illinois Medical School, longest-living mouse
===========
Brenton Gravely, RNA genomics
Dscam, over 100 exons, an extreme case of alternative splicing
Schmucker 2000 Cell. Mutually exclusive splicing.
Ig Repeats. Dimerization are isoform specific.
https://www.ncbi.nlm.nih.gov/pubmed/19934230
Dscam variatiosn between species
Gravely 2005, Cell.
Competing RNA base-pairing is a common mechanism for mutually exclusive splicing in anthropods, Yang 2011 Nature Struc Mol Biol
http://www.nature.com/nsmb/journal/v18/n2/abs/nsmb.1959.html
single cell RNA sequencing
Drosophila S2 cells, each cell show the same splicing isoform.
Drop-seq of Drosophila and human cell to control the number of cells in each droplet.
Oxford nanopore sequencing
1500 RNA binding proteins in human genome
Van Nostrand Nature methods, 2016, eCLIP-seq reveqls RBP-specific binding profiles
http://www.nature.com/nmeth/journal/v13/n6/abs/nmeth.3810.html
install anaconda3 on ubuntu virtualbox
#
hqin@rainboxdash:~/anaconda3/bin/$ ./jupter notebook &
https://thejacksonlaboratory.github.io/bd2k-workshop/
samtools view example.bam | less
Linux exercises: up and down arrows, tab for file-names autocomplete
bam file
vcf file (variant call file?)
less test.vcf
mkdir tmpdir
cd tmpdir
cd ..
ls
Essential probability and statistics for introduction to big data
- summary stattistics vs empirical statistics
- Common data transformation, Z-scores
- Bayesian inference
- Multiple hypothesis testing
https://en.wikipedia.org/wiki/Median_absolute_deviation
Bayes's rule
Introductory Data Mining
=====================
Afternoon
curricula:
all biology are computational
computational biology are parasite of biology
Approach:
research project approach using real datasets
targeted students: professional and academic oriented?
data carpentry's R for genomcis
http://www.datacarpentry.org/R-genomics/
RStudio projects
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
args(barplot)
?lm
??lm
help.search("kruskal")
==========
Barke Southern Illinois Medical School, longest-living mouse
===========
Brenton Gravely, RNA genomics
Dscam, over 100 exons, an extreme case of alternative splicing
Schmucker 2000 Cell. Mutually exclusive splicing.
Ig Repeats. Dimerization are isoform specific.
https://www.ncbi.nlm.nih.gov/pubmed/19934230
Dscam variatiosn between species
Gravely 2005, Cell.
Competing RNA base-pairing is a common mechanism for mutually exclusive splicing in anthropods, Yang 2011 Nature Struc Mol Biol
http://www.nature.com/nsmb/journal/v18/n2/abs/nsmb.1959.html
single cell RNA sequencing
Drosophila S2 cells, each cell show the same splicing isoform.
Drop-seq of Drosophila and human cell to control the number of cells in each droplet.
Oxford nanopore sequencing
1500 RNA binding proteins in human genome
Van Nostrand Nature methods, 2016, eCLIP-seq reveqls RBP-specific binding profiles
http://www.nature.com/nmeth/journal/v13/n6/abs/nmeth.3810.html
Sunday, May 14, 2017
sample RCN
Emory , RCN-UBE The Case Study and PBL Network
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1344208&HistoricalAwards=false
Joey Shawn
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1410087&HistoricalAwards=false
Award Abstract #1410087
Digitization TCN: Collaborative Research: The Key to the Cabinets: Building and Sustaining a Research Database for a Global Biodiversity Hotspot
important concepts/skills in undergraduate QBIO education and training
Important concepts
- Modeling approach:
- ODE
- PDE
- Discrete
- Analytic versus simulation
- Bistability, bifurcation
- Visualization of quantitative models
NSF RCN elements
https://www.nsf.gov/pubs/2015/nsf15527/nsf15527.htm
ll RCN proposals (including RCN-UBE) must conform to the following 7 guidance items:
- Topic/focus of research coordination. For all tracks, research coordination network (RCN) proposals should identify a clear theme as the focus of its activities. RCN proposals should spell out the theoretical and/or methodological foundations of the network's proposed activities, and should specify what activities will be undertaken, what new groups of investigators will be brought together, what products will be generated by network activities, and how information about the network and opportunities to participate will be disseminated. The proposal should also outline the expected benefits of the network's activities in moving a field forward and the implications for the broader community of researchers, educators and engineers.
- Principal investigator (PI). Although research coordination networks are expected to involve investigators from multiple sites, a single organization must serve as the submitting organization for each proposal. Of the two types of collaborative proposal formats described in the Grant Proposal Guide, this solicitation allows only a single proposal submission with subawards administered by that lead organization. The PI is the designated contact person for the project and is expected to provide leadership in fully coordinating and integrating the activities of the network. Strong, central leadership and clear lines of responsibility are essential for successful networking.
- Steering committee. Members of the steering committee will be network participants that assume key roles in the leadership and/or management of the project. The steering committee should be representative of the communities of participants that will be brought together through the RCN. It must include all Co-PIs, if any are listed on the cover page of the proposal, and any other senior personnel, including any foreign collaborators involved as leaders or otherwise considered senior personnel. Therefore, the steering committee constitutes all the senior personnel for the RCN proposal. The name and home organization of each steering committee member should be listed in the project summary. As these individuals are all senior personnel, their Biographical Sketches and Current and Pending Support statements must be included in the appropriate sections of the proposal.
- Network participants. The size of a network is expected to vary depending on the theme and the needs of the proposed activity. The network may be regional, national, or international. It is expected that a proposed network will involve investigators at diverse organizations. The inclusion of new researchers, post-docs, graduate students, and undergraduates is encouraged. Specific efforts to increase participation of underrepresented groups (women, underrepresented minorities, and persons with disabilities) must be included. In the proposal, an initial network of likely participants should be identified. However, there should be clearly developed mechanisms to maintain openness, ensure access, and actively promote participation by interested parties outside of the initial participants in the proposed network.
- Coordination/management mechanism. The proposal should include a clearly defined management plan. The plan should include a description of the specific roles and responsibilities of the PI and the steering committee. Mechanisms for allocating funds, such as support for the work of a steering committee, should be clearly articulated. The plan should include provisions for flexibility to allow the structure of the participant group to change over time as membership and the network's foci evolve. Mechanisms for assessing progress and the effectiveness of the networking activities should be part of the management plan.
- Information and material sharing. The goals of this program are to promote effective communication and to enhance opportunities for collaboration. Proposers are expected to develop and present a clearly delineated understanding of individual member's rights to ideas, information, data and materials produced as a result of the award that is consistent with the goals of the program. Infrastructure plans to support the communication and collaboration should be described. When the proposed activity involves generation of community resources such as databases or unique materials, a plan for their timely release and the mechanism of sharing beyond the membership of the RCN must be described in the Data Management Plan, a required Supplementary Document. In addition, a plan for long-term maintenance of such resources must be described without assuming continued support from NSF.
- International participation. NSF encourages international collaboration, and we anticipate that many RCN projects will include participants, including steering committee members, from outside the US. International collaborations should clearly strengthen the proposed project activities. As NSF funding predominantly supports participation by US participants, network participants from institutions outside the US are encouraged to seek support from their respective funding organizations, notably participants from developed countries. NSF funds may not be used to support the expenses of the international scientists and students at their home organization. For RCN projects that involve international partners, NSF funds may be used for the following:
Travel expenses for US scientists and students participating in exchange visits integral to the RCN project
RCN-related expenses for international partners to participate in networking activities while in the US.
RCN-related expenses for US participants to conduct networking activities in the international partner's home laboratory.
Friday, May 12, 2017
*** R learning materials and computational biology
Online courses
Github applied computational genomics
https://github.com/quinlan-lab/applied-computational-genomics
https://github.com/BenLangmead/comp-genomics-class
R programing at Coursera
https://www.coursera.org/learn/r-programming
Data camp https://www.datacamp.com/
introduction, intermediate, and advanced R
Statistics and R
https://www.edx.org/course/statistics-r-harvardx-ph525-1x
http://genomicsclass.github.io/book/pages/classes.html
https://courses.edx.org/courses/HarvardX/PH525.1x/1T2015/info
Quantitative biology workshop
https://www.edx.org/course/quantitative-biology-workshop-mitx-7-qbwx-2
Introduction to Bioconductor: annotation and analysis of genomes and genomics assays
https://www.edx.org/course/introduction-bioconductor-annotation-harvardx-ph525-5x
https://cgondro2.une.edu.au/Rcourse.htm
http://www.springer.com/us/book/9783319144740 . R book, primer to analysis of genomics data using R
Books and articles
An introduction to statistical learning with applications in R
http://www-bcf.usc.edu/~gareth/ISL/index.html
References:
http://hongqinlab.blogspot.com/2013/10/useful-r-materials-for-teaching.html
http://hongqinlab.blogspot.com/2015/06/ngs-tutorials.html
Github applied computational genomics
https://github.com/quinlan-lab/applied-computational-genomics
https://github.com/BenLangmead/comp-genomics-class
R programing at Coursera
https://www.coursera.org/learn/r-programming
Data camp https://www.datacamp.com/
introduction, intermediate, and advanced R
Statistics and R
https://www.edx.org/course/statistics-r-harvardx-ph525-1x
http://genomicsclass.github.io/book/pages/classes.html
https://courses.edx.org/courses/HarvardX/PH525.1x/1T2015/info
Quantitative biology workshop
https://www.edx.org/course/quantitative-biology-workshop-mitx-7-qbwx-2
Introduction to Bioconductor: annotation and analysis of genomes and genomics assays
https://www.edx.org/course/introduction-bioconductor-annotation-harvardx-ph525-5x
https://cgondro2.une.edu.au/Rcourse.htm
http://www.springer.com/us/book/9783319144740 . R book, primer to analysis of genomics data using R
Books and articles
An introduction to statistical learning with applications in R
http://www-bcf.usc.edu/~gareth/ISL/index.html
- Stephen J. Eglen's PLOS article. A quick guide to teaching R programming to computational biology students.
References:
http://hongqinlab.blogspot.com/2013/10/useful-r-materials-for-teaching.html
http://hongqinlab.blogspot.com/2015/06/ngs-tutorials.html
student progress
KR: finished data camp R intro course. next week, intermediate level,
BD: datacamp R intro.
Find RNAseq youtube tutorials and readings, write a brief comments on these tutorials.
Install VirtualBox, Ubuntu
BD: datacamp R intro.
Find RNAseq youtube tutorials and readings, write a brief comments on these tutorials.
Install VirtualBox, Ubuntu
Wednesday, May 10, 2017
App to estimate volume, conver units.
Many app exists for unit conversions
photo, video volume estimation
handwriting recognition apps
photo, video volume estimation
handwriting recognition apps
making student thinking and learning visible
Ken Shelton
Socrative
Student Portfolios
Assessment for, as, of Learning
For learning (Formative)
Of Learning (summative)
As Learning (Reflective)
http://www.tvdsb.ca/webpages/takahashid/techdia.cfm?subpage=128207
www.govote.at
https://www.menti.com/2ebdbb/3#
https://www.mentimeter.com/?utm_campaign=mentimeter%20logo&utm_medium=web-link&utm_source=govote
Digital Portfolio can let students
Socrative
Student Portfolios
Assessment for, as, of Learning
For learning (Formative)
Of Learning (summative)
As Learning (Reflective)
http://www.tvdsb.ca/webpages/takahashid/techdia.cfm?subpage=128207
www.govote.at
https://www.menti.com/2ebdbb/3#
https://www.mentimeter.com/?utm_campaign=mentimeter%20logo&utm_medium=web-link&utm_source=govote
Digital Portfolio can let students
Monday, May 8, 2017
BD tasks
May:
Learning R
Datacamp, intro, intermediate, advanced
https://www.datacamp.com/courses/free-introduction-to-r
Find reading materials on RNAseq
Github
June
Coursera R training course
Virtual Box, Ubuntun
Learning Linux
20170508, BD borrowed Haddock and Dunn, practical computing for biologist.
Learning R
Datacamp, intro, intermediate, advanced
https://www.datacamp.com/courses/free-introduction-to-r
Find reading materials on RNAseq
Github
June
Coursera R training course
Virtual Box, Ubuntun
Learning Linux
Sunday, May 7, 2017
*** computational genomics course plan
data visualization book
https://serialmentor.com/dataviz/
Key topics
Linux
RNAseq
R/Rstudio Rmd
Online:
Regular expression online exercise
RNAseq
biological network
Potential student projects:
time lapsed image analysis
rDNA reads in yeast genomes ~ lifespan
chemical compounds
network controllability
yeast RLS ~ genomics features
prediction of essential and non-essential genes.
ecology network analysis
aging data comparison
RNN reverse engineering of gene interactions
Reference:
http://hongqinlab.blogspot.com/2015/06/ngs-tutorials.html
http://www-personal.une.edu.au/~cgondro2/Rcourse.htm
Jackson lab workshop
Data Carpentry,
https://github.com/data-lessons/genomics-workshop
https://data-lessons.github.io/genomics-workshop/
EdX genomics course
Friday, May 5, 2017
UTC grading scale for letter grade
UTC grading policy, grading scale
90-100 A
80-89 B
70-79 C
60-69 D
Below 60 F
90-100 A
80-89 B
70-79 C
60-69 D
Below 60 F
Wednesday, May 3, 2017
student project, time lapsed image
MATLAB, code from image analysis
TODO: tracking objects around images
TODO: tracking objects around images
IPTG ordering
UBPBIO.com
IPTG ordering
|
Tuesday, May 2, 2017
rls 20160802.db missing entries
The rls 20160802.db contains many missing entries. I export this db using sqlite3 into csv, and these missing entries are still there. So, the problem is not limited to csv, but is due to this release of rls.db.
Monday, May 1, 2017
JAX computational genomics tools
On the academic side:
We will be using a number of genomic analysis software packages/tools. Please try to download and install the tools/programs listed below (IGV, R/RStudio and Python). Ada Zhan (cc’ed here) can assist you with installation questions. We will also be able to provide support on the first day of the course. We will use a cloud computing environment (web-based) but you will get information on that platform just before the course.
If you do not have a laptop at your disposal please alert me ASAP so that we can prepare a machine for your use.
Please install the following:
Integrative Genomics Viewer: (IGV) (Broad Institute)
Please go to the Broad institute website here and download the IGV version for your Mac or PC.
R:
R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio To install on:
Windows:
Mac OS X:
Install R by downloading and running this .pkg file from CRAN. Also, please install the RStudio IDE.
Linux:
Python: To set up Python:
Windows
- Download and install Anaconda.
- Download the default Python 3 installer. Use all of the defaults for installation except make sure to check Make Anaconda the default Python.
Mac OS X
- Download and install Anaconda.
- Download the default Python 3 installer. Use all of the defaults for installation.
Linux
- Download the installer that matches your operating system and save it in your home folder. Download the default Python 3 installer.
- Open a terminal window.
- Type
bash Anaconda-
Subscribe to:
Posts (Atom)