Explain the role of computation and
data mining in addressing hypothesis-driven and hypothesis-generating
questions within the life sciences: It is crucial for students
to have a clear understanding of the role computing and data mining play
in the modern life sciences. Given a traditional hypothesis-driven
research question, students should have ideas about what types of data
and software exist that could help them answer the question quickly and
efficiently. They should also appreciate that mining large datasets can
generate novel hypotheses to be tested in the lab or field.
- What hypotheses can one ask based biometric data being compiled (Fitbit, Google, etc.)
- Understand the role of various databases in identifying potential gene targets for drug development
Summarize key computational concepts, such as algorithms and relational databases, and their applications in the life sciences: In
order to make use of sophisticated software and database tools,
students must have a basic understanding of the underlying principles
that these tools are based upon. Students are not expected to be experts
in multiple algorithms or sophisticated databases, but currently the
vast majority of life sciences majors never take a programming or
database course, and have essentially zero exposure to how these tools
work. This must change.
- Be exposed to how data is organized in relational databases
- Be able to modify the search parameters to achieve biologically meaningful results
- Understand underlying algorithm(s) employed in sequence alignment (e.g. BLAST)
Apply statistical concepts used in bioinformatics: Many
biology curricula contain statistics, either as a standalone
biostatistics course or as part of other courses such as capstone
research courses. The primary distinction with regard to bioinformatics
has to do with the statistics of large datasets and multiple
- Drug trials: Interpretation of well designed drug trial data
- Transcriptomics: Understand the statistical modelling used to identify differentially expressed genes; Understand how genes implicated in cancer are identified using panels of sequenced tumor and WT cell lines or biopsies
- Sequence similarity searching: Understand that there is a probability of finding a given sequence similarity score by chance (the p-value); The size of the database searched affects the probability that they would see that particular score in a particular search (the expectation, or e-value).
Use bioinformatics tools to examine
complex biological problems in evolution, information flow, and other
important areas of biology: This competency is written broadly
so as to encompass a variety of problems addressed using bioinformatics
tools, from understanding the evolutionary underpinnings of sequence
comparison and homology detection, to the distinctions between genomic
sequences, RNA sequences, and protein sequences, to the interpretation
of phylogenetic trees. We want to emphasize that bioinformatics tools
can be used to teach existing parts of the curriculum such as the
central dogma or phylogenetic relationships, thus integrating the
bioinformatics into the curriculum as opposed to adding it on as an
addition to an already overfull curriculum (and thus forcing decisions
about what topic to remove to make room). The point of saying “complex”
biological problems is that students should be able to work through a
problem with multiple steps, not just perform isolated tasks.
- Employ gene ontology tools (e.g., Mapman, GO, KEGG).
- Understand protein sequence, structure, and function, using a variety of tools
- Understand gene structure, genomic context, alternative splicing using genome browsers
- Understand concept of homology
Find, retrieve, and organize various types of biological data: Given
the numerous and varied datasets currently being generated from all of
the ‘omics fields, students should develop the facility to: identify
appropriate data repositories; navigate and retrieve data from these
databases; and organize data relevant to their area of study (in flat
files or small local stand-alone databases).
- Store and interrogate small datasets using spreadsheets or delimited text files.
- Navigate and retrieve data from genome browsers
- Retrieve data from protein and genome databases (PDB, UniProt, NCBI)
Explore and/or model biological interactions, networks and data integration using bioinformatics: Modeling
of biological systems at all levels, from cellular to ecological, is
being facilitated by technological (e.g., sequencing, biochemical,
genetics) and algorithmic advances. These models provide novel insights
into the perturbations in systems causative of disease, interactions of
microbes with various eukaryotic systems, and how metabolic networks
respond to environmental stresses. Students should be familiar with the
techniques used to generate these analyses, have the ability to
interpret the outputs, and use the data to generate novel hypotheses.
- Cell Biology: predict impact of gene knockout on cell-signaling pathway
- Transcriptome: Analysis of transcriptomic data (RNA-Seq) available from SRS using Galaxy
- Ecological: Analysis of microbial sequence data using QIIME on Galaxy
Use command-line bioinformatics tools and write simple computer scripts: The
majority of the datasets students should be familiar with and be able
to interact with (e.g., genomic and proteomic sequences, BLAST results,
RNASeq and resulting differential expression data) are text files. The
most powerful and dynamic way to interact with these datasets is through
the command line or shell scripting, both of which are readily acquired
skills. Students need to have the flexibility to manipulate their own
data, and to create and modify complex data processing and analysis
- Write simple unix shell scripts to manipulate files
- Apply RNASeq analyses using R (STAR, Tophat, DESeq2) to open source data sets (SRS)
- Build and run statistical analyses using R or Python scripts
- Run BLAST using command line options
Describe and manage biological data types, structure, and reproducibility: This
competency addresses two distinct concerns: 1) each of the varied
‘omics fields produce data in formats particular to its needs, and these
formats evolve with changes in technologies and refinements in
downstream software; and 2) all experimental data is subject to error
and the user must be cognizant of the need to verify the reproducibility
of their data. The first concern highlights the requirement for
students to develop an awareness of and ability to manipulate different
data types given the versioning of formats. The second points to the
need for caution, to carry out appropriate statistical analyses on their
data as part of normal operating procedures and report the uncertainty
of their results, and to provide the relevant information to enable
reproduction of their results. Sometimes students have the tendency to
assume that anything they retrieve from an online database must be
correct; they need to be taught that this is not always the case.
- Reproducibility: Compare reproducibility of biological replicate data (e.g.transcriptomic data) using statistical tests (Spearman).
- Formats: Understand the various sequence formats used to store DNA and protein sequences (FASTA, FASTQ); Understand the representation of gene features using Gene Feature Format (GFF) files; Mass-Spec
Interpret the ethical, legal, medical, and social implications of biological data: The
increasing scale and penetrance of human genetic and genomic data has
greatly enhanced our ability to identify disease-related loci, druggable
targets, and potential for gene replacements with developing
techniques. However, with this information also comes many ethical,
legal, and social questions which are often outpaced by the
technological advances. As part of their scientific training, students
should debate the medicinal, societal and ethical implications of these
information sets and techniques.
- How does the scientific community protect against the falsification or manipulation of large datasets?
- Who should have access to this data, and how should it be protected?
- What are the implications, good and bad, of being able to walk into a doctor’s office and have your genome sequenced and analyzed in minutes?