Showing posts with label bioinformatics. Show all posts

Sunday, February 20, 2022

deep learning in bioinformations

https://github.com/liyu95/Deep_learning_examples

Monday, July 12, 2021

quantum sequence alignment

Quantum pattern recognition for local sequence alignment

https://ieeexplore.ieee.org/document/8269076

https://github.com/spencerking/QiskitSummerJam-LocalSequenceAlignment

amsantanubanerjee/quantum_sequence_alignment

QuASeR -- Quantum Accelerated De Novo DNA Sequence Reconstruction

https://arxiv.org/abs/2004.05078

https://github.com/QE-Lab/QuASeR

https://github.com/prince-ph0en1x/QuASeR

natural language processing, bioinformatics

BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer

https://academic.oup.com/bioinformatics/article-abstract/36/24/5678/6069538

LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations

https://academic.oup.com/bioinformatics/article-abstract/37/3/404/5893949

Attention-Based Transformers for Instance Segmentation of Cells in Microstructures

https://ieeexplore.ieee.org/abstract/document/9313305?casa_token=88kTMDn2LIUAAAAA:LxNIoFl-o24XoXV5PHpJ1T67-atx7ScUVKONPNGwKvPP6skEt4E99z1LmF6-8pGKSuuXtnCTSzo

transformer protein lanugage models are unsupervised structure learners

https://openreview.net/pdf?id=fylclEqgvgd

BERTology meets biology: interpreting attention in protein language models

https://arxiv.org/abs/2006.15222

Burris Wheeler transformation

Tuesday, November 14, 2017

Which human reference genome to use? (Heng Li)

originally posted at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

13 November 2017

TL;DR: If you map reads to GRCh37 or hg19, use hs37-1kg:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

If you map to GRCh37 and believe decoy sequences help with better variant calling, use hs37d5:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

If you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

Inclusion of ALT contigs. ALT contigs are large variations with very long flanking sequences nearly identical to the primary human assembly. Most read mappers will give mapping quality zero to reads mapped in the flanking sequences. This will reduce the sensitivity of variant calling and many other analyses. You can resolve this issue with an ALT-aware mapper, but no mainstream variant callers or other tools can take the advantage of ALT-aware mapping.
Padding ALT contigs with long “N”s. This has the same problem with 1 and also increases the size of genome unnecessarily. It is worse.
Inclusion of multi-placed sequences. In both GRCh37 and GRCh38, the pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you use a reference genome that contains both copies, you will not be able to call any variants in PARs with a standard pipeline. In GRCh38, some alpha satellites are placed multiple times, too. The right solution is to hard mask PARs on chrY and those extra copies of alpha repeats.
Not using the rCRS mitochondrial sequence. rCRS is widely used in population genetics. However, the official GRCh37 comes with a mitochondrial sequence 2bp longer than rCRS. If you want to analyze mitochondrial phylogeny, this 2bp insertion will cause troubles. GRCh38 uses rCRS.
Converting semi-ambiguous IUB codes to “N”. This is a very minor issue, though. Human chromosomal sequences contain few semi-ambiguous bases.
Using accession numbers instead of chromosome names. Do you know CM000663.2 corresponds to chr1 in GRCh38?
Not including unplaced and unlocalized contigs. This will force reads originated from these contigs to be mapped to the chromosomal assembly and lead to false variant calls.

Now we can explain what is wrong with other versions of human reference genomes:

hg19/chromFa.tar.gz from UCSC: 1, 3, 4 and 5.
hg38/hg38.fa.gz from UCSC: 1, 3 and 5.
GCA_000001405.15_GRCh38_genomic.fna.gz from NCBI: 1, 3, 5 and 6.
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from EnsEMBL: 3.
Homo_sapiens.GRCh38.dna.toplevel.fa.gz from EnsEMBL: 1, 2 and 3.

Using an impropriate human reference genome is usually not a big deal unless you study regions affected by the issues. However, 1) other researchers may be studying in these biologically interesting regions and will need to redo alignment; 2) aggregating data mapped to different versions of the genome will amplify the problems. It is still preferable to choose the right genome version if possible.

Well, welcome to bioinformatics!

Friday, October 13, 2017

computer science for biology major books

https://www.amazon.com/Developing-Bioinformatics-Computer-Skills-Introduction/dp/1565926641/ref=sr_1_sc_2?ie=UTF8&qid=1507904075&sr=8-2-spell&keywords=bioinformtics+skills

https://www.amazon.com/Bioinformatics-Data-Skills-Reproducible-Research/dp/1449367372/ref=sr_1_1?ie=UTF8&qid=1507904147&sr=8-1&keywords=bioinformatics+skills

Friday, July 28, 2017

PSORT WWW Server

PSORT is a computer program for the prediction of protein localization sites in cells. It receives the information of an amino acid sequence and its source orgin, e.g., Gram-negative bacteria, as inputs. Then, it analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional information.

https://psort.hgc.jp/

Sunday, May 28, 2017

NIBLSE, bioinformatics core competencies

https://qubeshub.org/groups/niblse/resourcecollection/core_competencies#2

This set of bioinformatics core competencies for undergraduate life scientists is informed by the survey results of more than 1,200 people, analysis of 90 syllabi addressing bioinformatics across institutions and diverse departments, and discussion among experts across academia and industry. The bulleted lists contain examples illustrating the competencies.

Explain the role of computation and data mining in addressing hypothesis-driven and hypothesis-generating questions within the life sciences: It is crucial for students to have a clear understanding of the role computing and data mining play in the modern life sciences. Given a traditional hypothesis-driven research question, students should have ideas about what types of data and software exist that could help them answer the question quickly and efficiently. They should also appreciate that mining large datasets can generate novel hypotheses to be tested in the lab or field.
- What hypotheses can one ask based biometric data being compiled (Fitbit, Google, etc.)
- Understand the role of various databases in identifying potential gene targets for drug development
Summarize key computational concepts, such as algorithms and relational databases, and their applications in the life sciences: In order to make use of sophisticated software and database tools, students must have a basic understanding of the underlying principles that these tools are based upon. Students are not expected to be experts in multiple algorithms or sophisticated databases, but currently the vast majority of life sciences majors never take a programming or database course, and have essentially zero exposure to how these tools work. This must change.
- Be exposed to how data is organized in relational databases
- Be able to modify the search parameters to achieve biologically meaningful results
- Understand underlying algorithm(s) employed in sequence alignment (e.g. BLAST)
Apply statistical concepts used in bioinformatics: Many biology curricula contain statistics, either as a standalone biostatistics course or as part of other courses such as capstone research courses. The primary distinction with regard to bioinformatics has to do with the statistics of large datasets and multiple comparisons.
- Drug trials: Interpretation of well designed drug trial data
- Transcriptomics: Understand the statistical modelling used to identify differentially expressed genes; Understand how genes implicated in cancer are identified using panels of sequenced tumor and WT cell lines or biopsies
- Sequence similarity searching: Understand that there is a probability of finding a given sequence similarity score by chance (the p-value); The size of the database searched affects the probability that they would see that particular score in a particular search (the expectation, or e-value).
Use bioinformatics tools to examine complex biological problems in evolution, information flow, and other important areas of biology: This competency is written broadly so as to encompass a variety of problems addressed using bioinformatics tools, from understanding the evolutionary underpinnings of sequence comparison and homology detection, to the distinctions between genomic sequences, RNA sequences, and protein sequences, to the interpretation of phylogenetic trees. We want to emphasize that bioinformatics tools can be used to teach existing parts of the curriculum such as the central dogma or phylogenetic relationships, thus integrating the bioinformatics into the curriculum as opposed to adding it on as an addition to an already overfull curriculum (and thus forcing decisions about what topic to remove to make room). The point of saying “complex” biological problems is that students should be able to work through a problem with multiple steps, not just perform isolated tasks.
- Employ gene ontology tools (e.g., Mapman, GO, KEGG).
- Understand protein sequence, structure, and function, using a variety of tools
- Understand gene structure, genomic context, alternative splicing using genome browsers
- Understand concept of homology
Find, retrieve, and organize various types of biological data: Given the numerous and varied datasets currently being generated from all of the ‘omics fields, students should develop the facility to: identify appropriate data repositories; navigate and retrieve data from these databases; and organize data relevant to their area of study (in flat files or small local stand-alone databases).
- Store and interrogate small datasets using spreadsheets or delimited text files.
- Navigate and retrieve data from genome browsers
- Retrieve data from protein and genome databases (PDB, UniProt, NCBI)
Explore and/or model biological interactions, networks and data integration using bioinformatics: Modeling of biological systems at all levels, from cellular to ecological, is being facilitated by technological (e.g., sequencing, biochemical, genetics) and algorithmic advances. These models provide novel insights into the perturbations in systems causative of disease, interactions of microbes with various eukaryotic systems, and how metabolic networks respond to environmental stresses. Students should be familiar with the techniques used to generate these analyses, have the ability to interpret the outputs, and use the data to generate novel hypotheses.
- Cell Biology: predict impact of gene knockout on cell-signaling pathway
- Transcriptome: Analysis of transcriptomic data (RNA-Seq) available from SRS using Galaxy
- Ecological: Analysis of microbial sequence data using QIIME on Galaxy
Use command-line bioinformatics tools and write simple computer scripts: The majority of the datasets students should be familiar with and be able to interact with (e.g., genomic and proteomic sequences, BLAST results, RNASeq and resulting differential expression data) are text files. The most powerful and dynamic way to interact with these datasets is through the command line or shell scripting, both of which are readily acquired skills. Students need to have the flexibility to manipulate their own data, and to create and modify complex data processing and analysis workflows.
- Write simple unix shell scripts to manipulate files
- Apply RNASeq analyses using R (STAR, Tophat, DESeq2) to open source data sets (SRS)
- Build and run statistical analyses using R or Python scripts
- Run BLAST using command line options
Describe and manage biological data types, structure, and reproducibility: This competency addresses two distinct concerns: 1) each of the varied ‘omics fields produce data in formats particular to its needs, and these formats evolve with changes in technologies and refinements in downstream software; and 2) all experimental data is subject to error and the user must be cognizant of the need to verify the reproducibility of their data. The first concern highlights the requirement for students to develop an awareness of and ability to manipulate different data types given the versioning of formats. The second points to the need for caution, to carry out appropriate statistical analyses on their data as part of normal operating procedures and report the uncertainty of their results, and to provide the relevant information to enable reproduction of their results. Sometimes students have the tendency to assume that anything they retrieve from an online database must be correct; they need to be taught that this is not always the case.
- Reproducibility: Compare reproducibility of biological replicate data (e.g.transcriptomic data) using statistical tests (Spearman).
- Formats: Understand the various sequence formats used to store DNA and protein sequences (FASTA, FASTQ); Understand the representation of gene features using Gene Feature Format (GFF) files; Mass-Spec
Interpret the ethical, legal, medical, and social implications of biological data: The increasing scale and penetrance of human genetic and genomic data has greatly enhanced our ability to identify disease-related loci, druggable targets, and potential for gene replacements with developing techniques. However, with this information also comes many ethical, legal, and social questions which are often outpaced by the technological advances. As part of their scientific training, students should debate the medicinal, societal and ethical implications of these information sets and techniques.
- How does the scientific community protect against the falsification or manipulation of large datasets?
- Who should have access to this data, and how should it be protected?
- What are the implications, good and bad, of being able to walk into a doctor’s office and have your genome sequenced and analyzed in minutes?

Sunday, March 26, 2017

Sample videos, Developing YouTube video to enhance education and training in bioinformatics and genomics

Selected Bioinformatics Videos

Annotate ORF using APE

Design PCR primers to amply a fragment that contains the mutation on msh2

Analyzing PCR primer using ApE

Protein structure visualization using SWISS PDB Viewer

Translation from DNA to protein, Python

Find Restriction Enzyme recognition sites, Python demo

Seletec Computing with R Videos

Other tutorial videos

Serial dilution demo, paper and coloring exercises

Animation on herd immunity
https://imgur.com/gallery/8M7q8#J7LANQ4

Wednesday, March 1, 2017

rasperry pi,

4273 pi

https://4273pi.org/loan-boxes/

Friday, October 14, 2016

active learning, bioinformatics teaching

https://flxlexblog.wordpress.com/2015/08/31/active-learning-strategies-for-bioinformatics-teaching-2

https://figshare.com/articles/Active_learning_strategies_for_bioinformatics_teaching/1541056

Saturday, December 19, 2015

Pevsner bioinformatics book

http://www.bioinfbook.org/php/

Monday, October 5, 2015

references for 2-credits bioinformatics course

Genomics Medicine Gets Personal, EdX
https://courses.edx.org/courses/course-v1:GeorgetownX+MEDX202-01+2015_3T/courseware/83df4b53f2d444a38a90558b5c639711/aa544225cd604b689f7e421c90a624fe/

https://www.coursera.org/courses/?query=bioinformatics
https://www.coursera.org/specializations/bioinformatics

Introduction to Bioinformatics 4th Edition, Arthur Lesk

Bioinformatics for biologists,

http://www.amazon.com/Bioinformatics-Biologists-Pavel-Pevzner/dp/1107648874/ref=sr_1_8?s=books&ie=UTF8&qid=1443631856&sr=1-8&keywords=bioinformatics&refinements=p_n_feature_nine_browse-bin%3A3291437011%2Cp_72%3A1250221011

Wednesday, March 26, 2014

Phylogeny lab, west nile virus using envelop glycoprotein squences

Download MEAG from www.megasoftware.net
This menu is written for a Windows computer. Toolbar features are slightly different in Mac.

Open file "WNV.mas" with MEGA. (In Mac, you need to start MEGA first, and then look for this file.)

In "Alignment Explorer", click "Translated Protein Sequences"

Click "Yes" for selection and genetics tables.

Translated proteins should look like this

Choose "Alignment" by "Align by ClustalW"

"OK" to use the default alignment setting.

Click "Data" -> "Export Alignment" -> "MEGA format".

Input title for the alignment

This alignment are for "protein coding sequences".

In the main window of MEGA, choose "Phylogeny" -> "Construct/Test Neighbor-Joining Trees"

Choose default parameters.

We should be able to see the generated tree.

In order to save the image, we can "Image"->"Copy to Clipboard".

And then "Paste" to a WORD document.

Note: Please feel free to explore many other features of MEGA.

Reference:
Online MEGA help file, http://www.megasoftware.net/webhelp/helpfile.htm

Sunday, March 16, 2014

R pairwise alignment

http://svitsrv25.epfl.ch/R-doc/library/Biostrings/html/pairwiseAlignment.html