Open Notebook: September 2011

There are old posts from 2011

标题: 简单介绍 Bioinformatics Tools for NGS 分析
发信站: BBS 未名空间站 (Sat Sep 10 19:30:42 2011, 美东)

本人在wet lab里面做纯数据分析，for NGS data analysis, 简单介绍一些自己接触过
，并且觉得挺有用的工具，说的有点杂，权作抛砖引玉，还请不吝赐教。

Next-Gen sequencing(NGS)和现在正在发展的3rd-gen sequencing将会在生物学研究中
被越来越广泛应用。不管你信不信，反正我信了。一是基于实验成本的降低（$1k
whole-genome sequencing is coming），越来越多的实验室可以操作；二是可以提供
相对low throughput experiment多的多的数据和信息，可以看到很多从前看不到的东
西；三是sequencer本身对测序的准确性正在逐渐提高，所以实验固有错误率降低；四
是各种算法的成熟应用，这使得很多由于实验产生的误差在出数据后通过对数据的分析
得以过滤。按照library preparation来分，NGS主要有DNA-seq和RNA-seq

DNA-seq is usually used as ChIP-seq to study transcription factor(TF)-DNA
binding, epigenetic histome modification (such as H3K4me3, H3K36me3 etc),
DNA methylation etc. and Exon-seq or whole-DNA sequencing to study genetic
variation (SNPs/Indels etc). 从bioinformatics的角度来说，DNA-seq相对比较容易
处理，简单的说，第一步先做alignment，就是把millions of short reads from the
sequencer map back to the genome。很多genome已经有相对完整的reference，这些
genome的whole sequence可以从UCSC genome browser(http://genome.ucsc.edu/)下载。先前已经有朋友贴过现有的alignment tools（ http://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short- Read_Sequence_Alignment），目前比较常用的有bowtie，bwa，maq，soap，从他们各自的网站，应该都可以下载到 executable binary或者source code。我们比较常用的是bowtie（ http://bowtie-bio.sourceforge.net/index.shtml ）和bwa（ http://bio-bwa.sourceforge.net/），简单方便速度快（paralyzed）；output files例如SAM，BAM format比较standardized，方便后续处理；他们的网站有非常详细的manual。

如果是做ChIP-seq，通常第二步是call peaks。TF的peak一般比较peaky，MACS(http://liulab.dfci.harvard.edu/MACS /00README.html)是个不错的peak caller. TF ChIP-seq，一般会在call出peak之后assign peak to gene, in order to find TF regulated genes，这个比较arbitary，overlap of the called peak with gene promoter, enhancer, or use GREAT(http://great.stanford.edu/) to assign genes. 通常TF-DNA binding都有motif，meme（http://meme.sdsc.edu/meme/intro.html）是个不错的motif caller，use meme你可以看看这些TF enriched sites是否有significant motif, you can also check whether a known motif is enriched in a certain genomic region you interested.也就是说，既可以从genomic region找motif，也可以反向从motif找potential binding sites。

Histome modification，例如H3K4me3, K3K36me3,H3K27me3,H3K27ac, etc通常不具有
motif（not expected to），usually people are interested in their position,
downstream genes,how the intensity of peak changes (from average diagram),
whether there is bivalent domain (H3K4me3, H3K27me3) etc. 整体来说，对于ChIP
-seq,可以利用open source tool自己整workflow；可以利用galaxy(http://main.g2.bx.psu.edu/)已有的workflow 和file tools（上面也有详细的tutorial）；也可以用Partek Genomic Suite （not free,licence required）.后两个适合bench worker使用，非常容易上手。

Exon-seq and whole-DNA-seq asks very different question, usually people are
interested to know genetic variations, such as SNPs, Indels, Copy number
variation and looking for reoccurrence in tumor samples. 第一步alignment几乎
没有什么区别，找genetic variation, SeqGene(http://sourceforge.net/apps/mediawiki/seqgene /index.php?title=SeqGene)是一个比较简单且实用的tool。SNPs/Indels的prediction，基本上是比较准确的，可以说那些high quality的prediction，95%是可以通过sangar sequencing validated。当然在找出SNPs/Indels之后，除了reoccurrence以外（这个需要大量sample），如何去ranking这些 SNPs/Indels的重要性，例如处在那些基因，有多少是missense mutation，在一个pathway中有多少基因被mutated，mutation所处的位点是否对some sexy gene的expression有影响，如何linkage with expression data等等，尚处在研究阶段。因此做为marker，reoccurrence mutation的detection有重要意义，但同时其后续的functional study比较难，因为在找到的mutation之中，很难区分哪个是driver，哪个是receiver。除此之外，由于cancer tumor的 heterogeneity，要想发现那些真正在tumor cell里面的mutation，asks for new bioinformatics algorithms。

RNA-seq通常又被称为transcriptome sequencing。RNA-seq can be used to study
RNA expression variation and gene isoforms structure variation between
samples, novel RNA(such as long-intergenic-noncoding RNA)detection, microRNA
targets etc. Sequencing depth对于RNA-seq来讲非常重要，很多，例如novel
transcripts没有deep到一定程度是很难detect到的。现在比较流行的有Illumina
highseq 2000，和Roche 454，前者可以sequence up to 200M reads（pair-end）后者
号称可以detect longer reads up to 800bp（我没用过，不做评论）。从
bioiformatics的角度讲，与DNA-seq相比，RNA-seq要稍微复杂一些，当然也更有意思
。主要需要处理的就是alternative splicing。Tophat（ http://tophat.cbcb.umd.edu/）是目前比较流行的RNA-seq aligner。其部分也是基于bowtie，只对bowtie unalignable reads处理的时候，才去考虑splicing。后续对transciptome的 prediction，scripture(http://www.broadinstitute.org/software/scripture/)和 cufflinks(http://cufflinks.cbcb.umd.edu/manual.html)都是比较常用的reference based transcriptome assembler. 他们的主要作用就是把tophat align好的reads（当然也包括tophat predicted splicing junctions）组装成transcriptome。通过对已知gene annotation（例如RefSeq genes，UCSC genes）等的筛选，可以找到那些处于intergenic 并且有信号的transcripts，而通常这些是novel transcripts。这两个assembler共同的缺点是，在同一个genomic loci都会predict非常多的isoforms，故而FP比较高，但好处是，可以帮助reseracher非常迅速的找到novel transcripts可能处于的位点。除此之外，cufflinks还可以用来检测gene/transcripts expression difference(there are better ways to do this)，比较两个assembled transcriptome的不同，merge multiple transcriptome etc.就annotate transcriptome来讲，RNA-seq is much better than previously used tilling-array。

总体来说，对RNA-seq的数据分析尚未被标准化，相对于对microarry的分析，即便是最
基本的call significant genes in RNA-seq依然处于发展阶段，在microarry，大家都
知道，有RMA normalization，数据被normalize 之后，可以用ttest，SAM等手段找
significant genes，但是RNA-seq，单就normalization仍尚无定论。不同之处在于
microarry更类似于模拟信号，sequence 更像数字信号,比microarry 用的probe要更准
确，精度更高。此前业内标准是用RPKM（read per kilo-base of exon per million
reads）来代表gene expression level，其实是比较粗糙的，因为在RNA-seq 的
library中，有些RNA expression非常高，而有些去非常低，variation非常大，如果仅
仅用number of mappable reads来normalize，并不是非常好。相对而言，DESeq（R
package）和next-gen SAM algorithm是比较好的tool，好像他们都是用了quantile
normalization(就是除去两头，用中间的50%做normalization)。

RNA-seq也有被用来做structure variation，我尚未用过一个比较reliable的tool。
RNA-seq也可以用来检测在RNA pool里面是否有bacterial和viral infection，PathSeq
(http://www.broadinstitute.org/software/pathseq/index.html)给出了一个比较好的pipeline，但是他们的software并不是work非常well，至少不是我想象中那么便于操作，我们自己写了一个。

俗话说，眼见为实，信号的visulazation对很多ongoing project有非常重要的意义，
有的甚至可以改变问题的问法。我们lab应该说，都是UCSC genome browser的忠实用户
，理由：1.UCSC 提供很多已经做好的track，例如各种gene annotation，Encode/
Gencode里面TF和histome marker的ChIP-seq/RNA-seq信号等等，只要active就可以看
到；2.可以无限量upload custom tracks from your own data, and share the link
to whoever you want to. 在做自己的track的时候，通常会遇到很多file format
conversion，UCSC自己有一套convert tools（http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/）, BedTools(http://code.google.com/p/bedtools/)也是比较常用的。具体来讲，就是通过这些tool，把 align好的reads（usually in SAM or BAM format）变成UCSC tracks(UCSC has a number of accepted track file formats: http://genome.ucsc.edu/goldenPath/help/customTrack.html)。UCSC 还support HTTP and FTP for tracks, usually if a file is greater than 300M, we will put it on Amazon cloud and give the link to UCSC, and we will see the track，and it’s actually very fast.

综上，简单的介绍了一些据我所知的NGS前期数据分析方法，对于每一个project而言，
其问题的提法不同，所需要用到的后续工具也不同，很多情况下需要自己写script去解
决。随着实验技术和软件技术的发展，我相信越来越多的非常有趣的问题可以通过一些
整合的方法得以解决，或者可以通过整合的方法得到解决这些问题的hint。

I forgot to talk about de novo assembly, and thank you for making this point.
For de novo assembly, you can try SOAPdenovo" http://soap.genomics.org.cn/soapdenovo.html", BGI actually has a lot SOAP series, I've never tried and compared to the others, but will soon, I guess.
I did try trinity from Broad "http://trinityrnaseq.sourceforge.net/". very similar algorithms, but people from Broad definitely have more connections than BGI, and I guess that was the main reason why their story gets published on Nature. Biotech.(too far away from the topic)
Usually de novo assembly takes much more RAM than reference based assembly,
so it should work for small genomes, for large genome like human/mouse, what
people usually do is locate a loci, extract reads mapped to this loci, and
do local de novo assembly. This local de novo assembly was also used for SNP
detection in some algorithms.

Open Notebook

Saturday, September 10, 2011

2011 NGS discussion on mitbbs