I study computational and quantitative biology with a focus on network aging. This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interests to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin
There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:
Inclusion of ALT contigs. ALT contigs are large variations with very long flanking sequences nearly identical to the primary human assembly. Most read mappers will give mapping quality zero to reads mapped in the flanking sequences. This will reduce the sensitivity of variant calling and many other analyses. You can resolve this issue with an ALT-aware mapper, but no mainstream variant callers or other tools can take the advantage of ALT-aware mapping.
Padding ALT contigs with long “N”s. This has the same problem with 1 and also increases the size of genome unnecessarily. It is worse.
Inclusion of multi-placed sequences. In both GRCh37 and GRCh38, the pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you use a reference genome that contains both copies, you will not be able to call any variants in PARs with a standard pipeline. In GRCh38, some alpha satellites are placed multiple times, too. The right solution is to hard mask PARs on chrY and those extra copies of alpha repeats.
Not using the rCRS mitochondrial sequence. rCRS is widely used in population genetics. However, the official GRCh37 comes with a mitochondrial sequence 2bp longer than rCRS. If you want to analyze mitochondrial phylogeny, this 2bp insertion will cause troubles. GRCh38 uses rCRS.
Converting semi-ambiguous IUB codes to “N”. This is a very minor issue, though. Human chromosomal sequences contain few semi-ambiguous bases.
Using accession numbers instead of chromosome names. Do you know CM000663.2 corresponds to chr1 in GRCh38?
Not including unplaced and unlocalized contigs. This will force reads originated from these contigs to be mapped to the chromosomal assembly and lead to false variant calls.
Now we can explain what is wrong with other versions of human reference genomes:
GCA_000001405.15_GRCh38_genomic.fna.gz from NCBI: 1, 3, 5 and 6.
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from EnsEMBL: 3.
Homo_sapiens.GRCh38.dna.toplevel.fa.gz from EnsEMBL: 1, 2 and 3.
Using an impropriate human reference genome is usually not a big deal unless you study regions affected by the issues. However, 1) other researchers may be studying in these biologically interesting regions and will need to redo alignment; 2) aggregating data mapped to different versions of the genome will amplify the problems. It is still preferable to choose the right genome version if possible.