https://media.nature.com/original/nature-assets/ng/journal/v36/n6/extref/ng1355-S2.pdf
Identification of duplicate genes and singletons After database cleaning, we conducted
an all-against-all FASTA3
self-search for the entire proteome of Drosophila melanogaster
(http://www.ensembl.org/Drosophila_melanogaster/) and that of Saccharomyces
cerevisiae (http://genome-www.stanford.edu/Saccharomyces/). A single copy gene (i.e., a
singleton) was defined as a protein that did not hit any other proteins in the FASTA
search with E = 0.1; this loose similarity search criterion was used to make sure that a
singleton is indeed a singleton. Two genes were regarded as duplicate genes if they meet
the following three criteria during FASTA all-against-all search (modified after Ref 4):
(1) E = 10-10; (2) their similarity is ≥ I (I= 30% if L ≥ 150 a.a. and I = 0.01n + 4.8L
-0.32(1 +
exp(-L/1000)) if L <150 a.a., where n = 6 and L is the length of the alignable region); and (3)
the length of the alignable region between the two sequences is >50% of the longer
protein. Since we wanted to detect the differences in expression change between real
duplicate genes and singletons, we