Alignment-free sequence comparison with biologically significant patterns

Ohjelma: 
Algoritmit ja koneoppiminen
Bioinformatiikka
Yhteyshenkilö: 

 

The interest in phylogeny reconstruction by alignment-free genome and proteome comparison is currently growing (see, e.g., [18, 19] for a review, and [1, 2] for two recent workshops). Alignment-free algorithms map two genomes x and y to high-dimensional vectors X and Y, indexed by all possible substrings of bounded length. X[w] and Y[w] -- the coordinate of X and Y associated with substring w -- contain the frequency of w in x and y, respectively, often transformed into a measure of statistical surprise. Standard distance functions for vectors are then applied to X and Y (e.g. Euclidean distance [14] or KL divergence [16]), and neighbor-joining, or some other form of distance matrix algorithm, is finally applied to build a taxonomy.
 
Composition vectors use, on purpose, little or no biological information. Almost all of them are indexed by string-theoretic components, like substrings of bounded length, rigid gapped motifs of bounded density, or subsequences of bounded span. The choice of such structures is dictated by their information content or by the computational efficiency of their extraction, regardless of biological function. This lack of biological information is consistent with the historical origin of alignment-free algorithms: a "reaction" to alignment-based methods that are prone to reconstruct different trees from different genes, and an attempt to recover as much "phylogenetic signal" as possible from a genome with minimal parameter setting.
 
The most natural way to sprinkle biology over alignment-free algorithms is probably building composition vectors whose coordinates correspond to biologically significant building blocks, rather than to string-theoretic constructs. Composition vectors with biologically significant building blocks have already been applied to protein classification (see, e.g. [3, 11]). In phylogeny reconstruction, trees  have been built by representing genomes as vectors indexed by COG clusters, and containing the frequency (or just the presence or absence) of genes belonging to each cluster (see, e.g., [9, 10, 12]). A similar approach has been applied to SCOP folds and domains (see e.g. [4, 5, 6, 8, 9] for a small sampler, and [7, 15] for a detailed study on the distribution of folds in proteomes), other natural "words" in the protein dictionary of an organism. PROSITE patterns have similarly been found to be selectively over- and under-represented in proteomes [13], but these preferences have not been applied to phylogeny construction yet.
 
COG-based and fold-based methods are limited to the coding parts of genomes, while current alignment-free algorithms allow to exploit information in noncoding parts as well [17]. PROSITE patterns have already been found to be selectively over- and under-represented in the translated intergenic regions of some genomes, and intergenic occurrences have been conjectured to be relics of ancient proteins that have been deactivated by mutation [20]: if this is indeed the case, taking into account the intergenic occurrences of such patterns in phylogeny reconstruction could move related taxa closer to their common ancestor, improving phylogenies. Additionally, while alignment-free algorithms are currently scaling to hundreds of taxa, COG-based and fold-based methods have been tested on just few dozen genomes.
 
In this work, we plan to systematically and exhaustively measure the quality of phylogenies built from the frequency of biologically significant patterns, both in the coding and in the noncoding parts of all genomes sequenced to date. We will experiment with the widest possible spectrum of syntactically different patterns and of biologically different databases -- for example, the gapped patterns in PROSITE, the ungapped strings in PRINTS and TRANSFAC, the profiles in PROSITE, HAMAP and JASPAR, the HMMs in PANTHER, PIRSF, Pfam, SMART, TIGRFAM, Gene3D, and SUPERFAMILY, and the repetitive elements in RepBase and in the RepeatMasker libraries. This will establish which type of pattern carries more "phylogenetic signal", whether combining different patterns improves classification, and whether adding information from the noncoding parts of the genome improves or degrades classification. To compute the statistical significance of seeing a given number of occurrences for specific classes of pattern, we will leverage a set of algorithms that have been developed over the years by the string analysis community.
 

[1] A. Apostolico and A. Dress. Alignment-free sequence comparison workshop. http://www.dei.unipd.it/~axa/AlignFree.htm, June 2011.

[2] A. Apostolico, A. Dress, and L. Parida. Structure discovery in biology: motifs, networks and phylogenies. Dagstuhl seminar. http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=12291, July 2012.

[3] A. Ben-Hur and D. Brutlag. Remote homology detection: a motif based approach. Bioinformatics, 19(suppl. 1):i26–i33, 2003.

[4] G. Caetano-Anolles and D. Caetano-Anolles. Universal sharing patterns in proteomes and evolution of protein fold architecture and life. Journal of Molecular Evolution, 60:484–498, 2005.

[5] E.J. Deeds, H. Hennessey, and E.I. Shakhnovich. Prokaryotic phylogenies inferred from protein structural domains. Genome Research, 15:393–402, 2005.

[6] M. Gerstein. Patterns of protein-fold usage in eight microbial genomes: A comprehensive structural census. Proteins: Structure, Function, and Bioinformatics, 33(4):518–534, 1998.

[7] H. Hegyi, J. Lin, D. Greenbaum, and M. Gerstein. Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins: Structure, Function, and Bioinformatics, 47(2):126–141, 2002.

[8] J. Lin and M. Gerstein. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Research, 10:808–818, 2000.

[9] J. Lin, J. Qian, D. Greenbaum, P. Bertone, R. Das, N. Echols, A. Senes, B. Stenger, and M. Gerstein. GeneCensus: genome comparisons in terms of metabolic pathway activity and protein family sharing. Nucleic Acids Research, 30:4574–4582, 2002.

[10] L. Ling, J. Wang, Y. Cui, W. Li, and R. Chen. Proteome-wide analysis of protein function composition reveals the clustering and phylogenetic prop- erties of organisms. Molecular Phylogenetics and Evolution, 25(1):101–111, 2002.

[11] B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A study of remote homology detection. Compaq Cambridge Research Laboratory, 2001.

[12] M.G. Montague and C.A. Hutchison. Gene content phylogeny of her- pesviruses. Proceedings of the National Academy of Sciences, 97(10):5334– 5339, 2000.

[13] P. Nicod`eme, T. Doerks, and M. Vingron. Proteome analysis based on motif statistics. Bioinformatics, 18:S161–S171, 2002.

[14] J. Qi, B. Wang, and B. Hao. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. Journal of Molecular Evolution, 58:1–11, 2004.

[15] J. Qian, N.M. Luscombe, and M. Gerstein. Protein family and fold occur- rence in genomes: power-law behaviour and evolutionary model. Journal of Molecular Biology, 313(4):673–681, 2001.

[16] G.E. Sims, S.-R. Jun, G.A. Wu, and S.-H. Kim. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences, 106(8):2677–2682, 2009.

[17] G.E. Sims, S.-R. Jun, G.A. Wu, and S.-H. Kim. Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. Pro- ceedings of the National Academy of Sciences, 106(40):17077–17082, 2009.

[18] S. Vinga. Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. In T.D. Pham, H. Yan, and D.I. Crane, editors, Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers, New York, 2007.

[19] S. Vinga and J. Almeida. Alignment-free sequence comparison – a review. Bioinformatics, 19(4):513–523, 2003.

[20] Z.L. Zhang, P.M. Harrison, and M. Gerstein. Digging deep for ancient relics: a survey of protein motifs in the intergenic sequences of four eukaryotic genomes. Journal of Molecular Biology, 323:811–822, 2002.

09.01.2013 - 09:39 Fabio Cunial
09.01.2013 - 09:39 Fabio Cunial