Comparative genomics: methods and applications.
ABSTRACT Interpreting the functional content of a given genomic sequence is one of the central challenges of biology today. Perhaps the most promising approach to this problem is based on the comparative method of classic biology in the modern guise of sequence comparison. For instance, protein-coding regions tend to be conserved between species. Hence, a simple method for distinguishing a functional exon from the chance absence of stop codons is to investigate its homologue from closely related species. Predicting regulatory elements is even more difficult than exon prediction, but again, comparisons pinpointing conserved sequence motifs upstream of translation start sites are helping to unravel gene regulatory networks. In addition to interspecific studies, intraspecific sequence comparison yields insights into the evolutionary forces that have acted on a species in the past. Of particular interest here is the identification of selection events such as selective sweeps. Both intra- and interspecific sequence comparisons are based on a variety of computational methods, including alignment, phylogenetic reconstruction, and coalescent theory. This article surveys the biology and the central computational ideas applied in recent comparative genomics projects. We argue that the most fruitful method of understanding the functional content of genomes is to study them in the context of related genomic sequences. In particular, such a study may reveal selection, a fundamental pointer to biological relevance.
[show abstract] [hide abstract]
ABSTRACT: miRNAs (microRNAs) are a class of endogenous small RNAs that are thought to negatively regulate protein production. Aberrant expression of many miRNAs is linked to cancer and other diseases. Little is known about the factors that regulate the expression of miRNAs. We have identified numerous regulatory elements upstream of miRNA genes that are likely to be essential to the transcriptional and posttranscriptional regulation of miRNAs. Newly identified regulatory motifs occur frequently and in multiple copies upstream of miRNAs. The motifs are highly enriched in G and C nucleotides, in comparison with the nucleotide composition of miRNA upstream sequences. Although the motifs were predicted using sequences that are upstream of miRNAs, we find that 99% of the top-predicted motifs preferentially occur within the first 500 nucleotides upstream of the transcription start sites of protein-coding genes; the observed preference in location underscores the validity and importance of the motifs identified in this study. Our study also raises the possibility that a considerable number of well-characterized, disease-associated transcription factors (TFs) of protein-coding genes contribute to the abnormal miRNA expression in diseases such as cancer. Further analysis of predicted miRNA-protein interactions lead us to hypothesize that TFs that include c-Myb, NF-Y, Sp-1, MTF-1, and AP-2alpha are master-regulators of miRNA expression. Our predictions are a solid starting point for the systematic elucidation of the causative basis for aberrant expression patterns of disease-related (e.g., cancer) miRNAs. Thus, we point out that focused studies of the TFs that regulate miRNAs will be paramount in developing cures for miRNA-related diseases. The identification of the miRNA regulatory motifs was facilitated by a new computational method, K-Factor. K-Factor predicts regulatory motifs in a set of functionally related sequences, without relying on evolutionary conservation.PLoS Computational Biology 05/2007; 3(4):e67. · 5.22 Impact Factor
Article: transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.[show abstract] [hide abstract]
ABSTRACT: Alignments of homologous DNA sequences are crucial for comparative genomics and phylogenetic analysis. However, multiple alignment represents a computationally difficult problem. For protein-coding DNA sequences, it is more advantageous in terms of both speed and accuracy to align the amino-acid sequences specified by the DNA sequences rather than the DNA sequences themselves. Many implementations making use of this concept of "translated alignments" are incomplete in the sense that they require the user to manually translate the DNA sequences and to perform the amino-acid alignment. As such, they are not well suited to large-scale automated alignments of large and/or numerous DNA data sets. transAlign is an open-source Perl script that aligns protein-coding DNA sequences via their amino-acid translations to take advantage of the superior multiple-alignment capabilities and speed of an amino-acid alignment. It operates by translating each DNA sequence into its corresponding amino-acid sequence, passing the entire matrix to ClustalW for alignment, and then back-translating the resulting amino-acid alignment to derive the aligned DNA sequences. In the translation step, transAlign determines the optimal orientation and reading frame for each DNA sequence according to the desired genetic code. It also checks for apparent frame shifts in the DNA sequences and can handle frame-shifted sequences in one of three ways (delete, align as amino acids regardless, or profile align as DNA). As a set of comparative benchmarks derived from six protein-coding genes for mammals shows, the strategy implemented in transAlign always improves the speed and usually the apparent accuracy of the alignment of protein-coding DNA sequences. transAlign represents one of few full and cross-platform implementations of the concept of translated alignments. Both the advantages accruing from performing a translated alignment and the suite of user-definable options available in the program mean that transAlign is ideally suited for large-scale automated alignments of very large and/or very numerous protein-coding DNA data sets. However, the good performance offered by the program also translates to the alignment of any set of protein-coding sequences. transAlign, including the source code, is freely available at http://www.tierzucht.tum.de/Bininda-Emonds/ (under "Programs").BMC Bioinformatics 02/2005; 6:156. · 2.75 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees. We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes. We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at http://adenine.biz.fh-weihenstephan.de/shustring/.BMC Bioinformatics 02/2005; 6:123. · 2.75 Impact Factor