Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria

Department of Biology, Wheaton College, Norton, MA 02766, USA.
Archaea (Vancouver, B.C.) (Impact Factor: 2.71). 01/2009; 2(3):159-67. DOI: 10.1155/2008/829730
Source: PubMed


Classification and regression tree (CART) analysis was applied to genome-wide tetranucleotide frequencies (genomic signatures) of 195 archaea and bacteria. Although genomic signatures have typically been used to classify evolutionary divergence, in this study, convergent evolution was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (nonlinear) qualities of genomes may reflect certain environmental conditions (such as temperature) in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results.

Download full-text


Available from: Mark D. LeBlanc, Jan 06, 2014
  • Source
    • "Karlin et al. [22] reported that the tetranucleotide CTAG is extremely underrepresented and distributed in an anomalous fashion along the genome of the thermophilic microbe M. jannaschi. Applying classification and regression tree (CART) analysis to genome-wide tetranucleotide frequencies of 195 archaea and bacteria, Dyer et al. [35] reported the discriminating tetramers, the frequencies of which could differentiate between three temperature ranges, hyperthermophily, thermophily and mesophily. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Microbes are known for their unique ability to adapt to varying lifestyle and environment, even to the extreme or adverse ones. The genomic architecture of a microbe may bear the signatures not only of its phylogenetic position, but also of the kind of lifestyle to which it is adapted. The present review aims to provide an account of the specific genome signatures observed in microbes acclimatized to distinct lifestyles or ecological niches. Niche-specific signatures identified at different levels of microbial genome organization like base composition, GC-skew, purine-pyrimidine ratio, dinucleotide abundance, codon bias, oligonucleotide composition etc. have been discussed. Among the specific cases highlighted in the review are the phenomena of genome shrinkage in obligatory host-restricted microbes, genome expansion in strictly intra-amoebal pathogens, strand-specific codon usage in intracellular species, acquisition of genome islands in pathogenic or symbiotic organisms, discriminatory genomic traits of marine microbes with distinct trophic strategies, and conspicuous sequence features of certain extremophiles like those adapted to high temperature or high salinity.
    Full-text · Article · Apr 2012 · Current Genomics
  • Source

    Preview · Article ·
  • [Show abstract] [Hide abstract]
    ABSTRACT: The genome of every organism is composed of millions of small chemical units called nucleobases, whose arrangement along a strand of DNA provides the recipe for the development of the organism. Bioinformaticists analyze patterns of bases in DNA, and by treating bases as an alphabet and genomes as texts have reinvented a number of techniques originally developed by philologists. But the sheer size of genomic texts has also forced bioinformatics to move beyond traditional philological methods and to use information processing and statistical techniques to analyze patterns that are otherwise too large or too subtle to be noted by the unaided eye. We have adapted some of these computer-aided methods from bioinformatics for the purpose of analyzing medieval texts, and demonstrate in this paper that our "lexomic" methods can be used to detect relationships between, and structures within, poetic texts in the Old English corpus. Specifically, we show that our methods can recognize the relationship between Daniel and Azarias and the divisions between Genesis A and B as well as those between Guthlac A and B and between Christ I, II, and III. We also demonstrate how lexomics can be used to shed light on the possible Cynewulfian affinity of Guthlac B and conclude that certain peculiarities in branching diagrams may be diagnostic of the existence of outside sources for Old English poems. The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. When applied to literature, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The term "lexomics" is perhaps even more applicable to the specific use we make of it here than in its original context, if only because a "word" in a written language is an obvious, well defined, and relatively uncontroversial category. Our methods are implemented as a series of programs or scripts written in the language Perl, freely available at the project website ( The open-source scripts enable a researcher to: (i) sort texts into directories (poetry vs. prose, then by manuscript), (ii) "cut" texts into equal-sized chunks, (iii) count the number of times every word occurs in each text or chunk, (iv) generate statistical summaries of texts or chunks, and (v) prepare counts of words for classification and/or cluster analyses. We first identify and count the words in a text. Our project focuses on texts in Old English where, fortunately, a complete corpus of Old English poetry has been assembled by the Dictionary of Old English project, greatly simplifying an otherwise onerous task.The Dictionary of Old English identifies individual words as strings of characters bounded by white space. Our software tabulates all the words in any group of texts and calculates the number of unique words and the number of words which appear only once. The table of words from any group of texts is created as comma-delimited data and can be used in any spreadsheet application (e.g., Microsoft Excel). These data can then be analyzed with various statistical techniques. 10 most frequently occurring words in the poem Azarias. It is important to note some complexities that must be dealt with when analyzing any texts as well as some problems unique to the Old English corpus. First, there are the problems associated with using edited or normalized texts versus diplomatic editions. For example, the Dictionary of Old English corpus uses Arnold Schröer's edition of the Rule of St. Benedict for its electronic version of that text. Schröer collated five manuscripts dating from the end of the tenth to the beginning of the twelfth century, so his edition does not reflect any single extant manuscript. Any conclusions drawn from analysis of the text in the DOE corpus are therefore in part dependent upon the editorial decisions and normalizations made by Schröer. Researchers can modify the DOE files to make them consistent with any one manuscript and then perform their research on the modified files, and there may be good reasons for using edited texts (for example, the removal of obvious errors), but scholars...
    No preview · Article · Jun 2011 · JEGP Journal of English and Germanic Philology
Show more