[show abstract][hide abstract] ABSTRACT: While molecular analyses have provided insight into the phylogeny of ciliates, the few studies assessing intraspecific variation have largely relied on just a single locus [e.g., nuclear small subunit rDNA (nSSU-rDNA) or mitochondrial cytochrome oxidase I]. In this study, we characterize the diversity of several nuclear protein-coding genes plus both nSSU-rDNA and mitochondrial small subunit rDNA (mtSSU-rDNA) of five isolates of the ciliate morphospecies Chilodonella uncinata. Although these isolates have nearly identical nSSU-rDNA sequences, they differ by up to 8.0% in mtSSU-rDNA. Comparative analyses of all loci, including β-tubulin paralogs, indicate a lack of recombination between strains, demonstrating that the morphospecies C. uncinata consists of multiple cryptic species. Further, there is considerable variation in substitution rates among loci as some protein-coding domains are nearly identical between isolates, while others differ by up to 13.2% at the amino acid level. Combining insights on macronuclear variation among isolates, the focus of this study, with published data from the micronucleus of two of these isolates, indicates that C. uncinata lineages are able to maintain both highly divergent and highly conserved genes within a rapidly evolving germline genome.
Journal of Molecular Evolution 12/2011; 73(5-6):266-72. · 2.15 Impact Factor
[show abstract][hide abstract] ABSTRACT: Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.
[show abstract][hide abstract] ABSTRACT: Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a "corrected" empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.
PLoS ONE 01/2010; 5(7):e11230. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: The single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content.
PLoS ONE 01/2010; 5(7):e11587. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: To understand astrovirus biology, it is essential to understand factors associated with its evolution. The current study reports the genomic sequences of nine novel turkey astrovirus (TAstV) type 2-like clinical isolates. This represents, to our knowledge, the largest genomic-length data set available for any one astrovirus type. The comparison of these TAstV sequences suggests that the TAstV species contains multiple subtypes and that recombination events have occurred across the astrovirus genome. In addition, the analysis of the capsid gene demonstrated evidence for both site-specific positive selection and purifying selection.
Journal of Virology 06/2008; 82(10):5099-103. · 5.08 Impact Factor
[show abstract][hide abstract] ABSTRACT: The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings. We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models. Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.
Molecular Biology and Evolution 02/2007; 24(1):159-70. · 10.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: Studies of microbial eukaryotes have been pivotal in the discovery of biological phenomena, including RNA editing, self-splicing RNA, and telomere addition. Here we extend this list by demonstrating that genome architecture, namely the extensive processing of somatic (macronuclear) genomes in some ciliate lineages, is associated with elevated rates of protein evolution. Using newly developed likelihood-based procedures for studying molecular evolution, we investigate 6 genes to compare 1) ciliate protein evolution to that of 3 other clades of eukaryotes (plants, animals, and fungi) and 2) protein evolution in ciliates with extensively processed macronuclear genomes to that of other ciliate lineages. In 5 of the 6 genes, ciliates are estimated to have a higher ratio of nonsynonymous/synonymous substitution rates, consistent with an increase in the rate of protein diversification in ciliates relative to other eukaryotes. Even more striking, there is a significant effect of genome architecture within ciliates as the most divergent proteins are consistently found in those lineages with the most highly processed macronuclear genomes. We propose a model whereby genome architecture-specifically chromosomal processing, amitosis within macronuclei, and epigenetics-allows ciliates to explore protein space in a novel manner. Further, we predict that examination of diverse eukaryotes will reveal additional evidence of the impact of genome architecture on molecular evolution.
Molecular Biology and Evolution 10/2006; 23(9):1681-7. · 10.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: We develop a new model for studying the molecular evolution of protein-coding DNA sequences. In contrast to existing models, we incorporate the potential for site-to-site heterogeneity of both synonymous and nonsynonymous substitution rates. We demonstrate that within-gene heterogeneity of synonymous substitution rates appears to be common. Using the new family of models, we investigate the utility of a variety of new statistical inference procedures, and we pay particular attention to issues surrounding the detection of sites undergoing positive selection. We discuss how failure to model synonymous rate variation in the model can lead to misidentification of sites as positively selected.
Molecular Biology and Evolution 01/2006; 22(12):2375-85. · 10.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: We analyze members of the receptor-like kinase (RLK) gene family in Arabidopsis thaliana for positive selection. Likelihood analyses find evidence for positive selection in 12 of the 52 RLK family sequences groups. These 12 groups represent 97 of the 403 sequences analyzed. The majority of genes in groups subject to positive selection have not been functionally characterized, but sites under selection are predominantly located in the extracellular region. The pattern of selection in the extracellular leucine-rich repeat (LRR) motif of groups 14 and 51 is similar to previous studies where positively selected positions are located in a solvent exposed beta-strand that may determine disease specificity, raising the possibility that some RLK genes function in a similar role.
Journal of Molecular Evolution 10/2005; 61(3):325-32. · 2.15 Impact Factor
[show abstract][hide abstract] ABSTRACT: PowerMarker delivers a data-driven, integrated analysis environment (IAE) for genetic data. The IAE integrates data management, analysis and visualization in a user-friendly graphical user interface. It accelerates the analysis lifecycle and enables users to maintain data integrity throughout the process. An ever-growing list of more than 50 different statistical analyses for genetic markers has been implemented in PowerMarker. AVAILABILITY: www.powermarker.net
[show abstract][hide abstract] ABSTRACT: The HyPhypackage is designed to provide a flexible and unified platform for carrying out likelihood-based analyses on multiple alignments of molecular sequence data, with the emphasis on studies of rates and patterns of sequence evolution. AVAILABILITY: http://www.hyphy.org CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: HyPhydocumentation and tutorials are available at http://www.hyphy.org.
[show abstract][hide abstract] ABSTRACT: Likelihood applications have become a central approach for molecular evolutionary analyses since the first computationally tractable treatment two decades ago. Although Felsenstein's original pruning algorithm makes likelihood calculations feasible, it is usually possible to take advantage of repetitive structure present in the data to arrive at even greater computational reductions. In particular, alignment columns with certain similarities have components of the likelihood calculation that are identical and need not be recomputed if columns are evaluated in an optimal order. We develop an algorithm for exploiting this speed improvement via an application of graph theory. The reductions provided by the method depend on both the tree and the data, but typical savings range between 15%and 50%. Real-data examples with time reductions of 80%have been identified. The overhead costs associated with implementing the algorithm are minimal, and they are recovered in all but the smallest data sets. The modifications will provide faster likelihood algorithms, which will allow likelihood methods to be applied to larger sets of taxa and to include more thorough searches of the tree topology space.
[show abstract][hide abstract] ABSTRACT: The accumulation of divergent histone H4 amino acid sequences within and between ciliate lineages challenges traditional views of the evolution of this essential eukaryotic protein. We analyzed histone H4 sequences from 13 species of ciliates and compared these data with sequences from well-sampled eukaryotic clades. Ciliate histone H4s differ from one another at as many as 46% of their amino acids, in contrast with the highly conserved character of this protein in most other eukaryotes. Equally striking, we find paralogs of histone H4 within ciliate genomes that differ by up to 25% of their amino acids, whereas paralogs in other eukaryotes share identical or nearly identical amino acid sequences. Moreover, the most divergent H4 proteins within ciliates are found in the lineages with highly processed macronuclear genomes. Our analyses demonstrate that the dual nature of ciliate genomes-the presence of a "germline" micronucleus and a "somatic" macronucleus within each cell-allowed the dramatic variation in ciliate histone genes by altering functional constraints or enabling adaptive evolution of the histone H4 protein, or both.
Molecular Biology and Evolution 04/2004; 21(3):555-62. · 10.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: Two hundred and sixty maize inbred lines, representative of the genetic diversity among essentially all public lines of importance to temperate breeding and many important tropical and subtropical lines, were assayed for polymorphism at 94 microsatellite loci. The 2039 alleles identified served as raw data for estimating genetic structure and diversity. A model-based clustering analysis placed the inbred lines in five clusters that correspond to major breeding groups plus a set of lines showing evidence of mixed origins. A "phylogenetic" tree was constructed to further assess the genetic structure of maize inbreds, showing good agreement with the pedigree information and the cluster analysis. Tropical and subtropical inbreds possess a greater number of alleles and greater gene diversity than their temperate counterparts. The temperate Stiff Stalk lines are on average the most divergent from all other inbred groups. Comparison of diversity in equivalent samples of inbreds and open-pollinated landraces revealed that maize inbreds capture <80% of the alleles in the landraces, suggesting that landraces can provide additional genetic diversity for maize breeding. The contributions of four different segments of the landrace gene pool to each inbred group's gene pool were estimated using a novel likelihood-based model. The estimates are largely consistent with known histories of the inbreds and indicate that tropical highland germplasm is poorly represented in maize inbreds. Core sets of inbreds that capture maximal allelic richness were defined. These or similar core sets can be used for a variety of genetic applications in maize.
[show abstract][hide abstract] ABSTRACT: PANZEA is the first public database for studying maize genomic diversity. It was initiated as a repository of genomic diversity for an NSF Plant Genome project on 'Maize Evolutionary Genomics'. PANZEA is hosted at the Bioinformatics Research Center, North Carolina State University, and is open to the public (http://statgen.ncsu.edu/panzea). PANZEA is designed to capture the interrelationships between germplasm, molecular diversity, phenotypic diversity and genome structure. It has the ability to store, integrate and visualize DNA sequence, enzymatic, SSR (simple sequence repeat) marker, germplasm and phenotypic data. The relational data model is selected and implemented in Oracle. An automated DNA sequence data submission tool has been created that allows project researchers to remotely submit their DNA sequence data directly to PANZEA. On-line database search forms and reports have been created to allow users to search or download germplasm, DNA sequence, gene/locus data and much more, directly from the web.
Comparative and Functional Genomics 01/2003; 4(2):246-9. · 0.92 Impact Factor
[show abstract][hide abstract] ABSTRACT: Ciliates provide a powerful system to analyze the evolution of duplicated alpha-tubulin genes in the context of single-celled organisms. Genealogical analyses of ciliate alpha-tubulin sequences reveal five apparently recent gene duplications. Comparisons of paralogs in different ciliates implicate differing patterns of substitutions (e.g., ratios of replacement/synonymous nucleotides and radical/conservative amino acids) following duplication. Most substitutions between paralogs in Euplotes crassus, Halteria grandinella and Paramecium tetraurelia are synonymous. In contrast, alpha-tubulin paralogs within Stylonychia lemnae and Chilodonella uncinata are evolving at significantly different rates and have higher ratios of both replacement substitutions to synonymous substitutions and radical amino acid changes to conservative amino acid changes. Moreover, the amino acid substitutions in C. uncinata and S. lemnae paralogs are limited to short stretches that correspond to functionally important regions of the alpha-tubulin protein. The topology of ciliate alpha-tubulin genealogies are inconsistent with taxonomy based on morphology and other molecular markers, which may be due to taxonomic sampling, gene conversion, unequal rates of evolution, or asymmetric patterns of gene duplication and loss.
[show abstract][hide abstract] ABSTRACT: Abstract Ciliates provide a powerful system to analyze the evolution of duplicated α-tubulin genes in the context of single-celled organisms. Genealogical analyses of ciliate α-tubulin sequences reveal five apparently recent gene duplications. Comparisons of paralogs in different ciliates implicate differing patterns of substitutions (e.g., ratios of replacement/synonymous nucleotides and radical/conservative amino acids) following duplication. Most substitutions between paralogs in Euplotes crassus, Halteria grandinella and Paramecium tetraurelia are synonymous. In contrast, α-tubulin paralogs within Stylonychia lemnae and Chilodonella uncinata are evolving at significantly different rates and have higher ratios of both replacement substitutions to synonymous substitutions and radical amino acid changes to conservative amino acid changes. Moreover, the amino acid substitutions in C. uncinata and S. lemnae paralogs are limited to short stretches that correspond to functionally important regions of the α-tubulin protein. The topology of ciliate α-tubulin genealogies are inconsistent with taxonomy based on morphology and other molecular markers, which may be due to taxonomic sampling, gene conversion, unequal rates of evolution, or asymmetric patterns of gene duplication and loss.