-
[show abstract]
[hide abstract]
ABSTRACT: We have searched the cloned 86 kilo base pair plasmid pHV1 from Haloferax volcanii for repeated sequence elements, of which we expected it to be a rich source. It contains five copies of the previously characterized element ISH51 and a total of five copies of three uncharacterized elements. pHV1 is part of an AT-rich fraction of the DNA that is likely to be a preferred site for IS insertion.Key words: Haloferax volcanii, pHV1 repeated sequence elements, ISH51.
Canadian Journal of Microbiology 02/2011; 39(2):201-206. · 1.36 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Genome phylogenies are used to build tree-like representations of evolutionary relationships among genomes. However, in condensing the phylogenetic signals within a set of genomes down to a single tree, these methods generally do not explicitly take into account discordant signals arising due to lateral genetic transfer. Because conflicting vertical and horizontal signals can produce compromise trees that do not reflect either type of history, it is essential to understand the sensitivity of inferred genome phylogenies to these confounding effects. Using replicated simulations of genome evolution, we show that different scenarios of lateral genetic transfer have significant impacts on the ability to recover the "true" tree of genomes, even when corrections for phylogenetically discordant signals are used.
Systematic Biology 01/2009; 57(6):844-56. · 10.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The integron/gene cassette systems identified in bacteria comprise a class of genetic elements that allow adaptation by acquisition of gene cassettes. Integron gene cassettes have been shown to facilitate the spread of drug resistance in human pathogens but their role outside a clinical setting has not been explored extensively. We sequenced 2145 integron gene cassettes from four marine sediment samples taken from the vicinity of Halifax Nova Scotia, Canada, increasing the number of gene cassettes obtained from environmental microbial communities by 10-fold. Sequence analyses reveals that the majority of these cassettes encode novel proteins and that this study is consistent with previous claims of high cassette diversity as we estimate a Chao1 diversity index of approximately 3000 cassettes from these samples. The functional distribution of environmental cassettes recovered in this study, when compared with that of cassettes from the only other source with significant sampling (Vibrio genomes) suggests that alternate selection regimes might be acting on these two gene pools. The majority of cassettes recovered in this study encode novel, unknown proteins. In instances where we obtained multiple alleles of a novel protein we demonstrate that non-synonymous versus synonymous substitution rates ratios suggest relaxed selection. Cassette-encoded proteins with known homologues represent a variety of functions and prevalent among these are isochorismatases; proteins involved in iron scavenging. Phylogenetic analysis of these isochorismatases as well as of cassette-encoded acetyltransferases reveals a patchy distribution, suggesting multiple sources for the origin of these cassettes. Finally, the two most environmentally similar sample sites considered in this study display the greatest overlap of cassette types, consistent with the hypothesis that cassette genes encode adaptive proteins.
Environmental Microbiology 05/2008; 10(4):1024-38. · 5.84 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Microbial genomes undergo evolutionary processes such as gene family expansion and contraction, variable rates and patterns of sequence substitution and lateral genetic transfer. Simulation tools are essential for both the generation of data under different evolutionary models and the validation of analytical methods on such data. However, meaningful investigation of phenomena such as lateral genetic transfer requires the simultaneous consideration of many underlying evolutionary processes.
We have developed EvolSimulator, a software package that combines non-stationary sequence and gene family evolution together with models of lateral genetic transfer, within a customizable birth-death model of speciation and extinction. Here, we examine simulated data sets generated with EvolSimulator using existing statistical techniques from the evolutionary literature, showing in detail each component of the simulation strategy.
Source code, manual and other information are freely available at www.bioinformatics.org.au/evolsim.
Supplementary data are available at Bioinformatics online.
Bioinformatics 05/2007; 23(7):825-31. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Using 1128 protein-coding gene families from 11 completely sequenced cyanobacterial genomes, we attempt to quantify horizontal gene transfer events within cyanobacteria, as well as between cyanobacteria and other phyla. A novel method of detecting and enumerating potential horizontal gene transfer events within a group of organisms based on analyses of "embedded quartets" allows us to identify phylogenetic signal consistent with a plurality of gene families, as well as to delineate cases of conflict to the plurality signal, which include horizontally transferred genes. To infer horizontal gene transfer events between cyanobacteria and other phyla, we added homologs from 168 available genomes. We screened phylogenetic trees reconstructed for each of these extended gene families for highly supported monophyly of cyanobacteria (or lack of it). Cyanobacterial genomes reveal a complex evolutionary history, which cannot be represented by a single strictly bifurcating tree for all genes or even most genes, although a single completely resolved phylogeny was recovered from the quartets' plurality signals. We find more conflicts within cyanobacteria than between cyanobacteria and other phyla. We also find that genes from all functional categories are subject to transfer. However, in interphylum as compared to intraphylum transfers, the proportion of metabolic (operational) gene transfers increases, while the proportion of informational gene transfers decreases.
Genome Research 10/2006; 16(9):1099-108. · 13.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The recently sequenced genome of the predatory delta-proteobacterium Bdellovibrio bacteriovorus provides many insights into its metabolism and evolution. Because its genes are reasonably uniform in G+C content, it was suggested that B. bacteriovorus actively resists recombination with foreign DNA and horizontal transfer of DNA from other bacteria. To investigate this further, we carried out a variety of phylogenetic and comparative genomics analyses using data from >200 microbial genomes, including several published delta-proteobacteria. Although there might be little evidence for the extensive recent transfer of genes, we demonstrate that ancient lateral gene acquisition has shaped the B. bacteriovorus genome to a great extent.
Trends in Microbiology 03/2006; 14(2):64-9. · 7.91 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: There are many ways to group completed genome sequences in hierarchical patterns (trees) reflecting relationships between their genes. Such groupings help us organize biological information and bear crucially on underlying processes of genome and organismal evolution. Genome trees make use of all comparable genes but can variously weight the contributions of these genes according to similarity, congruent patterns of similarity, or prevalence among genomes. Here we explore such possible weighting strategies, in an analysis of 142 prokaryotic and 5 eukaryotic genomes. We demonstrate that alternate weighting strategies have different advantages, and we propose that each may have its specific uses in systematic or evolutionary biology. Comparisons of results obtained with different methods can provide further clues to major events and processes in genome evolution.
Journal of Bacteriology 03/2005; 187(4):1305-16. · 3.83 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: When organismal phylogenies based on sequences of single marker genes are poorly resolved, a logical approach is to add more markers, on the assumption that weak but congruent phylogenetic signal will be reinforced in such multigene trees. Such approaches are valid only when the several markers indeed have identical phylogenies, an issue which many multigene methods (such as the use of concatenated gene sequences or the assembly of supertrees) do not directly address. Indeed, even when the true history is a mixture of vertical descent for some genes and lateral gene transfer (LGT) for others, such methods produce unique topologies.
We have developed software that aims to extract evidence for vertical and lateral inheritance from a set of gene trees compared against an arbitrary reference tree. This evidence is then displayed as a synthesis showing support over the tree for vertical inheritance, overlaid with explicit lateral gene transfer (LGT) events inferred to have occurred over the history of the tree. Like splits-tree methods, one can thus identify nodes at which conflict occurs. Additionally one can make reasonable inferences about vertical and lateral signal, assigning putative donors and recipients.
A tool such as ours can serve to explore the reticulated dimensionality of molecular evolution, by dissecting vertical and lateral inheritance at high resolution. By this, we mean that individual nodes can be examined not only for congruence, but also for coherence in light of LGT. We assert that our tools will facilitate the comparison of phylogenetic trees, and the interpretation of conflicting data.
BMC Evolutionary Biology 02/2005; 5:27. · 3.52 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The origin of the nuclear compartment has been extensively debated, leading to several alternative views on the evolution of the eukaryotic nucleus. Until recently, too little phylogenetic information was available to address this issue by using multiple characters for many lineages.
We analyzed 65 proteins integral to or associated with the nuclear pore complex (NPC), including all the identified nucleoporins, the components of their anchoring system and some of their main partners. We used reconstruction of ancestral sequences of these proteins to expand the detection of homologs, and showed that the majority of them, present all over the nuclear pore structure, share homologs in all extant eukaryotic lineages. The anchoring system, by contrast, is analogous between the different eukaryotic lineages and is thus a relatively recent innovation. We also showed the existence of high heterogeneity of evolutionary rates between these proteins, as well as between and within lineages. We show that the ubiquitous genes of the nuclear pore structure are not strongly conserved at the sequence level, and that only their domains are relatively well preserved.
We propose that an NPC very similar to the extant one was already present in at least the last common ancestor of all extant eukaryotes and it would not have undergone major changes since its early origin. Importantly, we observe that sequences and structures obey two very different tempos of evolution. We suggest that, despite strong constraints that froze the structural evolution of the nuclear pore, the NPC is still highly adaptive, modern, and flexible at the sequence level.
Genome biology 02/2005; 6(10):R85. · 6.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The multitude of motif detection algorithms developed to date have largely focused on the detection of patterns in primary sequence. Since sequence-dependent DNA structure and flexibility may also play a role in protein-DNA interactions, the simultaneous exploration of sequence- and structure-based hypotheses about the composition of binding sites and the ordering of features in a regulatory region should be considered as well. The consideration of structural features requires the development of new detection tools that can deal with data types other than primary sequence.
GANN (available at http://bioinformatics.org.au/gann) is a machine learning tool for the detection of conserved features in DNA. The software suite contains programs to extract different regions of genomic DNA from flat files and convert these sequences to indices that reflect sequence and structural composition or the presence of specific protein binding sites. The machine learning component allows the classification of different types of sequences based on subsamples of these indices, and can identify the best combinations of indices and machine learning architecture for sequence discrimination. Another key feature of GANN is the replicated splitting of data into training and test sets, and the implementation of negative controls. In validation experiments, GANN successfully merged important sequence and structural features to yield good predictive models for synthetic and real regulatory regions.
GANN is a flexible tool that can search through large sets of sequence and structural feature combinations to identify those that best characterize a set of sequences.
BMC Bioinformatics 02/2005; 6:36. · 2.75 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The genomic core concept has found several uses in comparative and evolutionary genomics. Defined as the set of all genes common to (ubiquitous among) all genomes in a phylogenetically coherent group, core size decreases as the number and phylogenetic diversity of the relevant group increases. Here, we focus on methods for defining the size and composition of the core of all genes shared by sequenced genomes of prokaryotes (Bacteria and Archaea). There are few (almost certainly less than 50) genes shared by all of the 147 genomes compared, surely insufficient to conduct all essential functions. Sequencing and annotation errors are responsible for the apparent absence of some genes, while very limited but genuine disappearances (from just one or a few genomes) can account for several others. Core size will continue to decrease as more genome sequences appear, unless the requirement for ubiquity is relaxed. Such relaxation seems consistent with any reasonable biological purpose for seeking a core, but it renders the problem of definition more problematic. We propose an alternative approach (the phylogenetically balanced core), which preserves some of the biological utility of the core concept. Cores, however delimited, preferentially contain informational rather than operational genes; we present a new hypothesis for why this might be so.
Genome Research 01/2005; 14(12):2469-77. · 13.61 Impact Factor
-
Trends in Microbiology 06/2004; 12(5):213-9. · 7.91 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We describe a query-based web-accessible system (www.neurogadgets.com/bws.php) for facilitating comparative microbial genomics. A variety of query pages are available, each with numerous options, that allow a biologist to pose relevant questions of genomic data. We illustrate with a characterization of species-specific protein-coding genes (so-called "ORFans"), finding that they are on average smaller, faster evolving, and less G+C-rich, and that they encode proteins more basic in their predicted isoelectric point, compared with non-species-specific genes. Using a dual-threshold approach, we conclude that these are characteristics of true species-specific genes, rather than artifacts of mis-annotation.
FEMS Microbiology Letters 09/2003; 225(2):213-20. · 2.04 Impact Factor
-
Nature 02/2003; 421(6920):217. · 36.28 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: If open reading frames (ORFs) have been transmitted primarily by vertical descent, the distributional profile of orthologues of each ORF should be congruent with the organismal tree or a subtree thereof. Distributional patterns not reconciled parsimoniously with tree-like descent and loss are prima facie evidence of lateral gene transfer. Herein, a rigorous criterion for recognizing ORF distributions is described and implemented; it does not require the inference of phylogenetic trees, nor does it assume any specific tree. Because lineage-specific differences in rates of sequence change can also generate unexpected distributional patterns, rate artefacts were controlled for by requiring pairwise matches between ORFs to exceed a rigorous inclusion threshold, but absence of a match was assessed against a more-permissive exclusion threshold. Applying this dual-threshold criterion to cross-domain and cross-phylum distributional patterns for ORFs in 23 bacterial genomes, a relative abundance of ORFs was observed that find a match in exactly seven other bacterial phyla; 94-99% of these ORFs also find matches among the Archaea and/or Eukarya. In the larger (and some smaller) bacterial genomes, ORFs that find matches in exactly one other bacterial phylum are also relatively abundant, but fewer of these have non-bacterial homologues; most of their matches within the Bacteria are to the Proteobacteria and/or Firmicutes, which cannot be sister lineages to all bacteria. ORFs that are neither distributed universally among the Bacteria, nor necessarily shared with topologically adjacent lineages, are preferentially enriched in large bacterial genomes.
International journal of systematic and evolutionary microbiology 06/2002; 52(Pt 3):777-87. · 2.27 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Darwin's paradigm holds that the diversity of present-day organisms has arisen via a process of genetic descent with modification, as on a bifurcating tree. Evidence is accumulating that genes are sometimes transferred not along lineages but rather across lineages. To the extent that this is so, Darwin's paradigm can apply only imperfectly to genomes, potentially complicating or perhaps undermining attempts to reconstruct historical relationships among genomes (i.e., a genome tree). Whether most genes in a genome have arisen via treelike (vertical) descent or by lateral transfer across lineages can be tested if enough complete genome sequences are used. We define a phylogenetically discordant sequence (PDS) as an open reading frame (ORF) that exhibits patterns of similarity relationships statistically distinguishable from those of most other ORFs in the same genome. PDSs represent between 6.0 and 16.8% (mean, 10.8%) of the analyzable ORFs in the genomes of 28 bacteria, eight archaea, and one eukaryote (Saccharomyces cerevisiae). In this study we developed and assessed a distance-based approach, based on mean pairwise sequence similarity, for generating genome trees. Exclusion of PDSs improved bootstrap support for basal nodes but altered few topological features, indicating that there is little systematic bias among PDSs. Many but not all features of the genome tree from which PDSs were excluded are consistent with the 16S rRNA tree.
Journal of Bacteriology 05/2002; 184(8):2072-80. · 3.83 Impact Factor
-
Robert L Charlebois,
Rama K. Singh,
Christina C. -y. Chan-weiher,
Ghislaine Allard,
Cynthia Chow,
Fabrice Confalonieri,
Bruce Curtis,
Michel Duguet,
Gael Erauso,
David Faguy, [......],
Xu Peng,
Susanne L. Penny,
Qunxin She,
Andrew St. Jean,
John van der Oost,
Felix Young,
Yvan Zivanovic,
W Ford Doolittle,
Mark A Ragan,
Christoph W Sensen
[show abstract]
[hide abstract]
ABSTRACT: The sequence of a 281-kbp contig from the crenarchaeote Sulfolobus solfataricus P2 was determined and analysed. Notable features in this region include 29 ribosomal protein genes, 12 tRNA genes (four of which contain archaeal-type introns), operons encoding enzymes of histidine biosynthesis, pyrimidine biosynthesis, and arginine biosynthesis, an ATPase operon, numerous genes for enzymes of lipopolysaccharide biosynthesis, and six insertion sequences. The content and organization of this contig are compared with sequences from crenarchaeotes, euryarchaeotes, bacteria, and eukaryotes.
04/2000;
-
Christoph W Sensen, Robert L Charlebois,
Cynthia Chow,
Ib Groth,
Clausen Bruce Curtis,
W. Ford,
Doolittle Michel,
Duguet Gael Erauso,
Terry Gaasterland,
Roger A Garrett, [......],
C. Kozera,
M E Schenk,
Bh H,
I.G. Clausen,
N Tolstrup,
M Duguet,
N Medina,
R A Garrett,
H Phan,
Q She
[show abstract]
[hide abstract]
ABSTRACT: The Sulfolobus solfataricus P2 genome collaborators are poised to sequence the entire 3-Mbp genome of this crenarchaeote archaeon. About 80% of the genome has been sequenced to date, with the rest of the sequence being assembled fast. In this publication we introduce the genomic sequencing and automated analysis strategy and present intial data derived from the sequence analysis. After an overview of the general sequence features, metabolic pathway studies are explained, using sugar metabolism as an example. The paper closes with an overview of repetitive elements in S. solfataricus.
12/1999;
-
C. W. Sensen, Robert L. Charlebois,
Cynthia Chow,
Ib Groth Clausen,
Bruce Curtis,
W. Ford Doolittle,
Michel Duguet,
Gael Erauso,
Terry Gaasterland,
Roger A. Garrett, [......],
Catherine Kozera,
Nadine Medina,
Anick De Moors,
John van der Oost,
Hien Phan,
Mark A. Ragan,
Margaret E. Schenk,
Qunxin She,
Rama K. Singh,
Niels Tolstrup
[show abstract]
[hide abstract]
ABSTRACT: The Sulfolobus solfataricus P2 genome collaborators are poised to sequence the entire 3-Mbp genome of this crenarchaeote archaeon. About 80% of the genome
has been sequenced to date, with the rest of the sequence being assembled fast. In this publication we introduce the genomic
sequencing and automated analysis strategy and present intial data derived from the sequence analysis. After an overview of
the general sequence features, metabolic pathway studies are explained, using sugar metabolism as an example. The paper closes
with an overview of repetitive elements in S. solfataricus.
Extremophiles 01/1998; 2(3):305-312. · 2.94 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Over 800 kbp of the 3-Mbp genome of Sulfolobus solfataricus have been sequenced to date. Our approach is to sequence subclones of mapped cosmids, followed by sequencing directly on cosmid templates with custom primers. Using a prototype automated system for genome-scale analysis, known as MAGPIE, along with other tools, we have discovered one open reading frame of at least 100 amino acids per kbp of sequence, and have been able to associate 50% of these with known genes through database searches. An examination of completely sequenced cosmids suggests a clustering of genes by function in the S. solfataricus genome.
FEBS Letters 07/1996; · 3.54 Impact Factor