Supertree bootstrapping methods for assessing phylogenetic variation among genes in genome-scale data sets.
ABSTRACT Nonparamtric bootstrapping methods may be useful for assessing confidence in a supertree inference. We examined the performance of two supertree bootstrapping methods on four published data sets that each include sequence data from more than 100 genes. In "input tree bootstrapping," input gene trees are sampled with replacement and then combined in replicate supertree analyses; in "stratified bootstrapping," trees from each gene's separate (conventional) bootstrap tree set are sampled randomly with replacement and then combined. Generally, support values from both supertree bootstrap methods were similar or slightly lower than corresponding bootstrap values from a total evidence, or supermatrix, analysis. Yet, supertree bootstrap support also exceeded supermatrix bootstrap support for a number of clades. There was little overall difference in support scores between the input tree and stratified bootstrapping methods. Results from supertree bootstrapping methods, when compared to results from corresponding supermatrix bootstrapping, may provide insights into patterns of variation among genes in genome-scale data sets.
[show abstract] [hide abstract]
ABSTRACT: Reduced-representation genome sequencing represents a new source of data for systematics, and its potential utility in interspecific phylogeny reconstruction has not yet been explored. One approach that seems especially promising is the use of inexpensive short-read technologies (e.g., Illumina, SOLiD) to sequence restriction-site associated DNA (RAD)--the regions of the genome that flank the recognition sites of restriction enzymes. In this study, we simulated the collection of RAD sequences from sequenced genomes of different taxa (Drosophila, mammals, and yeasts) and developed a proof-of-concept workflow to test whether informative data could be extracted and used to accurately reconstruct "known" phylogenies of species within each group. The workflow consists of three basic steps: first, sequences are clustered by similarity to estimate orthology; second, clusters are filtered by taxonomic coverage; and third, they are aligned and concatenated for "total evidence" phylogenetic analysis. We evaluated the performance of clustering and filtering parameters by comparing the resulting topologies with well-supported reference trees and we were able to identify conditions under which the reference tree was inferred with high support. For Drosophila, whole genome alignments allowed us to directly evaluate which parameters most consistently recovered orthologous sequences. For the parameter ranges explored, we recovered the best results at the low ends of sequence similarity and taxonomic representation of loci; these generated the largest supermatrices with the highest proportion of missing data. Applications of the method to mammals and yeasts were less successful, which we suggest may be due partly to their much deeper evolutionary divergence times compared to Drosophila (crown ages of approximately 100 and 300 versus 60 Mya, respectively). RAD sequences thus appear to hold promise for reconstructing phylogenetic relationships in younger clades in which sufficient numbers of orthologous restriction sites are retained across species.PLoS ONE 01/2012; 7(4):e33394. · 4.09 Impact Factor
Article: Robinson-Foulds supertrees.[show abstract] [hide abstract]
ABSTRACT: Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees. We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Theta(n) and Theta(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods. Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.Algorithms for Molecular Biology 02/2010; 5:18. · 1.35 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: Genome sequences of wild and domestic bactrian camels The Bactrian Camels Genome Sequencing and Analysis Consortium* Bactrian camels serve as an important means of transportation in the cold desert regions of China and Mongolia. Here we present a 2.01 Gb draft genome sequence from both a wild and a domestic bactrian camel. We estimate the camel genome to be 2.38 Gb, containing 20,821 protein-coding genes. Our phylogenomics analysis reveals that camels shared common ancestors with other even-toed ungulates about 55–60 million years ago. Rapidly evolving genes in the camel lineage are significantly enriched in metabolic pathways, and these changes may underlie the insulin resistance typically observed in these animals. We estimate the genome-wide heterozygosity rates in both wild and domestic camels to be 1.0 Â 10 À 3 . However, genomic regions with significantly lower heterozygosity are found in the domestic camel, and olfactory receptors are enriched in these regions. Our comparative genomics analyses may also shed light on the genetic basis of the camel's remarkable salt tolerance and unusual immune system. W ild bactrian camels (Camelus bactrianus ferus) are the lone survivors of the old world camels 1 . At present, their total number is only 730–880, less than that of the giant pandas 2 . They live in northwestern China and southwestern Mongolia, especially the Outer Altai Gobi Desert. Considered critically endangered by the International Union for Conservation of Nature, wild bactrian camels are protected under both the Convention on International Trade in Endangered Species of Wild Fauna and Flora and domestic legislations in China and Mongolia. The archaeozoological record shows that fully domesticated bactrian camels were present in the third millennium BC and subsequently spread into much of Central Asia 3 . However, our knowledge about the origins and migration history of domestic camels remains inconclusive. To adapt to the harsh conditions—cold, hot, arid, and poor grazing—of deserts or semi-deserts, camels have acquired many special abilities and attributes. They can store energy in their humps and abdomen in the form of fat, enabling them to survive long periods without any food or water 4 . The camel's body temperature may vary from 34 to 41 1C throughout the day 5 . Blood glucose levels in camels are twice those of other ruminants 6 . Camels tolerate a high dietary intake of salt, consuming eight times more than cattles and sheep 7 , yet they do not develop diabetes or hypertension. The Camelidae family are the only mammals that can produce heavy-chain antibodies (HCAbs), a special form of immunoglobulin that lacks the light chain, in contrast to conventional antibodies (Abs) 8 . HCAbs are smaller and more stable, offering particular advantages in various medical and biotechnological applications. In this study, we sequenced the genomes of both wild and domestic bactrian camel, to better understand the history of their evolution and domestication, and to provide a resource for research into the genetic mechanisms that enable camels to survive extreme environments. Results Genome sequence. We sequenced the genomes of an 8-year-old wild male bactrian camel named 'Naran' from the wild bactrian camel nature reserve of Altai province, Mongolia ('wild camel' hereafter, Supplementary Fig. S1) and a 6-year-old male Alashan bactrian camel from Inner Mongolia, China ('domestic camel' hereafter, Supplementary Fig. S2). For the wild camel genome, four paired-end/mate-pair sequencing libraries were constructed with insert sizes of 500 bp, 3 kb, 10 kb and 20 kb. For the domestic camel genome, only libraries with shorter insert size of 500 bp were constructed (Supplementary Table S2). We assembled the short reads obtained from the wild camel genome sequencing using SOAPdenovo 9 . The reads with the insert size of 500 bp were first assembled into contigs. Then the contigs were joined into scaffolds with reads from the shortest to the longest insert size. In total, we obtained 120,352 scaffolds, including 13,544 scaffolds longer than 1 kb and 3,453 longer than 10 kb. The N50 length of the scaffolds longer than 1 kb is 2.00 Mb (Table 1). We remapped the usable reads to the scaffolds and obtained an average effective depth of 76 Â and 24 Â for the wild and the domestic camel genomes, respectively (Supplementary Table S3). Using the fre-quency distribution of 17-mer in the reads (Supplementary Fig. S3), we estimated the camel genome size to be 2.38 Gb. This is close to the camel genome size (2.02–2.40 Gb) calculated based on haploid DNA contents (C values) (Supplementary Table S4). The genome sequences show that 34% of the bactrian camel genome are repetitive DNAs (Supplementary Table S5). This percentage is lower than that in human (450%) 10 , horse (46%) 11 or cattle (48%) 12 , but close to that in mouse (35%) 13 or dog (34%) 14 . Most of the repetitive DNAs are transposon-derivedNature Communications 11/2012; · 7.40 Impact Factor