[Show abstract][Hide abstract] ABSTRACT: Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. It automatically forms hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.
[Show abstract][Hide abstract] ABSTRACT: Genes in prokaryotic genomes are often arranged into clusters and co-transcribed into polycistronic RNAs. Isolated examples of polycistronic RNAs were also reported in some higher eukaryotes but their presence was generally considered rare. Here we developed a long-read sequencing strategy to identify polycistronic transcripts in several mushroom forming fungal species including Plicaturopsis crispa, Phanerochaete chrysosporium, Trametes versicolor, and Gloeophyllum trabeum. We found genome-wide prevalence of polycistronic transcription in these Agaricomycetes, involving up to 8% of the transcribed genes. Unlike polycistronic mRNAs in prokaryotes, these co-transcribed genes are also independently transcribed. We show that polycistronic transcription may interfere with expression of the downstream tandem gene. Further comparative genomic analysis indicates that polycistronic transcription is conserved among a wide range of mushroom forming fungi. In summary, our study revealed, for the first time, the genome prevalence of polycistronic transcription in a phylogenetic range of higher fungi. Furthermore, we systematically show that our long-read sequencing approach and combined bioinformatics pipeline is a generic powerful tool for precise characterization of complex transcriptomes that enables identification of mRNA isoforms not recovered via short-read assembly.
PLoS ONE 07/2015; 10(7). DOI:10.1371/journal.pone.0132628 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: PIWI proteins play essential and conserved roles in germline development, including germline stem cell maintenance and meiosis. Because germline regulators such as OCT4, NANOG, and SOX2 are known to be potent factors that reprogram differentiated somatic cells into induced pluripotent stem cells (iPSCs), we investigated whether the PIWI protein family is involved in iPSC production. We find that all three mouse Piwi genes, Miwi, Mili, and Miwi2, are expressed in embryonic stem cells (ESCs) at higher levels than in fibroblasts, with Mili being the highest. However, mice lacking all three Piwi genes are viable and female fertile, and are only male sterile. Furthermore, embryonic fibroblasts derived from Miwi/Mili/Miwi2 triple knockout embryos can be efficiently reprogrammed into iPS cells. These iPS cells expressed pluripotency markers and were capable of differentiating into all three germ layers in teratoma assays. Genome-wide expression profiling reveals that the triple knockout iPS cells are very similar to littermate control iPS cells. These results indicate that PIWI proteins are dispensable for direct reprogramming of mouse fibroblasts.
PLoS ONE 09/2014; 9(9):e97821. DOI:10.1371/journal.pone.0097821 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Ruminant livestock represent the single largest anthropogenic source of the potent greenhouse gas methane, which is generated by methanogenic archaea residing in ruminant digestive tracts. While differences between individual animals of the same breed in the amount of methane produced have been observed, the basis for this variation remains to be elucidated. To explore the mechanistic basis of this methane production, we measured methane yields from 22 sheep, which revealed that methane yields are a reproducible quantitative trait. Deep metagenomic and metatranscriptomic sequencing demonstrated the presence of methanogens both in the highest and lowest methane-producing sheep, with a similar abundance of methanogens and methanogenesis pathway genes in high and low methane emitters. However, transcription of methanogenesis pathway genes was substantially increased in sheep with high methane yields. These results identify a discrete set of rumen methanogens whose methanogenesis pathway transcription profiles correlate with methane yields and provide new targets for CH4 mitigation at the levels of microbiota composition and transcriptional regulation.
Genome Research 06/2014; 24(9):1517-25. DOI:10.1101/gr.168245.113 · 14.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: RNA-sequencing (RNA-seq) enables in-depth exploration of transcriptomes, but typical sequencing depth often limits its comprehensiveness. In this study, we generated nearly 3 billion RNA-Seq reads, totaling 341 Gb of sequence, from a Zea mays seedling sample. At this depth, a near complete snapshot of the transcriptome was observed consisting of over 90% of the annotated transcripts, including lowly expressed transcription factors. A novel hybrid strategy combining de novo and reference-based assemblies yielded a transcriptome consisting of 126,708 transcripts with 88% of expressed known genes assembled to full-length. We improved current annotations by adding 4,842 previously unannotated transcript variants and many new features, including 212 maize transcripts, 201 genes, 10 genes with undocumented potential roles in seedlings as well as maize lineage specific gene fusion events. We demonstrated the power of deep sequencing for large transcriptome studies by generating a high quality transcriptome, which provides a rich resource for the research community.
[Show abstract][Hide abstract] ABSTRACT: By directed evolution in the laboratory, we previously generated populations of Escherichia coli that exhibit a complex new phenotype, extreme resistance to ionizing radiation (IR). The molecular basis of this extremophile phenotype, involving strain isolates with a 3-4 order of magnitude increase in IR resistance at 3000 Gy, is now addressed. Of 69 mutations identified in one of our most highly adapted isolates, functional experiments demonstrate that the IR resistance phenotype is almost entirely accounted for by only three of these nucleotide changes, in the DNA metabolism genes recA, dnaB, and yfjK. Four additional genetic changes make small but measurable contributions. Whereas multiple contributions to IR resistance are evident in this study, our results highlight a particular adaptation mechanism not adequately considered in studies to date: Genetic innovations involving pre-existing DNA repair functions can play a predominant role in the acquisition of an IR resistance phenotype. DOI: http://dx.doi.org/10.7554/eLife.01322.001
[Show abstract][Hide abstract] ABSTRACT: Improved crop water-use efficiency (WUE) is critical for the long-term sustainability of agricultural production systems in the face of predicted future warmer and drier climates. Crassulacean acid metabolism (CAM) is a specialized mode of photosynthesis that enhances WUE through an inverse day/night pattern of stomatal closure/opening and improves photosynthetic efficiency by concentrating CO2 around RUBISCO. CAM has evolved multiple times from C3 photosynthesis and ~6.5% of higher plant species in more than 35 families have acquired CAM via parallel or convergent evolution. There are two fundamental questions to be answered to understand the molecular basis and evolutionary mechanism of CAM: 1) what are the genetic differences between CAM and non-CAM species and 2) what are the common molecular features shared among CAM plants from diverse origins? To address these questions, comparative genomics analysis was performed using multiple plant species including CAM (e.g., Agave, Kalanchoe, Mesembryanthemum), C3 (e.g., Arabidopsis, Oryza, Populus), C4 (e.g., Setaria, Sorghum, and Zea), and non-vascular plant species (e.g., Physcomitrella, Selaginella). Our analysis not only revealed orthologous gene groups shared between CAM and non-CAM species, but also identified genes specific to the CAM species. Also, expanded gene families were identified in CAM species compared with non-CAM species. Gene ontology and gene expression profiles were used to build hypothesis related to divergent gene functions that likely arose during CAM evolution. This research establishes a framework for CAM comparative genomics studies and provides new knowledge to inform genetic improvement in WUE and photosynthetic efficiency in crop plants under water-limiting conditions.
International Plant and Animal Genome Conference XXII 2014; 01/2014
[Show abstract][Hide abstract] ABSTRACT: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this "data deluge", here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation.
We built BioPig upon the Apache's Hadoop MapReduce system and the Pig data flow language. Compared to traditional serial and MPI based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at NERSC and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
BioPig is released as open source software under the BSD license at https://sites.google.com/a/lbl.gov/biopig/ CONTACT: ZhongWang@lbl.gov.
[Show abstract][Hide abstract] ABSTRACT: Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits.
Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis.
Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development.
[Show abstract][Hide abstract] ABSTRACT: The West Nile virus (WNV) is an emerging infection of biodefense concern and there are no available treatments or vaccines. Here we used a high-throughput method based on a novel gene expression analysis, RNA-Seq, to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV. From a total of 28 million reads per sample, we identified 1,514 transcripts that were differentially expressed after infection. Both predicted and novel gene changes were detected, as were gene isoforms, and while many of the genes were expressed by all donors, some were unique. Knock-down of genes not previously known to be associated with WNV resistance identified their critical role in control of viral infection. Our study distinguishes both common gene pathways as well as novel cellular responses. Such analyses will be valuable for translational studies of susceptible and resistant individuals-and for targeting therapeutics-in multiple biological settings.
[Show abstract][Hide abstract] ABSTRACT: Motivation:
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python.
[Show abstract][Hide abstract] ABSTRACT: In this work we describe a method to automatically detect errors in de novo assembled genomes. The method extends a Bayesian assembly quality evaluation framework, ALE, which computes the likelihood of an assembly given a set of unassembled data. Starting from ALE output, this method applies outlier detection algorithms to identify the precise locations of assembly errors. We show results from a microbial genome with manually curated assembly errors. Our method detects all deletions, 82.3% of insertions, and 88.8% of single base substitutions. It was also able to detect an inversion error that spans more than 400 bases.
eScience (eScience), 2013 IEEE 9th International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Background
Combinations of histone variants and modifications, conceptually representing a histone code, have been proposed to play a significant role in gene regulation and developmental processes in complex organisms. While various mechanisms have been implicated in establishing and maintaining epigenetic patterns at specific locations in the genome, they are generally believed to be independent of primary DNA sequence on a more global scale.
To address this systematically in the case of the human genome, we have analyzed primary DNA sequences underlying patterns of 19 different methylated histones in human primary T-cells and patterns of three methylated histones across additional human cell lines. We report strong sequence biases associated with most of these histone marks genome-wide in each cell type. Furthermore, the sequence characteristics for such association are distinct for different groups of histone marks.
These findings provide evidence of an influence of genomic sequence on patterns of histone modification associated with gene expression and chromatin programming, and they suggest that the mechanisms responsible for global histone modifications may interpret genomic sequence in various ways.
[Show abstract][Hide abstract] ABSTRACT: RNA sequencing (RNA-Seq) is rapidly replacing microarrays for profiling gene expression with much improved accuracy and sensitivity. One of the most common questions in a typical gene profiling experiment is how to identify a set of transcripts that are differentially expressed between different experimental conditions. Some of the statistical methods developed for microarray data analysis can be applied to RNA-Seq data with or without modifications. Recently several additional methods have been developed specifically for RNA-Seq data sets. This review attempts to give an in-depth review of these statistical methods, with the goal of providing a comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses.
Cell and Bioscience 07/2012; 2(1):26. DOI:10.1186/2045-3701-2-26 · 3.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
[Show abstract][Hide abstract] ABSTRACT: Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.
E-Science (e-Science), 2012 IEEE 8th International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches - reference-based, de novo and combined strategies - along with some perspectives on transcriptome assembly in the near future.