Zhong Wang

DOE Joint Genome Institute, Walnut Creek, California, United States

Are you Zhong Wang?

Claim your profile

Publications (23)255.15 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Ruminant livestock represent the single largest anthropogenic source of the potent greenhouse gas methane, which is generated by methanogenic archaea residing in ruminant digestive tracts. While differences between individual animals of the same breed in the amount of methane produced have been observed, the basis for this variation remains to be elucidated. To explore the mechanistic basis of this methane production, we measured methane yields from 22 sheep, which revealed that methane yields are a reproducible quantitative trait. Deep metagenomic and metatranscriptomic sequencing demonstrated the presence of methanogens both in the highest and lowest methane-producing sheep, with a similar abundance of methanogens and methanogenesis pathway genes in high and low methane emitters. However, transcription of methanogenesis pathway genes was substantially increased in sheep with high methane yields. These results identify a discrete set of rumen methanogens whose methanogenesis pathway transcription profiles correlate with methane yields and provide new targets for CH4 mitigation at the levels of microbiota composition and transcriptional regulation.
    Genome research. 06/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this “genomic dark matter.” Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene-context based approach can be used to assign function to genes that are otherwise categorized as “genomic dark matter” and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications. Biotechnol. Bioeng. © 2014 Wiley Periodicals, Inc.
    Biotechnology and Bioengineering 04/2014; · 4.16 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-sequencing (RNA-seq) enables in-depth exploration of transcriptomes, but typical sequencing depth often limits its comprehensiveness. In this study, we generated nearly 3 billion RNA-Seq reads, totaling 341 Gb of sequence, from a Zea mays seedling sample. At this depth, a near complete snapshot of the transcriptome was observed consisting of over 90% of the annotated transcripts, including lowly expressed transcription factors. A novel hybrid strategy combining de novo and reference-based assemblies yielded a transcriptome consisting of 126,708 transcripts with 88% of expressed known genes assembled to full-length. We improved current annotations by adding 4,842 previously unannotated transcript variants and many new features, including 212 maize transcripts, 201 genes, 10 genes with undocumented potential roles in seedlings as well as maize lineage specific gene fusion events. We demonstrated the power of deep sequencing for large transcriptome studies by generating a high quality transcriptome, which provides a rich resource for the research community.
    Scientific Reports 01/2014; 4:4519. · 5.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: By directed evolution in the laboratory, we previously generated populations of Escherichia coli that exhibit a complex new phenotype, extreme resistance to ionizing radiation (IR). The molecular basis of this extremophile phenotype, involving strain isolates with a 3-4 order of magnitude increase in IR resistance at 3000 Gy, is now addressed. Of 69 mutations identified in one of our most highly adapted isolates, functional experiments demonstrate that the IR resistance phenotype is almost entirely accounted for by only three of these nucleotide changes, in the DNA metabolism genes recA, dnaB, and yfjK. Four additional genetic changes make small but measurable contributions. Whereas multiple contributions to IR resistance are evident in this study, our results highlight a particular adaptation mechanism not adequately considered in studies to date: Genetic innovations involving pre-existing DNA repair functions can play a predominant role in the acquisition of an IR resistance phenotype. DOI: http://dx.doi.org/10.7554/eLife.01322.001.
    eLife Sciences 01/2014; 3:e01322.
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis. Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development.
    BMC Genomics 08/2013; 14(1):563. · 4.40 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Researchers need general-purpose methods for objectively evaluating the accuracy of single and metagenome assemblies, and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality, or lack statistical justification, and none are designed to evaluate metagenome assemblies. RESULTS: In this paper we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired end reads), sequencing coverage, read alignment, and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by PacBio sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real PacBio data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process. AVAILABILITY: ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python. CONTACT: pf98@cornell.edu or ZhongWang@lbl.gov.
    Bioinformatics 01/2013; · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The West Nile virus (WNV) is an emerging infection of biodefense concern and there are no available treatments or vaccines. Here we used a high-throughput method based on a novel gene expression analysis, RNA-Seq, to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV. From a total of 28 million reads per sample, we identified 1,514 transcripts that were differentially expressed after infection. Both predicted and novel gene changes were detected, as were gene isoforms, and while many of the genes were expressed by all donors, some were unique. Knock-down of genes not previously known to be associated with WNV resistance identified their critical role in control of viral infection. Our study distinguishes both common gene pathways as well as novel cellular responses. Such analyses will be valuable for translational studies of susceptible and resistant individuals-and for targeting therapeutics-in multiple biological settings.
    Viruses 01/2013; 5(7):1664-81. · 2.51 Impact Factor
  • Source
    Zhong Wang, Huntington F Willard
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: Combinations of histone variants and modifications, conceptually representing a histone code, have been proposed to play a significant role in gene regulation and developmental processes in complex organisms. While various mechanisms have been implicated in establishing and maintaining epigenetic patterns at specific locations in the genome, they are generally believed to be independent of primary DNA sequence on a more global scale. RESULTS: To address this systematically in the case of the human genome, we have analyzed primary DNA sequences underlying 19 different methylated histones in human primary T-cells. We report that sequence alone can accurately predict the location of most of these histone marks genome-wide in this cell type. Furthermore, the sequence features responsible for such predictions are distinct for different groups of histone marks. CONCLUSIONS: These findings support the existence of a genomic code for histone modification associated with gene expression and chromatin programming, and they suggest that the mechanisms responsible for global histone modifications may interpret genomic sequence in various ways.
    BMC Genomics 08/2012; 13(1):367. · 4.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing (RNA-Seq) is rapidly replacing microarrays for profiling gene expression with much improved accuracy and sensitivity. One of the most common questions in a typical gene profiling experiment is how to identify a set of transcripts that are differentially expressed between different experimental conditions. Some of the statistical methods developed for microarray data analysis can be applied to RNA-Seq data with or without modifications. Recently several additional methods have been developed specifically for RNA-Seq data sets. This review attempts to give an in-depth review of these statistical methods, with the goal of providing a comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses.
    Cell & bioscience. 07/2012; 2(1):26.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
    Nature Biotechnology 07/2012; 30(7):693-700. · 32.44 Impact Factor
  • Source
    Jeffrey A Martin, Zhong Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches - reference-based, de novo and combined strategies - along with some perspectives on transcriptome assembly in the near future.
    Nature Reviews Genetics 09/2011; 12(10):671-82. · 41.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: De novo assembly of the transcriptome is crucial for functional genomics studies in bioenergy research, since many of the organisms lack high quality reference genomes. In a previous study we successfully de novo assembled simple eukaryote transcriptomes exclusively from short Illumina RNA-Seq reads [1]. However, extensive alternative splicing, present in most of the higher eukaryotes, poses a significant challenge for current short read assembly processes. Furthermore, the size of next-generation datasets, often large for plant genomes, presents an informatics challenge. To tackle these challenges we present a combined experimental and informatics strategy for de novo assembly in higher eukaryotes. Using maize as a test case, preliminary results suggest our approach can resolve transcript variants and improve gene annotations.
    05/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paucity of enzymes that efficiently deconstruct plant polysaccharides represents a major bottleneck for industrial-scale conversion of cellulosic biomass into biofuels. Cow rumen microbes specialize in degradation of cellulosic plant material, but most members of this complex community resist cultivation. To characterize biomass-degrading genes and genomes, we sequenced and analyzed 268 gigabases of metagenomic DNA from microbes adherent to plant fiber incubated in cow rumen. From these data, we identified 27,755 putative carbohydrate-active genes and expressed 90 candidate proteins, of which 57% were enzymatically active against cellulosic substrates. We also assembled 15 uncultured microbial genomes, which were validated by complementary methods including single-cell genome sequencing. These data sets provide a substantially expanded catalog of genes and genomes participating in the deconstruction of cellulosic biomass.
    Science 01/2011; 331(6016):463-7. · 31.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single cell genomics, the amplification and sequencing of genomes from single cells, can provide a glimpse into the genetic make-up and thus life style of the vast majority of uncultured microbial cells, making it an immensely powerful and increasingly popular tool. This is accomplished by use of multiple displacement amplification (MDA), which can generate billions of copies of a single bacterial genome producing microgram-range DNA required for shotgun sequencing. Here, we address a key challenge inherent to this approach and propose a solution for the improved recovery of single cell genomes. While DNA-free reagents for the amplification of a single cell genome are a prerequisite for successful single cell sequencing and analysis, DNA contamination has been detected in various reagents, which poses a considerable challenge. Our study demonstrates the effect of UV irradiation in efficient elimination of exogenous contaminant DNA found in MDA reagents, while maintaining Phi29 activity. Consequently, we also find that increased UV exposure to Phi29 does not adversely affect genome coverage of MDA amplified single cells. While additional challenges in single cell genomics remain to be resolved, the proposed methodology is relatively quick and simple and we believe that its application will be of high value for future single cell sequencing projects.
    01/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86 servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Candida albicans is the major invasive fungal pathogen of humans, causing diseases ranging from superficial mucosal infections to disseminated, systemic infections that are often lifethreatening. We have used massively parallel high-throughput sequencing of cDNA (RNA-seq) to generate a high-resolution map of the C. albicans transcriptome under several different environmental conditions. We have quantitatively determined all of the regions that are transcribed under these different conditions, and have identified 602 novel transcriptionally active regions (TARs) and numerous novel introns that are not represented in the current genome annotation. Interestingly, the expression of many of these TARs is regulated in a condition-specific manner. This comprehensive transcriptome analysis significantly enhances the current genome annotation of C. albicans, a necessary framework for a complete understanding of the molecular mechanisms of pathogenesis for this important eukaryotic pathogen.
    Genome Research 10/2010; 20(10):1451-8. · 14.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The predominance of rRNAs in the transcriptome is a major technical challenge in sequence-based analysis of cDNAs from microbial isolates and communities. Several approaches have been applied to deplete rRNAs from (meta)transcriptomes, but no systematic investigation of potential biases introduced by any of these approaches has been reported. Here we validated the effectiveness and fidelity of the two most commonly used approaches, subtractive hybridization and exonuclease digestion, as well as combinations of these treatments, on two synthetic five-microorganism metatranscriptomes using massively parallel sequencing. We found that the effectiveness of rRNA removal was a function of community composition and RNA integrity for these treatments. Subtractive hybridization alone introduced the least bias in relative transcript abundance, whereas exonuclease and in particular combined treatments greatly compromised mRNA abundance fidelity. Illumina sequencing itself also can compromise quantitative data analysis by introducing a G+C bias between runs.
    Nature Methods 10/2010; 7(10):807-12. · 23.57 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcription of the eukaryotic genomes is carried out by three distinct RNA polymerases I, II, and III, whereby each polymerase is thought to independently transcribe a distinct set of genes. To investigate a possible relationship of RNA polymerases II and III, we mapped their in vivo binding sites throughout the human genome by using ChIP-Seq in two different cell lines, GM12878 and K562 cells. Pol III was found to bind near many known genes as well as several previously unidentified target genes. RNA-Seq studies indicate that a majority of the bound genes are expressed, although a subset are not suggestive of stalling by RNA polymerase III. Pol II was found to bind near many known Pol III genes, including tRNA, U6, HVG, hY, 7SK and previously unidentified Pol III target genes. Similarly, in vivo binding studies also reveal that a number of transcription factors normally associated with Pol II transcription, including c-Fos, c-Jun and c-Myc, also tightly associate with most Pol III-transcribed genes. Inhibition of Pol II activity using alpha-amanitin reduced expression of a number of Pol III genes (e.g., U6, hY, HVG), suggesting that Pol II plays an important role in regulating their transcription. These results indicate that, contrary to previous expectations, polymerases can often work with one another to globally coordinate gene expression.
    Proceedings of the National Academy of Sciences 02/2010; 107(8):3639-44. · 9.81 Impact Factor

Publication Stats

3k Citations
255.15 Total Impact Points

Institutions

  • 2010–2014
    • DOE Joint Genome Institute
      Walnut Creek, California, United States
    • Lawrence Berkeley National Laboratory
      • Genomics Division
      Berkeley, California, United States
  • 2013
    • Cornell University
      • Center for Applied Mathematics
      Ithaca, NY, United States
  • 2012
    • Duke University
      Durham, North Carolina, United States
  • 2008–2010
    • Yale University
      • Department of Molecular, Cellular and Developmental Biology
      New Haven, CT, United States