Zhong Wang

University of California, Berkeley, Berkeley, California, United States

Are you Zhong Wang?

Claim your profile

Publications (28)291.53 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: PIWI proteins play essential and conserved roles in germline development, including germline stem cell maintenance and meiosis. Because germline regulators such as OCT4, NANOG, and SOX2 are known to be potent factors that reprogram differentiated somatic cells into induced pluripotent stem cells (iPSCs), we investigated whether the PIWI protein family is involved in iPSC production. We find that all three mouse Piwi genes, Miwi, Mili, and Miwi2, are expressed in embryonic stem cells (ESCs) at higher levels than in fibroblasts, with Mili being the highest. However, mice lacking all three Piwi genes are viable and female fertile, and are only male sterile. Furthermore, embryonic fibroblasts derived from Miwi/Mili/Miwi2 triple knockout embryos can be efficiently reprogrammed into iPS cells. These iPS cells expressed pluripotency markers and were capable of differentiating into all three germ layers in teratoma assays. Genome-wide expression profiling reveals that the triple knockout iPS cells are very similar to littermate control iPS cells. These results indicate that PIWI proteins are dispensable for direct reprogramming of mouse fibroblasts.
    PLoS ONE 09/2014; 9(9):e97821. DOI:10.1371/journal.pone.0097821 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this “genomic dark matter.” Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene-context based approach can be used to assign function to genes that are otherwise categorized as “genomic dark matter” and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications. Biotechnol. Bioeng. © 2014 Wiley Periodicals, Inc.
    Biotechnology and Bioengineering 08/2014; 111(8). DOI:10.1002/bit.25250 · 4.16 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Ruminant livestock represent the single largest anthropogenic source of the potent greenhouse gas methane, which is generated by methanogenic archaea residing in ruminant digestive tracts. While differences between individual animals of the same breed in the amount of methane produced have been observed, the basis for this variation remains to be elucidated. To explore the mechanistic basis of this methane production, we measured methane yields from 22 sheep, which revealed that methane yields are a reproducible quantitative trait. Deep metagenomic and metatranscriptomic sequencing demonstrated the presence of methanogens both in the highest and lowest methane-producing sheep, with a similar abundance of methanogens and methanogenesis pathway genes in high and low methane emitters. However, transcription of methanogenesis pathway genes was substantially increased in sheep with high methane yields. These results identify a discrete set of rumen methanogens whose methanogenesis pathway transcription profiles correlate with methane yields and provide new targets for CH4 mitigation at the levels of microbiota composition and transcriptional regulation.
    Genome Research 06/2014; 24(9):1517-25. DOI:10.1101/gr.168245.113 · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-sequencing (RNA-seq) enables in-depth exploration of transcriptomes, but typical sequencing depth often limits its comprehensiveness. In this study, we generated nearly 3 billion RNA-Seq reads, totaling 341 Gb of sequence, from a Zea mays seedling sample. At this depth, a near complete snapshot of the transcriptome was observed consisting of over 90% of the annotated transcripts, including lowly expressed transcription factors. A novel hybrid strategy combining de novo and reference-based assemblies yielded a transcriptome consisting of 126,708 transcripts with 88% of expressed known genes assembled to full-length. We improved current annotations by adding 4,842 previously unannotated transcript variants and many new features, including 212 maize transcripts, 201 genes, 10 genes with undocumented potential roles in seedlings as well as maize lineage specific gene fusion events. We demonstrated the power of deep sequencing for large transcriptome studies by generating a high quality transcriptome, which provides a rich resource for the research community.
    Scientific Reports 03/2014; 4:4519. DOI:10.1038/srep04519 · 5.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: By directed evolution in the laboratory, we previously generated populations of Escherichia coli that exhibit a complex new phenotype, extreme resistance to ionizing radiation (IR). The molecular basis of this extremophile phenotype, involving strain isolates with a 3-4 order of magnitude increase in IR resistance at 3000 Gy, is now addressed. Of 69 mutations identified in one of our most highly adapted isolates, functional experiments demonstrate that the IR resistance phenotype is almost entirely accounted for by only three of these nucleotide changes, in the DNA metabolism genes recA, dnaB, and yfjK. Four additional genetic changes make small but measurable contributions. Whereas multiple contributions to IR resistance are evident in this study, our results highlight a particular adaptation mechanism not adequately considered in studies to date: Genetic innovations involving pre-existing DNA repair functions can play a predominant role in the acquisition of an IR resistance phenotype. DOI: http://dx.doi.org/10.7554/eLife.01322.001.
    eLife Sciences 03/2014; 3:e01322. DOI:10.7554/eLife.01322 · 8.52 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Improved crop water-use efficiency (WUE) is critical for the long-term sustainability of agricultural production systems in the face of predicted future warmer and drier climates. Crassulacean acid metabolism (CAM) is a specialized mode of photosynthesis that enhances WUE through an inverse day/night pattern of stomatal closure/opening and improves photosynthetic efficiency by concentrating CO2 around RUBISCO. CAM has evolved multiple times from C3 photosynthesis and ~6.5% of higher plant species in more than 35 families have acquired CAM via parallel or convergent evolution. There are two fundamental questions to be answered to understand the molecular basis and evolutionary mechanism of CAM: 1) what are the genetic differences between CAM and non-CAM species and 2) what are the common molecular features shared among CAM plants from diverse origins? To address these questions, comparative genomics analysis was performed using multiple plant species including CAM (e.g., Agave, Kalanchoe, Mesembryanthemum), C3 (e.g., Arabidopsis, Oryza, Populus), C4 (e.g., Setaria, Sorghum, and Zea), and non-vascular plant species (e.g., Physcomitrella, Selaginella). Our analysis not only revealed orthologous gene groups shared between CAM and non-CAM species, but also identified genes specific to the CAM species. Also, expanded gene families were identified in CAM species compared with non-CAM species. Gene ontology and gene expression profiles were used to build hypothesis related to divergent gene functions that likely arose during CAM evolution. This research establishes a framework for CAM comparative genomics studies and provides new knowledge to inform genetic improvement in WUE and photosynthetic efficiency in crop plants under water-limiting conditions.
    International Plant and Animal Genome Conference XXII 2014; 01/2014
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this "data deluge", here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. We built BioPig upon the Apache's Hadoop MapReduce system and the Pig data flow language. Compared to traditional serial and MPI based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at NERSC and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis. BioPig is released as open source software under the BSD license at https://sites.google.com/a/lbl.gov/biopig/ CONTACT: ZhongWang@lbl.gov.
    Bioinformatics 09/2013; DOI:10.1093/bioinformatics/btt528 · 4.62 Impact Factor
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis. Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development.
    BMC Genomics 08/2013; 14(1):563. DOI:10.1186/1471-2164-14-563 · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The West Nile virus (WNV) is an emerging infection of biodefense concern and there are no available treatments or vaccines. Here we used a high-throughput method based on a novel gene expression analysis, RNA-Seq, to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV. From a total of 28 million reads per sample, we identified 1,514 transcripts that were differentially expressed after infection. Both predicted and novel gene changes were detected, as were gene isoforms, and while many of the genes were expressed by all donors, some were unique. Knock-down of genes not previously known to be associated with WNV resistance identified their critical role in control of viral infection. Our study distinguishes both common gene pathways as well as novel cellular responses. Such analyses will be valuable for translational studies of susceptible and resistant individuals-and for targeting therapeutics-in multiple biological settings.
    Viruses 07/2013; 5(7):1664-81. DOI:10.3390/v5071664 · 3.28 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Researchers need general-purpose methods for objectively evaluating the accuracy of single and metagenome assemblies, and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality, or lack statistical justification, and none are designed to evaluate metagenome assemblies. RESULTS: In this paper we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired end reads), sequencing coverage, read alignment, and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by PacBio sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real PacBio data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process. AVAILABILITY: ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python. CONTACT: pf98@cornell.edu or ZhongWang@lbl.gov.
    Bioinformatics 01/2013; 29(4). DOI:10.1093/bioinformatics/bts723 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work we describe a method to automatically detect errors in de novo assembled genomes. The method extends a Bayesian assembly quality evaluation framework, ALE, which computes the likelihood of an assembly given a set of unassembled data. Starting from ALE output, this method applies outlier detection algorithms to identify the precise locations of assembly errors. We show results from a microbial genome with manually curated assembly errors. Our method detects all deletions, 82.3% of insertions, and 88.8% of single base substitutions. It was also able to detect an inversion error that spans more than 400 bases.
    eScience (eScience), 2013 IEEE 9th International Conference on; 01/2013
  • Source
    Zhong Wang, Huntington F Willard
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Combinations of histone variants and modifications, conceptually representing a histone code, have been proposed to play a significant role in gene regulation and developmental processes in complex organisms. While various mechanisms have been implicated in establishing and maintaining epigenetic patterns at specific locations in the genome, they are generally believed to be independent of primary DNA sequence on a more global scale. Results To address this systematically in the case of the human genome, we have analyzed primary DNA sequences underlying patterns of 19 different methylated histones in human primary T-cells and patterns of three methylated histones across additional human cell lines. We report strong sequence biases associated with most of these histone marks genome-wide in each cell type. Furthermore, the sequence characteristics for such association are distinct for different groups of histone marks. Conclusions These findings provide evidence of an influence of genomic sequence on patterns of histone modification associated with gene expression and chromatin programming, and they suggest that the mechanisms responsible for global histone modifications may interpret genomic sequence in various ways.
    BMC Genomics 08/2012; 13(1):367. DOI:10.1186/1471-2164-13-367 · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing (RNA-Seq) is rapidly replacing microarrays for profiling gene expression with much improved accuracy and sensitivity. One of the most common questions in a typical gene profiling experiment is how to identify a set of transcripts that are differentially expressed between different experimental conditions. Some of the statistical methods developed for microarray data analysis can be applied to RNA-Seq data with or without modifications. Recently several additional methods have been developed specifically for RNA-Seq data sets. This review attempts to give an in-depth review of these statistical methods, with the goal of providing a comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses.
    07/2012; 2(1):26. DOI:10.1186/2045-3701-2-26
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
    Nature Biotechnology 07/2012; 30(7):693-700. DOI:10.1038/nbt.2280 · 39.08 Impact Factor
  • T. Samak, D. Gunter, Zhong Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.
    E-Science (e-Science), 2012 IEEE 8th International Conference on; 01/2012
  • Source
    Jeffrey A Martin, Zhong Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches - reference-based, de novo and combined strategies - along with some perspectives on transcriptome assembly in the near future.
    Nature Reviews Genetics 09/2011; 12(10):671-82. DOI:10.1038/nrg3068 · 39.79 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: De novo assembly of the transcriptome is crucial for functional genomics studies in bioenergy research, since many of the organisms lack high quality reference genomes. In a previous study we successfully de novo assembled simple eukaryote transcriptomes exclusively from short Illumina RNA-Seq reads [1]. However, extensive alternative splicing, present in most of the higher eukaryotes, poses a significant challenge for current short read assembly processes. Furthermore, the size of next-generation datasets, often large for plant genomes, presents an informatics challenge. To tackle these challenges we present a combined experimental and informatics strategy for de novo assembly in higher eukaryotes. Using maize as a test case, preliminary results suggest our approach can resolve transcript variants and improve gene annotations.
    05/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paucity of enzymes that efficiently deconstruct plant polysaccharides represents a major bottleneck for industrial-scale conversion of cellulosic biomass into biofuels. Cow rumen microbes specialize in degradation of cellulosic plant material, but most members of this complex community resist cultivation. To characterize biomass-degrading genes and genomes, we sequenced and analyzed 268 gigabases of metagenomic DNA from microbes adherent to plant fiber incubated in cow rumen. From these data, we identified 27,755 putative carbohydrate-active genes and expressed 90 candidate proteins, of which 57% were enzymatically active against cellulosic substrates. We also assembled 15 uncultured microbial genomes, which were validated by complementary methods including single-cell genome sequencing. These data sets provide a substantially expanded catalog of genes and genomes participating in the deconstruction of cellulosic biomass.
    Science 01/2011; 331(6016):463-7. DOI:10.1126/science.1200387 · 31.48 Impact Factor

Publication Stats

5k Citations
291.53 Total Impact Points

Institutions

  • 2013–2014
    • University of California, Berkeley
      Berkeley, California, United States
  • 2010–2014
    • Lawrence Berkeley National Laboratory
      • Genomics Division
      Berkeley, California, United States
    • DOE Joint Genome Institute
      Walnut Creek, California, United States
  • 2008–2013
    • Yale University
      • • Department of Molecular Biophysics and Biochemistry
      • • Department of Molecular, Cellular and Developmental Biology
      New Haven, Connecticut, United States
  • 2012
    • Duke University
      Durham, North Carolina, United States