Wing H Wong

Stanford University, Palo Alto, California, United States

Are you Wing H Wong?

Claim your profile

Publications (68)654.59 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Code in Python is at http://bioinform.github.io/metasv/. rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary information attached. © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 04/2015; DOI:10.1093/bioinformatics/btv204 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing. Code in Java and Python is at http://github.com/bioinform/varsim. Reads and variants are at SRA PRJNA263417. rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary information attached. © The Author(s) 2014. Published by Oxford University Press.
    Bioinformatics 12/2014; DOI:10.1093/bioinformatics/btu828 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background / Purpose: Structural variations (SVs) are large genomic rearrangements, including deletion, insertion, inversion, duplication and translocation. SV detection is a key challenge with next-generation sequencing reads since SVs are generally much larger than read length. Accuracy of SV detection varies significantly by type, region and size, and thus no single solution fits all cases: read-pair (RP), split-read (SR), junction-mapping (JM), read-depth (RD). Here we describe a novel approach for SV detection by integrating across multiple complementary methods and signals to improve accuracy across all SV types and sizes. Main conclusion: The novel SV detection method, MetaSV, performs excellently on simulated data by integrating multiple methods and signals into a single analysis. Future work will use local assembly integration and optimal alignment to further boost MetaSV accuracy and resolution. In addition, support will be added for intra- and inter-chromosomal translocations. Finally, a high-confidence SV gold set for NA12878 will be used for validation on experimental data.
    International Conference on Intelligent Systems for Molecular Biology (ISMB) 2014; 07/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background / Purpose: Currently there is a lack of comprehensive simulation validation framework for next generation sequencing (NGS) analysis. Multiple agreed-upon validation datasets are critical for development of new secondary analysis methods, and read simulation is a bottleneck when simulating high coverage data. The genome in a bottle consortium has generated a gold standard set of variants by combining multiple sequencing technologies. However, this is not scalable for generation of multiple datasets. Here we present an alternative method, VarSim. Main conclusion: Simulation is an important validation methodology as it allows for easy generation of multiple datasets. The VarSim method provides a comprehensive simulation validation framework. Future efforts will add support for translocations and for simulating copy number alterations in cancer.
    International Conference on Intelligent Systems for Molecular Biology (ISMB) 2014; 07/2014
  • Luo Lu, Hui Jiang, Wing H. Wong
    [Show abstract] [Hide abstract]
    ABSTRACT: Consider a class of densities that are piecewise constant functions over partitions of the sample space defined by sequential coordinate partitioning. We introduce a prior distribution for a density in this function class and derive in closed form the marginal posterior distribution of the corresponding partition. A computationally efficient method, based on sequential importance sampling, is presented for the inference of the partition from this posterior distribution. Compared to traditional approaches such as the kernel method or the histogram, the Bayesian sequential partitioning (BSP) method proposed here is capable of providing much more accurate estimates when the sample space is of moderate to high dimension. We illustrate this by simulated as well as real data examples. The examples also demonstrate how BSP can be used to design new classification methods competitive with the state of the art.
    Journal of the American Statistical Association 12/2013; 108(504). DOI:10.1080/01621459.2013.813389 · 2.11 Impact Factor
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Haplotype, or the sequence of alleles along a single chromosome, has important applications in phenotype-genotype association studies, as well as in population genetics analyses. Because haplotype cannot be experimentally assayed in diploid organisms in a high-throughput fashion, numerous statistical methods have been developed to reconstruct probable haplotype from genotype data. These methods focus primarily on accurate phasing of a short genomic region with a small number of markers, and the error rate increases rapidly for longer regions. Here we introduce a new phasing algorithm, , which aims to improve long-range phasing accuracy. Using datasets from multiple populations, we found that reduces long-range phasing errors by up to 50% compared to the current state-of-the-art methods. In addition to inferring the most likely haplotypes, produces confidence measures, allowing downstream analyses to account for the uncertainties associated with some haplotypes. We anticipate that offers a powerful tool for analyzing large-scale data generated in the genome-wide association studies (GWAS).
    Statistica Sinica 10/2013; 23(4). DOI:10.5705/ss.2012.141s · 1.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse embryonic stem cells fused to human fibroblasts), in which reprogramming towards pluripotency is efficient and rapid, enabling the identification of transient regulators required at the onset. We used bi-species transcriptome-wide RNA-seq to quantify transcriptional changes in the human somatic nucleus during reprogramming towards pluripotency in heterokaryons. During heterokaryon reprogramming, the cytokine interleukin 6 (IL6), which is not detectable at significant levels in embryonic stem cells, was induced 50-fold. A 4-day culture with IL6 at the onset of iPS reprogramming replaced stably transduced oncogenic c-Myc such that transduction of only Oct4, Klf4 and Sox2 was required. IL6 also activated another Jak/Stat target, the serine/threonine kinase gene Pim1, which accounted for the IL6-mediated twofold increase in iPS frequency. In contrast, LIF, another induced GP130 ligand, failed to increase iPS frequency or activate c-Myc or Pim1, thereby revealing a differential role for the two Jak/Stat inducers in iPS generation. These findings demonstrate the power of heterokaryon bi-species global RNA-seq to identify early acting regulators of reprogramming, for example, extrinsic replacements for stably transduced transcription factors such as the potent oncogene c-Myc.
    Nature Cell Biology 09/2013; DOI:10.1038/ncb2835 · 20.06 Impact Factor
  • Junhee Seok, Lu Tian, Wing H Wong
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.
    Biostatistics 07/2013; 15(1). DOI:10.1093/biostatistics/kxt025 · 2.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: OBJECTIVE: To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments. DESIGN: Retrospective validation of multicenter prediction model. SETTING: Three university-affiliated outpatient IVF clinics located in different countries. PATIENT(S): Using primary models aggregated from >13,000 C1s, we applied the boosted tree method to train a preIVF-diversity model (PreIVF-D) with 1,061 C1s from 2008 to 2009, and validated predicted LB probabilities with an independent dataset comprising 1,058 C1s from 2008 to 2009. INTERVENTION(S): None. MAIN OUTCOME MEASURE(S): Predictive power, reclassification, receiver operator characteristic analysis, calibration, dynamic range. RESULT(S): Overall, with PreIVF-D, 86% of cases had significantly different LB probabilities compared with age control, and more than one-half had higher LB probabilities. Specifically, 42% of patients could have been identified by PreIVF-D to have a personalized predicted success rate >45%, whereas an age-control model could not differentiate them from others. Furthermore, PreIVF-D showed improved predictive power, with 36% improved log-likelihood (or 9.0-fold by log-scale; >1,000-fold linear scale), and prediction errors for subgroups ranged from 0.9% to 3.7%. CONCLUSION(S): Validated prediction of personalized LB probabilities from diverse multiple sources identify excellent prognoses in more than one-half of patients.
    Fertility and sterility 03/2013; DOI:10.1016/j.fertnstert.2013.02.016 · 4.30 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.
    Molecular Systems Biology 01/2013; 9:632. DOI:10.1038/msb.2012.65 · 14.10 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by Gli1 and Sox2, a pan-neural determinant, identified a set of shared regulatory regions associated with key factors central to cell fate determination and neural tube patterning. Functional analysis in transgenic mice validates core enhancers for each of these factors and demonstrates the dual requirement for Gli1 and Sox2 inputs for neural enhancer activity. Furthermore, through an unbiased determination of Gli-binding site preferences and analysis of binding site variants in the developing mammalian CNS, we demonstrate that differential Gli-binding affinity underlies threshold-level activator responses to Shh input. In summary, our results highlight Sox2 input as a context-specific determinant of the neural-specific Shh response and differential Gli-binding site affinity as an important cis-regulatory property critical for interpreting Shh morphogen action in the mammalian neural tube.
    Genes & development 12/2012; 26(24):2802-16. DOI:10.1101/gad.207142.112 · 12.64 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripo-tent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved ex-ceedingly inefficient. We discovered a striking differ-ence in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogram-ming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain-and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables effi-cient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modi-fiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory path-ways are required for efficient nuclear reprogram-ming in the induction of pluripotency.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.
    Cell 10/2012; 151(3):547-58. DOI:10.1016/j.cell.2012.09.034 · 33.12 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers. We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data. Linux and Mac OS X binaries free for academic use are available at http://www.stanford.edu/group/wonglab/seqalto whwong@stanford.edu.
    Bioinformatics 07/2012; 28(18):2366-73. DOI:10.1093/bioinformatics/bts450 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: To report and evaluate the performance and utility of an approach to predicting IVF-double embryo transfer (DET) multiple birth risks that is evidence-based, clinic-specific, and considers each patient's clinical profile. Retrospective prediction modeling. An outpatient university-affiliated IVF clinic. We used boosted tree methods to analyze 2,413 independent IVF-DET treatment cycles that resulted in live births. The IVF cycles were retrieved from a database that comprised more than 33,000 IVF cycles. None. The performance of this prediction model, MBP-BIVF, was validated by an independent data set, to evaluate predictive power, discrimination, dynamic range, and reclassification. Multiple birth probabilities ranging from 11.8% to 54.8% were predicted by the model and were significantly different from control predictions in more than half of the patients. The prediction model showed an improvement of 146% in predictive power and 16.0% in discrimination over control. The population standard error was 1.8%. We showed that IVF patients have inherently different risks of multiple birth, even when DET is specified, and this risk can be predicted before ET. The use of clinic-specific prediction models provides an evidence-based and personalized method to counsel patients.
    Fertility and sterility 05/2012; 98(1):69-76. DOI:10.1016/j.fertnstert.2012.04.011 · 4.30 Impact Factor
  • Source
    Nature 01/2012; 489(7414):57-74. DOI:10.1038/nature11247 · 42.35 Impact Factor
  • Pengyuhong, Shengzhong, Wing H.wong
    [Show abstract] [Hide abstract]
    ABSTRACT: The Ubiquitous Bio-Information Computing (UBIC2) project aims to disseminate protocols and software packages to facilitate the development of heterogeneous bio-information computing units that are interoperable and may run distributedly. UBIC2 specifies biological data in XML formats and queries data using XQuery. The UBIC2 programming library provides interfaces for integrating, retrieving, and manipulating heterogeneous biological data. Interoperability is achieved via Simple Object Access Protocol (SOAP) based web services. The documents and software packages of UBIC2 are available at .
    International Journal of Software Engineering and Knowledge Engineering 11/2011; 15(03). DOI:10.1142/S0218194005001951 · 0.26 Impact Factor
  • Source
    Martin A. Tanner, Wing H. Wong
    [Show abstract] [Hide abstract]
    ABSTRACT: It was known from Metropolis et al. [J. Chem. Phys. 21 (1953) 1087--1092] that one can sample from a distribution by performing Monte Carlo simulation from a Markov chain whose equilibrium distribution is equal to the target distribution. However, it took several decades before the statistical community embraced Markov chain Monte Carlo (MCMC) as a general computational tool in Bayesian inference. The usual reasons that are advanced to explain why statisticians were slow to catch on to the method include lack of computing power and unfamiliarity with the early dynamic Monte Carlo papers in the statistical physics literature. We argue that there was a deeper reason, namely, that the structure of problems in the statistical mechanics and those in the standard statistical literature are different. To make the methods usable in standard Bayesian problems, one had to exploit the power that comes from the introduction of judiciously chosen auxiliary variables and collective moves. This paper examines the development in the critical period 1980--1990, when the ideas of Markov chain simulation from the statistical physics literature and the latent variable formulation in maximum likelihood computation (i.e., EM algorithm) came together to spark the widespread application of MCMC methods in Bayesian computation.
    Statistical Science 04/2011; 25(2010). DOI:10.1214/10-STS341 · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Effective clinical management of prostate cancer (PCA) has been challenged by significant intratumoural heterogeneity on the genomic and pathological levels and limited understanding of the genetic elements governing disease progression. Here, we exploited the experimental merits of the mouse to test the hypothesis that pathways constraining progression might be activated in indolent Pten-null mouse prostate tumours and that inactivation of such progression barriers in mice would engender a metastasis-prone condition. Comparative transcriptomic and canonical pathway analyses, followed by biochemical confirmation, of normal prostate epithelium versus poorly progressive Pten-null prostate cancers revealed robust activation of the TGFβ/BMP-SMAD4 signalling axis. The functional relevance of SMAD4 was further supported by emergence of invasive, metastatic and lethal prostate cancers with 100% penetrance upon genetic deletion of Smad4 in the Pten-null mouse prostate. Pathological and molecular analysis as well as transcriptomic knowledge-based pathway profiling of emerging tumours identified cell proliferation and invasion as two cardinal tumour biological features in the metastatic Smad4/Pten-null PCA model. Follow-on pathological and functional assessment confirmed cyclin D1 and SPP1 as key mediators of these biological processes, which together with PTEN and SMAD4, form a four-gene signature that is prognostic of prostate-specific antigen (PSA) biochemical recurrence and lethal metastasis in human PCA. This model-informed progression analysis, together with genetic, functional and translational studies, establishes SMAD4 as a key regulator of PCA progression in mice and humans.
    Nature 02/2011; 470(7333):269-73. DOI:10.1038/nature09677 · 42.35 Impact Factor

Publication Stats

6k Citations
654.59 Total Impact Points

Institutions

  • 2005–2013
    • Stanford University
      • • Department of Statistics
      • • School of Humanities and Sciences
      Palo Alto, California, United States
    • University of Southern California
      Los Angeles, California, United States
  • 2001–2012
    • Harvard University
      • • Department of Stem Cell and Regenerative Biology
      • • Department of Molecular and Cell Biology
      • • Department of Statistics
      Boston, MA, United States
  • 2004–2009
    • Harvard Medical School
      • Department of Obstetrics, Gynecology, and Reproductive Biology
      Boston, Massachusetts, United States
    • University of Massachusetts Boston
      Boston, Massachusetts, United States
  • 2008
    • University of California, Berkeley
      • Department of Statistics
      Berkeley, MO, United States
  • 2002
    • Northeastern University
      • Department of Biology
      Boston, MA, United States