[Show abstract][Hide abstract] ABSTRACT: Long intergenic noncoding RNAs (lincRNAs) are derived from thousands of loci in mammalian genomes and are frequently enriched in transposable elements (TEs). Although families of TE-derived lincRNAs have recently been implicated in the regulation of pluripotency, little is known of the specific functions of individual family members. Here we characterize three new individual TE-derived human lincRNAs, human pluripotency-associated transcripts 2, 3 and 5 (HPAT2, HPAT3 and HPAT5). Loss-of-function experiments indicate that HPAT2, HPAT3 and HPAT5 function in preimplantation embryo development to modulate the acquisition of pluripotency and the formation of the inner cell mass. CRISPR-mediated disruption of the genes for these lincRNAs in pluripotent stem cells, followed by whole-transcriptome analysis, identifies HPAT5 as a key component of the pluripotency network. Protein binding and reporter-based assays further demonstrate that HPAT5 interacts with the let-7 microRNA family. Our results indicate that unique individual members of large primate-specific lincRNA families modulate gene expression during development and differentiation to reinforce cell fate.
[Show abstract][Hide abstract] ABSTRACT: A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-Throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
No preview · Article · Sep 2015 · Scientific Reports
[Show abstract][Hide abstract] ABSTRACT: SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0758-2) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Structural variations (SVs) are large genomic rearrangements, including deletion, insertion, inversion, duplication and translocation. SV detection is a key challenge with next-generation sequencing reads since SVs are generally much larger than read length. Accuracy of SV detection varies significantly by type, region and size, and thus no single solution fits all cases: read-pair (RP), split-read (SR), junction-mapping (JM), read-depth (RD). Here we describe a novel approach for SV detection by integrating across multiple complementary methods and signals to improve accuracy across all SV types and sizes.
The novel SV detection method, MetaSV, performs excellently on simulated data by integrating multiple methods and signals into a single analysis. Future work will use local assembly integration and optimal alignment to further boost MetaSV accuracy and resolution. In addition, support will be added for intra- and inter-chromosomal translocations. Finally, a high-confidence SV gold set for NA12878 will be used for validation on experimental data.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Currently there is a lack of comprehensive simulation validation framework for next generation sequencing (NGS) analysis. Multiple agreed-upon validation datasets are critical for development of new secondary analysis methods, and read simulation is a bottleneck when simulating high coverage data. The genome in a bottle consortium has generated a gold standard set of variants by combining multiple sequencing technologies. However, this is not scalable for generation of multiple datasets. Here we present an alternative method, VarSim.
Simulation is an important validation methodology as it allows for easy generation of multiple datasets. The VarSim method provides a comprehensive simulation validation framework. Future efforts will add support for translocations and for simulating copy number alterations in cancer.
[Show abstract][Hide abstract] ABSTRACT: Consider a class of densities that are piecewise constant functions over partitions of the sample space defined by sequential coordinate partitioning. We introduce a prior distribution for a density in this function class and derive in closed form the marginal posterior distribution of the corresponding partition. A computationally efficient method, based on sequential importance sampling, is presented for the inference of the partition from this posterior distribution. Compared to traditional approaches such as the kernel method or the histogram, the Bayesian sequential partitioning (BSP) method proposed here is capable of providing much more accurate estimates when the sample space is of moderate to high dimension. We illustrate this by simulated as well as real data examples. The examples also demonstrate how BSP can be used to design new classification methods competitive with the state of the art.
No preview · Article · Dec 2013 · Journal of the American Statistical Association
[Show abstract][Hide abstract] ABSTRACT: Haplotype, or the sequence of alleles along a single chromosome, has important applications in phenotype-genotype association studies, as well as in population genetics analyses. Because haplotype cannot be experimentally assayed in diploid organisms in a high-throughput fashion, numerous statistical methods have been developed to reconstruct probable haplotype from genotype data. These methods focus primarily on accurate phasing of a short genomic region with a small number of markers, and the error rate increases rapidly for longer regions. Here we introduce a new phasing algorithm, , which aims to improve long-range phasing accuracy. Using datasets from multiple populations, we found that reduces long-range phasing errors by up to 50% compared to the current state-of-the-art methods. In addition to inferring the most likely haplotypes, produces confidence measures, allowing downstream analyses to account for the uncertainties associated with some haplotypes. We anticipate that offers a powerful tool for analyzing large-scale data generated in the genome-wide association studies (GWAS).
No preview · Article · Oct 2013 · Statistica Sinica
[Show abstract][Hide abstract] ABSTRACT: Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse embryonic stem cells fused to human fibroblasts), in which reprogramming towards pluripotency is efficient and rapid, enabling the identification of transient regulators required at the onset. We used bi-species transcriptome-wide RNA-seq to quantify transcriptional changes in the human somatic nucleus during reprogramming towards pluripotency in heterokaryons. During heterokaryon reprogramming, the cytokine interleukin 6 (IL6), which is not detectable at significant levels in embryonic stem cells, was induced 50-fold. A 4-day culture with IL6 at the onset of iPS reprogramming replaced stably transduced oncogenic c-Myc such that transduction of only Oct4, Klf4 and Sox2 was required. IL6 also activated another Jak/Stat target, the serine/threonine kinase gene Pim1, which accounted for the IL6-mediated twofold increase in iPS frequency. In contrast, LIF, another induced GP130 ligand, failed to increase iPS frequency or activate c-Myc or Pim1, thereby revealing a differential role for the two Jak/Stat inducers in iPS generation. These findings demonstrate the power of heterokaryon bi-species global RNA-seq to identify early acting regulators of reprogramming, for example, extrinsic replacements for stably transduced transcription factors such as the potent oncogene c-Myc.
[Show abstract][Hide abstract] ABSTRACT: Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.
[Show abstract][Hide abstract] ABSTRACT: OBJECTIVE: To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments. DESIGN: Retrospective validation of multicenter prediction model. SETTING: Three university-affiliated outpatient IVF clinics located in different countries. PATIENT(S): Using primary models aggregated from >13,000 C1s, we applied the boosted tree method to train a preIVF-diversity model (PreIVF-D) with 1,061 C1s from 2008 to 2009, and validated predicted LB probabilities with an independent dataset comprising 1,058 C1s from 2008 to 2009. INTERVENTION(S): None. MAIN OUTCOME MEASURE(S): Predictive power, reclassification, receiver operator characteristic analysis, calibration, dynamic range. RESULT(S): Overall, with PreIVF-D, 86% of cases had significantly different LB probabilities compared with age control, and more than one-half had higher LB probabilities. Specifically, 42% of patients could have been identified by PreIVF-D to have a personalized predicted success rate >45%, whereas an age-control model could not differentiate them from others. Furthermore, PreIVF-D showed improved predictive power, with 36% improved log-likelihood (or 9.0-fold by log-scale; >1,000-fold linear scale), and prediction errors for subgroups ranged from 0.9% to 3.7%. CONCLUSION(S): Validated prediction of personalized LB probabilities from diverse multiple sources identify excellent prognoses in more than one-half of patients.
No preview · Article · Mar 2013 · Fertility and sterility
[Show abstract][Hide abstract] ABSTRACT: Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.
Preview · Article · Jan 2013 · Molecular Systems Biology
[Show abstract][Hide abstract] ABSTRACT: In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by Gli1 and Sox2, a pan-neural determinant, identified a set of shared regulatory regions associated with key factors central to cell fate determination and neural tube patterning. Functional analysis in transgenic mice validates core enhancers for each of these factors and demonstrates the dual requirement for Gli1 and Sox2 inputs for neural enhancer activity. Furthermore, through an unbiased determination of Gli-binding site preferences and analysis of binding site variants in the developing mammalian CNS, we demonstrate that differential Gli-binding affinity underlies threshold-level activator responses to Shh input. In summary, our results highlight Sox2 input as a context-specific determinant of the neural-specific Shh response and differential Gli-binding site affinity as an important cis-regulatory property critical for interpreting Shh morphogen action in the mammalian neural tube.
Preview · Article · Dec 2012 · Genes & development
[Show abstract][Hide abstract] ABSTRACT: Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripo-tent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved ex-ceedingly inefficient. We discovered a striking differ-ence in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogram-ming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain-and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables effi-cient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modi-fiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory path-ways are required for efficient nuclear reprogram-ming in the induction of pluripotency.
[Show abstract][Hide abstract] ABSTRACT: Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.
[Show abstract][Hide abstract] ABSTRACT: Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently single molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date, no statistical framework has been proposed to enhance the power to detect these events while also controlling for false positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test positions of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events while others represent putative chemically modified sites of unknown types.