# Wing H Wong's research while affiliated with Stanford University and other places

## Publications (226)

Article
Full-text available
Single-cell multiomics data continues to grow at an unprecedented pace. Although several methods have demonstrated promising results in integrating several data modalities from the same tissue, the complexity and scale of data compositions present in cell atlases still pose a challenge. Here, we present scJoint, a transfer learning method to integr...
Article
Motivation Isoform deconvolution is an NP-hard problem. The accuracy of the proposed solutions are far from perfect. At present, it is not known if gene structure and isoform concentration can be uniquely inferred given paired-end reads, and there is no objective method to select the fragment length to improve the number of identifiable genes. Diff...
Article
Full-text available
Significance T cell exhaustion is a major barrier to cancer immunotherapy. T cell exhaustion is the state of T cell dysfunction after chronic stimulation, and recent studies indicate that exhaustion is epigenetically controlled and associated with unique chromatin profiles. This work reports the genome-wide map of active DNA regulatory elements and...
Preprint
Full-text available
Dysfunction in T cells limits the efficacy of cancer immunotherapy. We profiled the epigenome, transcriptome, and enhancer connectome of exhaustion-prone GD2-targeting HA-28z chimeric antigen receptor (CAR) T cells and control CD19-targeting CAR T cells, which present less exhaustion-inducing tonic signaling, at multiple points during their ex vivo...
Preprint
Full-text available
Single-cell multi-omics data continues to grow at an unprecedented pace, and while integrating different modalities holds the promise for better characterisation of cell identities, it remains a significant computational challenge. In particular, extreme sparsity is a hallmark in many modalities such as scATAC-seq data and often limits their power...
Preprint
Full-text available
We have developed a generally applicable method based on CRISPR/Cas9-targeted ultra-long read sequencing (CTLR-Seq) to completely and haplotype-specifically resolve, at base-pair resolution, large, complex, and highly repetitive genomic regions that had been previously impenetrable to next-generation sequencing analysis such as large segmental dupl...
Article
Traditional MCMC algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that...
Preprint
Traditional MCMC algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that...
Article
Full-text available
HepG2 is one of the most widely used human cancer cell lines in biomedical research and one of the main cell lines of ENCODE. Although the functional genomic and epigenomic characteristics of HepG2 are extensively studied, its genome sequence has never been comprehensively analyzed and higher order genomic structural features are largely unknown. T...
Article
K562 is widely used in biomedical research. It is one of three tier-one cell lines of ENCODE and also most commonly used for large-scale CRISPR/Cas9 screens. Although its functional genomic and epigenomic characteristics have been extensively studied, its genome sequence and genomic structural features have never been comprehensively analyzed. Such...
Article
Full-text available
Purpose: Despite the successful progress next-generation sequencing technologies has achieved in diagnosing the genetic cause of rare Mendelian diseases, the current diagnostic rate is still far from satisfactory because of heterogeneity, imprecision, and noise in disease phenotype descriptions and insufficient utilization of expert knowledge in c...
Article
Full-text available
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human gen...
Preprint
Full-text available
The HepG2 cancer cell line is one of the most widely-used biomedical research and one of the main cell lines of ENCODE. Vast numbers of functional genomics and epigenomics datasets have been produced to characterize its biology. However, the correct interpretation such data requires an understanding of the cell line’s genome sequence and genome str...
Preprint
Full-text available
Tissue development results from lineage-specific transcription factors (TF) programming a dynamic chromatin landscape through progressive cell fate transitions. Here, we interrogate the epigenomic landscape during epidermal differentiation and create an inference network that ranks the coordinate effects of TF-accessible regulatory element-target g...
Preprint
Full-text available
K562 is widely used in biomedical research. It is one of three tier-one cell lines of ENCODE and also most commonly used for large-scale CRISPR/Cas9 screens. Although its functional genomic and epigenomic characteristics have been extensively studied, its genome sequence and genomic structural features have never been comprehensively analyzed. Such...
Preprint
Gibbs sampling is the de facto Markov chain Monte Carlo method used for inference and learning on large scale graphical models. For complicated factor graphs with lots of factors, the performance of Gibbs sampling can be limited by the computational cost of executing a single update step of the Markov chain. This cost is proportional to the degree...
Conference Paper
Article
Intrinsic noise, the stochastic cell-to-cell fluctuations in mRNAs and proteins, has been observed and proved to play important roles in cellular systems. Due to the recent development in single-cell-level measurement technology, the studies on intrinsic noise are becoming increasingly popular among scholars. The chemical master equation (CME) has...
Article
In our recent paper, we showed that in exponential family, contrastive divergence (CD) with fixed learning rate will give asymptotically consistent estimates \cite{wu2016convergence}. In this paper, we establish consistency and convergence rate of CD with annealed learning rate $\eta_t$. Specifically, suppose CD-$m$ generates the sequence of parame...
Article
Full-text available
This paper studies the contrastive divergence algorithm for approximate Maximum Likelihood Estimate (MLE) in exponential family, by relating it to Markov chain theory and stochastic stability literature. We prove that, with asymptotically probability 1, the algorithm generates a sequence of parameter guesses which converges to an invariant distribu...
Article
Full-text available
Long intergenic noncoding RNAs (lincRNAs) are derived from thousands of loci in mammalian genomes and are frequently enriched in transposable elements (TEs). Although families of TE-derived lincRNAs have recently been implicated in the regulation of pluripotency, little is known of the specific functions of individual family members. Here we charac...
Article
Full-text available
Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear...
Article
Full-text available
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-Throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus...
Article
Full-text available
SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genom...
Article
Regulation of gene expression changes during chondrogenic differentiation by DNA methylation and demethylation is little understood. Methylated cytosines (5mC) are oxidized by the ten-eleven-translocation (TET) proteins to 5-hydroxymethylcytosines (5hmC), 5-formylcytosines (5fC) and 5-carboxylcytosines (5caC) eventually leading to a replacement by...
Article
Objective To examine genome-wide 5hmC distribution in osteoarthritic (OA) and normal chondrocytes to investigate the effect on OA-specific gene expression. Methods Cartilage was obtained from OA patients undergoing total knee arthroplasty or control patients undergoing anterior cruciate ligament reconstruction. Genome-wide sequencing of 5hmC-enrich...
Article
Full-text available
Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous wo...
Article
Full-text available
VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously repo...
Conference Paper
Background / Purpose: Currently there is a lack of comprehensive simulation validation framework for next generation sequencing (NGS) analysis. Multiple agreed-upon validation datasets are critical for development of new secondary analysis methods, and read simulation is a bottleneck when simulating high coverage data. The genome in a bottle cons...
Conference Paper
Background / Purpose: Structural variations (SVs) are large genomic rearrangements, including deletion, insertion, inversion, duplication and translocation. SV detection is a key challenge with next-generation sequencing reads since SVs are generally much larger than read length. Accuracy of SV detection varies significantly by type, region and s...
Article
Consider a class of densities that are piecewise constant functions over partitions of the sample space defined by sequential coordinate partitioning. We introduce a prior distribution for a density in this function class and derive in closed form the marginal posterior distribution of the corresponding partition. A computationally efficient method...
Data
Full-text available
Article
Haplotype, or the sequence of alleles along a single chromosome, has important applications in phenotype-genotype association studies, as well as in population genetics analyses. Because haplotype cannot be experimentally assayed in diploid organisms in a high-throughput fashion, numerous statistical methods have been developed to reconstruct proba...
Article
Full-text available
Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse em...
Article
Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally opti...
Article
Full-text available
Objective: To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments. Design: Retrospective validation of multicenter prediction model. Setting: Three university-affiliated outpatient IVF clinics located in different countries. Patient(s):...
Data
Performance of the hierarchical model in partially modified plasmid data. The red, green and blue curves are ROC curves for the hierarchical model with control data, the case-control method, and the hierarchical model without control data, respectively. These three methods were tested on two different datasets: 1) a 3,589 bases long plasmid with 19...
Data
Full-text available
EM algorithm for fitting the hierarchial model. Text S1 provides a detailed description of the EM (Expectation-Maximization) algorithm used for estimating hyperparameters of the proposed hierarchical model. (PDF)
Article
Full-text available
Author Summary DNA modifications have been found in a wide range of living organisms, from bacteria to human. Many existing studies have shown that they play important roles in development, disease, bacteria virulence, etc. However, for many types of DNA modification, for example N6-methyladenine and 8-oxoG, there is not an efficient and accurate d...
Data
Full-text available
Article
Full-text available
Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic...
Article
Full-text available
In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by...