About
73
Publications
19,052
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,714
Citations
Introduction
Bioinformatics, NGS, System Biology, Computer Science, Software Engineer
Additional affiliations
January 2017 - April 2017
August 2011 - December 2016
January 2014 - January 2015
Publications
Publications (73)
High-fidelity (HiFi) sequencing has facilitated the assembly and analysis of the most repetitive region of the genome, the centromere. Nevertheless, our current understanding of human centromeres is based on a relatively small number of telomere-to-telomere assemblies, which has not yet captured its full diversity. In this study, we investigated th...
Understanding the roles played by centromeres in chromosome evolution and speciation is complicated by the fact that centromeres comprise large arrays of tandemly repeated satellite DNA, which hinders high-quality assembly. Here, we used long-read sequencing to generate nearly complete genome assemblies for four karyotypically diverse Papaver speci...
Background
Centromeres play a crucial and conserved role in cell division, although their composition and evolutionary history in green algae, the evolutionary ancestors of land plants, remains largely unknown.
Results
We constructed near telomere-to-telomere (T2T) assemblies for two Trebouxiophyceae species, Chlorella sorokiniana NS4-2 and Chlore...
Background: Centromeres play a crucial and conserved role in cell division, although their composition and evolutionary history in green algae, the evolutionary ancestors of land plants, remains largely unknown. Results: We constructed near telomere-to-telomere (T2T) assemblies for two Trebouxiophyceae species, Chlorella sorokiniana NS4-2 and Chlor...
High-fidelity (HiFi) sequencing has facilitated the assembly and analysis of the most repetitive region of the genome, the centromere. Nevertheless, our current understanding of human centromeres draws from a relatively small number of telomere-to-telomere assemblies, and so has not yet captured its full diversity. In this study, we investigated th...
Tandem duplication (TD) is a major type of structural variation (SV) that plays an important role in novel gene formation and human diseases. However, TDs are often missed or incorrectly classified as insertions by most modern SV detection methods due to the lack of specialized operation on TD-related mutational signals. Herein, we developed a TD d...
Microsatellite instability (MSI) is an indispensable biomarker in cancer immunotherapy. Currently, MSI scoring methods by high-throughput omics methods have gained popularity and demonstrated better performance than the gold standard method for MSI detection. However, the MSI detection method on expression data, especially single-cell expression da...
Background
Recent state-of-the-art sequencing technologies enable the investigation of challenging regions in the human genome and expand the scope of variant benchmarking datasets. Herein, we sequence a Chinese Quartet, comprising two monozygotic twin daughters and their biological parents, using four short and long sequencing platforms (Illumina,...
Background: Microsatellite instability (MSI) is an indispensable biomarker in cancer immunotherapy. Currently, MSI scoring methods by high-throughput omics methods have gained popularity and demonstrated better performance than the gold standard method for MSI detection. However, MSI detection method on expression data, especially single-cell expre...
Tandem duplication (TD) is a major type of structural variation (SV), and plays an important role in novel gene formation and human diseases. However, TDs are often missed or incorrectly classified as insertions by most of modern SV detection methods due to the lacking of specialized operation on TD related mutational signals. Herein, we developed...
Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core sample...
Significant improvements in long-read sequencing technologies have unlocked complex genomic areas, such as centromeres, in the genome and introduced the centromere annotation problem. Currently, centromeres are annotated in a semi-manual way. Here, we propose HiCAT, a generalizable automatic centromere annotation tool, based on hierarchical tandem...
Objective: The efficacy of immunotherapy for cholangiocarcinoma (CCA) is blocked by a high degree of tumor heterogeneity. Cell communication contributes to heterogeneity in the tumor microenvironment. This study aimed to explore critical cell signaling and biomarkers induced via cell communication during immune exhaustion in CCA.
Methods: We constr...
As the state-of-the-art sequencing technologies and computational methods enable investigation of challenging regions in the human genome, an update variant benchmark is demanded. Herein, we sequenced a Chinese Quartet, consisting of two monozygotic twin daughters and their biological parents, with multiple advanced sequencing platforms, including...
Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form but populations of Asian ancestry are underrepresented. Here, we present the first effort (Phase I) of the Chinese Pangenome Consortium (CPC) with a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core sampl...
Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted. We developed SVision, a deep-learning-based multi-object-recognition framework, to automatically detect and haracterize CSVs from long-read sequencing data. SVision outperforms current callers at identifying the internal structure of complex ev...
As the state-of-the-art sequencing technologies and computational methods enable investigation of challenging regions in the human genome, an update variant benchmark is demanded. Herein, we sequenced a Chinese Quartet, consisting of two monozygotic twin daughters and their biological parents, with multiple advanced sequencing platforms, including...
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202...
The execution of biological activities inside space‐limited cell nuclei requires sophisticated organization. Current studies on the 3D genome focus on chromatin interactions and local structures, e.g., topologically associating domains (TADs). In this study, two global physical properties: DNA density and distance to nuclear periphery (DisTP), are...
Significant improvements in long-read sequencing technologies have unlocked complex genomic areas, such as centromeres, in the genome and introduced the centromere annotation problem. Currently, centromeres are annotated in a semi-manual way. Here, we propose HiCAT, a generalizable automatic centromere annotation tool, based on hierarchical tandem...
Background
The efficacy of immunotherapy in cholangiocarcinoma (CCA) is blocked by its high degree of tumor heterogeneity.
Methods
We constructed empirical Bayes and Markov random field models to calculate transcription factors, interaction genes, and associated signaling pathways involved in cell-cell communication using single cell RNAseq data....
Efficacy of immunotherapy in hepatocellular carcinoma (HCC) is blocked by its high degree of inter‐ and intra‐tumor heterogeneity and immunosuppressive tumor microenvironment. However, the correlation between tumor heterogeneity and immunosuppressive microenvironment in HCC has not been well addressed. Here, we endeavored to dissect inter‐ and intr...
The advantages of both the length and accuracy of high-fidelity (HiFi) reads enable chromosome-scale haplotype-resolved genome assembly. In this study, we sequenced a cell line named HJ, established from a Chinese Han male individual by using HiFi and Hi-C. We assembled two high-quality haplotypes of the HJ genome (haplotype 1 (H1): 3.1 Gb, haploty...
Background and aims:
Intrahepatic cholangiocarcinoma (ICC) is not fully investigated and how stromal cells contribute to ICC formation is poorly understood. We aimed to uncover ICC origin, cellular heterogeneity, and critical modulators during ICC initiation/progression, and decipher how fibroblast and endothelial cells in the stromal compartment...
Significant improvements in genome sequencing and assembly technology have led to increasing numbers of high-quality genomes, revealing complex evolutionary scenarios such as multiple whole-genome duplication (WGD) events, which hinders ancestral genome reconstruction via the currently available computational frameworks. Here, we present the Inferr...
Opium poppy (Papaver somniferum), which produces benzylisoquinoline alkaloids (BIA), is an important medicinal plant. However, due to the 70.9% repetitive content and the whole‐genome duplication event of the opium poppy genome, it is difficult to generate accurate and comprehensive gene annotations. To overcome this problem, we used the PacBio sin...
We aimed to develop a whole-genome sequencing (WGS)-based copy number variant (CNV) calling algorithm with the potential of replacing chromosomal microarray assay (CMA) for clinical diagnosis. JAX-CNV is thus developed for CNV detection from WGS. The performance of this CNV calling algorithm was evaluated in a blinded manner on 31 samples and compa...
Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted by state-of-the-art variant detection algorithms. We developed SVision, a deep-learning based multi-object recognition framework, to automatically detect and characterize CSVs from long-read data. SVision outperforms current variant callers at i...
The microbiome is prevalent throughout human bodies, with profound health implications. However, it remains unclear whether it is present and active in human CSF, which has been long considered aseptic due to the blood-brain barrier.
Integration of Hepatitis B (HBV) virus into human genome disrupts genetic structures and cellular functions. Here, we conducted multiplatform long read sequencing on two cell lines and five clinical samples of HBV-induced hepatocellular carcinomas (HCC). We resolved two types of complex viral integration induced genome rearrangements and establishe...
For millions of years, plants evolve plenty of structurally diverse secondary metabolites (SM) to support their sessile lifestyles through continuous biochemical pathway innovation. While new genes commonly drive the evolution of plant SM pathway, how a full biosynthetic pathway evolves remains poorly understood. The evolution of pathway involves r...
Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains significant number of missing segments. Here, we report a high-quality and almost complete Col-0 genome assembly with two gaps (Col-XJTU) using combi...
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV...
Here, we report a high-quality (HQ) and almost complete genome assembly with a single gap and quality value (QV) larger than 60 of the model plant Arabidopsis thaliana ecotype Columbia (Col-0), generated using combination of Oxford Nanopore Technology (ONT) ultra-long reads, high fidelity (HiFi) reads and Hi-C data. The total genome assembly size i...
We aimed to develop a whole genome sequencing (WGS)-based copy number variant (CNV) calling algorithm with the potential of replacing chromosomal microarray assay (CMA) for clinical diagnosis. JAX-CNV is thus developed for CNV detection from WGS. The performance of this CNV calling algorithm was evaluated in a blinded manner on 31 samples and compa...
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV dis...
Resolving genomic structural variation
Many human genomes have been reported using short-read technology, but it is difficult to resolve structural variants (SVs) using these data. These genomes thus lack comprehensive comparisons among individuals and populations. Ebert et al. used long-read structural variation calling across 64 human genomes rep...
Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent–child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic var...
Background
Current single cell analysis methods annotate cell types at cluster-level rather than ideally at single cell level. Multiple exchangeable clustering methods and many tunable parameters have a substantial impact on the clustering outcome, often leading to incorrect cluster-level annotation or multiple runs of subsequent clustering steps....
Cerebrospinal fluids circulating human central nervous system have long been considered aseptic in healthy individuals, because normally the blood-brain barrier protects against microbial invasions. However, this dogma has been questioned by several reports that microbes were identified in human brains, raising the question whether a microbial comm...
Background:
Various models have been applied to predict the trend of the epidemic since the outbreak of COVID-19.
Methods:
In this study, we designed a dynamic graph model, not for precisely predicting the number of infected cases, but for a glance of the dynamics under a public epidemic emergency situation and of different contributing factors....
Background: SARS-CoV-2 has become a pandemic and researchers have built phylogenetic trees to trace the spread of the virus. However, the accumulation rate of variations and mutational hotspots remain largely unclear.
Results: We collected more than 3,100 SARS-CoV-2 genome sequences from GISAID and profiled the landscape of whole genome variations....
Background
Current single cell analysis methods annotate cell types at cluster-level rather than ideally at single cell level. Multiple exchangeable clustering methods and many tunable parameters have a substantial impact on the clustering outcome, often leading to incorrect cluster-level annotation or multiple runs of subsequent clustering steps....
Background
Current single cell analysis methods annotate cell types at cluster-level rather than ideally at single cell level. Multiple exchangeable clustering methods and many tunable parameters have a substantial impact on the clustering outcome, often leading to incorrect cluster-level annotation or multiple runs of subsequent clustering steps....
Background
Current single cell analysis methods annotate cell types at cluster-level rather than ideally at single cell level. Multiple exchangeable clustering methods and many tunable parameters have a substantial impact on the clustering outcome, often leading to incorrect cluster-level annotation or multiple runs of subsequent clustering steps....
Microsatellite instability (MSI) is a key biomarker for cancer therapy and prognosis. Traditional experimental assays are laborious and time-consuming, and next-generation sequencing-based computational methods do not work on leukemia samples, paraffin-embedded samples, or patient-derived xenografts/organoids, due to the requirement of matched norm...
Since the outbreak of 2019 novel coronavirus (2019-nCoV) at the hardest-hit city of Wuhan, the fast-moving spread has killed over three hundred people and infected more than ten thousands in China1. There are more than one hundred cases outside of China, affecting a dozen of countries globally2. The genome sequence of 2019-nCoV has been reported an...
We developed MSIsensor-pro (https://github.com/xjtu-omics/msisensor-pro), an open-source single sample microsatellite instability (MSI) scoring method for research and clinical applications. MSIsensor-pro introduces a multinomial distribution model to quantify polymerase slippages for each tumor sample and a discriminative sites selection method to...
Motivation:
Tumor purity is a fundamental property of each cancer sample and affects downstream investigations. Current tumor purity estimation methods either require matched normal sample or report moderately high tumor purity even on normal samples. It is critical to develop a novel computational approach to estimate tumor purity with sufficient...
A recent study on human structural variation indicates insufficiencies and errors in the human reference genome, GRCh38, and argues for the construction of a human pan-genome.
Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees....
Poppy genome reveals evolution of opiates
The opium poppy has been a source of painkillers since Neolithic times. Attendant risks of addiction threaten many today. Guo et al. now deliver a draft of the opium poppy genome, which encompasses 2.72 gigabases assembled into 11 chromosomes and predicts more than 50,000 protein-coding genes. A particularl...
Genetic variations are important evolutionary forces in all forms of life in nature. Accurate and efficient detection of various forms of genetic variants is crucial for understanding cell function, evolution and diseases in living organisms. In this chapter, we describe a detailed protocol that uses Pindel, a split-read algorithm, to discover inde...
Despite growing evidence demonstrates that the long non-coding ribonucleic acids (lncRNAs) are critical modulators for cancers, the knowledge about the DNA methylation patterns of lncRNAs is quite limited. We develop a systematic analysis pipeline to discover DNA methylation patterns for lncRNAs across multiple cancer subtypes from probe, gene and...
Abnormal DNA methylation is an important epigenetic regulator involving tumorigenesis. Deciphering cancer common and specific DNA methylation patterns is essential for us to understand the mechanisms of tumor development. The Cancer Genome Atlas (TCGA) project provides a large number of samples of different cancers that enable a pan-cancer study of...
DNA methylation has been proved to play important roles in cell development and complex diseases through comparative studies
of DNA methylation profiles across different tissues and samples. Current studies indicate that the regulation of DNA methylation
to gene expression depends on the genomic locations of CpGs. Common DNA methylation patterns sh...
Distinguishing driver pathways has been extensively studies, because they are critical for understanding the development and molecular mechanisms of cancer. Most existing methods available for driver pathways are based on the high coverage as well as high mutual exclusivity with the underlying assumption that the mutations are exclusive. But in man...
DNA methylation is a key functional regulatory mechanism in human genome, which plays critical roles in development, differentiation and many diseases. With rapid progress of large-scale projects (e.g. ENCODE), many DNA methylation data across diverse cell lines have been produced. However, common methylation patterns, cell lineage- and cell line-s...
Background: Large-scale cancer genomic projects are providing lots of data on genomic, epigenomic and gene expression aberrations in many cancer types. One key challenge is to detect functional driver pathways and to filter out nonfunctional passenger genes in cancer genomics. Vandin et al. [1] introduced the Maximum Weight Sub-matrix Problem to fi...
In this paper, we present a novel rough-fuzzy clustering (RFC) method to detect overlapping protein complexes in protein-protein interaction (PPI) networks. RFC focuses on fuzzy relation model rather than graph model by integrating fuzzy sets and rough sets, employs the upper and lower approximations of rough sets to deal with overlapping complexes...
Increasing evidence has indicated that long non-coding RNAs (lncRNAs) are implicated in and associated with many complex human diseases. Despite of the accumulation of lncRNA-disease associations, only a few studies had studied the roles of these associations in pathogenesis. In this paper, we investigated lncRNA-disease associations from a network...