
Xin ChenNanyang Technological University | ntu
Xin Chen
PhD
About
50
Publications
9,370
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,387
Citations
Citations since 2017
Introduction
Publications
Publications (50)
Xist (inactivated X chromosome specific transcript) is a prototype long noncoding RNA in charge of epigenetic silencing of one X chromosome in each female cell in mammals. In a genetic screen, we identify Mageb3 and its homologs Mageb1 and Mageb2 as genes functionally required for Xist-mediated gene silencing. Mageb1-3 are previously uncharacterize...
We carried out padlock capture, a high-resolution RNA allelotyping method, to study X chromosome inactivation (XCI). We examined the gene reactivation pattern along the inactive X (Xi), after Xist (X-inactive specific transcript), a prototype long non-coding RNA essential for establishing X chromosome inactivation (XCI) in early embryos, is conditi...
The study of genetic map linearization leads to a combinatorial hard problem,
called the {\em minimum breakpoint linearization} (MBL) problem. It is aimed at
finding a linearization of a partial order which attains the minimum breakpoint
distance to a reference total order. The approximation algorithms previously
developed for the MBL problem are o...
Background: Identifying all possible mapping locations of next-generation
sequencing (NGS) reads is highly essential in several applications such as
prediction of genomic variants or protein binding motifs located in repeat
regions, isoform expression quantification, metagenomics analysis, etc.
However, this task is very time-consuming and majority...
With the advances in high-throughput DNA sequencing technologies, RNA-seq has rapidly emerged as a powerful tool for the quantitative analysis of gene expression and transcript variant discovery. In comparative experiments, differential expression analysis is commonly performed on RNA-seq data to identify genes/features that are differentially expr...
Cytosine methylation plays an important role in many biological regulation processes. The current gold-standard method for analyzing cytosine methylation is based on sodium bisulfite treatment and high-throughput sequencing technologies. In this paper we introduce a new tool called TAMeBS for cytosine methylation analysis using bisulfite sequencing...
Background
Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the compar...
Enormous volumes of short reads data from next-generation sequencing (NGS)
technologies have posed new challenges to the area of genomic sequence
comparison.
The multiple sequence alignment approach is hardly applicable to NGS data due
to the challenging problem of short read assembly.
Thus alignment-free methods need to be developed for the compar...
Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing...
The problem of sorting unsigned permutations by double-cut-and-joins (SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairs of unichromosomal genomes without the gene
strandedness information. In this paper we show it is a NP-hard problem by reduction to an equivalent previously-known problem,
called breakpoint graph decompo...
Background
The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of th...
Extensions to edit distance. The analysis results for the problems SNP - MSPe and SNP - MSQe are presented. See "Additional file 1.pdf".
Accurate detection of single-nucleotide polymorphisms (SNPs) is crucial for the success of many downstream analyses such as clinical diagnosis, virus identification, genetic mapping and association studies. Among many others, one valuable approach for SNP detection is based on the base-specific cleavage of single-stranded nucleic acids followed by...
In this paper we study the problem of sorting unsigned genomes by double-cut-and-join operations, where genomes allow a mix of linear and circular chromosomes to be present. First, we formulate an equivalent optimization problem, called maximum cycle/path decomposition, which is aimed at finding a largest collection of edge-disjoint cycles/AA-paths...
The additive model is a semiparametric class of models that has become extremely popular because it is more flexible than the linear model and can be fitted to high-dimensional data when fully nonparametric models become infeasible. We consider the problem of simultaneous variable selection and parametric component identification using spline appro...
Fold recognition from amino acid sequences plays an important role in identifying protein structures and functions. The taxonomy-based method, which classifies a query protein into one of the known folds, has been shown very promising for protein fold recognition. However, extracting a set of highly discriminative features from amino acid sequences...
Prediction of protein contact map is of great importance since it can facilitate and improve the prediction of protein 3D
structure. However, the prediction accuracy is notoriously known to be rather low. In this paper, a consensus contact map
prediction method called LRcon is developed, which combines the prediction results from several complement...
The construction of consensus genetic maps is a very challenging problem in computational biology. Many computational approaches have been proposed on the basis of only the marker order relations provided by a given set of individual genetic maps. In this article, we propose a comparative approach to constructing consensus genetic maps for a genome...
G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is there...
The problem of sorting permutations by double-cut-and-joins (SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairs of unichromosomal genomes without the gene
strandedness information. In this paper we show it is a NP-hard problem by reduction to an equivalent previously-known problem,
called breakpoint graph decomposition (B...
The detailed descriptions about Recurrence quantification analysis, Fisher's discriminant algorithm and Prediction assessment can be found in this file.
Prediction of protein structural classes (alpha, beta, alpha + beta and alpha/beta) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-...
There is an increasing interest in clustering time course gene expression data to investigate a wide range of biological processes. However, developing a clustering algorithm ideal for time course gene express data is still challenging. As timing is an important factor in defining true clusters, a clustering algorithm shall explore expression corre...
In the recent years, there has been a growing interest in inferring the total order of genes or markers on a chromosome, since current genetic mapping efforts might only suffice to produce a partial order. Many interesting optimization problems were thus formulated in the framework of genome rearrangement. As an important one among them, the minimu...
We introduce a new combinatorial optimization problem in this article, called the minimum common integer partition (MCIP) problem, which was inspired by computational biology applications including ortholog assignment and DNA fingerprint assembly. A partition of a positive integer n is a multiset of positive integers that add up to exactly n, and a...
Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and als...
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics, since many computational methods for solving various biological problems critically rely on bona fide orthologs as input. While it is usually done using sequence similarity search, we recently proposed a new combinatorial...
Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and dis...
A large number of biclustering methods have been proposed to detect patterns in gene expression data. All these methods try to find some type of biclusters but no one can discover all the types of patterns in the data. Furthermore, researchers have to design new algorithms in order to find new types of biclusters/patterns that interest biologists....
We introduce a new combinatorial optimization problem in this paper, called the Minimum Common Integer Partition (MCIP) problem, which was inspired by computational biology applications including ortholog assignment and DNA fingerprint assembly. A partition of a positive integer n is a multiset of positive integers that add up to exactly n, and an...
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative ge- nomics, since many computational methods for solving various biological problems critically rely on bona fide orthologs as input. While it is usu- ally done using sequence similarity search, we recently proposed a new combinator...
The discovery of motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulato...
The rapid increase in the number of sequenced microbial genomes provides unprecedented opportunities to computational biologists to decipher the genomic structures of these microbes through development and application of advanced comparative genome analysis tools. In this presentation, we describe a systematic study we have been carrying out on dec...
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes o...
Approximate string matching is a fundamental and challeng- ing problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA se- quence analysis. In this paper, we present a fast algorithm for approxi- mate string matching, called FAAST. It aims at solving a popular variant of the...
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes o...
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (tha...
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity, to answer this question and have proven it to be universal. We apply this metric in measuring the amount of shared information between...
We present a computational protocol for inference of regulatory and signaling pathways in a microbial cell, through literature search, mining "high-throughput" biological data of various types, and computer-assisted human inference. This protocol consists of four key components: (a) construction of template pathways for microbial organisms related...
We computationally predict operons in the Synechococcus sp. WH8102 genome based on three types of genomic data: intergenic distances, COG gene functions and phylogenetic profiles. In the proposed method, we first estimate a log-likelihood distribution for each type of genomic data, and then fuse these distribution information by a perceptron to dis...
We present a computational protocol for inference of regulatory and signaling pathways in a microbial cell, through literature search, mining "high-throughput'' biological data of various types, and computer-assisted human inference. This protocol consists of four key components: (a) construction of template pathways for microbial organisms related...
While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs.
Availability: http://dna.cs.ucsb.edu/DNACompress
Contact: chxin@cs.ucsb.edumli@cs.ucsb.edu
*
To whom correspondence should be addressed.
This paper introduces a metric to measure the degree to which two computer programs are similar for plagiarism detection. This similarity metric is based on Kolmogorov complexity [8] and measures the amount of shared information between two programs. The measure is universal hence in theory not cheatable. Although the metric is not computable, we h...
We present a DNA compression algorithm, GenCompress, based on approximate matching that gives the best compression results on standard benchmark DNA sequences. We present the design rationale of GenCompress based on approximate matching, discuss details of the algorithm, provide experimental results, and compare the results with the two most effect...
An efficient compression scheme for color-quantized im- ages based on progressive coding of color information has been developed. Rather than sorting color indexes into a linear list structure, a binary-tree structure of color indexes is proposed. With this structure the new algorithm can pro- gressively recover an image from 2 colors up to all of...
Motivation:
Traditional sequence distances require an alignment and therefore are not directly applicable to the problem of whole genome phylogeny where events such as rearrangements make full length alignments impossible. We present a sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov compl...
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then desc...