Xin Chen

Xin Chen
Nanyang Technological University | ntu

PhD

About

50
Publications
9,370
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,387
Citations
Citations since 2017
4 Research Items
829 Citations
2017201820192020202120222023020406080100120140
2017201820192020202120222023020406080100120140
2017201820192020202120222023020406080100120140
2017201820192020202120222023020406080100120140

Publications

Publications (50)
Article
Full-text available
Xist (inactivated X chromosome specific transcript) is a prototype long noncoding RNA in charge of epigenetic silencing of one X chromosome in each female cell in mammals. In a genetic screen, we identify Mageb3 and its homologs Mageb1 and Mageb2 as genes functionally required for Xist-mediated gene silencing. Mageb1-3 are previously uncharacterize...
Article
Full-text available
We carried out padlock capture, a high-resolution RNA allelotyping method, to study X chromosome inactivation (XCI). We examined the gene reactivation pattern along the inactive X (Xi), after Xist (X-inactive specific transcript), a prototype long non-coding RNA essential for establishing X chromosome inactivation (XCI) in early embryos, is conditi...
Article
Full-text available
The study of genetic map linearization leads to a combinatorial hard problem, called the {\em minimum breakpoint linearization} (MBL) problem. It is aimed at finding a linearization of a partial order which attains the minimum breakpoint distance to a reference total order. The approximation algorithms previously developed for the MBL problem are o...
Article
Full-text available
Background: Identifying all possible mapping locations of next-generation sequencing (NGS) reads is highly essential in several applications such as prediction of genomic variants or protein binding motifs located in repeat regions, isoform expression quantification, metagenomics analysis, etc. However, this task is very time-consuming and majority...
Article
Full-text available
With the advances in high-throughput DNA sequencing technologies, RNA-seq has rapidly emerged as a powerful tool for the quantitative analysis of gene expression and transcript variant discovery. In comparative experiments, differential expression analysis is commonly performed on RNA-seq data to identify genes/features that are differentially expr...
Conference Paper
Full-text available
Cytosine methylation plays an important role in many biological regulation processes. The current gold-standard method for analyzing cytosine methylation is based on sodium bisulfite treatment and high-throughput sequencing technologies. In this paper we introduce a new tool called TAMeBS for cytosine methylation analysis using bisulfite sequencing...
Article
Full-text available
Background Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the compar...
Article
Full-text available
Enormous volumes of short reads data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods need to be developed for the compar...
Article
Full-text available
Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing...
Article
The problem of sorting unsigned permutations by double-cut-and-joins (SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairs of unichromosomal genomes without the gene strandedness information. In this paper we show it is a NP-hard problem by reduction to an equivalent previously-known problem, called breakpoint graph decompo...
Article
Full-text available
Background The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of th...
Data
Full-text available
Extensions to edit distance. The analysis results for the problems SNP - MSPe and SNP - MSQe are presented. See "Additional file 1.pdf".
Conference Paper
Accurate detection of single-nucleotide polymorphisms (SNPs) is crucial for the success of many downstream analyses such as clinical diagnosis, virus identification, genetic mapping and association studies. Among many others, one valuable approach for SNP detection is based on the base-specific cleavage of single-stranded nucleic acids followed by...
Article
Full-text available
In this paper we study the problem of sorting unsigned genomes by double-cut-and-join operations, where genomes allow a mix of linear and circular chromosomes to be present. First, we formulate an equivalent optimization problem, called maximum cycle/path decomposition, which is aimed at finding a largest collection of edge-disjoint cycles/AA-paths...
Article
The additive model is a semiparametric class of models that has become extremely popular because it is more flexible than the linear model and can be fitted to high-dimensional data when fully nonparametric models become infeasible. We consider the problem of simultaneous variable selection and parametric component identification using spline appro...
Article
Fold recognition from amino acid sequences plays an important role in identifying protein structures and functions. The taxonomy-based method, which classifies a query protein into one of the known folds, has been shown very promising for protein fold recognition. However, extracting a set of highly discriminative features from amino acid sequences...
Conference Paper
Full-text available
Prediction of protein contact map is of great importance since it can facilitate and improve the prediction of protein 3D structure. However, the prediction accuracy is notoriously known to be rather low. In this paper, a consensus contact map prediction method called LRcon is developed, which combines the prediction results from several complement...
Article
Full-text available
The construction of consensus genetic maps is a very challenging problem in computational biology. Many computational approaches have been proposed on the basis of only the marker order relations provided by a given set of individual genetic maps. In this article, we propose a comparative approach to constructing consensus genetic maps for a genome...
Article
Full-text available
G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is there...
Conference Paper
Full-text available
The problem of sorting permutations by double-cut-and-joins (SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairs of unichromosomal genomes without the gene strandedness information. In this paper we show it is a NP-hard problem by reduction to an equivalent previously-known problem, called breakpoint graph decomposition (B...
Data
Full-text available
The detailed descriptions about Recurrence quantification analysis, Fisher's discriminant algorithm and Prediction assessment can be found in this file.
Article
Full-text available
Prediction of protein structural classes (alpha, beta, alpha + beta and alpha/beta) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-...
Article
There is an increasing interest in clustering time course gene expression data to investigate a wide range of biological processes. However, developing a clustering algorithm ideal for time course gene express data is still challenging. As timing is an important factor in defining true clusters, a clustering algorithm shall explore expression corre...
Article
Full-text available
In the recent years, there has been a growing interest in inferring the total order of genes or markers on a chromosome, since current genetic mapping efforts might only suffice to produce a partial order. Many interesting optimization problems were thus formulated in the framework of genome rearrangement. As an important one among them, the minimu...
Article
Full-text available
We introduce a new combinatorial optimization problem in this article, called the minimum common integer partition (MCIP) problem, which was inspired by computational biology applications including ortholog assignment and DNA fingerprint assembly. A partition of a positive integer n is a multiset of positive integers that add up to exactly n, and a...
Article
Full-text available
Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and als...
Article
Full-text available
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics, since many computational methods for solving various biological problems critically rely on bona fide orthologs as input. While it is usually done using sequence similarity search, we recently proposed a new combinatorial...
Article
Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and dis...
Article
A large number of biclustering methods have been proposed to detect patterns in gene expression data. All these methods try to find some type of biclusters but no one can discover all the types of patterns in the data. Furthermore, researchers have to design new algorithms in order to find new types of biclusters/patterns that interest biologists....
Conference Paper
Full-text available
We introduce a new combinatorial optimization problem in this paper, called the Minimum Common Integer Partition (MCIP) problem, which was inspired by computational biology applications including ortholog assignment and DNA fingerprint assembly. A partition of a positive integer n is a multiset of positive integers that add up to exactly n, and an...
Conference Paper
Full-text available
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative ge- nomics, since many computational methods for solving various biological problems critically rely on bona fide orthologs as input. While it is usu- ally done using sequence similarity search, we recently proposed a new combinator...
Article
The discovery of motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulato...
Conference Paper
The rapid increase in the number of sequenced microbial genomes provides unprecedented opportunities to computational biologists to decipher the genomic structures of these microbes through development and application of advanced comparative genome analysis tools. In this presentation, we describe a systematic study we have been carrying out on dec...
Article
Full-text available
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes o...
Conference Paper
Full-text available
Approximate string matching is a fundamental and challeng- ing problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA se- quence analysis. In this paper, we present a fast algorithm for approxi- mate string matching, called FAAST. It aims at solving a popular variant of the...
Conference Paper
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes o...
Article
Full-text available
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (tha...
Article
Full-text available
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity, to answer this question and have proven it to be universal. We apply this metric in measuring the amount of shared information between...
Article
We present a computational protocol for inference of regulatory and signaling pathways in a microbial cell, through literature search, mining "high-throughput" biological data of various types, and computer-assisted human inference. This protocol consists of four key components: (a) construction of template pathways for microbial organisms related...
Article
We computationally predict operons in the Synechococcus sp. WH8102 genome based on three types of genomic data: intergenic distances, COG gene functions and phylogenetic profiles. In the proposed method, we first estimate a log-likelihood distribution for each type of genomic data, and then fuse these distribution information by a perceptron to dis...
Article
We present a computational protocol for inference of regulatory and signaling pathways in a microbial cell, through literature search, mining "high-throughput'' biological data of various types, and computer-assisted human inference. This protocol consists of four key components: (a) construction of template pathways for microbial organisms related...
Article
Full-text available
While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs. Availability: http://dna.cs.ucsb.edu/DNACompress Contact: chxin@cs.ucsb.edumli@cs.ucsb.edu * To whom correspondence should be addressed.
Article
Full-text available
This paper introduces a metric to measure the degree to which two computer programs are similar for plagiarism detection. This similarity metric is based on Kolmogorov complexity [8] and measures the amount of shared information between two programs. The measure is universal hence in theory not cheatable. Although the metric is not computable, we h...
Article
We present a DNA compression algorithm, GenCompress, based on approximate matching that gives the best compression results on standard benchmark DNA sequences. We present the design rationale of GenCompress based on approximate matching, discuss details of the algorithm, provide experimental results, and compare the results with the two most effect...
Conference Paper
An efficient compression scheme for color-quantized im- ages based on progressive coding of color information has been developed. Rather than sorting color indexes into a linear list structure, a binary-tree structure of color indexes is proposed. With this structure the new algorithm can pro- gressively recover an image from 2 colors up to all of...
Article
Motivation: Traditional sequence distances require an alignment and therefore are not directly applicable to the problem of whole genome phylogeny where events such as rearrangements make full length alignments impossible. We present a sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov compl...
Article
Full-text available
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then desc...

Network

Cited By