Publications (179)579.65 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: With the advent of highthroughput technologies making largescale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for timecourse data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.  [Show abstract] [Hide abstract]
ABSTRACT: The DiseaseConnect (http://diseaseconnect.org) is a web server for analysis and visualization of a comprehensive knowledge on mechanismbased disease connectivity. The traditional disease classification system groups diseases with similar clinical symptoms and phenotypic traits. Thus, diseases with entirely different pathologies could be grouped together, leading to a similar treatment design. Such problems could be avoided if diseases were classified based on their molecular mechanisms. Connecting diseases with similar pathological mechanisms could inspire novel strategies on the effective repositioning of existing drugs and therapies. Although there have been several studies attempting to generate disease connectivity networks, they have not yet utilized the enormous and rapidly growing public repositories of diseaserelated omics data and literature, two primary resources capable of providing insights into disease connections at an unprecedented level of detail. Our DiseaseConnect, the first public web server, integrates comprehensive omics and literature data, including a large amount of gene expression data, GenomeWide Association Studies catalog, and textmined knowledge, to discover diseasedisease connectivity via common molecular mechanisms. Moreover, the clinical comorbidity data and a comprehensive compilation of known drugdisease relationships are additionally utilized for advancing the understanding of the disease landscape and for facilitating the mechanismbased development of new drug treatments.  [Show abstract] [Hide abstract]
ABSTRACT: With the development of nextgeneration sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignmentbased genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signaturebased methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignmentfree genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.  [Show abstract] [Hide abstract]
ABSTRACT: Abstract Local alignmentfree sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignmentfree comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignmentfree sequence comparison can be solved by making a quadratic number of alignmentfree substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignmentfreebased methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.  [Show abstract] [Hide abstract]
ABSTRACT: To an RNA pseudoknot structure is naturally associated a topological surface, which has its associated genus, and structures can thus be classified by the genus. Based on earlier work of HarerZagier, we compute the generating function [Formula: see text] for the number [Formula: see text] of those structures of fixed genus [Formula: see text] and minimum stack size [Formula: see text] with [Formula: see text] nucleotides so that no two consecutive nucleotides are basepaired and show that [Formula: see text] is algebraic. In particular, we prove that [Formula: see text], where [Formula: see text]. Thus, for stack size at least two, the genus only enters through the subexponential factor, and the slow growth rate compared to the number of RNA molecules implies the existence of neutral networks of distinct molecules with the same structure of any genus. Certain RNA structures called shapes are shown to be in natural onetoone correspondence with the cells in the PennerStrebel decomposition of Riemann's moduli space of a surface of genus [Formula: see text] with one boundary component, thus providing a link between RNA enumerative problems and the geometry of Riemann's moduli space.  [Show abstract] [Hide abstract]
ABSTRACT: Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIPSeq data using transcription factor GABP. Software is available online (wwwrcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb). 
Article: New powerful statistics for alignmentfree sequence comparison under a pattern transfer model
[Show abstract] [Hide abstract]
ABSTRACT: Alignmentfree sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignmentfree comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignmentfree statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.  [Show abstract] [Hide abstract]
ABSTRACT: The rapid accumulation of biological networks poses new challenges and calls for powerful integrative analysis tools. Most existing methods capable of simultaneously analyzing a large number of networks were primarily designed for unweighted networks, and cannot easily be extended to weighted networks. However, it is known that transforming weighted into unweighted networks by dichotomizing the edges of weighted networks with a threshold generally leads to information loss. We have developed a novel, tensorbased computational framework for mining recurrent heavy subgraphs in a large set of massive weighted networks. Specifically, we formulate the recurrent heavy subgraph identification problem as a heavy 3D subtensor discovery problem with sparse constraints. We describe an effective approach to solving this problem by designing a multistage, convex relaxation protocol, and a nonuniform edge sampling technique. We applied our method to 130 coexpression networks, and identified 11,394 recurrent heavy subgraphs, grouped into 2,810 families. We demonstrated that the identified subgraphs represent meaningful biological modules by validating against a large set of compiled biological knowledge bases. We also showed that the likelihood for a heavy subgraph to be meaningful increases significantly with its recurrence in multiple networks, highlighting the importance of the integrative approach to biological network analysis. Moreover, our approach based on weighted graphs detects many patterns that would be overlooked using unweighted graphs. In addition, we identified a large number of modules that occur predominately under specific phenotypes. This analysis resulted in a genomewide mapping of gene network modules onto the phenome. Finally, by comparing module activities across many datasets, we discovered highorder dynamic cooperativeness in protein complex networks and transcriptional regulatory networks.  [Show abstract] [Hide abstract]
ABSTRACT: Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the loglikelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs. 
Article: Enumeration of linear chord diagrams
[Show abstract] [Hide abstract]
ABSTRACT: A linear chord diagram canonically determines a fatgraph and hence has an associated genus $g$. We compute the natural generating function ${\bf C}_g(z)=\sum_{n\geq 0} {\bf c}_g(n)z^n$ for the number ${\bf c}_g(n)$ of linear chord diagrams of fixed genus $g\geq 1$ with a given number $n\geq 0$ of chords and find the remarkably simple formula ${\bf C}_g(z)=z^{2g}R_g(z) (14z)^{{1\over 2}3g}$, where $R_g(z)$ is a polynomial of degree at most $g1$ with integral coefficients satisfying $R_g({1\over 4})\neq 0$ and $R_g(0) = {\bf c}_g(2g)\neq 0.$ In particular, ${\bf C}_g(z)$ is algebraic over $\mathbb C(z)$, which generalizes the corresponding classical fact for the generating function ${\bf C}_0(z)$ of the Catalan numbers. As a corollary, we also calculate a related generating function germaine to the enumeration of knotted RNA secondary structures, which is again found to be algebraic.  [Show abstract] [Hide abstract]
ABSTRACT: Rapid methods for alignmentfree sequence comparison make largescale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching ktuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a selfstandardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.  [Show abstract] [Hide abstract]
ABSTRACT: Variation in genome structure is an important source of human genetic polymorphism: It affects a large proportion of the genome and has a variety of phenotypic consequences relevant to health and disease. In spite of this, human genome structure variation is incompletely characterized due to a lack of approaches for discovering a broad range of structural variants in a global, comprehensive fashion. We addressed this gap with Optical Mapping, a highthroughput, highresolution singlemolecule system for studying genome structure. We used Optical Mapping to create genomewide restriction maps of a complete hydatidiform mole and three lymphoblastderived cell lines, and we validated the approach by demonstrating a strong concordance with existing methods. We also describe thousands of new variants with sizes ranging from kb to Mb.  [Show abstract] [Hide abstract]
ABSTRACT: The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over or underrepresented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over or underrepresented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.  [Show abstract] [Hide abstract]
ABSTRACT: Complex human diseases are often caused by multiple mutations, each of which contributes only a minor effect to the disease phenotype. To study the basis for these complex phenotypes, we developed a networkbased approach to identify coexpression modules specifically activated in particular phenotypes. We integrated these modules, proteinprotein interaction data, Gene Ontology annotations, and our database of genephenotype associations derived from literature to predict novel human genephenotype associations. Our systematic predictions provide us with the opportunity to perform a global analysis of human gene pleiotropy and its underlying regulatory mechanisms. We applied this method to 338 microarray datasets, covering 178 phenotype classes, and identified 193,145 phenotypespecific coexpression modules. We trained random forest classifiers for each phenotype and predicted a total of 6,558 genephenotype associations. We showed that 40.9% genes are pleiotropic, highlighting that pleiotropy is more prevalent than previously expected. We collected 77 ChIPchip datasets studying 69 transcription factors binding over 16,000 targets under various phenotypic conditions. Utilizing this unique data source, we confirmed that dynamic transcriptional regulation is an important force driving the formation of phenotype specific gene modules. We created a genomewide gene to phenotype mapping that has many potential implications, including providing potential new drug targets and uncovering the basis for human disease phenotypes. Our analysis of these phenotypespecific coexpression modules reveals a high prevalence of gene pleiotropy, and suggests that phenotypespecific transcription factor binding may contribute to phenotypic diversity. All resources from our study are made freely available on our online Phenotype Prediction Database.  [Show abstract] [Hide abstract]
ABSTRACT: New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.  [Show abstract] [Hide abstract]
ABSTRACT: Largescale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the ktuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by singlesequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(*). For D(2)(S), which is a selfstandardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(*), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.  [Show abstract] [Hide abstract]
ABSTRACT: About 85% of the maize genome consists of highly repetitive sequences that are interspersed by lowcopy, genecoding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence build. For this purpose, we constructed a genomewide, highresolution optical map of the maize inbred line B73 genome containing >91,000 restriction sites (averaging 1 site/ approximately 23 kb) accrued from mapping genomic DNA molecules. Our optical map comprises 66 contigs, averaging 31.88 Mb in size and spanning 91.5% (2,103.93 Mb/ approximately 2,300 Mb) of the maize genome. A new algorithm was created that considered both optical map and unfinished BAC sequence data for placing 60/66 (2,032.42 Mb) optical map contigs onto the maize iMap. The alignment of optical maps against numerous data sources yielded comprehensive results that proved revealing and productive. For example, gaps were uncovered and characterized within the iMap, the FPC (fingerprinted contigs) map, and the chromosomewide pseudomolecules. Such alignments also suggested amended placements of FPC contigs on the maize genetic map and proactively guided the assembly of chromosomewide pseudomolecules, especially within complex genomic regions. Lastly, we think that the full integration of B73 optical maps with the maize iMap would greatly facilitate maize sequence finishing efforts that would make it a valuable reference for comparative studies among cereals, or other maize inbred lines and cultivars.  [Show abstract] [Hide abstract]
ABSTRACT: We report an improved draft nucleotide sequence of the 2.3gigabase genome of maize, an important crop plant and model for biological research. Over 32,000 genes were predicted, of which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome. These were responsible for the capture and amplification of numerous gene fragments and affect the composition, sizes, and positions of centromeres. We also report on the correlation of methylationpoor regions with Mu transposon insertions and recombination, and copy number variants with insertions and/or deletions, as well as how uneven gene losses between duplicated regions were involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform and set the stage for further investigations to improve our understanding of the domestication and agricultural improvements of maize.  [Show abstract] [Hide abstract]
ABSTRACT: Although many studies have been successful in the discovery of cooperating groups of genes, mapping these groups to phenotypes has proved a much more challenging task. In this article, we present the first genomewide mapping of gene coexpression modules onto the phenome. We annotated coexpression networks from 136 microarray datasets with phenotypes from the Unified Medical Language System (UMLS). We then designed an efficient graphbased simulated annealing approach to identify coexpression modules frequently and specifically occurring in datasets related to individual phenotypes. By requiring phenotypespecific recurrence, we ensure the robustness of our findings. We discovered 118,772 modules specific to 42 phenotypes, and developed validation tests combining Gene Ontology, GeneRIF and UMLS. Our method is generally applicable to any kind of abundant network data with defined phenotype association, and thus paves the way for genomewide, gene networkphenotype maps.  [Show abstract] [Hide abstract]
ABSTRACT: Haplotype assembly is becoming a very important tool in genome sequencing of human and other organisms. Although haplotypes were previously inferred from genome assemblies, there has never been a comparative haplotype browser that depicts a global picture of wholegenome alignments among haplotypes of different organisms. We introduce a wholegenome HAPLotype brOWSER (HAPLOWSER), providing evolutionary perspectives from multiple aligned haplotypes and functional annotations. Haplowser enables the comparison of haplotypes from metagenomes, and associates conserved regions or the bases at the conserved regions with functional annotations and custom tracks. The associations are quantified for further analysis and presented as pie charts. Functional annotations and custom tracks that are projected onto haplotypes are saved as multiple files in FASTA format. Haplowser provides a userfriendly interface, and can display alignments of haplotypes with functional annotations at any resolution. AVAILABILITY: Haplowser, written in Java, supports multiple platforms including Windows and Linux. Haplowser is publicly available at http://embio.yonsei.ac.kr/haplowser .
Publication Stats
9k  Citations  
579.65  Total Impact Points  
Top Journals
Institutions

2014

University of California, Berkeley
 Department of Statistics
Berkeley, California, United States


19842014

University of Southern California
 • Division of Molecular and Computational Biology
 • Department of Computer Science
 • Department of Biological Sciences
 • Department of Mathematics
Los Angeles, California, United States


19832013

University of California, Los Angeles
 Department of Mathematics
Los Ángeles, California, United States


20092012

Tsinghua University
 Department of Automation
Peping, Beijing, China 
Cold Spring Harbor Laboratory
Cold Spring Harbor, New York, United States


19862004

Harvard University
 Department of Biostatistics
Cambridge, Massachusetts, United States


2001

Emory University
Atlanta, Georgia, United States


1998

Slovak Academy of Sciences
Presburg, Bratislavský, Slovakia


1995

Pennsylvania State University
 Department of Electrical Engineering
University Park, MD, United States


1992

Michigan Technological University
 Department of Computer Science
Хаутон, Michigan, United States


1985

Harvard Medical School
Boston, Massachusetts, United States 
University of Southern Mississippi
 Department of Mathematics
HBG, Mississippi, United States
