Ying Xu

Jilin University, Yung-chi, Jilin Sheng, China

Are you Ying Xu?

Claim your profile

Publications (297)1372.44 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Nucleic Acids Research 02/2015; DOI:10.1093/nar/gkv177 · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Gastric cancer is one of the most prevalent and aggressive cancers worldwide, and its molecular mechanism remains largely elusive. Here we report the genomic landscape in primary gastric adenocarcinoma of human, based on the complete genome sequences of five pairs of cancer and matching normal samples. In total, 103,464 somatic point mutations, including 407 non-synonymous ones, were identified and the most recurrent mutations were harbored by Mucins (MUC3A and MUC12) and transcription factors (ZNF717, ZNF595 and TP53). 679 genomic rearrangements were detected, which affect 355 protein-coding genes; and 76 genes show copy number changes. Through mapping the boundaries of the rearranged regions to the folded three-dimensional structure of human chromosomes, we determined that 79.6% of the chromosomal rearrangements happen among DNA fragments in close spatial proximity, especially when two endpoints stay in a similar replication phase. We demonstrated evidences that microhomology-mediated break induced replication was utilized as a mechanism in inducing ~40.9% of the identified genomic changes in gastric tumor. Our data analyses revealed potential integrations of Helicobacter pylori DNA into the gastric cancer genomes. Overall a large set of novel genomic variations were detected in these gastric cancer genomes, which may be essential to the study of the genetic basis and molecular mechanism of the gastric tumorigenesis. This article is protected by copyright. All rights reserved.
    International Journal of Cancer 12/2014; DOI:10.1002/ijc.29352 · 6.20 Impact Factor
  • Chi Zhang, Sha Cao, Ying Xu
    10/2014; 2(3). DOI:10.1007/s40484-014-0032-8
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Essential proteins are those that are indispensable to cellular survival and development. Existing methods for essential protein identification generally rely on knock-out experiments and/or the relative density of their interactions (edges) with other proteins in a Protein-Protein Interaction (PPI) network. Here, we present a computational method, called EW, to first rank protein-protein interactions in terms of their Edge Weights, and then identify sub-PPI-networks consisting of only the highly-ranked edges and predict their proteins as essential proteins. We have applied this method to publicly-available PPI data on Saccharomyces cerevisiae (Yeast) and Escherichia coli (E. coli) for essential protein identification, and demonstrated that EW achieves better performance than the state-of-the-art methods in terms of the precision-recall and Jackknife measures. The highly-ranked protein-protein interactions by our prediction tend to be biologically significant in both the Yeast and E. coli PPI networks. Further analyses on systematically perturbed Yeast and E. coli PPI networks through randomly deleting edges demonstrate that the proposed method is robust and the top-ranked edges tend to be more associated with known essential proteins than the lowly-ranked edges.
    PLoS ONE 09/2014; 9(9):e108716. DOI:10.1371/journal.pone.0108716 · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The availability of a large number of sequenced bacterial genomes facilitates in-depth studies about why genes (operons) in a bacterial genome are globally organized the way they are. We have previously discovered that (the relative) transcription- activation frequencies among different biological pathways encoded in a genome have a dominating role in the global arrangement of operons. One complicating factor in such a study is that some operons may be involved in multiple pathways with different activation frequencies. A quantitative model has been developed that captures this information, which tends to be minimized by the current global arrangement of operons in a bacterial (and archaeal) genome compared to possible alternative arrangements. A study is carried out here using this model on a collection of 52 closely related E. coli genomes, which revealed interesting new insights about how bacterial genomes evolve to optimally adapt to their environments through adjusting the (relative) genomic locations of the encoding operons of biological pathways once their utilization and hence transcription activation frequencies change, to maintain the above energy-efficiency property. More specifically we observed that it is the frequencies of the transcription activation of pathways relative to those of the other encoded pathways in an organism as well as the variation in the activation frequencies of a specific pathway across the related genomes that play a key role in the observed commonalities and differences in the genomic organizations of genes (and operons) encoding specific pathways across different genomes.
    Science China. Life sciences 09/2014; DOI:10.1007/s11427-014-4734-y · 1.51 Impact Factor
  • Nature 09/2014; · 42.35 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Clostridium genus of bacteria contains the most widely studied biofuel-producing organisms such as Clostridium thermocellum and also some human pathogens, plus a few less characterized strains. Here, we present a comparative genomic analysis of 40 fully sequenced clostridial genomes, paying a particular attention to the biomass degradation ones. Our analysis indicates that some of the Clostridium botulinum strains may have been incorrectly classified in the current taxonomy and hence should be renamed according to the 16S ribosomal RNA (rRNA) phylogeny. A core-genome analysis suggests that only 169 orthologous gene groups are shared by all the strains, and the strain-specific gene pool consists of 22,668 genes, which is consistent with the fact that these bacteria live in very diverse environments and have evolved a very large number of strain-specific genes to adapt to different environments. Across the 40 genomes, 1.4–5.8 % of genes fall into the carbohydrate active enzyme (CAZyme) families, and 20 out of the 40 genomes may encode cellulosomes with each genome having 1 to 76 genes bearing the cellulosome-related modules such as dockerins and cohesins. A phylogenetic footprinting analysis identified cis-regulatory motifs that are enriched in the promoters of the CAZyme genes, giving rise to 32 statistically significant motif candidates.
    BioEnergy Research 06/2014; DOI:10.1007/s12155-014-9486-9 · 3.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A challenge in phylogenetic inference of gene trees is how to properly sample a large pool of homologous sequences to derive a good representative subset of sequences. Such a need arises in various applications, e.g. when (1) accuracy-oriented phylogenetic reconstruction methods may not be able to deal with a large pool of sequences due to their high demand in computing resources; (2) applications analyzing a collection of gene trees may prefer to use trees with fewer operational taxonomic units (OTUs), for instance for the detection of horizontal gene transfer events by identifying phylogenetic conflicts; and (3) the pool of available sequences is biased towards extensively studied species. In the past, the creation of subsamples often relied on manual selection. Here we present an Automated sequence-Sampling method for improving the Taxonomic diversity of gene phylogenetic trees, AST, to obtain representative sequences that maximize the taxonomic diversity of the sampled sequences. To demonstrate the effectiveness of AST, we have tested it to solve four problems, namely, inference of the evolutionary histories of the small ribosomal subunit protein S5 of E. coli, 16 S ribosomal RNAs and glycosyl-transferase gene family 8, and a study of ancient horizontal gene transfers from bacteria to plants. Our results show that the resolution of our computational results is almost as good as that of manual inference by domain experts, hence making the tool generally useful to phylogenetic studies by non-phylogeny specialists. The program is available at http://csbl.bmb.uga.edu/~zhouchan/AST.php.
    PLoS ONE 06/2014; 9(6):e98844. DOI:10.1371/journal.pone.0098844 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A number of proposals have been made in the past century regarding what may drive sporadic cancers to initiate and develop. Yet, the problem remains largely unsolved as none of the proposals have been widely accepted as cancer-initiation drivers. We propose here a driver model for the initiation and early development of solid cancers associated with inflammation-induced chronic hypoxia and ROS accumulation. The model consists of five key elements: (i) human cells tend to have a substantial gap between ATP demand and supply during chronic hypoxia, which would inevitably lead to increased uptake of glucose and accumulation of its metabolites; (ii) the accumulation of these metabolites will cast mounting pressure on the cells and ultimately result in the production and export of hyaluronic acid; (iii) the exported hyaluronic acid will be degraded into fragments of various sizes, serving as tissue-repair signals, including signals for cell proliferation, cell survival and angiogenesis, which lead to the initial proliferation of the underlying cells; (iv) cell division provides an exit for the accumulated glucose metabolites by using them towards macromolecular synthesis for the new cell, and hence alleviate the pressure from the metabolite accumulation; and (v) this process continues as long as the hypoxic condition persists. In tandem, genetic mutations may be selected to make cell divisions and hence survival more sustainable and efficient, also increasingly more uncontrollable. This model also applies to some hereditary cancers as their key mutations, such as BRCA for breast cancer, generally lead to increased ROS and ultimately to repression of mitochondrial activities and up-regulation of glycolysis, as well as hypoxia; hence the energy gap, glucose-metabolite accumulation, hyaluronic acid production and continuous cell division for survival. © 2014 Wiley Periodicals, Inc.
    International Journal of Cancer 05/2014; 136(9). DOI:10.1002/ijc.28975 · 6.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular.
    Nucleic Acids Research 04/2014; 42(W1). DOI:10.1093/nar/gku315 · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The GPCR genes have a variety of exon-intron structures even though their proteins are all structurally homologous. We have examined all human GPCR genes with at least two functional protein isoforms, totaling 199, aiming to gain an understanding of what may have contributed to the large diversity of the exon-intron structures of the GPCR genes. The 199 genes have a total of 808 known protein splicing isoforms with experimentally verified functions. Our analysis reveals that 1301 (80.6%) adjacent exon-exon pairs out of the total of 1,613 in the 199 genes have either exactly one exon skipped or the intron in-between retained in at least one of the 808 protein splicing isoforms. This observation has a statistical significance p-value of 2.051762 * e(-09), assuming that the observed splicing isoforms are independent of the exon-intron structures. Our interpretation of this observation is that the exon boundaries of the GPCR genes are not randomly determined; instead they may be selected to facilitate specific alternative splicing for functional purposes.
    Journal of Bioinformatics and Computational Biology 02/2014; 12(1):1350019. DOI:10.1142/S0219720013500194 · 0.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Pancreatic cancer is the deadliest of all cancers with worst outcome and poor survival rate. Chemotherapy with gemcitabine works well for early stage cancer, but becomes ineffective for advanced-stage cancer. As such, there is a dire need for new approaches to treat this cancer. The metabolism of tumor cells is very different from that of normal cells. In particular, the differences in amino acid metabolism are gaining increasing attention in cancer biology. Selective amino acid transporters are upregulated in cancer in response to the increased demands for amino acids in tumor cells. Such tumor-selective amino acid transporters are logical druggable targets for cancer therapy. As such, pharmacologic blockade of such upregulated transporters would lead to cell death selectively in tumor cells by depriving the tumor cells of essential nutrients. With this in mind, we analyzed 8 different publically available microarray datasets in Gene Expression Omnibus for the amino acid transporters that are upregulated
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Increased flux through the hexosamine biosynthetic pathway and the corresponding increase in intracellular glycosylation of proteins via O-linked β-N-acetylglucosamine (O-GlcNAc) is sufficient to induce insulin resistance (IR) in multiple systems. Previously, our group used shotgun proteomics to identify multiple rodent adipocytokines and secreted proteins whose levels are modulated upon the induction of IR by indirectly and directly modulating O-GlcNAc levels. We have validated the relative levels of several of these factors using immunoblotting. Since adipocytokines levels are regulated primarily at the level of transcription and O-GlcNAc alters the function of many transcription factors, we hypothesized that elevated O-GlcNAc levels on key transcription factors are modulating secreted protein expression. Here, we show that upon the elevation of O-GlcNAc levels and the induction of IR in mature 3T3-F442a adipocytes, the transcript levels of multiple secreted proteins reflect the modulation observed at the protein level. We validate the transcript levels in male mouse models of diabetes. Using inguinal fat pads from the severely IR db/db mouse model and the mildly IR diet-induced mouse model, we have confirmed that the secreted proteins regulated by O-GlcNAc modulation in cell culture are likewise modulated in the whole animal upon a shift to IR. By comparing the promoters of similarly regulated genes, we determine that Sp1 is a common cis-acting element. Furthermore, we show that the LPL and SPARC promoters are enriched for Sp1 and O-GlcNAc modified proteins during insulin resistance in adipocytes. Thus, the O-GlcNAc modification of proteins bound to promoters, including Sp1, is linked to adipocytokine transcription during insulin resistance.
    Frontiers in Endocrinology 01/2014; 5:223. DOI:10.3389/fendo.2014.00223
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As biotechnology advances rapidly, a tremendous amount of cancer genetic data has become available, providing an unprecedented opportunity for understanding the genetic mechanisms of cancer. To understand the effects of duplications and deletions on cancer progression, two genomes (normal and tumor) were sequenced from each of five stomach cancer patients in different stages (I, II, III and IV). We developed a phylogenetic model for analyzing stomach cancer data. The model assumes that duplication and deletion occur in accordance with a continuous time Markov Chain along the branches of a phylogenetic tree attached with five extended branches leading to the tumor genomes. Moreover, coalescence times of the phylogenetic tree follow a coalescence process. The simulation study suggests that the maximum likelihood approach can accurately estimate parameters in the phylogenetic model. The phylogenetic model was applied to the stomach cancer data. We found that the expected number of changes (duplication and deletion) per gene for the tumor genomes is significantly higher than that for the normal genomes. The goodness-of-fit test suggests that the phylogenetic model with constant duplication and deletion rates can adequately fit the duplication data for the normal genomes. The analysis found nine duplicated genes that are significantly associated with stomach cancer.
    Nucleic Acids Research 12/2013; DOI:10.1093/nar/gkt1320 · 8.81 Impact Factor
  • Source
  • Source
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteins can move from blood circulation into salivary glands through active transportation, passive diffusion or ultrafiltration, some of which are then released into saliva and hence can potentially serve as biomarkers for diseases if accurately identified. We present a novel computational method for predicting salivary proteins that come from circulation. The basis for the prediction is a set of physiochemical and sequence features we found to be discerning between human proteins known to be movable from circulation to saliva and proteins deemed to be not in saliva. A classifier was trained based on these features using a support-vector machine to predict protein secretion into saliva. The classifier achieved 88.56% average recall and 90.76% average precision in 10-fold cross-validation on the training data, indicating that the selected features are informative. Considering the possibility that our negative training data may not be highly reliable (i.e., proteins predicted to be not in saliva), we have also trained a ranking method, aiming to rank the known salivary proteins from circulation as the highest among the proteins in the general background, based on the same features. This prediction capability can be used to predict potential biomarker proteins for specific human diseases when coupled with the information of differentially expressed proteins in diseased versus healthy control tissues and a prediction capability for blood-secretory proteins. Using such integrated information, we predicted 31 candidate biomarker proteins in saliva for breast cancer.
    PLoS ONE 11/2013; 8(11):e80211. DOI:10.1371/journal.pone.0080211 · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We have recently developed a new version of the DOOR operon database, DOOR 2.0, which is available online at http://csbl.bmb.uga.edu/DOOR/ and will be updated on a regular basis. DOOR 2.0 contains genome-scale operons for 2072 prokaryotes with complete genomes, three times the number of genomes covered in the previous version published in 2009. DOOR 2.0 has a number of new features, compared with its previous version, including (i) more than 250 000 transcription units, experimentally validated or computationally predicted based on RNA-seq data, providing a dynamic functional view of the underlying operons; (ii) an integrated operon-centric data resource that provides not only operons for each covered genome but also their functional and regulatory information such as their cis-regulatory binding sites for transcription initiation and termination, gene expression levels estimated based on RNA-seq data and conservation information across multiple genomes; (iii) a high-performance web service for online operon prediction on user-provided genomic sequences; (iv) an intuitive genome browser to support visualization of user-selected data; and (v) a keyword-based Google-like search engine for finding the needed information intuitively and rapidly in this database.
    Nucleic Acids Research 11/2013; 42(D1). DOI:10.1093/nar/gkt1048 · 8.81 Impact Factor

Publication Stats

9k Citations
1,372.44 Total Impact Points


  • 2003–2014
    • Jilin University
      • • College of Computer Science & Technology
      • • State Key Laboratory of Inorganic Synthesis and Preparative
      Yung-chi, Jilin Sheng, China
    • University of Missouri
      Columbia, Missouri, United States
    • University of California, Santa Barbara
      • Department of Computer Science
      Santa Barbara, CA, United States
  • 2003–2013
    • University of Georgia
      • • Department of Biochemistry and Molecular Biology
      • • Department of Computer Science
      • • Department of Psychology
      Атина, Georgia, United States
  • 2002–2013
    • Howard Hughes Medical Institute
      Ashburn, Virginia, United States
    • Sandia National Laboratories
      Albuquerque, New Mexico, United States
  • 1995–2013
    • Oak Ridge National Laboratory
      • • Biosciences Division
      • • Computer Science and Mathematics Division
      • • Life Sciences Division
      Oak Ridge, Florida, United States
  • 2012
    • Sichuan University
      • Key Laboratory of Bio-resource and Eco-environment (MOE)
      Hua-yang, Sichuan, China
  • 2006–2011
    • University of California, San Francisco
      • Department of Microbiology and Immunology
      San Francisco, California, United States
    • The American Society for Biochemistry and Molecular Biology
      Атина, Georgia, United States
    • Shandong University
      • Department of Applied Mathematics
      Jinan, Shandong Sheng, China
  • 2007
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, GA, United States
  • 2005
    • Tokyo Denki University
      • Division of Mathematical Sciences
      Edo, Tōkyō, Japan
  • 2003–2005
    • The University of Tennessee Medical Center at Knoxville
      Knoxville, Tennessee, United States
  • 2004
    • University of California, Riverside
      • Department of Computer Science and Engineering
      Riverside, CA, United States
  • 2002–2003
    • University of Alberta
      • Department of Computing Science
      Edmonton, Alberta, Canada