Ying Xu

Jilin University, Yung-chi, Jilin Sheng, China

Are you Ying Xu?

Claim your profile

Publications (285)1321.56 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Gastric cancer is one of the most prevalent and aggressive cancers worldwide, and its molecular mechanism remains largely elusive. Here we report the genomic landscape in primary gastric adenocarcinoma of human, based on the complete genome sequences of five pairs of cancer and matching normal samples. In total, 103,464 somatic point mutations, including 407 non-synonymous ones, were identified and the most recurrent mutations were harbored by Mucins (MUC3A and MUC12) and transcription factors (ZNF717, ZNF595 and TP53). 679 genomic rearrangements were detected, which affect 355 protein-coding genes; and 76 genes show copy number changes. Through mapping the boundaries of the rearranged regions to the folded three-dimensional structure of human chromosomes, we determined that 79.6% of the chromosomal rearrangements happen among DNA fragments in close spatial proximity, especially when two endpoints stay in a similar replication phase. We demonstrated evidences that microhomology-mediated break induced replication was utilized as a mechanism in inducing ~40.9% of the identified genomic changes in gastric tumor. Our data analyses revealed potential integrations of Helicobacter pylori DNA into the gastric cancer genomes. Overall a large set of novel genomic variations were detected in these gastric cancer genomes, which may be essential to the study of the genetic basis and molecular mechanism of the gastric tumorigenesis. This article is protected by copyright. All rights reserved.
    International Journal of Cancer 11/2014; · 6.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The availability of a large number of sequenced bacterial genomes facilitates in-depth studies about why genes (operons) in a bacterial genome are globally organized the way they are. We have previously discovered that (the relative) transcription- activation frequencies among different biological pathways encoded in a genome have a dominating role in the global arrangement of operons. One complicating factor in such a study is that some operons may be involved in multiple pathways with different activation frequencies. A quantitative model has been developed that captures this information, which tends to be minimized by the current global arrangement of operons in a bacterial (and archaeal) genome compared to possible alternative arrangements. A study is carried out here using this model on a collection of 52 closely related E. coli genomes, which revealed interesting new insights about how bacterial genomes evolve to optimally adapt to their environments through adjusting the (relative) genomic locations of the encoding operons of biological pathways once their utilization and hence transcription activation frequencies change, to maintain the above energy-efficiency property. More specifically we observed that it is the frequencies of the transcription activation of pathways relative to those of the other encoded pathways in an organism as well as the variation in the activation frequencies of a specific pathway across the related genomes that play a key role in the observed commonalities and differences in the genomic organizations of genes (and operons) encoding specific pathways across different genomes.
    Science China. Life sciences. 09/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Clostridium genus of bacteria contains the most widely studied biofuel-producing organisms such as Clostridium thermocellum and also some human pathogens, plus a few less characterized strains. Here, we present a comparative genomic analysis of 40 fully sequenced clostridial genomes, paying a particular attention to the biomass degradation ones. Our analysis indicates that some of the Clostridium botulinum strains may have been incorrectly classified in the current taxonomy and hence should be renamed according to the 16S ribosomal RNA (rRNA) phylogeny. A core-genome analysis suggests that only 169 orthologous gene groups are shared by all the strains, and the strain-specific gene pool consists of 22,668 genes, which is consistent with the fact that these bacteria live in very diverse environments and have evolved a very large number of strain-specific genes to adapt to different environments. Across the 40 genomes, 1.4–5.8 % of genes fall into the carbohydrate active enzyme (CAZyme) families, and 20 out of the 40 genomes may encode cellulosomes with each genome having 1 to 76 genes bearing the cellulosome-related modules such as dockerins and cohesins. A phylogenetic footprinting analysis identified cis-regulatory motifs that are enriched in the promoters of the CAZyme genes, giving rise to 32 statistically significant motif candidates.
    BioEnergy Research 06/2014; · 4.25 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A number of proposals have been made in the past century regarding what may drive sporadic cancers to initiate and develop. Yet, the problem remains largely unsolved as none of the proposals have been widely accepted as cancer-initiation drivers. We propose here a driver model for the initiation and early development of solid cancers associated with inflammation-induced chronic hypoxia and ROS accumulation. The model consists of five key elements: (i) human cells tend to have a substantial gap between ATP demand and supply during chronic hypoxia, which would inevitably lead to increased uptake of glucose and accumulation of its metabolites; (ii) the accumulation of these metabolites will cast mounting pressure on the cells and ultimately result in the production and export of hyaluronic acid; (iii) the exported hyaluronic acid will be degraded into fragments of various sizes, serving as tissue-repair signals, including signals for cell proliferation, cell survival and angiogenesis, which lead to the initial proliferation of the underlying cells; (iv) cell division provides an exit for the accumulated glucose metabolites by using them towards macromolecular synthesis for the new cell, and hence alleviate the pressure from the metabolite accumulation; and (v) this process continues as long as the hypoxic condition persists. In tandem, genetic mutations may be selected to make cell divisions and hence survival more sustainable and efficient, also increasingly more uncontrollable. This model also applies to some hereditary cancers as their key mutations, such as BRCA for breast cancer, generally lead to increased ROS and ultimately to repression of mitochondrial activities and up-regulation of glycolysis, as well as hypoxia; hence the energy gap, glucose-metabolite accumulation, hyaluronic acid production and continuous cell division for survival. © 2014 Wiley Periodicals, Inc.
    International Journal of Cancer 05/2014; · 6.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular.
    Nucleic Acids Research 04/2014; · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The GPCR genes have a variety of exon-intron structures even though their proteins are all structurally homologous. We have examined all human GPCR genes with at least two functional protein isoforms, totaling 199, aiming to gain an understanding of what may have contributed to the large diversity of the exon-intron structures of the GPCR genes. The 199 genes have a total of 808 known protein splicing isoforms with experimentally verified functions. Our analysis reveals that 1301 (80.6%) adjacent exon-exon pairs out of the total of 1,613 in the 199 genes have either exactly one exon skipped or the intron in-between retained in at least one of the 808 protein splicing isoforms. This observation has a statistical significance p-value of 2.051762 * e(-09), assuming that the observed splicing isoforms are independent of the exon-intron structures. Our interpretation of this observation is that the exon boundaries of the GPCR genes are not randomly determined; instead they may be selected to facilitate specific alternative splicing for functional purposes.
    Journal of Bioinformatics and Computational Biology 02/2014; 12(1):1350019. · 0.93 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A challenge in phylogenetic inference of gene trees is how to properly sample a large pool of homologous sequences to derive a good representative subset of sequences. Such a need arises in various applications, e.g. when (1) accuracy-oriented phylogenetic reconstruction methods may not be able to deal with a large pool of sequences due to their high demand in computing resources; (2) applications analyzing a collection of gene trees may prefer to use trees with fewer operational taxonomic units (OTUs), for instance for the detection of horizontal gene transfer events by identifying phylogenetic conflicts; and (3) the pool of available sequences is biased towards extensively studied species. In the past, the creation of subsamples often relied on manual selection. Here we present an Automated sequence-Sampling method for improving the Taxonomic diversity of gene phylogenetic trees, AST, to obtain representative sequences that maximize the taxonomic diversity of the sampled sequences. To demonstrate the effectiveness of AST, we have tested it to solve four problems, namely, inference of the evolutionary histories of the small ribosomal subunit protein S5 of E. coli, 16 S ribosomal RNAs and glycosyl-transferase gene family 8, and a study of ancient horizontal gene transfers from bacteria to plants. Our results show that the resolution of our computational results is almost as good as that of manual inference by domain experts, hence making the tool generally useful to phylogenetic studies by non-phylogeny specialists. The program is available at http://csbl.bmb.uga.edu/~zhouchan/AST.php.
    PLoS ONE 01/2014; 9(6):e98844. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Essential proteins are those that are indispensable to cellular survival and development. Existing methods for essential protein identification generally rely on knock-out experiments and/or the relative density of their interactions (edges) with other proteins in a Protein-Protein Interaction (PPI) network. Here, we present a computational method, called EW, to first rank protein-protein interactions in terms of their Edge Weights, and then identify sub-PPI-networks consisting of only the highly-ranked edges and predict their proteins as essential proteins. We have applied this method to publicly-available PPI data on Saccharomyces cerevisiae (Yeast) and Escherichia coli (E. coli) for essential protein identification, and demonstrated that EW achieves better performance than the state-of-the-art methods in terms of the precision-recall and Jackknife measures. The highly-ranked protein-protein interactions by our prediction tend to be biologically significant in both the Yeast and E. coli PPI networks. Further analyses on systematically perturbed Yeast and E. coli PPI networks through randomly deleting edges demonstrate that the proposed method is robust and the top-ranked edges tend to be more associated with known essential proteins than the lowly-ranked edges.
    PLoS ONE 01/2014; 9(9):e108716. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As biotechnology advances rapidly, a tremendous amount of cancer genetic data has become available, providing an unprecedented opportunity for understanding the genetic mechanisms of cancer. To understand the effects of duplications and deletions on cancer progression, two genomes (normal and tumor) were sequenced from each of five stomach cancer patients in different stages (I, II, III and IV). We developed a phylogenetic model for analyzing stomach cancer data. The model assumes that duplication and deletion occur in accordance with a continuous time Markov Chain along the branches of a phylogenetic tree attached with five extended branches leading to the tumor genomes. Moreover, coalescence times of the phylogenetic tree follow a coalescence process. The simulation study suggests that the maximum likelihood approach can accurately estimate parameters in the phylogenetic model. The phylogenetic model was applied to the stomach cancer data. We found that the expected number of changes (duplication and deletion) per gene for the tumor genomes is significantly higher than that for the normal genomes. The goodness-of-fit test suggests that the phylogenetic model with constant duplication and deletion rates can adequately fit the duplication data for the normal genomes. The analysis found nine duplicated genes that are significantly associated with stomach cancer.
    Nucleic Acids Research 12/2013; · 8.81 Impact Factor
  • Source
  • Source
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We have recently developed a new version of the DOOR operon database, DOOR 2.0, which is available online at http://csbl.bmb.uga.edu/DOOR/ and will be updated on a regular basis. DOOR 2.0 contains genome-scale operons for 2072 prokaryotes with complete genomes, three times the number of genomes covered in the previous version published in 2009. DOOR 2.0 has a number of new features, compared with its previous version, including (i) more than 250 000 transcription units, experimentally validated or computationally predicted based on RNA-seq data, providing a dynamic functional view of the underlying operons; (ii) an integrated operon-centric data resource that provides not only operons for each covered genome but also their functional and regulatory information such as their cis-regulatory binding sites for transcription initiation and termination, gene expression levels estimated based on RNA-seq data and conservation information across multiple genomes; (iii) a high-performance web service for online operon prediction on user-provided genomic sequences; (iv) an intuitive genome browser to support visualization of user-selected data; and (v) a keyword-based Google-like search engine for finding the needed information intuitively and rapidly in this database.
    Nucleic Acids Research 11/2013; · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Studying lignin biosynthesis in Panicum virgatum (switchgrass) has provided a basis for generating plants with reduced lignin content and increased saccharification efficiency. Chlorogenic acid (CGA, caffeoyl quinate) is the major soluble phenolic compound in switchgrass, and the lignin and CGA biosynthetic pathways potentially share intermediates and enzymes. The enzyme hydroxycinnamoyl-CoA: quinate hydroxycinnamoyltransferase (HQT) is responsible for CGA biosynthesis in tobacco, tomato and globe artichoke, but there are no close orthologs of HQT in switchgrass or in other monocotyledonous plants with complete genome sequences. We examined available transcriptomic databases for genes encoding enzymes potentially involved in CGA biosynthesis in switchgrass. The protein products of two hydroxycinnamoyl-CoA shikimate/quinate hydroxycinnamoyltransferase (HCT) genes (PvHCT1a and PvHCT2a), closely related to lignin pathway HCTs from other species, were characterized biochemically and exhibited the expected HCT activity, preferring shikimic acid as acyl acceptor. We also characterized two switchgrass coumaroyl shikimate 3'-hydroxylase (C3'H) enzymes (PvC3'H1 and PvC3'H2); both of these cytochrome P450s had the capacity to hydroxylate 4-coumaroyl shikimate or 4-coumaroyl quinate to generate caffeoyl shikimate or CGA. Another switchgrass hydroxycinnamoyl transferase, PvHCT-Like1, is phylogenetically distant from HCTs or HQTs, but exhibits HQT activity, preferring quinic acid as acyl acceptor, and could therefore function in CGA biosynthesis. The biochemical features of the recombinant enzymes, the presence of the corresponding activities in plant protein extracts, and the expression patterns of the corresponding genes, suggest preferred routes to CGA in switchgrass.
    Plant Molecular Biology 11/2013; · 3.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The thermophilic anaerobe Clostridium thermocellum is a candidate consolidated bioprocessing (CBP) biocatalyst for cellulosic ethanol production. It is capable of both cellulose solubilization and its fermentation to produce lignocellulosic ethanol. Intolerance to stresses routinely encountered during industrial fermentations may hinder the commercial development of this organism. A previous C. thermocellum ethanol stress study showed that largest transcriptomic response was in genes and proteins related to nitrogen uptake and metabolism. In this study, C. thermocellum was grown to mid-exponential phase and treated with furfural or heat to a final concentration of 3 g.L-1 or 68[degree sign]C respectively to investigate general and specific physiological and regulatory stress responses. Samples were taken at 10, 30, 60 and 120 min post-shock, and from untreated control fermentations, for transcriptomic analyses and fermentation product determinations and compared to a published dataset from an ethanol stress study. Urea uptake genes were induced following furfural stress, but not to the same extent as ethanol stress and transcription from these genes was largely unaffected by heat stress. The largest transcriptomic response to furfural stress was genes for sulfate transporter subunits and enzymes in the sulfate assimilatory pathway, although these genes were also affected late in the heat and ethanol stress responses. Lactate production was higher in furfural treated culture, although the lactate dehydrogenase gene was not differentially expressed under this condition. Other redox related genes such as a copy of the rex gene, a bifunctional acetaldehyde-CoA/alcohol dehydrogenase and adjacent genes did show lower expression after furfural stress compared to the control, heat and ethanol fermentation profiles. Heat stress induced expression from chaperone related genes and overlap was observed with the responses to the other stresses. This study suggests the involvement of C. thermocellum genes with functions in oxidative stress protection, electron transfer, detoxification, sulfur and nitrogen acquisition, and DNA repair mechanisms in its stress responses and the use of different regulatory networks to coordinate and control adaptation. This study has identified C. thermocellum gene regulatory motifs and aspects of physiology and gene regulation for further study. The nexus between future systems biology studies and recently developed genetic tools for C. thermocellum offers the potential for more rapid strain development and for broader insights into this organism's physiology and regulation.
    Biotechnology for Biofuels 09/2013; 6(1):131. · 5.55 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Zymomonas mobilis ZM4 is a capable ethanologenic bacterium with high ethanol productivity and ethanol tolerance. Previous studies indicated that several stress-related proteins and changes in the ZM4 membrane lipid composition may contribute to ethanol tolerance. However, the molecular mechanisms of its ethanol stress response have not been elucidated fully. Methodology/Principal Findings In this study, ethanol stress responses were investigated using systems biology approaches. Medium supplementation with an initial 47 g/L (6% v/v) ethanol reduced Z. mobilis ZM4 glucose consumption, growth rate and ethanol productivity compared to that of untreated controls. A proteomic analysis of early exponential growth identified about one thousand proteins, or approximately 55% of the predicted ZM4 proteome. Proteins related to metabolism and stress response such as chaperones and key regulators were more abundant in
    PLoS ONE 07/2013; 8(7):e68886. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an integrated toolkit, BoBro2.0, for prediction and analysis of cis regulatory motifs. This toolkit can (i) reliably identify statistically significant cis regulatory motifs at a genome scale; ii) accurately scan for all motif instances of a query motif in specified genomic regions using a novel method for p-value estimation; (iii) provide highly reliable comparisons and clustering of identified motifs, which takes into consideration the weak signals from the flanking regions of the motifs; and (iv) analyze co-occurring motifs in the regulatory regions. We have carried out systematic comparisons between motif predictions by BoBro2.0 and by the MEME package. The comparison results on E. coli K12 genome and the Human genome show that BoBro2.0 can identify the statistically significant motifs at a genome scale more efficiently, identify motif instances more accurately and get more reliable motif clusters than MEME. In addition BoBro2.0 provides correlational analyses among the identified motifs to facilitate the inference of joint regulation relationships of transcription factors. The source code of the program is freely available for noncommercial uses at http://code.google.com/p/bobro/. xyn@bmb.uga.edu.
    Bioinformatics 07/2013; · 5.47 Impact Factor
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The three major components of plant biomass, cellulose, hemicellulose and lignin, are highly recalcitrant and deconstruction involves thermal and chemical pretreatment. Microbial conversion is a possible solution, but few anaerobic microbes utilize both cellulose and hemicellulose and none are known to solubilize lignin. Herein, we show that the majority (85%) of insoluble switchgrass biomass that had not been previously chemically treated was degraded at 78 °C by the anaerobic bacterium Caldicellulosiruptor bescii. Remarkably, the glucose/xylose/lignin ratio and physical and spectroscopic properties of the remaining insoluble switchgrass were not significantly different than those of the untreated plant material. C. bescii is therefore able to solubilize lignin as well as the carbohydrates and, accordingly, lignin-derived aromatics were detected in the culture supernatants. From mass balance analyses, the carbohydrate in the solubilized switchgrass quantitatively accounted for the growth of C. bescii and its fermentation products, indicating that the lignin was not assimilated by the microorganism. Immunoanalyses of biomass and transcriptional analyses of C. bescii showed that the microorganism when grown on switchgrass produces enzymes directed at key plant cell wall moieties such as pectin, xyloglucans and rhamnogalacturonans, and that these and as yet uncharacterized enzymes enable the degradation of cellulose, hemicellulose and lignin at comparable rates. This unexpected finding of simultaneous lignin and carbohydrate solubilization bodes well for industrial conversion by extremely thermophilic microbes of biomass that requires limited or, more importantly, no chemical pretreatment.
    Energy & Environmental Science 05/2013; · 11.65 Impact Factor

Publication Stats

7k Citations
1,321.56 Total Impact Points


  • 2009–2014
    • Jilin University
      • • Department of Phathogenobiology
      • • College of Computer Science & Technology
      • • Department of Chemistry
      Yung-chi, Jilin Sheng, China
    • Zhejiang University
      Hang-hsien, Zhejiang Sheng, China
  • 2003–2014
    • University of Georgia
      • • Institute of Bioinformatics
      • • Department of Biochemistry and Molecular Biology
      • • Department of Computer Science
      Атина, Georgia, United States
    • University of Missouri
      Columbia, Missouri, United States
    • University of Waterloo
      • David R. Cheriton School of Computer Science
      Waterloo, Ontario, Canada
    • University of Tennessee
      Knoxville, Tennessee, United States
  • 2009–2013
    • The Samuel Roberts Noble Foundation
      • Division of Plant Biology
      Ardmore, Oklahoma, United States
  • 2002–2013
    • Howard Hughes Medical Institute
      Ashburn, Virginia, United States
    • Sandia National Laboratories
      Albuquerque, New Mexico, United States
  • 1995–2013
    • Oak Ridge National Laboratory
      • • Biosciences Division
      • • Joint Institute for Computational Sciences
      • • Life Sciences Division
      • • Computer Science and Mathematics Division
      Oak Ridge, Florida, United States
  • 2012
    • Northeast Institute of Geography and Agroecology
      • Institute of Biomedical Engineering and Health Technology
      Beijing, Beijing Shi, China
    • Tongji University
      • College of Life Science and Technology
      Shanghai, Shanghai Shi, China
    • North Carolina State University
      • Department of Chemical and Biomolecular Engineering
      Raleigh, NC, United States
  • 2011
    • Chinese PLA General Hospital (301 Hospital)
      Peping, Beijing, China
  • 2006–2010
    • University of California, San Francisco
      • Department of Microbiology and Immunology
      San Francisco, CA, United States
    • Shandong University
      • Department of Applied Mathematics
      Jinan, Shandong Sheng, China
    • The American Society for Biochemistry and Molecular Biology
      Атина, Georgia, United States
  • 2007–2009
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, GA, United States
    • Drexel University
      Philadelphia, Pennsylvania, United States
  • 2007–2008
    • University of North Carolina at Charlotte
      Charlotte, North Carolina, United States
  • 2005
    • Tokyo Denki University
      • Division of Mathematical Sciences
      Tokyo, Tokyo-to, Japan
  • 2004
    • University of California, Riverside
      • Department of Computer Science and Engineering
      Riverside, CA, United States
  • 2002–2003
    • University of Alberta
      • Department of Computing Science
      Edmonton, Alberta, Canada