Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics

Metabolomics Research Group, RIKEN Plant Science Center, 1-7-22 Tsurumi-ku, Suehiro-cho, Yokohama, Kanagawa, 230-0045, Japan.
BMC Bioinformatics (Impact Factor: 2.58). 04/2010; 11(1):214. DOI: 10.1186/1471-2105-11-214
Source: PubMed


Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious.
Here we present a strategy and a software tool for integrating metabolite identifiers from local reference libraries and public databases that do not depend on a single common primary identifier. The program constructs groups of interconnected identifiers of analytes and metabolites to obtain a local metabolite-centric SQLite database. The created database can be used to map in-house identifiers and synonyms to external resources such as the KEGG database. New identifiers can be imported and directly integrated with existing data. Queries can be performed in a flexible way, both from the command line and from the statistical programming environment R, to obtain data set tailored identifier mappings.
Efficient cross-referencing of metabolite identifiers is a key technology for metabolomics data analysis. We provide a practical and flexible solution to this task and an open-source program, the metabolite masking tool (MetMask), available at, that implements our ideas.

Download full-text


Available from: Miyako Kusano
  • Source
    • "Each separate platform can individually be appropriate for a metabolomics study. For this investigation , the four separate data sets from each platform were collated using a documented data summarization strategy (Redestig et al. 2010; Kusano et al. 2011). Overall, the approach generated a total of 732 identified or annotated peaks (Supporting Information File S2). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Information on crop genotype- and phenotype-metabolite associations can be of value to trait development as well as to food security and safety. The unique study presented here assessed seed metabolomic and ionomic diversity in a soybean lineage representing ~35 years of breeding (launch years 1972–2008) and increasing yield potential. Selected varieties included six conventional and three genetically modified (GM) glyphosate-tolerant lines. A metabolomics approach utilizing capillary electrophoresis (CE)-time-of-flight-mass spectrometry (TOF-MS), gas chromatography (GC)-TOF-MS and liquid chromatography (LC)-quadrupole (q)-TOFMS resulted in measurement of a total of 732 annotated peaks. Ionomics through inductively-coupled plasma (ICP)-MS profiled twenty mineral elements. Orthogonal partial least squares-discriminant analysis (OPLS-DA) of the seed data successfully differentiated newer higher-yielding soybean from earlier lower-yielding accessions at both field sites. This result reflected genetic fingerprinting data that demonstrated a similar distinction between the newer and older soybean. Correlation analysis also revealed associations between yield data and specific metabolites. There were no clear metabolic differences between the conventional and GM lines. Overall, observations of metabolic and genetic differences between older and newer soybean varieties provided novel and significant information on the impact of varietal development on biochemical variability. Proposed applications of omics in food and feed safety assessments will need to consider that GM is not a major source of metabolite variability and that trait development in crops will, of necessity, be associated with biochemical variation.
    Full-text · Article · Apr 2014 · Metabolomics
  • Source
    • "We only considered open-source applications as these can readily be adapted to the needs of the metabolic reconstruction community and integrated into metabolic reconstruction tools. Three applications that met these criteria were MetMask [29], the Chemical Translation System (CTS) [30] and UniChem [31]. These applications implement annotation strategies that go beyond name search. "
    [Show abstract] [Hide abstract]
    ABSTRACT: An important step in the reconstruction of a metabolic network is annotation of metabolites. Metabolites are generally annotated with various database or structure based identifiers. Metabolite annotations in metabolic reconstructions may be incorrect or incomplete and thus need to be updated prior to their use. Genome-scale metabolic reconstructions generally include hundreds of metabolites. Manually updating annotations is therefore highly laborious. This prompted us to look for open-source software applications that could facilitate automatic updating of annotations by mapping between available metabolite identifiers. We identified three applications developed for the metabolomics and chemical informatics communities as potential solutions. The applications were MetMask, the Chemical Translation System, and UniChem. The first implements a "metabolite masking" strategy for mapping between identifiers whereas the latter two implement different versions of an InChI based strategy. Here we evaluated the suitability of these applications for the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We applied the best suited application to updating identifiers in Recon 2, the latest reconstruction of human metabolism. All three applications enabled partially automatic updating of metabolite identifiers, but significant manual effort was still required to fully update identifiers. We were able to reduce this manual effort by searching for new identifiers using multiple types of information about metabolites. When multiple types of information were combined, the Chemical Translation System enabled us to update over 3,500 metabolite identifiers in Recon 2. All but approximately 200 identifiers were updated automatically. We found that an InChI based application such as the Chemical Translation System was better suited to the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We identified several features, however, that could be added to such an application in order to tailor it to this task.
    Full-text · Article · Jan 2014 · Journal of Cheminformatics
  • Source
    • "Okada et al. (2009) Exp: selection of metabolites Brassicaceae, Gramineae, Fabaceae Sawada et al. (2009) Exp: matrix-assisted laser desorption/ionization mass spectrometry Shroff et al. (2009) Exp: determination of gene function Arabidopsis thaliana Stracke et al. (2009) Bioinfo: complexity of relationship between plants and metabolites Takemoto et al. (2009) Bioinfo: metabolic pathway prediction Tanaka et al. (2009b) Exp: quality assessment Kampo medicine Tanaka et al. (2009a) Exp: quality assessment Angelica acutiloba Tianniam et al. (2009) Review: web resources in MS-based metabolomics Tohge and Fernie (2009) Bioinfo: metabolite annotation Wishart et al. (2009) Exp: diarylheptanoid biosynthesis Curcuma longa Xie et al. (2009) Review: functional genomics Yonekura-Sakakibara and Saito (2009) Exp: metabolite composition Rhizoctania solani Aliferis and Jabaji (2010) Exp: QTLs of barley, against Fusarium head blight Hordeum vulgare Bollina et al. (2010) Exp: changing color of flower from dark purple to white Brunfelsia calycina Bar-Akiva et al. (2010) Bioinfo: chemical similarity search and substructure matching of compounds Hattori et al. (2010) DB: MassBank, MS DB Horai et al. (2010) Review: MS data processing Kind and Fiehn (2010) Review: metabolomics in plant ecology and genetics Macel et al. (2010) Exp: metabolic profiling of different tissues Arabidopsis thaliana Matsuda et al. (2010) Review: identification of metabolites Neumann and Bocker (2010) DB: polyphenol contents in foods Neveu et al. (2010) Review: FT-ICR-MS. Reaction representation based on van Krevelen diagram Ohta et al. (2010) Review: relationship between individual omics data based on multivariate analysis and DB Medicinal plants Okada et al. (2010) Review: dietary intake Penn et al. (2010) Bioinfo: multiple metabolomics platforms for different types of MS Redestig et al. (2010) Review: functional genomics Saito and Matsuda (2010) DB: benzylisoquinoline alkaloids Singla et al. (2010) Exp: quality assessment Glycyrrhiza uralensis Tanaka et al. (2010) Review: annotation of gene function based on co-response gene and identification of metabolites Tohge and Fernie (2010) Bioinfo: network analysis of species–metabolite relationships Takemoto (2010) Bioinfo: MS data processing Weber et al. (2010) Bioinfo: QTL informatics Solanum tuberosum Acharjee et al. (2011) Exp: subcellular distribution of metabolites Arabidopsis thaliana Krueger et al. (2011) Review: pesticide research Aliferis and Chrysayi- Tokousbalides (2011) Bioinfo: metabolomics in medical purpose with systems chemical biology and chemoinformatics "
    [Show abstract] [Hide abstract]
    ABSTRACT: Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species–metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.
    Full-text · Article · Mar 2013 · Plant and Cell Physiology
Show more