Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis.

Metabolomics Research Group, RIKEN Plant Science Center, 1-7-22 Tsurumi-ku, Suehiro-cho, Yokohama, Kanagawa, 230-0045, Japan.
BMC Bioinformatics (Impact Factor: 3.02). 01/2010; 11:214. DOI: 10.1186/1471-2105-11-214
Source: PubMed

ABSTRACT Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious.
Here we present a strategy and a software tool for integrating metabolite identifiers from local reference libraries and public databases that do not depend on a single common primary identifier. The program constructs groups of interconnected identifiers of analytes and metabolites to obtain a local metabolite-centric SQLite database. The created database can be used to map in-house identifiers and synonyms to external resources such as the KEGG database. New identifiers can be imported and directly integrated with existing data. Queries can be performed in a flexible way, both from the command line and from the statistical programming environment R, to obtain data set tailored identifier mappings.
Efficient cross-referencing of metabolite identifiers is a key technology for metabolomics data analysis. We provide a practical and flexible solution to this task and an open-source program, the metabolite masking tool (MetMask), available at, that implements our ideas.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An important step in the reconstruction of a metabolic network is annotation of metabolites. Metabolites are generally annotated with various database or structure based identifiers. Metabolite annotations in metabolic reconstructions may be incorrect or incomplete and thus need to be updated prior to their use. Genome-scale metabolic reconstructions generally include hundreds of metabolites. Manually updating annotations is therefore highly laborious. This prompted us to look for open-source software applications that could facilitate automatic updating of annotations by mapping between available metabolite identifiers. We identified three applications developed for the metabolomics and chemical informatics communities as potential solutions. The applications were MetMask, the Chemical Translation System, and UniChem. The first implements a "metabolite masking" strategy for mapping between identifiers whereas the latter two implement different versions of an InChI based strategy. Here we evaluated the suitability of these applications for the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We applied the best suited application to updating identifiers in Recon 2, the latest reconstruction of human metabolism. All three applications enabled partially automatic updating of metabolite identifiers, but significant manual effort was still required to fully update identifiers. We were able to reduce this manual effort by searching for new identifiers using multiple types of information about metabolites. When multiple types of information were combined, the Chemical Translation System enabled us to update over 3,500 metabolite identifiers in Recon 2. All but approximately 200 identifiers were updated automatically. We found that an InChI based application such as the Chemical Translation System was better suited to the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We identified several features, however, that could be added to such an application in order to tailor it to this task.
    Journal of Cheminformatics 01/2014; 6(1):2. · 3.59 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Phytochemical genomics is a recently emerging field, which investigates the genomic basis of the synthesis and function of phytochemicals (plant metabolites), particularly based on advanced metabolomics. The chemical diversity of the model plant Arabidopsis thaliana is larger than previously expected, and the gene-to-metabolite correlations have been elucidated mostly by an integrated analysis of transcriptomes and metabolomes. For example, most genes involved in the biosynthesis of flavonoids in Arabidopsis have been characterized by this method. A similar approach has been applied to the functional genomics for production of phytochemicals in crops and medicinal plants. Great promise is seen in metabolic quantitative loci analysis in major crops such as rice and tomato, and identification of novel genes involved in the biosynthesis of bioactive specialized metabolites in medicinal plants.
    Current opinion in plant biology 04/2013; · 10.33 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite recent intensive research efforts in functional genomics, the functions of only a limited number of Arabidopsis (Arabidopsis thaliana) genes have been determined experimentally and improving gene annotation remains a major challenge in plant science. As metabolite profiling can characterize the metabolomic phenotype of a genetic perturbation in the plant metabolism, it provides clues to the function(s) of genes of interest. We chose 50 Arabidopsis mutants including a set of characterized and uncharacterized mutants, that resemble wild-type plants. We performed metabolite profiling of the plants using gas chromatography-mass spectrometry (GC-MS). To make the dataset available as an efficient public functional genomics tool for hypothesis generation, we developed our MeKO database. It allows evaluation of whether a mutation affects metabolism during normal plant growth and contains images of mutants, data on differences in metabolite accumulation, and interactive analysis tools. Non-processed data, including chromatograms, mass spectra, and experimental metadata, follow the guidelines set by Metabolomics Standards Initiative (MSI) and are freely downloadable. Proof-of-concept analysis suggests that the MeKO database is highly useful for the generation of hypotheses for genes of interest and for improving gene annotation. MeKO is publicly available at
    Plant physiology 05/2014; · 6.56 Impact Factor

Full-text (3 Sources)

Available from
Jun 6, 2014