[Show abstract][Hide abstract] ABSTRACT: DINIES (drug-target interaction network inference engine based on supervised analysis) is a web server for predicting unknown drug-target interaction networks from various types of biological data (e.g. chemical structures, drug side effects, amino acid sequences and protein domains) in the framework of supervised network inference. The originality of DINIES lies in prediction with state-of-the-art machine learning methods, in the integration of heterogeneous biological data and in compatibility with the KEGG database. The DINIES server accepts any 'profiles' or precalculated similarity matrices (or 'kernels') of drugs and target proteins in tab-delimited file format. When a training data set is submitted to learn a predictive model, users can select either known interaction information in the KEGG DRUG database or their own interaction data. The user can also select an algorithm for supervised network inference, select various parameters in the method and specify weights for heterogeneous data integration. The server can provide integrative analyses with useful components in KEGG, such as biological pathways, functional hierarchy and human diseases. DINIES (http://www.genome.jp/tools/dinies/) is publicly available as one of the genome analysis tools in GenomeNet.
Nucleic Acids Research 05/2014; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The IUBMB׳s Enzyme List gives a valuable library of the individual experimental facts on enzyme activities, providing the standard classification and nomenclature of enzymes. Empirical knowledge about the relationships between the enzyme protein sequences (or structures) and their functions (the capability of catalyzing chemical reactions) has been accumulating in public literatures and databases. This provides a complementary approach to standardize and organize enzyme data, i.e., predicting the possible enzymes, reactions and metabolites that remain to be identified experimentally. Thus, we suggest the necessity of classifying enzymes based on the evidence and different perspectives obtained from various experimental works. The KEGG (Kyoto Encyclopedia of Genes and Genomes) database describes enzymes from many different viewpoints including; the IUBMB׳s enzyme nomenclature/classification (EC numbers), the similarity group of enzyme reactions (KEGG Reaction Class; RCLASS) based solely on the chemical structure transformation patterns, and the similarity groups of enzyme genes (KEGG Orthology; KO) based on the orthologous groups that can be mapped to the KEGG PATHWAY and BRITE functional hierarchy. Some unique identifiers were additionally introduced to the KEGG database other than the EC numbers established by IUBMB. R, RP and RC numbers are given to distinguish reactions, reactant pairs and RCLASS, respectively. Genes, including enzyme genes, have their own ID numbers in specific organisms, and they are classified into ortholog groups that are identified by K numbers. In this review, we explain the concept and methodology of this formulation with some concrete example cases. We propose it beneficial to create a standard classification scheme that deals with both experimentally identified and theoretically predicted enzymes.
[Show abstract][Hide abstract] ABSTRACT: In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource (http://www.kegg.jp/ or http://www.genome.jp/kegg/) has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers and associated annotations. For translational bioinformatics, the KEGG MEDICUS resource has been developed by integrating drug labels (package inserts) used in society.
Nucleic Acids Research 11/2013; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Despite wide-spread consensus on the need to transform toxicology and risk assessment in order to keep pace with technological and computational changes that have revolutionized the life sciences, there remains much work to be done to achieve the vision of toxicology based on a mechanistic foundation. To this end, a workshop was organized to explore one key aspect of this transformation - the development of Pathways of Toxicity as a key tool for hazard identification based on systems biology. Several issues were discussed in depth in the workshop: The first was the challenge of formally defining the concept of a Pathway of Toxicity (PoT), as distinct from, but complementary to, other toxicological pathway concepts such as mode of action (MoA). The workshop came up with a preliminary definition of PoT as "A molecular definition of cellular processes shown to mediate adverse outcomes of toxicants". It is further recognized that normal physiological pathways exist that maintain homeostasis and these, sufficiently perturbed, can become PoT. Second, the workshop sought to define the adequate public and commercial resources for PoT information, including data, visualization, analyses, tools, and use-cases, as well as the kinds of efforts that will be necessary to enable the creation of such a resource. Third, the workshop explored ways in which systems biology approaches could inform pathway annotation, and which resources are needed and available that can provide relevant PoT information to the diverse user communities.
[Show abstract][Hide abstract] ABSTRACT: There is a tendency that a unit of enzyme genes in an operon-like structure in the prokaryotic genome encodes enzymes that catalyze a series of consecutive reactions in a metabolic pathway. Our recent analysis shows that this and other genomic units correspond to chemical units reflecting chemical logic of organic reactions. From all known metabolic pathways in the KEGG database chemical units, called reaction modules, we identified the conserved sequences of chemical structure transformation patterns of small molecules. The extracted patterns suggest co-evolution of genomic units and chemical units. While the core of the metabolic network may have evolved with mechanisms involving individual enzymes and reactions, its extension may have been driven by modular units of enzymes and reactions.
[Show abstract][Hide abstract] ABSTRACT: The metabolic network is both a network of chemical reactions and a network of enzymes that catalyze reactions. Towards better understanding of this duality in the evolution of the metabolic network, we developed a method to extract conserved sequences of reactions called reaction modules from the analysis of chemical compound structure transformation patterns in all known metabolic pathways stored in the KEGG PATHWAY database. The extracted reaction modules are repeatedly used as if they are building blocks of the metabolic network and contain chemical logic of organic reactions. Furthermore, the reaction modules often correspond to traditional pathway modules defined as sets of enzymes in the KEGG MODULE database and sometimes to operon-like gene clusters in prokaryotic genomes. We identified well-conserved, possibly ancient, reaction modules involving 2-oxocarboxylic acids. The chain extension module that appears as the tricarboxylic acid reaction sequence in the TCA cycle is now shown to be used in other pathways together with different types of modification modules. We also identified reaction modules and their connection patterns for aromatic ring cleavages in microbial biodegradation pathways, which are most characteristic in terms of both distinct reaction sequences and distinct gene clusters. The modular architecture of biodegradation modules will have a potential for predicting degradation pathways of xenobiotic compounds. The collection of these and many other reaction modules is made available as part of the KEGG database.
Journal of Chemical Information and Modeling 02/2013; · 4.30 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: KEGG ( http://www.genome.jp/kegg/ ) is an integrated database resource for linking genomes or molecular datasets to molecular networks (pathways, etc.) representing higher-level systemic functions of the cell, the organism, and the ecosystem. Major efforts have been undertaken for capturing and representing experimental knowledge as manually drawn KEGG pathway maps and for genome-based generalization of experimental knowledge through the KEGG Orthology (KO) system. Current knowledge on diseases and drugs has also been integrated in the KEGG pathway maps, especially in terms of known disease genes and drug targets. Thus, KEGG can be used as a reference knowledge base for integration and interpretation of large-scale datasets generated by high-throughput experimental technologies, as well for finding their practical values. Here we give an introduction to the KEGG Mapper tools, especially for understanding disease mechanisms and adverse drug interactions.
[Show abstract][Hide abstract] ABSTRACT: In order to develop hypothesis on unknown metabolic pathways, biochemists frequently rely on literature that uses a free-text format to describe functional groups or substructures. In computational chemistry or cheminformatics, molecules are typically represented by chemical descriptors, i.e., vectors that summarize information on its various properties. However, it is difficult to interpret these chemical descriptors since they are not directly linked to the terminology of functional groups or substructures that the biochemists use.
In this study, we used KEGG Chemical Function (KCF) format to computationally describe biochemical substructures in seven attributes that resemble biochemists' way of dealing with substructures.
We established KCF-S (KCF-and-Substructures) format as an additional structural information of KCF. Applying KCF-S revealed the specific appearance of substructures from various datasets of molecules that describes the characteristics of the respective datasets. Structure-based clustering of molecules using KCF-S resulted the clusters in which molecular weights and structures were less diverse than those obtained by conventional chemical fingerprints. We further applied KCF-S to find the pairs of molecules that are possibly converted to each other in enzymatic reactions, and KCF-S clearly improved predictive performance than that presented previously.
KCF-S defines biochemical substructures with keeping interpretability, suggesting the potential to apply more studies on chemical bioinformatics. KCF and KCF-S can be automatically converted from Molfile format, enabling to deal with molecules from any data sources.
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: One of the main goals of genomic analysis is to elucidate the comprehensive functions (functionome) in individual organisms or a whole community in various environments. However, a standard evaluation method for discerning the functional potentials harbored within the genome or metagenome has not yet been established. We have developed a new evaluation method for the potential functionome, based on the completion ratio of Kyoto Encyclopedia of Genes and Genomes (KEGG) functional modules. RESULTS: Distribution of the completion ratio of the KEGG functional modules in 768 prokaryotic species varied greatly with the kind of module, and all modules primarily fell into 4 patterns (universal, restricted, diversified and non-prokaryotic modules), indicating the universal and unique nature of each module, and also the versatility of the KEGG Orthology (KO) identifiers mapped to each one. The module completion ratio in 8 phenotypically different bacilli revealed that some modules were shared only in phenotypically similar species. Metagenomes of human gut microbiomes from 13 healthy individuals previously determined by the Sanger method were analyzed based on the module completion ratio. Results led to new discoveries in the nutritional preferences of gut microbes, believed to be one of the mutualistic representations of gut microbiomes to avoid nutritional competition with the host. CONCLUSIONS: The method developed in this study could characterize the functionome harbored in genomes and metagenomes. As this method also provided taxonomical information from KEGG modules as well as the gene hosts constructing the modules, interpretation of completion profiles was simplified and we could identify the complementarity between biochemical functions in human hosts and the nutritional preferences in human gut microbiomes. Thus, our method has the potential to be a powerful tool for comparative functional analysis in genomics and metagenomics, able to target unknown environments containing various uncultivable microbes within unidentified phyla.
[Show abstract][Hide abstract] ABSTRACT: The identification of orthologous genes in an increasing number of fully sequenced genomes is a challenging issue in recent genome science. Here we present KEGG OC (http://www.genome.jp/tools/oc/), a novel database of ortholog clusters (OCs). The current version of KEGG OC contains 1 176 030 OCs, obtained by clustering 8 357 175 genes in 2112 complete genomes (153 eukaryotes, 1830 bacteria and 129 archaea). The OCs were constructed by applying the quasi-clique-based clustering method to all possible protein coding genes in all complete genomes, based on their amino acid sequence similarities. It is computationally efficient to calculate OCs, which enables to regularly update the contents. KEGG OC has the following two features: (i) It consists of all complete genomes of a wide variety of organisms from three domains of life, and the number of organisms is the largest among the existing databases; and (ii) It is compatible with the KEGG database by sharing the same sets of genes and identifiers, which leads to seamless integration of OCs with useful components in KEGG such as biological pathways, pathway modules, functional hierarchy, diseases and drugs. The KEGG OC resources are accessible via OC Viewer that provides an interactive visualization of OCs at different taxonomic levels.
Nucleic Acids Research 11/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: KEGG (Kyoto Encyclopedia of Genes and Genomes) is a bioinformatics resource for understanding the functions and utilities of cells and organisms from both high-level and genomic perspectives. It is a self-sufficient, integrated resource consisting of genomic, chemical, and network information, with cross-references to numerous outside databases. The genomic and chemical information is a complete set of building blocks (genes and molecules) and the network information includes molecular wiring diagrams (interaction/reaction networks) and hierarchical classifications (relation networks) to represent high-level functions. This unit describes protocols for using KEGG, focusing on molecular network information in KEGG PATHWAY, KEGG BRITE, and KEGG MODULE, perturbed molecular networks in KEGG DISEASE and KEGG DRUG, molecular building block information in KEGG GENES and KEGG LIGAND, and a mechanism for linking genomes to molecular networks in KEGG ORTHOLOGY (KO). All of these many protocols enable the user to take advantage of the full breadth of the functionality provided by KEGG.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 06/2012; Chapter 1:Unit1.12.
[Show abstract][Hide abstract] ABSTRACT: Gene network inference engine based on supervised analysis (GENIES) is a web server to predict unknown part of gene network from various types of genome-wide data in the framework of supervised network inference. The originality of GENIES lies in the construction of a predictive model using partially known network information and in the integration of heterogeneous data with kernel methods. The GENIES server accepts any 'profiles' of genes or proteins (e.g. gene expression profiles, protein subcellular localization profiles and phylogenetic profiles) or pre-calculated gene-gene similarity matrices (or 'kernels') in the tab-delimited file format. As a training data set to learn a predictive model, the users can choose either known molecular network information in the KEGG PATHWAY database or their own gene network data. The user can also select an algorithm of supervised network inference, choose various parameters in the method, and control the weights of heterogeneous data integration. The server provides the list of newly predicted gene pairs, maps the predicted gene pairs onto the associated pathway diagrams in KEGG PATHWAY and indicates candidate genes for missing enzymes in organism-specific metabolic pathways. GENIES (http://www.genome.jp/tools/genies/) is publicly available as one of the genome analysis tools in GenomeNet.
Nucleic Acids Research 05/2012; 40(Web Server issue):W162-7. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this chapter, we demonstrate the usability of the KEGG (Kyoto encyclopedia of genes and genomes) databases and tools, especially focusing on the visualization of the omics data. The desktop application KegArray and many Web-based tools are tightly integrated with the KEGG knowledgebase, which helps visualize and interpret large amount of data derived from high-throughput measurement techniques including microarray, metagenome, and metabolome analyses. Recently developed resources for human disease, drug, and plant research are also mentioned.
[Show abstract][Hide abstract] ABSTRACT: ABSTRACT We sequenced the genome of Theileria orientalis, a tick-borne apicomplexan protozoan parasite of cattle. The focus of this study was a comparative genome analysis of T. orientalis relative to other highly pathogenic Theileria species, T. parva and T. annulata. T. parva and T. annulata induce transformation of infected cells of lymphocyte or macrophage/monocyte lineages; in contrast, T. orientalis does not induce uncontrolled proliferation of infected leukocytes and multiplies predominantly within infected erythrocytes. While synteny across homologous chromosomes of the three Theileria species was found to be well conserved overall, subtelomeric structures were found to differ substantially, as T. orientalis lacks the large tandemly arrayed subtelomere-encoded variable secreted protein-encoding gene family. Moreover, expansion of particular gene families by gene duplication was found in the genomes of the two transforming Theileria species, most notably, the TashAT/TpHN and Tar/Tpr gene families. Gene families that are present only in T. parva and T. annulata and not in T. orientalis, Babesia bovis, or Plasmodium were also identified. Identification of differences between the genome sequences of Theileria species with different abilities to transform and immortalize bovine leukocytes will provide insight into proteins and mechanisms that have evolved to induce and regulate this process. The T. orientalis genome database is available at http://totdb.czc.hokudai.ac.jp/. IMPORTANCE Cancer-like growth of leukocytes infected with malignant Theileria parasites is a unique cellular event, as it involves the transformation and immortalization of one eukaryotic cell by another. In this study, we sequenced the whole genome of a nontransforming Theileria species, Theileria orientalis, and compared it to the published sequences representative of two malignant, transforming species, T. parva and T. annulata. The genome-wide comparison of these parasite species highlights significant genetic diversity that may be associated with evolution of the mechanism(s) deployed by an intracellular eukaryotic parasite to transform its host cell.
[Show abstract][Hide abstract] ABSTRACT: In contrast to the increasing number of the successful genome projects, there still remain many orphan metabolites for which their synthesis processes are unknown. Metabolites, including these orphan metabolites, can be classified into groups that share the same core substructures, originated from the same biosynthetic pathways. It is known that many metabolites are synthesized by adding up building blocks to existing metabolites. Therefore, it is proposed that, for any given group of metabolites, finding the core substructure and the branched substructures can help predict their biosynthetic pathway. There already have been many reports on the multiple graph alignment techniques to find the conserved chemical substructures in relatively small molecules. However, they are optimized for ligand binding and are not suitable for metabolomic studies.
We developed an efficient multiple graph alignment method named as MUCHA (Multiple Chemical Alignment), specialized for finding metabolic building blocks. This method showed the strength in finding metabolic building blocks with preserving the relative positions among the substructures, which is not achieved by simply applying the frequent graph mining techniques. Compared with the combined pairwise alignments, this proposed MUCHA method generally reduced computational costs with improving the quality of the alignment.
MUCHA successfully find building blocks of secondary metabolites, and has a potential to complement to other existing methods to reconstruct metabolic networks using reaction patterns.
[Show abstract][Hide abstract] ABSTRACT: Classification of the individuals' genotype data is important in various kinds of biomedical research. There are many sophisticated clustering algorithms, but most of them require some appropriate similarity measure between objects to be clustered. Hence, accurate inter-diplotype similarity measures are always required for classification of diplotypes. In this article, we propose a new accurate inter-diplotype similarity measure that we call the population model-based distance (PMD), so that we can cluster individuals with diplotype SNPs data (i.e., unphased-diplotypes) with higher accuracies. For unphased-diplotypes, the allele sharing distance (ASD) has been the standard to measure the genetic distance between the diplotypes of individuals. To achieve higher clustering accuracies, our new measure PMD makes good use of a given appropriate population model which has never been utilized in the ASD. As the population model, we propose to use an hidden Markov model (HMM)-based model. We call the PMD based on the model the HHD (HIT HMM-based Distance). We demonstrate the impact of the HHD on the diplotype classification through comprehensive large-scale experiments over the genome-wide 8930 data sets derived from the HapMap SNPs database. The experiments revealed that the HHD enables significantly more accurate clustering than the ASD.
Journal of computational biology: a journal of computational molecular cell biology 12/2011; 19(1):55-67. · 1.69 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/ or http://www.kegg.jp/) is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge base for such systemic functions by capturing and organizing experimental knowledge in computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies and KEGG modules. Continuous efforts have also been made to develop and improve the cross-species annotation procedure for linking genomes to the molecular networks through the KEGG Orthology system. Here we report KEGG Mapper, a collection of tools for KEGG PATHWAY, BRITE and MODULE mapping, enabling integration and interpretation of large-scale data sets. We also report a variant of the KEGG mapping procedure to extend the knowledge base, where different types of data and knowledge, such as disease genes and drug targets, are integrated as part of the KEGG molecular networks. Finally, we describe recent enhancements to the KEGG content, especially the incorporation of disease and drug information used in practice and in society, to support translational bioinformatics.
Nucleic Acids Research 11/2011; 40(Database issue):D109-14. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Co-administration of multiple drugs may cause adverse effects, which are usually known but sometimes unknown. Package inserts of prescription drugs are supposed to contain contraindications and warnings on adverse interactions, but such information is not necessarily complete. Therefore, it is becoming more important to provide health professionals with a comprehensive view on drug-drug interactions among all the drugs in use as well as a computational method to identify potential interactions, which may also be of practical value in society. Here we extracted 1,306,565 known drug-drug interactions from all the package inserts of prescription drugs marketed in Japan. They were reduced to 45,180 interactions involving 1352 drugs (active ingredients) identified by the D numbers in the KEGG DRUG database, of which 14,441 interactions involving 735 drugs were linked to the same drug-metabolizing enzymes and/or overlapping drug targets. The interactions with overlapping targets were further classified into three types: acting on the same target, acting on different but similar targets in the same protein family, and acting on different targets belonging to the same pathway. For the rest of the extracted interaction data, we attempted to characterize interaction patterns in terms of the drug groups defined by the Anatomical Therapeutic Chemical (ATC) classification system, where the high-resolution network at the D number level is progressively reduced to a low-resolution global network. Based on this study we have developed a drug-drug interaction retrieval system in the KEGG DRUG database, which may be used for both searching against known drug-drug interactions and predicting potential interactions.
Journal of Chemical Information and Modeling 09/2011; 51(11):2977-85. · 4.30 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: iPath2.0 is a web-based tool (http://pathways.embl.de) for the visualization and analysis of cellular pathways. Its primary map summarizes the metabolism in biological systems as annotated to date. Nodes in the map correspond to various chemical compounds and edges represent series of enzymatic reactions. In two other maps, iPath2.0 provides an overview of secondary metabolite biosynthesis and a hand-picked selection of important regulatory pathways and other functional modules, allowing a more general overview of protein functions in a genome or metagenome. iPath2.0's main interface is an interactive Flash-based viewer, which allows users to easily navigate and explore the complex pathway maps. In addition to the default pre-computed overview maps, iPath offers several data mapping tools. Users can upload various types of data and completely customize all nodes and edges of iPath2.0's maps. These customized maps give users an intuitive overview of their own data, guiding the analysis of various genomics and metagenomics projects.
Nucleic Acids Research 05/2011; 39(Web Server issue):W412-5. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: With the rise of experimental technologies for omics research in recent years, considerable quantitative data related to transcription,
protein and metabolism are available for predicting protein functions. To predict protein functions from large omics data,
reference knowledge databases and bioinformatics tools play considerable roles. KEGG (http://www.genome.jp/kegg/) database we have been establishing is an integrated database of biological systems including genomic, chemical and systemic
functional information. Our group has also been developing the tools for genome or chemical analysis as GenomeNet Bioinformatics
Tools (http://www.genome.jp/en/gn_tools.html). In this chapter, we introduce the KEGG database resources and the GenomeNet Bioinformatics Tools for predicting protein
functions from the viewpoint of omics research, as well as some recent topics (KEGG PLANT Resource and PathPred). KEGG PLANT
Resource is one of the contents in the KEGG EDRUG database, and contains links for plant secondary metabolite biosynthesis
pathways, plant genomes and EST sequences, chemical information of plant natural products and the prediction tool for plant
secondary metabolism pathway. PathPred is a recently developed pathway prediction tool based on the chemical structure transformation
patterns of enzyme reactions found in metabolic pathways.