-
[show abstract]
[hide abstract]
ABSTRACT: The study of the scale-free topology in non-biological and biological networks and the dynamics that can explain this fascinating property of complex systems have captured the attention of the scientific community in the last years. Here, we analyze the biochemical pathways of three organisms (Methanococcus jannaschii, Escherichia coli, Saccharomyces cerevisiae) which are representatives of the main kingdoms Archaea, Bacteria and Eukaryotes during the course of the biological evolution. We can consider two complementary representations of the biochemical pathways: the enzymes network and the chemical compounds network. In this article, we propose a stochastic model that explains that the scale-free topology with exponent in the vicinity of gamma approximately 3/2 found across these three organisms is governed by the log-normal dynamics in the evolution of the enzymes network. Precisely, the fluctuations of the connectivity degree of enzymes in the biochemical pathways between evolutionary distant organisms follow the same conserved dynamical principle, which in the end is the origin of the stationary scale-free distribution observed among species, from Archaea to Eukaryotes. In particular, the log-normal dynamics guarantees the conservation of the scale-free distribution in evolving networks. Furthermore, the log-normal dynamics also gives a possible explanation for the restricted range of observed exponents gamma in the scale-free networks (i.e., gamma > or = 3/2). Finally, our model is also applied to the chemical compounds network of biochemical pathways and the Internet network.
Biosystems 01/2006; 83(1):26-37. · 1.78 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Extensive studies have been done to understand the principles behind architectures of real networks. Recently, evidence for hierarchical organization in many real networks has also been reported. Here, we present a hierarchical model that reproduces the main experimental properties observed in real networks: scale-free of degree distribution P (k) [frequency of the nodes that are connected to k other nodes decays as a power law P (k) approximately k(-gamma) ] and power-law scaling of the clustering coefficient C (k) approximately k(-1) . The major points of our model can be summarized as follows. (a) The model generates networks with scale-free distribution for the degree of nodes with general exponent gamma>2 , and arbitrarily close to any specified value, being able to reproduce most of the observed hierarchical scale-free topologies. In contrast, previous models cannot obtain values of gamma>2.58 . (b) Our model has structural flexibility because (i) it can incorporate various types of basic building blocks (e.g., triangles, tetrahedrons, and, in general, fully connected clusters of n nodes) and (ii) it allows a large variety of configurations (i.e., the model can use more than n-1 copies of basic blocks of n nodes). The structural features of our proposed model might lead to a better understanding of architectures of biological and nonbiological networks.
Physical Review E 03/2005; 71(3 Pt 2A):036132. · 2.26 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: An increasing number of observations support the hypothesis that most biological functions involve the interactions between many proteins, and that the complexity of living systems arises as a result of such interactions. In this context, the problem of inferring a global protein network for a given organism, using all available genomic data about the organism, is quickly becoming one of the main challenges in current computational biology.
This paper presents a new method to infer protein networks from multiple types of genomic data. Based on a variant of kernel canonical correlation analysis, its originality is in the formalization of the protein network inference problem as a supervised learning problem, and in the integration of heterogeneous genomic data within this framework. We present promising results on the prediction of the protein network for the yeast Saccharomyces cerevisiae from four types of widely available data: gene expressions, protein interactions measured by yeast two-hybrid systems, protein localizations in the cell and protein phylogenetic profiles. The method is shown to outperform other unsupervised protein network inference methods. We finally conduct a comprehensive prediction of the protein network for all proteins of the yeast, which enables us to propose protein candidates for missing enzymes in a biosynthesis pathway.
Softwares are available upon request.
Bioinformatics 09/2004; 20 Suppl 1:i363-70. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Several studies on real complex networks from different fields as biology, economy, or sociology have shown that the degree of nodes (number of edges connected to each node) follows a scale-free power-law distribution like $P(k)\approx k^{-\gamma}$, where $P(k)$ denotes the frequency of the nodes that are connected to $k$ other nodes. Here we have carried out a study on scale-free networks, where a line graph transformation (i.e., edges in an initial network are transformed into nodes) is applied to a power-law distribution. Our results indicate that a power-law distribution as $P(k)\approx k^{-\gamma +1}$ is found for the transformed network together with a peak for low-degree nodes. In the present work we show a parametrization of this behaviour and discuss its application to real networks as metabolic networks, protein-protein interaction network and World Wide Web.
03/2004;
-
[show abstract]
[hide abstract]
ABSTRACT: MOTIVATION: A major issue in computational biology is the reconstruction of pathways from several genomic datasets, such as expression data, protein interaction data and phylogenetic profiles. As a first step toward this goal, it is important to investigate the amount of correlation which exists between these data. RESULTS: These methods are successfully tested on their ability to recognize operons in the Escherichia coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data.
Bioinformatics 02/2003; 19 Suppl 1:i323-30. · 5.47 Impact Factor
-
M Kanehisa
[show abstract]
[hide abstract]
ABSTRACT: Post-genomics may be defined in different ways depending on how one views the challenges after the discovery of the genome. A traditional view is to follow the concept of the central dogma in molecular biology, namely from genome to transcriptome to proteome. Projects are ongoing to analyse gene expression profiles both at the mRNA and protein levels, and to catalogue protein 3D structure families, which will no doubt help the understanding of the information in the genome. However, once complete, such experimentally determined catalogues of genes, RNAs and proteins only tell us about the building blocks of life. They do not tell us much about how life operates as a system, such as higher order functional behaviours of the cell or the organism. Thus, an alternative view of post-genomics is to go up from the molecular level to the cellular level and eventually to still higher levels, i.e., the biological systems. Bioinformatics provides basic concepts as well as practical methods to integrate this view with the traditional view and to analyse complex interactions among building blocks and with dynamic environments.
Pharmacogenomics 12/2001; 2(4):373-85. · 3.97 Impact Factor
-
K W Makabe,
T Kawashima,
S Kawashima,
T Minokawa,
A Adachi,
H Kawamura,
H Ishikawa,
R Yasuda,
H Yamamoto,
K Kondoh, [......],
Y Kondoh,
S Kido,
M Tsujinami,
N Nishimura,
M Takahashi,
T Nakamura, M Kanehisa,
M Ogasawara,
T Nishikata,
H Nishida
[show abstract]
[hide abstract]
ABSTRACT: The ascidian egg is a well-known mosaic egg. In order to investigate the molecular nature of the maternal genetic information stored in the egg, we have prepared cDNAs from the mRNAs in the fertilized eggs of the ascidian, Halocynthia roretzi. The cDNAs of the ascidian embryo were sequenced, and the localization of individual mRNA was examined in staged embryos by whole-mount in situ hybridization. The data obtained were stored in the database MAGEST (http://www.genome.ad.jp/magest) and further analyzed. A total of 4240 cDNA clones were found to represent 2221 gene transcripts, including at least 934 different protein-coding sequences. The mRNA population of the egg consisted of a low prevalence, high complexity sequence set. The majority of the clones were of the rare sequence class, and of these, 42% of the clones showed significant matches with known peptides, mainly consisting of proteins with housekeeping functions such as metabolism and cell division. In addition, we found cDNAs encoding components involved in different signal transduction pathways and cDNAs encoding nucleotide-binding proteins. Large-scale analyses of the distribution of the RNA corresponding to each cDNA in the eight-cell, 110-cell and early tailbud embryos were simultaneously carried out. These analyses revealed that a small fraction of the maternal RNAs were localized in the eight-cell embryo, and that 7.9% of the clones were exclusively maternal, while 40.6% of the maternal clones showed expression in the later stages. This study provides global insights about the genes expressed during early development.
Development 08/2001; 128(13):2555-67. · 6.60 Impact Factor
-
M Kuroda,
T Ohta,
I Uchiyama,
T Baba,
H Yuzawa,
I Kobayashi,
L Cui,
A Oguchi,
K Aoki,
Y Nagai, [......], M Kanehisa,
A Yamashita,
K Oshima,
K Furuya,
C Yoshino,
T Shiba,
M Hattori,
N Ogasawara,
H Hayashi,
K Hiramatsu
[show abstract]
[hide abstract]
ABSTRACT: Staphylococcus aureus is one of the major causes of community-acquired and hospital-acquired infections. It produces numerous toxins including superantigens that cause unique disease entities such as toxic-shock syndrome and staphylococcal scarlet fever, and has acquired resistance to practically all antibiotics. Whole genome analysis is a necessary step towards future development of countermeasures against this organism.
Whole genome sequences of two related S aureus strains (N315 and Mu50) were determined by shot-gun random sequencing. N315 is a meticillin-resistant S aureus (MRSA) strain isolated in 1982, and Mu50 is an MRSA strain with vancomycin resistance isolated in 1997. The open reading frames were identified by use of GAMBLER and GLIMMER programs, and annotation of each was done with a BLAST homology search, motif analysis, and protein localisation prediction.
The Staphylococcus genome was composed of a complex mixture of genes, many of which seem to have been acquired by lateral gene transfer. Most of the antibiotic resistance genes were carried either by plasmids or by mobile genetic elements including a unique resistance island. Three classes of new pathogenicity islands were identified in the genome: a toxic-shock-syndrome toxin island family, exotoxin islands, and enterotoxin islands. In the latter two pathogenicity islands, clusters of exotoxin and enterotoxin genes were found closely linked with other gene clusters encoding putative pathogenic factors. The analysis also identified 70 candidates for new virulence factors.
The remarkable ability of S aureus to acquire useful genes from various organisms was revealed through the observation of genome complexity and evidence of lateral gene transfer. Repeated duplication of genes encoding superantigens explains why S aureus is capable of infecting humans of diverse genetic backgrounds, eliciting severe immune reactions. Investigation of many newly identified gene products, including the 70 putative virulence factors, will greatly improve our understanding of the biology of staphylococci and the processes of infectious diseases caused by S aureus.
The Lancet 05/2001; 357(9264):1225-40. · 38.28 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a new method to extract a set of correlated genes with respect to multiple biological features. Relationships among genes on a specific feature are encoded as a graph structure whose nodes correspond to genes. For example, the genome is a graph representing positional correlations of genes on the chromosome, the pathway is a graph representing functional correlations of gene products, and the expression profile is a graph representing gene expression similarities. When a set of genes are localized in a single graph, such as a gene cluster on the chromosome, an enzyme cluster in the metabolic pathway, or a set of coexpressed genes in the microarray gene expression profile, this may suggest a functional link among those genes. The functional link would become stronger when the clusters are correlated; namely, when a set of corresponding genes form clusters in multiple graphs. The newly introduced heuristic algorithm extracts such correlated gene clusters as isomorphic subgraphs in multiple graphs by using inter-graph links that are defined based on biological relevance. Using the method, we found E.coli correlated gene clusters in which genes are related with respect to the positions in the genome and the metabolic pathway, as well as the 3D structural similarity. We also analyzed protein-protein interaction data by two-hybrid experiments and gene coexpression data by microarrays in S.cerevisiae, and estimated the possibility of utilizing our method for screening the datasets that are likely to contain many false positive relations.
Genome informatics. International Conference on Genome Informatics 02/2001; 12:44-53.
-
[show abstract]
[hide abstract]
ABSTRACT: We previously reported two graph algorithms for analysis of genomic information: a graph comparison algorithm to detect locally similar regions called correlated clusters and an algorithm to find a graph feature called P-quasi complete linkage. Based on these algorithms we have developed an automatic procedure to detect conserved gene clusters and align orthologous gene orders in multiple genomes. In the first step, the graph comparison is applied to pairwise genome comparisons, where the genome is considered as a one-dimensionally connected graph with genes as its nodes, and correlated clusters of genes that share sequence similarities are identified. In the next step, the P-quasi complete linkage analysis is applied to grouping of related clusters and conserved gene clusters in multiple genomes are identified. In the last step, orthologous relations of genes are established among each conserved cluster. We analyzed 17 completely sequenced microbial genomes and obtained 2313 clusters when the completeness parameter P: was 40%. About one quarter contained at least two genes that appeared in the metabolic and regulatory pathways in the KEGG database. This collection of conserved gene clusters is used to refine and augment ortholog group tables in KEGG and also to define ortholog identifiers as an extension of EC numbers.
Nucleic Acids Research 11/2000; 28(20):4029-36. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The availability of computerized knowledge on biochemical pathways in the KEGG database opens new opportunities for developing computational methods to characterize and understand higher level functions of complete genomes. Our approach is based on the concept of graphs; for example, the genome is a graph with genes as nodes and the pathway is another graph with gene products as nodes. We have developed a simple method for graph comparison to identify local similarities, termed correlated clusters, between two graphs, which allows gaps and mismatches of nodes and edges and is especially suitable for detecting biological features. The method was applied to a comparison of the complete genomes of 10 microorganisms and the KEGG metabolic pathways, which revealed, not surprisingly, a tendency for formation of correlated clusters called FRECs (functionally related enzyme clusters). However, this tendency varied considerably depending on the organism. The relative number of enzymes in FRECs was close to 50% for Bacillus subtilis and Escherichia coli, but was <10% for SYNECHOCYSTIS: and Saccharomyces cerevisiae. The FRECs collection is reorganized into a collection of ortholog group tables in KEGG, which represents conserved pathway motifs with the information about gene clusters in all the completely sequenced genomes.
Nucleic Acids Research 11/2000; 28(20):4021-8. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The distribution of genes coding for membrane proteins was investigated in 16 complete genomes: 4 archaea, 11 bacteria, and 1 eukaryote. Membrane proteins were identified by our new method of predicting transmembrane segments () after the removal of amino-terminal signal peptides. Interestingly, about half of the membrane protein genes in each genome were found to be located next to another, forming tandem clusters. Roughly 10%-30% of the tandem clusters were conserved among organisms, and most of the conserved tandem clusters belonged to one of the three functional groups, namely, transporters, the electron transport system, and cell motility. A tandem cluster sometimes contained paralogous membrane proteins, in which case the cluster size and the number of transmembrane segments could be related to a functional category, especially to transporters. In addition to the clustering of membrane proteins, the clustering of membrane proteins and ATP-binding proteins in the complete genomes was also analyzed. Although this clustering was not statistically significant, it was useful to identify candidate membrane protein partners of isolated ATP-binding protein components in the ABC transporters. Possible implications of tandem cluster organization of membrane protein genes are discussed including the complex formation and other functional coupling of protein products and the mechanism of protein translocation to the cell membrane.
Genome Research 07/2000; 10(6):731-43. · 13.61 Impact Factor
-
M Kanehisa
Advances in protein chemistry 02/2000; 54:381-408. · 3.20 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes with up-to-date annotation of gene functions. The higher order functional information is stored in the PATHWAY database, which contains graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. The PATHWAY database is supplemented by a set of ortholog group tables for the information about conserved subpathways (pathway motifs), which are often encoded by positionally coupled genes on the chromosome and which are especially useful in predicting gene functions. A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and enzymatic reactions. KEGG provides Java graphics tools for browsing genome maps, comparing two genome maps and manipulating expression maps, as well as computational tools for sequence comparison, graph comparison and path computation. The KEGG databases are daily updated and made freely available (http://www. genome.ad.jp/kegg/).
Nucleic Acids Research 02/2000; 28(1):27-30. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: AAindex is a database of amino acid indices and amino acid mutation matrices. An amino acid index is a set of 20 numerical values representing various physico--chemical and biochemical properties of amino acids. An amino acid mutation matrix is generally 20 x 20 numerical values representing similarity of amino acids. AAindex consists of two sections: AAindex1 for the collection of published amino acid indices and AAindex2 for the collection of published amino acid mutation matrices. Each entry of either AAindex1 or AAindex2 consists of the definition, the reference information, a list of related entries in terms of the correlation coefficient and the actual data. The database may be accessed through the DBGET/LinkDB system at GenomeNet (http://www. genome.ad.jp/aaindex/ ) or may be downloaded by anonymous FTP (ftp://ftp.genome.ad.jp/db/genomenet/aaindex/ ).
Nucleic Acids Research 02/2000; 28(1):374. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: LIGAND is a composite database comprising three sections: ENZYME for the information of enzyme molecules and enzymatic reactions, COMPOUND for the information of metabolites and other chemical compounds, and REACTION for the collection of substrate-product relations. The current release includes 3390 enzymes, 5645 compounds and 5207 reactions. The database is indispensable for the reconstruction of metabolic pathways in the completely sequenced organisms. The LIGAND database can be accessed through the WWW (http://www.genome.ad.jp/dbget/ligand.html ) or may be downloaded by anonymous FTP (ftp://kegg.genome.ad.jp/molecules/ligand/ ).
Nucleic Acids Research 02/2000; 28(1):380-2. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: MAGEST is a database for newly identified maternal cDNAs of the ascidian, Halocynthia roretzi, which aims to examine the population of the mRNAs. We have collected 3' and 5' tag sequences of mRNAs and their expression data from whole-mount in situ hybridi-zation in early embryos. To date, we have determined more than 2000 tag-sequences of H.roretzi cDNAs and input them into public databases. The tag sequences and the expression data as well as additional information can be obtained through MAGEST via the WWW at http://www.genome.ad.jp/magest/
Nucleic Acids Research 02/2000; 28(1):133-5. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: LIGAND is a composite database consisting of three sections and containing the information of chemical substances, chemical reactions and enzymes that catalyze reactions. The COMPOUND section is a collection of metabolic compounds, as well as macromolecules, chemical elements and other chemical substances in a living cell. The ENZYME section is a collection of all known enzymatic reactions, together with the information of enzyme molecules, classified according to the EC (Enzyme Commission) numbers. The REACTION section is a new addition to the database containing metabolic reactions that appear in the pathway diagrams of the KEGG/PATHWAY database and/or in the ENZYME section. The LIGAND database can be accessed through the WWW (http://www.genome.ad.jp/dbget/ligand.html) or may be downloaded by anonymous FTP (ftp://kegg.genome.ad. jp/molecules/ligand).
Nucleic Acids Research 02/1999; 27(1):377-9. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. It consists of two sections: AAindex1 for the amino acid index of 20 numerical values and AAindex2 for the amino acid mutation matrix of 210 numerical values. Each entry of either AAindex1 or AAindex2 consists of the definition, the reference information, a list of related entries in terms of the correlation coefficient, and the actual data. The database may be accessed through the DBGET/LinkDB system at GenomeNet (http://www.genome.ad. jp/dbget/) or may be downloaded by anonymous FTP (ftp://ftp.genome. ad.jp/db/genomenet/aaindex/).
Nucleic Acids Research 02/1999; 27(1):368-9. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Kyoto Encyclopedia of Genes and Genomes (KEGG) is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules. The major component of KEGG is the PATHWAY database that consists of graphical diagrams of biochemical pathways including most of the known metabolic pathways and some of the known regulatory pathways. The pathway information is also represented by the ortholog group tables summarizing orthologous and paralogous gene groups among different organisms. KEGG maintains the GENES database for the gene catalogs of all organisms with complete genomes and selected organisms with partial genomes, which are continuously re-annotated, as well as the LIGAND database for chemical compounds and enzymes. Each gene catalog is associated with the graphical genome map for chromosomal locations that is represented by Java applet. In addition to the data collection efforts, KEGG develops and provides various computational tools, such as for reconstructing biochemical pathways from the complete genome sequence and for predicting gene regulatory networks from the gene expression profiles. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/).
Nucleic Acids Research 02/1999; 27(1):29-34. · 8.03 Impact Factor