Interactive InterPro-based comparisons of proteins in whole genomes.
ABSTRACT MOTIVATION: The SWISS-PROT group at the EBI has developed the Proteome Analysis Database utilizing existing resources and providing comprehensive and integrated comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archaea and eukaryotes. The Proteome Analysis Database is accompanied by a program that has been designed to carry out interactive InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.
Full-textDOI: · Available from: Alexander Kanapin, Aug 23, 2015
- SourceAvailable from: Steffen Möller[Show abstract] [Hide abstract]
ABSTRACT: The dissertation is submitted for the degree of Doctor of Philosophy.
- [Show abstract] [Hide abstract]
ABSTRACT: The applications of InterPro span a range of biologically important areas that includes automatic annotation of protein sequences and genome analysis. In automatic annotation of protein sequences InterPro has been utilised to provide reliable characterisation of sequences, identifying them as candidates for functional annotation. Rules based on the InterPro characterisation are stored and operated through a database called RuleBase. RuleBase is used as the main tool in the sequence database group at the EBI to apply automatic annotation to unknown sequences. The annotated sequences are stored and distributed in the TrEMBL protein sequence database. InterPro also provides a means to carry out statistical and comparative analyses of whole genomes. In the Proteome Analysis Database, InterPro analyses have been combined with other analyses based on CluSTr, the Gene Ontology (GO) and structural information on the proteins.Briefings in Bioinformatics 10/2002; 3(3):285-95. · 9.62 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Motivation: Obtaining accurate estimates of the numbers of protein-coding genes and protein domains in a proteome, and the number of protein domains in nature is a daunting challenge. Computational analysis of the protein domain sets in the pro-teomes of many species allows us to estimate these numbers and to find their evolution relationships. Results: We have analyzed the distributions of the number of occurrences of protein domains in sample proteomes of the 70 fully sequenced genome organisms of three major kingdoms of life: Archaea, Bacteria and Eukaryota. We found that a large fraction of the identified distinct protein domains (i.e., unique domains and homologous domain families) in these 70 proteomes (1051 (23%) out of 4493) are found in at least one organism in each of these kingdoms of life and that 43 (1%) of these domains are common to all the 70 organisms. All the observed domain occurrence frequency distributions for these 70 proteomes are well fitted by a family of Pareto-like functions, associated with the steady state distributions of a linear Markov random process. We present explicit formulas that accurately predict the number of distinct protein domains and the number of protein-coding genes for a given organism as functions of the number of non-redundant domain-to-protein links in the proteomes. These functions allows us to predict that there are 42,740, 27,900, and 21,200 protein-coding genes/open reading frames in the human,A. thaliana, and mouse genomes, respectively. We also estimate that there are 5271, 2955, and 4915 distinct protein domains in the human, A. thaliana, and mouse proteomes, respectively, and about 5500 distinct protein domains in the entire "proteome world".Journal of Biological Systems 12/2002; 10(4):381-407. DOI:10.1142/S0218339002000767 · 0.96 Impact Factor