SIMAP--a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters.
ABSTRACT The prediction of protein function as well as the reconstruction of evolutionary genesis employing sequence comparison at large is still the most powerful tool in sequence analysis. Due to the exponential growth of the number of known protein sequences and the subsequent quadratic growth of the similarity matrix, the computation of the Similarity Matrix of Proteins (SIMAP) becomes a computational intensive task. The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences. Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology. Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).
[show abstract] [hide abstract]
ABSTRACT: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.Nucleic Acids Research 10/1997; 25(17):3389-402. · 8.03 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: Similarity Matrix of Proteins (SIMAP) (http://mips.gsf.de/simap) provides a database based on a pre-computed similarity matrix covering the similarity space formed by >4 million amino acid sequences from public databases and completely sequenced genomes. The database is capable of handling very large datasets and is updated incrementally. For sequence similarity searches and pairwise alignments, we implemented a grid-enabled software system, which is based on FASTA heuristics and the Smith-Waterman algorithm. Our ProtInfo system allows querying by protein sequences covered by the SIMAP dataset as well as by fragments of these sequences, highly similar sequences and title words. Each sequence in the database is supplemented with pre-calculated features generated by detailed sequence analyses. By providing WWW interfaces as well as web-services, we offer the SIMAP resource as an efficient and comprehensive tool for sequence similarity searches.Nucleic Acids Research 01/2006; 34(Database issue):D252-6. · 8.03 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer and unambiguously detect proteins that shared a common ancestor billions of years in the past. FASTA is both fast and selective because it initially considers only amino acid identities. Its sensitivity is increased not only by using the PAM250 matrix to score and rescore regions with large numbers of identities but also by joining initial regions. The results of searches with FASTA compare favorably with results using NWS-based programs that are 100 times slower. FASTA is slightly less sensitive but considerably more selective. It is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family. The joining step by FASTA to calculate the initn score is especially useful for sequences that share regions of sequence similarity that are separated by variable-length loops. FASTP and FASTA were designed to identify protein sequences that have descended from a common ancestor, and they have proved very useful for this task. In many cases, a FASTA sequence search will result in a list of high scoring library sequences that are homologous to the query sequence, or the search will result in a list of sequences with similarity scores that cannot be distinguished from the bulk of the library. In either case, the question of whether there are sequences in the library that are clearly related to the query sequence has been answered unambiguously. Unfortunately, the results often will not be so clear-cut, and careful analysis of similarity scores, statistical significance, the actual aligned residues, and the biological context are required. In the course of analyzing the G-protein-coupled receptor family, several proteins were found that, because of a high initn score and a low init1 score that increased almost 2-fold with optimization, appeared to be members of this family which were not previously recognized. RDF2 analysis showed borderline z values, and only a careful examination of the sequence alignments that focused on the conserved residues provided convincing evidence that the high scores were fortuitous. As sequence comparison methods become more powerful by becoming more sensitive, they become more likely to mislead, and even greater care is required.Methods in Enzymology 02/1990; 183:63-98. · 2.04 Impact Factor
SIMAP—a comprehensive database of pre-calculated
protein sequence similarities, domains, annotations
Thomas Rattei1,*, Patrick Tischler1, Stefan Go ¨tz2, Marc-Andre ´ Jehl1, Jonathan Hoser1,
Roland Arnold1, Ana Conesa2and Hans-Werner Mewes1,3
1Technische Universita ¨t Mu ¨nchen, Department of Genome Oriented Bioinformatics, Wissenschaftszentrum
Weihenstephan, Freising, Germany,2Bioinformatics Department, Centro de Investigacio ´n Prı ´ncipe Felipe,
Valencia, Spain and3Institute for Bioinformatics and Systems Biology (MIPS), Helmholtz Zentrum Mu ¨nchen,
German Research Center for Environmental Health (GmbH), Neuherberg, Germany
Received September 15, 2009; Revised October 10, 2009; Accepted October 12, 2009
The prediction of protein function as well as the
reconstruction of evolutionary genesis employing
sequence comparison at large is still the most
powerful tool in sequence analysis. Due to the expo-
nential growth of the number of known protein
sequences and the subsequent quadratic growth
of the similarity matrix, the computation of the
Similarity Matrix of Proteins (SIMAP) becomes a
computational intensive task. The SIMAP database
provides a comprehensive and up-to-date pre-
calculation ofthe protein
matrix, sequence-based features and sequence
clusters. As of September 2009, SIMAP covers
48million proteins and more than 23million non-
redundant sequences. Novel features of SIMAP
include the expansion of the sequence space by
including databases such as ENSEMBL as well as
the integration of metagenomes based on their con-
sistent processing and annotation. Furthermore,
protein function predictions by Blast2GO are pre-
calculated for all sequences in SIMAP and the data
access and query functions have been improved.
SIMAP assists biologists to query the up-to-date
large-scale downstream projects in computational
biology. Access to SIMAP
through the web portal for individuals (http://mips
through DAS (http://webclu.bio.wzw.tum.de/das/)
and Web-Service (http://mips.gsf.de/webservices/
Protein sequences are of utmost importance for studying
the function and evolution of genes and genomes.
Evolutionary processes of mutation and selection have
shaped the protein sequence space and became manifest
in the protein sequences as well as their pair-wise and
group-wise similarities. Therefore, a rich collection of
methods in computational biology relies on the analysis
and comparison of protein sequences. Many of these
intensively used methods perform sequence similarity
searches [e.g. BLAST (1)] or compare protein sequences
against secondary databases of protein families [e.g.
The fast increasing volume of publicly available pro-
tein sequences forges a computational dilemma for
bioinformatics tasks that require repeated all-against-all
calculations of sequence similarities or sequence features.
Such rather straightforward but technically challenging
tasks among others are the annotation of genomes or the
clustering of the protein sequence space into protein
families. Due to the exponential growth of the number of
sequences and the quadratic complexity of the sequence
similarity matrix, the computational demand of calculating
an all-versus-all sequence matrix of all known proteins
easily outgrows available computational resources. Due
to the subsequent growth of the secondary databases,
a similar problem exists for the prediction of protein
domains. As a consequence, any repeated ab initio recalcu-
lation of the similarity matrix is highly ineffective due to
the recalculation of the vast majority of already known
sequence similarity relations. However, as the number of
recently added sequences is always small compared to
the bulk of know sequences, repeated recalculations—
frequently performed in many sequence-based projects—
*To whom correspondence should be addressed. Tel: +49 8161 712136; Fax: +49 8161 712186; Email: email@example.com
Published online 11 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D223–D226
? The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The Similarity Matrix of Proteins (SIMAP) solves the
computational dilemma described above by incrementally
pre-calculating the sequence similarities forming the
known protein sequence space (3). The comparison of
new sequences versus known ones returns symmetric
scores that can be updated accordingly in the existing
records. Compared to other resources that pre-calculate
sequence similarities [e.g. NCBI Blink (4)], the FASTA
(5) and Smith–Waterman (6) based similarity calculation
in SIMAP is only restricted by a static and sensitive raw
score threshold without limiting the maximal number of
hits per sequence. Hence the structure of the sequence
similarity matrix is not influenced by the taxonomy and
study biases that exist in the major protein sequence
databases. The SIMAP database stores raw scores from
the calculated alignments. When querying SIMAP,
e-values are calculated on-the-fly according to the
selected databases and taxa. To complement the pair-
wise sequence similarity matrix by position specific
searches against known protein families, SIMAP in
addition pre-calculates sequence based features as e.g.
InterPro matches (2). To maximize its coverage to pro-
vide an efficient alternative to BLAST or Interproscan
calculations, the comprehensive representation of the
protein sequence space is crucial for SIMAP. Recent
improvements in SIMAP have addressed this requirement
by further expanding the sequence space and including
metagenomic sequences. Further improvements have
extended the functional annotation of the protein
annotations and improved the data access and query
tools of SIMAP.
NEW FEATURES AND IMPROVEMENTS IN SIMAP
Comprehensive coverage of the protein sequence space
SIMAP represents the known protein sequence space
comprehensively and up-to-date. According to this goal,
the SIMAP database is synchronized once per month with
the major protein sequence databases (Table 1). The con-
sideration of each of these databases in SIMAP is justified
by providing either unique protein sequences that are not
found in other databases [e.g. ENSEMBL (7)], or unique
protocols for data processing [e.g. NCBI RefSeq (8)].
The continuous and rapid growth of the sequence space
demands for a sophisticated high-performance computing
infrastructure to pre-calculate the sequence similarities
of all new sequences and their sequence-based features
immediately after the import of new sequences even
in case ofSIMAP’s incremental
The SIMAPBOINC public resource computing project
(9) steadily provides compute power beyond the current
need and thus enables rapid updating of SIMAP.
Consistent processing and annotation of metagenomes
With the breakthrough of next generation sequencing
methods and their application to environmental samples
(10), metagenomic sequences have indelibly expanded the
protein sequence space to non-culturable organisms
and environmental communities. However, the pioneering
‘Global ocean sampling’ (GOS) project (11) so far remains
the only metagenomic dataset of which protein sequences
are represented in a major public sequence database
[NCBI GenBank (4)]. All other metagenomes are—if at
all—deposited in distributed resources as the ‘Whole
Genome Shutgun’ (wgs) section of NCBI GenBank (4)
or the IMG/M database (12). No standardized protocol
for gene calling and the annotation of protein-coding
sequences has been established so far for these data
collections. As the consistent annotation of metagenomes
is indispensable for any downstream comparative analysis
such as comparisons of taxonomic or functional profiles
between different metagenomes, an extension of SIMAP
was implemented that extracts coding sequences from
metagenomic sequencing reads, assembled contigs and
scaffolds in a consistent way.
This part of SIMAP covering environmental sequence
fragments is monthly synchronized with three major
repositories of metagenomes (Table 2). Entirely redundant
metagenomes are considered only once, whereas redun-
dant representations of the same project differing in
their total number of nucleotides (e.g. the whale fall
samples in IMG/M and GenBank wgs) are retained.
Similar to the methodology used by the GOS project
(13), coding sequences are extracted from the nucleotide
sequences in a multi-step procedure:
(1) all open reading frames (ORF) exceeding a length
of 90nt are extracted from the nucleotide sequences
of a metagenome,
(2) all-against-all protein sequence similarities between
all ORFs in a metagenome are calculated using the
SIMAP software (3): first a FASTA (5) similarity
search against the low-complexity masked sequences
down to the BLOSUM50 (14) score of 80 is
Table 1. Number of protein entries and non-redundant sequences of
the major protein sequence databases included in SIMAP as of
Database Protein entriesNon-redundant
Table 2. Number of metagenomic samples and extracted protein-
coding sequences in SIMAP as of September 2009
D224 Nucleic AcidsResearch, 2010, Vol.38,Database issue
performed without restricting the number of hits,
thereafter the alignments are re-calculated without
(3) ORFs are weighted by the number and score of their
sequence alignments; shadow ORFs are detected by
their overlap with higher weighted ORFs and
removed using the methodology and parameters as
in the GOS project (13),
(4) remaining ORFs having a length of at least 60 aa are
imported into the main SIMAP database,
(5) all-against-all protein sequence similarities between
all ORFs of a metagenome and all other protein
sequences in SIMAP are calculated as in step 2,
(6) again, shadow ORFs are removed as in step 3.
Compared to the supervised gene prediction methods
as used in other metagenomic resources, the procedure
applied in SIMAP is not biased towards any taxonomic
group (i.e. prokaryotes) and only limited by the minimal
length of open reading frames in step 1 and 4. The
parameters applied in this procedure ensure optimal sen-
sitivity in detecting coding sequences both in single-exon
and multi-exon genes.
The derived metagenomic ORFs have almost doubled
the volume of the known protein sequence space and thus
significantly added valuable information (Figure 1).
However, metagenomic sequences exhibit lower accuracy
compared to completely sequenced genes and genomes,
show fragmentation in case of multi-exon genes and
lack knowledge of their taxonomic origin. Therefore,
metagenomic sequences can be excluded when retrieving
data fromSIMAP according
requirements of the user.
to the individual
Functional annotation of the protein sequence space
Many computational methods to support the prediction
of protein function are computationally expensive and
therefore benefit from comprehensive pre-calculation
and incremental updates as the basic design principles of
SIMAP. SIMAP thus pre-calculates Interpro domains and
features (2) for all sequences including metagenomic
ORFs (15). New releases of InterPro are incorporated
into SIMAP as soon as they become available; SIMAP
is regularly updated to the latest InterPro version
SIMAP provides an ideal complete resource for the
computation of secondary features such as the functional
annotation of protein sequences based on information
transfer from annotated proteins. BLAST2GO may
serve as an example that provides various annotation
tools for the functional classification of proteins (16,17).
Blast2GO achieves the automatic functional annotation of
DNA or protein sequences employing the Gene Ontology
vocabulary. We have adapted the Blast2GO suite to
enable the retrieval of sequence similarities from the
SIMAP database instead of performing BLAST (1)
searches. This step saves an enormous amount of
compute-time compared to BLAST and allows annotating
the complete protein sequence space of SIMAP using a
few PCs within a week. We have integrated the adapted
BLAST2GO program into the monthly update workflow
of SIMAP in order to keep the pre-calculated BLAST2GO
annotations complete and up-to-date (Table 3).
High performance data access facilities
All data in SIMAP are freely available. The continuously
growing size of SIMAP demands a sophisticated imple-
mentation of the database to provide versatile and rapid
access to the data with respect to a broad spectrum of use
cases. Based on the established database and standard
middleware components of SIMAP, we have improved
the performance and stability of SIMAP through cluster-
ing of two independent database and application servers.
This clustering effectively uncouples production and main-
tenance processes. Each of the servers is ready to process
more than 2million complex queries per day.
Furthermore, we have improved the different data
access facilities connecting SIMAP to its users. The versa-
tile web portal allows searching for proteins by text or
sequence queries. The matches are starting points
for retrieving homologous proteins based on sequence
similarity or domain architecture. Protein report pages
integrate data from SIMAP including InterPro and GO
annotation as well as from external resources as the
methods, all-against-all matrices of similarity scores can
be downloaded for user-supplied groups of proteins.
Programmatic access to SIMAP is provided by several
SOAP based Web-Services. The SimpAT (Simap Access
Tools) allows easy access to the SIMAP database using
Web-Service functionality. Recently, we have imple-
mented Distributed Annotation System (DAS) services
for SIMAP. These can be accessed via the URL
http://webclu.bio.wzw.tum.de/das/ and provide easy and
rapid access to the proteins, sequence similarities, InterPro
To facilitate clustering
Figure 1. Composition of the non-redundant protein sequence space in
SIMAP as of September 2009.
Table 3. Pre-calculated functional annotations in SIMAP as of
Method Number of pre-calculated
Nucleic Acids Research, 2010,Vol.38, Database issueD225
matches and GO annotations from SIMAP. These data
with the exception of the very huge similarity matrix
itself can also be downloaded as flat files from the
SIMAP web portal. For research projects interested in
parts of the similarity matrix, we provide project specific
monthly dumps upon request.
The SIMAP database is a unique fundamental resource
for computational biology that consequently puts the
principle of incremental pre-calculation of sequence
similarities and sequence based features into practice.
SIMAP as an exhaustive, up-to-date resource to inspect
the sequence similarity of any known sequence enables of
any type of systematic post-processing with respect to the
functional or structural classification of proteins.
The recent integration of metagenomic sequences
into SIMAP basedona
coding sequences has been beneficial to preserve the
comprehensiveness of the sequence space representation
in SIMAP. At the same time it makes use of the
sequence similarity matrix of SIMAP to resolve overlaps
and remove shadow ORFs. SIMAP represents to our
knowledge the largest and most homogeneous resource
for the annotation of coding sequences in metagenomes.
It provides an ideal data repository and speed-up for tools
as e.g. MEGAN (19) that extract taxonomic and func-
tional information from similarities between metagenomic
ORFs and known proteins in major sequence databases.
The extended functional annotation of the sequence
space through the pre-calculation of GO annotations
and the improved data access facilities have enhanced
the potential of SIMAP in assisting biologists in answering
their individual research questions as well as facilitating
downstream projects in computational biology at any
The authors gratefully acknowledge the BOINCSIMAP
community for donating their CPU power for the calcu-
lation of protein similarities and features. They are
grateful to their colleagues at MIPS, in particular
Mathias Walter, Martin Muensterkoetter and Manuel
Spannagl, for many helpful discussions and suggestions.
SUN Microsystems Inc. (funding a fully equipped X4500
data center server that is hosting parts of the SIMAP
database, through a SUN Academic Excellence Grant),
European Science Foundation (financial support for
Stefan Go ¨ tz through the activity entitled ‘Frontiers of
Functional Genomics’). Funding for open access charge:
Helmholtz Zentrum Mu ¨ nchen, German Research Center
for Environmental Health, Neuherberg.
Conflict of interest statement. None declared.
1. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res., 25, 3389–3402.
2. Hunter,S., Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A.,
Binns,D., Bork,P., Das,U., Daugherty,L. and Duquenne,L. (2009)
InterPro: the integrative protein signature database. Nucleic Acids
Res., 37, D211–D215.
3. Arnold,R., Rattei,T., Tischler,P., Truong,M.D., Stumpflen,V. and
Mewes,W. (2005) SIMAP-The similarity matrix of proteins.
Bioinformatics, 21, ii42–ii46.
4. Sayers,E.W., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K.,
Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S.
et al. (2009) Database resources of the national center for
biotechnology information. Nucleic Acids Res., 37, D5–D15.
5. Pearson,W.R. (1990) Rapid and sensitive sequence comparison
with FASTP and FASTA. Methods Enzymol., 183, 63–98.
6. Smith,T.F. and Waterman,M.S. (1981) Identification of common
molecular subsequences. J. Mol. Bwl, 147, 195–197.
7. Hubbard,T.J.P., Aken,B.L., Ayling,S., Ballester,B., Beal,K.,
Bragin,E., Brent,S., Chen,Y., Clapham,P. and Clarke,L. (2009)
Ensembl 2009. Nucleic Acids Res., 37, D690–D697.
8. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference
sequences (RefSeq): a curated non-redundant sequence database
of genomes, transcripts and proteins. Nucleic Acids Res., 35,
9. Rattei,T., Walter,M., Arnold,R., Anderson,D.P. and Mewes,W.
(2007) Using public resource computing and systematic
pre-calculation for large scale sequence analysis. Lecture Notes
Comp. Sci., 4360, 11–18.
10. Handelsman,J. (2004) Metagenomics: application of genomics
to uncultured microorganisms. Microbiol Mol. Biol. Rev., 68,
11. Rusch,D.B., Halpern,A.L., Sutton,G., Heidelberg,K.B.,
Williamson,S., Yooseph,S., Wu,D., Eisen,J.A., Hoffman,J.M. and
Remington,K. (2007) The Sorcerer II global ocean sampling
expedition: Northwest Atlantic through eastern tropical Pacific.
PLoS Biol., 5, e77.
12. Markowitz,V.M., Ivanova,N.N., Szeto,E., Palaniappan,K., Chu,K.,
Dalevi,D., Chen,I., Min,A., Grechkin,Y. and Dubchak,I. (2008)
IMG/M: a data management and analysis system for metagenomes.
Nucleic Acids Res., 36, D534–D538.
13. Yooseph,S., Sutton,G., Rusch,D.B., Halpern,A.L., Williamson,S.J.,
Remington,K., Eisen,J.A., Heidelberg,K.B., Manning,G. and Li,W.
(2007) The Sorcerer II Global Ocean Sampling expedition:
expanding the universe of protein families. PLoS Biol., 5, e16.
14. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution
matrices from protein blocks. Proc. Natl Acad. Sci., 89,
15. Rattei,T., Arnold,R., Tischler,P., Lindner,D., Stumpflen,V. and
Mewes,H.W. (2006) SIMAP: the similarity matrix of proteins.
Nucleic Acids Res., 34, D252–D256.
16. Conesa,A., Go ¨ tz,S., Garcia-Gomez,J.M., Terol,J., Talon,M. and
Robles,M. (2005) Blast2GO: a universal tool for annotation,
visualization and analysis in functional genomics research.
Bioinformatics, 21, 3674–3676.
17. Go ¨ tz,S., Garcia-Gomez,J.M., Terol,J., Williams,T.D.,
Nagaraj,S.H., Nueda,M.J., Robles,M., Talon,M., Dopazo,J. and
Conesa,A. (2008) High-throughput functional annotation and
data mining with the Blast2GO suite. Nucleic Acids Res., 36,
18. Walter,M.C., Rattei,T., Arnold,R., Guldener,U.,
Munsterkotter,M., Nenova,K., Kastenmuller,G., Tischler,P.,
Wolling,A. and Volz,A. (2008) PEDANT covers all complete
RefSeq genomes. Nucleic Acids Res., 37, D408–D411.
19. Huson,D.H., Auch,A.F., Qi,J. and Schuster,S.C. (2007) MEGAN
analysis of metagenomic data. Genome Res., 17, 377–386.
D226 Nucleic AcidsResearch, 2010, Vol.38,Database issue