From: Medical Biomethods Handbook
Edited by: J. M. Walker and R. Rapley © Humana Press, Inc., Totowa, NJ
Bioinformatic Tools for Gene and Protein Sequence Analysis
Bernd H. A. Rehm and Frank Reinecke
The rapid development of efficient, automated DNA-sequencing methods has strongly
advanced the genome-sequencing era, culminating in the determination of the entire human
genome in 2001 (1,2). An enormous amount of DNA sequence data are available and databases
still grow exponentially (see Fig. 1). Analysis of this overwhelming amount of data, including
hundreds of genomes from both prokaryotes and eukaryotes, has given rise to the field of
bioinformatics. Development of bioinformatic tools has evolved rapidly in order to identify
genes that encode functional proteins or RNA. This is an important task, considering that even
in the best studied bacterium Escherichia coli more than 30% of the identified open reading
frames (ORFs) represent hypothetical genes with no known function. Future challenges of
genome-sequence analysis will include the understanding of diseases, gene regulation, and
metabolic pathway reconstruction. In addition, a set of methods for protein analysis summa-
rized under the term proteomics holds tremendous potential for biomedicine and biotechnology
(141). The large number of bioinformatic tools that have been made available to scientists
during the last few years has presented the problem of which to use and how best to obtain
scientifically valid answers (3). In this chapter, we will provide a guide for the most efficient
way to analyze a given sequence or to collect information regarding a gene, protein, structure,
or interaction of interest by applying current publicly available software and databases that
mainly use the World Wide Web. All links to services or download sites are given in the text or
listed in Table 1; the succession of tools is briefly summarized in Fig. 2.
2. Software Tools for Bioinformatics
In the first part of this chapter, software tools will be described that mainly use algorithms
and are based either on very short-sequence comparisons, physical or chemical properties of
molecules, or statistics. A second group of software relies mainly on databases and will be
discussed below. As so-called integrated methods are evolving and becoming more and more
popular, it is difficult to divide programs into these two groups.
There are many programs routinely used to generate contiguous DNA sequences from raw
data obtained from high-throughput sequencers, to assign quality scores to each base, remove
contaminating sequences (such as vector DNA), and provide the means to link sequences con-
taining applications. First, base-callers like Phred (4,5) extract raw sequences from raw data.
There are also contig assemblers like Phrap (University of Washington, http://bozeman.
mbt.washington.edu/phrap.docs/phrap.html) or CAP3 (6) that assemble fragments to contigs
and packages like consed (7) or GAP4 (8), which are used to finish sequencing projects. These
programs are not explained in detail here.
388 Rehm and Reinecke
Compilation of Links to Bioinformatic Services and Software
Fig. 1. Development of stored DNA-sequence information in GenBank from 1982 to 2002
(?, base pairs; ?, sequences. (From http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.)
Table 1 (continued)
Protein Properties and Structure
Protein Families and Motifs
SNPs and Expression Profiling
390 Rehm and Reinecke
Any DNA region that can be assigned a function is of special interest. The sequence ele-
ments that can be found within them include promoters and various transcription factor-bind-
ing sites, ribosome-binding sites, start and stop codons, splice sites, and so forth. These are
referred to as signals. Methods to detect them are termed signal sensors. In contrast, extended
and variable-length regions, such as exons and introns, are termed contents. These are recog-
nized by methods that can be called content sensors (9). It is a major challenge to find the
functional sites responsible for gene structure, regulation and transcription. Computational
methodology for finding genes and other functional sites in genomic DNA has evolved signifi-
cantly over the past years (10–20).
2.1. Gene-Finding Methods
The most basic signal sensor is a simple consensus sequence or an expression that describes
a consensus sequence together with allowable variations. Sophisticated types of signal sensor,
such as neural nets, are extensively used (21,22). The most important content sensors are pro-
grams that predict coding regions (23). In prokaryotes, genes are identified simply by looking
Fig. 2. Succession of programs during assignment of function to DNA or proteins. Arrows
indicate action of bioinformatic tools, detailed descriptions can be found in the chapter given in
Bioinformatic Tools 391
for long ORFs. However, these relatively simple methods cannot be transferred to higher
eukaryotes. In order to discriminate coding against noncoding regions in eukaryotes, exon
content sensors use statistical models of the nucleotide frequencies (24) and dependencies
present in codon structures. The most commonly used statistical models are known as Markov
models. Neural nets are used to combine several coding hints together with signal sensors for
the flanking splice sites in exon detectors (22). Other content sensors include those for CpG
sites (regions that often occur at the beginning of genes, where the dinucleotide CG appears
more frequently than in the rest of the genome and sensors for repetitive DNA such as human
ALU sequences (25). The latter sensors are often used as masks or filters that completely
remove the repetitive DNA, leaving the remaining DNA to be analyzed.
The statistical signals that signal and content sensors try to localize are usually weak, and
there are usually dependencies between signals and contents, such as the possible correlation
between splice-site strength and exon size (26). During the past decade, several systems that
combine signal and content sensors have been developed in an attempt to identify complete
gene structures. The first program using linguistic rules and a formal grammar for the arrange-
ment of certain signals required for gene prediction was GenLang (27). As with most integrated
gene-finders to date, GenLang uses dynamic programming to combine potential exon regions
and other scored regions and sites into gene prediction with a maximal total score (13,16,28).
These models are called hidden Markov models (HMMs). Gene-finding HMMs can be viewed
as stochastic versions of the gene structure grammars. Early gene-finding HMMs included
EcoParse for Escherichia coli (29)—also recently used in the annotation of the Mycobacterium
tuberculosis genome (30)—and Xpound (31), Veil (32), and HMMgene (16) for the human
genome (13). More recent programs include GeneMark HMM (33), GLIMMER (34), and
Critica (35). Different programs specializing in special detectable differences between coding
and noncoding regions have been developed recently. EasyGene (20) and AMIGene (36) apply
statistical methods on predicted ORFs of prokaryotes, ZCURVE (19) concentrates on nucleic
acid distribution for bacterial or archaeal genomes. GeneMarkS is an improved version of
GeneMark, which has been applied to identify genes in the genomes of Bacillus subtilis and
E. coli with the highest accuracy described to date (37). Even physical properties of DNA are
used to predict genes: GeneFizz (38) compares the physics-based structural segmentation
between helix and coil domains. Measuring the spectral rotation based on a process termed
discrete Fourier transform (DFT) has proven to be capable of predicting genes in Saccharomy-
ces cerevisiae (39). A slightly more general class of probabilistic models, called generalized
HMMs or (hidden) semi-Markov models, has roots in GeneParser (40) and is more fully devel-
oped in Genie (41) and, subsequently, GenScan (42). The above gene-finders predict gene
structure based only on general features of genes, rather than using explicit comparisons to
known genes and their corresponding proteins, or auxiliary information such as expressed se-
quence tag (EST) matches. Some gene-finding systems combine multiple statistical measures
with protein database homology searches performed using the predicted gene or deduced pro-
tein (11,15,43). This homology approach was developed by Gelfand et al. (44) and is used by
EUGENE’HOM (45) or Procrustes (15), which uses a “spliced alignment” algorithm, similar
to a Smith–Waterman algorithm (46), to derive a putative gene structure by aligning the DNA
to a partial protein homolog of the gene to be predicted. Instead of inventing and approving yet
another gene-finding program, there are several approaches to combining different existing
algorithms using advantages of each program to eliminate disadvantages inherent to each single
2.2. Signal-Finding Bioinformatics Methods
There is a tremendous variety of software exploiting various ways to identify structural
genes in a DNA sequence, and as already outlined, there are more elements present than just
structural genes. Some gene-finding approaches presented above already consider bases out-
side the start and stop codons, but there are specialized resources, namely MatInspector (48)
392Rehm and Reinecke
and SignalScan (49), to find transcription factor-binding sites using a relatively small database
called TRANSFAC (50). Another program designed to find putative eukaryotic polymerase II
promoter sequences in primary sequence data is PromotorScan (51).
2.3. Sequence-Alignment Methods
An alignment refers to the procedure of comparing two or more sequences by looking for a
series of individual characters or character patterns that are found in the same order in the
sequences. To align sequences, identical or similar characters are grouped in the same column,
whereas nonidentical characters can either be placed in the same column, resulting in a mis-
match, or opposite a gap in one of the other sequences. In an optimal alignment, mismatches
and gaps are distributed in such a way that matches are maximized. Two types of sequence
alignment have been recognized. In a global alignment, an attempt is made to align the entire
sequences with as many characters as possible. In a local alignment, stretches of sequence with
the highest density of matches are given the highest priority, thus generating one or more
islands of matches in the aligned sequences and producing totally mismatching regions.
Alignments are the working principle of programs (e.g., BLAST) that search similar sequences
by successively aligning a query with an entire sequence database; they are also useful for
determining the evolutionary distance between homologous sequences of different origin.
2.3.1. Multiple-Sequence Alignment
Comparison of multiple sequences can reveal gene functions that are not evident from simple
sequence homologies. As a result of genome-sequencing projects, new sequences are often
found to be similar to several uncharacterized sequences, defining whole families of novel
genes with no obvious function. However, such a family enables the application of efficient
alternative similarity search methods. Software packages are now available that derive profiles
from multiple-sequence alignments. Profiles incorporate position-specific scoring information
that is derived from the abundance of a given residue in an aligned column. Because sequence
families preferentially conserve certain critical residues and motifs, this information should
allow more sensitive database searches. Most new profile software are based on statistical
HMMs. Much more comprehensive reviews of the literature on profile HMM methods are
available elsewhere (12,28,52–55). ClustalW is a well-supported and frequently used free pro-
gram capable of dealing with large numbers of sequences at high processing speeds as com-
pared to other alignment algorithms, and it is available for Macintosh, Windows, and various
UNIX systems (56). There is also a graphical user interface—ClustalX (57). However, for
sequences with less than 30% identity, the program T-COFFEE might be used, which is more
accurate than the progressively aligning ClustalW, but slower. Once the family is defined,
obtaining an acceptable multiple-sequence alignment is usually straightforward. Multiple align-
ments can either be generated from FASTA format files (using a ClustalW supporting website)
or from DbClustal, which produces a BLAST output in which the family members can be
selected and the multiple alignment subsequently produced. It is important to inspect the align-
ment in the graphical display of ClustalX to make sure that it appears consistent. The alignment
can also be saved in multiple-sequence format (MSF), which can be read by other software for
further analysis (e.g., careful editing, trimming, coloring, shading, and printing). GeneDoc avail-
able for Windows (www.psc.edu/biomed/genedoc) offers many of these editing features (58).
2.4. Phylogenetic Analysis
Phylogenetic trees can be constructed based on multiple alignments. In rooted trees, the
ancestral state of the organisms, or genes, being studied is shown at the bottom of the tree, and
the tree branches, or bifurcates, until it reaches the terminal branches, tips, or leaves at the top
of the tree. An unrooted tree is a less intuitive, more abstract concept. Unrooted trees represent
the branching order, but do not indicate the root, or location, of the last common ancestor.
Ideally, rooted trees are preferable, but almost every phylogenetic reconstruction algorithm
Bioinformatic Tools 393
provides an unrooted tree. Molecular sequence analysis is a field in its infancy and an inexact
science in which there are few analytical tools that are truly based on general mathematical and
statistical principles. Consequently, many phylogenetic trees reconstructed from molecular se-
quences are incorrect. This is mainly caused by the following:
1. Incorrect sequence alignments.
2. The failure to properly account for site-to-site variation (all sites within sequences can evolve
at different rates).
3. Unequal rate effects (the inability of most tree-building algorithms to produce good phyloge-
netic trees when genes from different taxa in the tree have evolved at different rates).
Of the three pitfalls, alignment artifacts are potentially the most serious. A new algorithm,
paralinear (logdet) distances (59,60), provides a simple, but rigorous, mathematical solution
for the third pitfall. For a discussion of many other useful algorithms currently available,
including maximum parsimony likelihood and other distance methods, see refs. 61 and 62.
Sequence alignments should be carefully checked before calculating evolutionary trees. There
are several programs to help calculate trees from genome data. The best known software for
reconstructing trees is the program PAUP (phylogenetic analysis using parsimony), which is
part of the GCG sequence analysis package that supports logdet analysis. PAUP is user friendly
and comprehensive. PHYLIP/Phylodendron are further well-known packages that contain a
large variety of routines, including several that incorporate the latest theoretical developments.
A stand-alone software package available for many computer platforms for viewing, editing,
rearranging, and printing trees is TreeView (64).
2.5. Protein Properties and Structure Prediction
Protein sequences allow extensive calculations that help to assign function, to predict topol-
ogy (subcellular localization, spanning of membranes) and structure, and to find sites that are
likely to be cleaved or modified; interaction or catalytic mechanisms can be simulated.
Bioinformatic resources on the WWW range from the determination of the molecular weight to
complex threading and three-dimensional (3D) prediction algorithms. A huge list of tools can
be found on the ExPASy proteomic tools homepage (65). Because of the great variety of pro-
grams available, several of these single tools have been integrated into one interface. Examples
are PredictProtein (66) or META PP (67). These integrate resources such as SignalP (68),
which predicts the presence and location of signal peptide cleavage sites in amino acid sequences
from different organisms—NetOglyc [predictions of mucin type O-glycosylation sites in mam-
malian proteins (69–71)], NetPhos [predicting potential phosphorylation sites at serine, threo-
nine or tyrosine residues in protein sequences, (72)], NetPico [predictions of cleavage sites of
picornaviral proteases (73)], and ChloroP [predicting chloroplast transit peptides and their
cleavage sites (74)]. Secondary structure prediction is performed by JPRED (75); transmem-
brane helices are identified using TMHMM (76), TopPred (77,78), and DAS (79). Structural
databases are searched to detect similarities between remote homolog proteins too weak to be
inferred from simple sequence alignment techniques by FRSVR (80,81) and SAM-T02 (82)
but the detection of remote homologs represents a problem to be solved because most of the
results returned are supposedly wrong. Results should be checked very carefully. Homology
modeling is also covered by META PP, applying both SWISS-MODEL (83–85) and
CPHmodels (86) described below.
A database is any collection of data or information that is specially organized for rapid
search and retrieval by a computer. To cope with not only the vast amount of sequence informa-
tion but also other experimental data, biological databases have been set up and are updated
continuously. For biological data such as information about a protein’s sequence, structure,
modification, or interaction, software tools have been developed that enable searching, com-
paring, and retrieving these stored data.
394Rehm and Reinecke
3.1. Matching Algorithms
Basic queries like finding a key word in an article employ simple pattern-matching algo-
rithms that do not need a statistical evaluation of the result. Bioinformatic tools started like this
and most tools apply algorithms that match a query against all targets in a database, taking into
account the degree and type of mismatches or gaps (87). Deciding whether the observed degree
of structural likeness is significant and, therefore, a hint toward functional identity is a task for
statistical methods based on special biological matrices (88,89). These matrices are crucial to
considering the biological nature of the data. According to structural relevance, varieties in
certain amino acid residues, for example, are decisive, whereas other differences can be negligible.
Sequence database matching has proven to be a remarkably useful method for assigning a
function to an unknown sequence. If sequence similarity to one or more database sequences,
whose function is already known, is obtained, the unknown sequence can be inferred to have
the same function, biochemical activity, or structure. The strength of these inferences depends
on the strength of the similarity. As a rough rule, if more than 25–30% of a protein sequence is
identical in an alignment, then the sequences are homologous (90). RNA genes are usually
much more conserved, which is the reason why they represent suitable markers for phyloge-
netic analyses. Functional DNA sequences like promoters or other regulatory regions are sig-
nificantly shorter than genes encoding enzyme proteins or RNA, making them hard to identify
by means of sequence comparison or alignment. Identification and assignment of genes, as well
as the functional and structural classification of proteins, is performed based on similarity of
sequence and/or properties such as motifs or structurally conserved regions. Because DNA
sequences are variable in the third base position of the codon, protein-sequence analysis is the
more valuable approach. In general, sequence analysis requires the comparison of sequences
from unknown genes or proteins with those of known function deposited in databases. How-
ever, the sequences of homologous proteins can diverge greatly over time, whereas the struc-
ture or function of the same proteins has diverged only slightly. Conversely, proteins with
similar folds can exhibit completely different functions. However, much can be deduced about
an unknown protein when significant sequence similarity is detected with a well-studied pro-
tein. Alignment provides a powerful tool to compare related sequences, and the alignment of
two residues could reflect a common evolutionary origin or represent common structural and/
or catalytic roles, not always reflecting an evolutionary process.
3.1.1. Substitution Matrices and Alignment Scores
In order to identify the most valuable alignments, the standard procedure is to assign scores
to them. For each pair of letters that can be aligned, a substitution score is chosen. The com-
plete set of these scores is called a substitution matrix [PAM (91) and BLOSUM (92)]. Addi-
tionally, scores are chosen for gaps, which consist of one or more adjacent nulls in one sequence
aligned with letters in the other. Because a single mutational event can insert or delete more
than one residue, a long gap should be penalized only slightly more than a short gap. Accord-
ingly, affined gap costs, which charge a relatively large penalty for the existence of a gap and a
smaller penalty for each residue it contains, have become the most widely used gap-scoring
system. The quality of sequence comparison depends very much on the choice of appropriate
substitution and gap scores. In brief, for ungapped alignments, the alignment score of a given
pair of residues i and j depends on the fraction qij of true alignment positions in which these
paired residues tend to appear (88). Accordingly, the design of a good substitution matrix is
based on estimating the target frequencies qij accurately. However, the target frequencies
depend on the degree of evolutionary divergence between the related sequences of interest.
Therefore, a series of matrices tailored to varying degrees of evolutionary divergence are
required (88,91,92). This was the intention in constructing the PAM and BLOSUM series of
amino-acid-substitution matrices. These matrices are generally used unmodified for gapped
local and global alignment. There is no widely accepted theory for selecting gap costs that
requires adjustment for individual similarity searches (93).
3.1.2. Alignment Scores and E Values
To test the biological relevance of a global or local alignment of two sequences, one needs to
know how the value of an alignment score can be expected to occur by chance. Current ver-
sions of the FASTA and BLAST search programs report the raw scores of the alignments they
return, as well as assessments of their statistical significance based on the extreme value distri-
bution. Most simply, these assessments take the form of E values. The E value for a given
alignment depends on its score, as well as the lengths of both the query sequence and the
database sequence searched. It represents the number of distinct alignments with equivalent or
superior scores that might have been expected to occur only by chance. The smaller the E
value, the more likely that the alignment is significant and not occurring by chance (88,89,94).
3.1.3. Filtering Database Sequences
Many DNA and protein sequences contain regions of highly restricted nucleic acid and amino
acid compositions and regions of short elements repeated many times. The standard alignment
models and scoring systems were not designed to capture the evolutionary processes that led to
these low-complexity regions. As a result, two sequences containing compositionally biased
regions can receive a very high similarity score that reflects this bias alone. For many purposes,
these regions are not relevant and can obscure other important similarities. Therefore, pro-
grams that filter low-complexity regions from query or database sequences will often turn a
useless database search into a valuable one. For this reason, the NCBI BLAST server will
remove such sections in proteins using a program termed SEG (95). Although these programs
automatically remove the majority of problematic matches, some problems invariably occur.
Furthermore, masking might preclude interesting hits. Therefore, it is useful to adjust the mask-
ing parameters or turn filtering off completely.
3.1.4. Database Searching
Fundamentally, performing a database search is a very simple operation: A query sequence
is aligned with each of the sequences in a database and a score describing the degree of likeness
is calculated using a suitable matrix. Nevertheless, sequence comparison procedures should be
applied carefully. The design of a BLAST database search requires consideration of the kind of
information one hopes to obtain about the query sequence of interest (96). A major constraint
of database searching is that it only reveals similarity and might not indicate function. There-
fore, it is better to use data that describe the natural situation as accurately as possible (e.g.,
comparing 3D structures of proteins with each other). This is because 3D structures rather than
the primary sequence is conserved during evolution processes. However, in most cases, the
information will consist of a primary sequence alone. One should, nonetheless, compare
deduced protein sequences rather than DNA if the query DNA is likely to encode for a protein.
This also enables the detection of remote homologs (97). In DNA comparisons, there is noise
from the rapidly mutated third-base position in each codon and from comparisons of noncoding
frames. In addition, amino acids have chemical characteristics that allow degrees of similarity
to be assessed, rather than simple recognition of identity or nonidentity. DNA vs DNA com-
parison (BLASTN program) is typically used to find identical regions of sequence in a data-
base. One should apply this search in order to find RNA-encoding or regulatory regions and/or
to discover whether a protein-encoding gene has been previously sequenced or contains splice
junctions. Briefly, protein-level searches are valuable for detecting evolutionarily related genes,
whereas DNA searches are best for locating nearly identical regions of sequence. The follow-
ing should be considered when designing a database search:
1. Search a large current database (SWISS-Prot, EMBL, Genebank).
2. Compare relevant data.
3. Filter query for low-complexity regions.
4. Interpret scores with E values.
396Rehm and Reinecke
5. Recognize that most homologs are not found by pairwise sequence comparison.
6. Consider slower and more powerful methods, but use iterative programs with great caution
(iterative programs might indicate homology, which is not related to function).
3.2. Sequence Databases
Protein-homology searches are usually performed employing the nr (nonredundant) sequence
database at the NCBI (National Center for Biotechnology Information) website. The nr database
combines data from several sources, removes redundant identical sequences, and yields a col-
lection with nearly all known proteins. A frequent update of the NCBI nr database guarantees
that the most recent and complete database is used. Obviously, a search will not identify a
sequence that has not been included in the database and, as databases are growing so rapidly,
use of a current database is essential. Several specialized databases are also available, each of
which is a subset of the nr database. One might also wish to search DNA databases at the
protein level. Programs can do so automatically by first translating the DNA in all six reading
frames and then making comparisons with each of these translations. The nr database, which
contains the most publicly available DNA sequences, is useful to search when hunting for new
genes; identified genes in this database would already be in the protein nr database. Because of
the different combinations of queries and database types, there are several variants of BLAST
(87,89,90). The BLAST programs can be run via the Internet or they can alternatively be down-
loaded from an ftp site to run locally. Another option is to use the FASTA package (97). The
FASTA program is slower but can be more effective than BLAST. The package also contains
SSEARCH, an implementation of the rigorous Smith–Waterman algorithm, which is slow but
the most sensitive. Iterative programs such as PSI-BLAST require extreme care in their opera-
tion because they can provide misleading results; however, they have the potential to find more
homologs than purely pairwise methods. The effectiveness of any alignment program depends
on the scoring systems it employs (88,92,93).
3.3. Protein Family and Motif Databases
There are several collections of amino-acid-sequence motifs that indicate particular struc-
tural or functional elements. Web-based searches of these collections with a newly identified
sequence allow reasonably confident functional predictions to be made. A variety of genome-
and cDNA-sequencing projects is producing raw sequence data at a breathtaking speed, creat-
ing the need for a large-scale functional classification effort. On a smaller scale, the average
molecular biologist can also be faced with a new sequence without any a priori functional
knowledge. Any hint as to whether the newly identified gene encodes a transcription factor, a
cytoskeletal protein, or a metabolic enzyme would certainly help to interpret the experimental
results and would suggest a direction for subsequent investigations.
3.3.1. Protein vs Domain Classification
The first step is usually a database search with BLAST or a similar program. Optimally, the
BLAST output would show a clear similarity to a single, well-characterized protein spanning
the complete length of the query protein. However, in the worst case, the output list would fail
to show any significant hit. In reality, the most frequent result is a list of partial matches to
assorted proteins, most of them uncharacterized, with the remainder having dubious or even
contradictory functional assignments. Much of this confusion is caused by the modular archi-
tecture of the proteins involved. An analysis of known 3D protein structures reveals that, rather
than being monolithic, many of them contain multiple folding units. Each unit (domain) has its
own hydrophobic core and has most of its residue–residue contacts internally. In order to fulfill
these conditions, independent domains must have a minimum size of approx 50 residues unless
stabilized by metal ions or disulfide bridges. Analysis of protein sequences contradicts this
structural observation. Sequence pairs frequently exhibit localized regions of similarity,
whereas the rest of the proteins are totally dissimilar. Folding independence for all of these
so-called homology domains has not been demonstrated experimentally. For protein classifica-
tion, it is important to note that the homology domains also frequently harbor independent
functions. Some domains have enzymatic activity, others bind to small messenger molecules,
and others specifically bind to DNA, RNA, or proteins. A multidomain protein can, therefore,
have more than one function and belong to more than one protein family or class. For this
reason, most of the current approaches to protein functional classification focus on domains
rather than complete proteins.
3.3.2. The Protein Superfamily Assignment
Efforts have been undertaken to group protein sequences into families and superfamilies.
The various approaches differ in their degree of automation, their comprehensiveness, their
focus on complete proteins or protein domains, and the methodology applied. Some of these
efforts are aimed exclusively at the classification of existing sequence data. Others go one step
further and aim to extract the essential features from sequence families and to store them in the
form of domain or motif descriptors, which can then be used for searches with user-supplied
protein sequences. These searches exert high sensitivity, which has proven to be most useful
for the functional assignment of unknown proteins. A parallel development, which will be
discussed later, is the classification of protein 3D structures. A comprehensive discussion of
this topic can be found in refs. 98 and 99. Some of the most popular collections, which consider
the modular nature of proteins, are briefly discussed. The PROSITE pattern library was one of
the pioneering efforts in collecting descriptors for important protein motifs with biological
relevance (100,101). A PROSITE pattern does not describe a complete domain or even protein,
but just tries to identify the functionally most important residue combinations, such as the
catalytic site of an enzyme. All motifs are accompanied by extensive documentation, including
references. However, the short patterns do not contain enough information to yield statistically
significant matches in the large and growing protein databases. Consequently, a certain number
of false-positive hits is to be expected when carrying out a database search, and any hit reported
after scanning the PROSITE database with a sequence has to be treated with appropriate cau-
tion. To solve these restrictions, the PROSITE pattern library has been supplemented since
1995 by the PROSITE profile library (101). Generalized profiles are at an intermediate position
between a sequence-to-sequence comparison and the matching of a regular expression to a
sequence (102). The ProDom database was the first comprehensive collection of complete pro-
tein domains (103,104). It is derived from SWISS-PROT, and the domains are denoted only by
cluster numbers and do not contain any biological annotation. Moreover, the automatically
determined domain boundaries are unreliable and the associated search methods are not very
sensitive. Pfam, which is derived from ProDom, contains HMMs, which are conceptually re-
lated to the PROSITE profiles (53,105). The Pfam models typically span complete protein
domains and can be searched with the HMMER package or on a web-based server. The current
release of Pfam (10.0) contains 6190 families (106). Similar to the PROSITE profiles, the Pfam
models are refined iteratively, starting from clear homologs and incorporating increasingly
distant family members in the process. Because of their information-rich descriptors, both col-
lections are able to detect even very distant instances of a protein motif that are rarely found by
any other method. Because Pfam models and PROSITE profiles can be interconverted (102),
combination searches are available at InterPro (107,108). The current release of InterPro (7.0)
contains 8547 entries describing 6416 families, 1902 domains, 163 repeats, 26 active sites, 20
binding sites, and 20 posttranslational modifications. The use of this integrated service is there-
fore recommended, although there are still some specialized databases that are not covered.
MEROPS (109) is a catalog and classification system of enzymes with proteolytic activity
(peptidases or proteases).
NIFAS is a Java applet, which retrieves domain information from the Pfam database and
uses ClustalW to calculate a tree for a given domain and to enable visual analysis of domain
evolution in proteins. Consideration of the evolution of certain domains might be important for
398Rehm and Reinecke
functional annotation of modular proteins and for understanding the function of individual
domains (110). SMART (simple modular architecture research tool) allows the identification
and annotation of genetically mobile domains and the analysis of domain architectures. The
SMART database is an independent collection of HMMs (domain families)—660 in version
3.5—focusing on protein domains related to signaling, extracellular, and chromatin-associated
proteins (111–113). These domains are extensively annotated with respect to phyletic distribu-
tions, functional class, tertiary structure, and functionally important residues. Domains found
in the nr protein database as well as search parameters and taxonomic information are stored in
a relational database system. User interfaces to this database allow searches for proteins con-
taining specific combinations of domains in defined taxa. BLOCKS (114) and PRINTS (63,115)
are two motif databases that represent protein or domain families by several short, ungapped
multiple alignment fragments. The current release of BLOCKS (13.0) contains 8656 blocks
representing 2101 groups, which are derived from PROSITE patterns. The blocks for the
BLOCKS database are made automatically by looking for the most highly conserved regions in
groups of proteins documented in the PROSITE database. The Internet-based versions of the
PROSITE and SWISS-PROT databases that are used are located at the ExPASy molecular
biology web-server of the Geneva University Hospital and the University of Geneva. The blocks
created by Block Maker are created in the same manner as the blocks in the BLOCKS database
but with sequences provided by the user. Results are reported in a multiple-sequence alignment
format and in the standard Block format for searching. PRINTS is a compendium of protein
fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family;
its diagnostic power is refined by iterative scanning of a composite of SWISS-PROT+SP-
TrEMBL. Usually the motifs do not overlap, but are separated along a sequence, although they
might be contiguous in 3D space. Fingerprints can encode protein folds and functionalities
more flexibly and powerfully than can single motifs, their full diagnostic potency deriving
from the mutual context afforded by motif neighbors. BLOCKS and PRINTS can be searched
with the same programs at the website InterPro (see above). Domains important in signal trans-
duction are likely to be found with the PROSITE profiles or SMART; Pfam emphasizes extra-
cellular domains, and the PROSITE patterns are good at identifying enzyme classes by their
active-site motif. Recently, a unified protein family resource, MetaFam, was generated to sup-
port the general classification efforts (116). MetaFam is a protein family classification built up
from 10 publicly accessible protein family databases (Blocks + DOMO, Pfam, PIR-ALN,
PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYSTERS). Meta-Fam’s family
“supersets” are created automatically by comparing families between the databases. However,
the number of available single and combined domain descriptors should not be overestimated
as a quality criterion, as the databases and the associated search methods differ in generality
and sensitivity. The most promising approach to predicting the exact function of a protein is to
find its characterized ortholog from a different species or a well-conserved paralog that fulfills
a related but different function (117). In addition, the databases contain large amounts of incor-
rect annotated sequences.
3.4. Protein Structure Databases
The complexity and sophistication of biological molecular interactions are astonishing. In
this context, it is essential to develop bioinformatic tools that reliably allow the prediction of
protein structure, as the structure determines interaction with all kinds of small molecules (sub-
strates, activators, repressors, drugs) and other proteins (either specific as in natural multiprotein
complexes or unspecific), consequently revealing the protein’s function. In the near future,
representative structures for most water-soluble protein domains will be available, which will
allow modeling and classification of related sequences to provide structures for all gene prod-
ucts. However, elucidating the function of all gene products in vivo will be a long-term chal-
lenge for biologists. The emphasis will shift to understanding of principles and control of
biological function and the interactions between molecules. A 3D model of a protein can help
one to understand the “docking” of ligands and proteins, which is essential to enable their
rational design or modification to efficiently discover drug targets or design new drugs target-
ing both proteins in pathogens and disease-related human proteins.
3.4.1. Structure Classification
The Protein Data Bank, a computer-based archival file for macromolecular structures, was
founded under the term “Brookhaven National Laboratory Protein Data Bank” (BNL PDB) in
1977 (118). Today, the PDB repository for the processing and distribution of 3D biological
macromolecular structure data contains over 22,053 entries in a standardized file format that
can be browsed and searched online (119). There are several projects taking data from PDB for
further analysis. The SCOP (120) database aims to provide a detailed and comprehensive de-
scription of the structural and evolutionary relationships among all proteins whose structure is
known, including all entries in PDB. It is available as a set of tightly linked hypertext docu-
ments, which make the large database comprehensible and accessible. SCOP uses three differ-
ent major levels of hierarchy: family (clear evolutionarily relationship), superfamily (probable
common evolutionary origin), and fold (major structural similarity). A similar approach is real-
ized in the CATH database (121,122), which is also a hierarchical domain classification of
protein structures in the PDB but only crystal structures solved to resolution better than 3.0 Å
are considered, together with nuclear magnetic resonance (NMR) structures. CATH employs
four major levels in this hierarchy: class, architecture, topology (fold family), and homologous
3.4.2. Structure Modeling
Three-dimensional structure prediction (modeling) of proteins produces reliable results only
using the “homology modeling” approach, which generally consists of four steps:
1. Data banks searching to identify the structural homolog.
2. Target–template alignment.
3. Model building and optimization.
4. Model evaluation.
SWISS-MODEL is a server for automated comparative homology modeling of 3D protein
structures. It pioneered the field of automated modeling starting in 1993 (123) and is the most
widely used free web-based automated modeling facility today. In 2002, the server computed
120,000 user requests for 3D protein models. SWISS-MODEL provides several levels of user
interaction through its Internet interface (124). In the “first approach mode,” only an amino
acid sequence of a protein is submitted to build a 3D model. Template selection, alignment, and
model building are performed completely automated by the server. In the “alignment mode,”
the modeling process is based on a user-defined target–template alignment. Complex modeling
tasks can be handled with the “project mode” using DeepView (125), the Swiss-PdbViewer
(available for PC, Macintosh, Linux and SGI, downloadable from http://www.expasy.org/
spdbv), an integrated sequence-to-structure workbench. All models are sent back via e-mail
with a detailed modeling report. WhatCheck analyses and ANOLEA evaluations are pro-
vided optionally. Similar homology modelers are CPHmodels (86), Geno3D (126), and
3.5. Protein Interaction Databases
Protein–protein interactions play important roles in nearly every event that takes place in a
cell. The Biomolecular Interaction Network Database (BIND) is a database designed to store
full descriptions of interactions, molecular complexes, and pathways (128). An Interaction
record is based on the interaction between two objects. An object can be a protein, DNA, RNA,
ligand, or molecular complex. The description of an interaction encompasses cellular location,
experimental conditions used to observe the interaction, conserved sequence, molecular loca-
400Rehm and Reinecke
tion of interaction, chemical action, kinetics, thermodynamics, and chemical state and can be
accessed through a BLAST search against the database to gather information on the interac-
tions of the query sequence stored in BIND (128). The DIP database (129) catalogs experimen-
tally determined interactions between proteins. It combines information from a variety of
sources to create a single, consistent set of protein–protein interactions.
3.5.1. Protein–Protein Interaction Prediction and Docking
The protein–protein or protein–ligand docking problem started to fascinate biophysical
chemists and computational biologists almost 30 yr ago (130,131). Given the 3D structures of
two interacting proteins, a docking algorithm aims to determine the 3D structures of the com-
plex by rotating and translating the proteins, generating a large number of candidate complexes
in the computer, and to select favorable ones. Docking procedures are tested first on protein–
protein complexes taken from the Protein Data Bank, mostly protease–inhibitor and antigen–
antibody complexes. The CAPRI experiment (132), inspired by the CASP (Critical Assessment
of Structure Prediction) algorithm was given atomic coordinates for protein components of
several target complexes. The predicted interactions were assessed by comparison to X-ray
structures and show significant success on some of the targets. However, the prediction failed
with others, and progress is still needed before large-scale predictions of protein–protein
interactions can be made reliably. The docking of small molecules and proteins is reviewed
in ref. 133.
4. Bioinformatics Genomics and Medical Applications
4.1. Genome Analysis and Databases
Most software tools and databases presented above can be used with any kind of DNA or
protein sequence. The availability of complete genome sequences of hundreds of more or less
related organisms (mainly prokaryotes) allows additional approaches leading the assignment of
function to thus far unknown genes and proteins that are not possible with just a subset of a
genome. There are also a number of medically related databases such as OMIM (Online
mendelian inheritance in man), which contains information regarding inherited and other dis-
eases. Furthermore, attempts to correlate data regarding particular diseases are also being devel-
oped, such as the Cancer Genome Anatomy Project, which is a database of known mutations in
genes arising in particular tissues.
4.2. Comparative Genomics
Methods requiring complete genome sequences include (1) mapping and alignments of en-
tire genomes of closely related organisms to identify clusters and functional units, (2) meta-
bolic pathway reconstruction to identify missing links and assign function (which is of special
biotechnological interest), and (3) comparison of entire genomes of closely related organisms
with a different phenotype. The latter technique is suitable for detecting genes that might con-
tribute to virulence toward humans or plants. It is, therefore, of special interest for medical
science, the pharmaceutic industry, and agriculture. By means of subtracting the entire genome
of a harmless bacterium (e.g., a Bacillus strain) from the genome of a closely related virulent
bacterium (e.g., Bacillus anthracis), genes that are not related to virulence are eliminated. This
procedure yields candidates that are probably responsible for the pathogenic phenotype of
B. anthracis (134,135) or might be suitable for use as vaccines (136). As a consequence,
these genes represent promising targets for specific drugs against the virulence system of
B. anthracis without affecting apathogenic strains. The described comparison can be per-
formed using predicted genes only; more accurate results are obtained comparing expression
profiles or applying proteomics.
Comparisons of entire genomes reveal clusters that are conserved. Clusters of orthologous
groups of proteins (COGs) were delineated by comparing protein sequences encoded in 43
Bioinformatic Tools 401
complete genomes, representing 30 major phylogenetic lineages. Each COG consists of indi-
vidual proteins or groups of paralogs from at least three lineages and, thus, corresponds to an
ancient conserved domain. The COG database (137,138) provides a phyletic pattern search
web page that is available to facilitate the creation of a specific filter that, as being applied to
the COGs, can filter out a COG set that will comply with the condition specified in the query.
Pharmacogenomics is the study of how an individual’s genetic inheritance affects the body’s
response to drugs. The term comes from the words pharmacology and genomics and is, thus,
the intersection of pharmaceuticals and genetics. Pharmacogenomics holds the promise that
drugs might one day be tailor-made for individuals and adapted to each person’s own genetic
makeup. Environment, diet, age, lifestyle, and state of health can influence a person’s response
to medicines, but understanding an individual’s genetic makeup is thought to be the key to
creating personalized drugs with greater efficiency and safety. Pharmacogenomics combines
traditional pharmaceutical sciences such as biochemistry with annotated knowledge of genes,
proteins, and single-nucleotide polymorphisms.
The anticipated benefits of pharmacogenomics are as follows:
1. More powerful medicines (by a therapy more targeted to specific diseases).
2. Better, safer drugs (doctors will be able to analyze a patient’s genetic profile and prescribe the
best available drug therapy from the beginning).
3. More accurate methods of determining appropriate drug dosages (dosages on weight and age
will be replaced with dosages based on a person’s genetics).
4. Advanced screening for disease (knowledge of a particular disease susceptibility will allow
careful monitoring, and treatments can be introduced at the most appropriate stage to maxi-
mize their therapy).
5. Better vaccines (vaccines made of genetic material, either DNA or RNA, promise all the ben-
efits of existing vaccines without all the risks).
6. Improvements in the drug discovery and approval process (pharmaceutical companies will be
able to discover potential therapies more easily using genome targets),
7. Decrease in the overall cost of health care.
The explosion in both single-nucleotide polymorphism (SNP) and microarray data gener-
ated from the human genome project has necessitated the development of a means of catalog-
ing and annotating (briefly describing) these data so that scientists can more easily access and
use it for their research. Database repositories for both SNP (dbSNP) and microarray (GEO)
data are available at the NCBI. These databases include either descriptive information about
the data within the site itself (GEO) or links to NCBI and external information resources
(dbSNP). Access to these data and information resources allows scientists to more easily inter-
pret data that will be used not only to help determine drug response but to study disease suscep-
tibility and conduct basic research in population genetics.
4.3.1. SNP Databases
A key aspect of human genome research and pharmacogenomics is associating sequence
variations with heritable phenotypes. The most common variations are SNPs, which occur
approximately once every 100–300 bases (see Chapter 19). Because SNPs are expected to fa-
cilitate large-scale association genetics studies, there has recently been great interest in SNP
discovery and detection. The cSNP database specializing on human chromosome 21 is a joint
project between the Division of Medical Genetics of the University of Geneva Medical School
and the Swiss Institute of Bioinformatics, which offers BLAST and text searches to explore
their data. In collaboration with the National Human Genome Research Institute, the National
Center for Biotechnology Information has established the dbSNP database (139) to serve as a
central repository for both single-base nucleotide subsitutions and short deletion and inser-
402 Rehm and Reinecke
tion polymorphisms. Once discovered, these polymorphisms could be used by additional labo-
ratories, using the sequence information around the polymorphism and the specific experimen-
tal conditions. The data in dbSNP are integrated with other NCBI genomic data and are
accessible by the same tools as other NCBI databases.
4.3.2. Expression Profiling
The Gene Expression Omnibus (GEO) is a gene expression and hybridization array data
repository, as well as a curated, online resource for gene expression data browsing, query, and
retrieval. GEO was the first fully public high-throughput gene expression data repository and
became operational in July 2000 (140). There are several ways to deposit and retrieve GEO
data. The search facilities “Gene profiles,” “Dataset,” and “Sequence BLAST” are powerful
and link to the well-known Entrez-Interface, including accession links to relevant genes and
1. Venter, J. C. et al (2001) The sequence of the human genome. Science 291, 1304–1351.
2. Lander, E. S. et al (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
3. Rehm B.H. (2001) Bioinformatic tools for DNA/protein sequence analysis, functional assignment
of genes and protein classification. Appl. Microbiol. Biotechnol. 57, 579–592.
4. Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base-calling of automated sequencer
traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185.
5. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error
probabilities. Genome Res. 8, 186–194.
6. Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome Res. 9,
7. Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphical tool for sequence finishing.
Genome Res. 8, 195–202.
8. Staden, R. (1996) The Staden Sequence Analysis Package. Mol. Biotech. 5, 233–241.
9. Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res.
10. Claverie, J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic
sequences. Hum. Mol. Genet. 6, 1735–1744.
11. Guigo, R. (1997) Computational gene identification: an open problem. Comput. Chem. 21, 215–222.
12. Krogh, A. (1998) In Computational Methods in Molecular Biology (Salzberg, S. L., Searls, D., and
Kasif, S., eds.), Elsevier, Amsterdam.
13. Krogh, A. (1998) In Guide to Human Genome Computing (Bishop, M. J., ed.), 2nd ed. Academic,
New York, pp. 261–274.
14. Delcher, A. L., Harmon, D., Kasif, S., White, O., and Salzberg, S. L. (1999) Improved microbial
gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641.
15. Guigo, R., Agarwal, P., Abril, J. F., Burset, M., and Fickett, J. W. (2000) An assessment of gene
prediction accuracy in large DNA sequences. Genome Res. 10, 1631–1642.
16. Krogh, A. (2000) Using database matches with for HMMGene for automated gene detection in
Drosophila. Genome Res. 10, 523–528.
17. Shibuya, T. and Rigoutsos, I. (2002) Dictionary-driven prokaryotic gene finding. Nucleic Acids Res.
18. Pedersen, J. S. and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure
and evolution. Bioinformatics 19, 219–227.
19. Guo, F. B., Ou, H. Y., and Zhang, C. T. (2003) ZCURVE: a new system for recognizing protein-
coding genes in bacterial and archaeal genomes. Nucleic Acids Res. 31, 1780–1789.
20. Larsen, T. S., Krogh, A. (2003) EasyGene—a prokaryotic gene finder that ranks ORFs by statistical
significance. BMC Bioinformat. 4, 21.
21. Gelfand, M. S. (1995) Prediction of function in DNA sequence analysis. J. Comput. Biol. 2, 87–115.
22. Sherriff, A. and Ott, J. (2001) Applications of neural networks for gene finding. Adv. Genet. 42,
23. Fickett, J. W. (1996) Finding genes by computer: the state of the art. Trends Genet. 12, 316–320.
Bioinformatic Tools 403
24. Zhang, C. T., Wang, J., and Zhang, R. (2002) Using a Euclid distance discriminant method to find
protein coding genes in the yeast genome. Comput. Chem. 26, 195–206.
25. Bajic, V. B. and Seah, S. H. (2003) Dragon gene start finder: an advanced system for finding
approximate locations of the start of gene transcriptional units. Genome Res. 13, 1923–1929.
26. Zhang, M. Q. (1998) Statistical features of human exons and their flanking regions. Hum. Mol.
Genet. 7, 919–932.
27. Searls, D. B. (1992) The linguistics of DNA. Am. Sci. 80, 579–591.
28. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis: Probabi-
listic Models of Proteins and Nucleic acids. Cambridge University Press, Cambridge.
29. Krogh, A., Mian, I. S., and Haussler, D. (1994) A hidden Markov model that finds genes in E. coli
DNA. Nucleic Acids Res. 22, 4768–4778.
30. Cole, S.T., Brosch, R., Parkhill, J., et al. (1998) Deciphering the biology of Mycobacterium tubercu-
losis from the complete genome sequence. Nature 393, 537–544.
31. Thomas, A. and Skolnick, M. (1994) A probabilistic model for detecting coding regions in DNA
sequences. IMA J. Math. Appl. Med. Biol. 11, 149–160.
32. Henderson, J., Salzberg, S., and Fasman, K. (1997) Finding genes in DNA with a hidden Markov
model. J. Comput. Biol. 4, 127–141.
33. Lukashin, A. V. and Borodovsky, M. (1998) GeneMark hmm: new solutions for gene finding.
Nucleic Acids Res. 26, 1107–1115.
34. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J., and Tettelin, H. (1999) Interpolated
Markov models for eukaryotic gene finding. Genomics 59, 24–31.
35. Badger, J. H. and Olsen, G. J. (1999) CRITICA: coding region identification tool invoking com-
parative analysis. Mol. Biol. Evol. 16, 512–524.
36. Bocs, S., Cruveiller, S., Vallenet, D., Nuel, G., and Medigue, C. (2003) AMIGene: annotation of
microbial genes. Nucleic Acids Res. 31, 3723–6.
37. Besemer, J., Lomsadze, A., and Borodovsky, M. (2001) GeneMarkS: a self-training method for
prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regula-
tory regions. Nucleic Acids Res. 29, 2607–2618.
38. Yeramian, E. and Jones, L. (2003) GeneFizz: a web tool to compare genetic (coding/non-coding)
and physical (helix/coil) segmentations of DNA sequences. Gene discovery and evolutionary per-
spectives. Nucleic Acids Res. 31, 3843–3849.
39. Kotlar, D. and Lavner, Y. (2003) Gene prediction by spectral rotation measure: a new method for
identifying protein-coding regions. Genome Res. 13, 1930–1937.
40. Snyder, E. and Stormo, G. (1995) Identification of protein coding regions in genomic DNA. J. Mol.
Biol. 248, 1–18.
41. Reese, M. G., Eeckman, F. H., Kulp, D., and Haussler, D. (1997) Improved splice site detection in
Genie. J. Comput. Biol. 4, 311–323.
42. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J.
Mol. Biol. 268, 78–94.
43. Xu, Y. and Überbacher, E. C. (1997) Automated gene identification in large-scale genomic se-
quences. J. Comput. Biol. 4, 325–338.
44. Gelfand, M. S., Mironov, A. A., and Pevzner, P. A. (1996) Gene recognition via spliced sequence
alignment. Proc. Natl. Acad. Sci. USA 93, 9061–9066.
45. Foissac, S., Bardou, P., Moisan, A., Cros, M. J., and Schiex, T. (2003) EUGENE’HOM: a generic
similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res.31, 3742–3745.
46. Smith, T. E. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol.
Biol. 147, 195–197.
47. Yada, T., Takagi, T., Totoki, Y., Sakaki, Y., and Takaeda Y. (2003) DIGIT: a novel gene finding
program by combining gene-finders. Pac. Symp. Biocomput. 2003, 375–387.
48. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995) MatInd and MatInspector—
new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic
Acids Res. 23, 4878–4884.
49. Prestridge, D. S. (1991) SIGNAL SCAN: a computer program that scans DNA sequences for eu-
karyotic transcriptional elements. CABIOS 7, 203–206.
50. Wingender, E., Chen, X., Hehl, R., et al. (2000) TRANSFAC: an integrated system for gene expres-
sion regulation. Nucleic Acids Res. 28, 316–319.
404 Rehm and Reinecke
51. Prestridge, D. S. (1995) Predicting Pol II Promoter Sequences Using Transcription Factor Binding
Sites. J. Mol. Biol. 249, 923–932.
52. Eddy, S. R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol. 6, 361–365.
53. Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.
54. Baldi, R. and Brunak, S. (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Bos-
55. Korenberg, M. J., David, R., Hunter, I. W., and Solomon, J. E. (2000) Automatic classification of
protein sequences into structure/function groups via parallel cascade identification: a feasibility
study. Ann. Biomed. Eng. 28, 803–811.
56. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position-specific gap pen-
alties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680.
57. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F., and Higgins, D. G. (1997) The
CLUSTAL X windows interface: flexible strategies for multiple sequence alignment aided by qual-
ity analysis tools. Nucleic Acids Res. 25, 4876–4882.
58. Nicholas, K. B., Nicholas, H. B., Jr., and Deerfield, D. W., II. (1997) GeneDoc: analysis and visual-
ization of genetic variation. EMBNEW.NEWS 4, 14.
59. Lake, J. A. (1994) Reconstructing evolutionary trees from DNA and protein sequences: paralinear
distances. Proc. Natl. Acad. Sci. USA 91, 1451–1459.
60. Lockhart, P. J., Steel, M. A., Hendy, M. D., and Penny, D. (1994) Recovering evolutionary trees
under a more realistic model of sequence. Mol. Biol. Evol. 11, 605–612.
61. Brocchieri, L. (2001) Phylogenetic inferences from molecular sequences: review and critique. Theor.
Popul. Biol. 59, 27–40.
62. Stewart, C.-B. (1993) The powers and pitfalls of parsimony. Nature 361, 603–607.
63. Attwood, T. K., Beck, M. E., Flower, D. R., Scordis, P., and Selley, J. N. (1998) The PRINTS
protein fingerprint database in its fifth year. Nucleic Acids Res. 26, 304–308.
64. Page, R. D. (1996) TreeView: an application to display phylogenetic trees on personal computers.
Comput. Appl. Biosci. 12, 357–358.
65. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R. D., and Bairoch A. (2003) ExPASy: the
proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31, 3784–3788.
66. Rost, B. (1996) PHD: predicting one-dimensional protein structure by profile based neural net-
works. Methods Enzymol. 266, 525–539.
67. Eyrich, V. A. and Rost, B. (2003) META-PP: single interface to crucial prediction servers. Nucleic
Acids Res. 31, 3308–3310.
68. Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Identification of prokaryotic and
eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10, 1–6.
69. Hansen, J. E., Lund, O., Tolstrup, N, Gooley, A. A., Williams, K. L., and Brunak, S. (1998)
NetOglyc: Prediction of mucin type O-glycosylation sites based on sequence context and surface
accessibility. Glycoconjugate J. 15, 115–130.
70. Hansen, J. E., Lund, O., Rapacki, K., and Brunak, S. (1997) O-glycbase version 2.0 - A revised
database of O-glycosylated proteins. Nucleic Acids Res. 25, 278–282.
71. Hansen, J. E., Lund, O., Rapacki, K., et al. (1995) Prediction of O-glycosylation of mammalian
proteins: specificity patterns of UDP-GalNAc:-polypeptide N-acetylgalactosaminyltransferase.
Biochem. J. 308, 801–813.
72. Blom, N., Gammeltoft, S., and Brunak, S. (1999) Sequence- and structure-based prediction of eu-
karyotic protein phosphorylation sites. J. Mol. Biol. 294, 1351–1362.
73. Blom, N., Hansen, J., Blaas, D., and Brunak, S. (1996) Cleavage site analysis in picornaviral
polyproteins: Discovering cellular targets by neural networks. Protein Sci. 5, 2203–2216.
74. Emanuelsson, O., Nielsen, H., and von Heijne, G. (1999) ChloroP, a neural network-based method
for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8, 978–984.
75. Cuff, J. A. and Barton, G. J. (1999) Evaluation and improvement of multiple sequence methods for
protein secondary structure prediction. Proteins 34, 508–519.
76. Sonnhammer, E. L. L. von Heijne, G., and Krogh, A. (1998) A hidden Markov model for predicting
transmembrane helices in protein sequences. in Proceedings of the Sixth Intern Conference on Intel-
ligent Systems for Molecular Biology (ISMB98), pp. 175–182.
77. von Heijne, G. (1992) Membrane protein structure prediction, hydrophobicity analysis and the posi-
tive-inside rule. J. Mol. Biol. 225, 487–494.
Bioinformatic Tools 405
78. Karplus, K., Barrett, C., and Hughey, R. (1998) Hidden markov models for detecting remote pro-
tein homologies. Bioinformatics 14, 846–856.
79. Cserzo, M., Wallin, E., Simon, I., von Heijne, G., and Elofsson, A. (1997) Prediction of transmem-
brane alpha-helices in procariotic membrane proteins: the dense alignment surface method. Pro-
tein Eng. 10, 673–676.
80. Fischer, D. and Eisenberg, D. A. (1996) Fold recognition using sequence-derived properties. Pro-
tein Sci. 5, 947–955.
81. Elofsson, A., Fischer, D., Rice, D. W., LeGrand, S., and Eisenberg, D. A. (1996) Study of com-
bined structure-sequence profiles. Folding Design 1, 451–461.
82. Karplus, K., Karchin, R., Draper, J., et al. (2003) Combining local-structure, fold-recognition, and
new-fold methods for protein structure prediction. Proteins 53(Suppl 6), 491–496.
83. Peitsch, M. C. (1995) Protein modelling by E-mail. BioTechnology 13, 658–660.
84. Peitsch, M. C. (1996) ProMod and Swiss-Model: internet-based tools for automated comparative
protein modelling. Biochem. Soc. Trans. 24, 274–279.
85. Guex, N. and Peitsch, M. C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment
for comparative protein modelling. Electrophoresis 18, 2714–2723.
86. Lund, O., Frimand, K., Gorodkin, J., et al. (1997) Protein distance constraints predicted by neural
networks and probability density functions. Protein Eng. 10, 1241–1248.
87. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment
search tool. J. Mol. Biol. 215, 403–410.
88. Altschul, S. F. (1991) Amino acid substitution matrices from an information theoretic perspective.
J. Mol. Biol. 219, 555–565.
89. Altschul, S. F. and Gish, W. (1996) Local alignment statistics. Methods Enzymol. 266, 460–480.
90. Rost, B., Schneider, R., and Sander, C. (1997) Protein fold recognition by prediction-based thread-
ing. J. Mol. Biol. 270, 471–480.
91. Dayhoff, M. O., Barker, W. C., and Hunt, L. T. (1983) Establishing homologies in protein se-
quences. Methods Enzymol. 91, 524–545.
92. Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA 89, 10,915–10,919.
93. Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein
Sci. 4, 1145–1160.
94. Karlin, S. and Altschul, S. E. (1990) Methods for assessing the statistical significance of molecular
sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
95. Wootton, J. C. (1994) Non-globular domains in protein sequences: automated segmentation using
complexity measures. Comput. Chem. 18, 269–285.
96. Altschul, S. F., Madden, T. L., Schäffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
97. Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc.
Natl. Acad. Sci. USA 85, 2444–2448.
98. Martin, A. C., Orengo, C. A., Hutchinson, E. G., et al. (1998) Protein folds and functions. Structure
99. McGuffin, L. J., Bryson, K., and Jones, D. T. (2001) What are the baselines for protein fold recog-
nition? Bioinformatics 17, 63–72.
100. Bairoch, A. (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 19,
101. Bairoch, A., Bucher, P., and Hofmann, K. (1997) The PROSITE database, its status in 1997. Nucleic
Acids Res. 25, 217–221.
102. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K. (1996) A flexible motif search technique
based on generalized profiles. Comput. Chem. 20, 3–23.
103. Sonnhammer, E. L. and Kahn, D. (1994) Modular arrangement of proteins as inferred from analy-
sis of homology. Protein Sci. 3, 482–492.
104. Corpet, F., Gouzy, J., and Kahn, D. (1998) The ProDom database of protein domain families.
Nucleic Acids Res. 26,323–326.
105. Sonnhammer, E. L., Eddy, S. R., and Durbin, R. (1997) Pfam: a comprehensive database of protein
domain families based on seed alignments. Proteins 28, 405–420.
106. Bateman, A., Birney, E., Cerruti, L., et al. (2002) The Pfam protein families database. Nucleic
Acids Res. 30, 276–280.
406Rehm and Reinecke
107. Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2001) The InterPro database, an integrated docu-
mentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40.
108. Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2003) The InterPro Database, 2003 brings
increased coverage and new features. Nucleic Acids Res. 31, 315–8.
109. Rawlings, N. D., O’Brien, E., and Barrett, A.J. (2002) MEROPS: the protease database. Nucleic
Acids Res. 30, 343–346.
110. Storm, C. E. and Sonnhammer, E. L. (2001) NIFAS: visual analysis of domain evolution in pro-
teins. Bioinformatics 17, 343–348.
111. Schultz, J., Milpetz, F., Bork, P., and Ponting, C. P. (1998) SMART, a simple modular architecture
research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857–5864.
112. Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P., and Bork, P. (2000) SMART: a web-based
tool for the study of genetically mobile domains. Nucleic Acids Res. 28, 231–234.
113. Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART
domain-based sequence annotation resource. Nucleic Acids Res. 30,242–244.
114. Pietrokovski, S., Henikoff, J.G. and Henikoff, S, (1996) The Blocks database—a system for pro-
tein classification. Nucleic Acids Res. 24, 197–200.
115. Attwood, T. K., Flower, D. R., Lewis, A. P., et al. (1999) PRINTS prepares for the new millen-
nium. Nucleic Acids Res. 27, 220–225.
116. Silverstein, K. A., Shoop, E., Johnson, J. E., and Retzel, E. F. (2001) MetaFam: a unified classifi-
cation of protein families. I. Overview and statistics. Bioinformatics 17, 249–261.
117. Yuan, Y. P., Eulenstein, O., Vingron, M., and Bork, P. (1998) Towards detection of orthologues in
sequence databases. Bioinformatics 14, 285–289.
118. Bernstein, F. C., Koetzle, T. F., Williams, G. J., et al. (1977) The Protein Data Bank. A computer-
based archival file for macromolecular structures. Eur. J. Biochem. 80, 319–324.
119. Berman, H. M., Westbrook, J., Feng, Z., et al. (2000) The Protein Data Bank. Nucleic Acids Res.
120. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995) SCOP: a structural classification
of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
121. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M. (1997)
CATH—a Hierarchic classification of protein domain structures. Structure 5, 1093–1108.
122. Pearl, F. M. G, Lee, D., Bray, J. E, Sillitoe, I., Todd, A. E., Harrison, A. P., Thornton, J. M., and
Orengo, C.A. (2000) Assigning genomic sequences to CATH. Nucleic Acids Res. 28, 277–282.
123. Peitsch, M. C. and Jongeneel, V. (1993) A 3-dimensional model for the CD40 ligand predicts that
it is a compact trimer similar to the tumor necrosis factors. Int. Immunol. 5, 233–238.
124. Schwede, T., Kopp, J., Guex, N., and Peitsch, M. C. (2003) SWISS-MODEL: an automated protein
homology-modeling server. Nucleic Acids Res. 31, 3381–3385.
125. Guex, N. and Peitsch, M. C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment
for comparative protein modeling. Electrophoresis 18, 2714–2723.
126. Combet, C., Jambon, M., Deleage, G., and Geourjon, C. (2002) Geno3D: automatic comparative
molecular modelling of protein. Bioinformatics 18, 213–214.
127. Lambert, C., Leonard, N., De Bolle, X., and Depiereux, E. (2002) ESyPred3D: prediction of pro-
teins 3D structures. Bioinformatics 18, 1250–1256.
128. Bader, G. D., Betel, D., and Hogue, C. W. (2003) BIND: the Biomolecular Interaction Network
Database. Nucleic Acids Res. 31, 248–250.
129. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., and Eisenberg, D. (2000)
DIP: The Database of Interacting Proteins. Nucleic Acids Res. 28, 289–291.
130. Levinthal, C., Wodak, S. J., Kahn, P., and Dadivanian, A. K. (1975) Hemoglobin interaction in
sickle cell fibers. I. Theoretical approaches to the molecular contacts. Proc. Natl. Acad. Sci. USA
131. Wodak, S. J. and Janin, J. (1978) Computer analysis of protein-protein interaction. J. Mol. Biol.
132. Janin, J., Henrick, K., Moult, J., et al. (2003) CAPRI: a Critical Assessment of PRedicted Interac-
tions. Proteins 52, 2–9.
133. Taylor, R. D., Jewsbury, P. J., and Essex, J. W. (2002) A review of protein-small molecule docking
methods. J. Comput. Aided Mol. Des. 16, 151–166.
134. Read, T. D., Peterson, S. N., Tourasse, N., et al. (2003) The genome sequence of Bacillus anthracis
Ames and comparison to closely related bacteria. Nature 423, 81–86.
135. Ivanova, N., Sorokin, A., Anderson, I., et al. (2003) Genome sequence of Bacillus cereus and
comparative analysis with Bacillus anthracis. Nature 423, 87–91.
136. Smith, D. R. (1996) Microbial pathogen genomes - new strategies for identifying therapeutics and
vaccine targets. Trends Biotechnol. 14, 290–293.
137. Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997) A genomic perspective on protein families.
Science 278, 631–637.
138. Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., et al. (2001) The COG database: new develop-
ments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29,
139. Wheeler, D. L., Church, D. M., Federhen, S., et al. (2003) Database resources of the National
Center for Biotechnology. Nucleic Acids Res. 31, 28–33.
140. Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI gene expres-
sion and hybridization array data repository. Nucleic Acids Res. 30, 207–210.
141. Rehm, B. H. A. and Reinecke, F. (2004) Evaluation of proteomic techniques: applications and
potential. Curr. Proteomics 1, 103–111.