METHODOLOGY ARTICLE Open Access
A systematic study of genome context methods:
calibration, normalization and combination
Luciana Ferrer, Joseph M Dale, Peter D Karp*
Background: Genome context methods have been introduced in the last decade as automatic methods to predict
functional relatedness between genes in a target genome using the patterns of existence and relative locations of
the homologs of those genes in a set of reference genomes. Much work has been done in the application of
these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and
their combination necessary for their optimal use.
Results: We present a thorough study of the four main families of genome context methods found in the
literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the
gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being
competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the
four methods. A thorough exploration of the parameter space for each method is performed and results across
different target organisms are presented.
We propose the use of normalization procedures as those used on microarray data for the genome context scores.
We show that substantial gains can be achieved from the use of a simple normalization technique. In particular,
the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our
knowledge, on the best-performing phylogenetic profile system in the literature.
Finally, we show results from combining the various genome context methods into a single score. When using a
cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision
tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a
gain of around 15% over what can be considered the state of the art in this area: the four original genome con-
text methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these
gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the tar-
Conclusions: Our experiments indicate that gene neighbor is the best individual genome context method and
that gains from the combination of individual methods are very sensitive to the training data used to obtain the
combiner’s parameters. If adequate training data is not available, using the gene neighbor score by itself instead of
a combined score might be the best choice.
In recent years, large-scale genome sequencing has
resulted in a steep growth in the number of fully
sequenced genomes. Part of the sequencing effort is to
automatically annotate the genome with structural infor-
mation (for example, location of open reading frames
and coding regions) and functional information. Much
of this annotation process relies on finding homologs of
the target genes in other annotated genomes. The target
gene often inherits the function of its homologous
sequences, when available. Using this method, genes
that do not have an annotated homologous sequence in
any other genome cannot be assigned a function.
Genome Context Methods
Genome context analysis denotes a family of techniques
used to infer functional relationships between genes
using a comparative analysis approach that allows for
* Correspondence: email@example.com
Artificial Intelligence Center, SRI International, Menlo Park, California, USA
Ferrer et al. BMC Bioinformatics 2010, 11:493
© 2010 Ferrer et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
the inference of function across genes that may not
share sequence similarity. These techniques are based
on assumptions drawn from knowledge about evolution-
ary processes. For example, the phylogenetic profile
method  uses the patterns of occurrence of a gene
across a set of genomes. Two genes with similar occur-
rence patterns are likely to be functionally related. The
assumption is that organisms are under evolutionary
pressure to encode either both genes or neither gene if
the genes are related. Other genome context techniques
use evidence such as protein fusions [2-4], proximity of
genes within the genome [4,5], and proximity of homo-
logous sequences of the genes across a list of reference
genomes . These methods will be explained in detail
in Section 2.
Many other sources of information have been used for
functional prediction (see, for example, [6,7]), including
mRNA co-expression data, MIPS functional similarity, GO
functional similarity, co-essentiality, and co-regulation. All
these extra sources of information rely on data other than
that available in the annotated genome sequence. In this
paper, we limit our study to features that can be extracted
automatically from the annotated genome.
Normalization of Genome Context Scores
Genome context methods generate numeric values, or
scores, for pairs of genes. These scores are assumed to be
correlated with the probability that the two genes are
functionally related. Unfortunately, scores, in most cases,
indirectly capture other characteristics of the two genes
involved. For example, measures of similarity between
two phylogenetic profiles might be affected by how
frequent the genes are in the list of genomes used to
compute the profiles. This bias in the scores can degrade
the performance of genome context methods. The pro-
blem of score bias is, in fact, a common problem across
many statistical processing problems. A well-known
example of data that suffers this problem in bioinfor-
matics is microarray data. In this case, measurements
coming from different microarrays are affected not just
by the differences we intend to study, but also by differ-
ences in the scanner or in the production of the array.
Normalization procedures are designed to compensate
for this problem [8,9]. In this paper, we adapt two nor-
malization procedures that have been used for the micro-
array problem to the problem of estimating functional
relations from genome context scores. We demonstrate
varying degrees of relative gain in sensitivity of as much
as 40% depending on the type of score, at operating
points corresponding to high specificity.
Combination of Genome Context Scores
The different genome context scores implemented in
this paper capture somewhat different information
about the samples. Hence, one would expect that com-
bining these scores into a single score should lead to a
score that is better than any of the individual ones.
Combination of genome context scores has been
explored in many papers [4,7,10-15]. Two related meth-
ods are used in the STRING database [14,16] and the
Prolinks database . In both cases, scores are first indi-
vidually transformed into confidence measures using a
labeled training set. The resulting confidence measures
are then combined into a single measure by picking the
maximum  or using a simple product expression .
The same product expression is used in , but in this
case the individual scores are first weighted by a factor
that depends on the performance of the method. In all
these cases, the authors present the results of the com-
bination on the same data used to train the transform
into confidence measures or find the weights for the dif-
ferent methods. That is, the parameters of the combina-
tion are the optimal parameters on the data where
results are being reported. This results in an optimistic
prediction of the gains that can be achieved from com-
bination. It is not clear from these experiments whether
such results would generalize to unseen data.
More complex combination procedures have also been
proposed. In  a support vector machine is trained to
combine three different genome context method scores,
showing that the combination outperforms the individual
methods (both sensitivity and specificity are improved)
when the combiner is trained with cross-validation
on Escherichia coli gene pairs. That is, gene pairs are
randomly split into sets and each set is classified using
the combiner trained on all remaining sets. Combination
results including many other information sources apart
from genome context methods (for example, mRNA
co-expression data and GO functional similarity) are pre-
sented in . In this case, the combiner is also trained
using cross-validation, although the dataset includes gene
pairs from the complete MIPS catalog, not from a single
organism. Results show a modest gain in performance
from combination, although none of the scores that are
used in the final combination correspond to genome con-
Some of the papers mentioned above (for example,
[10-12]) present the performance of the systems after a
hard decision on the label of each sample has been made.
Sometimes this is done because the system directly out-
puts a binary decision. At other times, the hard labels are
obtained by thresholding of a continuous score with
some predetermined threshold. In either case, comparing
systems that make hard decisions is not straightforward,
since, generally, systems end up at different operating
points. If neither the false positive nor false negative rate
of two systems is the same, there is no direct way to com-
pare them, unless one of the systems has a strictly smaller
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 2 of 24
value for both measures, in which case it can be declared
better than the other (this is the case for some results in
the papers cited above). When that is not the case, a per-
formance measure that combines the two types of errors
is generally used to compare the systems. Choosing such
a performance measure implies a rather arbitrary deci-
sion on the costs of the different types of errors.
Hence, in our judgment, not enough strong evidence on
the degree of improvement that can be obtained from
combination procedures for genome context scores has
yet been presented in the literature. Furthermore, to our
knowledge, no attempt has yet been made at training
combination procedures for genome context methods on
certain organisms to apply them on other organisms not
included in the training set. This is, arguably, the most
realistic scenario where the goal is to assign functional
relatedness labels to a new organism for which no or little
manual curation has yet been done and for which no
related strain has been curated either. In such a case,
cross-validation or train-on-test results would not be
applicable. In this paper, we present comparative results of
combination performance when training the combiner on
organisms that are phylogenetically distant to the test
organism and using cross-validation on the test organism.
Parameter Tuning for Genome Context Methods
In addition to introducing the normalization procedure
for genome context scores, and showing several combi-
nation results, we present a large set of experimental
results exploring the parameter space of the different
methods. Some experimental studies on the optimal set-
tings of the parameters of the genome context methods
have been presented in the literature. Sun et al. 
explore the performance of the phylogenetic profile
method when the distance metric is given by mutual
information for different values of the BLAST E-value
threshold and different composition of the reference list.
These same parameters are explored in , along with
the metric used to compute the distance between phylo-
genetic profiles. Cokus et al.  also explore different
metrics, some of them rather sophisticated, and propose
a new one that outperforms all other metrics tried in
their paper. To our knowledge, optimization studies of
this type have not been performed for genome context
methods other than phylogenetic profile.
In this paper, we thoroughly explore the parameter
space of each method, including the percent similarity
required in the gene fusion method and the E-value
threshold used to infer homology. Furthermore, six differ-
ent metrics of phylogenetic profile similarity are explored,
including the metric proposed by Cokus et al. . We
also include a study of the effect of the size and composi-
tion of the list of reference genomes on all methods and a
comparison of results of the different methods across
different organisms. We show most results in terms of
receiver operating characteristic (ROC) curves, which
allow us to compare systems without committing to a
certain set of costs for the different types of errors.
Summary of Contributions
The contributions of this paper are (1) a normalization
procedure for the genome context scores that improves
performance of the mutual information phylogenetic
profile method by around 25%, resulting, to our knowl-
edge, on the best-performing phylogenetic profile
method to date; (2) a thorough exploration of the effect
on performance of the parameters of the different gen-
ome context methods; (3) a comparison of performance
of the different methods on a set of bacterial organisms,
from where we observe a variation of a factor of 2 in
the sensitivity achieved by the different methods across
organisms, and a rather consistent ranking of the differ-
ent methods with the gene neighbor methods giving
substantially better performance than any phylogenetic
profile method for most organisms; and (4) a study of
the effect that the training data has on the performance
of the combination methods for the genome context
scores, resulting in the very important conclusion that
cross-validation results commonly presented in the
literature can be overly optimistic about the benefits
that can be achieved from combination.
Direct comparison of results across papers in the area
of genome context methods are many times impossible
due to changes in databases, testing protocols and defi-
nition of performance measures, sometimes leading to
apparent contradictory conclusions. This was one moti-
vation for the thorough study presented in this paper. In
Section 4, whenever possible, conclusions found in the
literature will be contrasted with our own conclusions.
2 Genome Context Methods
Given two genes G1and G2in a certain target genome, we
wish to compute a measure of the likelihood that they are
functionally related. We study the four types of genome
context methods widely used in the literature: phyloge-
netic profiles, gene neighbor, gene cluster, and gene fusion
(or Rosetta Stone). Three of these methods rely on infor-
mation about the presence of homologous sequences for
the genes in the target genome on a list of reference
organisms. In genome context methods, homology is
generally inferred using the degree of sequence similarity
given by the E-value. Even though sequence similarity
does not directly imply homology, this is a practical and
computationally efficient way to infer it commonly used in
bioinformatics. The sequence similarity information used
here is obtained from the Comprehensive Microbial
Resource (CMR) database . The CMR database is
loaded into an Oracle relational database using the
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 3 of 24
BioWarehouse toolkit  for ease of access. We consider
a sequence to be homologous to the query sequence if the
E-value obtained from CMR is smaller than a certain
In this paper, the value of the threshold E is tuned to
optimize the performance of the methods. CMR uses
blastp to generate the E-values. Matches with percent of
similarity smaller than 40% or percent of identity smaller
than 10% are discarded. Two filters are used by CMR to
mask off segments of the query sequence that have low
compositional complexity: SEG  and XNU .
Genome context methods can be grouped into two cate-
gories: full coverage and restricted coverage. Full-coverage
methods are those that can generate scores for all possible
gene pairs from a genome, while restricted-coverage meth-
ods generate scores only for some pairs. The phylogenetic
profile and the gene neighbor methods are full coverage,
while the gene cluster and the gene fusion methods are
restricted coverage. The gene cluster method generates a
score only when two genes are adjacent in the genome
and coded in the same strand, and the gene fusion method
generates a score only when the two genes from the target
genome are found fused into a single gene in some other
genome. Table 1 shows a summary of the genome context
methods implemented in this paper.
2.1 Phylogenetic Profiles Method
The phylogenetic profile (PP) of gene G is a binary vector
encoding the presence (indicated with a 1) or absence
(indicated with a 0) of a homologous sequence of G in a
list of reference genomes. The length of the PP vector is
given by the number of genomes in the list. Given the PPs
for two genes, the product PP is defined as the vector that
contains a 1 only in the positions in which both PPs have
a 1 and 0 otherwise. If we think of the PPs for each gene
in a target genome as forming a matrix where each row
corresponds to a gene, then the columns of this matrix are
the organism profiles, while the rows are the gene
profiles. It is assumed that two genes having similar PPs
are likely to be functionally related, since evolutionary
pressure favors the simultaneous preservation or elimina-
tion of two genes that function together. Several measures
have been proposed in the literature to quantify the dis-
tance between two PPs. Here, we compare five of the most
common measures: mutual information (which, in plots
and tables we will call pp-mutual-info), Pearson coeffi-
cient (pp-pearson), Jaccard coefficient (pp-jaccard),
hypergeometric p-value (pp-pval), and weighted hyper-
geometric p-value (pp-wpval). Furthermore, we imple-
ment a sixth PP method called weighted hypergeometric
p-value with runs (pp-wpval-with-runs) proposed in 
aimed at relaxing some of the assumptions made in the
computation of the other metrics. In the following we give
mathematical definitions of the first four measures and
conceptually describe the remaining two. Given two PPs,
p1and p2, for genes G1and G2, where pk(i) is 1 if organ-
ism i in the reference list contains a homolog of gene Gk
and 0 otherwise, define
Table 1 Genome context methods implemented in this paper
Similarity between the phylogenetic profiles (PPs) of two genes across a list of reference genomes using the
mutual information between the two PPs as similarity measure.
pp-pearsonAs above, using the Pearson correlation as measure.
pp-jaccard As above, using the Jaccard coefficient as measure.
pp-pval As above, using the p-value of the observed PPs given by the hyper-geometric distribution that assumes that
the probability of a homolog of gene G1appearing in genome i is independent and identical to the
probability of a homolog of gene G2appearing in genome j, for all genes and all genomes.
Same as pp-pval but relaxing the assumption that the probability of a homolog of a target gene appearing in
genome i is the same for all i.
Similar to pp-wpval but using a heuristic to compensate for the assumption of independence of the genomes.
gn-lnX Measure of the relative distance between homologs of two genes in a list of reference genomes. The measure
is given by the negative logarithm of the product of the relative distances between genes across all genomes
that contain homologs for both genes.
gn-pvalP-value measure for the observed value of gn-lnX.
gn-norm-lnX Normalized version of gn-lnX where its value is divided by the total number of genomes found to contain
gfP-value for the observed number of times two genes are found fused into a single gene in the reference
gc Relative distance in bases between two genes that are adjacent and coded in the same strand in the target
The short name used in the rest of the paper to refer to each method is listed in the second column. The methods are explained in detail in Section 2.
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 4 of 24
the training data. The confidence-product combiner,
which has few parameters to learn, has a more stable
performance, much more independent of the data used
for training. On the other hand, as we have seen, also as
a consequence of this lack of complexity this combiner
is worse than the decision tree combiner when good
training data is available.
The fact that the effect of the training data is much
more marked when the known-function sets are used for
training and testing instead of the sm-enzyme sets might
be due to several factors. It is possible that the genome
context scores on the known-function set of samples are
less consistent across organisms than in the sm-enzyme
set, making the learning of the combination function
inherently harder. More likely, this behavior might be
due to the fact that, as mentioned earlier, the gold stan-
dard is probably less reliable in the known-function set
than in the sm-enzyme set. While ECK12 is a highly
Figure 12 ROC curves for the sm-enzyme gold standard set (left) and the known function set (right) of samples for ECK12 for
different combination strategies. Solid curves correspond to the confidence-product combiner, and dashed curves correspond to the decision
tree combiner. The color of the curve varies according to the scores being combined. Red curves (called comb.orig-scores) correspond to
combiners using only gn-pval, pp-mutual-info, gc, and gf. The blue curves (called comb.orig+new-scores) include those four plus gn-lnX, gn-
lnX after znorm and pp-mutual-info after znorm. Parameters E = 10-4, Q = 50 and a reference list of size 343 obtained by clustering are used to
generate all scores. Two individual scores are also shown for comparison: gn-lnX after znorm and gn-pval. These are the best two individual
systems. A combined system can be considered successful if its curve is consistently better than that of the best individual system being
combined at every operating point.
Figure 13 Same as Figure 12 but training the combiners using samples from CAULO, MTBC, MTBR, FRANT and HPY instead of using
cross-validation on ECK12 samples.
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 21 of 24
curated database, the rest of the organisms used in this
paper have undergone much less curation, making the
gold standard on these organisms lower quality since
pathways or complexes that exist in the organisms might
not appear in the database. For these organisms, it is rea-
sonable to expect the quality of the labels to be better for
genes that have been tagged as enzymes than for other
genes which might not have been studied as much.
Furthermore, it is possible that the kinds of functional
relationships we are currently not considering in our
gold standard (for example, interactions due to proteins
transiently binding to each other or other kinds of func-
tional relationships not captured by our labeling proce-
dure, as explained in Section 4.1) correspond to a larger
proportion of samples on the known-function set than in
the sm-enzyme set. If, for either reason, the gold standard
is indeed less accurate on the known-function set than in
the sm-enzyme set, since the combiner is learning from
these labels, we would expect the combiner to perform
worse on the known-function set.
The implication of the results presented here is that
combination results trained on a certain database might
not generalize to organisms that are not well repre-
sented in this database. Furthermore, the benefits from
using more complex combination procedures might be
overestimated when training and testing on the same
organism (or set of organisms). The only way to fairly
assess whether a combination procedure will be able to
generalize to unseen organisms that are not closely
related to those available for training is to devise the
training and testing databases in a way that is represen-
tative of the actual testing conditions. How distant to
the test organisms the train organisms should be
depends on how the combiner will be used. If the goal
is to use the combiner on organisms for which no clo-
sely related organism is available for training, this same
kind of criteria should be used to select the training
data when trying to assess the performance of the com-
We present a systematic study of individual genome
context methods and their combination, which we
believe is needed to better understand how to optimize
these techniques. The families of methods studied in
this paper are the gene cluster, gene neighbor, gene
fusion, and phylogenetic profile. These are the methods
widely used in the literature and in publicly available
databases . We propose the use of normalization
techniques for the genome context methods and show
that it can produce large performance gains. We study
the optimal parameters for each method and the effect
that the reference list of genomes has on its perfor-
mance. We also show the performance of the different
methods for a set of bacterial organisms. Furthermore,
we present a careful study of the effect that the training
data has on a combination procedure used to merge all
genome context scores into a single score.
In comparative experiments across different individual
methods, we find that methods that compute summary
measures of the distance between homologs of the
genes in the target genome across a list of reference
organisms, commonly called gene neighbor methods,
lead, in most cases, to the best overall performance of
all genome context methods. Although this result was
observed earlier , we have demonstrated that it is
true for any choice of reference organism list and for
almost all organisms on which we tested. In absolute
terms, performance of the genome context methods var-
ies widely (up to a factor of two) across organisms, but
generally their ranking does not, the gene neighbor
method being invariably among the best. The gene clus-
ter is, for some organisms, competitive with the gene
neighbor method at low sensitivities. Phylogenetic pro-
file methods are generally worse than gene neighbor
methods, in many cases by a large margin. The gene
fusion method is the worst across most organisms. Note
that the gene fusion and gene cluster methods are quali-
tatively different from the gene neighbor and phyloge-
netic profile methods in that they can generate scores
only for a small proportion of gene pairs. With this con-
sideration, we believe that the gene neighbor method
can be considered the overall best genome context
method in the literature since it leads to the best perfor-
mance on most organisms and is able to generate scores
for all gene pairs in the target genomes.
Three of the four genome context method families
rely on the extraction of homologs for the genes in the
target genome. As commonly done in bioinformatics,
sequence similarity is used as a way to detect homology.
For this, a threshold on the BLAST E-value must be
chosen. A thorough study of the performance of the dif-
ferent methods varying the value of this threshold indi-
cates that the optimal value is in the vicinity of 10-4.
While this same value was found to be optimal for a
phylogenetic profile method in , our results indicate
that this value is also approximately optimal for all
other genome context methods.
We also show that the size and composition of the
reference organism list has a significant influence on the
performance of the genome context methods. Organism
lists containing many organisms that are closely related
to each other negatively affect the performance of some
methods since they violate their independence assump-
tion. Reference lists can be pruned to exclude highly
related organisms using a clustering procedure resulting
in relative gains on sensitivity of around 5%. Overall, the
optimization of the E-value threshold and the list of
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 22 of 24
reference organisms results in a gain of around 8% in
the sensitivity of the gene neighbor method with respect
to the system presented in , which uses an E-value
threshold of 10-10and no pruning of the list of reference
Genome context scores suffer, as do many other
scores from different statistical processing problems,
from a bias effect: scores are affected not only by the
characteristic we wish to detect (in our case, functional
relationship), but by other characteristics that consis-
tently affect the values of the scores for a certain group
of samples. Score normalization methods aimed at com-
pensating for this bias are in standard use in signal pro-
cessing problems but have not yet, to our knowledge,
been applied to genome context methods.
We implement two score normalization procedures
borrowed from microarray analysis. We show that score
normalization methods aimed at equalizing the distribu-
tion of scores across genes leads to large gains of
as much as 40% on some genome context methods.
A relative gain of around 25% is observed on the phylo-
genetic profile method that uses mutual information as
the similarity metric between profiles. With this
improvement, this normalized phylogenetic profile score
is, we believe, the best performing in this family of
methods since it outperforms a method introduced in
, which has in turn been shown to outperform
other, some very sophisticated and computationally
complex, phylogenetic profile methods. Finally, results
from combining the individual genome context scores
testing on E. coli K-12 gene pairs are presented. Two
combination approaches are compared: a simple
approach that converts each individual set of scores into
confidence measures and combines them with a simple
nontrainable function, and a more complex approach
that uses decision trees as combiners. We show that,
when a cross-validation procedure is used for training,
the decision tree combiner can greatly outperform the
simpler combiner. In this case, large gains are obtained
when three scores proposed in this paper, two of them
normalized as explained above, are added for combina-
tion to the four original scores previously proposed in
the literature. Specifically, when cross-validation on all
known-function E. coli K-12 gene pairs is used for train-
ing and seven scores are used for combination, a gain in
sensitivity of around 20% is obtained with respect to the
best individual score given by a gene neighbor method.
This gain can be compared to that obtained with the
simpler combiner method when combining only the
four original scores. This system is comparable to those
used for the Prolinks paper  and in the STRING data-
base [14,16,42]. For this system we find that the gains
from combination on E. coli K-12 gene pairs with
respect to the gene neighbor method are less than 4%.
Hence, our system results in a gain with respect to the
state of the art in genome context methods when well-
matched data is used to train the combiner.
To our knowledge, combination performance has
always been tested either by training the combiner para-
meters on the test samples (for example, [4,14]), or
using a cross-validation procedure (for example, [7,11]).
These procedures (particularly train-on-test) are
expected to lead to an optimistic assessment of the
gains that can be achieved from combination on organ-
isms that are not well represented on the training data.
In this paper, we explore the performance of the two
combiners mentioned above when training data from
organisms other than the target organism is used. We
find that when organisms that are phylogenetically dis-
tant from the target organism are used to train the com-
biners both combination methods fail to give gains with
respect to the single best individual score. Nevertheless,
adding the normalized scores proposed in this paper
seems to add some robustness to the procedure, allow-
ing the simpler combiner to be at least as good as the
individual best score.
Our conclusion is that, if genome context scores are
to be used on organisms that are not well represented
or phylogenetically similar to those available to generate
a gold standard, and a single score needs to be gener-
ated for each gene pair, then either the single best score,
the gene neighbor p-value, should be used by itself or a
simple combiner (with few parameters) should be
trained, preferably using the normalized scores proposed
in this work. Using a complex combination procedure
that leads to large gains on cross-validation experiments
is likely to lead to suboptimal results on these unseen
Additional file 1: Study of the effect of E and Q parameters. In this
file we show results on the effect of the parameters E (E-value threshold
for inferring homology) on the gene neighbor, phylogenetic profile and
gene fusion methods and Q (percent overlap for finding Rosetta Stones)
on the gene fusion method.
We thank Tomer Altman for his advise on several database related issues.
We also thank Shawn Cokus and Matteo Pellegrini from University of
California, Los Angeles, for their help during the implementation of the
phylogenetic profile method we adopted from their paper. This work was
funded by SRI International. The contents of this article are solely the
responsibility of the authors and do not necessarily represent the official
views of SRI International.
LF, JD and PDK decided research directions and discussed and analyzed
results. LF carried out the experimental research and drafted the manuscript.
JD and PDK revised the manuscript for technical content. All authors read
and approved the final manuscript.
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 23 of 24
Received: 13 February 2010 Accepted: 1 October 2010
Published: 1 October 2010
1. Pellegrini M, Marcotte E, Thompson M, Eisenberg D, Yeates T: Assigning
protein functions by comparative genome analysis: protein phylogenetic
profiles. PNAS 1999, 96:4285-8.
2. Marcotte E, Pellegrini M, Ng H, Rice D, Yeates T, Eisenberg D: Detecting
protein function and protein-protein interactions from genome
sequences. Science 1999, 285:751-3.
3. Enright A, Iliopoulos I, Kyrpides N, Ouzounis C: Protein interaction maps
for complete genomes based on gene fusion events. Nature 1999,
4. Bowers P, Pellegrini M, Thompson M, Fierro J, Yeates T, Eisenberg D:
Prolinks: a database of protein functional linkages derived from
coevolution. Genome Biology 2004, 5(5):R35.
5. Overbeek R, Fonstein M, D’Souza M, Pusch G, Maltsev N: Use of contiguity
on the chromosome to predict functional coupling. In Silico Biol 1999,
6. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan N, Chung S, Emili A,
Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for
predicting protein-protein interactions from genomic data. Science 2003,
7. Lu L, Xia Y, Paccanaro A, Yu H, Gerstein M: Assessing the limits of
genomic data integration for predicting protein networks. Genome
Research 2005, 15(7):945-53.
8.Schadt E, Li C, Ellis B, Wong W: Feature extraction and normalization
algorithms for high-density oligonucleotide gene expression array data.
Journal of Cellular Biochemistry 2001, , Suppl 37: 120-5.
9. Bolstad B, Irizarry R, Astrand M, Speed T: A Comparison of normalization
methods for high density oligonucleotide array data based on bias and
variance. Bioinformatics 2003, 19(2):185-193.
10.Marcotte EM, Pellegrini M, Thompson MJ, Yeates T, Eisenberg D: A
combined algorithm for genome-wide prediction of protein function.
Nature 1999, 402:83-86.
11. Yellaboina S, Goyal K, Mande S: Inferring genome-wide functional
linkages in E. coli by combining improved genome context methods:
comparison with high-throughput experimental data. Genome Research
12.Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Y, Zhao Z: InPrePPI: an
integrated evaluation method based on genomic context for predicting
protein-protein interactions in prokaryotic genomes. BMC Bioinformatics
13.Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D: Inference of
protein function and protein linkages in Mycobacterium tuberculosis
based on prokaryotic genome organization: a combined computational
approach. Genome Biology 2003, 4(9):R59.
14.von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B: STRING: a
database of predicted functional associations between proteins. Nucleic
Acids Research 2003, 31(1):258-61.
15. Hu P, Janga SC, Babu M, Díaz-Mejía J, Butland G, Yang W, Pogoutse O,
Guo X, Phanse S, Wong P, Chandran S, Christopoulos C, Nazarians-
Armavil A, Nasseri NK, Musso G, Ali M, Nazemof N, Eroukova V, Golshani A,
Paccanaro A, Greenblatt J, Moreno-Hagelsieb G, Emili A: Global functional
atlas of Escherichia coli encompassing previously uncharacterized
proteins. PLoS Biol 2009, 7(4).
16.von Mering C, Jensen L, Snel B, Hooper SD, Krupp M, Foglierini M,
Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-
protein associations, integrated and transferred across organisms. Nucleic
Acids Research 2005, 33:433-7.
17. Sun J, Xu J, Liu Z, Liu Q, Zhao A, Shi T, Li Y: Refined phylogenetic profiles
method for predicting protein-protein interactions. Bioinformatics 2005,
18.Karimpour-Fard A, Hunter L, Gill R: Investigation of factors affecting
prediction of protein-protein interaction networks by phylogenetic
profiling. BMC Genomics 2007, 8:393.
19.Cokus S, Mizutani S, Pellegrini M: An improved method for identifying
functionally linked proteins using phylogenetic profiles. BMC
Bioinformatics 2007, 8.
20.Peterson J, Umayam L, Dickinson T, Hickey E, White O: The Comprehensive
Microbial Resource. Nucleic Acids Research 2001, 29:123-5.
21.Lee T, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert D, Tenenbaum J,
Karp P: BioWarehouse: A bioinformatics database warehouse toolkit. BMC
Bioinformatics 2006, 7:170.
Wootton JC, Federhen S: Statistics of local complexity in amino acid
sequences and sequence databases. Computers and Chemistry 1993,
Claverie JM, States DJ: Information enhancement methods for large scale
sequence analysis. Computers and Chemistry 1993, 17:191-201.
Kharchenko P, Chen L, Freund Y, Vitkup D, Church G: Identifying metabolic
enzymes with multiple types of association evidence. BMC Bioinformatics
Barker D, Meade A, Pagel M: Constrained models of evolution lead to
improved prediction of functional linkage from correlated gain and loss
of genes. Bioinformatics 2007, 23(1):14-20.
Tamames J, Casari G, Ouzounis C, Valencia A: Conserved clusters of
functionally related genes in two bacterial genomes. J Mol Evol 1997,
Brouwer R, Kuipers O, van Hijum S: The relative value of operon
predictions. Briefings in Bioinformatics 2008, 9(5):367-75.
Pandey G, Ramakrishnan LN, Steinbach M, Kumar V: Systematic evaluation
of scaling methods for gene expression data. Bioinformatics and
Biomedicine, IEEE International Conference on 2008, 0:376-381.
Karp P, Ouzounis C, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D,
Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansion of the BioCyc
collection of pathway/genome databases to 160 genomes. Nucleic Acids
Research 2005, 33(19):6083-89.
Caspi R, Foerster H, Fulcher C, Kaipa P, Krummenacker M, Latendresse M,
Paley S, Rhee SY, Shearer A, Tissier C, Walk T, Zhang P, Karp PD: The
MetaCyc database of metabolic pathways and enzymes and the BioCyc
collection of Pathway/Genome Databases. Nucleic Acids Research 2008, 36:
Keseler I, Bonavides-Martinez C, Collado-Vides J, Gama-Castro S, Gunsalus R,
Johnson DA, Krummenacker M, Nolan L, Paley S, Paulsen I, Peralta-Gil M,
Santos-Zavaleta A, Shearer A, Karp P: EcoCyc: A comprehensive view of E.
coli biology. Nucleic Acids Research 2009, 37:D464-70.
Caspi R, Altman T, Dale J, Dreher K, Fulcher C, Gilham F, Kaipa P,
Karthikeyan A, Kothari A, Krummenacker M, Latendresse M, Mueller L,
Paley S, Popescu L, Pujar A, Shearer A, Zhang P, Karp P: The MetaCyc
database of metabolic pathways and enzymes and the BioCyc collection
of Pathway/Genome Databases. Nucleic Acids Research 2010, 38:D473-9.
Green M, Karp P: The outcomes of pathway database computations
depend on pathway ontology. Nucleic Acids Research 2006, 34:3687-97.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource
for deciphering the genome. Nucleic Acids Research 2004, 32:D277-D280.
Barker D, Pagel M: Predicting functional gene links from phylogenetic-
statistical analyses of whole genomes. PLoS Computational Biology 2005,
Chambers JM, Hastie TJ: Statistical Models in S. Wadsworth and BrooksCole
R Development Core Team: R: A language and environment for statistical
computing R Foundation for Statistical Computing, Vienna, Austria 2005.
Buntine W, Caruana R: Introduction to IND and recursive partitioning.
Tech Rep FIA-91-28, NASA Ames Research Center 1991.
Buntine W: IND software package.[http://opensource.arc.nasa.gov/project/
Breiman L: Bagging predictors. Machine Learning 1996, 24(2):123-140.
Koonin EV, Galperin MY: Sequence - Evolution - Function: Computational
Approaches in Comparative Genomics Kluwer Academic 2002.
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T,
Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a global
view on proteins and their functional interactions in 630 organisms.
Nucleic Acids Research 2009, 37.
Cite this article as: Ferrer et al.: A systematic study of genome context
methods: calibration, normalization and combination. BMC Bioinformatics
Ferrer et al. BMC Bioinformatics 2010, 11:493
Page 24 of 24