ArticlePDF Available

MetaDisorder: A meta-server for the prediction of intrinsic disorder in proteins

Authors:

Abstract and Figures

Intrinsically unstructured proteins (IUPs) lack a well-defined three-dimensional structure. Some of them may assume a locally stable structure under specific conditions, e.g. upon interaction with another molecule, while others function in a permanently unstructured state. The discovery of IUPs challenged the traditional protein structure paradigm, which stated that a specific well-defined structure defines the function of the protein. As of December 2011, approximately 60 methods for computational prediction of protein disorder from sequence have been made publicly available. They are based on different approaches, such as utilizing evolutionary information, energy functions, and various statistical and machine learning methods. Given the diversity of existing intrinsic disorder prediction methods, we decided to test whether it is possible to combine them into a more accurate meta-prediction method. We developed a method based on arbitrarily chosen 13 disorder predictors, in which the final consensus was weighted by the accuracy of the methods. We have also developed a disorder predictor GSmetaDisorder3D that used no third-party disorder predictors, but alignments to known protein structures, reported by the protein fold-recognition methods, to infer the potentially structured and unstructured regions. Following the success of our disorder predictors in the CASP8 benchmark, we combined them into a meta-meta predictor called GSmetaDisorderMD, which was the top scoring method in the subsequent CASP9 benchmark. A series of disorder predictors described in this article is available as a MetaDisorder web server at http://iimcb.genesilico.pl/metadisorder/. Results are presented both in an easily interpretable, interactive mode and in a simple text format suitable for machine processing.
Content may be subject to copyright.
R E S E A R C H A R T I C L E Open Access
MetaDisorder: a meta-server for the prediction
of intrinsic disorder in proteins
Lukasz P Kozlowski
1
and Janusz M Bujnicki
1,2*
Abstract
Background: Intrinsically unstructured proteins (IUPs) lack a well-defined three-dimensional structure. Some of
them may assume a locally stable structure under specific conditions, e.g. upon interaction with another molecule,
while others function in a permanently unstructured state. The discovery of IUPs challenged the traditional protein
structure paradigm, which stated that a specific well-defined structure defines the function of the protein. As of
December 2011, approximately 60 methods for computational prediction of protein disorder from sequence have
been made publicly available. They are based on different approaches, such as utilizing evolutionary information,
energy functions, and various statistical and machine learning methods.
Results: Given the diversity of existing intrinsic disorder prediction methods, we decided to test whether it is
possible to combine them into a more accurate meta-prediction method. We developed a method based on
arbitrarily chosen 13 disorder predictors, in which the final consensus was weighted by the accuracy of the
methods. We have also developed a disorder predictor GSmetaDisorder3D that used no third-party disorder
predictors, but alignments to known protein structures, reported by the protein fold-recognition methods, to infer
the potentially structured and unstructured regions. Following the success of our disorder predictors in the CASP8
benchmark, we combined them into a meta-meta predictor called GSmetaDisorderMD, which was the top scoring
method in the subsequent CASP9 benchmark.
Conclusions: A series of disorder predictors described in this article is available as a MetaDisorder web server at
http://iimcb.genesilico.pl/metadisorder/. Results are presented both in an easily interpretable, interactive mode and
in a simple text format suitable for machine processing.
Background
Many proteins are functional despite they lack a stable
three-dimensional structure under physiological conditions
in vitro and/or in vivo [1,2]. Regions of protein-protein and
protein-nucleic acid interactions, as well as sites of posttran-
slational modification, often fall into regions that are locally
disordered or undergo disorderorder transition in biologic-
ally relevant situations [3,4]. Intrinsic disorder is a common
feature of hubproteins that interact with multiple other
proteins and perform important regulatory roles in the cell
[5]. Many intrinsically unstructured proteins (IUPs) or in-
trinsically unstructured regions (IURs) are critical for cell
survival, proliferation, differentiation, and apoptosis, which
make them important from a biomedical point of view.
Intrinsically unfolded proteins, once purified, can be
identified by various experimental methods [6-9]. However,
experimental determination of the absence of a three-
dimensional structure is difficult. Since the presence or the
absence of a single stable structure is encoded in the pro-
tein sequence, it is possible to use the sequence information
to predict regions of disorder in the similar manner
as e.g. secondary structure. Therefore, the emerging
unfoldomicsfield [1,10] has prompted the development
of numerous computational methods for the prediction of
disordered regions from protein sequence (see e.g. list of
URLs in DisProt, the Database of Protein Disorder [11]).
IUPs and intrinsically unfolded regions (IURs) are quite
diverse. They can be classified in various ways according
to length (short vs long disorder), method of experimental
determination (e.g. lack of electron of density in crystal
* Correspondence: iamb@genesilico.pl
1
Laboratory of Bioinformatics and Protein Engineering, International Institute
of Molecular and Cell Biology, ul, Trojdena 4, 02-109, Warsaw, Poland
2
Laboratory of Bioinformatics, Institute of Molecular Biology and
Biotechnology, Faculty of Biology, ul Umultowska 89, 61-614, Poznan, Poland
© 2012 Kozlowski and Bujnicki; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111
http://www.biomedcentral.com/1471-2105/13/111
structures), the presence or absence of certain structural
features (e.g. disorder with secondary structure but no ter-
tiary structure), and many other factors. Different types of
disorder are often associated with different types of char-
acteristic. For this reason, some computational methods
for disorder predictions are available in several versions,
trained on different datasets, e.g. on short and long IURs
separately [1,2]. However, thus far no single clear-cut clas-
sification of all disorder types has emerged that would be
accepted and used by all experts in the field, and most
methods for disorder prediction from protein sequence
aim for a binary classification of protein residues: ordered
or disordered (i.e. will all types of disorder treated as a
single class).
The so-called meta-methodapproach relies on the fact
that different algorithms have their individual advantages
and disadvantages, and the combination of methods can be
used to improve the prediction accuracy. This approach
has been used to develop many successful prediction meth-
ods, e.g. in protein fold recognition [12], protein function
prediction [13], prediction of protein domains [14], predic-
tion of protein model quality [15], and recently also in pro-
tein disorder prediction [16-18]. In this article, we describe
a set of predictors that take as an input a protein sequence,
query other methods, and calculate a final consensuspre-
diction of disorder (in the sense of any disorderas a
single class, as opposed to different types of order treated
jointly as another single class). They have been implemen-
ted as a single web server called MetaDisorder, available at
http://iimcb.genesilico.pl/metadisorder/. One of our meth-
ods is essentially a primary predictor, as it does not use any
other disorder prediction method, however it is metain
the sense that it does utilize other predictions, namely
alignments to proteins of known structure reported by pro-
tein fold-recognition methods. Our other disorder predic-
tors are typical meta-methods, as they directly query a
series of primary disorder predictors and utilize their out-
put. Additionally, other types of one-dimensional features,
such as predicted secondary structure and predicted solv-
ent accessibility are used. In the framework of the CASP8
and CASP9 benchmarks, these meta-predictors outper-
formedothermethodsfordisorderprediction[19].
Methods
Definition of disorder
Protein disorder can be defined by many ways depending
on the research focus and experimental method used. As
a baseline, we used the definition used in the Critical As-
sessment of protein Structure Prediction (CASP) experi-
ments: the disordered residues are those marked by
REMARK465 tag in the experimentally determined pro-
tein structures deposited in Protein Data Bank (PDB) [20],
which indicates regions with missing coordinates in crystal
structures determined by X-ray crystallography or residues
with highly variable coordinates in ensembles of Nuclear
Magnetic Resonance (NMR) structures. This definition
was extended to include also proteins deposited in the
DisProt database (disorder validated experimentally by a
variety of experimental methods such as circular dichro-
ism (CD) spectroscopy, mass spectrometry, immuno-
chemistry, SDS-PAGE gel, small-angle X-ray scattering
(SAXS), currently over 1300 regions) [11]. The advantage
of the DisProt database is that it includes proteins without
known three-dimensional structure, especially proteins
that are entirely disordered, whose structure typically can-
not be determined by high resolution methods (X-ray
crystallography and NMR). Thus, we treat all disorder
types as a single class.
Primary methods used in the meta-method
The MetaDisorder series of predictors combined, via a
machine-learning approach, the predictions of 13 primary
disorder predictors that performed well in CASP and are
freely available as standalone applications or stable web ser-
vers that can process large numbers of queries: DisEMBL
[21], DISOPRED2 [22], DISpro [23], Globplot [24], iPDA
[25], IUPred [26], Pdisorder [27], Poodle-s [28], Poodle-l
[29], PrDOS [30], Spritz [31], DisPSSMP [32], and RONN
[33]. Additionally, the meta-predictors designed for CASP9
used also six subjectively selected methods for protein fold-
recognition: HHSEARCH run over PDB70 and CDD data-
bases [34], FFAS [35], mGenThreader [36], PSI-BLAST run
in two different modes (with and without masking regions
with low sequence complexity) over the culled PDB
database [37], PHYRE [38], and PCONS [39] (a consensus
method that uses as an input models generated by
MODELLER [40] based on alignments from the previously
mentioned fold-recognition methods). For short description
of each method see Table 1 and Table 2. Additionally, two
methods for secondary structure prediction: JNET [41] and
PSIPRED [42], and one solvent accessibility predictor, JNET
[41], were used.
Training and testing datasets
To train the meta-predictors, two independent datasets
were used. The first dataset was prepared based on the
combined DisProt database (version 3.6) and CASP7
targets. Sequences longer than 1000 residues were omitted,
because they exceed the length limit of some of the
primary methods used and could not be processed auto-
matically without arbitrary manipulations. Overall, this
procedure provided 566 proteins, which included 232,664
residues in total, of which 23.45% were disordered. The
second dataset, called pdbRemark465, was based on struc-
tures in the PDB database. Representative structures were
extracted using the PISCES server [43] and filtered accord-
ing to the following criteria: experimental technique: X-ray
crystallography, resolution <,R-factor<0.2, length
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 2 of 11
http://www.biomedcentral.com/1471-2105/13/111
501000 aa residues, and mutual sequence similarity
<20%. The resulting dataset contained 1147 proteins
(289,008residues, of which 6.28% were disordered according
tothe REMARK465tag in thePDBfiles,see Additionalfile1).
In the final version of the meta-predictor, we combined these
two datasets and used them for assessing the disorder predic-
tion accuracy. During that procedure, standard 10-fold cross
validation was used. All amino acid residues were randomly
assignedinto 10 binsof nearlyequal size. 9 binswereusedas a
source of the training data and the remaining 10th bin was
used as a sourceof the testingdata. This procedure was then
repeated 10 times, with each of the 10 bins used exactly once
forvalidation.Theresultsof10analyseswere thenaveragedto
producefinalscores.
Since we aimed to be as objective as possible in asses-
sing the predictive power of our methods in a fair com-
parison to other methods, to avoid any bias we tested all
predictors described in this article within truly blind tests
of CASP8 and CASP9, in which (as mentioned earlier),
the prediction of disorder is defined as the ability to
identify regions with missing coordinates in crystal
structures determined by X-ray crystallography or residues
with highly variable coordinates in ensembles of NMR
structures.
For the training of GSmetaDisorder3D and GSmetaDi-
sorderMD predictors, we used proteins from CASP8
(122 proteins, 27,614 residues, of which 11.11% were
disordered; among them 19 were solved by NMR, 2.515
residues, of which 47.95% were disordered). Again, 10-
fold cross validation was used. The detailed statistics
about each dataset are provided in Table 3.
Measures used for training and evaluation
The results of predictions can be divided into four cat-
egories: true positives (TP) residues correctly predicted
as disordered, true negatives (TN) residues correctly
predicted as ordered, false positives (FP) ordered
Table 1 Description of disorder predictors analyzed in this work
Method Short description Availability Ref.
DisEMBL ANN trained to predict classic loops (DSSP), flexible loops with high B-factors, missing
coordinates in X-ray structures, regions of low-complexity and prone to aggregation.
local installation [21]
DISOPRED2 SVM trained to predict residues with missing coordinates. local installation [22]
DISpro Recursive neural networks (RNNs) trained to predict missing coordinates. local installation [23]
GlobPlot A simple method based on several hydrophobicity scales to predict regions of missing
coordinates and loops with high B-factors.
local installation [24]
iPDA Incorporates information about sequence conservation, predicted secondary structure,
sequence complexity and hydrophobic clusters.
web service [25]
IUPred Estimates pairwise interaction energies using a statistical potential.
Two versions for predicting long and short disorder.
web service [26]
Pdisorder Combination of neural network, linear discriminant function and acute smoothing
procedure is used for recognition of disordered and ordered regions in proteins.
web service [27]
Poodle-s SVM trained for short disorder detection (uses PSSMs generated by PSI-BLAST). web service [28]
Poodle-l Predicts long disorder using an SVM. web service [29]
PrDOS Predicts missing coordinates in 3D structure using SVM and PSSMs from PSI-BLAST. web service [30]
Spritz Predicts long and short disorder (missing coordinates) using two separate SVMs. Utilizes secondary structure. web service [31]
RONN Predicts missing coordinates using an ANN. local installation [33]
Table 2 Description of fold recognition methods used by MetaDisorder
Method Short description Availability Ref.
PSI-BLAST Position-Specific Iterated BLAST uses position-specific scoring matrices derived
during the search of the nr database
local installation [37]
FFAS Profile-profile alignment and fold-recognition algorithm for fold and function assignment local installation [35]
mGenThreader The method combines profile-profile alignments with secondary-structure specific gap-penalties,
classic pair- and solvation potentials using a linear combination optimized with a regression SVM model
local installation [36]
HHsearch Generalizes the alignment of protein sequences with a profile hidden Markov model (HMM)
to the case of pairwise alignment of profile HMMs
local installation [34]
PCONS A neural-network-based consensus predictor local installation [39]
PHYRE An algorithm that uses profile-profile and secondary structure matching algorithm web service [38]
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 3 of 11
http://www.biomedcentral.com/1471-2105/13/111
residues misclassified as disordered, and false negatives
(FN) disordered residues misclassified as ordered.
The first assessment criterion we used was the receiver
operating characteristic (ROC). The ROC curve is a
graphical plot of the sensitivity vs. false positive rate for
a classifier, as its discrimination threshold is changed.
The resulting area under curve (AUC) defines the overall
robustness of an algorithm, where 1 means the perfect
predictor (all true positives are found by the method
without any false positives) and 0.5 corresponds to a
random one.
The second criterion is the weighted score, called S
w,
which rewards a correct disorder prediction higher than a
correct order prediction [44]. This is done to avoid over-
prediction of an ordered state due the fact that ordered
regions are more common in known proteins. The S
w
score
is defined as:
SW¼S
Smax ¼Wdisorder TPWorder FP þWorder TN Wdisorder FN
Wdisorder TN þFNðÞþWorder TN þFPðÞ
where the W
disorder
equals the fraction of ordered resi-
dues and W
order
equals the fraction of disordered resi-
dues. S
w
is in the range 1 to 1, where 0 means random
prediction. Maximization of S
w
was the main criterion of
the optimization procedure and it was also used to as-
sess the relative value of individual primary disorder pre-
dictors to be incorporated into our meta-servers. The S
w
score was directly used as a weight of a prediction
returned by each such method.
The third commonly used measure, which was not
used during our procedure of developing the consensus
methods, but which was used for their evaluation, is
Matthews correlation coefficient (MCC) [45]:
MCC ¼TPTN FPFN
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TPþFPðÞTPþFNðÞTNþFPðÞTNþFNðÞ
p
S
w
and MCC were the measures used during CASP to
assess disorder predictors.
Finally, we used our own measure, called S
ww,
which
combines AUC and S
w
score in the following way: it is
calculated using the S
w
formula, but the discrimination
threshold is changed incrementally from 0 to 1, by steps
of 0.01, giving sets of TP, TN, FP, FN values that are used
to calculate a series of S
w
scores. S
ww
is the average value
of these scores. This score was used only in the GSmeta-
DisorderMD2 method during CASP9.
The statistical significance of the evaluation scores was
determined by the bootstrap confidence interval method
[19,46]: 80% of the targets were randomly selected 1000
times, and the mean absolute error of scores was calcu-
lated. The ROC statistics were compared by using the
Wilcoxon signed rank test and by calculating standard
errors of ROC statistics.
Binary consensus and continuous consensus versions of
MetaDisorder predictors
In general, two categories of predictors exist. The
simplest predictors are binary, they try to classify the
predicted feature only into separate subcategories
(here disordered and ordered residues). More
advanced methods return continuous scores with
values e.g. between 0 and 1 that inform how certain
the prediction is, and the prediction is made accord-
ing to an arbitrarily chosen threshold. The lower the
threshold, the higher the number of both true and
false positives. Accordingly, initially we constructed
two versions of the MetaDisorder predictor, named
BinCons and FloatCons. These two methods were
tested within the framework of the CASP8 benchmark
as groups with numbers 153 and 297, respectively
[19]. BinCons uses only binary predictions from pri-
mary methods: each disorder prediction for a residue
is counted as 1 and ordered as 0.01 (0 was avoided
to prevent possible cases of dividing by zero). Float-
Cons uses all the information available: if a given
method returns a continuous prediction, its score is
used during the final consensus calculation. A con-
sensus score for each residue is calculated by sum-
ming the scores from all primary methods and
multiplying them by the accuracy of the given
method. The result is normalized, i.e. the score is
divided by the maximal possible score. For simplicity,
the criterion of a methods accuracy used as the
weight of the method was S
w
calculated for our com-
bined datasets. It was possible, because S
w
does not
depend on the predictor output type.
In the next step, a special correcting function is
used. It takes into account the fact that residues
located in the protein termini are on the average
more disordered than residues in the middle of the
Table 3 Summary of the datasets employed in this study
DisProt + CASP7 pdbRemark465 CASP8
Number of proteins 566 1147 122
Number of residues in disordered regions 54,570 (23.45%) 18,146 (6.28%) 3,068 (11.11%)
Number of residues in ordered regions 178,094 (76.55%) 270,862 (93.72%) 24,546 (88.89%)
Total number of residues 232,664 289,008 27,614
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 4 of 11
http://www.biomedcentral.com/1471-2105/13/111
protein chain. This function is based on the statistics
of disorder presence in the 15 proximal residues cal-
culated on both datasets and provides an appropriate
corrective factor, by which the original predictive
score is multiplied.
Finally, the decision whether a residue is ordered or
disordered is made. If a residue scores above the thresh-
old, it is predicted as disordered; otherwise it is pre-
dicted as ordered. The threshold for classifying the
residue as ordered or disordered was based on S
w
scores
obtained during 10-fold cross validation tests.
Additionally, at the end, the repairing procedure is
employed to improve the prediction. For predicted
string (e.g. DDD---D--..., with D indicating disorder
and -indicating order) a simple smoothing filter
with a window of five residues is applied. It elimi-
nates short (up to 3 residues) stretches of predicted
disorder within long regions of predicted order (con-
verts previous example to DDD------...).
GSmetaDisorder3D a template-matching method
Apart from disorder predictors, many other bioinfor-
matics tools yield implicit or explicit information
about order and disorder. In the course of a variety
of other protein sequence analysis projects, we rea-
lized that there is a clear correlation between the dis-
order in the target protein sequence, and the
presence of gaps in alignments to structurally charac-
terized templates calculated by the protein fold-
recognition methods. Although the implementation of
a method utilizing this type of information may seem
trivial, it was not so straightforward to deal with dif-
ferent types of fold recognition methods. In other
words, it was not so obvious which method should be
used or, if many methods were used, how to rank
them. Additionally, a template-matching method
should be able to take into account the fact that
matches to homologous proteins have different reli-
ability and in some cases homologous sequences can-
not be found. To address all these questions, we
compared the results from arbitrary chosen fold rec-
ognition methods that were relatively fast and per-
formed well in the framework of CASP: HHSEARCH,
FFAS, mGenThreader, PSI-BLAST, PHYRE, and
PCONS5 (see Methods for details and references). To
optimize the weights assigned to individual methods
depending on the alignment quality we used a genetic
algorithm implemented in Pyevolve [47]. The fitness
function of the genetic algorithm was designed as a
one-dimensional vector of length 24 (8 methods men-
tioned above multiplied by 3 thresholds for well-,
moderately- and poorly-scored templates; see Table 4
for details of the thresholds used). In this way, the
weights for all methods were obtained, for the further
incorporation into a combined template-matching
method. The resulting predictor was tested in CASP9
as a group number 421 (GSmetaDisorder3D).
GSmetaDisorderMD and GSmetaDisorderMD2 combined
disorder consensus and template-matching method
The next method in the MetaDisorder series, GSmetaDi-
sorderMD, was developed by combining FloatCons (the
consensus method with continuous scoring) with GSme-
taDisorder3D (the method based on analysis of gaps in
fold-recognition alignments). The same genetic algo-
rithm was used as in the training of GSmetaDisorder3D,
but additionally the second dimension to the vector was
added to optimize the relationship between these two
components. This method was tested in CASP9 as a
group number 374.
GSmetaDisorderMD2 is a variant of GSmetaDisor-
derMD, in which the genetic algorithm used for training
optimized the S
ww
score instead of the S
w
score. This
predictor was tested in CASP9 as a group number 147.
Implementation and availability
The MetaDisorder is a web interface to our series of dis-
order meta-predictors and can be accessed at http://
iimcb.genesilico.pl/metadisorder/. Wrappers and parsers
for primary prediction methods were written in the Py-
thon programming language under the Unix system.
Data are stored in a MySQL database. The web server
was implemented using the mod_python Apache mod-
ule. For the interactive presentation of results, the Java-
Script chart library Highcharts [48] is used. Additionally,
the results of analyses can be also obtained as simple
text output (for details see Figure 1).
Results
Meta prediction of protein disorder from primary
disorder predictors
Motivated by the success of meta-prediction in various
fields of bioinformatics, we tested its applicability to the
prediction of disordered residues in protein sequences.
Table 4 Thresholds used in fold recognition programs for
classification of potentially good, medium and poor
alignments
Predicted alignment quality
Method Good Medium Poor
PSI-BLAST* <2e-06 <0.023 >0.023
FFAS <34.5 <8.5 >8.5
MGenThreader >0.65 >0.546 <0.546
HHsearch* >95 >80 <80
PCONS >2.17 >1.03 <1.03
PHYRE <0.085 <0.27 >0.27
* - the same score was used regardless of the database.
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 5 of 11
http://www.biomedcentral.com/1471-2105/13/111
Initially, we developed meta-predictors BinCons and
FloatCons that calculate a consensus score by taking into
account the relative expected accuracies of constituent
primary methods (see Methods for details). BinCons and
FloatCons were first benchmarked by ourselves on com-
bined datasets consisting of CASP7 targets, DISPROT
database and pdbRemark465 dataset obtained from a fil-
tered PDB database (Table 5 and Figure 2, see Methods
for details) and subsequently by independent assessors
within the framework of the CASP8 experiment (Table 6)
[19]. In both tests the BinCons and FloatCons meta-
predictors performed considerably better than individual
primary predictors (e.g. AUC of 0.868 and 0.843 com-
pared to 0.830 and 0.829 for the top-performing primary
predictors iPDA and VSL2 in our benchmark). The stat-
istical significance of those results was compared by
using the Wilcoxon signed rank test (for details see
Additional file 2: Table S1). The overall difference of
accuracy between these two meta-predictors was rela-
tively small (2.9%), but statistically significant according
to the Wilcoxon signed rank test. The difference be-
tween both meta-predictors and iPDA and VSL2 is also
Figure 1 MetaDisorder web-server interface. a) user-friendly web interface main plot part can be easily zoomed in and out, results reported
by all primary methods can be downloaded in the CASP format. b) simple text output format suitable for machine processing.
Table 5 Performance of disorder prediction on the
combined pdbRemark465, CASP7 and Disprot dataset
Evaluation score
Method Sw MCC AUC
FloatCons 0.608 ± 0.007 0.475 ±0.008 0.868 ± 0.002
BinCons 0.599 ±0.007 0.487 ± 0.008 0.843 ± 0.003
iPDA 0.555 ±0.006 0.419 ±0.006 0.829 ± 0.004
DISPROT(vls2) 0.539 ±0.005 0.399 ±0.005 0.830 ± 0.001
DISOPRED 0.481 ±0.006 0.436 ±0.006 0.778 ± 0.003
POODLE-S 0.474 ±0.009 0.423 ±0.010 0.828 ± 0.004
PrDOS 0.469 ±0.007 0.442 ±0.008 0.810 ± 0.006
POODLE-L 0.464±0.010 0.397 ±0.010 0.794 ± 0.004
RONN 0.450 ±0.006 0.350 ±0.007 0.762 ± 0.006
IUPred (short) 0.445 ±0.006 0.412 ±0.007 0.788 ± 0.002
DisPSSMP 0.442 ±0.012 0.377 ±0.012 0.776 ± 0.004
IUPred (long) 0.432±0.008 0.392 ±0.009 0.787 ± 0.004
Spritz (long) 0.418 ±0.009 0.377 ±0.010 -
Pdisorder 0.383 ±0.007 0.350 ±0.007 -
Dispro 0.355 ±0.006 0.411 ±0.008 -
Spritz (short) 0.334 ±0.007 0.306 ±0.007 -
DisEMBL 0.289±0.007 0.232 ±0.006 -
GlobPlot 0.187 ±0.004 0.172 ±0.004 -
The highest value for each score is shown in bold.
Figure 2 Receiver operating characteristics (ROC) plots and
their area under curve (AUC) for disorder prediction methods
used to construct the FloatCons meta-predictor for a combined
dataset comprising Disprot, CASP7 targets and PDBremark465.
FPR values are presented on a logarithmic scale.
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 6 of 11
http://www.biomedcentral.com/1471-2105/13/111
statistically significant. This exercise demonstrated that
meta-prediction can significantly improve the inference
of intrinsic disorder from protein sequence, but the use
of continuous scores contributes little to that success
over simple binary prediction.
Gaps in fold recognition alignments provide useful
information for protein disorder prediction
Subsequently, we have developed a primary disorder
predictor GSmetaDisorder3D that uses information from
the coverage of the target sequence by known protein
structures, according to alignments reported by protein-
fold recognition methods (hence, it is primarywith re-
spect to disorder prediction, but metawith respect to
utilization of other predictors). These methods aim at
aligning target protein sequences to proteins with related
structure. The lack of matches to known structures for a
given sequence region may indicate the lack of detect-
able structured counterparts in the database, including
cases of structural disorder. Figure 1b illustrates an ex-
ample, where the paucity of matches to known struc-
tures reported by fold-recognition methods corresponds
to a disordered region. GSmetaDisorder3D uses six dif-
ferent protein fold-recognition methods (with two of
these run in two different modes). The selection of these
tools was dictated by the methodsaccuracy (according
to CASP [49]), but also speed, and either availability for
local installation or stability of a web service. One issue
we had to address was the fact that each fold-
recognition method typically generates up to ten alterna-
tive alignments that are scored differently and may
exhibit different accuracy. There are many nonlinear
aspects of these methods that should be taken into ac-
count when considering the prediction of disorder using
information from homologous alignments. To address
them, we employed a genetic algorithm. The fitness
function was designed in such a way that it optimizes a
vector of size 24, where triads of the vector elements
represents weights for the eight fold recognition meth-
ods indicating good, medium and poor quality
alignments.
As it can be seen in Table 6, GSmetaDisorder3D per-
forms better than many primary disorder prediction
methods that sometimes use sophisticated machine
learning algorithms, although it does not outperform
them all. According to our benchmark, this method
achieved ROC of 0.833 on CASP8 targets (Table 7). This
indicates that the coverage of the target sequence by
known structures in fold-recognition alignments is a
good discriminator of protein order and disorder, but
Table 6 The results of our meta-predictors and top-scoring primary methods in CASP8 and CASP9
CASP8
Method Sw AUC Sensitivity Specificity
FloatCons 0.662 ± 0.048 0.908 ± 0.017 0.758 ± 0.048 0.904 ± 0.004
BinCons 0.661±0.050 0.897 ± 0.021 0.741 ± 0.050 0.920 ± 0.003
DisoClust 0.644 ±0.047 0.908 ± 0.018 0.727 ±0.047 0.917 ± 0.004
MULTICOM 0.660 ± 0.039 0.896 ± 0.019 0.796 ± 0.039 0.864 ± 0.004
Mahmood-Torda 0.619± 0.061 0.918 ± 0.015 0.641 ± 0.061 0.978 ± 0.001
POODLE-L 0.588 ± 0.066 0.895 ±0.021 0.646 ± 0.066 0.942 ± 0.004
CASP9
Method Sw AUC Sensitivity Specificity
FloatCons 0.427 ± 0.009 0.795 ± 0.011 0.574 ± 0.020 0.854 ± 0.009
GSmetaDisorder3D 0.391 ± 0.007 0.784 ± 0.012 0.411 ± 0.016 0.948 ± 0.008
GSmetaDisorderMD 0.476 ± 0.006 0.818 ± 0.008 0.654 ± 0.012 0.821 ± 0.010
GSmetaDisorderMD2 0.516 ± 0.010 0.841 ± 0.014 0.653 ± 0.013 0.860 ± 0.012
PrDOS2 0.509 ± 0.002 0.855 ± 0.010 0.609 ± 0.008 0.857 ± 0.003
MULTICOM-REFINE 0.500 ± 0.003 0.821 ± 0.008 0.651 ± 0.003 0.851 ± 0.004
The highest value for each score is shown in bold.
Table 7 The results of evaluation of GSmetaDisorder3d,
GSmetaDisorderMD and GSmetaDisorderMD2 on CASP8
targets
Evaluation score
Method MCC Sw AUC
floatCons 0.654 ± 0.041 0.606 ±0.023 0.904 ±0.009
GSmetaDisorder3d 0.589±0.047 0.519 ± 0.024 0.833 ±0.014
GSmetaDisorderMD 0.558 ± 0.034 0.684 ± 0.023 0.927 ±0.011
GSmetaDisorderMD2 0.607 ± 0.042 0.684 ± 0.022 0.929 ± 0.017
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 7 of 11
http://www.biomedcentral.com/1471-2105/13/111
alone it is not sufficient to predict protein disorder as
well as the top disorder prediction methods.
Fold-recognition analysis adds value to consensus
disorder prediction
The GSmetaDisorder3D was not intended to serve as an
independent predictor, but as a complement to other
methods based on different principles. It has been com-
bined with the consensus meta-predictor FloatCons into
a meta-predictor named GSmetaDisorderMD. According
to an in-house benchmark and CASP9, GSmetaDisor-
derMD outperforms FloatCons by 2-4%, depending on
the dataset used for testing (see Table 6 and Table 7 for
numeric details). It must be emphasized that this
method was tested only on CASP targets (with ten cross
validation across residues), because only for them pre-
dictions from all primary methods were available.
We have also developed and tested a minor variant of this
method, dubbed GSmetaDisorderMD2, trained with the
use of the S
ww
scoreinsteadoftheS
w
score as the target
function. This modification brought about a small but sig-
nificant improvement in the prediction quality, especially if
we consider the result s from CASP9 (AUC = 0.841 and
0.818 for GSmetaDisorderMD2 and GSmetaDisorderMD,
respectively).
Discussion
Consensus predictions are practically useful: they are
significantly better than primary predictors
The development of meta-predictors is often criticized as
a parasitic approach that discourages the development of
primary methods and does not improve our understand-
ing of the underlying biological processes. In this article
we have described not only a series of meta-methods that
use other developersmethods, but a novel primary
method based on a different principle, which does not
beatother primary algorithms in a head-to-head com-
parison, but is sufficiently different, that its inclusion
improves meta-prediction by a few percent. Thus, we
argue that the development of meta-servers can actually
positively influence the development of methods that are
based on novel principles and that it can highlight the util-
ity of new algorithms even if they do not winthe compe-
tition on the basic level. On the other hand, our
benchmarks demonstrate that many oldmethods are
still useful in terms of contribution of important informa-
tion that can be used for meta-prediction, and that meta-
predictors can incorporate them as building blocksinto
a practically useful bioinformatics service.
The key conclusion from our work is that even a very
simple weighted consensus (binCons and floatCons pre-
dictors) is able to improve disorder prediction over pri-
mary methods, resulting in a more robust and accurate
prediction, as assessed according both to the S
w
score and
AUC. As can be concluded from data presented in Table 5
and Table 6, regardless of the type of score and dataset
used, consensus methods performed comparatively well
both in our in-house benchmark and in CASP [19]. The
most advanced and best-performing meta-predictors
described in this manuscript use machine learning to de-
rive the best features from the primary predictors avail-
able. They outperformed consensus predictors based on
simply averaging the input of the primary predictors.
Consensus predictions improve other methods
predictions. Where does the improvement come from?
Consensus predictors are more robust than primary pre-
dictors they are based on. They give less false positives
and on the average the predictions are more definite.
Primary predictors are different from each other and in
a collective prediction their different strengths can be
combined and/or their difference weaknesses can be
eliminated. First, different datasets are used for training,
biasing the prediction towards (or against) certain types
of proteins with particular features. For instance, the use
of proteins from the PDB eliminates all proteins that are
so disordered that their structure cannot be determined,
while the use of proteins from DisProt implies the reli-
ance on low-resolution experimental data that blurs the
boundary between order and disorder. Second, different
machine learning techniques are used that can be more
or less accurate under different circumstances. Typically,
the impact of the machine learning algorithm used or
the parameters chosen for the training of a given pre-
dictor is not clear, as comprehensive evaluation of vari-
ous machine-learning methods with respect to a
particular dataset is rarely performed and described.
Hence, each primary predictor can be viewed as an in-
stantiation of its developersexpertise and ideas with re-
spect to the dataset preparation, invention of new
algorithms and/or machine learning use, which is never
fully optimal with respect to all relevant parameters. A
successful meta-predictor based on a machine-learning
approach is able to perform a synthesis of abilities of the
primary methods, and in our opinion the greatest im-
provement comes from eliminating their individual defi-
ciencies rather than in the exploitation of the individual
unusual strengths.
Deficiencies of the meta-server approach for disorder
prediction
Disorder predictors developed in this work were care-
fully benchmarked against many other methods, using
several different datasets as a reference, including the
blind tests of CASP8 and CASP9, where they always
ranked among top contenders. It is unfortunately impos-
sible to compare these methods to all the published dis-
order predictors (as of December 2011, over 60 methods
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 8 of 11
http://www.biomedcentral.com/1471-2105/13/111
can be found in the literature and on the web), as not all
of them are freely available as servers or standalone
tools, and not all of them participate in CASP.
Another problem in benchmarking bioinformatics
methods is that almost all of them use as an initial step
a similarity search over some protein sequence database
(usually with the PSI-BLAST [37] method). These data-
bases are constantly updated. For this reason it is not
entirely fair to compare our predictors with other meth-
ods, unless they are installed locally and use the same
databases. Hence, we could not directly compare our
method to many new methods. For example the
MFDp meta-predictor [50] can be installed locally,
but it depends on more than ten third-party pro-
grams (e.g. HHsearch [34]), which use their own
databases. A fair comparison of MFDp and MetaDisorder
methods would require e.g. the availability of HHsearch
HMM-profile databases from 2008 and 2010 and others,
which are unfortunately not available.
The problem with local benchmarks mentioned above
emphasizes the importance of CASP experiments.
There, the contenders cannot control the dataset used
for testing the methods, and the problem with biological
database content is alleviated, as all methods are allowed
to use the most up-to-date sequence databases (whether
they actually use the full potential of the availability of
these databases is another question). Hence, it should be
stressed out that the presented series of methods was
developed, tested, and improved through two editions of
CASP, and was found to be superior to other methods in
these fair competitions.
MetaDisorder is relatively slow, as it depends on more
than 20 programs, which are not very fast even if in-
stalled locally. Some of them search big databases and/
or are not parallelized. For instance the generation of
alignments by fold recognition methods can take more
than an hour for long sequences. In the case of online
web servers installed on third-party servers, the response
may be delayed for reasons that are beyond the control
of the meta-predictor (e.g. server crash). A significantly
speed-limiting factor in our GSmetaDisorder3D method
is the use of the PCONS5 algorithm, which is a fold-
recognition meta-predictor run only when all primary
fold recognition methods return their alignments and
corresponding 3D models are generated by MODELLER.
Despite these performance drawbacks, the MetaDisorder
web server is typically able to calculate final predictions
from within minutes up to few hours, depending on se-
quence length.
Probably the most serious problem in disorder prediction
is that the binary classification of residues into the ordered
or disordered state is very simplistic. Disorderis not a
single state, but in fact represents a whole range of bio-
physical characteristics that can be captured by different
experimental techniques. It has been shown that disorder
predictors trained on proteins with different types of dis-
order often achieve poor accuracy on disorder of proteins
of a different type, which has led to the definition of fla-
vorsof disorder, characterized by differences in sequence
properties [51]. There are certain classes of disorder for
which specialized predictors have been developed, for in-
stance short vs. long disorder [28,29], and prediction of
protein-binding regions in disordered proteins [52]. The
use of a meta-server allows not only for combining predic-
tions of different flavors of disorder into one consensus
prediction, but also to collect and display these different
predictions next to each other, allowing the human user to
make an informed functional interpretation. On the other
hand, the collection of results obtained by multiple meth-
ods can be overwhelming for a lay user. Clearly, there is a
need to develop more clear-cut classification of disorder
that would capture functional features correlated with se-
quence features that can be used by machine learning
methods in the development of multi-state disorder predic-
tors. Current efforts towards the development of disorder
ontology (http://www.disprot.org/idpo.obo) and new classi-
fication schemes (e.g. by the ch-cdf plot method [53]) are
expected to help in the development of multi-class
predictors.
Conclusions
The meta-approach allows the consolidation of pre-
existing knowledge to obtain more robust and accurate
predictions than with the use of primary predictors. We
developed one primary disorder meta-predictor and a
series of disorder meta-predictors that use different sets of
primary predictors, and tested their performance on dif-
ferent datasets. The most important evaluation of the pre-
dictorsaccuracy was in blind tests of CASP8 and CASP9.
In both cases, our meta-predictors were found to be super-
ior with respect to all primary methods and other meta-
predictors. Currently, our MetaDisorder web service offers
a possibility to run more than 20 bioinformatics tools (in-
cluding primary disorder predictors, secondary structure
predictors, and fold recognition methods), and to analyze
the summary of results via a user-friendly interface.
Additional files
Additional file 1: 1147 sequences with their definitions of being
disordered/ordered extracted from pdb files according to
remark465.
Additional file 2: Table S1. Results of the Wilcoxon Singed-Rank Two-
Sided Tests for the AUC scores on dataset combining CASP7, DISPROT and
pdbRemark465 datasets.
Competing interests
Authors declare that they have no competing interests.
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 9 of 11
http://www.biomedcentral.com/1471-2105/13/111
Acknowledgements
Our consensus methods could not be developed without the availability of
third-party methods and servers. We would like to thank all developers for
kindly making their programs freely available. We also thank Peter Tompa,
Keith Dunker, and Monika Fuxreiter for stimulating discussions. LPK was
supported by the Polish Ministry of Science and Higher Education (grant
NN301 190139). JMB was supported by the European Union (project Health-
Prot, contract number 229676), and by the Polish Ministry of Science and
Higher Education (grant number POIG.02.03.00-00-003/09).
Authorscontributions
LPK collected all data, carried out calculations, developed programs and web
interface and drafted the manuscript. JMB conceived of the project and
edited the manuscript. Both authors read and approved the final manuscript.
Received: 29 December 2011 Accepted: 26 April 2012
Published: 24 May 2012
References
1. Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V,
Obradovic Z, Uversky VN: The unfoldomics decade: an update on
intrinsically disordered proteins. BMC Genomics 2008, 9(Suppl 2):S1.
2. Tompa P, Fuxreiter M: Fuzzy complexes: polymorphism and structural
disorder in protein-protein interactions. Trends Biochem Sci 2008,
33(1 ):28.
3. Zhang Y, Stec B, Godzik A: Between order and disorder in protein
structures: analysis of "dual personality" fragments in proteins. Structure
2007, 15(9):11411147.
4. Fuxreiter M, Tompa P, Simon I: Local structural disorder imparts plasticity
on linear motifs. Bioinformatics 2007, 23(8):950956.
5. Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, Uversky VN,
Vidal M, Iakoucheva LM: Intrinsic disorder is a common feature of hub
proteins from four eukaryotic interactomes. PLoS Comput Biol 2006,
2(8):e100.
6. Bernado P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI: Structural
characterization of flexible proteins using small-angle X-ray scattering.
J Am Chem Soc 2007, 129(17):56565664.
7. Ferreon AC, Moran CR, Gambin Y, Deniz AA: Single-molecule fluorescence
studies of intrinsically disordered proteins. Methods Enzymol 2010,
472:179204.
8. Meier S, Blackledge M, Grzesiek S: Conformational distributions of
unfolded polypeptides from novel NMR techniques. J Chem Phys 2008,
128(5):052204.
9. Receveur-Brechot V, Bourhis JM, Uversky VN, Canard B, Longhi S: Assessing
protein disorder and induced folding. Proteins 2006, 62(1):2445.
10. Uversky VN: The mysterious unfoldome: structureless, underappreciated,
yet vital part of any given proteome. J Biomed Biotechnol 2010,
2010:568068.
11. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B,
Tompa P, Chen J, Uversky VN, et al:DisProt: the Database of Disordered
Proteins. Nucleic Acids Res 2007, 35(Database issue):D786793.
12. Kurowski MA, Bujnicki JM: GeneSilico protein structure prediction meta-
server. Nucleic Acids Res 2003, 31(13):33053307.
13. Friedberg I, Harder T, Godzik A: JAFA: a protein function annotation meta-
server. Nucleic Acids Res 2006, 34(Web Server issue):W379381.
14. Saini HK, Fischer D: Meta-DP: domain prediction meta-server.
Bioinformatics 2005, 21(12):29172920.
15. Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM: MetaMQAP: a meta-server
for the quality assessment of protein models. BMC Bioinformatics 2008,
9(1):403.
16. Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B: Improved disorder
prediction by combination of orthogonal approaches. PLoS One 2009,
4(2):e4433.
17. Ishida T, Kinoshita K: Prediction of disordered regions in proteins based
on the meta approach. Bioinformatics 2008, 24(11):13441348.
18. Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN: PONDR-FIT: a
meta-predictor of intrinsically disordered amino acids. Biochim Biophys
Acta 2010, 1804(4):9961010.
19. Noivirt-Brik O, Prilusky J, Sussman JL: Assessment of disorder predictions in
CASP8. Proteins 2009, 77(Suppl 9):210216.
20. Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook
J: The Protein Data Bank and the challenge of structural genomics. Nat
Struct Biol 2000, 7(Suppl):957959.
21. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder
prediction: implications for structural proteomics. Structure 2003,
11(11):14531459.
22. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server
for the prediction of protein disorder. Bioinformatics 2004,
20(13):21382139.
23. Medina MW, Gao F, Naidoo D, Rudel LL, Temel RE, McDaniel AL, Marshall
SM, Krauss RM: Coordinately regulated alternative splicing of genes
involved in cholesterol biosynthesis and uptake. PLoS ONE 2011,
6(4):e19420.
24. Linding R, Russell RB, Neduva V, Gibson TJ: GlobPlot: Exploring protein
sequences for globularity and disorder. Nucleic Acids Res 2003,
31(13):37013708.
25. Su CT, Chen CY, Hsu CM, iPDA: integrated protein disorder analyzer.
Nucleic Acids Res 2007, 35(Web Server issue):W465472.
26. Dosztanyi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the
prediction of intrinsically unstructured regions of proteins based on
estimated energy content. Bioinformatics 2005, 21(16):34333434.
27. SoftBerry - PDISORDER:, [http://linux1.softberry.com/berry.phtml?
topic=pdisorder&group=programs&subgroup=propt]
28. Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting
protein disorder by using physicochemical features and reduced amino
acid set of a position-specific scoring matrix. Bioinformatics 2007,
23(17):23372338.
29. Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T: POODLE-L: a two-level
SVM prediction system for reliably predicting long disordered regions.
Bioinformatics 2007, 23(16):20462053.
30. Ishida T, Kinoshita K: PrDOS: prediction of disordered protein regions
from amino acid sequence. Nucleic Acids Res 2007, 35(Web Server issue):
W460464.
31. Vullo A, Bortolami O, Pollastri G, Tosatto SC: Spritz: a server for the
prediction of intrinsically disordered regions in protein sequences using
kernel machines. Nucleic Acids Res 2006, 34(Web Server issue):W164168.
32. Su CT, Chen CY, Ou YY: Protein disorder prediction by condensed PSSM
considering propensity for order or disorder. BMC Bioinformatics 2006,
7:319.
33. Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the bio-basis function
neural network technique applied to the detection of natively
disordered regions in proteins. Bioinformatics 2005, 21(16):33693376.
34. Soding J: Protein homology detection by HMM-HMM comparison.
Bioinformatics 2005, 21(7):951960.
35. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A: FFAS03: a server for
profile--profile sequence alignments. Nucleic Acids Res 2005,
33(Web Server issue):W284288.
36. Alber F, Dokudovskaya S, Veenhoff LM, Zhang W, Kipper J, Devos D,
Suprapto A, Karni-Schmidt O, Williams R, Chait BT, et al:The molecular
architecture of the nuclear pore complex. Nature 2007, 450(7170):695701.
37. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:
Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res 1997, 25(17):33893402.
38. Lareau LF, Inada M, Green RE, Wengrod JC, Brenner SE: Unproductive
splicing of SR genes associated with highly conserved and
ultraconserved DNA elements. Nature 2007, 446(7138):926929.
39. Wallner B, Elofsson A: Pcons5: combining consensus, structural evaluation
and fold recognition scores. Bioinformatics 2005, 21(23):42484254.
40. Sali A, Potterton L, Yuan F, van Vlijmen H, Karplus M: Evaluation of
comparative protein modeling by MODELLER. Proteins 1995,
23(3):318326.
41. Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to
improve protein secondary structure prediction. Proteins 2000, 40(3):502511.
42. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction
server. Bioinformatics 2000, 16(4):404405.
43. Wang G, Dunbrack RL Jr: PISCES: recent improvements to a PDB
sequence culling server. Nucleic Acids Res 2005,
33(Web Server issue):W9498.
44. Jin Y, Dunbrack RL Jr: Assessment of disorder predictions in CASP6.
Proteins 2005, 61(Suppl 7):167175.
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 10 of 11
http://www.biomedcentral.com/1471-2105/13/111
45. Matthews BW: Co mparison of the predicted and observed secondary
structure of T4 phage lysozyme. Biochim Biophys Acta 1975, 405(2):442451.
46. Carpenter J, Bithell J: Bootstrap confidence intervals: when, which, what?
A practical guide for medical statisticians. Stat Med 2000, 19(9):11411164.
47. Butterfield A, Vedagiri V, Lang E, Lawrence C, Wakefield MJ, Isaev A, Huttley
GA: PyEvolve: a toolkit for statistical modelling of molecular evolution.
BMC Bioinformatics 2004, 5:1.
48. HighCharts JS:, [http://www.highcharts.com/]
49. Cozzetto D, Kryshtafovych A, Fidelis K, Moult J, Rost B, Tramontano A:
Evaluation of template-based models in CASP8 with standard measures.
Proteins 2009, 77(Suppl 9):1828.
50. Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L:
Improved sequence-based prediction of disordered regions with
multilayer fusion of multiple information sources. Bioinformatics 2010,
26(18):i489496.
51. Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder.
Proteins 2003, 52(4):573584.
52. Dosztanyi Z, Meszaros B, Simon I: ANCHOR: web server for predicting
protein binding regions in disordered proteins. Bioinformatics 2009,
25(20):27452746.
53. Huang F, Oldfield C, Meng J, Hsu WL, Xue B, Uversky VN, Romero P, Dunker
AK: Subclassifying disordered proteins by the ch-cdf plot method.
Pac Symp Biocomput 2012, 17:128139.
doi:10.1186/1471-2105-13-111
Cite this article as: Kozlowski and Bujnicki: MetaDisorder: a meta-server
for the prediction of intrinsic disorder in proteins. BMC Bioinformatics
2012 13:111.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Kozlowski and Bujnicki BMC Bioinformatics 2012, 13:111 Page 11 of 11
http://www.biomedcentral.com/1471-2105/13/111
... Cic, on the other hand, has two DNA-binding regions, a HMG-box complemented by a second DNA binding domain called C1 (the structures of human Cic DNA binding regions have recently been solved (46)). Since no experimental data is available for the other regions of these proteins, we tested the hypothesis that they might contain disordered regions using different IDR predictors, including Spot-dis (47), AUCpreD (48), MetaDisorder (49) and MobiDB-lite (50). In addition, the AI system AlphaFold (which predicts the 3D structure of proteins from their primary amino acid sequence) was used to infer disorder based on the pLDDT score, a per-residue confidence measure. ...
... The copyright holder for this preprint this version posted February 2, 2024. ; https://doi.org/10.1101/2024.01.30.578077 doi: bioRxiv preprint the meta model Metadisorder (49) which combines multiple predictive algorithms to report a final consensus (86). We also used MobiDB-lite (50), another meta model developed after the analysis by Nielsen et al. was published, and AlphaFold (51,87). ...
Preprint
Full-text available
Transcription factors play an essential role in pattern formation during early embryo development, generating a strikingly fast and precise transcriptional response that results in sharp gene expression boundaries. To characterize the steps leading up to transcription, we performed a side-by-side comparison of the nuclear dynamics of two morphogens, a transcriptional activator, Bicoid (Bcd), and a transcriptional repressor, Capicua (Cic), both involved in body patterning along the anterior-posterior axis of the early Drosophila embryo. We used a combination of fluorescence recovery after photobleaching, fluorescence correlation spectroscopy, and single particle tracking to access a wide range of dynamical timescales. Despite their opposite effects on gene transcription, we find that Bcd and Cic have very similar nuclear dynamics, characterized by the co-existence of a freely diffusing monomer population with a number of oligomeric clusters, which range from low stoichiometry and high mobility clusters to larger, DNA-bound hubs. Our observations are consistent with the inclusion of both Bcd and Cic into transcriptional hubs or condensates, while putting constraints on the mechanism by which these form. These results fit in with the recent proposal that many transcription factors might share a common search strategy for target genes regulatory regions that makes use of their large unstructured regions, and may eventually help explain how the transcriptional response they elicit can be at the same time so fast and so precise. SIGNIFICANCE By conducting a comparative study of the nuclear dynamics of Bicoid (a transcriptional activator) and Capicua (a transcriptional repressor) in the Drosophila embryo, we have uncovered a striking similarity in their behaviours. Despite their divergent roles in transcription, both proteins have a propensity to form oligomeric species ranging from highly mobile, low stoichiometry clusters to larger, DNA-bound hubs. Such findings impose new constraints on the existing models of gene regulation by transcription factors, particularly in aspects related to target search and oligomeric binding to gene regulatory regions needed to explain the rapid and precise transcriptional response observed in developmental processes.
... Over the past two decades, numerous servers and predictors have emerged to provide the propensity of amino acids to be disordered in proteins or entire proteomes (Liu et al., 2019). Some routinely used predictors include PONDR , PONDR VLXT , PONDR VSL2 , PONDR VL3 , PONDR FIT , IUPred (Dosztányi et al., 2005), FoldIndex (Prilusky et al., 2005), MobiDB (Piovesan et al., 2021), PrDOS (Ishida and Kinoshita, 2007), MetaDisorder (Kozlowski and Bujnicki, 2012), and DisEMBL (Linding et al., 2003). Additionally, specialized predictors like ANCHOR (Dosztányi et al., 2009), MoRFpred (Disfani et al., 2012), MoRFchibi_web (Malhis et al., 2016), DISOPRED3 (Jones and Cozzetto, 2015), DRNAPred (Yan and Kurgan, 2017), Dis-oRDPbind (Peng et al., 2017), and PPRInt (Kumar et al., 2008) have been developed to identify motifs within disordered regions that can bind to other proteins or nucleic acids. ...
... Over the past two decades, numerous servers and predictors have emerged to provide the propensity of amino acids to be disordered in proteins or entire proteomes (Liu et al., 2019). Some routinely used predictors include PONDR , PONDR VLXT , PONDR VSL2 , PONDR VL3 , PONDR FIT , IUPred (Dosztányi et al., 2005), FoldIndex (Prilusky et al., 2005), MobiDB , PrDOS (Ishida & Kinoshita, 2007), MetaDisorder (Kozlowski & Bujnicki, 2012), and DisEMBL (Linding et al., 2003). Additionally, specialized predictors like ANCHOR (Dosztányi et al., 2009), MoRFpred (Disfani et al., 2012), MoRFchibi_web (Malhis et al., 2016), DISOPRED3 (Jones & Cozzetto, 2015), DRNAPred (Yan & Kurgan, 2017), DisoRDPbind (Peng et al., 2017), and PPRInt (Kumar et al., 2008) have been developed to identify motifs within disordered regions that can bind to other proteins or nucleic acids. ...
Preprint
Full-text available
Eukaryotic proteins often feature long stretches of amino acids that lack a well-defined three-dimensional structure and are referred to as intrinsically disordered proteins (IDPs) or regions (IDRs). Although these proteins challenge conventional structure-function paradigms, they play vital roles in cellular processes. Recent progress in experimental techniques, such as NMR spectroscopy, single molecule FRET, high speed AFM and SAXS, have provided valuable insights into the biophysical basis of IDP function. This review discusses the advancements made in these techniques particularly for the study of disordered regions in proteins. In NMR spectroscopy new strategies such as ¹³C detection, non-uniform sampling, segmental isotope labeling, and rapid data acquisition methods address the challenges posed by spectral overcrowding and low stability of IDPs. The importance of various NMR parameters, including chemical shifts, hydrogen exchange rates, and relaxation measurements, to reveal transient secondary structures within IDRs and IDPs are presented. Given the high flexibility of IDPs, the review outlines NMR methods for assessing their dynamics at both fast (ps-ns) and slow (μs-ms) timescales. IDPs exert their functions through interactions with other molecules such as proteins, DNA, or RNA. NMR-based titration experiments yield insights into the thermodynamics and kinetics of these interactions. Detailed study of IDPs requires multiple experimental techniques, and thus, several methods are described for studying disordered proteins, highlighting their respective advantages and limitations. The potential for integrating these complementary techniques, each offering unique perspectives, is explored to achieve a comprehensive understanding of IDPs.
... We used MetaDisorder [33] to predict disordered regions, in accordance with the principles described in [34], and DeepCoil [35] to predict coiled-coil regions. ...
Article
Full-text available
Viruses frequently contain overlapping genes, which encode functionally unrelated proteins from the same DNA or RNA region but in different reading frames. Yet, overlapping genes are often overlooked during genome annotation, in particular in DNA viruses. Here we looked for the presence of overlapping genes likely to encode a functional protein in human parvovirus B19 (genus Erythroparvovirus), using an experimentally validated software, Synplot2. Synplot2 detected an open reading frame, X, conserved in all erythroparvoviruses, which overlaps the VP1 capsid gene and is under highly significant selection pressure. In a related virus, human parvovirus 4 (genus Tetraparvovirus), Synplot2 also detected an open reading frame under highly significant selection pressure, ARF1, which overlaps the VP1 gene and is conserved in all tetraparvoviruses. These findings provide compelling evidence that the X and ARF1 proteins must be expressed and functional. X and ARF1 have the exact same location (they overlap the region of the VP1 gene encoding the phospholipase A2 domain), are both in the same frame (+1) with respect to the VP1 frame, and encode proteins with similar predicted properties, including a central transmembrane region. Further studies will be needed to determine whether they have a common origin and similar function. X and ARF1 are probably translated either from a polycistronic mRNA by a non-canonical mechanism, or from an unmapped monocistronic mRNA. Finally, we also discovered proteins predicted to be expressed from a frame overlapping VP1 in other species related to parvovirus B19: porcine parvovirus 2 (Z protein) and bovine parvovirus 3 (X-like protein).
... A certain region was considered as IDR if it had at least 30 residues and all these residues had the IUPred score more than 0.3. The default value of the score (0.5) recommended by IUPred was changed to 0.3 because this threshold provides a better correlation with the results of MetaDisorderMD2, which gave the best results at the CASP9 contest ( Figure S1) [23]. We did not use MetaDisorderMD2 directly due to the unavailability of its standalone command line version. ...
... regions that are predicted more accurately. This underlies the design of tools that implement a consensus of results from multiple disorder predictors, which are shown to be on average better when compared with the corresponding results generated by the corresponding individual predictors [122][123][124] . At the disordered transactivation regions at the N-terminus, p53 interacts with TFIID, TFIIH, Mdm2, RPA, CBP/p300 and CSN5/Jab1 among many other proteins 125 , whereas its C-terminal domain acts as a binding hub for GSK3β, PARP-1, TAF1, TRRAP, hGcn5, TAF, 14-3-3, S100B(ββ) and many other proteins 125 . ...
Article
Intrinsic disorder is instrumental for a wide range of protein functions, and its analysis, using computational predictions from primary structures, complements secondary and tertiary structure-based approaches. In this Tutorial, we provide an overview and comparison of 23 publicly available computational tools with complementary parameters useful for intrinsic disorder prediction, partly relying on results from the Critical Assessment of protein Intrinsic Disorder prediction experiment. We consider factors such as accuracy, runtime, availability and the need for functional insights. The selected tools are available as web servers and downloadable programs, offer state-of-the-art predictions and can be used in a high-throughput manner. We provide examples and instructions for the selected tools to illustrate practical aspects related to the submission, collection and interpretation of predictions, as well as the timing and their limitations. We highlight two predictors for intrinsically disordered proteins, flDPnn as accurate and fast and IUPred as very fast and moderately accurate, while suggesting ANCHOR2 and MoRFchibi as two of the best-performing predictors for intrinsically disordered region binding. We link these tools to additional resources, including databases of predictions and web servers that integrate multiple predictive methods. Altogether, this Tutorial provides a hands-on guide to comparatively evaluating multiple predictors, submitting and collecting their own predictions, and reading and interpreting results. It is suitable for experimentalists and computational biologists interested in accurately and conveniently identifying intrinsic disorder, facilitating the functional characterization of the rapidly growing collections of protein sequences.
... We used MetaDisorder [55] to predict disordered regions, in accordance with the principles described in [56], and DeepCoil [57] to predict coiled-coil regions. 4 We used two complementary methods to reliably predict transmembrane segments, as explained in [58]. ...
Preprint
Full-text available
Viruses frequently contain overlapping genes, which encode functionally unrelated proteins from the same DNA or RNA region but in different reading frames. Yet overlapping genes are often overlooked during genome annotation, in particular in DNA viruses. Here we looked for the presence of overlapping genes likely to encode a functional protein in human parvovirus B19 (genus erythroparvovirus), using an experimentally validated software, Synplot2. Synplot2 detected an open reading frame, X, conserved in all erythroparvoviruses, which overlaps the VP1 capsid gene, and is under highly significant selection pressure. In a related virus, human parvovirus (genus tetraparvovirus), Synplot2 also detected an open reading frame under highly significant selection pressure, ARF1, which overlaps the VP1 gene. X and ARF1 have exactly the same location (both overlap the region of VP1 encoding the phospholipase A2 domain), and encode proteins with similar predicted properties, such as a transmembrane region, strongly suggesting that they are homologous. These findings provide compelling evidence that the X protein must be expressed and functional. It is probably translated either from a polycistronic mRNA by a non-canonical mechanism, or from an unmapped monocistronic mRNA. Finally, we also discovered proteins predicted to be expressed from a frame overlapping VP1 in other species related to parvovirus B19: porcine parvovirus 2 (Z protein) and bovine parvovirus 3 (X-like protein).
... To determine the order-disorder pattern of the FYCO1 protein, the web-based tools Meta Disorder [23] and MobiDB [24] were utilized. Thirteen tools implemented in Meta Disorder predict the disorder regions of the protein by employing six tools (GlobPlot, DisEMBL, IUPred, ESpritz, VSL2b, and Jronn). ...
Article
Full-text available
Liquid–liquid phase separation (LLPS) is a novel principle for interpreting precise spatiotemporal coordination in living cells through biomolecular condensate (BMC) formation via dynamic aggregation. LLPS changes individual molecules into membrane-free, droplet-like BMCs with specific functions, which coordinate various cellular activities. The formation and regulation of LLPS are closely associated with oncogenesis, tumor progressions and metastasis, the specific roles and mechanisms of LLPS in tumors still need to be further investigated at present. In this review, we comprehensively summarize the conditions of LLPS and identify mechanisms involved in abnormal LLPS in cancer processes, including tumor growth, metastasis, and angiogenesis from the perspective of cancer hallmarks. We have also reviewed the clinical applications of LLPS in oncologic areas. This systematic summary of dysregulated LLPS from the different dimensions of cancer hallmarks will build a bridge for determining its specific functions to further guide basic research, finding strategies to intervene in LLPS, and developing relevant therapeutic approaches.
Article
Full-text available
Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. We show examples with known proteins where it successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C-terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface—GlobPipe—for the advanced user to do whole proteome analysis. GlobPlot can also be used as a generic infrastructure package for graphical displaying of any possible propensity.
Article
Full-text available
The PSIPRED protein structure prediction server allows users to submit a protein sequence, perform a prediction of their choice and receive the results of the prediction both textually via e-mail and graphically via the web. The user may select one of three prediction methods to apply to their sequence: PSIPRED, a highly accurate secondary structure prediction method; MEMSAT 2, a new version of a widely used transmembrane topology prediction method; or GenTHREADER, a sequence profile based fold recognition method. Availability: Freely available to non-commercial users at http://globin.bio.warwick.ac.uk/psipred/
Article
Full-text available
Genes involved in cholesterol biosynthesis and uptake are transcriptionally regulated in response to cellular sterol content in a coordinated manner. A number of these genes, including 3-hydroxy-3-methylglutaryl coenzyme A reductase (HMGCR) and LDL receptor (LDLR), undergo alternative splicing, resulting in reductions of enzyme or protein activity. Here we demonstrate that cellular sterol depletion suppresses, and sterol loading induces, alternative splicing of multiple genes involved in the maintenance of cholesterol homeostasis including HMGCR and LDLR, the key regulators of cellular cholesterol biosynthesis and uptake, respectively. These changes were observed in both in vitro studies of the HepG2 human hepatoma derived cell line, as well as in vivo studies of St. Kitts vervets, also known as African green monkeys, a commonly used primate model for investigating cholesterol metabolism. These effects are mediated in part by sterol regulation of polypyrimidine tract binding protein 1 (PTBP1), since knock-down of PTBP1 eliminates sterol induced changes in alternative splicing of several of these genes. Single nucleotide polymorphisms (SNPs) that influence HMGCR and LDLR alternative splicing (rs3846662 and rs688, respectively), have been associated with variation in plasma LDL-cholesterol levels. Sterol-induced changes in alternative splicing are blunted in carriers of the minor alleles for each of these SNPs, indicating an interaction between genetic and non-genetic regulation of this process. Our results implicate alternative splicing as a novel mechanism of enhancing the robust transcriptional response to conditions of cellular cholesterol depletion or accumulation. Thus coordinated regulation of alternative splicing may contribute to cellular cholesterol homeostasis as well as plasma LDL levels.
Article
The effect of training a neural network secondary structure prediction algorithm with different types of multiple sequence alignment profiles derived from the same sequences, is shown to provide a range of accuracy from 70.5% to 76.4%. The best accuracy of 76.4% (standard deviation 8.4%), is 3.1% (Q3) and 4.4% (SOV2) better than the PHD algorithm run on the same set of 406 sequence non-redundant proteins that were not used to train either method. Residues predicted by the new method with a confidence value of 5 or greater, have an average Q3 accuracy of 84%, and cover 68% of the residues. Relative solvent accessibility based on a two state model, for 25, 5, and 0% accessibility are predicted at 76.2, 79.8, and 86.6% accuracy respectively. The source of the improvements obtained from training with different representations of the same alignment data are described in detail. The new Jnet prediction method resulting from this study is available in the Jpred secondary structure prediction server, and as a stand-alone computer program from: http://barton.ebi.ac.uk/. Proteins 2000;40:502–511. © 2000 Wiley-Liss, Inc.
Article
Since the early 1980s, a bewildering array of methods for constructing bootstrap confidence intervals have been proposed. In this article, we address the following questions. First, when should bootstrap confidence intervals be used. Secondly, which method should be chosen, and thirdly, how should it be implemented. In order to do this, we review the common algorithms for resampling and methods for constructing bootstrap confidence intervals, together with some less well known ones, highlighting their strengths and weaknesses. We then present a simulation study, a flow chart for choosing an appropriate method and a survival analysis example. Copyright © 2000 John Wiley & Sons, Ltd.
Article
Since the early 1980s, a bewildering array of methods for constructing bootstrap confidence intervals have been proposed. In this article, we address the following questions. First, when should bootstrap confidence intervals be used. Secondly, which method should be chosen, and thirdly, how should it be implemented. In order to do this, we review the common algorithms for resampling and methods for constructing bootstrap confidence intervals, together with some less well known ones, highlighting their strengths and weaknesses. We then present a simulation study, a flow chart for choosing an appropriate method and a survival analysis example. Copyright © 2000 John Wiley & Sons, Ltd.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Article
Intrinsically disordered proteins (IDPs) are associated with a wide range of functions. We suggest that sequence-based subtypes, which we call flavors, may provide the basis for different biological functions. The problem is to find a method that separates IDPs into different flavor / function groups. Here we discuss one approach, the (Charge-Hydropathy) versus (Cumulative Distribution Function) plot or CH-CDF plot, which is based the combined use of the CH and CDF disorder predictors. These two predictors are based on significantly different inputs and methods. This CH-CDF plot partitions all proteins into 4 groups: structured, mixed, disordered, and rare. Studies of the Protein Data Bank (PDB) entries and homologous show different structural biases for each group classified by the CH-CDF plot. The mixed class has more order-promoting residues and more ordered regions than the disordered class. To test whether this partition accomplishes any functional separation, we performed gene ontology (GO) term analysis on each class. Some functions are indeed found to be related to subtypes of disorder: the disordered class is highly active in mitosis-related processes among others. Meanwhile, the mixed class is highly associated with signaling pathways, where having both ordered and disordered regions could possibly be important.