ArticlePDF Available

ProsperousPlus: a one-stop and comprehensive platform for accurate protease-specific substrate cleavage prediction and machine-learning model construction



Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualised, developed, tested, and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at
Fuyi Li is a professor at the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics, computational
biology, machine learning and data mining.
Cong Wang is a PhD student in the College of Information Engineering, Northwest A&F University. Her research interests are bioinformatics and machine learning.
Xudong Guo received his MEng degree from Ningxia University, China. He is currently a research assistant at the College of Information Engineering, Northwest
A&F University, China. His research interests are bioinformatics and data mining.
Tatsuya Akutsu received his DEng degree in Information Engineering in 1989 from the University of Tokyo,Japan. Since 2001, he has been a professor at the
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan. His research interests include bioinformatics and discrete algorithms.
Geoffrey I. Webb received his PhD degree in 1987 from La Trobe University. He is the director of the Monash Centre for Data Science and a Professor in the Faculty
of Information Technology at Monash University, Australia. He is a leading data scientist and has been the Program Committee Chair of two leading data mining
conferences, ACM SIGKDD and IEEE ICDM.His research interests include machine learning, data mining, computational biology and user modelling.
Lachlan J.M. Coin is a professor and group leader in the Department of Microbiology and Immunology at the University of Melbourne. He is also a member of the
Department of Clinical Pathology, University of Melbourne. His research interests are bioinformatics, machine learning, transcriptomics and genomics.
Lukasz Kurgan is a fellow of AIMBE and AAIA, Member of European Academy of Sciences and Arts and Robert J. Mattauch Endowed Professor of Computer
Science at Virginia Commonwealth University. His research encompasses the structural and functional characterization of proteins. He serves on theEditorial
Board of Bioinformatics and as associate editor-in-chief of Biomolecules. More details at
Jiangning Song is an associate professor and a group leader at the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the
Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biomedicine, machine learning, data mining and
pattern recognition.
Received: July 30, 2023. Revised: August 30, 2023. Accepted: September 29, 2023
© The Author(s) 2023. Published by Oxford University Press. All rights reserved. For Permissions, please email:
Briefings in Bioinformatics, 2023, 24(6),114
Problem Solving Protocol
ProsperousPlus: a one-stop and comprehensive platform
for accurate protease-specific substrate cleavage
prediction and machine-learning model construction
Fuyi Li, Cong Wang, Xudong Guo,Tatsuya Akutsu,Geoffrey I. Webb,Lachlan J.M. Coin,Lukasz Kurgan and Jiangning Song
Corresponding authors: Fuyi Li, 22 Xinong Road, College of Information Engineering, Yangling, Shaanxi 712100, China. E-mail:; Jiangning
Song, Monash Biomedicine Discovery Institute and Data Futures Institute, 19 Innovation Walk, Monash University, Clayton campus, Victoria 3800, Australia.
Tel .: +61 3 9902 9304; E-mail:; Lukasz Kurgan, Department of Computer Science, Virginia Commonwealth University, Department of
Computer Science, 401 West Main Street, Room E4225, P.O. Box 843019, Richmond, Virginia 23284-3019, USA. Tel.: (804) 827-3986; Fax: (804) 828-2771.
Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing
accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate
specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the
growth of the protease substrate cleavage data. In particular, since data for 100+protease types are available and this number
continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a
computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we
conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those
with no programming or little bioinformatics background,to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus
facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-
selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the
predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software
package at
Keywords:protease; cleavage site prediction; scoring function; machine learning; ensemble learning; model construction; high-
throughput prediction
Proteases are enzymes that cleave their target proteins’ peptide
backbone, contributing to various cellular processes, including
protein degradation, signal transduction and immune response
[13]. Many proteases are highly specific, cleaving only those
target substrates that present certain amino acid sequence
patterns. When dysregulated, proteases’ actions are closely
associated with numerous diseases [27], motivating their
roles in drug design and disease diagnosis efforts [46,8].
Since current data on the cleavage sites and substrates are
incomplete, sequence-based substrate cleavage site predictors
can be used to support efforts to understand protease function
and substrate specificity. Some of these tools provide accurate
predictions [9] and can be applied to identify previously unknown
and physiologically relevant cleavage sites, thus providing
insights into biological processes and guiding hypothesis-
driven experiments to verify protease–substrate interaction and
Downloaded from by Biomedical Library user on 24 October 2023
2|Li et al.
We systematically reviewed and evaluated 19 state-of-the-art
computational approaches for the cleavage site prediction [9]. We
classified the existing tools into two main categories according to
the prediction algorithms that they utilize: (i) scoring function-
based predictors, such as PeptideCutter [10], PoPS [11], SitePredic-
tion [12] and GPS-CCD [13], and (ii) machine learning-based tools,
such as Cascleave [14], PROSPER [15], Cascleave2.0 [16], iProt-Sub
[17], PROSPERous [18], DeepCleave [19] and Procleave [20]. The
scoring function-based tools rely on a limited number of scoring
functions (i.e. typically just one or two), and thus, they cannot
comprehensively encode the amino acid sequence information
for a given substrate. While machine learning-based tools do not
share this drawback, they require considerable time and availabil-
ity of high-end computing hardware to compute and tune/opti-
mize their predictive models, especially for the intrinsically large
deep learning models. In addition, formulation and calculation
of sequence-derived features are often required when developing
machine learning models, which requires strong programming
skills and may lead to overfitting/overtraining the resulting mod-
els if the number of these features is excessive. Moreover, another
major problem is that users cannot effectively train models that
target specific protease types, and instead, they have to rely on
the current tools that may miss some proteases that the users
want to target. The latter issue is compounded by the fast pace
with which protease substrate cleavage data have been generated
and will continue to grow, which means that predictors for the
proteases that were recently added will inevitably be missing.
In addition, considering data privacy and intellectual property
aspects, some labs may prefer to develop predictors using their
own in-house generated data instead of relying on the published
methods that are trained using public data. These efforts might
be rather challenging because of the machine learning and pro-
gramming skills that are needed to develop accurate predictive
tools. Correspondingly, we argue that it might be more practical to
provide an easy-to-use computational platform that helps users
quickly and efficiently build predictors that address the needs of
specific users rather than developing a new predictor for a specific
collection of protease types.
Drawing from our extensive experience with building predic-
tors of protease-specific substrates and cleavage sites [1721],
we conceptualized, developed, tested and released ProsperousPlus,
a versatile bioinformatics platform that can be used to develop
fast and accurate predictors for a user-defined collection of pro-
tease types. ProsperousPlus leverages both the sequence scoring
function-based and the machine learning-based approaches to
generate accurate predictions. Our platform can be used by users
with little to no programming and bioinformatics expertise and is
conveniently available as a webserver and a standalone code. Its
current version supports the development of models for up to 110
protease types, and this can be easily extended by the inclusion
of additional datasets for the other proteases.
Overview of ProsperousPlus
Figure 1A illustrates the four major activities that were needed to
develop ProsperousPlus, including data collection, sequence scor-
ing, model construction and evaluation and webserver develop-
ment. As the first activity, we collected and pre-processed training
and independent test datasets from the MEROPS resource for 110
protease types [22]. These datasets are needed to train/compute
predictors and assess their predictive performance. Second, we
implemented several sequence scoring functions and sequence
encodings that can be used to generate a diverse and large collec-
tion of features from the input protein sequences, which in turn
are used as inputs to predictive models.Third, we formulated and
implemented the AutoML framework, which supports the con-
struction/training, evaluation and selection of accurate protease-
specific prediction models. The training allows for selecting a
specific collection of features, predictive model types and pro-
tease types. The selection relies on empirical comparison of the
predictive performance of predictors that are trained using dif-
ferent features and predictive models. Finally, we developed an
online webserver and local standalone software for ProsperousPlus
using empirically optimized models of each protease type that is
covered by our datasets.
Figure 1B illustrates the three functional modules, Prediction’,
TrainYourModel’and‘UseYourOwnModel’, that are available in both
the webserver and local standalone software for ProsperousPlus.
The Prediction module provides access to pre-trained prediction
models for the 110 protease types, allowing users to conduct
protease-specific substrate cleavage site prediction easily. With
the TrainYourModel module, users can train their own prediction
models based on their in-house dataset, which can cover pro-
teases outside of the 110 types, using the AutoML framework
of ProsperousPlus. This module also facilitates comparative eval-
uation of predictive performance for the generated models. The
selected pre-trained and user-generated models can be used to
make predictions with the help of the UseYourOwnModel module.
Dataset collection and pre-processing
We curated experimentally validated protease substrate cleavage
annotation data for model training and validation from release
12.4 of the MEROPS database, which is a comprehensive database
for protease substrates and their cleavage events[22]. We removed
highly homologous sequences (>70% sequence identity) from the
initial substrate datasets. This aligns with previous studies [18,
20], and it facilitates the generation of more robust models that
will not be skewed towards the prediction of over-represented
homologues. Next, we selected proteases with over 30 cleavage
sites and split their datasets into training and independent test
subsets (the latter is not used for training) using the 7:3 ratio. We
excluded proteases with 30 or fewer sites since this amount of
data will not be sufficient to perform model training and testing
activities. We represent each cleavage site by a window of 20
amino acids, i.e. 10 amino acids in the upstream and 10 amino
acids in the downstream of the cleavage site. Using this protocol,
we obtained the training and independent test datasets for 110
protease types summarized in Tables S1 and S2. We trained and
optimized the ProsperousPlus models for each of the 110 protease
types using the 10-fold cross-validation on the training datasets,
and afterwards, we evaluated their predictive performance utiliz-
ing the independent test datasets.
Sequence scoring functions in ProsperousPlus
ProsperousPlus relies on the two-layer prediction architecture,
which uses multiple sequence scoring functions in the first
layer to produce inputs for the second layer that utilizes a
machine learning algorithm to make predictions. The first layer
employs eight different types of scoring functions, which were
found to be useful for substrate cleavage site and protein post-
translational modification site predictions [18,2325]. These
functions calculate scores that quantify the likelihood of a
cleavage site, and we describe them in the following subsections.
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |3
Figure 1. The ProsperousPlus framework. (A) Graphical illustration of four major activities that were needed to develop ProsperousPlus, which include
data collection, sequence scoring, AutoML module that facilitates training and evaluation of ML models and webserver development; (B) graphical
illustration of three functional modules in ProsperousPlus that facilitate selection of pre-trained prediction models, training of new models and use of
the selected/trained models.
Amino acid frequency
Amino acid frequency (AAF) is a popular sequence scoring func-
tion [23,25,26] which is defined as
where NFiis the normalized relative AAF and Pdenotes the amino
acid position surrounding the cleavage site (i.e.P=[10, 9, 8,
7, 6, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]). NFiis defined
as follows:
where fi=ni/Nrepresents the frequency value of the amino acid
at the position i(e.g. Nis the total number of residues of all known
cleavage peptide sequences, and niis the number of each amino
acid type at position i)andfi,max is the frequency value of the most
common amino acid at the same position.
Downloaded from by Biomedical Library user on 24 October 2023
4|Li et al.
WebLogo-based sequence conservation
WebLogo [27] is a widely used approach based on the calculation
of the sequence conservation score (W)[
18,23,25]. We apply the
conservation scores generated by WebLogo to rank the potential
cleavage sites. The conservation score of a peptide is calculated
WebLogoScore =
where Widenotes the conservation score of the amino acid at the
position i(P=[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10]).
Nearest neighbour similarity
Nearest neighbour similarity (NNS) evaluates the similarity
between two amino acid sequences A(m,n)and B(m,n),where
mand nare the numbers of amino acids flanking the upstream
and downstream of the cleavage site [18,23]. The NNS between
two cleavage site sequences A(m,n)and B(m,n)is defined as
NNS (A,B)=
Score A[i],B[i](4)
where iruns from mto nand Score a,bis the corresponding
element value in the BLOSUM62 substitution matrix. A[i]and B[i]
denote the ith amino acids of sequences A(m,n)and B(m,n).We
use m=n=10, which indicates the cleavage sites [P(10), P(9),
P(8), P(7), P(6), P(5), P(4), P(3), P(2), P(1), P(1), P(2), P(3),
P(4), P(5), P(6), P(7), P(8), P(9), P(10)].
For a given protease type, the putative substrate cleavage site
sequence is compared against all training cleavage sites to calcu-
late the sequence similarity scores and the final score is calcu-
lated as the average of these similarity scores.
K-Nearest Neighbours
The K-Nearest Neighbour (KNN) identifies the kmost similar
known cleavage site sequences of a potential cleavage site
sequence by calculating the distance between them [23,25]. The
distance between two amino acid sequences A(m,n)and B(m,n)
is calculated as follows:
Distance (A,B)=1m+n
i=1Sim A[i],B[i]
where mand nhave the same meaning as NNS described in
the Nearest Neighbour Similarity section, and m=n=10 (win-
dow size = 20, i= 0). The Sim () function calculates the similarity
between the amino acids A[i]and B[i]as follows:
Sim a,b=BLOSUM62 a,bmin {BLOSUM62}
max {BLOSUM62}min {BLOSUM62}(6)
where aand bdenote the amino acids from the sequences Aand
B, respectively. We use k= 0.005 ×N(Nis the total number of
sequences) to calculate the KNN score.
Position probability matrix-based scoring function
The position probability matrix (PPM) is a 21×20 matrix, where
‘21’ denotes the 20 types of common amino acids and X (unde-
fined amino acids and gap ‘-’), and ‘20’ is the length of cleavage site
sequences. The PPM captures positional preference of amino acids
in the cleavage site sequence by calculating probability of each
amino acid occurring at different positions within the sequence
[28]. The PPM is calculated as follows:
M,i[1, 20](7)
where Fa,iis the probability of amino acid aat the position iin the
PPM. Ca,irepresents the number of residues of amino acid type a
at position i,andMis the number of samples. Given a cleavage
site S={a1a2...a20}, the average of the PPM elemental values
corresponding to all amino acids in the sequence is calculated as
the feature of the sequence. The score of a sequence is computed
as follows:
PPMscore =20
20 (8)
whereS[i]is the ith amino acid in the sequence S.
Position-specific scoring matrix-based IC50 scoring
The position-specific scoring matrix (PSSM) is also a 21 ×20
matrix, similar to the PPM. The PSSM quantifies evolutionary
information of amino acids occurring at different positions in a
given sequence [29], and it has been widely used in many related
bioinformatics prediction tasks [3032]. PSSMs are generated
by first calculating the sequence-weighted frequency of the
amino acids at each position on the cleavage site sequence,
then the frequency is normalized by the background frequency
of corresponding amino acids and finally transformed by a log
transformation [33]. The elements of a PSSM are calculated as
Pa,i=log Fa,i+ω
where Pa,irepresents the element value of amino acid aat position
iin the PSSM; Fa,iis the frequency of amino acid aat position i;BGa
denotes the background frequency corresponding to the amino
acid a,andωtakes a value from the 0 to 1 range. We obtain the
background frequencies of amino acids from the UniProt database
The PSSM cannot be directly used as the scoring function to
calculate the score for the cleavage site sequence.We use the half-
maximum inhibitory concentration (IC50) score, which is widely
used for the binding affinity prediction [32,35], to produce the
PSSM-based score for the cleavage site sequences. The IC50 score
is computed as follows:
Seq_score =20
20 (10)
IC50 =5000Max Seq_score/Max Min (11)
where we set Max = 0.8 and Min =0.8 based on Liu et al.[30].
Position weight matrix-based scoring function
The position weight matrix (PWM) is a 21 ×20 matrix, similar to
the PPM. The PWM is used to identify sequence patterns or motifs
with specific functions in the sequences, by identifying regions
with higher position weights [25,36,37]. The PWM is calculated
as follows:
0.05 (12)
where PWa,idenotes the position weight of amino acid aat posi-
tion iin PWM, and Fa,iis the element value in the PPM. For a given
cleavage site sequence S={a1a2...a20}, the score is determined
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |5
by taking the average position weight of all amino acids within
the sequence. The formula is as follows:
ScorePWM =20
20 (13)
where S[i]denotes the ith amino acid in the sequence S.
Substitution matrix index-based scoring function
The substitution matrix index (SMI) is utilized to compute similar-
ity amongst sequences and is extensively employed for extracting
features from sequences [3841]. We use five different versions
of BLOSUM scoring matrices, including BLOSUM100, BLOSUM75,
BLOSUM62, BLOSUM45 and BLOSUM30, and five point accepted
mutation (PAM) matrices, including PAM500, PAM400, PAM300,
PAM120 and PAM30, to evaluate similarity between the query
cleavage site and other cleavage site sequences from the training
dataset. For a given query sequence S={a1a2...a20},theSMIscore
is calculated as
ScoreSMI =
max ScoreSj =20
i=1Matrix(index(Stj[i]),index(Stj[i])) :j=1, ...,m
where Matrix x,yis the corresponding element value with index
x,yof the BLOSUM or PAM scoring matrix, and S[i]is the ith
amino acid of the query sequence Sand Stj[i]is the ith amino
acid of the sequence Stj,whichisthejth sequence in the training
dataset (training dataset has msequence in total). Besides, the
index(a)function can get the corresponding index of amino acid a
in the matrix. We use the maximum ScoreSj value as the SMI score,
and we generate 10 SMI scores for the 10 different substitution
AutoML framework
The AutoML framework automates the entire process of designing
and selecting an accurate predictor, making it easy for users who
lack machine learning and coding expertise. In particular, AutoML
implements and considers multiple versions of the two-layered
predictors, performs advanced machine learning modelling and
comparative analysis and selects the most accurate solution. The
predictors generate the 17 scores in the first layer by combining
the outputs of the eight scoring functions that are produced from
the input sequence. The second layer applies the 17 scores to
train predictors by utilizing nine popular machine learning algo-
rithms, including Logistic Regression (LR), Naïve Bayes (NB), Sup-
port Vector Machine (SVM), Random Forest (RF), KNN, CatBoost
[42], Extreme Gradient Boosting (XGBoost) [43], Light Gradient
Boosting Machine (LightGBM) [44] and Averaged One-Dependent
Estimator (AODE) [45]. The framework considers using each of
the nine algorithms, performs a robust comparative analysis of
their predictive performance using multiple performance metrics,
attempts to improve the most promising models using popular
feature selection and ensemble algorithms and ultimately out-
puts the model that secures the most accurate predictions.
We detail this process in Algorithm 1: First, the inputs that
include the 17-dimensional feature sets for the training dataset
Sand the number of cross-validation folds kare collected. We
use the stratified sampling strategy (StratifiedKFold) to divide
the feature set Sinto the kcross-validation folds. Steps 2–7
involve applying the stratified k-fold cross-validation to train
and validate the nine machine learning models: LR, NB, SVM, RF,
KNN, CatBoost, XGBoost, LightGBM and AODE. Next, the accuracy
Algorithm 1 The AutoML pipeline of ProsperousPlus
S: feature set of the training dataset;
k: the cross-validation fold number;
optimised_model: the optimal prediction model;
01: cv =StratifiedKFold (S,k);
02: for i,(
i)in enumerate(cv.split (S)):
03: for j,min enumerate ([LR,NB,SVM,RF,KNN,CatBoost,XGBoost,
LightGBM, AODE]):
04: mi,j =trainModel(m,Strain
05: ACCvalues[i][j], AUCvalues[i][j]=predictModel(mi,j,Svalid
06: end for;
07: end for;
08: m1,m2,m3=rank (ACCvalues[i][j].mean,AUCvalues[i][j].mean);
09: m1,m2,m3=IFS (m1,m2,m3,cv,S);
10: stacked_m=stack_models (m
11: blended_m=blend_models (m
12: bagging_m1,bagging_m2,bagging_m3=bagging_model (m
13: optimised_model =compare_models (m1,m2,m3,stacked_m,
14: return optimised_model;
and area under the receiver-operating characteristic (ROC) curves
(AUC) values of each model on Svalid
iare obtained to evaluate
and select the top three best models (denoted as m1,m2and
m3). The top three are selected by sorting the nine models by
their average AUC values across the kvalidation folds set. If
selected models have the same average AUC value, then we use
the average accuracy value to break the tie. In step 9, we attempt
to optimize the top three models using the incremental feature
selection algorithm [46] (this method was shown to produce
strong results in several related studies [4753]), which results in
models (m1,m2,m3). Then, we generate five ensemble models of
m1,m2and m3using three different popular ensemble learning
strategies, including stacking [54], blending [55] and three versions
of bagging [56]. Subsequently, we obtain eight models, which
include the three base models m
2and m
3and five ensemble
models stacked_m,blended_m,bagging_m1,bagging_m2and
bagging_m3. Finally, in line 13, we select the optimised_model,
which is the model that secures the highest AUC and accuracy
values amongst the eight alternatives.
Performance evaluation
AutoML calculates several widely used performance measures
including accuracy, sensitivity, precision, F1 and Matthew’s cor-
relation coefficient (MCC) [57]. These measures are calculated as
Accuracy =TP +TN
TP +TN +FP +FN (15)
Sensitivity =TP
TP +FN (16)
Precision =TP
TP +FP (17)
(TP +FP)×(TP +FN)×(TN +FP)×(TN +FN)(18)
F1=2×Precision ×Recall
Precision +Recall (19)
where TP, TN, FP and FN denote the numbers of true positives,
true negatives, false positives and false negatives, respectively.
Moreover, the AutoML framework also plots the ROC curves and
calculates the AUC values [58].
Downloaded from by Biomedical Library user on 24 October 2023
6|Li et al.
Figure 2. Distribution and clustering of the cleavage site and non-cleavage sites based on the sequence scores selected by the IFS algorithm for six
proteases: (A)caspase-1,(B)caspase-3,(C)caspase-6,(D) MMP-2, (E) MMP-3 and (F) Granzyme B. For each protease, the samples are clustered into two
groups using the K-means algorithm, and clusters are colour-coded. The cleavage and non-cleavage sites are presented by different markers, where
dots are for the cleavage sites and × for the non-cleavage sites. The inset bar charts show the fractions of cleavage versus non-cleavage sites in each
Predictive quality of the sequence scoring
We use the training datasets to investigate the ability of the
17 sequence scores generated by eight scoring functions to
differentiate cleavage from non-cleavage sites in two comple-
mentary ways. First, we perform unsupervised clustering to
see whether these scores can be used for the natural grouping
of the sequences (without using the cleavage annotation) and
whether these groups align with the annotations. Second, we
use Algorithm 1 to quantify the performance of the predictive
models that rely on these scores. Both experiments focus on
six diverse proteases, including caspase-1, caspase-3, caspase-
6, matrix metallopeptidase-2, matrix metallopeptidase-3 and
Granzyme B. These proteases cover a wide range of training
dataset sizes, from 112 cleavage sites for caspase-1 to 1600 for
matrix metallopeptidase-3 (Table S2), and they are frequently
targeted by the existing predictors.
We apply the popular K-means clustering [59]withK=2 to
mimic the presence of the two populations (cleavage sites versus
non-cleavage sites) in the training datasets. We process the
datasets for each protease using the IFS algorithm before running
clustering to accommodate for the fact that this simulates the
processing done in the first layer of the two-layered predictive
model that we use (section Sequence Scoring Functions in Pros-
perousPlus and Algorithm 1). Figure 2 shows the resulting clusters
for the six proteases, including caspase-1 (Figure 2A), caspase-
Figure 2B), caspase-6 (Figure 2C), matrix metallopeptidase-2
(Figure 2D), matrix metallopeptidase-3 (Figure 2E)andGranzyme
Figure 2F). The clustering results are provided in Table S3.To
ease the visualization, this figure relies on the two-dimensional
feature space that applies the two dominant dimensions
extracted using the principal component analysis. We represent
the cleavage sites using the dot markers and the non-cleavage
sites using the x markers, while the two clusters are colour-coded
in red and green.
The colour-coded clusters reveal relatively clear natural group-
ings. We find particularly well-separated clusters for caspase-1,
caspase-3, caspase-6 and Granzyme B (Figure 2A–C and F,respec-
tively). Moreover, these natural clusters (i.e. clusters obtained
without using the cleavage site annotations) are in good agree-
ment with the cleavage site annotations, where cleavage sites
(positives) are primarily grouped in the red clusters while the
non-cleavage sites (negatives) dominate the green clusters in
caspase-1, caspase-6 and Granzyme B. In contrast, cleavage sites
are primarily grouped in the green cluster, and non-cleavage
sites dominate the red cluster in caspase-3. For instance, for
caspase-6, 91.59% of the cleavage sites are in the red cluster,
and 97.80% of the non-cleavage sites are in the green cluster.
While the scores of MMP-2 and MMP-3 achieved relatively poorer
clustering performance compared with the other four proteases,
but still showed far better results than random classification.
These results reveal that the sequence scoring functions that
we employ produce relatively well-defined natural clusters that
align with the native annotations of cleavage sites,suggesting that
these features should be useful for predicting cleavage sites across
different types of proteases.
We also evaluate the predictive performance of the complete
two-layer models using the 10-fold cross-validation on the train-
ing dataset for the same six proteases (Tabl e 1 ). We quantify the
predictive quality with a comprehensive collection of metrics,
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |7
Tab l e 1: Performance evaluation of ProsperousPlus based on the 10-fold cross-validation on the training dataset for six proteases:
C14.001 (caspase-1), C14.003 (caspase-3), C14.005 (caspase-6), M10.003 (MMP-2), M10.005 (MMP-3) and S01.010 (Granzyme B). The
results are presented as the averages over the 10-folds ±the corresponding standard deviations
Protease Accuracy AUC Sensitivity Precision F1 MCC
C14.001 0.875 ±0.058 0.910 ±0.053 0.846 ±0.112 0.913 ±0.067 0.872 ±0.062 0.762 ±0.111
C14.003 0.970 ±0.015 0.989 ±0.011 0.902 ±0.047 0.924 ±0.050 0.913 ±0.044 0.895 ±0.053
C14.005 0.970 ±0.006 0.992 ±0.006 0.974 ±0.014 0.968 ±0.015 0.971 ±0.005 0.941 ±0.011
M10.003 0.863 ±0.023 0.922 ±0.022 0.879 ±0.026 0.847 ±0.024 0.863 ±0.023 0.728 ±0.045
M10.005 0.869 ±0.019 0.931 ±0.015 0.881 ±0.025 0.860 ±0.026 0.870 ±0.019 0.738 ±0.038
S01.010 0.909 ±0.030 0.953 ±0.015 0.887 ±0.041 0.925 ±0.033 0.906 ±0.033 0.819 ±0.060
including accuracy, AUC, sensitivity, precision, F1 and MCC. We
find that combining the sequence scoring functions with the
machine learning classifiers produces relatively strong predic-
tions, with AUCs ranging between 0.86 (for metallopeptidase-2)
and 0.97 (for caspase-6 and caspase-3). The values of the other
metrics are also high, with the MCC values between 0.73 and
0.95 that correspond to strong correlations, sensitivities >0.87
and precision scores >0.84. Altogether, these results suggest that
the machine learning algorithms that are used produce accurate
predictions, which means they can take advantage of the high-
quality features produced by the sequence scoring functions.
Comparison with other modern predictors
We used the independent test datasets to compare the predictive
performance of models generated by ProsperousPlus against sev-
eral state-of-the-art approaches that include SitePrediction [12],
PROSPERous [18], DeepCleave [19] and Procleave [20]. We collected
predictions of these four tools using their corresponding web-
servers. We compare results for 12 proteases (caspase-1, caspase-
3, caspase-6, caspase-7, MMP-2, MMP-3, MMP-7, MMP-8, MMP-9,
MMP-12, Granzyme B and thrombin) that are commonly predicted
by these tools. We summarize the results in Figures 3 and 4and
Table S4.
Figure 3 and Table S4 show the performance comparison
results between ProsperousPlus and state-of-the-art approaches
regarding three major evaluation metrics (accuracy, MCC and
F1) for 12 proteases on the independent test datasets. The
comparison results demonstrated that ProsperousPlus achieved
optimal predictive performance in most of the proteases. More
specifically, ProsperousPlus performed best on five out of 12
proteases in all three metrics (accuracy, MCC and F1), including
caspase-1 (C14.001), caspase-6 (C14.005), MMP-8 (M10.002), MMP-
2 (M10.003) and MMP-12 (M10.009). In addition, ProsperousPlus
achieved the best performance on MMP-3 (M10.005) and thrombin
(S01.217) in two of three metrics. To be precise, ProsperousPlus
achieved the best accuracy and F1 and the second-best MCC
(Prosperous achieved the best MCC) on MMP-3. For thrombin,
ProsperousPlus secured the best accuracy and MCC and the second-
best F1 (Procleave achieved the best F1). Besides, ProsperousPlus
ranked second on four proteases regarding these three metrics,
including caspase-3 (C14.003), MMP-9 (M10.004), MMP-7 (M10.008)
and Granzyme B (S01.010), with slightly lower performance
compared with Procleave. However, ProsperousPlus ranked 4th on
caspase-7 (C14.004) in terms of three performance metrics.
Figure 4 shows the ROC curves of ProsperousPlus and state-
of-the-art methods on these 12 proteases. The AUC value is
a threshold-independent metric that provides an aggregate
measure of a model’s ability to distinguish between positive and
negative samples across all possible thresholds, with a value
ranging from 0 to 1. AUC values closer to 1 indicate better
performance. ProsperousPlus performed best regarding the AUC
with eight out of the 12 proteases, including caspase-1 (C14.001),
caspase-3 (C14.003), caspase-7 (C14.004), caspase-6 (C14.005),
MMP-2 (M10.003), MMP-9 (M10.004), MMP-3 (M10.005) and MMP-
12 (M10.009), while on the MMP-8 (M10.002), ProsperousPlus and
Procleave both achieved the best AUC value. In addition, Pros-
perousPlus achieved the second-best AUCs on MMP-7 (M10.008),
Granzyme B (S01.010) and thrombin (S01.217), and Procleave
performed best on these two proteases. The average AUC across
all considered proteases for ProsperousPlus was 0.966, surpassing
the second-best AUC of 0.950 achieved by Procleave and the
third-best AUC of 0.923 achieved by DeepCleave. Overall, all these
results demonstrated that ProsperousPlus achieved competitive
predictive performance that exceeded the predictive performance
of current state-of-the-art approaches on average.
Model interpretation
We employ the Shapley additive explanation (SHAP) algorithm
[60] to perform an interpretability analysis of the ProsperousPlus
models. SHAP assesses the importance of input features through
the calculation of Shapley values. This tool is frequently applied to
analyse related bioinformatics models [49,50,6163]. While Pros-
perousPlus utilizes feature selection to develop a well-performing
subset of features in Algorithm 1, here, we study the contributions
of individual features to the predictive quality of the resulting
predictive models.
Figure 5 shows the feature importance ranking generated by
SHAP for the 12 proteases that we use in the comparative analysis
in section Comparison with Other Modern Predictors. Each panel
in this figure ranks the features from the most useful (at the top)
to the least useful (at the bottom). Each feature is represented by a
plot that shows the distribution of SHAP values (positive and neg-
ative values that quantify associations with predictions) for test
sequences, while the colour gradient from blue to red visualizes
the distribution of feature values from low to high. These plots
provide useful insights into the impact and association of values
of the features on the model performance.
We found that several features provide substantial contri-
butions to the prediction of the cleavage sites across multiple
proteases, including KNN, NNS, WLS and IC50. Notably, the KNN
scores are identified by SHAP as the most important feature in
six out of 12 proteases (caspase-1, caspase-3, MMP-3, MMP-9,
MMP-12 and Granzyme B), the second important feature for three
proteases (caspase-6, MMP-7 and MMP-8) and the third for two
proteases (MMP-2 and thrombin), respectively. The SHAP values’
distributions of KNN scores for all these 12 proteases indicate
that ProsperousPlus is more likely to predict samples with larger
KNN values as the cleavage sites. The WLS scores are also found
to be useful, surpassing the KNN for three proteases (caspase-
6, MMP-7 and thrombin). In addition, WLS ranked second for
Downloaded from by Biomedical Library user on 24 October 2023
8|Li et al.
Figure 3. Performance comparison between ProsperousPlus and state-of-the-art approaches, including DeepCleave, Procleave, PROSPERous and SitePre-
diction, in terms of Accuracy, MCC and F1 on the independent test datasets.
caspase-3 and ranked third for five proteases, e.g. caspase-1, MMP-
3, MMP-8, MMP-12 and Granzyme B. Generally, a larger value of
the WLS feature promotes the prediction of cleavage sites, while
a lower WLS value favours the prediction of non-cleavage sites.
Conversely, IC50 (which ranked as the second important feature
for MMP-2, MMP-3 and Granzyme B and ranked as the third
important feature for caspase-3 and caspase-6) demonstrated an
opposite effect compared to KNN and WLS features. SHAP plots
revealed that larger values of the IC50 score in most proteases led
to a higher likelihood of non-cleavage site prediction.In summary,
KNN, NNS, WLS and IC50 features are arguably the key features
that contribute to the cleavage site predictions by the mod-
els produced by ProsperousPlus. The significance of these scoring
functions presumably arises from their adeptness at capturing
specific motifs, short conserved sequences or functional domains
that play pivotal roles in the biological activity of the sequences.
These functions might excel in pinpointing such crucial regions
that substantially contribute to the prediction. Furthermore, scor-
ing functions such as KNN and NNS take into consideration
the nearest neighbours of each sequence, which might aid in
capturing context-specific insights. This might signify conserved
functional contexts that underlie the observed biological activity.
Moreover, KNN, NNS and WLS possess the capability to cap-
ture non-linear relationships within the sequence data through
their proximity-based methodology, enabling them to grasp intri-
cate interactions that collectively contribute to the prediction
Furthermore, we contrast distributions of values of the best
four features for the cleavage sites (positives in red) and the non-
cleavage sites (negative in green) in Figure S1. We find that these
distributions are substantially different between the sequences
of the cleavage versus the non-cleavage sites. Importantly, the
differences are consistent across different protease types, i.e. the
median IC50 values, are uniformly lower for the cleavage sites,
and the median WLS, KNN and NNS values are higher for the
cleavage sites. This explains why and how our models can so
consistently produce highly accurate results over the different
types of proteases.
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |9
Figure 4. ROC curves of ProsperousPlus and other state-of-the-art approaches (including SitePrediction, PROSPERous, DeepCleave and Procleave) for the
cleavage site prediction of caspase-1, caspase-3, caspase-6, caspase-7, MMP-2, MMP-3, MMP-7, MMP-8, MMP-9, MMP-12, Granzyme B and thrombin on
the independent test datasets.
Downloaded from by Biomedical Library user on 24 October 2023
10 |Li et al.
Figure 5. Feature interpretations according to SHAP values for ProsperousPlus prediction of protease-specific cleavage sites on the independent test
datasets. The order from top to bottom represents the importance ranking of features. Colours indicate feature values (red: high; blue: low), and SHAP
(positive or negative) values indicate the directionality of the top features. Positive SHAP values indicate positive predictions (cleavage sites), while
negative SHAP values indicate negative predictions (non-cleavage sites).
Webserver and local standalone software for
We amplify the impact of the ProsperousPlus resource by imple-
menting it as both an online webserver and standalone software
that can be used locally. We rely on the PHP and Python program-
ming languages, which are relatively popular and should be easy
to use by the end users. The webserver is freely available at http://, while the local
standalone version is available at
ProsperousPlus. Both versions include the three modules that
we introduce in section Overview of ProsperousPlus,i.e.‘Prediction’,
Figure 6A). Here, we pro-
vide more details.
The Prediction module provides access to the pre-computed
predictive models for the 110 protease types, allowing users
to instantly make predictions for the corresponding protease-
specific cleavage sites. Users can input or upload their substrate
sequences in the FASTA format and then select the proteases
of interest to make the prediction (Figure 6B-1). We limited each
prediction job to <1000 sequences in the webserver to ensure
that long jobs do not block this resource for other users. The
standalone version does not have this limit and is recommended
in cases where users want to process large datasets. Using the
TrainYourModel’ and UseYourOwnModel’ modules, users can train
new models based on their in-house data (including protease
types outside of the list of the 110 for which we pre-computed
the models) and then apply these trained models to make
cleavage site predictions (Figure 6B-2). For the TrainYourModel
module, users need to provide a training dataset file containing
cleavage site sequences (with sequence lengths ranging from
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |11
Figure 6. The ProsperousPlus webserver. (A) Intuitive graphical workflow of the webserver. (B) An example to showcase the usage of the ProsperousPlus
webserver that covers the input page for the Prediction’, TrainYourModel and UseYourOwnModel modules (subpanels 1, 2 and 3), the prediction results
(subpanel 4), the model training results (subpanel 5) and visualization of results (subpanel 6).
8 to 20) to train and build the model. In addition, users can
optionally provide a test dataset file to test the performance of
the trained model. Sequences of interest can also be optionally
provided to make predictions by using the trained model. For
the UseYourOwnModel module, users need to provide the trained
model file and sequences of interest in the FASTA format to make
the prediction.
Upon submission, a unique Job ID will be generated to refer
to the job summary page during the job execution process. Users
can use this Job ID to enquire and track the execution progress of
their job and access and download their prediction results once
they are ready (Figure 6B-4, B-5 and B-6). The Prediction’mod-
ule outputs consist of the prediction results and visualization
plots of predicted cleavage site positions in the query FASTA
sequences. The outputs of the TrainYourModel module include the
trained model file, the performance evaluation results file, the
ROC curves of the trained model and a graphical plot including
predicted cleavage sites with corresponding positions (if the query
sequence in the FASTA format is provided). The outputs of the
UseYourOwnModel’ modules are the same as the Prediction’mod-
ule, including the prediction results and plots of the predicted
cleavage sites.
We introduce ProsperousPlus, a user-friendly and comprehensive
bioinformatic platform for accurate protease-specific cleavage
site prediction. ProsperousPlus generates well-designed two-layer
prediction models that combine the strengths of scoring function-
based features, machine learning algorithms and multiple model
Downloaded from by Biomedical Library user on 24 October 2023
12 |Li et al.
ensemble strategies (stacking, blending and bagging). We empiri-
cally show that the scoring functions that ProsperousPlus uses pro-
vide high-quality inputs for the predictive models across different
types of proteases. Further benchmarking using the independent
test datasets reveals that models generated by ProsperousPlus are
substantially more accurate than the results produced by modern
predictors of cleavage sites on average. The key features of Pros-
perousPlus include
1) Comprehensive coverage of the protease-specific substrate
and cleavage site predictions. While our platform is pre-
loaded with models for the 110 protease types, its innovative
AutoML pipeline facilitates an easy (even for users without
programming and bioinformatics background) development
of models for other protease types. This arguably makes
ProsperousPlus the most comprehensive tool that is available
to date.
2) State-of-the-art levels of predictive performance. This stems
from the well-informed design of our platform, which bene-
fits from our expertise as authors of PROSPERous, iProt-Sub
and DeepCleave predictors.
3) Availability of both the webserver and standalone code ver-
sions. This amplifies the impact of our tools by catering to
the needs of different types of users, e.g. occasional versus
frequent users, users who want to run large jobs versus
predictions for small datasets and bioinformaticians who
want to use these predictors in other bioinformatics tools.
The combination of these features arguably makes Prosperous-
Plus an invaluable tool, empowering users to perform accurate
protease-specific substrate cleavage site prediction and to train
in-house models to meet the ever-expanding pool of substrate
cleavage site data.
While ProsperousPlus represents a significant advancement in
protease-specific cleavage site prediction, it has certain limita-
tions: Firstly, although the platform encompasses an extensive
array of protease types, the predictive performance might vary
across different protease families due to variations in cleavage site
characteristics. Additionally, while incorporating multiple scoring
functions enhances the predictive power, the selection of an
optimal set of features and models remains a challenge for some
proteases, which might be addressed through ongoing refinement.
In terms of future research, exploring cutting-edge deep learning
architectures, such as large language models (LLMs), could be
potentially leveraged to further enhance the predictive accuracy
and uncover complex interactions governing protease substrate
Key Points
Comprehensive coverage of the protease-specific sub-
strate and cleavage site predictions. While ProsperousPlus
is pre-loaded with models for the 110 protease types, its
innovative AutoML pipeline facilitates the easy develop-
ment of models for other protease types.
State-of-the-art levels of predictive performance. This
stems from the well-informed design of our platform,
which benefits from our expertise as authors of PROS-
PERous, iProt-Sub and DeepCleave predictors.
Availability of both webserver and standalone code ver-
sions at and
This amplifies the impact of ProsperousPlus by catering
to the needs of different types of users.
Supplementary data are available online at https://academic.oup.
The National Natural Scientific Foundation of China (No.
62202388); the National Key Research and Development Program
of China (No. 2022YFF1000100); the Qin Chuangyuan Innovation
and Entrepreneurship Talent Project (No. QCYRCXM-2022-230);
Talent Research Funding at Northwest A&F University (No.
Z1090222021); the Major and Seed Inter-Disciplinary Research
Projects awarded by Monash University.
The ProsperousPlus tool is freely available at http://prosperousplus. The source code and datasets of
ProsperousPlus are freely available on the Download page of the
webserver and GitHub repository (
ProsperousPlus). In addition, detailed user instructions are acces-
sible on the Help page at http://prosperousplus.unimelb-biotools.
1. Lopez-Otin C, Matrisian LM. Emerging roles of proteases in
tumour suppression. Nat Rev Cancer 2007;7:800–8.
2. Dixit VM. The road to death: caspases, cleavage, and pores. Sci
Adv 2023;9:eadi2011.
3. Han N, Jin K, He K, et al. Protease-activated receptors in cancer:
a systematic review. Oncol Lett 2011;2:599–608.
4. Chary A, Holodniy M. Recent advances in hepatitis C virus
treatment: review of HCV protease inhibitor clinical trials. Rev
Recent Clin Trials 2010;5:158–73.
5. Pang X, Xu W, Liu Y, et al. The research progress of SARS-CoV-
2 main protease inhibitors from 2020 to 2022. Eur J Med Chem
6. Peach CJ, Edgington-Mitchell LE, Bunnett NW, et al. Protease-
activated receptors in health and disease. Physiol Rev 2023;103:
7. Turk B. Targeting proteases: successes, failures and future
prospects. Nat Rev Drug Discov 2006;5:785–99.
8. Yau MK, Liu L, Fairlie DP. Toward drugs for protease-activated
receptor 2 (PAR2). J Med Chem 2013;56:7477–97.
9. Li F, Wang Y, Li C, et al. Twenty years of bioinformatics research
for protease-specific substrate and cleavage site prediction: a
comprehensive revisit and benchmarking of existing methods.
Brief Bioinform 2019;20:2150–66.
10. Wilkins MR, Gasteiger E, Bairoch A, et al. Protein identifica-
tion and analysis tools in the ExPASy server. Methods Mol Biol
11. Boyd SE, Pike RN, Rudy GB, et al. PoPS: a computational tool for
modeling and predicting protease specificity. J Bioinform Comput
Biol 2005;3:551–85.
Downloaded from by Biomedical Library user on 24 October 2023
Model customization to predict cleavage site |13
12. Verspurten J, Gevaert K, Declercq W, et al. SitePredicting the
cleavage of proteinase substrates. Trends Biochem Sci 2009;34:
13. Liu Z, Cao J, Gao X, et al. GPS-CCD: a novel computational
program for the prediction of calpain cleavage sites. PloS One
14. Song J, Tan H, Shen H, et al. Cascleave: towards more accurate
prediction of caspase substrate cleavage sites. Bioinformatics
15. Song J, Tan H, Perry AJ, et al. PROSPER: an integrated feature-
based tool for predicting protease substrate cleavage sites. PloS
One 2012;7:e50300.
16. Wang M, Zhao XM, Tan H, et al. Cascleave 2.0, a new approach for
predicting caspase and granzyme cleavage targets.Bioinformatics
17. Song J, Wang Y, Li F, et al. iProt-sub: a comprehensive package for
accurately mapping and predicting protease-specific substrates
and cleavage sites. Brief Bioinform 2019;20:638–58.
18. SongJ,LiF,LeierA,et al. PROSPERous: high-throughput predic-
tion of substrate cleavage sites for 90 proteases with improved
accuracy. Bioinformatics 2018;34:684–7.
19. Li F, Chen J, Leier A, et al. DeepCleave: a deep learning predictor
for caspase and matrix metalloprotease substrates and cleavage
sites. Bioinformatics 2020;36:1057–65.
20. Li F, Leier A, Liu Q , et al. Procleave: predicting protease-specific
substrate cleavage sites by combining sequence and structural
information. Genomics Proteomics Bioinformatics 2020;18:52–64.
21. Wang Y, Song J, Marquez-Lago TT, et al. Knowledge-transfer
learning for prediction of matrix metalloprotease substrate-
cleavage sites. Sci Rep 2017;7:5755.
22. Rawlings ND, Bateman A. How to use the MEROPS database
and website to help understand peptidase specificity. Protein Sci
23. Li F, Li C, Marquez-Lago TT, et al. Quokka: a comprehensive
tool for rapid and accurate prediction of kinase family-specific
phosphorylation sites in the human proteome. Bioinformatics
24. Gao J, Xu D. The Musite open-source framework for
phosphorylation-site prediction. BMC Bioinformatics 2010;
11(Suppl 12):S9.
25. MeiS,LiF,XiangD,et al. Anthem: a user customised tool for
fast and accurate prediction of binding between peptides and
HLA class I molecules. Brief Bioinform 2021;22(5):bbaa415.
26. Chen Z, Zhao P, Li F, et al. iFeature: a python package and web
server for features extraction and selection from protein and
peptide sequences. Bioinformatics 2018;34:2499–502.
27. Crooks GE, Hon G, Chandonia JM, et al. WebLogo: a sequence logo
generator. Genome Res 2004;14:1188–90.
28. Nishida K, Frith MC, Nakai K. Pseudocounts for transcription
factor binding sites. Nucleic Acids Res 2009;37:939–44.
29. Andreatta M, Alvarez B, Nielsen M. GibbsCluster: unsupervised
clustering and alignment of peptide sequences. Nucleic Acids Res
30. Liu G, Li D, Li Z, et al. PSSMHCpan: a novel PSSM-based software
for predicting class I peptide-HLA binding affinity. GigaScience
31. Zhang H, Lund O, Nielsen M. The PickPocket method for predict-
ing binding specificities for receptors based on receptor pocket
similarities: application to MHC-peptide binding. Bioinformatics
32. MeiS,LiF,LeierA,et al. A comprehensive review and per-
formance evaluation of bioinformatics tools for HLA class I
peptide-binding prediction. Brief Bioinform 2020;21:1119–35.
33. Thompson JD, Higgins DG, Gibson TJ. Improved sensitivity of
profile searches through the use of sequence weights and gap
excision. Bioinformatics 1994;10:19–29.
34. Consortium U. UniProt: the universal protein knowledgebase.
Nucleic Acids Res 2018;46:2699.
35. Li M, Lu Z, Wu Y, et al. BACPI: a bi-directional attention neural
network for compound–protein interaction and binding affinity
prediction. Bioinformatics 2022;38:1995–2002.
36. Gfeller D, Guillaume P, Michaux J, et al. The length distribution
and multiple specificity of naturally presented HLA-I ligands. J
Immunol 2018;201:3705–16.
37. Bassani-Sternberg M, Chong C, Guillaume P, et al. Deciphering
HLA-I motifs across HLA peptidomes improves neo-antigen pre-
dictions and identifies allostery regulating HLA specificity. PLoS
Comput Biol 2017;13:e1005725.
38. Jurtz V, Paul S, Andreatta M, et al. NetMHCpan-4.0: improved
peptide–MHC class I interaction predictions integrating eluted
ligand and peptide binding affinity data. J Immunol 2017;199:
39. Rasmussen M, Fenoy E, Harndahl M, et al. Pan-specific prediction
of peptide–MHC class I complex stability, a correlate of T cell
immunogenicity. J Immunol 2016;197:1600582.
40. Hu Y, Wang Z, Hu H, et al. ACME: pan-specific peptide-MHC
class I binding prediction through attention-based deep neural
networks. Bioinformatics 2019;35:4946–54.
41. Reynisson B, Alvarez B, Paul S, et al. NetMHCpan-4.1 and
NetMHCIIpan-4.0: improved predictions of MHC antigen pre-
sentation by concurrent motif deconvolution and integration
of MS MHC eluted ligand data. Nucleic Acids Res 2020;48:
42. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient
boosting with categorical features support. arXiv preprint
arXiv:1810.11363 2018.
43. Chen T, He T, Benesty M, et al. Xgboost: extreme gradient boost-
ing. R package version 0.4-2 2015:1-4.
44. Ke G, Meng Q , Finley T, et al. Lightgbm: a highly efficient gra-
dient boosting decision tree. Adv Neural Inf Process Syst 2017;30:
45. Webb GI, Boughton JR, Wang Z. Not so naive Bayes: aggregating
one-dependence estimators. Mach Learn 2005;58:5–24.
46. Liu H, Setiono R. Incremental feature selection. Appl Intell 1998;9:
47. Li F, Li C, Wang M, et al. GlycoMine: a machine learning-based
approach for predicting N-, C-and O-linked glycosylation in the
human proteome. Bioinformatics 2015;31:1411–9.
48. Li F, Li C, Revote J, et al. GlycoMine struct: a new bioinformatics
tool for highly accurate mapping of the human N-linked and O-
linked glycoproteomes by incorporating structural features. Sci
Rep 2016;6:34595.
49. Li F, Guo X, Jin P, et al. Porpoise: a new approach for accu-
rate prediction of RNA pseudouridine sites. Brief Bioinform
50. Chen R, Li F, Guo X, et al. ATTIC is an integrated approach
for predicting A-to-I RNA editing sites in three species. Brief
Bioinform 2023;24(3):bbad170.
51. Li F, Chen J, Ge Z, et al. Computational prediction and
interpretation of both general and specific types of
promoters in Escherichia coli by exploiting a stacked
ensemble-learning framework. Brief Bioinform 2021;22:
52. Jia C, Bi Y, Chen J, et al. PASSION: an ensemble neural network
approach for identifying the binding sites of RBPs on circRNAs.
Bioinformatics 2020;36:4276–82.
Downloaded from by Biomedical Library user on 24 October 2023
14 |Li et al.
53. Wang M, Zhao XM, Takemoto K, et al. FunSAV: predict-
ing the functional effect of single amino acid variants
using a two-stage random forest model. PloS One 2012;7:
54. MJVD L, Polley EC, Hubbard AE. Super learner, statistical appli-
cations in genetics and molecular biology. 2007;6.
55. Zhou Z-H. Ensemble Methods: Foundations and Algorithms.CRC
Press, 2012.
56. Altman N, Krzywinski M. Ensemble methods: bagging and ran-
dom forests. Nat Methods 2017;14:933–4.
57. Matthews BW. Comparison of the predicted and observed sec-
ondary structure of T4 phage lysozyme. Biochim Biophys Acta
Protein Struct 1975;405:442–51.
58. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett
59. Hartigan JA, Wong MA. A K-means clustering algorithm. J R Stat
Soc Ser C Appl Stat 2018;28:100–8.
60. Lundberg SM, Lee S-I. A unified approach to interpreting model
predictions. Adv Neural Inf Process Syst 2017;30.
61. Bi Y, Li F, Guo X, et al. Clarion is a multi-label problem transfor-
mation method for identifying mRNA subcellular localizations.
Brief Bioinform 2022;23(6):bbac467.
62. Wang R, Jiang Y, Jin J, et al. DeepBIO: an automated and inter-
pretable deep-learning platform for high-throughput biological
sequence prediction, functional annotation and visualization
analysis. Nucleic Acids Res 2023;51:3017–29.
63. WeiL,HeW,MalikA,et al. Computational prediction and
interpretation of cell-specific replication origin sites from multi-
ple eukaryotes by exploiting stacking framework. Brief Bioinform
Downloaded from by Biomedical Library user on 24 October 2023
... Hence, this identified novel N-terminal sequence is from proteolysis. We found 27 novel N-termini of SEPs might be produced by proteolysis, according to predictions made by an online tool ProsperousPlus (53). For example, the identified N-terminal peptide IFYNNPKLETAQMFMNR of IP_2307589 may come from the protease cleavage by ADAMTS4. ...
Full-text available
sORF-encoded peptides (SEPs) refer to proteins encoded by small open reading frames (sORFs) with a length of less than 100 amino acids, which play an important role in various life activities. Analysis of known SEPs showed that using non-canonical initiation codons of SEPs was more common. However, the current analysis of SEP sequences mainly relies on bioinformatics prediction, and most of them use AUG as the start site, which may not be completely correct for SEPs. Chemical labeling was used to systematically analyze the N-terminal sequences of SEPs to accurately define the start sites of SEPs. By comparison, we found that dimethylation and guanidinylation are more efficient than acetylation. The ACN precipitation and heating precipitation performed better in SEP enrichment. As an N-terminal peptide enrichment material, Hexadhexaldehyde was superior to CNBr-activated agarose and NHS-activated agarose. Combining these methods, we identified 128 SEPs with 131 N-terminal sequences. Among them, two-thirds are novel N-terminal sequences, and most of them start from the 11–31st amino acids of the original sequence. Partial novel N-termini were produced by proteolysis or signal peptide removal. Some SEPs’ transcription start sites were corrected to be non-AUG start codons. One novel start codon was validated using GFP-tag vectors. These results demonstrated that the chemical labeling approaches would be beneficial for identifying the start codons of sORFs and the real N-terminal of their encoded peptides, which helps better understand the characterization of SEPs.
... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...
Protein-metal ion interactions play a central role in the onset of numerous diseases. When amino acid changes lead to missense mutations in metal-binding sites, the disrupted interaction with metal ions can compromise protein function, potentially causing severe human ailments. Identifying these disease-associated mutation sites within metal-binding regions is paramount for understanding protein function and fostering innovative drug development. While some computational methods aim to tackle this challenge, they often fall short in accuracy, commonly due to manual feature extraction and the absence of structural data. We introduce MetalPrognosis, an innovative, alignment-free solution that predicts disease-associated mutations within metal-binding sites of metalloproteins with heightened precision. Rather than relying on manual feature extraction, MetalPrognosis employs sliding window sequences as input, extracting deep semantic insights from pre-trained protein language models. These insights are then incorporated into a convolutional neural network, facilitating the derivation of intricate features. Comparative evaluations show MetalPrognosis outperforms leading methodologies like MCCNN and M-Ionic across various metalloprotein test sets. Furthermore, an ablation study reiterates the effectiveness of our model architecture. To facilitate public use, we have made the datasets, source codes, and trained models for MetalPrognosis online available at .
... Typically, they support a limited array of seed match types, including 8mer, 7mer-m8, 7mer-A1, 6mer and offset 6mer, all considered canonical sites. In contrast, deep learning approaches have shown a remarkable ability to automatically discern intricate data patterns compared to those reliant on feature engineering (23)(24)(25)(26)(27)(28). For instance, Lee et al . ...
Full-text available
MicroRNAs (miRNAs) are short non-coding RNAs involved in various cellular processes, playing a crucial role in gene regulation. Identifying miRNA targets remains a central challenge and is pivotal for elucidating the complex gene regulatory networks. Traditional computational approaches have predominantly focused on identifying miRNA targets through perfect Watson-Crick base pairings within the seed region, referred to as canonical sites. However, emerging evidence suggests that perfect seed matches are not a prerequisite for miRNA-mediated regulation, underscoring the importance of also recognizing imperfect, or non-canonical, sites. To address this challenge, we propose Mimosa, a new computational approach that employs the Transformer framework to enhance the prediction of miRNA targets. Mimosa distinguishes itself by integrating contextual, positional, and base-pairing information to capture in-depth attributes, thereby improving its predictive capabilities. Its unique ability to identify non-canonical base-pairing patterns makes Mimosa a standout model, reducing the reliance on pre-selecting candidate targets. Mimosa achieves superior performance in gene-level predictions and also shows impressive performance in site-level predictions across various non-human species through extensive benchmarking tests. To facilitate research efforts in miRNA targeting, we have developed an easy-to-use web server for comprehensive end-to-end predictions, which is publicly available at
... In general, a suitable learning rate or set of learning rates can speed up model training and yield a better or even optimal performance. If the learning rate is too small, it takes a long time to reach the desired state; on the other hand, if the learning rate is too large, the algorithm may not converge [43], [44], [45], [46]. In addition, we used categorical cross-entropy as a loss function for model training. ...
The Type III Secretion Systems (T3SSs) play a pivotal role in host-pathogen interactions by mediating the secretion of type III secretion system effectors (T3SEs) into host cells. These T3SEs mimic host cell protein functions, influencing interactions between Gram-negative bacterial pathogens and their hosts. Identifying T3SEs is essential in biomedical research for comprehending bacterial pathogenesis and its implications on human cells. This study presents EDIFIER, a novel multi-channel model designed for accurate T3SE prediction. It incorporates a graph structural channel, utilizing graph convolutional networks (GCN) to capture protein 3D structural features and a sequence channel based on the ProteinBERT pre-trained model to extract the sequence context features of T3SEs. Rigorous benchmarking tests, including ablation studies and comparative analysis, validate that EDIFIER outperforms current state-of-the-art tools in T3SE prediction. To enhance EDIFIER's accessibility to the broader scientific community, we developed a webserver that is publicly accessible at . We anticipate EDIFIER will contribute to the field by providing reliable T3SE predictions, thereby advancing our understanding of host-pathogen dynamics.
... Firstly, the feature-based approach requires manual design and a combination of features, leading to a labourintensive trial-and-error process [31]. These manually developed features may also be irrelevant or redundant, hindering accurate model training [32]. Feature selection techniques are employed to mitigate this issue but add complexity to the process. ...
Origins of replication sites (ORIs) are crucial genomic regions where DNA replication initiation takes place, playing pivotal roles in fundamental biological processes like cell division, gene expression regulation, and DNA integrity. Accurate identification of ORIs is essential for comprehending cell replication, gene expression, and mutation-related diseases. However, experimental approaches for ORI identification are often expensive and time-consuming, leading to the growing popularity of computational methods. In this study, we present PLANNER (DeeP LeArNiNg prEdictor for ORI), a novel approach for species-specific and cell-specific prediction of eukaryotic ORIs. PLANNER uses the multi-scale ktuple sequences as input and employs the DNABERT pre-training model with transfer learning and ensemble learning strategies to train accurate predictive models. Extensive empirical test results demonstrate that PLANNER achieved superior predictive performance compared to state-of-the-art approaches, including iOri-Euk, Stack-ORI, and ORI-Deep, within specific cell types and across different cell types. Furthermore, by incorporating an interpretable analysis mechanism, we provide insights into the learned patterns, facilitating the mapping from discovering important sequential determinants to comprehensively analysing their biological functions. To facilitate the widespread utilisation of PLANNER, we developed an online webserver and local stand-alone software, available at and , respectively
... In scenarios where protein sequences lack extensive homologous families, this becomes a limitation. On the other hand, deep learning can auto-extract features, improving model generalization [22,23,24,25]. ...
Full-text available
Protein-metal ion interactions play a central role in the onset of numerous diseases. When amino acid changes lead to missense mutations in metal-binding sites, the disrupted interaction with metal ions can compromise protein function, potentially causing severe human ailments. Identifying these disease-associated mutation sites within metal-binding regions is paramount for understanding protein function and fostering innovative drug development. While some computational methods aim to tackle this challenge, they often fall short in accuracy, commonly due to manual feature extraction and the absence of structural data. We introduce MetalPrognosis, an innovative, alignment-free solution that predicts disease-associated mutations within metal-binding sites of metalloproteins with heightened precision. Rather than relying on manual feature extraction, MetalPrognosis employs sliding window sequences as input, extracting deep semantic insights from pre-trained protein language models. These insights are then incorporated into a convolutional neural network, facilitating the derivation of intricate features. Comparative evaluations show MetalPrognosis outperforms leading methodologies like MCCNN and PolyPhen-2 across various metalloprotein test sets. Furthermore, an ablation study reiterates the effectiveness of our model architecture. To facilitate public use, we have made the datasets, source codes, and trained models for MetalPrognosis online available at .
Full-text available
Multiple sclerosis (MS) is characterized by neuroinflammation and demyelination of the central nervous system (CNS), leading to disablility. Genetic variants that confer MS risk implicate genes involved in immune function, while variants related to severity of the disease are associated with genes preferentially expressed within the CNS. Current MS therapies decrease relapse rates by preventing immune-mediated damage of myelin, but they ultimately fail to slow long-term disease progression, which apparently depends on CNS intrinsic processes. The molecular events that trigger progressive MS are still unknown. Here we report that the C-terminal region of TAF1 (the scaffolding subunit of the general transcription factor TFIID) is underrepresented in postmortem brain tissue from individuals with MS. Furthermore, we demonstrate in vivo, in genetically modified mice, that C-terminal alteration of TAF1 suffices to induce an RNA polymerase II (RNAPII)-elongation deficit that particularly affects oligodendroglial myelination-related genes and results in an MS-like brain transcriptomic signature, including increased expression of proinflammatory genes. This transcriptional profile is accompanied by CNS-resident inflammation, robust demyelination and MS-like motor phenotypes. We also identify numerous interactors of C-terminal TAF1 that participate in RNAPII-promoter escape, of which two show evidence for genetic association to MS. Our study reveals that TAF1 dysfunction converges with genetic susceptibility to cause transcriptional dysregulation in CNS cell types, such as oligodendrocytes, to ultimately trigger MS.
Conference Paper
Full-text available
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm , and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations , the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
Full-text available
Vishva Dixit recounts his favorite discoveries after 30-plus years studying the proteins that allow infected, damaged, or obsolete cells to die.
Full-text available
A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e., transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species including H. sapiens, M. musculus, and D. melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a webserver for ATTIC, which is publicly available at We anticipate that ATTIC can be utilised as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterise their roles in post-transcriptional regulation.
Full-text available
Here, we present DeepBIO, the first-of-its-kind automated and interpretable deep-learning platform for high-throughput biological sequence functional analysis. DeepBIO is a one-stop-shop web service that enables researchers to develop new deep-learning architectures to answer any biological question. Specifically, given any biological sequence data, DeepBIO supports a total of 42 state-of-the-art deep-learning algorithms for model training, comparison, optimization and evaluation in a fully automated pipeline. DeepBIO provides a comprehensive result visualization analysis for predictive models covering several aspects, such as model interpretability, feature analysis and functional sequential region discovery. Additionally, DeepBIO supports nine base-level functional annotation tasks using deep-learning architectures, with comprehensive interpretations and graphical visualizations to validate the reliability of annotated sites. Empowered by high-performance computers, DeepBIO allows ultra-fast prediction with up to million-scale sequence data in a few hours, demonstrating its usability in real application scenarios. Case study results show that DeepBIO provides an accurate, robust and interpretable prediction, demonstrating the power of deep learning in biological sequence functional analysis. Overall, we expect DeepBIO to ensure the reproducibility of deep-learning biological sequence analysis, lessen the programming and hardware burden for biologists and provide meaningful functional insights at both the sequence level and base level from biological sequences alone. DeepBIO is publicly available at
The novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread worldwide. The main protease (Mpro) of SARS-CoV-2 plays a central role in viral replication and transcription and represents an attractive drug target for fighting COVID-19. Many SARS-CoV-2 Mpro inhibitors have been reported, including covalent and noncovalent inhibitors. The SARS-CoV-2 Mpro inhibitor PF-07321332 (Nirmatrelvir) designed by Pfizer has been put on the market. This paper briefly introduces the structural characteristics of SARS-CoV-2 Mpro and summarizes the research progress of SARS-CoV-2 Mpro inhibitors from the aspects of drug repurposing and drug design. These information will provide a basis for the drug development of treating the infection of SARS-CoV-2 and even other coronaviruses in the future.
Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47%, 91.29%, 79.77%, 92.10%, 89.15%, 83.74%, 80.74%, 79.23%, and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus, and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at
Although generally regarded as degradatory enzymes, certain proteases are also signaling molecules that specifically control cellular functions by cleaving protease-activated receptors (PARs). The four known PARs are members of the large family of G protein-coupled receptors. These transmembrane receptors control most physiological and pathological processes and are the target of a large proportion of therapeutic drugs. Signaling proteases include enzymes from the circulation, from immune, inflammatory epithelial and cancer cells, as well as from commensal and pathogenic bacteria. Advances in our understanding of the structure and function of PARs provide insights into how diverse proteases activate these receptors to regulate physiological and pathological processes in most tissues and organ systems. The realization that proteases and PARs are key mediators of disease, coupled with advances in understanding the atomic level structure of PARs and their mechanisms of signaling in subcellular microdomains, has spurred the development of antagonists, some of which have advanced to the clinic. Herein we review the discovery, structure and function of this receptor system, highlight the contribution of PARs to homeostatic control, and discuss the potential of PAR antagonists for the treatment of major diseases.
Motivation The identification of compound-protein interactions (CPIs) is an essential step in the process of drug discovery. The experimental determination of CPIs is known for a large amount of funds and time it consumes. Computational model has therefore become a promising and efficient alternative for predicting novel interactions between compounds and proteins on a large scale. Most supervised machine learning prediction models are approached as a binary classification problem, which aim to predict whether there is an interaction between the compound and the protein or not. However, compound-protein interaction is not a simple binary on-off relationship, but a continuous value reflects how tightly the compound binds to a particular target protein, also called binding affinity. Results In this study, we propose an end-to-end neural network model, called BACPI, to predict compound-protein interaction and binding affinity. We employ graph attention network (GAT) and convolutional neural network (CNN) to learn the representations of compounds and proteins, and develop a bi-directional attention neural network model to integrate the representations. To evaluate the performance of BACPI, we use three CPI datasets and four binding affinity datasets in our experiments. The results show that, when predicting CPIs, BACPI significantly outperforms other available machine learning methods on both balanced and unbalanced datasets. This suggests that the end-to-end neural network model that predicts CPIs directly from low level representations is more robust than traditional machine learning-based methods. And when predicting binding affinities, BACPI achieves higher performance on large datasets compared to other state-of-the-art deep learning methods. This comparison result suggests that the proposed method with bi-directional attention neural network can capture the important regions of compounds and proteins for binding affinity prediction. Availability and implementation Data and source codes are available at Supplementary information Supplementary data are available at Bioinformatics online.
Pseudouridine is a ubiquitous RNA modification type present in eukaryotes and prokaryotes, which plays a vital role in various biological processes. Almost all kinds of RNAs are subject to this modification. However, it remains a great challenge to identify pseudouridine sites via experimental approaches, requiring expensive and time-consuming experimental research. Therefore, computational approaches that can be used to perform accurate in silico identification of pseudouridine sites from a large amount of RNA sequence data are highly desirable and can aid in the functional elucidation of this critical modification. Here, we propose a new computational approach, termed Porpoise, to accurately identify pseudouridine sites from RNA sequence data. Porpoise builds upon a comprehensive evaluation of 18 frequently used feature encoding schemes based on the selection of four types of features, including binary features, pseudo k-tuple composition (PseKNC), nucleotide chemical property (NCP), and position-specific trinucleotide propensity based on single-strand (PSTNPss). The selected features are fed into the stacked ensemble learning framework to enable the construction of an effective stacked model. Both cross-validation tests on the benchmark dataset and independent tests show that Porpoise achieves superior predictive performance than several state-of-the-art approaches. The application of model interpretation tools demonstrates the importance of PSTNPs for the performance of the trained models. This new method is anticipated to facilitate community-wide efforts to identify putative pseudouridine sites and formulate novel testable biological hypothesis.
Neopeptide-based immunotherapy has been recognised as a promising approach for the treatment of cancers. For neopeptides to be recognised by CD8+ T cells and induce an immune response, their binding to human leukocyte antigen class I (HLA-I) molecules is a necessary first step. Most epitope prediction tools thus rely on the prediction of such binding. With the use of mass spectrometry, the scale of naturally presented HLA ligands that could be used to develop such predictors has been expanded. However, there are rarely efforts that focus on the integration of these experimental data with computational algorithms to efficiently develop up-to-date predictors. Here, we present Anthem for accurate HLA-I binding prediction. In particular, we have developed a user-friendly framework to support the development of customisable HLA-I binding prediction models to meet challenges associated with the rapidly increasing availability of large amounts of immunopeptidomic data. Our extensive evaluation, using both independent and experimental datasets, shows that Anthem achieves an overall similar or higher area under curve (AUC) value compared with other contemporary tools. It is anticipated that Anthem will provide a unique opportunity for the non-expert users to analyse and interpret their own in-house or publicly deposited datasets.