ArticlePDF AvailableLiterature Review

Ten quick tips for machine learning in computational biology

Authors:

Abstract and Figures

Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimistic results. With this review, we present ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that we observed hundreds of times in multiple bioinformatics projects. We believe our ten suggestions can strongly help any machine learning practitioner to carry on a successful project in computational biology and related sciences.
Content may be subject to copyright.
Chicco BioData Mining (2017) 10:35
DOI 10.1186/s13040-017-0155-3
REVIEW Open Access
Ten quick tips for machine learning in
computational biology
Davide Chicco
Correspondence:
davide.chicco@davidechicco.it
Princess Margaret Cancer Centre,
PMCR Tower 11-401, 101 College
Street, M5G 1L7 Toronto, Ontario,
Canada
Abstract
Machine learning has become a pivotal tool for many projects in computational
biology, bioinformatics, and health informatics. Nevertheless, beginners and
biomedical researchers often do not have enough experience to run a data mining
project effectively, and therefore can follow incorrect practices, that may lead to
common mistakes or over-optimistic results. With this review, we present ten quick tips
to take advantage of machine learning in any computational biology context, by
avoiding some common errors that we observed hundreds of times in multiple
bioinformatics projects. We believe our ten suggestions can strongly help any machine
learning practitioner to carry on a successful project in computational biology and
related sciences.
Keywords: Tips, Machine learning, Computational biology, Biomedical informatics,
Health informatics, Bioinformatics, Data mining, Computational intelligence
Introduction
Recent advances in high-throughput sequencing technologies have made large biolog-
ical datasets available to the scientific community. Together with the growth of these
datasets, internet web services expanded, and enabled biologists to put large data online
for scientific audiences.
As a result, scientists have begun to search for novel ways to interrogate, analyze,
and process data, and therefore infer knowledge about molecular biology, physiology,
electronic health records, and biomedicine in general. Because of its particular ability
to handle large datasets, and to make predictions on them through accurate statistical
models, machine learning was able to spread rapidly and to be used commonly in the
computational biology community.
A machine learning algorithm is a computational method based upon statistics, imple-
mented in software, able to discover hidden non-obvious patterns in a dataset, and
moreover to make reliable statistical predictions about similar new data. As explained by
Kevin Yip and colleagues: “The ability [of machine learning] to automatically identify pat-
terns in data [...] is particularly important when the expert knowledge is incomplete or
inaccurate, when the amount of available data is too large to be handled manually, or when
there are exceptions to the general cases” [1]. This is clearly the case for computational
biology and bioinformatics.
Machine learning (often termed also data mining,computational intelligence,orpattern
recognition) has thus been applied to multiple computational biology problems so far
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 2 of 17
[2–5], helping scientific researchers to discover knowledge about many aspects of biology.
Despite its importance, often researchers with biology or healthcare backgrounds do not
have the specific skills to run a data mining project. This lack of skills often makes biol-
ogists delay or decide not to try to include any machine learning analysis in it. In other
cases, biological and healthcare researchers who embark on a machine learning venture
sometimes follow incorrect practices, which lead to error-prone analyses, or give them
the illusion of success.
To avoid those situations, we present here ten quick tips to take advantage of machine
learning in any computational biology project. Ten best practices, or ten pieces of advice,
that we developed especially for machine learning beginners, and for biologists and
healthcare scientists who have limited experience with data mining.
We organize our ten tips as follows. At the beginning, the first five tips regard prac-
tices to consider before commencing to program a machine learning software (the dataset
check and arrangement in Tip 1, the dataset subset split in Tip 2, the problem category
framing in Tip 3, the algorithm choice in Tip 4, and the handling of imbalanced dataset
problem in Tip 5). After them, the next two tips regard relevant practices to adopt during
the machine learning program development (the hyper-parameter optimization in Tip 6,
and the handling of the overfitting problem in Tip 7). Moreover, the following tip refers to
what to do at the end of a machine learning algorithm execution (the performance score
evaluation in Tip 8). Finally, the last two tips regard broad general best practices on how
to arrange a project, and are valid not only in machine learning and computational biol-
ogy, but in any scientific field (choosing open source programming platforms in Tip 9, and
asking feedback and help from experts in Tip 10).
To beginners, the understanding of these ten quick tips should not replace the study of
machine learning through a book. On the contrary, we wrote this manuscript to provide
a complementary resource to a classical training from a textbook [2], and therefore we
suggest all the beginners to start from there.
In this paper, we consider an input dataset for a binary classification task represented
as a typical table (or matrix) having Mdata instances as rows, Nfeatures as columns,
and a binary target-label column. Of course, switching the rows with the columns would
not change the results of a machine learning algorithm application. We call negative data
instance a row of the input table with negative,false,or0astargetlabel,andpositive data
instance a row of the input table with positive,true,or1astargetlabel.
Carrying a machine learning project to success might be troublesome, but these ten
quick tips can help the readers at least avoid common mistakes, and especially avoid the
dangerous illusion of inflated achievement.
Tip 1: Check and arrange your input dataset properly
Even though it might seem surprising, the most important key point of a machine
learning project does not regard machine learning: it regards your dataset properties and
arrangement.
First of all, before starting any data mining activity, you have to ask yourself: do I
have enough data to solve this computational biology problem with machine learning?
Nowadays, in the Big Data era, with very large biological datasets publically available
online, this question might appear irrelevant, but it really raises an important problem in
the statistical learning community and domain. While gathering more data can always be
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 3 of 17
beneficial for your machine learning models [6, 7], deciding what is the minimum dataset
size to be able to train properly a machine learning algorithm might be tricky. Even if
sometimes this not possible, the ideal situation would be having at least ten times as many
data instances as there are data features [8, 9].
After addressing the issue of the dataset size, the most important priority of your project
is the dataset arrangement. In fact, the way you engineer your input features, clean and
pre-process your input dataset, scale the data features into a normalized range, ran-
domly shuffle the dataset instances, include newly constructed features (if needed) will
determine if your machine learning project will succeed or fail in its scientific task. As
Pedro Domingos clearly affirmed, in machine learning: “[Dataset] feature engineering is
the key” [6].
This advice might seem counter-intuitive for machine learning beginners. In fact, new-
comers might ask: how could the success of a data mining project rely primarily on
the dataset, and not on the algorithm itself? The explanation is straightforward: popu-
lar machine learning algorithms have become widespread, first of all, because they work
quite well. Similarly to what Isaac Newton once said, if we can progress further, we do it
by standing on the shoulders of giants, who developed the data mining methods we are
using nowadays. And since these algorithms work so well, and we have plenty of open
source software libraries which implement them (Tip 9), we usually do not need to invent
new machine learning techniques when starting a new project.
On the contrary, each dataset is unique. Indeed, each dataset has domain-specific fea-
tures, contains data strictly related to its scientific area, and might contain mistaken
values hardly noticeable by inexperienced researchers. The Gene Ontology annotation
(GOA) database [10], for example, despite its unquestionable usefulness, has several
issues. Since not all the annotations are supervised by human curators, some of them
might be erroneous; and since different laboratories and biological research groups might
have worked on the same genes, some annotations might contain inconsistent infor-
mation [11]. Problems like these can strongly influence the performance of a machine
learning method application.
Given the importance and the uniqueness of each dataset domain, machine learning
projects can succeed only if a researcher clearly understands the dataset details, and
he/she is able to arrange it properly before running any data mining algorithm on it. In
fact, successful projects happen only when machine learning practitioners work by the
side of domain experts [6]. This is particularly true in computational biology.
Arranging a biological dataset properly means multiple facets, often grouped all
together into a step called data pre-processing.
First, an initial common useful practice is to always randomly shuffle the data instances.
This operation removes any possible trend related to the order the data instances were
collected, and which might wrongly influence the learning process.
Moreover, another necessary practice is data cleaning, that is discarding all the data
which have corrupt, inaccurate, inconsistent, or outlier values [12]. This operation
involves expertise and “folk wisdom”, and has to be done carefully. Therefore, we rec-
ommend to do it only in the evident cases. Suppose, for example, in a dataset of 100
data instances, you have a particular feature showing values in the [ 0; 0.5] range for 99
instances, and a 80 value for only one single instance (Fig. 1a). That value is clearly an
outlier, and it might be caused by a malfunctioning of the machinery which generated
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 4 of 17
ab c
Fig. 1 aExample of dataset feature which needs data pre-processing and cleaning before being employed
in a machine learning program. All the feature data have values in the [ 0;0.5], except an outlier having value
80 (Tip 1). bRepresentation of a typical dataset table having Nfeatures as columns and Mdata instances as
rows. An effective ratio for the split of an input dataset table: 50% of the data instances for the training set;
30% of the data instances for the validation set; and the last 20% of the data instances for the test set (Tip 2).
cExample of a typical biological imbalanced dataset, which can contain 90% negative data instances and
only 10% positive instances. This aspect can be tackled with under-sampling and other techniques (Tip 5)
the dataset. Its inclusion in the machine learning phase processing might cause the algo-
rithm to incorrectly classify or to fail to correctly learn from data instances. In this case,
you would better remove that particular data element and apply your machine learning
only to the remaining dataset, or round that data value to the upper limit value among
the other data (0.5 in this case). When handling a large dataset, removing the outliers is
the best plan, because you still have enough data to train your model properly. When the
dataset size is small-scale and each data instance is precious, instead, it is better to round
the outliers to the maximum (or minimum) limit.
For numerical datasets, in addition, the normalization (or scaling)byfeature(by
column) into the [ 0; 1] interval is often necessary to put the whole dataset into a common
frame, before the machine learning algorithm process it. Latent semantic indexing (LSI),
for example, is an information retrieval method which necessitates this pre-processing to
be employed for prediction of gene functional annotations [13]. Data normalization into
the [ min;max] interval, or into an interval having a particular mean (for example, 0.0)
and a particular standard deviation (for example, 1.0) are also popular strategies [14].
An effective advice related to data pre-processing, finally, is always to start with a
small-scale dataset. In biology, it is common to have large datasets made of millions
or billions of instances. So, if you have a large dataset, and your machine learning
algorithm training lasts days, create a small-scale miniature dataset with same posi-
tive/negative ratio of the original, in order to reduce the processing time to few minutes.
Then use that synthesized limited dataset to test and adjust your algorithm, and keep
it separated from the original large dataset. Once the algorithm is generating satis-
fying results on the synthesized toy dataset, apply it to the original large dataset,
and proceed.
Tip 2: Split your input dataset into three independent subsets (training set,
validation set,test set), and use the test set only once you complete training
and optimization phases
Many textbooks and online guides say machine learning is about splitting the dataset in
two: training set and test set. This approach is incomplete, since it does not take into
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 5 of 17
account that almost always your algorithm has a few key hyper-parameters to be selected
before applying the model (Tip 6).
In fact, a common mistake in machine learning is using, in the test set, data instances
already used during the training phase or the hyper-parameter optimization phase, and
then obtaining inflated performance scores [15]. But as Richard Feynman used to say, in
science and in life: “The first principle is that you must not fool yourself, and you are the
easiest person to fool”.
Therefore, to avoid hallucinating yourself this way, you should always split your input
dataset into three independent subsets: training set,validation set,andtest set. A common
suggested ratio would be 50% for the training set, 30% for the validation set, and 20% for
the test set (Fig. 1b). When dataset is too small and this split ratio is not possible, machine
learning practitioners should consider alternative techniques such as cross-validation
[16] (Tip 7).
After the subset split, use the training set and the validation set to train your model
and to optimize the hyper-parameter values, and withhold the test set. Do not touch it.
Finally, at the very end, once you have found the best hyper-parameters and trained your
algorithm, apply the trained model to the test set, and check the performance results.
This approach (also termed the “lock box approach” [17]) is pivotal in every machine
learning project, and often means the real difference between success and failure.
In fact, as Michael Skocik and colleagues [17] noticed, setting aside a subset and using
it only when the models are ready is an effective common practice in machine learning
competitions. The authors of that paper, moreover, suggest that all the machine learning
projects in neuroscience routinely incorporate a lock box approach. We agree and revamp
this statement: the lock box approach should be employed by every machine learning
project in every field.
Tip 3: Frame your biological problem into the right algorithm category
You have your biological dataset, your scientific question, and a scientific goal for your
project. You have arranged and engineered your dataset, as explained in Tip 1. You decide
you want to solve your scientific project with machine learning, but you are undecided
about what algorithm to start with.
Before choosing the data mining method, you have to frame your biological problem
into the right algorithm category, which will then help you find the right tool to answer
your scientific question.
Some key questions can help you understand your scientific problem. Do you have
labeled targets for your dataset? That is, for each data instance, do you have a ground
truth label which can tell you if the information you are trying to identify is asso-
ciated to that data instance or not? If yes, your problem can be attributed to the
supervised learning category of tasks, and, if not, to the the unsupervised learning
category [4].
For example, suppose you have a dataset where the rows contain the profiles of patients,
and the columns contain biological features related to them [18]. One of the features
states the diagnosis of the patient, that is if he/she is healthy or unhealthy, which can be
termed as target (or output variable) for this dataset. Since, in this case, the dataset con-
tains a target label for each data instance, the problem of predicting these targets can
be named supervised learning. Popular supervised learning algorithms in computational
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 6 of 17
biology are support vector machines (SVMs) [19], k-nearest neighbors (k-NN) [20], and
random forests [21].
If the target can have a finite number of possible values (for example, extracellular,or
cytoplasm,ornucleus for a specific cell location), we call the problem classification task.
And if the possible target values are only two (like true or false, 0 or 1, healthy patient or
unhealthy patient), we name it binary classification.
If the targets are real values, instead, the problem would be named regression task.
Target labels are not always present in biological datasets. When data are unla-
beled, machine learning can still be employed to infer hidden associations between
data instances, or to discover the hidden structure of a dataset. These cases are
called unsupervised learning,orcluster analysis tasks. Common unsupervised learn-
ing methods in computational biology include k-means clustering [22], truncated
singular value decomposition (SVD) [23], and probabilistic latent semantic analysis
(pLSA) [24].
Once you studied and understood your dataset, you have to decide to which of these
categories of problems you should address your project, and then you are ready to choose
the proper machine learning algorithm to start your predictions.
Tip 4: Which algorithm should you choose to start? The simplest one!
Once you understand what kind of biological problem you are trying to solve, and which
method category can fit your situation, you then have to choose the machine learning
algorithm with which to start your project. Even if it always advisable to use multi-
ple techniques and compare their results, the decision on which one to start can be
tricky.
Many textbooks suggest to select a machine learning method by just taking into account
the problem representation, while Pedro Domingos [6] suggests to take into account also
the cost evaluation, and the performance optimization.
This algorithm-selection step, which usually occurs at the beginning of a machine learn-
ing journey, can be dangerous for beginners. In fact, an inexperienced practitioner might
end up choosing a complicated, inappropriate data mining method which might lead
him/her to bad results, as well as to lose precious time and energy. Therefore, this is our
tip for the algorithm selection: if undecided, start with the simplest algorithm [25].
By employing a simple algorithm, you will be able to keep everything under control, and
better understand what is happening during the application of the method. In addition, a
simple algorithm will provide better generalization skills, less chance of overfitting, easier
training and faster learning properties than complex methods.
Examples of simple algorithms are k-means clustering for unsupervised learning [22]
and k-nearest neighbors (k-NN) for supervised learning [26]. Even though stating the
level of simplicity of a machine learning method is not an easy task, we consider
k-means and k-NN simple algorithms because they are easier to understand and to
interpret than other models, such as artificial neural networks [27] or support vector
machines [19].
Regarding k-NN, suppose for example you have a complementary DNA (cDNA)
microarray input dataset made of 1,000 real data instances, each having 80 features
and 1 binary target label. This dataset can be represented with a table made of 1000
rows and 81 columns. Following our suggestion, if you think that your biological dataset
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 7 of 17
can be learnt with a supervised learning method (Tip 3), you might consider to begin
to classify instances with simple algorithm such as k-nearest neighbors (k-NN) [26].
This method assigns each new observation (an 80-dimension point, in our case) to
the class of the majority of k-nearest neighbors (the knearest points, measured with
Euclidean distance) [28].
Consequently, given the simplicity of the algorithm, you will be able to oversee (and to
possibly debug) each step of it, especially if problems arise. In case you reach a satisfy-
ing performance with k-nearest neighbors, you will be able to stick with it, and proceed
in your project. Otherwise, you will always be able to switch to another algorithm, and
employ the k-nearest neighbors results for a baseline comparison.
As David J. Hand explained, complex models should be employed only if the dataset
features provide some reasonable justification for their usage [25].
Tip 5: Take care of the imbalanced data problem
In computational biology and in bioinformatics, it is often common to have imbal-
anced datasets. An imbalanced (or unbalanced) dataset is a dataset in which one class
is over-represented respect to the other(s) (Fig. 1c). For example, a typical dataset of
Gene Ontology annotations, that can be analyzed with a non-negative matrix factoriza-
tion, usually has only around 0.1% of positive data instances, and 99.9% of negative data
instances [11, 23].
In these common situations, the dataset ratio can be a problem: how can you train a
classifier to be able to correctly predict both positive data instances, and negative data
instances, if you have such a huge difference in the proportions? Probably, your learning
model is going to learn fast how to recognize the over-represented negative data instances,
but it is going to have difficulties recognizing the scarce subset instances, that are the
positive items in this case.
Our heuristic suggestion on what ratio of elements to use in the training set is to pick
up the average value between 50% and the real proportion percentage. Therefore, in
the 90%:10% example, insert in your training set (90% +50%)/2=70% negative data
instances, and (10% +50%)/2=30% positive data instances. Obviously, this procedure is
possible if there are enough data for each class to create a 70%:30% training set. Alterna-
tively, you can balance the dataset by incorporating the empirical label distribution of the
data instances, following Bayes’ rule [29]. Even if more precise, this strategy might be too
complicated for beginners; this is why we suggest to use the afore-mentioned heuristic
ratiotostart.
In addition, there are multiple effective techniques to handle the imbalanced data
problem [30]. The best way to tackle this problem is always to collect more data.
If this is not possible, a common and effective strategy to handle imbalanced datasets is
the data class weighting, in which different weights are assigned to data instances depend-
ing if they belong to the majority class or the minority class [31]. Data class weighting is a
standard technique to fight the imbalanced data problem in machine learning.
An alternative method to deal with this issue is under-sampling [32], where you just
remove data elements from the over-represented class. The disadvantage here is that you
do not let the classifier learn the excluded data instances.
In addition, other techniques exist, even if trying the aforementioned ones first might
be already enough for your machine learning project [30]. Moreover, to properly take
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 8 of 17
care of the imbalanced dataset problem, when measuring your prediction performances,
you need to rely not on accuracy (Eq. 1), balanced accuracy [33], or F1 score (Eq. 2), but
rather on the Matthews correlation coefficient (MCC,Eq.3).Aswewillbetterexplain
later (Tip 8), among the common performance evaluation scores, MCC is the only one
which correctly takes into account the ratio of the confusion matrix size. Especially on
imbalanced datasets, MCC is correctly able to inform you if your prediction evaluation is
going well or not, while accuracy or F1 score would not.
Tip 6: Optimize each hyper-parameter
The hyper-parameters of a machine learning algorithm are higher-level properties of the
algorithm statistical model, which can strongly influence its complexity, its speed in learn-
ing, and its application results. Indeed, examples of hyper-parameters are the number kof
neighbors in k-nearest neighbors (Fig. 2) [26], the number kof clusters in k-means clus-
tering [22], the number of topics (classes) in topic modeling [24], and the dimensions of an
artificial neural network (number of hidden layers and number of hidden units) [34]. The
hyper-parameters cannot be learned by the algorithm directly from the training phase,
and rather they must be set before the training step starts.
A useful practice to select the best suitable value for each hyper-parameter is a grid
search. After having divided the input dataset into training set,validation set,andtest set,
withhold the test set (as explained in Tip 2), and employ the validation set to evaluate
the algorithm when using a specific hyper-parameter value. For each possible value of the
ab
cd
Fig. 2 Example of how an algorithm’s behavior and results change when the hyper-parameter changes, for
the the k-nearest neighbors method [20] (image adapted from [72]). aIn this example, there are six blue
square points and five red triangle points in the Euclidean space. A new point (the green circle) enters the
space, and k-NN has to decide to which category to assign it (red triangle or blue square). bIf we set the
hyper-parameter k=3, the algorithm considers only the three points nearest to the new green circle, and
assigns the green circle to the red triangle category (two red triangles versus one blue square). cLikewise, if
we set the hyper-parameter k=4, the algorithm considers only the four points nearest to the new green
circle, and assigns the green circle again to the red triangle category (the two red triangles are nearer to the
green circle than the two blue squares). dHowever, if we set the hyper-parameter k=5, the algorithm
considers only the five points nearest to the new green circle, and assigns the green circle to the blue square
category (three blue squares versus two red triangles)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 9 of 17
hyper-parameters, then, train your model on the training set and evaluate it on the val-
idation set, through the Matthews correlation coefficient (MCC) or the Precision-Recall
area under the curve (Tip 8), and record the score into an array of real values. Once you
have tried all the possible values of hyper-parameters, choose the one which led to the
highest performance score (besthin Algorithm 1). Finally, train the model having besthas
hyper-parameter on the training set, and apply it to the test set (Algorithm 1).
Algorithm 1 Hyper-parameter optimization. h: hyper-parameter. P: maximum value of
h.modelh: the statistical model having hvalue as hyper-parameter
1: Randomly shuffle all the data instances of the input dataset
2: Split the input dataset into independent training set,validation set, and test set
3: for h=1, ..., Pdo
4: Train modelhon the training set
5: Save modelhinto a file or on a database
6: Evaluate modelhon the validation set
7: Save the result Matthews correlation coefficient (MCC) of the previous step
8: end for
9: Select the hyper-parameter besthwhich led to the best MCC value in the previous loop
10: Load the previously saved modelbesth
11: Apply modelbesthto the test set
Alternatively, you can consider taking advantage of some automatic machine learning
software methods, which automatically optimize the hyper-parameters of the algorithm
you selected. These packages include Auto-Sklearn [35], Auto-Weka [36], TPOT [37], and
PennAI [38].
Once again, we want to highlight the importance of the splitting the dataset into three
different independent subsets: training set, validation set, and test set. These three sub-
sets must contain no common data instances, and the data instances must be selected
randomly, not to make the data collection order influence the algorithm. For these rea-
sons, we strongly suggest to apply a randomly shuffle to the whole input dataset, just after
the dataset reading (first line of Algorithm 1).
Tip 7: Minimize overfitting
As Pedro Domingos correctly stated: “Overfitting is the bugbear of machine learning
[6]. In data mining, overfitting happens every time an algorithm excessively adapts to the
training set, and therefore performs badly in the validation set (and test set).
Overfitting happens as a result of the statistical model having to solve two problems.
During training, it has to minimize its performance error (often measured through mean
square error for regression, or cross-entropy for classification). But during testing, it has
to maximize its skills to make correct predictions on unseen data. This “double goal”
might lead the model to memorize the training dataset, instead of learning its data trend,
which should be its main task.
Fortunately, there are a few powerful tools to battle overfitting: cross-validation,and
regularization. In 10-fold cross-validation, the statistical model considers 10 different
portions of the input dataset as training set and validation set, in a series. After shuffling
the input dataset instances and setting apart the test set, the algorithm takes the remain-
ing dataset and divides it into ten folds. It then creates a loop for igoing from 1 to 10.
For each iteration, the cross validation sets the data of the ith fold as validation set, then
trains the algorithm on the remaining dataset folds, and finally applies the algorithm to
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 10 of 17
the validation set. To measure the performance of the classifier in this phase, the user
can estimate the median variance of the predictions made in the 10-folds. The algorithm
designer can choose a number of kfolds different from 10, even if 10 is a heuristic com-
mon choice that allows the training set to contain the 90% of the data instances and the
validation set to contain the 10%.
With cross-validation, the trained model does not overfit to a specific training subset,
but rather is able to learn from each data fold, in turn.
In addition, regularization is a mathematical technique which consists of penalizing
the evaluation function during training, often by adding penalization values that increase
with the weights of the learned parameters [39].
In conclusion, as any machine learning expert will tell you, overfitting will always be
a problem for machine learning. But the awareness of this problem, together with the
aforementioned techniques, can effectively help you to reduce it.
Tip 8: Evaluate your algorithm performance with the Matthews correlation
coefficient (MCC) or the Precision-Recall curve
When you apply your trained model to the validation set or to the test set, you need
statistical scores to measure your performance.
In fact, in a typical supervised binary classification problem, for each element of the
validation set (or test set) you have a label stating if the element is positive or negative
(1 or 0, usually). Your machine learning algorithm makes a prediction for each element of
the validation set, expressing if it is positive or negative, and, based upon these prediction
and the gold-standard labels, it will assign each element to one of the following cate-
gories: true negatives (TN), true positives (TP), false positives (FP), false negatives (FN)
(Table 1).
If many elements of the set then fall into the first two classes (TP or TN), this means that
your algorithm was able to correctly predict as positive the elements that were positive in
the validation set (TP), or to correctly classify as negative the instances that were negative
in the validation set (TN). On the contrary, if you have many FP instances, this means
that your method wrongly classified as positive many elements which are negative in the
validation set. And, as well, many FN elements mean that the classifier wrongly predicted
as negative a lot of elements which are positive in the validation set.
In order to have an overall understanding of your prediction, you decide to take
advantage of common statistical scores, such as accuracy (Eq. 1), and F1 score (Eq. 2).
accuracy =TP +TN
TP +TN +FP +FN (1)
(accuracy: worst value =0; best value =1)
F1score =2·TP
2·TP +FP +FN (2)
(F1 score: worst value =0; best value =1)
Table 1 The confusion matrix: each pair (actual value; predicted value) falls into one of the four listed
categories
predicted positive predicted negative
actual positive TP (true positives) FN (false negatives)
actual negative FP (false positives) TN (true negatives)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 11 of 17
However, even if accuracy and F1 score are widely employed in statistics, both can be
misleading, since they do not fully consider the size of the four classes of the confusion
matrix in their final score computation.
Suppose, for example, you have a very imbalanced validation set made of 100 elements,
95 of which are positive elements, and only 5 are negative elements (as explained in Tip 5).
And suppose also you made some mistakes in designing and training your machine learn-
ing classifier, and now you have an algorithm which always predicts positive. Imagine that
you are not aware of this issue.
By applying your only-positive predictor to your imbalanced validation set, therefore,
you obtain values for the confusion matrix categories:
TP = 95, FP = 5; TN = 0, FN = 0.
These values lead to the following performance scores: accuracy = 95%, and F1 score =
97.44%. By reading these over-optimistic scores, then you will be very happy and will think
that your machine learning algorithm is doing an excellent job. Obviously, you would be
on the wrong track.
On the contrary, to avoid these dangerous misleading illusions, there is another perfor-
mance score that you can exploit: the Matthews correlation coefficient [40] (MCC, Eq. 3).
MCC =TP ·TN FP ·FN
(TP +FP)·(TP +FN)·(TN +FP)·(TN +FN)(3)
(MCC: worst value =−1; best value =+1).
By considering the proportion of each class of the confusion matrix in its formula, its
score is high only if your classifier is doing well on both the negative and the positive
elements.
In the example above, the MCC score would be undefined (since TN and FN would
be 0, therefore the denominator of Eq. 3 would be 0). By checking this value, instead of
accuracy and F1 score, you would then be able to notice that your classifier is going in the
wrong direction, and you would become aware that there are issues you ought to solve
before proceeding.
Let us consider this other example. You ran a classification on the same dataset which
led to the following values for the confusion matrix categories:
TP = 90, FP = 5; TN = 1, FN = 4.
In this example, the classifier has performed well in classifying positive instances, but
was not able to correctly recognize negative data elements. Again, the resulting F1 score
and accuracy scores would be extremely high: accuracy = 91%, and F1 score = 95.24%.
Similarly to the previous case, if a researcher analyzed only these two score indicators,
without considering the MCC, he/she would wrongly think the algorithm is performing
quite well in its task, and would have the illusion of being successful.
On the other hand, checking the Matthews correlation coefficient would be pivotal
once again. In this example, the value of the MCC would be 0.14 (Eq. 3), indicating
that the algorithm is performing similarly to random guessing. Acting as an alarm,the
MCC would be able to inform the data mining practitioner that the statistical model is
performing poorly.
For these reasons, we strongly encourage to evaluate each test performance through the
Matthews correlation coefficient (MCC), instead of the accuracy and the F1 score, for any
binary classification problem.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 12 of 17
In addition to the Matthews correlation coefficient, another performance score that
you will find helpful is the Precision-Recall curve. Often you will not have binary labels
(for example, true and false) for negative and the positive elements in your predictions,
but rather a real value of each prediction made, in the [ 0, 1] interval. In this common
case, you can decide to utilize each possible value of your prediction as threshold for the
confusion matrix.
Therefore, you will end up having a real valued array for each FN, TN, FP, TP classes.
To measure the quality of your performance, you will be able to choose between two
common curves, of which you will be able to compute the area under the curve (AUC):
receiver operating characteristic (ROC) curve (Fig. 3a), and Precision-Recall (PR) curve
(Fig. 3b) [41].
The ROC curve is computed through recall (true positive rate, sensitivity)ontheyaxis
and fallout (false positive rate,or1 specificity)onthexaxis:
ROC curve axes:
recall =TP
TP +FN fallout =FP
FP +TN (4)
In contrast, the Precision-Recall curve has precision (positive predictive value)onthey
axis and recall (true positive rate, sensitivity)onthexaxis:
Precision-Recall curve axes:
precision =TP
TP +FP recall =TP
TP +FN (5)
Usually, the evaluation of the performance is made by computing the area under the
curve(AUC)ofthesetwocurvemodels:thegreatertheAUCis,thebetterthemodel
is performing.
As one can notice, the optimization of the ROC curve tends to maximize the correctly
classified positive values (TP, which are present in the numerator of the recall formula),
and the correctly classified negative values (TN, which are present in the denominator of
the fallout formula).
ab
Fig. 3 aExample of Precision-Recall curve, with the precision score on the y axis and the recall score on the x
axis (Tip 8). The grey area is the PR cuve area under the curve (AUPRC). bExample of receiver operating
characteristic (ROC) curve, with the recall (true positive rate) score on the y axis and the fallout (false positive
rate) score on the x axis (Tip 8). The grey area is the ROC area under the curve (AUROC)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 13 of 17
Differently, the optimization of the PR curve tends to maximize to the correctly classi-
fied positive values (TP, which are present both in the precision and in the recall formula),
and does not consider directly the correctly classified negative values (TN, which are
absent both from the precision and in the recall formula).
In computational biology, we often have very sparse dataset with many negative
instances and few positive instances. Therefore, we prefer to avoid the involvement of
true negatives in our prediction score. In addition, ROC and AUROC present additional
disadvantages related to their interpretation in specific clinical domains [42].
For these reasons, the Precision-Recall curve is a more reliable and informative indicator
for your statistical performance than the receiver operating characteristic curve, especially
for imbalanced datasets [43].
Other useful techniques to assess the statistical significance of a machine learning
predictions are permutation testing [44] and bootstrapping [45].
Tip 9: Program your software with open source code and platforms
When starting a machine learning project, one of the first decisions to take is which
programming language or platform you should use. While different packages provide dif-
ferent methods, different execution speed, and different features, we strongly suggest you
to avoid proprietary software, and instead to work only with free open source machine
learning software packages.
Using proprietary software, in fact, can cause you several troubles. First of all, it limits
your collaboration possibilities only to people who have a license to use that specific soft-
ware. For example, suppose you are working in a hospital, and would like a collaborator
from a university to work on your software code. If you are working with a proprietary
software, and his/her university does not have the same software license, the collabora-
tion cannot happen. On the contrary, if you use an open source platform, you will not
face these problem and will be able to start a partnership with anyone willing to work
with you.
Another big problem with proprietary software is that you will not be able to re-use your
own software, in case you switch job, and/or in case your company or institute decides not
to pay the software license anymore. On the contrary, if you work with open source pro-
grams, you will always be able to re-use your own software in the future, even if switching
jobs or work places.
For these and other reasons, we advice you to work only with free open source machine
learning software packages and platforms, such as R [46], Python [47], Torch [48], and
Weka [49].
R is an interpreted programming language for statistical computing and graphics,
extremely popular among the statisticians’ community. It provides several libraries for
machine learning algorithms (including, for example, k-nearest neighbors and k-means),
effective libraries for statistical visualization (such as ggplot2 [50]), and statistical analysis
packages (such as the extremely popular Bioconductor package [51]). On the other
hand, Python is a high-level interpreted programming language, which provides multiple
fast machine learning libraries (for example, Pylearn2 [52], Scikit-learn [53]), mathe-
matical libraries (such as Theano [54]), and data mining toolboxes (such as Orange
[55]). Torch, instead, is a programming language based upon lua [56], a platform, and
a set of very fast libraries for deep artificial neural networks. On the other hand,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 14 of 17
Waikato Environment for Knowledge Analysis (Weka) is a platform for machine learning
libraries [49]. Its software is written in Java, and it was developed at the University of
Wai kato ( Ne w Zea la nd ).
For beginners, we strongly suggest starting with R, possibly on an open source operating
system (such as Linux Ubuntu). In fact, using open source programming languages and
platforms will also facilitate scientific collaborations with researchers in other laboratories
or institutions [57].
In addition, we also advise you share your software code publically on the internet,
among the publication of your project paper and datasets [58, 59]. In fact, as Nick Barnes
explained: “Freely provided working code, whatever its quality, [...] enables others to
engage with your research” [60]. Even more, releasing your code openly in the internet
also allows the computational reproducibility of your paper results [61].
Together with the usage of open source software, we recommend two other optimal
practices for computational biology and science in general: write in-depth documentation
about your code [62, 63], and keep a lab notebook about your project [64].
Writing complete documentation for your software and keeping a scientific diary
updated about your progress will save a lot of time for your future self, and will be a
priceless resource for the success of your project.
Tip 10: Ask for feedback and help to computer science experts, or to
collaborative Q&A online communities
During the progress of a scientific project, asking for a review by experts in the field is
always a useful idea. Therefore, if you are a biologist or a healthcare researcher working
near a university, surely you should consider contacting a machine learning professional
in the computer science department, and ask him/her to meet to gain useful feedback
about your project.
Sometimes when meeting a data mining expert in person is not possible, you should
then consider to get feedback about your project from data mining professionals
through collaborative question-and-answer (Q&A) websites such as Cross Validated,
Stack Overflow, Quora, BioStars, and Bioinformatics beta [65].
Cross Validated is a Q&A website of the Stack Exchange platform, mainly for ques-
tions related to statistics [66]. Similarly, Stack Overflow is part of the same platform,
and it is probably the most-known Q&A website among programmers and software
developers [67]. It often includes questions and answers about machine learning soft-
ware. On the other hand, if Cross Validated and Stack Overflow are more about using
users’ interactions and expertise to solve specific issues, you can post broader and more
general questions on Quora, whose answers can probably help you better if you are a
beginner [68].
Regarding bioinformatics and computational biology, two useful Q&A platforms are
BioStars [69, 70] and the recently released Bioinformatics beta [71].
Indeed, the feedback you receive will be priceless: the community users will be able to
notice aspects that you did not consider, and will provide you suggestions and help which
will make your approach unshakeable.
In addition, many questions and clarifications that the community users ask you will
anticipate the possible questions of reviewers of a journal after the submission of your
manuscript describing your machine learning algorithm. Finally, your question and its
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 15 of 17
community answers will be able to help other users having the same issues in the
future, too.
Conclusion
Running a machine learning project in computational biology, without making common
mistakes and without fooling yourself, can be a hard task, especially if you are a beginner.
We believe these ten tips can be an useful checklist of best practices, lessons learned,
ways to avoid common mistakes and over-optimistic inflated results, and general pieces
of advice for any data mining practitioner in computational biology: following them from
the moment you start your project can significantly pave your way to success.
Even though we originally developed these tips for apprentices, we strongly believe they
should be kept in mind by experts, too. Nowadays, multiple topics covered by our tips are
broadly discussed and analyzed in the machine learning community (for example, over-
fitting, hyper-parameter optimization, imbalanced dataset), while unfortunately other tip
topics are still inadequately uncommon (for example, the usage of Matthews correlation
coefficient, and open source platforms). With this manuscript, we hope these concepts
can spread and become common practices in every data mining project.
Abbreviations
AUC: Area under the curve; cDNA: Complementary DNA; FN: False negatives; FP: False positives; GO: Gene Ontology;
GOA: Gene Ontology annotations; k-NN: k-nearest neighbors; LSI: Latent semantic indexing; MCC: Matthews correlation
coefficient; MSE: Mean square error; PR: Precision-recall; pLSA: Probabilistic latent semantic analysis; Q&A: Questions and
answers; ROC: Receiver operating characteristic; SVD: Singular value decomposition; SVM: Support vector machine; TN:
True negatives; TP: True positives
Acknowledgments
The author thanks Michael M. Hoffman (Princess Margaret Cancer Centre) for his advice, David Duvenaud (University of
Toronto) for his preliminary revision of this manuscript, Chang Cao (University of Toronto) for her help with the images,
Francis Nguyen (Princess Margaret Cancer Centre) for his help in the English proof-reading, Pierre Baldi (University of
California Irvine) for his advice, and especially Christian Cumbaa (Princess Margaret Cancer Centre) for his multiple
revisions, suggestions, and comments. This paper is dedicated to the tumor patients of the Princess Margaret Cancer
Centre.
Funding
Not applicable.
Availability of data and materials
The R code of example images is available upon request.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The author declares that he has no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Received: 31 August 2017 Accepted: 8 November 2017
References
1. Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome Biol.
2013;14(5):205.
2. Baldi P, Brunak S. Bioinformatics: the machine learning approach. Cambridge: MIT press; 2001.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 16 of 17
3. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, et al.
Machine learning in bioinformatics. Brief Bioinform. 2006;7(1):86–112.
4. Tarca AL, Carey VJ, Chen X-W, Romero R, Dr˙
aghici S. Machine learning and its applications to biology. PLoS
Comput Biol. 2007;3(6):e116.
5. Schölkopf B, Tsuda K, Vert J-P. Kernel methods in computational biology. Cambridge: MIT Press; 2004.
6. Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):78–87.
7. Ng A. Lecture 70 - Data For Machine Learning, Machine Learning Course on Coursera. https://coursera.org/learn/
machine-learning/lecture/XcNcz. Accessed 30 Aug 2017.
8. Abu-Mostafa YS, Magdon-Ismail M, Lin H-T. Learning from data. volume 4. NY, USA: AML Book New York; 2012.
9. Haldar M. How much training data do you need? https://medium.com/@malay.haldar/. Accessed 30 Aug 2017.
10. The Gene Ontology Consortium. Gene Ontology annotations and resources. Nucleic Acids Res. 2013;41(D1):
D530—D535.
11. Chicco D, Tagliasacchi M, Masseroli M. Genomic annotation prediction based on integrated information. In:
International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Berlin Heidelberg:
Springer. 2011. p. 238–52.
12. Apiletti D, Bruno G, Ficarra E, Baralis E. Data cleaning and semantic improvement in biological databases. J Integr
Bioinforma. 2006;3(2):219–29.
13. Chicco D, Masseroli M. Software suite for gene and protein annotation prediction and similarity search. IEEE/ACM
Trans Comput Biol Bioinforma. 2015;12(4):837–43.
14. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Waltham: Elsevier; 2011.
15. Boulesteix A-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research.
PLoS Comput Biol. 2015;11(4):e1004191.
16. Refaeilzadeh P, Tang L, Liu H. Cross-validation. In: Encyclopedia of Database Systems. Berlin Heidelberg: Springer.
2009. p. 532–8.
17. Skocik M, Collins J, Callahan-Flintoft C, Bowman H, Wyble B. I tried a bunch of things: the dangers of unexpected
overfitting in classification. bioRxiv. 2016;078816.
18. Er O, Tanrikulu AC, Abakay A, Temurtas F. An approach based on probabilistic neural network for diagnosis of
mesothelioma’s disease. Comput Electr Eng. 2012;38(1):75–81.
19. Noble WS. Support vector machine applications in computational biology. Kernel Methods Comput Biol. 200471–92.
20. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value
estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520–5.
21. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for
microarray-based cancer classification. BMC Bioinformatics. 2008;9(1):319.
22. Hussain HM, Benkrid K, Seker H, Erdogan AT. FPGA implementation of k-means algorithm for bioinformatics
application: An accelerated approach to clustering Microarray data. In: Adaptive Hardware and Systems (AHS), 2011
NASA/ESA Conference on. Piscataway: IEEE. 2011. p. 248–55.
23. Chicco D, Masseroli M. Ontology-based prediction and prioritization of gene functional annotations. IEEE/ACM
Trans Comput Biol Bioinforma. 2016;13(2):248–60.
24. Pinoli P, Chicco D, Masseroli M. Computational algorithms to predict Gene Ontology annotations. BMC
Bioinformatics. 2015;16(Suppl 6):S4.
25. Hand DJ. Classifier technology and the illusion of progress. Stat Sci. 2006;21(1):1–14.
26. Wu W, Xing EP, Myers C, Mian IS, Bissell MJ. Evaluation of normalization methods for cDNA microarray data by
k-NN classification. BMC Bioinformatics. 2005;6(1):191.
27. Cross SS, Harrison RF, Kennedy RL. Introduction to neural networks. Lancet. 1995;346(8982):1075–9.
28. Manning CD, Raghavan P, Schütze H, et al. Introduction to information retrieval, volume 1. Cambridge: Cambridge
University Press; 2008.
29. Hoens TR, Chawla NV. Imbalanced datasets: from sampling to classifiers, Imbalanced Learning: Foundations,
Algorithms, and Applications. Hoboken: John Wiley; 2013, pp. 43–59.
30. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
31. Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data. Berkeley: University of California Berkeley;
2004, p. 110.
32. Brownlee J. Eight tactics to combat imbalanced classes in your machine learning dataset. http://
machinelearningmastery.com/tactics. Accessed 30 Aug 2017.
33. Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: 20th
International Conference on Pattern Recognition, ICPR 2010. Piscataway: IEEE. 2010. p. 3121–4.
34. Chicco D, Sadowski P, Baldi P. Deep autoencoder neural networks for Gene Ontology annotation predictions. In:
Proceedings of ACM BCB 2014 - the 5th ACM Conference on Bioinformatics, Computational Biology, and Health
Informatics. New York: ACM. 2014. p. 533–540.
35. Auto-sklearn. https://github.com/automl/auto-sklearn. Accessed 11 Sept 2017.
36. Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K. Auto-weka 2.0: Automatic model selection and
hyperparameter optimization in weka. J Mach Learn Res. 2016;17:1–5.
37. Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Moore JH, et al. Automating biomedical data science through
tree-based pipeline optimization. In: European Conference on the Applications of Evolutionary Computation. Berlin
Heidelberg: Springer. 2016. p. 123–137.
38. Olson RS, Sipper M, La Cava W, Tartarone S, Vitale S, Fu W, Holmes JH, Moore JH. A system for accessible artificial
intelligence. arXiv preprint arXiv:1705.00594. 2017;1705.00594:1–15.
39. Neumaier A. Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Rev. 1998;40(3):
636–66.
40. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim
Biophys Acta Protein Struct. 1975;405(2):442–51.
41. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd
International Conference on Machine Learning. New York: ACM. 2006. p. 233–240.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chicco BioData Mining (2017) 10:35 Page 17 of 17
42. Halligan S, Altman DG, Mallett S. Disadvantages of using the area under the receiver operating characteristic curve
to assess imaging tests: a discussion and proposal for an alternative approach. Eur Radiol. 2015;25(4):932.
43. Saito T, Rehmsmeier M. The Precision-Recall plot is more informative than the ROC plot when evaluating binary
classifiers on imbalanced datasets. PLoS ONE. 2015;10(3):e0118432.
44. Ojala M, Garriga GC. Permutation tests for studying classifier performance. J Mach Learn Res. 2010;11(Jun):1833–63.
45. Efron B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika.
1981;68(3):589–99.
46. Lantz B. Machine learning with R. Birmingham: Packt Publishing Ltd; 2013.
47. Van Rossum G. Python programming language. In: USENIX Annual Technical Conference, volume 41. Wilmington:
Python Software Foundation. 2007. p. 36.
48. Collobert R, Kavukcuoglu K, Farabet C. Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS
Workshop, number EPFL-CONF-192376. Granada: NIPS Conference. 2011.
49. Witten IH, Frank E, Hall MA, Pal CJ. Data mining: practical machine learning tools and techniques. Cambridge:
Morgan Kaufmann; 2016.
50. Wickham H. ggplot2: elegant graphics for data analysis. Berlin Heidelberg: Springer; 2016.
51. Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. Bioinformatics and computational biology solutions using R
and Bioconductor. Berlin Heidelberg: Springer Science & Business Media; 2006.
52. Goodfellow IJ, Warde-Farley D, Lamblin P, Dumoulin V, Mirza M, Pascanu R, Bergstra J, Bastien F, Bengio Y.
Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214. 2013;1308.4214:1–9.
53. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,
Dubourg V, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.
54. Theano Development Team. Theano: a Python framework for fast computation of mathematical expressions. arXiv
e-prints, abs/1605.02688. 2016.
55. Dem˙
sar J, Curk T, Erjavec A, Gorup ˙
C, Ho˙
cevar T, Milutinovi˙
cM, Mo
˙
zina M, Polajnar M, Toplak M, Stari˙
cA,etal.
Orange: data mining toolbox in Python. J Mach Learn Res. 2013;14(1):2349–53.
56. Ierusalimschy R, De Figueiredo LH, Celes Filho W. Lua – an extensible extension language. Softw Pract Experience.
1996;26(6):635–52.
57. Boland MR, Karczewski KJ, Tatonetti NP. Ten simple rules to enable multi-site collaborations through data sharing.
PLoS Comput Biol. 2017;13(1):e1005278.
58. Prli´
c A, Procter JB. Ten simple rules for the open development of scientific software. PLoS Comput Biol. 2012;8(12):
e1002802.
59. Osborne JM, Bernabeu MO, Bruna M, Calderhead B, Cooper J, Dalchau N, Dunn S-J, Fletcher AG, Freeman R,
Groen D, et al. Ten simple rules for effective computational research. PLoS Comput Biol. 2014;10(3):e1003506.
60. Barnes N. Publish your computer code: it is good enough. Nature. 2010;467(7317):753.
61. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS
Comput Biol. 2013;9(10):e1003285.
62. Karimzadeh M, Hoffman MM. Top considerations for creating bioinformatics software documentation. Brief
Bioinforma. 2017;bbw134:1–7.
63. Noble WS. A quick guide to organizing computational biology projects. PLoS Comput Biol. 2009;5(7):e1000424.
64. Schnell S. Ten simple rules for a computational biologist’s laboratory notebook. PLoS Comput Biol. 2015;11(9):
e1004385.
65. Dall’Olio GM, Marino J, Schubert M, Keys KL, Stefan MI, Gillespie CS, Poulain P, Shameer K, Sugar R, Invergo BM,
et al. Ten simple rules for getting help from online scientific communities. PLoS Comput Biol. 2011;7(9):e1002202.
66. Stack Exchange. Cross Validated. http://stats.stackexchange.com. Accessed 30 Aug 2017.
67. Stack Exchange. Stack Overflow. http://www.stackoverflow.com. Accessed 30 Aug 2017.
68. Quora Inc. Quora Machine Learning. http://www.quora.com/machine-learning. Accessed 30 Aug 2017.
69. Parnell LD, Lindenbaum P, Shameer K, Dall’Olio GM, Swan DC, Jensen LJ, Cockell SJ, Pedersen BS, Mangan ME,
et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comput Biol. 2011;7(10):
e1002216.
70. BioStars. Biostars, bioinformatics explained. https://www.biostars.org. Accessed 30 Aug 2017.
71. Stack Exchange - Bioinformatics beta. https://bioinformatics.stackexchange.com. Accessed 30 Aug 2017.
72. AnAj AA. KnnClassification.svg. https://commons.wikimedia.org/wiki/File:KnnClassification.svg. Accessed
14 Nov 2017.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Based on the functional activities, machine learning algorithms are categorized into supervised, unsupervised, semi-supervised, and reinforced learning. Recently, machine learning and deep learning algorithms (DL) have been used to solve challenges in emerging applications, particularly medical data analytics [45][46][47][48][49]. Compared to machine learning algorithms, deep learning algorithms are constructed based on neural networks, mimicking human brain function and automatically analyzing the data. ...
... For example, machine learning algorithms used for applicability conditions [43] and deep learning algorithms promise better results for solving problems in the field of chemistry [44]. A high number of researchers have applied machine learning algorithms in the fields of medicine [45,46], particularly in diagnosing medical images [47], computational biology [48,49], astronomy [50], agriculture [51], corporation economy [52], industry [53], construction [54], environmental modelling [55], and geo-ecological processes [56], exploration [57], petrographic studies [58][59], and forecasting of data mining [60]. ...
Article
Full-text available
Lifestyle changes affect human health significantly, leading to heart disease and Sudden death in patients. The Healthcare Industry faces multiple challenges in managing and treating the increasing number of patients according to their disease severity level. The challenges for medical experts include early identification of symptoms & timely treatment to save them. The non-availability of efficient, effective, and automatic screening methods and limited medical data analytics augment these challenges. Some current medical applications have stated that Artificial intelligence algorithms provide efficient outcomes in predicting heart diseases. This paper conducted a detailed survey on various artificial intelligence algorithms used for heart disease prediction from various related datasets. The survey aims to identify the limitations and factual problem statements and identify solutions.
... Regarding information content, the Phi-coefficient is considered by some scientists to be the most informative single score for evaluating the quality of a binary classifier prediction, and it is strongly recommended in machine learning (Chicco, 2017). 3 There has even been a suggestion to replace the area under the curve of receiver operating characteristics (ROC AUC) with the Phi-coefficient for binary classification (Chicco & Jurman, 2023). ...
... is a widely adopted coefficient due to its exceptional utility. From an informational perspective, it is renowned for its effectiveness, often considered one of the most informative measures for binary classification (Chicco, 2017;Chicco & Jurman, 2023). When predicting responses of participants on an item A based on an item B, is hard to surpass. ...
Article
Full-text available
Unidimensionality refers to the principle of measuring only one concept at a time, which is essential in scientific research. In psychology, unidimensionality is often evaluated using a method called factor analysis, which groups similar variables together. Typically, the degree of similarity between variables is measured using the Pearson correlation, a popular statistical index. However, this article highlights a significant problem with using Pearson correlations to assess unidimensionality: they tend to underestimate how unidimensional a measure really is. By examining the relationship between a unidimensionality index called H and the Pearson correlation, the findings reveal that the Pearson correlation is always equal to or lower than H, leading to consistent underestimations of unidimensionality. The article presents a theoretical example demonstrating that factor analysis based on Pearson correlations can fail to uncover the true unidimensional structure. To show that this issue is not just theoretical, the article analyzes real data from a statistics exam, which uncovers the same problem. A detailed analysis of the exam data shows that breaches of unidimensionality are systematic and should not simply be regarded as random noise. Inconsistent response patterns can reveal whether a participant blundered, cheated, or misunderstood the concepts being tested. The conclusion is that psychologists should consider unidimensionality not as a peripheral concern but as the foundation for any serious scaling attempt. The H index could play a crucial role in establishing this foundation, while the Pearson correlation should not be used for assessing unidimensionality.
... Thus, we can experiment using one of the most adopted metrics in the imbalanced classification problems as a fitness function, such as accuracy or the F1 score. However, since the accuracy and F1 score alone can be misleading in imbalanced datasets, we use the Matthews correlation coefficient (MCC) as a fitness function given that it is the only metric that takes into account the ratio of the confusion matrix size (Chicco 2017). MCC is calculated given the Formula 1 below. ...
Article
Full-text available
Text data often exhibits class distribution imbalance, causing classifiers to favor the majority class with a larger number of samples, resulting in frequent misclassification of minority class samples. Furthermore, numerical text representations are often highly dimensional. Researchers have proposed several methods employing synthetic minority over-sampling technique with k-nearest neighbors (k-NN) to generate synthetic instances for the minority class by linearly interpolating between a chosen minority instance and its k-NN neighbors. However, k-NN-based approaches may face challenges related to the curse of dimensionality, especially in the case of text oversampling. To address these challenges, we introduce in this paper synthetic genetic oversampling (SYNGO), a novel technique based on genetic algorithms for high-dimensional data that does not rely on neighboring instances. Moreover, SYNGO detects new patterns using the inherent diversity of the majority class by introducing not only the minority instances but also the majority borderline instances into the initial population. We conducted several experiments to verify the performance of the proposed method. The results show SYNGO’s effectiveness in classifying imbalanced text data and that it outperforms other oversampling methods in downstream classification tasks.
... Effective hyperparameter optimization can significantly enhance model accuracy and robustness. Researchers have widely utilized hyperparameter search approaches, such as grid search, random search, and Bayesian optimization [34,35]. Grid search is a systematic method of hyperparameter optimization, which exhaustively searches all combinations of hyperparameters to identify the optimal configuration [36]. ...
Article
Full-text available
Uncertainty quantification (UQ) is critical for modeling complex dynamic systems, ensuring robustness and interpretability. This study extends Physics-Guided Bayesian Neural Networks (PG-BNNs) to enhance model robustness by integrating physical laws into Bayesian frameworks. Unlike Artificial Neural Networks (ANNs), which provide deterministic predictions, and Bayesian Neural Networks (BNNs), which handle uncertainty probabilistically but struggle with generalization under sparse and noisy data, PG-BNNs incorporate the laws of physics, such as governing equations and boundary conditions, to enforce physical consistency. This physics-guided approach improves generalization across different noise levels while reducing data dependency. The effectiveness of PG-BNNs is validated through a one-degree-of-freedom vibration system with multiple noise levels, serving as a representative case study to compare the performance of Monte Carlo (MC) dropout ANNs, BNNs, and PG-BNNs across interpolation and extrapolation domains. Model accuracy is assessed using Mean Squared Error (MSE), Mean Absolute Percentage Error (MAE), and Coefficient of Variation of Root Mean Square Error (CVRMSE), while UQ is evaluated through 95% Credible Intervals (CIs), Mean Prediction Interval Width (MPIW), the Quality of Confidence Intervals (QCI), and Coverage Width-based Criterion (CWC). Results demonstrate that PG-BNNs can achieve high accuracy and good adherence to physical laws simultaneously, compared to MC dropout ANNs and BNNs, which confirms the potential of PG-BNNs in engineering applications related to dynamic systems.
Chapter
The pharmacokinetic behavior of a drug is determined by the intricate interplay between the physicochemical properties of the molecule and its multifaceted interactions with the biological system from the moment of administration until its elimination from the body. Consequently, the determination and evaluation of pharmacokinetic events constitute a complex process, which inevitably results in an equal or greater complexity in predictive studies. In light of this, it becomes evident that predictive methods employing computational models will present challenges that must be addressed to enable their broad development and application. To this end, it is essential to acquire detailed knowledge of the stages involved in the model creation process and identify the critical points requiring attention to minimize potential failures or low predictive power of computational methods. In this chapter, this approach will be employed, bringing forth experiences available in the literature on the process of creating computational models for predicting pharmacokinetic behavior at the early stages of drug development, highlighting the main challenges commonly encountered, and the strategies typically employed to mitigate prediction issues faced by different research groups in this field.
Article
Full-text available
This paper examines the application of Topological Data Analysis (TDA) for trajectory classification, aiming to improve the interpretation of complex spatial movement patterns. By utilizing TDA, we explore the hidden structures in trajectory datasets, offering a fresh perspective on classification methods. Our study integrates TDA into trajectory analysis, highlighting its ability to capture spatial features that conventional methods may miss. We assess TDA’s effectiveness using both simulated and real-world trajectory data from a survey comparing existing classifiers. TDA demonstrated significant performance improvements, with accuracy gains of up to 42.95% in certain scenarios. Notably, in real-world datasets, TDA increased accuracy by 38.49% for hurricane trajectory classification and improved precision by 39.24%. Simulated trajectories provided a controlled environment to further test TDA’s robustness. The results underscore the potential of TDA to enhance trajectory analysis, uncovering complex spatial patterns and relationships that traditional methods may overlook.
Article
Pedestrian crashes represent a critical traffic safety issue, often resulting in fatal outcomes and raising significant equity concerns. This study analyzed detailed records of pedestrian-involved crashes in California from 2018 to 2021, employing a novel clustering framework enhanced by the SHapley Additive exPlanations approach. The proposed method significantly enhanced interpretability by effectively capturing complex non-linear relationships and interactions among features. The results indicate that impairment status and lighting conditions are pivotal in severe crash outcomes, while broader societal and demographic factors are more substantially associated with less severe cases. Non-injury pedestrian crashes tend to occur in less underserved, more resilient communities, whereas fatal crashes are more common in underserved communities with poor lighting and incomplete pedestrian infrastructure, particularly when pedestrians are under the influence of drugs or alcohol. The findings underscore the necessity for developing comprehensive safety measures that not only address situational risks but also consider broader societal conditions.
Article
Full-text available
Public genomic and proteomic databases can be affected by a variety of errors. These errors may involve either the description or the meaning of data (namely, syntactic or semantic errors). We focus our analysis on the detection of semantic errors, in order to verify the accuracy of the stored information. In particular, we address the issue of data constraints and functional dependencies among attributes in a given relational database. Constraints and dependencies show semantics among attributes in a database schema and their knowledge may be exploited to improve data quality and integration in database design, and to perform query optimization and dimensional reduction. We propose a method to discover data constraints and functional dependencies by means of association rule mining. Association rules are extracted among attribute values and allow us to find causality relationships among them. Then, by analyzing the support and confidence of each rule, (probabilistic) data constraints and functional dependencies may be detected. With our method we can both show the presence of erroneous data and highlight novel semantic information. Moreover, our method is database-independent because it infers rules from data. In this paper, we report the application of our techniques to the SCOP (Structural Classification of Proteins) and CATH Protein Structure Classification databases.
Article
Full-text available
WEKA is a widely used, open-source machine learning platform. Due to its intuitive interface, it is particularly popular with novice users. However, such users often find it hard to identify the best approach for their particular dataset among the many available. We describe the new version of Auto-WEKA, a system designed to help such users by automatically searching through the joint space of WEKA’s learning algorithms and their respective hyperparameter settings to maximize performance, using a state-of-the-art Bayesian optimization method. Our new package is tightly integrated with WEKA, making it just as accessible to end users as any other learning algorithm. © 2017 Lars Kotthoff, Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown.
Article
Full-text available
While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains. We discuss how genetic programming can aid in this endeavor, and highlight specific examples where genetic programming has automated machine learning analyses in previous projects.
Chapter
Full-text available
Classification is one of the most fundamental tasks in the machine learning and data-mining communities. One of the most common challenges faced when trying to perform classification is the class imbalance problem. A number of sampling approaches, ranging from under-sampling to over-sampling, have been developed to solve the problem of class imbalance. This chapter provides an overview of the sampling strategies as well as classification algorithms developed for countering class imbalance. It considers the issues of correctly evaluating the performance of a classifier on imbalanced datasets and presents a discussion on various metrics. The sampling techniques discussed here include under-sampling, over-sampling, hybrid techniques and ensemble-based methods. Methods have also been developed that aim to directly combat class imbalance without the need for sampling. These methods come mainly from the cost-sensitive learning community; however, classifiers that deal with imbalance are not necessarily cost-sensitive learners. classification; learning (artificial intelligence); sampling methods
Article
Full-text available
Open access, open data, and software are critical for advancing science and enabling collaboration across multiple institutions and throughout the world. Despite near universal recognition of its importance, major barriers still exist to sharing raw data, software, and research products throughout the scientific community. Many of these barriers vary by specialty [1], increasing the difficulties for interdisciplinary and/or translational researchers to engage in collaborative research. Multi-site collaborations are vital for increasing both the impact and the generalizability of research results. However, they often present unique data sharing challenges. We discuss enabling multi-site collaborations through enhanced data sharing in this set of Ten Simple Rules.
Chapter
A detailed overview of current research in kernel methods and their application to computational biology. Modern machine learning techniques are proving to be extremely valuable for the analysis of data in computational biology problems. One branch of machine learning, kernel methods, lends itself particularly well to the difficult aspects of biological data, which include high dimensionality (as in microarray measurements), representation as discrete and structured data (as in DNA or amino acid sequences), and the need to combine heterogeneous sources of information. This book provides a detailed overview of current research in kernel methods and their applications to computational biology. Following three introductory chapters—an introduction to molecular and computational biology, a short review of kernel methods that focuses on intuitive concepts rather than technical details, and a detailed survey of recent applications of kernel methods in computational biology—the book is divided into three sections that reflect three general trends in current research. The first part presents different ideas for the design of kernel functions specifically adapted to various biological data; the second part covers different approaches to learning from heterogeneous data; and the third part offers examples of successful applications of support vector machine methods.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Book
This new edition to the classic book by ggplot2 creator Hadley Wickham highlights compatibility with knitr and RStudio. ggplot2 is a data visualization package for R that helps users create data graphics, including those that are multi-layered, with ease. With ggplot2, it's easy to: • produce handsome, publication-quality plots with automatic legends created from the plot specification • superimpose multiple layers (points, lines, maps, tiles, box plots) from different data sources with automatically adjusted common scales • add customizable smoothers that use powerful modeling capabilities of R, such as loess, linear models, generalized additive models, and robust regression • save any ggplot2 plot (or part thereof) for later modification or reuse • create custom themes that capture in-house or journal style requirements and that can easily be applied to multiple plots • approach a graph from a visual perspective, thinking about how each component of the data is represented on the final plot This book will be useful to everyone who has struggled with displaying data in an informative and attractive way. Some basic knowledge of R is necessary (e.g., importing data into R). ggplot2 is a mini-language specifically tailored for producing graphics, and you'll learn everything you need in the book. After reading this book you'll be able to produce graphics customized precisely for your problems, and you'll find it easy to get graphics out of your head and on to the screen or page. New to this edition:< • Brings the book up-to-date with ggplot2 1.0, including major updates to the theme system • New scales, stats and geoms added throughout • Additional practice exercises • A revised introduction that focuses on ggplot() instead of qplot() • Updated chapters on data and modeling using tidyr, dplyr and broom
Conference Paper
Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
Book
Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, offers a thorough grounding in machine learning concepts, along with practical advice on applying these tools and techniques in real-world data mining situations. This highly anticipated fourth edition of the most acclaimed work on data mining and machine learning teaches readers everything they need to know to get going, from preparing inputs, interpreting outputs, evaluating results, to the algorithmic methods at the heart of successful data mining approaches. Extensive updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including substantial new chapters on probabilistic methods and on deep learning. Accompanying the book is a new version of the popular WEKA machine learning software from the University of Waikato. Authors Witten, Frank, Hall, and Pal include today's techniques coupled with the methods at the leading edge of contemporary research. Please visit the book companion website at http://www.cs.waikato.ac.nz/ml/weka/book.html It contains Powerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book Online Appendix on the Weka workbench; again a very comprehensive learning aid for the open source software that goes with the book Table of contents, highlighting the many new sections in the 4th edition, along with reviews of the 1st edition, errata, etc. Provides a thorough grounding in machine learning concepts, as well as practical advice on applying the tools and techniques to data mining projects Presents concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods Includes a downloadable WEKA software toolkit, a comprehensive collection of machine learning algorithms for data mining tasks-in an easy-to-use interactive interface Includes open-access online courses that introduce practical applications of the material in the book.