Conference PaperPDF Available

Identification of cancer diagnosis estimation models using evolutionary algorithms: A case study for breast cancer, melanoma, and cancer in the respiratory system

Authors:

Abstract and Figures

In this paper we present results of empirical research work done on the data based identification of estimation models for cancer diagnoses: Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors we have trained mathematical models for estimating cancer diagnoses. Several data based modeling approaches implemented in HeuristicLab have been applied for identifying estimators for selected cancer diagnoses: Linear regression, k-nearest neighbor learning, artificial neural networks, and support vector machines (all optimized using evolutionary algorithms) as well as genetic programming. The investigated diagnoses of breast cancer, melanoma, and respiratory system cancer can be estimated correctly in up to 81%, 74%, and 91% of the analyzed test cases, respectively; without tumor markers up to 75%, 74%, and 87% of the test samples are correctly estimated, respectively.
Content may be subject to copyright.
Identification of Cancer Diagnosis Estimation
Models Using Evolutionary Algorithms -
A Case Study for Breast Cancer, Melanoma, and
Cancer in the Respiratory System
Stephan M. Winkler1, Michael Affenzeller1, Witold Jacak1,
Herbert Stekel2
1Upper Austria University of Applied Sciences
Department of Bioinformatics
Softwarepark 11
4232 Hagenberg, Austria
stephan.winkler@fh-hagenberg.at
michael.affenzeller@fh-hagenberg.at
witold.jacak@fh-hagenberg.at
2General Hospital Linz
Central Laboratory
Krankenhausstraße 9
4021 Linz, Austria
herbert.stekel@akh.linz.at
26 April 2011
Abstract
In this paper we present results of empirical research work done on
the data based identification of estimation models for cancer diagnoses:
Based on patients’ data records including standard blood parameters, tu-
mor markers, and information about the diagnosis of tumors we have
trained mathematical models for estimating cancer diagnoses.
Several data based modeling approaches implemented in HeuristicLab
have been applied for identifying estimators for selected cancer diagnoses:
Linear regression, k-nearest neighbor learning, artificial neural networks,
and support vector machines (all optimized using evolutionary algorithms)
as well as genetic programming. The investigated diagnoses of breast can-
cer, melanoma, and respiratory system cancer can be estimated correctly
in up to 81%, 74%, and 91% of the analyzed test cases, respectively;
1
without tumor markers up to 75%, 74%, and 87% of the test samples are
correctly estimated, respectively.
Keywords: Cancer Diagnosis Estimation, Tumor Marker Data, Data
Mining, Machine Learning, Statistical Analysis
1 Introduction
In this paper we present research results achieved within the research center
Heureka!1: Data of thousands of patients of the General Hospital (AKH) Linz,
Austria, have been analyzed in order to identify mathematical models for cancer
diagnoses. We have used a medical database compiled at the central laboratory
of AKH in the years 2005 – 2008: 28 routinely measured blood values of thou-
sands of patients are available as well as several tumor markers (substances
found in humans that can be used as indicators for certain types of cancer).
Not all values are measured for all patients, especially tumor marker values are
determined and documented only if there are indications for the presence of can-
cer. The results of empirical research work done on the data based identification
of estimation models for cancer diagnoses are presented in this paper: Based
on patients’ data records including standard blood parameters, tumor markers,
and information about the diagnosis of tumors we have trained mathematical
models for estimating cancer diagnoses.
The following data based modeling methods (implemented in HeuristicLab
[?]) have been used for producing classifiers: Linear regression, k-nearest neigh-
bor classification, neural networks, support vector machines, and genetic pro-
gramming.
In the following section (Section ??) we describe the database we have used
for our research work as well as the tumor markers for which we have developed
classifiers; we also describe the data preprocessing steps. For each tumor for
which we have developed classifiers we define the sets of input variables used in
this research project. In Section ?? we describe the modeling methods used in
this research project as well as the parameter settings applied, and in Section
?? we summarize and analyze the modeling results we have achieved. The
conclusion of this paper is given in Section ??, followed by references and an
appendix in which we summarize optimized modeling details.
2 Database
2.1 Available Patient Data
The blood data measured at the AKH in the years 2005–2008 have been com-
piled in a database storing each set of measurements (belonging to one patient):
1Josef Ressel Center for Heuristic Optimization;
http://heureka.heuristiclab.com/
2
Each sample in this database contains an unique ID number of the respective
patient, the date of the measurement series, the ID number of the measurement,
and a set of parameters summarized in Table ??; standard blood parameters
are stored as well as tumor marker values and cancer diagnosis information.
Patients personal data were at no time available to the authors except the head
of the laboratory.
In total, information about 20,819 patients is stored in 48,580 samples.
Please note that of course not all values are available in all samples; there are
many missing values simply because not all blood values are measured during
each examination. Further details about the data set can for example be found
in [?].
2.1.1 Standard Parameters
Information about the blood parameters stored in the AKH database (which
are listed in the upper part of Table ?? at the end of this paper) can be found
in [?] and [?], e.g.
2.1.2 Tumor Markers
In general, tumor markers are substances found in humans (especially in the
blood or in body tissues) that can be used as indicators for certain types of
cancer. There are several different tumor markers which are used in oncology to
help detect the presence of cancer; elevated tumor marker values can indicate
the presence of cancer, but there can also be other causes. As a matter of
fact, elevated tumor marker values themselves are not diagnostic, but rather
suggestive; tumor markers can be used to monitor the result of a treatment (as
for example chemotherapy).
Literature discussing tumor markers, their identification, their use, and the
application of data mining methods for describing the relationship between
markers and the diagnosis of certain cancer types can be found for example
in [?] (where an overview of clinical laboratory tests is given and different kinds
of such test application scenarios as well as the reason of their production are
described), [?], [?], [?], and [?].
Information about the tumor markers stored in the AKH database are listed
in the lower part of Table ??.
2.1.3 Cancer Diagnoses
Finally, information about cancer diagnoses is also available in the AKH
database: If a patient is diagnosed with any kind of cancer, then this is also
stored in the database.
Our goal in the research work described in this paper is to identify estimation
models for the presence of the following types of cancer: Malignant neoplasms
in the respiratory system (RSC, cancer classes C30–C39 according to the In-
ternational Statistical Classification of Diseases and Related Health Problems
3
Table 1: Overview of the data sets compiled for selected cancer types
Cancer Input Variables Total Samples in Missing
Type Samples Class 0 Class 1 Values
Breast 706 324 382 46.67%
Cancer AGE, SEX, AFP, ALT, AST, BSG1, BUN, C125, C153, C199, C724, (45.89%) (54.11%)
Melanoma CBAA, CEA, CEOA, CH37, CHOL, CLYA, CMOA, CNEA, CRP, CYFS, 905 485 420 47.79%
FE, FER, FPSA, GT37, HB, HDL, HKT, HS, KREA, LD37, MCV, (53.59%) (46.41%)
Respiratory NSE, PLT, PSA, PSAQ, RBC, S100, SCC, TBIL, TF, TPS, WBC 2,363 1,367 996 44.76%
System Cancer (57.85%) (42.15%)
10th Revision (ICD-10)), melanoma and malignant neoplasms on the skin (Mel,
C43–C44), and breast cancer (BC, C50).
2.2 Data Preprocessing
Before analyzing the data and using them for training classifiers we have pre-
processed the available data:
All variables have been linearly scaled to the interval [0;1]: For each vari-
able vi, the minimum value miniis subtracted from all contained values
and the result divided by the difference between miniand the maximum
plausible value maxplaui; all values greater than the given maximum plau-
sible value are replaced by 1.0.
All samples belonging to the same patient with not more than one day
difference with respect to the measurement data have been merged. This
has been done in order to decrease the number of missing values in the
data matrix. In rare cases, more than one value might thus be available
for a certain variable; in such a case, the first value is used.
Additionally, all measurements have been sample-wise re-arranged and
clustered according to the patients’ IDs. This has been done in order to
prevent data of certain patients being included in the training as well as
in the test data.
Before starting the modeling algorithms for training classifiers we had to
compile separate data sets for each analyzed target tumor ti: First, blood pa-
rameter measurements were joined with diagnosis results; only measurements
and diagnoses with a time delta less than a month were considered. Second, all
samples containing measured values for tiare extracted. Third, all samples are
removed that contain less than 15 valid values. Finally, variables with less than
10% valid values are removed from the data base.
This procedure results in a specialized data set dstifor each tumor marker ti.
In Table ?? we summarize statistical information about all resulting data sets
for the markers analyzed here; the numbers of samples belonging to each of the
defined classes are also given for each resulting data set.
4
Full medical data set
(blood parameters,
tumor marker target
values)
1 0 1 1 0 0 0 1
0 0 0 1 1 0 1 1
1 0 1 1 1 0 1 0
Data subset
(selected blood
parameters, tumor
marker target
values)
Parents selection,
crossover, mutation
Evaluation,
i.e., modeling:
lin. reg., kNN,
ANN, SVM, …
(k-fold cross
validation)
Offspring
selection
0.482
0.693 7 0.8 6
6 0.5 8
5 0.3 4
1 0 0 1 1 0 0 1 0.551 4 0.2 7
Figure 1: A hybrid evolutionary algorithm for feature selection and parameter
optimization in data based modeling.
3 Modeling Methods
In this section we describe the modeling methods applied for identifying estima-
tion models for cancer diagnosis: On the one hand we apply hybrid modeling
using machine learning algorithms and evolutionary algorithms for parameter
optimization and feature selection (as described in Section ??), on the other
hand apply use genetic programming (as described in Section ??).
3.1 Hybrid Modeling Using Machine Learning Algorithms
and Evolutionary Algorithms for Parameter Opti-
mization and Features Selection
3.1.1 General Modeling Approach: Definition and Evaluation of So-
lution Candidates
Feature selection is often considered an essential step in data based modeling;
it is used to reduce the dimensionality of the datasets and often conducts to
better analyses. Given a set of nfeatures F={f1, f2, . . . , fn}, our goal here
is to find a subset FFthat is on the one hand as small as possible and
on the other hand allows modeling methods to identify models that estimate
given target values as well as possible. Additionally, each data based modeling
method (except plain linear regression) has several parameters that have to be
set before starting the modeling process.
The fitness of feature selection Fand training parameters with respect
to the chosen modeling method is calculated in the following way: We use
a machine learning algorithm m(with parameters p) for estimating predicted
target values est(F, m, p) and compare those to the original target values orig;
the coefficient of determination (R2) function is used for calculating the quality
of the estimated values. Additionally, we also calculate the ratio of selected
5
features |F|/|F|. Finally, using a weighting factor α, we calculate the fitness of
the set of features Fusing mand pas
fitness(F, m, p) =
α∗ |F|/|F|+ (1 α)(1 R2(est(F, m, p), orig)).(1)
As an alternative to the coefficient of determination function we can also use
a classification specific function that calculates the ratio of correctly classified
samples, either in total or as the average of all classification accuracies of the
given classes (as for example described in [?], Section 8.2): For all samples
that are to be considered we know the original classifications origCl, and using
(predefined or dynamically chosen) thresholds we get estimated classifications
estCl(F, m, p) for estimated target values est(F, m, p). The total classification
accuracy cak(F, m, p) is calculated as
ca(F, m, p) = |{j:estCl(F, m, p)[j] = origCl[j]}|
|estCl|(2)
Class-wise classification accuracies cwca are calculated as the average of all
classification accuracies for each given class cCseparately:
ca(F, m, p)c=
|{j:estCl(F, m, p)[j] = origC l[j] = c}|
|{j:origC l[j] = c}| (3)
cwca(F, m, p) = cCca(F, m, p)c
|C|(4)
We can now define the classification specific fitness of feature selection Fusing
mand pas
fitnessca(F, m, p) =
α∗ |F|/|F|+ (1 α)(1 ca(F, m, p)) (5)
or
fitnesscwca (F, m, p) =
α∗ |F|/|F|+ (1 α)(1 cwca(F, m, p)).(6)
In [?], for example, the use of evolutionary algorithms for feature selection
optimization is discussed in detail in the context of gene selection in cancer
classification; in [?] we have analyzed the sets of features identified as relevant
in the modeling of tumor markers AFP and CA15-3.
6
We have now used evolutionary algorithms for finding optimal feature sets
as well as optimal modeling parameters for models for tumor diagnosis; this
approach is schematically shown in Figure ??. A solution candidate is here
represented as [s1,...,np1,...,q ] where siis a bit denoting whether feature Fiis
selected or not and pjis the value for parameter jof the chosen modeling
method m. This rather simple definition of solution candidates enables the
use of standard concepts for genetic operators for crossover and mutation of
bit vectors and real valued vectors: We use uniform, single point, and 2-point
crossover operators for binary vectors and bit flip mutation that flips each of
the given bits with a given probability. Explanations of these operators can for
example be found in [?] and [?].
We have used strict offspring selection [?] which means that individuals are
accepted to become members of the next generation if they are evaluated better
than both parents. Standard fitness evaluation as given in Equation ?? has
been used during the execution of the evolutionary processes, and classification
specific fitness evaluation as given in Equation ?? has been used for selecting
the solution candidate eventually returned as the algorithm’s result.
3.1.2 Modeling Methods Used in this Research Project
The following techniques for training classifiers have been used in this research
project: Linear regression, neural networks, the k-nearest-neighbor method,
support vector machines, and genetic programming. All these machine learn-
ing methods have been implemented using the HeuristicLab framework2[?],
a framework for prototyping and analyzing optimization techniques for which
both generic concepts of evolutionary algorithms and many functions to eval-
uate and analyze them are available; we have used these implementations for
producing the results summarized in the following section. In this section we
give information about these training methods; details about the HeuristicLab
implementation of these methods can for example be found in [?].
Linear modeling
Given a data collection including minput features storing the informa-
tion about Nsamples, a linear model is defined by the vector of coefficients
θ1...m. For calculating the vector of modeled values eusing the given input
values matrix u1...m, these input values are multiplied with the corresponding
coefficients and added: e=u1...m θ. The coefficients vector can be com-
puted by simply applying matrix division. For conducting the test series doc-
umented here we have used an implementation of the matrix division function:
θ=InputV alues\T argetV alues. Additionally, a constant additive factor is
also included into the model; i.e., a constant offset is added to the coefficients
vector. Theoretical background of this approach can be found in [?].
kNN Classification
Unlike other data based modeling methods, k-nearest-neighbor classification
2http://dev.heuristiclab.com
7
[?] works without creating any explicit models. During the training phase, the
samples are simply collected; when it comes to classifying a new, unknown
sample xnew, the sample-wise distance between xnew and all other training
samples xtrain is calculated and the classification is done on the basis of those
ktraining samples (xNN ) showing the smallest distances from xnew .
In the context of classification, the numbers of instances (of the knearest
neighbors) are counted for each given class and the algorithm automatically pre-
dicts that class that is represented by the highest number of instances (included
in xNN ). In the test series documented in this paper we have applied weighting
to kNN classification: The distance between xnew and xNN is relevant for the
classification statement, the weight of “nearer” samples is higher than that of
samples that are “further away” from xnew.
In this research work we have varied kbetween 1 and 10.
Artificial Neural Networks
For training artificial neural network (ANN) models, three-layer feed-forward
neural networks with one linear output neuron were created using backprop-
agation; theoretical background and details can for example be found in [?]
(Chapter 11, “Neural Networks”). In the tests documented in this paper the
number of hidden (sigmoidal) nodes hn has been varied from 5 to 100; we have
applied ANN training algorithms that use internal validation sets, i.e., training
algorithms use 30% of the given training data as validation data and eventually
return those network structures that perform best on these internal validation
samples.
Support Vector Machines
Support vector machines (SVMs) are a widely used approach in machine
learning based on statistical learning theory [?]. The most important aspect
of SVMs is that it is possible to give bounds on the generalization error of
the models produced, and to select the corresponding best model from a set of
models following the principle of structural risk minimization [?].
In this work we have used the LIBSVM implementation described in [?],
which is used in the respective SVM interface implemented for HeuristicLab;
here we have used Gaussian radial basis function kernels with varying values for
the cost parameters c(c[0,512]) and the γparameter of the SVM’s kernel
function (γ[0,1]).
3.2 Genetic Programming
As an alternative to the approach described in the previous sections we have also
applied a classification algorithm based on genetic programming (GP) [?] using
a structure identification framework described in [?] and [?], in combination
with strict offspring selection; this GP approach has been implemented as a
part of HeuristicLab.
We have used the following parameter settings for our GP test series: The
mutation rate was set to 20%, gender specific parents selection [?] (combining
8
random and roulette selection) was applied as well as strict offspring selection
[?] (OS, with success ratio as well as comparison factor set to 1.0). The functions
set described in [?] (including arithmetic as well as logical ones) was used for
building composite function expressions.
In addition to splitting the given data into training and test data, the GP
based training algorithm implemented in HeuristicLab has been designed in
such a way that a part of the given training data is not used for training models
and serves as validation set; in the end, when it comes to returning classifiers,
the algorithm returns those models that perform best on validation data. This
approach has been chosen because it is assumed to help to cope with over-fitting;
it is also applied in other GP based machine learning algorithms as for example
described in [?].
4 Modeling Results
Five-fold cross-validation [?] training / test series have been executed; this
means that the available data are separated in five (approximately) equally
sized, complementary subsets, and in each training / test cycle one data subset
is chosen is used as test and the rest of the data as training samples.
In this section we document test accuracies (µ, σ) for the investigated cancer
types; we here summarize test results for modeling cancer diagnoses using tumor
markers (TMs) as well as for modeling without using tumor markers. Linear
modeling, kNN modeling, ANNs, and SVMs have been applied for identifying es-
timation models for the selected tumor types, genetic algorithms with strict OS
have been applied for optimizing variable selections and modeling parameters;
standard fitness calculation as given in ?? has been used by the evolutionary
process, the classification specific one as given in ?? has been used for selecting
the eventually returned model. The probability of selecting a variable initially
was set to 30%. Additionally, we have also applied simple linear regression us-
ing all available variables. Finally, genetic programming with strict offspring
selection (OSGP) has also been applied.
In all test series the maximum selection pressure [?] was set to 100, i.e.,
the algorithms were terminated as soon as the selection pressure reached 100.
The population size for genetic algorithms optimizing variable selections and
modeling parameters was set to 10, for GP the population size was set to 700.
In all modeling cases except kNN modeling regression models have been trained,
the threshold for classification decisions was in all cases set to 0.5 (since the
absence of the specific tumor is represented by 0.0 in the data and its presence
by 1.0).
Details about the size of the optimized variable sets as well as optimized
modeling parameters are summarized in the appendix of this paper.
9
Table 2: Modeling results for breast cancer diagnosis
Using TMs Not using TMs
Modeling Method Test accuracies Test accuracies
µ σ µ σ
LR, full features set 79.32% 1.06 70.63% 1.28
OSGA + LR, α= 0.0 81.78% 0.21 73.13% 0.36
OSGA + LR, α= 0.1 81.49% 1.18 72.66% 0.14
OSGA + LR, α= 0.2 81.44% 0.37 71.40% 0.57
OSGA + kNN, α= 0.0 79.21% 0.78 74.22% 2.98
OSGA + kNN, α= 0.1 78.99% 0.57 75.55% 0.87
OSGA + kNN, α= 0.2 78.33% 1.04 74.50% 0.20
OSGA + ANN, α= 0.0 81.41% 1.14 75.60% 2.47
OSGA + ANN, α= 0.1 80.19% 1.68 72.38% 6.08
OSGA + ANN, α= 0.2 79.37% 1.17 70.54% 6.10
OSGA + SVM, α= 0.0 81.23% 1.10 73.90% 2.36
OSGA + SVM, α= 0.1 80.46% 1.80 72.19% 0.94
OSGA + SVM, α= 0.2 77.43% 3.55 71.89% 0.70
OSGP, ms = 50 79.72% 1.80 75.32% 0.45
OSGP, ms = 100 75.50% 4.95 71.63% 2.75
OSGP, ms = 150 79.20% 6.60 75.75% 2.16
Table 3: Modeling results for melanoma diagnosis
Using TMs Not using TMs
Modeling Method Test accuracies Test accuracies
µ σ µ σ
LR, full features set 73.81% 3.39 71.09% 4.14
OSGA + LR, α= 0.0 72.45% 4.69 72.36% 2.30
OSGA + LR, α= 0.1 74.73% 2.35 72.09% 4.01
OSGA + LR, α= 0.2 73.85% 2.54 72.70% 2.02
OSGA + kNN, α= 0.0 68.77% 2.38 71.00% 1.97
OSGA + kNN, α= 0.1 71.33% 0.27 70.21% 3.41
OSGA + kNN, α= 0.2 67.33% 0.31 69.65% 3.14
OSGA + ANN, α= 0.0 74.78% 1.63 69.17% 2.97
OSGA + ANN, α= 0.1 73.81% 2.23 71.82% 0.61
OSGA + ANN, α= 0.2 74.12% 1.03 71.40% 0.49
OSGA + SVM, α= 0.0 69.72% 7.57 68.87% 4.78
OSGA + SVM, α= 0.1 71.75% 4.88 68.22% 1.88
OSGA + SVM, α= 0.2 61.48% 3.99 63.20% 2.09
OSGP, ms = 50 71.24% 9.54 74.89% 3.66
OSGP, ms = 100 69.91% 5.20 65.16% 13.06
OSGP, ms = 150 71.79% 4.31 70.13% 3.60
Table 4: Modeling results for respiratory system cancer diagnosis
Using TMs Not using TMs
Modeling Method Test accuracies Test accuracies
µ σ µ σ
LR, full features set 91.32% 0.37 85.97% 0.27
OSGA + LR, α= 0.0 91.57% 0.46 86.41% 0.36
OSGA + LR, α= 0.1 91.16% 1.18 85.80% 0.45
OSGA + LR, α= 0.2 89.45% 0.37 85.02% 0.15
OSGA + kNN, α= 0.0 90.98% 0.84 87.09% 0.46
OSGA + kNN, α= 0.1 90.01% 2.63 87.01% 0.83
OSGA + kNN, α= 0.2 90.16% 0.74 86.92% 0.81
OSGA + ANN, α= 0.0 90.28% 1.63 85.97% 4.07
OSGA + ANN, α= 0.1 90.99% 1.97 85.82% 4.52
OSGA + ANN, α= 0.2 88.64% 1.87 87.24% 1.91
OSGA + SVM, α= 0.0 89.03% 1.38 83.12% 3.79
OSGA + SVM, α= 0.1 89.91% 1.58 86.25% 0.79
OSGA + SVM, α= 0.2 88.33% 1.94 84.66% 2.06
OSGP, ms = 50 89.58% 2.75 85.98% 5.74
OSGP, ms = 100 90.44% 3.02 86.54% 6.02
OSGP, ms = 150 89.58% 3.75 87.97% 5.57
10
5 Conclusion
As documented in the previous section, the investigated diagnoses of breast
cancer, melanoma, and respiratory system cancer can be estimated correctly
in up to 81%, 74%, and 91% of the analyzed test cases, respectively; without
tumor markers up to 75%, 74%, and 88% of the test samples are correctly
estimated, respectively. Linear modeling performs well in all modeling tasks,
feature selection using genetic algorithms and nonlinear modeling yield even
better results for all analyzed modeling tasks. No modeling method performs
best for all diagnosis prediction tasks.
Further research shall focus on the practical application of the here presented
research results in the treatment of patients, and the authors also plan to analyze
in how far separately estimated tumor markers (as discussed in [?] and [?]) can
help improve cancer diagnosis predictions without having to use original tumor
marker values.
6 Acknowledgments
The work described in this paper was done within the Josef Ressel Centre for
Heuristic Optimization (Heureka!) sponsored by the Austrian Research Promo-
tion Agency (FFG).
References
[1] M. Affenzeller and S. Wagner. SASEGASA: A new generic parallel evolu-
tionary algorithm for achieving highest quality results. Journal of Heuris-
tics - Special Issue on New Advances on Parallel Meta-Heuristics for Com-
plex Problems, 10:239–263, 2004.
[2] M. Affenzeller, S. Winkler, S. Wagner, and A. Beham. Genetic Algorithms
and Genetic Programming - Modern Concepts and Practical Applications.
Chapman & Hall / CRC, 2009.
[3] E. Alba, J. G.-N. L. Jourdan, and E.-G. Talbi. Gene selection in cancer
classification using PSO/SVM and GA/SVM hybrid algorithms. IEEE
Congress on Evolutionary Computation 2007, pages 284 – 290, 2007.
[4] G. L. Andriole, E. D. Crawford, R. L. Grubband, S. S. Buys, D. Chia, T. R.
Church, et al. Mortality results from a randomized prostate-cancer screen-
ing trial. New England Journal of Medicine, 360(13):1310–1319, 2009.
[5] W. Banzhaf and C. Lasarczyk. Genetic programming of an algorithmic
chemistry. In U. O’Reilly, T. Yu, R. Riolo, and B. Worzel, editors, Genetic
Programming Theory and Practice II, pages 175–190. Ann Arbor, 2004.
[6] N. Bitterlich and J. Schneider. Cut-off-independent tumour marker evalua-
tion using ROC approximation. Anticancer Research, 27:4305–4310, 2007.
11
[7] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector
machines, 2001. Software available at http://www.csie.ntu.edu.tw/
~cjlin/libsvm.
[8] N. Clegg, C. Ferguson, L. True, H. Arnold, A. Moorman, J. Quinn, R. Ves-
sella, and P. Nelson. Molecular characterization of prostatic small-cell neu-
roendocrine carcinoma. Prostate, 55(1):55–64, 2003.
[9] G. Crombach, H. W¨urz, F. Herrmann, R. Kreienberg, V. M¨obus,
P. Schmidt-Rhode, G. Sturm, H. Caffier, and H. Kaesemann. The impor-
tance of the scc antigen in the diagnosis and follow-up of cervix carcinoma.
Deutsche Medizinische Wochenschrift, 114(18):700–705, 1989.
[10] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley
Interscience, 2nd edition, 2000.
[11] M. J. Duffy and J. Crown. A personalized approach to cancer treatment:
how biomarkers can help. Clinical Chemistry, 54(11):1770–1779, 2008.
[12] A. Eiben and J. Smith. Introduction to Evolutionary Computation. Natural
Computing Series. Springer-Verlag Berlin Heidelberg, 2003.
[13] B. Frey, R. Morant, H. Senn, and W. Riesen. Clinical assessment of the
new tumor marker tps. International Journal for Cancer Research and
Treatment, 17:270–276, 1994.
[14] P. Gold and S. O. Freedman. Demonstration of tumor-specific antigens
in human colonic carcinomata by immunological tolerance and absorption
techniques. The Journal of Experimental Medicine, 121:439–462, 1965.
[15] J. H. Holland. Adaption in Natural and Artifical Systems. University of
Michigan Press, 1975.
[16] J. A. Koepke. Molecular marker test standardization. Cancer, 69:1578–
1581, 1992.
[17] R. Kohavi. A study of cross-validation and bootstrap for accuracy esti-
mation and model selection. In Proceedings of the 14th international joint
conference on Artificial intelligence, volume 2, pages 1137–1143. Morgan
Kaufmann, 1995.
[18] H. Koprowski, M. Herlyn, Z. Steplewski, and H. Sears. Specific antigen in
serum of patients with colon carcinoma. Science, 212(4490):53–55, 1981.
[19] J. R. Koza. Genetic Programming: On the Programming of Computers by
Means of Natural Selection. The MIT Press, 1992.
[20] M. LaFleur-Brooks. Exploring Medical Language: A Student-Directed Ap-
proach. St. Louis, Missouri, USA: Mosby Elsevier, 7th edition, 2008.
12
[21] R. S. Lai, C. C. Chen, P. C. Lee, and J. Y. Lu. Evaluation of cytokeratin
19 fragment (cyfra 21-1) as a tumor marker in malignant pleural effusion.
Japanese Journal of Clinical Oncology, 29(9):421–424, 199.
[22] L. Ljung. System Identification – Theory For the User, 2nd edition. PTR
Prentice Hall, Upper Saddle River, N.J., 1999.
[23] G. J. Mizejewski. Alpha-fetoprotein structure and function: relevance to
isoforms, epitopes, and conformational variants. Experimental biology and
medicine, 226(5):377–408, 2001.
[24] O. Nelles. Nonlinear System Identification. Springer Verlag, Berlin Heidel-
berg New York, 2001.
[25] Y. Niv. Muc1 and colorectal cancer pathophysiology considerations. World
Journal of Gastroenterology, 14(14):2139–2141, 2008.
[26] D. Nonaka, L. Chiriboga, and B. Rubin. Differential expression of s100 pro-
tein subtypes in malignant melanoma, and benign and malignant peripheral
nerve sheath tumors. Journal of Cutaneous Pathology, 35(11):1014–1019,
2008.
[27] A. J. Rai, Z. Zhang, J. Rosenzweig, I. ming Shih, T. Pham, E. T. Fung, L. J.
Sokoll, and D. W. Chan. Proteomic approaches to tumor marker discovery.
Archives of Pathology & Laboratory Medicine, 126(12):1518–1526, 2002.
[28] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[29] S. Wagner. Heuristic Optimization Software Systems – Modeling of Heuris-
tic Optimization Algorithms in the HeuristicLab Software Environment.
PhD thesis, Johannes Kepler University Linz, 2009.
[30] S. Wagner and M. Affenzeller. SexualGA: Gender-specific selection for
genetic algorithms. In N. Callaos, W. Lesso, and E. Hansen, editors, Pro-
ceedings of the 9th World Multi-Conference on Systemics, Cybernetics and
Informatics (WMSCI) 2005, volume 4, pages 76–81. International Institute
of Informatics and Systemics, 2005.
[31] P. W. Williams and H. D. Gray. Gray’s anatomy. New York: C. Living-
stone, 37th edition, 1989.
[32] S. Winkler. Evolutionary System Identification - Modern Concepts and
Practical Applications. PhD thesis, Institute for Formal Models and Veri-
fication, Johannes Kepler University Linz, 2008.
[33] S. Winkler, M. Affenzeller, W. Jacak, and H. Stekel. Classification of
tumor marker values using heuristic data mining methods. In Proceedings
of the GECCO 2010 Workshop on Medical Applications of Genetic and
Evolutionary Computation (MedGEC 2010), 2010.
13
[34] S. Winkler, M. Affenzeller, G. Kronberger, M. Kommenda, S. Wagner,
W. Jacak, and H. Stekel. Feature selection in the analysis of tumor marker
data using evolutionary algorithms. In Proceedings of the 7th International
Mediterranean and Latin American Modelling Multiconference, pages 1 –
6, 2010.
[35] B. W. Yin, A. Dnistrian, and K. O. Lloyd. Ovarian cancer antigen CA125
is encoded by the MUC16 mucin gene. International Journal of Cancer,
98(5):737–40, 2002.
[36] K. Yonemori, M. Ando, T. S. Taro, N. Katsumata, K. Matsumoto, Y. Ya-
manaka, T. Kouno, C. Shimizu, and Y. Fujiwara. Tumor-marker analysis
and verification of prognostic models in patients with cancer of unknown
primary, receiving platinum-based combination chemotherapy. Journal of
Cancer Research and Clinical Oncology, 132(10):635–642, 2006.
[37] L. Zhong, X. Zhou, K. Wei, X. Yang, C. Ma, C. Zhang, and Z. Zhang.
Application of serum tumor markers and support vector machine in the
diagnosis of oral squamous cell carcinoma. Shanghai Kou Qiang Yi Xue
(Shanghai Journal of Stomatology), 17(5):457–460, 2008.
14
Appendix
In this appendix we summarize results of the executing modeling test series:
In Table ?? we summarize the effort of the modeling approaches applied in
this research work: For the combination of GAs and machine learning methods
we document the number of modeling executions, and for GP we give the number
of evaluated solutions (i.e., models).
For the combination of genetic algorithms with linear regression, kNN mod-
eling, ANNs, and SVMs (with varying variable ratio (vr) weighting factors) as
well as GP with varying maximum tree sizes ms we give the sizes of selected
variable sets, and (where applicable) also k,hn,c, and γ. Obviously there are
different variations in the parameters identified as optimal by the evolutionary
process: The numbers of variables used as well as the neural networks’ hid-
den nodes vary to a relatively small extent, e.g., whereas especially the SVMs’
parameters (especially the cfactors) vary very strongly.
Table 5: Effort in terms of executed modeling runs and evaluated model struc-
tures Modeling Modeling executions
Method vrf w = 0.0vrfw = 0.1vrf w = 0.2
µ σ µ σ µ σ
LR 3260.4 717.8 2339.6 222.6 2465.2 459.2
kNN 2955.3 791.8 3046.0 362.4 3791.3 775.9
ANN 3734.0 855.9 3305.0 582.6 3297.0 475.9
SVM 2950.0 794.8 2846.0 391.4 3496.7 859.8
Modeling Evaluated solutions (models)
Method ms = 50 ms = 100 ms = 150
µ, σ µ,σ µ, σ
OSGP 1483865.0, 1999913.3, 2238496.7,
674026.2 198289.1 410123.6
Table 6: Optimized parameters for linear regression
Problem Instance, Variables used
vr weighting µ σ
BC, α= 0.0 16.6 2.10
TM α= 0.1 11.8 1.50
α= 0.2 6.4 0.60
BC, α= 0.0 9.6 1.15
no TM α= 0.1 8.8 0.58
α= 0.2 6.4 1.20
Mel, α= 0.0 16.6 0.55
TM α= 0.1 12.2 0.84
α= 0.2 9.2 4.09
Mel, α= 0.0 10.8 1.79
no TM α= 0.1 8.8 2.28
α= 0.2 8.2 1.92
RSC, α= 0.0 17.2 2.95
TM α= 0.1 13.4 2.51
α= 0.2 9.0 2.55
RSC, α= 0.0 16.0 4.64
no TM α= 0.1 9.6 0.89
α= 0.2 8.6 3.21
15
Table 7: Optimized parameters for kNN modeling
Problem Instance, Variables used k
vr weighting µ σ µ σ
BC, α= 0.0 18.2 2.20 9.8 2.10
TM α= 0.1 14.0 3.60 12.6 4.60
α= 0.2 11.0 1.80 11.2 3.00
BC, α= 0.0 14.4 1.67 11.2 1.64
no TM α= 0.1 14.0 2.45 13.8 3.11
α= 0.2 11.8 0.84 18.8 1.10
Mel, α= 0.0 15.6 1.82 17.8 2.86
TM α= 0.1 16.4 1.34 14.4 5.90
α= 0.2 13.6 1.67 19.4 1.34
Mel, α= 0.0 15.0 1.58 14.2 1.10
no TM α= 0.1 10.4 1.52 18.2 1.64
α= 0.2 9.6 1.14 16.8 2.05
RSC, α= 0.0 14.6 1.67 20.0 0.00
TM α= 0.1 13.6 1.67 16.8 6.06
α= 0.2 10.4 1.52 12.8 3.90
RSC, α= 0.0 15.6 2.79 15.2 1.64
no TM α= 0.1 12.2 1.10 10.6 1.82
α= 0.2 10.2 2.95 13.2 2.95
Table 8: Optimized parameters for ANNs
Problem Instance, Variables used hn
vr weighting µ σ µ σ
BC, α= 0.0 17.0 1.20 75.6 20.80
TM α= 0.1 14.8 3.40 51.0 5.80
α= 0.2 11.0 0.80 35.8 13.00
BC, α= 0.0 12.6 1.41 82.4 23.46
no TM α= 0.1 12.2 0.89 70.8 26.40
α= 0.2 11.2 1.10 68.2 14.58
Mel, α= 0.0 19.6 2.19 56.8 13.31
TM α= 0.1 12.8 2.28 61.0 2.55
α= 0.215.6 5.18 51.2 6.98
Mel, α= 0.0 15.4 2.51 68.6 14.24
no TM α= 0.1 8.2 1.64 59.8 3.83
α= 0.2 8.0 1.00 58.6 5.81
RSC, α= 0.0 13.4 3.44 64.6 10.97
TM α= 0.1 11.2 2.28 68.2 6.69
α= 0.2 8.2 1.64 60.2 13.92
RSC, α= 0.0 13.2 2.28 71.2 12.38
no TM α= 0.1 12.2 2.05 70.6 12.99
α= 0.2 11.6 2.19 64.4 14.24
16
Table 9: Optimized parameters for SVMs
Problem Instance, Variables used C γ
vr weighting µ,σ µ,σ µ,σ
α= 0.0 21.6, 101.50, 0.05,
BC, 3.50 92.30 0.06
TM α= 0.1 18.8, 12.44, 0.09,
3.50 13.85 0.01
α= 0.2 16.0, 64.79, 0.04,
2.00 67.59 0.01
α= 0.0 15.6, 47.16, 0.05,
BC, 1.83 12.63 0.05
no TM α= 0.1 15.4, 22.50, 0.07,
1.10 25.88 0.04
α= 0.2 13.0, 8.14, 0.07,
2.65 10.09 0.04
α= 0.0 13.0, 166.23, 0.27,
Mel, 4.53 236.61 0.25
TM α= 0.1 10.8, 204.74, 0.18,
3.42 210.43 0.19
α= 0.2 4.2, 123.08, 0.26,
2.95 44.14 0.20
α= 0.0 21.4, 116.21, 0.41,
Mel, 6.95 196.73 0.30
no TM α= 0.1 19.8, 492.73, 0.48,
1.64 8.10 0.41
α= 0.2 14.4, 310.17, 0.36,
3.29 208.60 0.35
α= 0.0 21.2, 183.54, 0.27,
RSC, 8.50 95.38 0.26
TM α= 0.1 14.6, 74.56, 0.09,
1.14 67.98 0.10
α= 0.2 11.2, 37.55, 0.45,
3.83 68.31 0.35
α= 0.0 13.4, 23.14, 0.35,
RSC, 4.10 31.91 0.25
no TM α= 0.1 12.4, 144.73, 0.19,
3.21 96.79 0.08
α= 0.2 12.4, 376.66, 0.09,
3.21 206.84 0.10
Table 10: Number of variables used by models returned by OSGP
Problem Instance, Variables used by returned model
maximum tree size ms µ σ
BC, ms = 50 9.0 2.74
TM ms = 100 9.6 1.34
ms = 150 17.8 0.45
BC, ms = 50 10.5 0.71
no TM ms = 100 10.0 1.41
ms = 150 11.5 0.71
Mel, ms = 50 10.2 2.05
TM ms = 100 10.0 2.55
ms = 150 12.0 2.00
Mel, ms = 50 8.0 1.58
no TM ms = 100 8.8 0.84
ms = 150 11.4 3.36
RSC, ms = 50 7.8 2.05
TM ms = 100 12.0 2.35
ms = 150 12.0 1.22
RSC, ms = 50 9.4 3.91
no TM ms = 100 12.2 2.17
ms = 150 13.6 3.13
17
Table 11: List of patient data variables collected at AKH Linz in the years 2005
– 2008: Blood parameters, general patient information, and tumor markers
Para- Description Unit Plausible
meter Range
ALT Alanine transaminase, a transaminase enzyme;
also called glutamic pyruvic transaminase (GPT). U/l [1; 225]
AST Aspartate transaminase,
an enzyme also called glutamic oxaloacetic transaminase (GOT). U/l [1; 175]
BSG1 Erythrocyte sedimentation rate; mm [0; 50]
the rate at which red blood cells settle /
precipitate within one hour.
BUN Blood urea nitrogen; mg/dl [1; 150]
measures the amount of nitrogen in the blood (caused by urea).
CBAA Basophil granulocytes; type of leukocytes. G/l [0.0; 0.2]
CEOA Eosinophil granulocytes; type of leukocytes. G/l [0.0; 0.4]
CH37 Cholinesterase, an enzyme. kU/l [2; 23]
CHOL Cholesterol, a structural component of cell membranes. mg/dl [40; 550]
CLYA Lymphocytes; type of leukocytes. G/l [1; 4]
CMOA Monocytes; type of leukocytes. G/l [0.2; 0.8]
CNEA Neutrophils; most abundant type of leukocytes. G/l [1.8; 7.7]
CRP C-reactive protein, a protein; inflammations cause the rise of CRP. mg/dl [0; 20]
FE Iron. ug/dl [30; 210]
FER Ferritin, a protein that stores and transports iron in a safe form. ng/ml [10; 550]
GT37 γ-glutamyltransferase, an enzyme. U/l [1; 290]
HB Hemoglobin, a protein that contains iron and transports oxygen. g/dl [6; 18]
HDL High-density lipoprotein; this protein enables the transport of lipids with blood. mg/dl [25; 120]
HKT Hematocrit; the packed cell volume, i.e., the proportion of red blood cells within the blood. % [25; 65]
HS Uric acid, also called urate. mg/dl [1; 12]
KREA Creatinine, a chemical by-product produced in muscles. mg/dl [0.2; 5.0]
LD37 Lactate dehydrogenase (LDH), an enzyme that can be used as a marker of injuries to cells. U/l [5; 744]
MCV Mean corpuscular / cell volume; the average size (i.e., volume) of red blood cells. fl [69; 115]
PLT Thrombocytes, also called platelets, are irregularly-shaped cells that do not have a nucleus. G/l [25; 1,000]
RBC Erythrocytes, red blood cells that transport and deliver oxygen. T/l [2.2; 8.0]
TBIL Bilirubin, the yellow product of the heme catabolism. mg/dl [0; 5]
TF Transferrin, a protein, delivers iron. mg/dl [100; 500]
WBC Leukocytes, also called white blood cells (WBCs); cells that help the body fight infections or G/l [1.5; 50]
foreign materials.
AGE The patient’s age. years [0; 120]
SEX The patient’s sex. f/m {f, m}
AFP Alpha-fetoprotein ([?])
is a protein found in the blood plasma;
during fetal life it is produced IU/ml [0.0; 90.0]
by the yolk sac and the liver.
AFP is also often measured and used as a marker for a set of
tumors, especially endodermal sinus tumors (yolk sac carcinoma),
neuroblastoma, hepatocellular,
carcinoma and germ cell tumors [?].
CA 125 Cancer antigen 125 (CA 125) ([?]),
also called carbohydrate antigen 125 or mucin 16 (MUC16), U/ml [0.0; 150]
is a protein that is often used as a tumor marker that may
be elevated in the presence of
specific types of cancers.
CA 15-3 Mucin 1 (MUC1), also known as cancer antigen 15-3 (CA 15-3), is a protein used U/ml [0.0; 100.0]
as a tumor marker in the context of monitoring certain cancers [?],
especially breast cancer.
CA 19-9 CA 19-9 is a tumor marker often used to monitor
monitor a person’s response to cancer treatment U/m [0.0; 120.0]
and/or cancer progression, for example colon cancer
and pancreatic cancer [?].
CEA Carcinoembryonic antigen (CEA; [?]) is a protein that is in humans normally produced during ng/ml [0.0; 50.0]
fetal development. When used as a tumor marker, CEA is mainly used to identify recurrences
of cancer after surgical resections.
CYFRA Fragments of cytokeratin 19, a protein found in the cytoskeleton, are found in many places of the ng/ml [0.0; 10.0]
human body; especially in the lung and in malign lung tumors high concentrations of these
fragments, which are also called CYFRA 21-1, are found [?].
fPSA The free-to-total ratio of the prostate-specific antigen (PSA) is calculated and stored in fPSA. ratio [0.0; 1.0]
NSE The neuron-specific enolase (NSE) is an enzyme frequently used as tumor marker for lung cancer ng/ml [0.0; 100.0]
because it can be help to identify neuronal cells and cells with neuroendocrine differentiation. [?]
PSA Prostate-specific antigen (PSA; [?]) is a protein produced in the prostate gland; PSA blood tests ng/ml [0.0; 20.0]
are widely considered the most effective test currently available for the early detection of
prostate cancer since PSA is often elevated in the presence of prostate disorders.
S-100 S-100 is a family of proteins found in vertebrates; members of the S-100 protein family ug/l [0.0; 1.2]
are useful as markers for certain tumors. S-100 values can be found in melanoma and are
used as as cell markers for anatomic pathology and also markers for inflammatory diseases. [?]
SCC The squamous cell carcinoma antigen (SCC) is used as tumor marker for the diagnosis and ng/ml [0.0; 20.0]
follow-up control of epithelial carcinoma [?].
TPS TPS (tissue polypeptide specific antigen) is used as tumor marker indicating cellular U/l [0.0; 300.0]
proliferation; details about this tumor marker can for example be found in [?].
18
... A recent comparative study suggests that GA outperforms PSO on high-dimensional benchmark datasets from the UCI repository [40]. As a wrapper approach, GA-based feature selection relies on machine learning algorithms, such as linear regression [41], logistic regression [42], Naive Bayes [43,44,45], support vector machines (SVM) [36,44,46], and artificial neural networks (ANN) [47,48,49], to evaluate the quality of a feature subset. ...
Preprint
Full-text available
Through genome-wide association studies (GWAS), disease susceptible genetic variables can be identified by comparing the genetic data of individuals with and without a specific disease. However, the discovery of these associations poses a significant challenge due to genetic heterogeneity and feature interactions. Genetic variables intertwined with these effects often exhibit lower effect-size, and thus can be difficult to be detected using machine learning feature selection methods. To address these challenges, this paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous subsets of genetic variables from a network constructed from multiple independent feature selection runs based on a genetic algorithm (GA), an evolutionary learning algorithm. We employ a non-linear machine learning algorithm to detect feature interaction. We introduce the Community Risk Score (CRS), a synthetic feature designed to quantify the collective disease association of each variable subset. Our experiment showcases the effectiveness of the utilized GA-based feature selection method in identifying feature interactions through synthetic data analysis. Furthermore, we apply our novel approach to a case-control colorectal cancer GWAS dataset. The resulting synthetic features are then used to explain the genetic heterogeneity in an additional case-only GWAS dataset.
... Evolutionary algorithms, such as genetic algorithm (GA), are popular search strategies for wrapper approaches [27]- [31]. Various machine learning algorithms, including linear regression [32], logistic regression [33], Naive Bayes [34]- [36], support vector machines (SVM) [30], [35], [37], and artificial neural networks (ANN) [38]- [40], are used to evaluate the goodness of a feature subset. Different evaluation algorithms can lead feature search algorithms to discover features with different types of association. ...
Preprint
Full-text available
Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90% of the performance of multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data.
... The main disadvantage of these methods is the 'nesting effects', so removed or selected features cannot be used for later testing. In addition, evolutionary techniques-based GA [40], GP [41], PSO [42], [43], and ACO [44] have been considered to determine which non-dominated solution provides the best tradeoff between the number of features and the classification accuracy. Most of the existing feature selection methods suffer from the issues of high computational time/cost and stagnation in local optimum. ...
Preprint
Full-text available
Fake account detection is a topical issue when many Online Social Networks encounter several issues caused by the growing number of unethical online activities. This study presents a new Quantum Beta-behaved Multi-Objective Particle Swarm Optimization (QB-MOPSO) algorithm for machine learning based Twitter fake accounts detection. The proposed approach aims to improve the learning process of deep neural networks, random forest, through minimizing simultaneously the feature dimensionality and the classification error rate. The main contribution consists in proposing a quantum beta MOPSO to handle the training phase of neural and deep architectures. The QB-MOPSO is used to perform a multi-objective training of the random forest algorithm. The QB-MOPSO has two optimization profiles: the first one uses a quantum-behaved equation for improving the exploratory behaviour of PSO, while the second one uses a beta function to enhance PSO’s exploitation. An extensive experimental study is carried out using two open Twitter datasets with 1982 and 928 accounts. The new proposal is a random forest QB-MOPSO. Results showed that random forest QB-MOPSO accuracy is about 99.19% and 97.52% accounts on datasets 1 and 2. Comparative analysis of the prosed architecture toward the original architecture showed that the use of QB-MOPSO for learning enhances the random forest algorithm which perform then the original ones.
... The main disadvantage of these methods is the 'nesting effects', so removed or selected features cannot be used for later testing. In addition, evolutionary techniques-based GA [40], GP [41], PSO [42], [43], and ACO [44] have been considered to determine which non-dominated solution provides the best tradeoff between the number of features and the classification accuracy. Most of the existing feature selection methods suffer from the issues of high computational time/cost and stagnation in local optimum. ...
Preprint
Full-text available
p>Fake account detection is a topical issue when many Online Social Networks encounter several issues caused by the growing number of unethical online activities. This study presents a new Quantum Beta-behaved Multi-Objective Particle Swarm Optimization (QB-MOPSO) algorithm for machine learning based Twitter fake accounts detection. The proposed approach aims to improve the learning process of deep neural networks, random forest, through minimizing simultaneously the feature dimensionality and the classification error rate. The main contribution consists in proposing a quantum beta MOPSO to handle the training phase of neural and deep architectures. The QB-MOPSO is used to perform a multi-objective training of the random forest algorithm. The QB-MOPSO has two optimization profiles: the first one uses a quantum-behaved equation for improving the exploratory behaviour of PSO, while the second one uses a beta function to enhance PSO’s exploitation. An extensive experimental study is carried out using two open Twitter datasets with 1982 and 928 accounts. The new proposal is a random forest QB-MOPSO. Results showed that random forest QB-MOPSO accuracy is about 99.19% and 97.52% accounts on datasets 1 and 2. Comparative analysis of the prosed architecture toward the original architecture showed that the use of QB-MOPSO for learning enhances the random forest algorithm which perform then the original ones.</p
... Yang and Honavar [88] suggested a fitness function with a sole purpose of maximising accuracy while reducing cost. To improve classification accuracy, Winkler et al. [83] used a range of fitness parameters. The optimum characteristics in a data classification problem ( [58]) may provide the highest accuracy. ...
Article
Full-text available
Globally, patients with diabetes, diabetic retinopathy, cancer, and heart disease are growing rapidly in developed and developing countries. As a result of these ailments, the rate of human mortality and vision loss has risen dramatically. The design and development of computer-based prediction systems may facilitate the appropriate treatment of these four illnesses by medical professionals. For the design of an efficient and fast prediction (or classification) system, it is necessary to use efficient feature selection techniques to reduce the complexity of the feature space. If there are n features, then there is a possibility that 2ⁿ subsets of features can be created, and testing all of these subsets of selected features would require a significant amount of time. The suggested technique is to investigate the application of ant-lion based optimization to choose a subset of features. The chosen characteristics are used to train and evaluate four classifiers (and their ensemble) based on machine learning. The study used over three public benchmark datasets and one privately composed dataset, each one was disease-specific. The performance of the recommended strategy was evaluated using five performance assessment measures. This adjustment significantly improves the outcome. The strategy may decrease the initial feature set by up to 50% without impacting performance (in terms of accuracy). We can get maximum accuracies of 84.44% for the heart disease dataset, 79.99% for the diabetes dataset, 98.52% for the diabetic retinopathy dataset, and 97.18% for the skin cancer dataset. This empirical research will help doctors and all people make better decisions by giving them a second opinion.
... Yang and Honavar (1998) introduced a single target fitness function in that maximizes accuracy and lowers costs. Winkler et al. (2011) have also used several fitness functions to improve classification accuracy. By using the best characteristics, the data categorization problem may be solved with the greatest amount of accuracy (Pandey and Kulhari 2018). ...
Article
Full-text available
Feature selection is an important component of the machine learning domain, which selects the ideal subset of characteristics relative to the target data by omitting irrelevant data. For a given number of features, there are 2ⁿ possible feature subsets, making it challenging to select the optimal set of features from a dataset via conventional feature selection approaches. We opted to investigate glaucoma infection since the number of individuals with this disease is rising quickly around the world. The goal of this study is to use the feature set (features derived from fundus images of benchmark datasets) to classify images into two classes (infected and normal) and to select the fewest features (feature selection) to achieve the best performance on various efficiency measuring metrics. In light of this, the paper implements and recommends a metaheuristics-based technique for feature selection based on emperor penguin optimization, bacterial foraging optimization, and proposes their hybrid algorithm. From the retinal fundus benchmark images, a total of 36 features were extracted. The proposed technique for selecting features minimizes the number of features while improving classification accuracy. Six machine learning classifiers classify on the basis of a smaller subset of features provided by these three optimization techniques. In addition to the execution time, eight statistically based performance metrics are calculated. The hybrid optimization technique combined with random forest achieves the highest accuracy, up to 0.95410. Because the proposed medical decision support system is effective and ensures trustworthy decision-making for glaucoma screening, it might be utilized by medical practitioners as a second opinion tool, as well as assist overworked expert ophthalmologists and prevent individuals from losing their eyesight.
... GA searches for feature groups by combining complementary subsets via crossover and adjustment via mutation. GA-based methods generally belong to wrapper approaches, and some common machine learning methods have been used to evaluate the goodness of a feature subset, such as linear regression [14], logistic regression [30], Naive Bayes [3,5,31], support vector machines (SVM) [3,27,29], and artificial neural networks (ANN) [22,32,37]. ...
... The main disadvantage of these methods is the 'nesting effects', so removed or selected features cannot be used for later testing. In addition, evolutionary techniques-based GA [39], GP [40], PSO [41], [42], and ACO [43] have been considered to determine which non-dominated solution provides the best tradeoff between the number of features and the classification accuracy. Most of the existing feature selection methods suffer from the issues of high computational time/cost and stagnation in local optimum. ...
Preprint
Full-text available
p>Fake account detection is a topical issue when many Online Social Networks (OSNs) encounter problems caused by a growing number of unethical online social activities. This study presents a new Quantum Beta-Distributed Multi-Objective Particle Swarm Optimization (QBD-MOPSO) system to detect fake accounts on Twitter. The proposed system aims to minimize two objective functions simultaneously: specifically features dimensionality and classification error rate. The QBD-MOPSO has two optimization profiles: the first uses a quantum behaved equation for improving the exploratory behaviour of PSO, while the second uses a beta function to enhance PSO’s exploitation. Six variants of the QBD-MOPSO approach are proposed to account for various data distribution types. The QBD-MOPSO system provides a feature selection technique based on the sigmoid function for position binary encoding. Each particle has a binary vector as a potential solution for feature subset selection, and a bit with the value of “1” indicating selection of a feature and “0” otherwise. Machine learning based classification models are trained and tested using a subset of selected features. An extensive experimental study is carried using two benchmark Twitter datasets with 1982 and 928 accounts. From 46 original features, QBD-MOPSO has selected 32 and 25 pertinent features and accurately classified 99.19% and 97.52% account on the datasets.</p
... The main disadvantage of these methods is the 'nesting effects', so removed or selected features cannot be used for later testing. In addition, evolutionary techniques-based GA [39], GP [40], PSO [41], [42], and ACO [43] have been considered to determine which non-dominated solution provides the best tradeoff between the number of features and the classification accuracy. Most of the existing feature selection methods suffer from the issues of high computational time/cost and stagnation in local optimum. ...
Preprint
Full-text available
p>Fake account detection is a topical issue when many Online Social Networks (OSNs) encounter problems caused by a growing number of unethical online social activities. This study presents a new Quantum Beta-Distributed Multi-Objective Particle Swarm Optimization (QBD-MOPSO) system to detect fake accounts on Twitter. The proposed system aims to minimize two objective functions simultaneously: specifically features dimensionality and classification error rate. The QBD-MOPSO has two optimization profiles: the first uses a quantum behaved equation for improving the exploratory behaviour of PSO, while the second uses a beta function to enhance PSO’s exploitation. Six variants of the QBD-MOPSO approach are proposed to account for various data distribution types. The QBD-MOPSO system provides a feature selection technique based on the sigmoid function for position binary encoding. Each particle has a binary vector as a potential solution for feature subset selection, and a bit with the value of “1” indicating selection of a feature and “0” otherwise. Machine learning based classification models are trained and tested using a subset of selected features. An extensive experimental study is carried using two benchmark Twitter datasets with 1982 and 928 accounts. From 46 original features, QBD-MOPSO has selected 32 and 25 pertinent features and accurately classified 99.19% and 97.52% account on the datasets.</p
Article
Full-text available
This paper presents a new generic Evolutionary Algorithm (EA) for retarding the unwanted effects of premature convergence. This is accomplished by a combination of interacting generic methods. These generalizations of a Genetic Algorithm (GA) are inspired by population genetics and take advantage of the interactions between genetic drift and migration. In this regard a new selection scheme is introduced, which is designed to directedly control genetic drift within the population by advantageous self-adaptive selection pressure steering. Additionally this new selection model enables a quite intuitive heuristics to detect premature convergence. Based upon this newly postulated basic principle the new selection mechanism is combined with the already proposed Segregative Genetic Algorithm (SEGA), an advanced Genetic Algorithm (GA) that introduces parallelism mainly to improve global solution quality. As a whole, a new generic evolutionary algorithm (SASEGASA) is introduced. The performance of the algorithm is evaluated on a set of characteristic benchmark problems. Computational results show that the new method is capable of producing highest quality solutions without any problem-specific additions.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Context.—Current tumor markers for ovarian cancer still lack adequate sensitivity and specificity to be applicable in large populations. High-throughput proteomic profiling and bioinformatics tools allow for the rapid screening of a large number of potential biomarkers in serum, plasma, or other body fluids. Objective.—To determine whether protein profiles of plasma can be used to identify potential biomarkers that improve the detection of ovarian cancer. Design.—We analyzed plasma samples that had been collected between 1998 and 2001 from patients with sporadic ovarian serous neoplasms before tumor resection at various International Federation of Gynecology and Obstetrics stages (stage I [n = 11], stage II [n = 3], and stage III [n = 29]) and from women without known neoplastic disease (n = 38) using proteomic profiling and bioinformatics. We compared results between the patients with and without cancer and evaluated their discriminatory performance against that of the cancer antigen 125 (CA125) tumor marker. Results.—We selected 7 biomarkers based on their collective contribution to the separation of the 2 patient groups. Among them, we further purified and subsequently identified 3 biomarkers. Individually, the biomarkers did not perform better than CA125. However, a combination of 4 of the biomarkers significantly improved performance (P ≤ .001). The new biomarkers were complementary to CA125. At a fixed specificity of 94%, an index combining 2 of the biomarkers and CA125 achieves a sensitivity of 94% (95% confidence interval, 85%–100.0%) in contrast to a sensitivity of 81% (95% confidence interval, 68%–95%) for CA125 alone. Conclusions.—The combined use of bioinformatics tools and proteomic profiling provides an effective approach to screen for potential tumor markers. Comparison of plasma profiles from patients with and without known ovarian cancer uncovered a panel of potential biomarkers for detection of ovarian cancer with discriminatory power complementary to that of CA125. Additional studies are required to further validate these biomarkers.
Chapter
This chapter discusses nonlinear system identification with neurofuzzy methods. In a general part, summary and overview of the most important types of fuzzy models are given. Their properties, advantages, and drawbacks are illustrated. In a more specific part a new algorithm for the construction of Takagi-Sugeno fuzzy systems is presented in detail. It is successfully applied to the identification of two nonlinear dynamic real-world processes.
Article
Evolution has provided a source of inspiration for algorithm designers since the birth of computers. The resulting field, evolutionary computation, has been successful in solving engineering tasks ranging in outlook from the molecular to the astronomical. Today, the field is entering a new phase as evolutionary algorithms that take place in hardware are developed, opening up new avenues towards autonomous machines that can adapt to their environment. We discuss how evolutionary computation compares with natural evolution and what its benefits are relative to other computing approaches, and we introduce the emerging area of artificial evolution in physical systems.
Book
The book covers the most common and important approaches for the identification of nonlinear static and dynamic systems. Additionally, it provides the reader with the necessary background on optimization techniques making the book self-contained. The emphasis is put on modern methods based on neural networks and fuzzy systems without neglecting the classical approaches. The entire book is written from an engineering point-of-view, focusing on the intuitive understanding of the basic relationships. This is supported by many illustrative figures. Advanced mathematics is avoided. Thus, the book is suitable for last year undergraduate and graduate courses as well as research and development engineers in industries. The new edition~includes exercises.
Article
Background: TPS (Tissue Polypeptide Specific Antigen) is defined by a monoclonal antibody against an epitope of the soluble Tissue Polypeptide Antigen (TPA). It is considered to be a tumor marker indicating cellular proliferation, not just tumor load. Serial values of this unique marker during tumor therapy might, therefore, be more sensitive and earlier signs of tumor response than conventional tumor markers. Patients and Methods: Serial values of TPS and TPA were determined in 50 consecutive tumor patients and compared with simultaneously measured conventional tumor markers. Evolution of TPS and TPA values over time was compared with the clinical response to tumor therapy. Results: Distribution and levels of both TPS and TPA were similar, and a significant correlation was found between them (R = 0.877, p = 0.0001). In 30 patients the value of TPS was higher than that of TPA, and in 20 cases TPS values were lower than TPA values. No significant correlation to conventional tumor markers was found. Values of both TPS and TPA were lower in patients with lymphomas than in patients with gastrointestinal cancers, breast or lung cancers. Although we have some evidence that in some cases a response may be detected somewhat earlier by TPS than by clinical criteria or by conventional tumor markers, the ability of TPS to reliably predict a clinical response was generally poor. Even frankly discordant evolutions of TPS, TPA and clinical response were seen. Conclusions: Based on our results, the serial analysis of TPS and TPA for use in the decision making process during follow-up of unselected cancer patients cannot be recommended. The actual clinical value of determining TPS remains to be demonstrated.