Conference Paper

Analysis of Selected Evolutionary Algorithms in Feature Selection and Parameter Optimization for Data Based Tumor Marker Modeling

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we report on the use of evolutionary algorithms for optimizing the identification of classification models for selected tumor markers. Our goal is to identify mathematical models that can be used for classifying tumor marker values as normal or as elevated; evolutionary algorithms are used for optimizing the parameters for learning classification models. The sets of variables used as well as the parameter settings for concrete modeling methods are optimized using evolution strategies and genetic algorithms. The performance of these algorithms is analyzed as well as the population diversity progress. In the empirical part of this paper we document modeling results achieved for tumor markers CA 125 and CYFRA using a medical data base provided by the Central Laboratory of the General Hospital Linz; empirical tests are executed using HeuristicLab.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Castillo et al. [17,18] used ant colony optimization (ACO) to adjust different membership functions of complex fuzzy controllers. Winkler et al. [19] used different evolutionary strategies to perform FS and to optimize linear models, k -nearest neighbors ( k -NN), ANNs and SVM with the final purpose of identifying tumor markers. Sanz-García et al. [20] proposed a GA-based optimization method to create better overall parsimonious ANNs for predicting set points in a steel annealing furnace. ...
Article
Most proposed metaheuristics for feature selection and model parameter optimization are based on a two-termed function. Their main drawback is the need of a manual set of the parameter that balances between the loss and the penalty term. In this paper, a novel methodology referred as the GA-PARSIMONY and specifically designed to overcome this issue is evaluated in detail in thirteen public databases with five regression techniques. It is a GA-based meta-heuristic that splits the classic two-termed minimization functions by making two consecutive ranks of individuals. The first rank is based solely on the generalization error, while the second (named ReRank) is based on the complexity of the models, giving a special weight to the complexity entailed by large number of inputs.
... Among the different existing methods to tackle this issue, soft computing (SC) seems to be an effective approach to reduce computational costs[22,33,4,7]. In this context, there is an increasing number of studies reporting SC strategies that combine FS and HO applied to multiple fields[15,9,31,14,8,32,5,24]. New libraries are emerging to perform HO with Bayesian Optimization (BO) like Hyperopt[2]in Python, or mlr[3]and rBayesianOptimization in R. Besides, there exists other tools that are focused on the optimization of more KDD stages such as algorithm selection (AS), data transformation (DT), dimensional reduction (DR), model selection (MS) or feature construction (FC). ...
Conference Paper
This paper presents a hybrid methodology that combines Bayesian Optimization (BO) with a constrained version of the GA-PARSIMONY method to obtain parsimonious models. The proposal is designed to reduce the computational efforts associated to the use of GA-PARSIMONY alone. The method is initialized with BO to obtain favorable initial model parameters. With these parameters, a constrained GA-PARSIMONY is implemented to generate accurate parsimonious models using feature reduction, data transformation and parsimonious model selection. Finally, a second BO is run again with the selected features. Experiments with Extreme Gradient Boosting Machines (XGBoost) and six UCI databases demonstrate that the hybrid methodology obtains analogous models than the GA-PARSIMONY but with a significant reduction on the execution time in five of the six datasets.
... Ding [9] uses particle swarm optimization for simultaneously selecting the best spectral band and optimizing SVM parameters in hyperspectral classification of remote sensing images. Winkler et al. [23] report different evolutionary strategies to select inputs in order to optimize linear models, k -nearest neighbors (k-NN), artificial neural network (ANN) or SVM. Their objective is to select the best models capable of identifying tumor markers. ...
Chapter
This paper presents a performance comparative of GA-PAR SIMONY methodology with five well-known regression algorithms and with different genetic algorithm (GA) configurations. This approach is mainly based on combining GA and feature selection (FS) during model tuning process to achieve better overall parsimonious models that assure good generalization capacities. For this purpose, individuals, already sorted by their fitness function, are rearranged in each iteration depending on the model complexity. The main objective is to analyze the overall model performance achieve with this methodology for each regression algorithm against different real databases and varying the GA setting parameters. Our preliminary results show that two algorithms, multilayer perceptron (MLP) with the Broyden-Fletcher-Goldfarb-Shanno training method and support vector machines for regression (SVR) with radial basis function kernel, performing better with similar features reduction when database has low number of input attributes ( ≲32) and it has been used low GA population sizes.
Preprint
Full-text available
Feature selection is the process of identifying statistically most relevant features to improve the predictive capabilities of the classifiers. To find the best features subsets, the population based approaches like Particle Swarm Optimization(PSO) and genetic algorithms are being widely employed. However, it is a general observation that not having right set of particles in the swarm may result in sub-optimal solutions, affecting the accuracies of classifiers. To address this issue, we propose a novel tunable swarm size approach to reconfigure the particles in a standard PSO, based on the data sets, in real time. The proposed algorithm is named as Tunable Particle Swarm Size Optimization Algorithm (TPSO). It is a wrapper based approach wherein an Alternating Decision Tree (ADT) classifier is used for identifying influential feature subset, which is further evaluated by a new objective function which integrates the Classification Accuracy (CA) with a modified F-Score, to ensure better classification accuracy over varying population sizes. Experimental studies on bench mark data sets and Wilcoxon statistical test have proved the fact that the proposed algorithm (TPSO) is efficient in identifying optimal feature subsets that improve classification accuracies of base classifiers in comparison to its standalone form.
Article
This article presents a hybrid methodology in which a KDD scheme is optimized to build accurate parsimonious models. The methodology tries to find the best model by using genetic algorithms to optimize a KDD scheme formed with the following stages: feature selection, transformation of the skewed input and the output data, parameter tuning and parsimonious model selection. The results obtained demonstrated the optimization of these steps that significantly improved the model generalization capabilities in some UCI databases. Finally, this methodology was applied to create room demand parsimonious models using booking databases from a hotel located in a region of Northern Spain. The results proved that the proposed method created models with higher generalization capacity and lower complexity compared to those obtained with classical KDD process.
Conference Paper
EXtreme Gradient Boosting (XGBoost) has become one of the most successful techniques in machine learning competitions. It is computationally efficient and scalable, it supports a wide variety of objective functions and it includes different mechanisms to avoid over-fitting and improve accuracy. Having so many tuning parameters, soft computing (SC) is an alternative to search precise and robust models against classical hyper-tuning methods. In this context, we present a preliminary study in which a SC methodology, named GA-PARSIMONY, is used to find accurate and parsimonious XGBoost solutions. The methodology was designed to optimize the search of parsimonious models by feature selection, parameter tuning and model selection. In this work, different experiments are conducted with four complexity metrics in six high dimensional datasets. Although XGBoost performs well with high-dimensional databases, preliminary results indicated that GA-PARSIMONY with feature selection slightly improved the testing error. Therefore, the choice of solutions with fewer inputs, between those with similar cross-validation errors, can help to obtain more robust solutions with better generalization capabilities.
Chapter
A distinguishing feature of symbolic regression using genetic programming is its ability to identify complex nonlinear white-box models. This is especially relevant in practice where models are extensively scrutinized in order to gain knowledge about underlying processes. This potential is often diluted by the ambiguity and complexity of the models produced by genetic programming. In this contribution we discuss several analysis methods with the common goal to enable better insights in the symbolic regression process and to produce models that are more understandable and show better generalization. In order to gain more information about the process we monitor and analyze the progresses of population diversity, building block information, and even more general genealogy information. Regarding the analysis of results, several aspects such as model simplification, relevance of variables, node impacts, and variable network analysis are presented and discussed.
Article
This article proposes a new genetic algorithm (GA) methodology to obtain parsimonious support vector regression (SVR) models capable of predicting highly precise setpoints in a continuous annealing furnace (GA-PARSIMONY). The proposal combines feature selection, model tuning, and parsimonious model selection in order to achieve robust SVR models. To this end, a novel GA selection procedure is introduced based on separate cost and complexity evaluations. The best individuals are initially sorted by an error fitness function, and afterwards, models with similar costs are rearranged according to model complexity measurement so as to foster models of lesser complexity. Therefore, the user-supplied penalty parameter, utilized to balance cost and complexity in other fitness functions, is rendered unnecessary. GA-PARSIMONY performed similarly to classical GA on twenty benchmark datasets from public repositories, but used a lower number of features in a striking 65% of models. Moreover, the performance of our proposal also proved useful in a real industrial process for predicting three temperature setpoints for a continuous annealing furnace. The results demonstrated that GA-PARSIMONY was able to generate more robust SVR models with less input features, as compared to classical GA.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Conference Paper
Full-text available
In this paper we analyse a new evolutionary approach to the vehicle routing problem. We present Genetic Vehicle Representation (GVR), a two-level representational scheme designed to deal in an effective way with all the information that candidate solutions must encode. Experimental results show that this method is both effective and robust, allowing the discovery of new best solutions for some well-known benchmarks.
Conference Paper
Full-text available
Cellular Genetic Algorithms (cGAs) are a subclass of Genetic Algorithms (GAs) in which the population diversity and exploration are enhanced thanks to the existence of small overlapped neighborhoods. Such a kind of structured algorithms is specially well suited for complex problems. In this paper we propose the utilization of some cGAs with and without including local search techniques for solving the vehicle routing problem (VRP). A study on the behavior of these algorithms has been performed in terms of the quality of the solutions found, execution time, and number of function evaluations (effort). We have selected the bench- mark of Christofides, Mingozzi and Toth for testing the proposed cGAs, and compare them with some other heuristics in the literature. Our con- clusions are that cGAs with an added local search operator are able of always locating the optimum of the problem at low times and reasonable effort for the tested instances.
Book
Full-text available
Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications discusses algorithmic developments in the context of genetic algorithms (GAs) and genetic programming (GP). It applies the algorithms to significant combinatorial optimization problems and describes structure identification using HeuristicLab as a platform for algorithm development. The book focuses on both theoretical and empirical aspects. The theoretical sections explore the important and characteristic properties of the basic GA as well as main characteristics of the selected algorithmic extensions developed by the authors. In the empirical parts of the text, the authors apply GAs to two combinatorial optimization problems: the traveling salesman and capacitated vehicle routing problems. To highlight the properties of the algorithmic measures in the field of GP, they analyze GP-based nonlinear structure identification applied to time series and classification problems. Written by core members of the HeuristicLab team, this book provides a better understanding of the basic workflow of GAs and GP, encouraging readers to establish new bionic, problem-independent theoretical concepts. By comparing the results of standard GA and GP implementation with several algorithmic extensions, it also shows how to substantially increase achievable solution quality.
Article
Full-text available
The aim was to investigate the diagnostic utility of CYFRA 21-1 (cytokeratin 19 fragment) as a tumor marker in pleural effusion and evaluate the value of combining CYFRA 21-1 and carcinoembryonic antigen (CEA) assays as a diagnostic aid in the malignant pleural effusion. One hundred and twenty-six patients (72 malignant and 54 benign pleural effusion) were included in this retrospective study. The effusion levels of CYFRA 21-1 and CEA were measured using radioimmunometric assay. The median values of CYFRA 21-1 in benign and malignant pleural effusion are 15 and 70 ng/ml, respectively. Using a cut-off value of 50 ng/ml, defined at 94% specificity, the diagnostic sensitivity of CYFRA 21-1 for non-small cell lung carcinoma (n = 61), squamous cell carcinoma (n = 21), adenocarcinoma (n = 40) and small cell lung cancer (n = 11) was 64, 71, 60 and 18%, respectively. Regardless of cell types, the diagnostic sensitivity of CYFRA 21-1 and CEA in malignant pleural effusion (n = 72) was 57 and 60%, respectively (cut-off value of 10 ng/ml in CEA assay). Combining CEA with CYFRA 21-1, the diagnostic sensitivity may increase up to 72%, which was defined at 89% specificity. CYFRA 21-1 assay may be a useful tumor marker for discriminating benign from malignant pleural effusion, especially in those of non-small cell lung cancer. The combined use of CEA and CYFRA 21-1 assay in the malignant effusion may increase the diagnostic yield compared with CEA or CYFRA 21-1 alone.
Conference Paper
Full-text available
In this work we compare the use of a particle swarm optimization (PSO) and a genetic algorithm (GA) (both augmented with support vector machines SVM) for the classification of high dimensional microarray data. Both algorithms are used for finding small samples of informative genes amongst thousands of them. A SVM classifier with 10- fold cross-validation is applied in order to validate and evaluate the provided solutions. A first contribution is to prove that PSOsvm is able to find interesting genes and to provide classification competitive performance. Specifically, a new version of PSO, called Geometric PSO, is empirically evaluated for the first time in this work using a binary representation in Hamming space. In this sense, a comparison of this approach with a new GAsvm and also with other existing methods of literature is provided. A second important contribution consists in the actual discovery of new and challenging results on six public datasets identifying significant in the development of a variety of cancers (leukemia, breast, colon, ovarian, prostate, and lung).
Article
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensiveleaveone -out cross-validation. We report on a largescale experiment---over half a million runs of C4.5 and a Naive-Bayes algorithm---to estimate the effects of different parameters on these algorithms on real-world datasets. For crossvalidation, wevary the number of folds and whether the folds are stratified or not# for bootstrap, wevary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds. 1 Introduction It can not be emphasized eno...
Book
The book covers the most common and important approaches for the identification of nonlinear static and dynamic systems. Additionally, it provides the reader with the necessary background on optimization techniques making the book self-contained. The emphasis is put on modern methods based on neural networks and fuzzy systems without neglecting the classical approaches. The entire book is written from an engineering point-of-view, focusing on the intuitive understanding of the basic relationships. This is supported by many illustrative figures. Advanced mathematics is avoided. Thus, the book is suitable for last year undergraduate and graduate courses as well as research and development engineers in industries. The new edition~includes exercises.
Conference Paper
In this paper we describe the use of evolutionary algorithms for the selection of relevant features in the context of tumor marker modeling. Our aim is to identify mathematical models for classifying tumor marker values AFP and CA 15-3 using available patient parameters; data provided by the General Hospital Linz are used. The use of evolutionary algorithms for finding optimal sets of variables is discussed; we also define fitness functions that can be used for evaluating feature sets taking into account the number of selected features as well as the resulting classification accuracies. In the empirical section of this paper we document results achieved using an evolution strategy in combination with several machine learning algorithms (linear regression, k-nearest-neighbor modeling, and artificial neural networks) which are applied using cross-validation for evaluating sets of selected features. The identified sets of relevant variables as well as achieved classification rates are compared.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
Serum assays based on the CA125 antigen are widely used in the monitoring of patients with ovarian cancer; however very little is known about the molecular nature of the CA125 antigen. We recently cloned a partial cDNA (designated MUC16) that codes for a new mucin that is a strong candidate for being the CA125 antigen. This assignment has now been confirmed by transfecting a partial MUC16 cDNA into 2 CA125-negative cell lines and demonstrating the synthesis of CA125 by 3 different assays. Of the 3 antibodies (OC125, M11 and VK-8) tested on the transfected cells, only the first 2 were strongly positive, indicating the differential expression of the CA125 epitopes in these cells. The cloning and expression of CA125 antigen opens the way to an understanding of its function in normal and malignant cells. © 2002 Wiley-Liss, Inc.
Chapter
The sections in this article are1The Problem2Background and Literature3Outline4Displaying the Basic Ideas: Arx Models and the Linear Least Squares Method5Model Structures I: Linear Models6Model Structures Ii: Nonlinear Black-Box Models7General Parameter Estimation Techniques8Special Estimation Techniques for Linear Black-Box Models9Data Quality10Model Validation and Model Selection11Back to Data: The Practical Side of Identification
Conference Paper
Tumor markers are substances that are found in blood, urine, or body tissues and that are used as indicators for tumors; elevated tumor marker values can indicate the presence of cancer, but there can also be other causes. We have used a medical database compiled at the blood laboratory of the General Hospital Linz, Austria: Several blood values of thousands of patients are available as well as several tumor markers. We have used several data based modeling approaches for identifying mathematical models for estimating selected tumor marker values on the basis of routinely available blood values; in detail, estimators for the tumor markers AFP, CA-125, CA15-3, CEA, CYFRA, and PSA have been identified and are analyzed in this paper. The documented tumor marker values are classified as "normal" or "elevated"; our goal is to design classifiers for the respective binary classification problems. As we show in the results section, for those medical modeling tasks described here, genetic programming performs best among those techniques that are able to identify nonlinearities; we also see that GP results show less overfitting than those produced using other methods.
Article
To evaluate the relationship between serum CA125 tumour marker level before and after surgery of epithelial ovarian carcinoma and assess its potential role as a prognostic factor. A retrospective review of 87 patients with epithelial ovarian carcinoma at a single centre between January 2001 and December 2005 was performed. Serum CA125 levels were assessed for their relationship to pathological stage, tumour grade, tumour volume and age as well as overall survival. A total of 75 patients, mean age 58.94 years and median follow-up of 24 months were included in the analysis. While the preoperative CA125 level did not correlate significantly with stage, tumour grade or survival, the postoperative CA125 correlated to FIGO stage (p<0.0001), tumour grade (p<0.0001) and overall survival (p=0.01). Reduced survival was noted with increasing age at the time of surgery (p=0.009) and bulk of the residual disease postoperatively (p=0.011).
Article
A review of the status of standardization of laboratory tests of particular interest to oncologists is presented. Currently, relatively few of these tests are standardized; as a result, interlaboratory and interinstitutional comparison of data is problematic. In 1992, additional interlaboratory studies of common tumor markers will be initiated by the College of American Pathologists. The National Committee for Clinical Laboratory Standards also has begun to develop standard methods and guidelines for these important tests.
Article
The analysis of tumour markers is based on the evaluation of data in relation to defined cut-off values. Changes in the method of determination or reference study group have led to different results. Cut-off-independent diagnostic evaluation of laboratory parameters can avoid laboratory-based and method-derived systematic errors. The decision guarantee (DG) is an appropriate parameter that can be determined using a defined reference population and its respective receiver operating characteristic (ROC) curve. The influence of ROC differences on the determination of DG is examined. A group of 281 consecutive patients with newly diagnosed, histologically confirmed lung cancer and a control group of 231 patients were examined. Histological classification of the tumour cases defined in 59 small-cell carcinoma, 102 squamous cell carcinomas, 66 adenocarcinomas and 54 large-cell carcinomas or mixed bronchial carcinomas without classification. The control group without tumours consisted of 23 healthy subjects, 125 patients with silicosis or asbestosis, 27 with chronic obstructive pulmonary diseases (COPD) and 56 suffering from inflammatory lung diseases. Cytokeratin-19 fragments (CYFRA 21-1) was the most sensitive marker with a sensitivity of 57.3% and a specificity of 94.9%. Sensitivity and specificity influence each other. Related to the ROC curve, the method described here ensured the diagnosis of lung cancer on the basis of the data collected in comparison with a reference population. Thus, it was possible to determine with statistical certainty whether the evaluation of the sample data would lead to a diagnosis of lung cancer. The DG provides the basis for a laboratory-and method-independent support for a diagnosis including fairer information about the reference population in the data analysis.
Heuristic Optimization Software Systems - Modeling of Heuristic Optimization Algorithms in the HeuristicLab Software Environment
  • S Wagner