Conference Paper

Feature Selection in the Analysis of Tumor Marker Data Using Evoutionary Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we describe the use of evolutionary algorithms for the selection of relevant features in the context of tumor marker modeling. Our aim is to identify mathematical models for classifying tumor marker values AFP and CA 15-3 using available patient parameters; data provided by the General Hospital Linz are used. The use of evolutionary algorithms for finding optimal sets of variables is discussed; we also define fitness functions that can be used for evaluating feature sets taking into account the number of selected features as well as the resulting classification accuracies. In the empirical section of this paper we document results achieved using an evolution strategy in combination with several machine learning algorithms (linear regression, k-nearest-neighbor modeling, and artificial neural networks) which are applied using cross-validation for evaluating sets of selected features. The identified sets of relevant variables as well as achieved classification rates are compared.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In [3], for example, the use of evolutionary algorithms for feature selection optimization is discussed in detail in the context of gene selection in cancer classification; in [27] we have analyzed the sets of features identified as relevant for modeling tumor markers AFP and CA15-3. ...
Conference Paper
In this paper we discuss the effects of using pre-clustered data on the identification of estimation models for cancer diagnoses. Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors, the goal is to identify mathematical models for estimating cancer diagnoses. We have applied a hybrid clustering and classification approach that first identifies data clusters (using standard patient data and tumor markers) and then learns prediction models on the basis of these data clusters. In the empirical section we analyze the clusters of patient data samples formed using k-means clustering: The optimal number of clusters is identified, and we investigate the homogeneity of these clusters. Several evolutionary modeling approaches implemented in HeuristicLab have been applied for subsequently identifying estimators for selected cancer diagnoses: Linear regression, k-nearest neighbor learning, artificial neural networks, and support vector machines (all optimized using evolutionary algorithms) as well as genetic programming. As we show in the results section, the investigated diagnoses of breast cancer, melanoma, and respiratory system cancer can be estimated correctly in up to 84.2%, 80.3%, and 94.1% of the analyzed test cases, respectively; without tumor markers up to 78.2%, 78%, and 93.3% of the test samples are correctly estimated, respectively.
Chapter
In this chapter we present results of empirical research work done on the data based identification of estimation models for tumor markers and cancer diagnoses: Based on patients’ data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors we have trained mathematical models that represent virtual tumor markers and predictors for cancer diagnoses, respectively. We have used a medical database compiled at the Central Laboratory of the General Hospital Linz, Austria, and applied several data based modeling approaches for identifying mathematical models for estimating selected tumor marker values on the basis of routinely available blood values; in detail, estimators for the tumor markers AFP, CA-125, CA15-3, CEA, CYFRA, and PSA have been identified and are discussed here. Furthermore, several data based modeling approaches implemented in HeuristicLab have been applied for identifying estimators for selected cancer diagnoses: Linear regression, k-nearest neighbor learning, artificial neural networks, and support vector machines (all optimized using evolutionary algorithms) as well as genetic programming. The investigated diagnoses of breast cancer, melanoma, and respiratory system cancer can be estimated correctly in up to 81%, 74%, and 91% of the analyzed test cases, respectively; without tumor markers up to 75%, 74%, and 87% of the test samples are correctly estimated, respectively.
Chapter
In this chapter we present a novel method for scoring function specification and feature selection by combining unsupervised learning with supervised cross validation. Various clustering algorithms such as one dimensional Kohonen SOM, k-means, fuzzy c-means and hierarchical clustering procedures are used to perform a clustering of object-data for a chosen subset of input features and a given number of clusters. The resulting object clusters are compared with the predefined target classes and a matching factor (score) is calculated. This score is used as criterion function for heuristic sequential and cross feature selection.
Article
In this paper, we present an ensemble modeling approach for sentiment analysis using machine learning algorithms. The main goal of sentiment analysis is to develop estimators that are able to identify the sentiment orientation (positive, negative, or neutral) of sentences found in any arbitrary source. The novel approach presented here relies on the analysis of the words found in sentences and the formation of large sets of heterogeneous models, i.e., binary as well as multi-class classification models that are calculated by various different machine learning methods; these models shall represent the relationship between the presence of given words (or combination of words) and sentiments. All models trained during the learning phase are applied during the test phase and the final sentiment assessment is annotated with a confidence value that specifies, how reliable the models are regarding the presented decision. In the empirical part of this paper, we show results achieved using a German corpus of Amazon recensions and a set of machine learning methods (decision trees and adaptive boosting, Gaussian processes, random forests, k-nearest neighbor classification, support vector machines and artificial neural networks with evolutionary feature and parameter optimization, and genetic programming). Using a heterogeneous model ensemble learning approach that combines multi-class classifiers as well as binary classifiers, the classification accuracy can be increased significantly and the ratio of totally wrongly classified samples (i.e., those that are assigned to the completely opposite sentiment orientation) can be decreased significantly.
Conference Paper
In this paper we report on the use of evolutionary algorithms for optimizing the identification of classification models for selected tumor markers. Our goal is to identify mathematical models that can be used for classifying tumor marker values as normal or as elevated; evolutionary algorithms are used for optimizing the parameters for learning classification models. The sets of variables used as well as the parameter settings for concrete modeling methods are optimized using evolution strategies and genetic algorithms. The performance of these algorithms is analyzed as well as the population diversity progress. In the empirical part of this paper we document modeling results achieved for tumor markers CA 125 and CYFRA using a medical data base provided by the Central Laboratory of the General Hospital Linz; empirical tests are executed using HeuristicLab.
Conference Paper
The paper presents the analysis of two different approaches for a system to support cancer diagnosis. The first one uses only tumor marker data containig missing values to predict cancer occurrence and the second one also includes standard blood parameters. Both systems are based on several heterogeneous artificial neural networks for estimating missing values of tumor markers and they finally caluculate possibilities of different tumor diseases.
Conference Paper
Full-text available
In this paper we present results of empirical research work done on the data based identification of estimation models for cancer diagnoses: Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors we have trained mathematical models for estimating cancer diagnoses. Several data based modeling approaches implemented in HeuristicLab have been applied for identifying estimators for selected cancer diagnoses: Linear regression, k-nearest neighbor learning, artificial neural networks, and support vector machines (all optimized using evolutionary algorithms) as well as genetic programming. The investigated diagnoses of breast cancer, melanoma, and respiratory system cancer can be estimated correctly in up to 81%, 74%, and 91% of the analyzed test cases, respectively; without tumor markers up to 75%, 74%, and 87% of the test samples are correctly estimated, respectively.
Article
Full-text available
Several lines of evidence point towards a biological role of mucin and particularly MUC1 in colorectal cancer. A positive correlation was described between mucin secretion, proliferation, invasiveness, metastasis and bad prognosis. But, the role of MUC1 in cancer progression is still controversial and somewhat confusing. While Mukherjee and colleagues developed MUC1-specific immune therapy in a CRC model, Lillehoj and co-investigators showed recently that MUC1 inhibits cell proliferation by a beta-catenin-dependent mechanism. In carcinoma cells the polarization of MUC1 is lost and the protein is over expressed at high levels over the entire cell surface. A competitive interaction between MUC1 and E-cadherin, through beta-catenin binding, disrupts E-cadherin-mediated cell-cell interactions at sites of MUC1 expression. In addition, the complex of MUC1-beta-catenin enters the nucleus and activates T-cell factor/leukocyte enhancing factor 1 transcription factors and activates gene expression. This mechanism may be similar to that just described for DCC and UNC5H, which induced apoptosis when not engaged with their ligand netrin, but mediate signals for proliferation, differentiation or migration when ligand bound.
Conference Paper
Tumor markers are substances that are found in blood, urine, or body tissues and that are used as indicators for tumors; elevated tumor marker values can indicate the presence of cancer, but there can also be other causes. We have used a medical database compiled at the blood laboratory of the General Hospital Linz, Austria: Several blood values of thousands of patients are available as well as several tumor markers. We have used several data based modeling approaches for identifying mathematical models for estimating selected tumor marker values on the basis of routinely available blood values; in detail, estimators for the tumor markers AFP, CA-125, CA15-3, CEA, CYFRA, and PSA have been identified and are analyzed in this paper. The documented tumor marker values are classified as "normal" or "elevated"; our goal is to design classifiers for the respective binary classification problems. As we show in the results section, for those medical modeling tasks described here, genetic programming performs best among those techniques that are able to identify nonlinearities; we also see that GP results show less overfitting than those produced using other methods.
Article
A review of the status of standardization of laboratory tests of particular interest to oncologists is presented. Currently, relatively few of these tests are standardized; as a result, interlaboratory and interinstitutional comparison of data is problematic. In 1992, additional interlaboratory studies of common tumor markers will be initiated by the College of American Pathologists. The National Committee for Clinical Laboratory Standards also has begun to develop standard methods and guidelines for these important tests.
Article
To evaluate the usefulness of tumor-marker measurements and to identify prognostic factors in patients with cancer of unknown primary (CUP), receiving platinum-based combination chemotherapy and to verify the adjustment of previously reported prognostic models in this population. We conducted univariate and multivariate analyses in consecutive patients with CUP receiving platinum-based combination chemotherapy. Previously reported prognostic models were then validated in this population. A total of 93 patients were analyzed and the response rate to platinum-based chemotherapeutic regimens among the 93 patients was 39.8%. The median time to progression and overall survival period were 4.1 and 12.4 months, respectively. The ST-439 level was significantly higher in patients with histologically confirmed adenocarcinoma than in patients with poorly differentiated adenocarcinoma or poorly differentiated carcinoma. A multivariate analysis indicated that performance status, the number of involved organs, and the serum lactate dehydrogenase level were the prognostic factors of the outcome. Both the previously reported prognostic models for predicting the duration of survival in this population were shown to be valid. Tumor-marker measurements are not helpful in the management of patients with CUP. Previously reported prognostic models may be useful for selecting indication for chemotherapy or for stratifying the patients in clinical trial.