Article

Variable Interaction Networks in Medical Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we describe the identification of variable interaction networks in a medical data set. The main goal is to generate mathematical models for standard blood parameters as well as tumor markers using other available parameters in this data set. For each variable we identify those variables that are most relevant for modeling it; relevance of a variable can in this context be defined via the frequency of its occurrence in models identified by evolutionary machine learning methods or via the decrease in modeling quality after removing it from the data set. Several data based modeling approaches implemented in HeuristicLab have been applied for identifying estimators for selected tumor markers and cancer diagnoses: Linear regression and support vector machines (optimized using evolutionary algorithms) as well as genetic programming.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Both works have in common that they use symbolic regression as machine learning method for the VIN-base models, which allows them to gain even deeper insights, since symbolic regression models are interpretable, clearbox models by themselves (Affenzeller et al., 2014). Winkler et al. (2013) use the VIN approach to calculate and visualize the importance of medical variables for estimating breast cancer diagnoses. Concerning VIN creation details, the authors of this paper describe several methods to calculate the variable impacts, i. e. the VIN edge weights, e. g. using the machine learning model error, the re-training error, or the variable frequency during the training process. ...
Conference Paper
Full-text available
With the growing use of machine learning models in many critical domains, research regarding making these models, as well as their predictions, more explainable has intensified in the last few years. In this paper, we present extensions to the machine learning based data mining technique Variable Interaction Networks (VIN), to integrate existing domain knowledge and thus, enable more meaningful analysis. Several tests on data from a case study concerned with long-term monitored photovoltaic systems, verify the feasibility of our approach to provide valuable, human-interpretable insights. In particular, we show the successful application of root-cause detection in scenarios with changing system conditions.
... Simulation offers immediate feedback about proposed changes, allows analysis of scenarios and promotes communication on building a shared system view and understanding of how a complex system works Forsberg et al., 2011). As for mathematical models for medical parameters that can be easily embedded in simulation models, meaningful applications and case studies can be found in Winkler et al. (2013) and Winkler et al. (2011). ...
Conference Paper
Full-text available
This paper presents the results of a simulation study that has involved a public Facility Healthcare, in particular an operative unit of intensive care. This unit represents the core of a healthcare facility where patients with abnormal vital signs as a result of diseases of the medical or surgical treatment of trauma are greeted. The approach proposed in this research work begins with the study and analysis of the processes that take place within the unit of intensive care; after that tools and methods of lean management are applied by using a Modeling & Simulation based approach. The work illustrates the improvement obtained with the application of Kanban technique and the 5S method. To this end, a simulation model of the intensive care unit considering the main processes and activities has been developed. Main goal of simulation was to see to which extent the operative unit is improved after the changes implemented through lean management tools and methods. The simulation model highlights the delays due to poor internal organization and the improvements achieved by redesigning the flows, minimizing the pathways and reengineering the layout of the operative unit. The results obtained by using the simulation model have been transferred to the real system with a relevant increase of the intensive care unit performances (giving a quantitative measure of the value added to care and wellness of patients).
... The relevance of a variable in this context can be defined via the frequency of its occurrence in models identified by evolutionary machine learning methods or via the decrease in modeling quality after removing it from the data set. 46 The following algorithms have been applied for the data set generated in this study including the results for TPC, TEAC, ORAC, Mn 2+ , Mg 2+ , Ca 2+ , Cu 2+ , K + , and PO 4 3− : linear regression and random forests. 47 As shown in Figure 5A, linear regression confirms the relationship of the antioxidant capacity (TEAC and ORAC) and the TPC level. ...
Article
The compositional characteristics of untreated pure juice prepared from 88 apple varieties grown in the region of Eferding/Upper Austria were determined. Many of the analyzed varieties are non-commercial, old varieties not present in market. The aim of the study was to quantitate the mineral, phosphate, trace elements and polyphenolic content in order to identify varieties that are of particular interest for a wider distribution. Great variations among the investigated varieties could be found. This holds especially true for the total polyphenolic content (TPC) ranging from 103.2 to 2,275.6 mg/L. A clear dependence of the antioxidant capacity on the TPC levels was detected. Bioinformatics was employed to find specific interrelationships, such as Mg2+/Mn2+ and PO43-/K+, between the analyzed bio- and phytochemical parameters. Furthermore, special attention was drawn on putative effects of grafting on the phytochemical composition of apple varieties. By grafting 27 different apple varieties on two trees grown close to each other, it could be shown that the apple fruits remain their characteristic phytochemical composition. Finally, apple juice prepared from selected varieties was further characterized by additional biochemical analysis including cytotoxicity, epidermal growth factor receptor (EGFR)-inhibition and α-amylase activity tests. Cytotoxicity as well as inhibition of EGFR-activation were found to be dependent on the TPC, while α-amylase activity was reduced by the apple juices independent of the presence of polyphenolic substances. Taken together selected apple varieties investigated within this study might serve as preferable sources for the development of apple-based food with a strong focus on health beneficial effects.
Article
The computation of assembly tolerance information is necessary to fulfill robust design requirements. This assembly is computationally costly, with current calculations taking several hours. We aim to identify surrogate models for predicting degrees of freedom within a tolerance chain based on point connections between assembly components. Thus, replacing part of the current computation workflow and consequently reduce computation time. We use manufacturing tolerances set by norms and industrial standards to identifly these surrogate models, which define all relevant features and resulting output variables. We use black-box modeling methods (artificial neural networks and gradient boosted trees), as well as white-box modeling (symbolic regression by genetic programming). We see that these three models can reliably predict the degrees of freedom of a tolerance chain with high accuracy (R² > 0.99).
Conference Paper
In this paper we analyze the dynamics of the predictability and variable interactions in financial data of the years 2007–2014. Using a sliding window approach, we have generated mathematical prediction models for various financial parameters using other available parameters in this data set. For each variable we identify the relevance of other variables with respect to prediction modeling. By applying sliding window machine learning we observe that changes of the predictability of financial variables as well as of influence factors can be identified by comparing modeling results generated for different periods of the last 8 years. We see changes of relationships and the predictability of financial variables over the last years, which corresponds to the fact that relationships and dynamics in the financial sector have changed significantly over the last decade. Still, our results show that the predictability has not decreased for all financial variables, indeed in numerous cases the prediction quality has even improved.
Conference Paper
Full-text available
Selection for reproduction in the context of Genetic Algorithms uses only one selection scheme to select parent individuals. When considering the model of sexual selection in the area of population genetics it gets obvious that the process of choosing mating partners in natural populations is difierent for male and female individuals. In this paper the authors introduce a new selection paradigm for Genetic Algorithms (SexualGA) based upon the concepts of male vigor and female choice of population genetics which provides the possibility to use two difierent selection schemes simultaneously within one algorithm. By using this new concept it is possible to simulate sexual selection in natural populations more precisely. Furthermore, SexualGA also ofiers far more ∞exibility concerning the adaptivity of selection pressure enabling the GA user to tune the algorithm more accurately.
Chapter
Full-text available
In terms of goal orientedness, selection is the driving force of Genetic Algorithms (GAs). In contrast to crossover and mutation, selection is completely generic, i.e. independent of the actually employed problem and its representation. GA-selection is usually implemented as selection for reproduction (parent selection). In this paper we propose a second selection step after reproduction which is also absolutely problem independent. This self-adaptive selection mechanism, which will be referred to as offspring selection, is closely related to the general selection model of population genetics. As the problem- and representation-specific implementation of reproduction in GAs (crossover) is often critical in terms of preservation of essential genetic information, offspring selection has proven to be very suited for improving the global solution quality and robustness concerning parameter settings and operators of GAs in various fields of applications. The experimental part of the paper discusses the potential of the new selection model exemplarily on the basis of standardized real-valued test functions in high dimensions.
Conference Paper
Full-text available
In this paper we present results of empirical research work done on the data based identification of estimation models for cancer diagnoses: Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors we have trained mathematical models for estimating cancer diagnoses. Several data based modeling approaches implemented in HeuristicLab have been applied for identifying estimators for selected cancer diagnoses: Linear regression, k-nearest neighbor learning, artificial neural networks, and support vector machines (all optimized using evolutionary algorithms) as well as genetic programming. The investigated diagnoses of breast cancer, melanoma, and respiratory system cancer can be estimated correctly in up to 81%, 74%, and 91% of the analyzed test cases, respectively; without tumor markers up to 75%, 74%, and 87% of the test samples are correctly estimated, respectively.
Book
Full-text available
Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications discusses algorithmic developments in the context of genetic algorithms (GAs) and genetic programming (GP). It applies the algorithms to significant combinatorial optimization problems and describes structure identification using HeuristicLab as a platform for algorithm development. The book focuses on both theoretical and empirical aspects. The theoretical sections explore the important and characteristic properties of the basic GA as well as main characteristics of the selected algorithmic extensions developed by the authors. In the empirical parts of the text, the authors apply GAs to two combinatorial optimization problems: the traveling salesman and capacitated vehicle routing problems. To highlight the properties of the algorithmic measures in the field of GP, they analyze GP-based nonlinear structure identification applied to time series and classification problems. Written by core members of the HeuristicLab team, this book provides a better understanding of the basic workflow of GAs and GP, encouraging readers to establish new bionic, problem-independent theoretical concepts. By comparing the results of standard GA and GP implementation with several algorithmic extensions, it also shows how to substantially increase achievable solution quality.
Conference Paper
Full-text available
In this work we compare the use of a particle swarm optimization (PSO) and a genetic algorithm (GA) (both augmented with support vector machines SVM) for the classification of high dimensional microarray data. Both algorithms are used for finding small samples of informative genes amongst thousands of them. A SVM classifier with 10- fold cross-validation is applied in order to validate and evaluate the provided solutions. A first contribution is to prove that PSOsvm is able to find interesting genes and to provide classification competitive performance. Specifically, a new version of PSO, called Geometric PSO, is empirically evaluated for the first time in this work using a binary representation in Hamming space. In this sense, a comparison of this approach with a new GAsvm and also with other existing methods of literature is provided. A second important contribution consists in the actual discovery of new and challenging results on six public datasets identifying significant in the development of a variety of cancers (leukemia, breast, colon, ovarian, prostate, and lung).
Article
In this paper we present a comprehensive framework enabling data based quality pre-assessment for the quality of metallurgical products, and report on the main experiences gained in cooperation with an industrial partner. The proposed approach is based on a sequential structure including data pre-processing to obtain a well-conditioned problem as well as nonlinear modelling approaches to determine dependencies. As we document here, this approach works correctly for approximately 90% of the cases, but hints are given on which essential modifications (associated with the problem setup) should be taken to re-formulate the problem in order to increase its performance.
Article
Written by one of the preeminent researchers in the field, this book provides a comprehensive exposition of modern analysis of causation. It shows how causality has grown from a nebulous concept into a mathematical theory with significant applications in the fields of statistics, artificial intelligence, economics, philosophy, cognitive science, and the health and social sciences. Judea Pearl presents and unifies the probabilistic, manipulative, counterfactual, and structural approaches to causation and devises simple mathematical tools for studying the relationships between causal connections and statistical associations. The book will open the way for including causal analysis in the standard curricula of statistics, artificial intelligence, business, epidemiology, social sciences, and economics. Students in these fields will find natural models, simple inferential procedures, and precise mathematical definitions of causal concepts that traditional texts have evaded or made unduly complicated. The first edition of Causality has led to a paradigmatic change in the way that causality is treated in statistics, philosophy, computer science, social science, and economics. Cited in more than 5,000 scientific publications, it continues to liberate scientists from the traditional molds of statistical thinking. In this revised edition, Judea Pearl elucidates thorny issues, answers readers’ questions, and offers a panoramic view of recent advances in this field of research. Causality will be of interests to students and professionals in a wide variety of fields. Anyone who wishes to elucidate meaningful relationships from data, predict effects of actions and policies, assess explanations of reported events, or form theories of causal understanding and causal speech will find this book stimulating and invaluable.
Conference Paper
This contribution describes how symbolic regression can be used for knowledge discovery with the open-source software HeuristicLab. HeuristicLab includes a large set of algorithms and problems for combinatorial optimization and for regression and classification, including symbolic regression with genetic programming. It provides a rich GUI to analyze and compare algorithms and identified models. This contribution mainly focuses on specific aspects of symbolic regression that are unique to HeuristicLab, in particular, the identification of relevant variables and model simplification.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Conference Paper
In this article, we describe the use of tumour marker estimation models in the prediction of tumour diagnoses. In previous works, we have identified classification models for tumour markers that can be used for estimating tumour marker values on the basis of standard blood parameters. These virtual tumour markers are now used in combination with standard blood parameters for learning classifiers that are used for predicting tumour diagnoses. Several data-based modelling approaches implemented in HeuristicLab have been applied for identifying estimators for selected tumour markers and cancer diagnoses: linear regression, k-nearest neighbour (k-NN) learning, artificial neural networks (ANNs) and support vector machines (SVMs) (all optimised using evolutionary algorithms), as well as genetic programming (GP). We have applied these modelling approaches for identifying models for breast cancer diagnoses; in the results section, we summarise classification accuracies for breast cancer and we compare classification results achieved by models that use measured marker values as well as models that use virtual tumour markers.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Conference Paper
Tumor markers are substances that are found in blood, urine, or body tissues and that are used as indicators for tumors; elevated tumor marker values can indicate the presence of cancer, but there can also be other causes. We have used a medical database compiled at the blood laboratory of the General Hospital Linz, Austria: Several blood values of thousands of patients are available as well as several tumor markers. We have used several data based modeling approaches for identifying mathematical models for estimating selected tumor marker values on the basis of routinely available blood values; in detail, estimators for the tumor markers AFP, CA-125, CA15-3, CEA, CYFRA, and PSA have been identified and are analyzed in this paper. The documented tumor marker values are classified as "normal" or "elevated"; our goal is to design classifiers for the respective binary classification problems. As we show in the results section, for those medical modeling tasks described here, genetic programming performs best among those techniques that are able to identify nonlinearities; we also see that GP results show less overfitting than those produced using other methods.
Article
A review of the status of standardization of laboratory tests of particular interest to oncologists is presented. Currently, relatively few of these tests are standardized; as a result, interlaboratory and interinstitutional comparison of data is problematic. In 1992, additional interlaboratory studies of common tumor markers will be initiated by the College of American Pathologists. The National Committee for Clinical Laboratory Standards also has begun to develop standard methods and guidelines for these important tests.
Article
Operational protocols are a valuable means for quality control. However, developing operational protocols is a highly complex and costly task. We present an integrated approach involving both intelligent data analysis and knowledge acquisition from experts that support the development of operational protocols. The aim is to ensure high quality standards for the protocol through empirical validation during the development, as well as lower development cost through the use of machine learning and statistical techniques. We demonstrate our approach of integrating expert knowledge with data driven techniques based on our effort to develop an operational protocol for the hemodynamic system.
Article
To evaluate the usefulness of tumor-marker measurements and to identify prognostic factors in patients with cancer of unknown primary (CUP), receiving platinum-based combination chemotherapy and to verify the adjustment of previously reported prognostic models in this population. We conducted univariate and multivariate analyses in consecutive patients with CUP receiving platinum-based combination chemotherapy. Previously reported prognostic models were then validated in this population. A total of 93 patients were analyzed and the response rate to platinum-based chemotherapeutic regimens among the 93 patients was 39.8%. The median time to progression and overall survival period were 4.1 and 12.4 months, respectively. The ST-439 level was significantly higher in patients with histologically confirmed adenocarcinoma than in patients with poorly differentiated adenocarcinoma or poorly differentiated carcinoma. A multivariate analysis indicated that performance status, the number of involved organs, and the serum lactate dehydrogenase level were the prognostic factors of the outcome. Both the previously reported prognostic models for predicting the duration of survival in this population were shown to be valid. Tumor-marker measurements are not helpful in the management of patients with CUP. Previously reported prognostic models may be useful for selecting indication for chemotherapy or for stratifying the patients in clinical trial.
Adaption in Natural and Artifical Systems
  • J H Holland
Holland, J. H., 1975. Adaption in Natural and Artifical Systems. University of Michigan Press.
Molecular marker test Figure 5: Interaction network of medical variables standardization. Cancer, 69
  • J A Koepke
Koepke, J. A., 1992. Molecular marker test Figure 5: Interaction network of medical variables standardization. Cancer, 69, pp. 1578-1581.
Heuristic optimization software systems - Modeling of heuristic optimization algorithms in the heuristiclab software environment
  • S Wagner
Wagner, S., 2009. Heuristic Optimization Software Systems -Modeling of Heuristic Optimization Algorithms in the HeuristicLab Software Environment. PhD Thesis, Institute for Formal Models and Verification, Johannes Kepler University Linz, Austria.
Evolutionary System Identification -Modern Concepts and Practical Applications
  • S Winkler
Winkler, S., 2009. Evolutionary System Identification -Modern Concepts and Practical Applications. Schriften der Johannes Kepler Universität Linz, Reihe C: Technik und Naturwissenschaften.