ArticlePDF Available

Extracting Logical Classification Rules With Gene Expression Programming: Microarray Case Study

Authors:

Abstract and Figures

The bene ts of nding trends in large volumes of data has driven to the development of data mining technol-ogy for over a decade. This paper presents an evolu-tionary approach for data mining based on enhanced version of gene expression programming (GEP). We enhance the original GEP technique by using a logical operators instead of mathematical ones to represent the chromosome validity evaluation, which results in unconstrained search of the genome space while still ensuring validity of the program's output, it has been demonstrated that GEP greatly surpasses the tradi-tional tree-based GP for its simplicity, high e ciency, solution compactness and comprehensibility. Keywords: Data mining, classi cation rules, gene expression programming, microarray
Content may be subject to copyright.
... GEP has been applied to many problems, including combinatorial optimizations [6], finite transducers [42], classifications [7]- [10], [41], time series predictions [11]- [13], and symbolic regressions [14]- [16]. GEP was also employed to automatically generate a hyper-heuristic framework for combinatorial optimization problems [43], [44]. ...
... We denote T d as the execution time to process a data chunk which is smaller than the original input data set. T d can be computed as (10) where N d is the number of data points in a data chunk. ...
Article
Full-text available
Gene expression programming (GEP) is a data driven evolutionary technique that well suits for correlation mining. Parallel GEPs are proposed to speed up the evolution process using a cluster of computers or a computer with multiple CPU cores. However, the generation structure of chromosomes and the size of input data are two issues that tend to be neglected when speeding up GEP in evolution. To fill the research gap, this paper proposes three guiding principles to elaborate the computation nature of GEP in evolution based on an analysis of GEP schema theory. As a result, a novel data engineered GEP is developed which follows closely the generation structure of chromosomes in parallelization and considers the input data size in segmentation. Experimental results on two data sets with complementary features show that the data engineered GEP speeds up the evolution process significantly without loss of accuracy in data correlation mining. Based on the experimental tests, a computation model of the data engineered GEP is further developed to demonstrate its high scalability in dealing with potential big data using a large number of CPU cores.
... Neural networks and genetic programming algorithms are evolving technologies with an increasing number of real-world applications including finance and economics (Lisboa and Vellido, 2000; Chen, 2002; Sermpinis et al., 2012). In addition, though it has been successfully applied in problems in biology, mining and computing (Dehuri and Cho, 2008; Weinert, 2004a, 2004b; Margny and El-Semman, 2005), genetic expression programming (GEP) is a new technique and its applications are quite limited in finance and business . The motivation for this paper is to forecast oil price performance using these promising classes of artificial intelligence models: the neural networks and the gene expression programming (GEP). ...
... The results show the effectiveness and efficiency of the GEP algorithm on credit evaluation problems. GEP has also been successfully applied in problems as diverse as mining and computing (Dehuri and Cho, 2008; Lopez and Weinert, 2004a, 2004b; Margny and Semman, 2005), particle physics data analysis (Teodorescu and Sherwood, 2008), food processing (Kahyaoglu, 2008), real parameter optimization (Xu et al., 2009), and chaotic maps analysis (Hardy and Steeb, 2002). ...
Article
Full-text available
This study aims to forecast oil prices using evolutionary techniques such as gene expression programming (GEP) and artificial neural network (NN) models to predict oil prices over the period from January 2, 1986 to June 12, 2012. Autoregressive integrated moving average (ARIMA) models are employed to benchmark evolutionary models. The results reveal that the GEP technique outperforms traditional statistical techniques in predicting oil prices. Further, the GEP model outperforms the NN and the ARIMA models in terms of the mean squared error, the root mean squared error and the mean absolute error. Finally, the GEP model also has the highest explanatory power as measured by the R-squared statistic. The results of this study have important implications for both theory and practice.
... Considering (11) and (13) we can generate the difference between EGEP and GEP on T e using (14) which proves Theorem 2. ...
Article
Full-text available
Gene expression programming (GEP) is a data driven evolutionary technique that is well suited to correlation mining of system components. With the rapid development of industry 4.0, the number of components in a complex industrial system has increased significantly with a high complexity of correlations. As a result, a major challenge in employing GEP to solve system engineering problems lies in computation efficiency of the evolution process. To address this challenge, this paper presents EGEP, an event tracker enhanced GEP, which filters irrelevant system components to ensure the evolution process to converge quickly. Furthermore, we introduce three theorems to mathematically validate the effectiveness of EGEP based on a GEP schema theory. Experiment results also confirm that EGEP outperforms the GEP with a shorter computation time in an evolution.
... However, it cannot automatically work properly with any different data types. This limitation is in fact due to the closure property of GEP [28], which means any non-terminal (function) must operate with terminals (attributes) or functions. In other words, in order to use GEP to solve a problem based on a type of data, we have to define the functions and terminals, as well as other parameters such as the number of genes (consist of terminals and functions) each chromosome has, the number of individuals in a population and the number of evolutions. ...
... However, it cannot automatically work properly with any different data types. This limitation is in fact due to the closure property of GEP [28], which means any non-terminal (function) must operate with terminals (attributes) or functions. In other words, in order to use GEP to solve a problem based on a type of data, we have to define the functions and terminals, as well as other parameters such as the number of genes (consist of terminals and functions) each chromosome has, the number of individuals in a population and the number of evolutions. ...
Article
Lung cancer is one of the deadliest diseases in the world. Non-small cell lung cancer (NSCLC) is the most common and dangerous type of lung cancer. Despite the fact that NSCLC is preventable and curable for some cases if diagnosed at early stages, the vast majority of patients are diagnosed very late. Furthermore, NSCLC usually recurs sometime after treatment. Therefore, it is of paramount importance to predict NSCLC recurrence, so that specific and suitable treatments can be sought. Nonetheless, conventional methods of predicting cancer recurrence rely solely on histopathology data and predictions are not reliable in many cases. The microarray gene expression (GE) technology provides a promising and reliable way to predict NSCLC recurrence by analysing the GE of sample cells. This study proposes a new model from GE programming to use microarray datasets for NSCLC recurrence prediction. To this end, the authors also propose a hybrid method to rank and select relevant prognostic genes that are related to NSCLC recurrence prediction. The proposed model was evaluated on real NSCLC microarray datasets and compared with other representational models. The results demonstrated the effectiveness of the proposed model.
... This direction includes methods other than classical analysis, based on simulation, and solving problems of generalization, association and finding patterns. The problem of cluster analysis, known as the problem of automatic grouping of objects is perhaps one of the studied widely in the data mining and machine learning communities [1][2][3][4][5][6][7][8]. ...
Article
Full-text available
Fast Balanced K-means (FBK-means) clustering approach is one of the most important consideration when one want to solve clustering problem of balanced data. Mostly, numerical experiments show that FBK-means is faster and more accurate than the K-means algorithm, Genetic Algorithm, and Bee algorithm. FBK-means Algorithm needs few distance calculations and fewer computational time while keeping the same clustering results. However, the FBK-means algorithm doesn’t give good results with imbalanced data. To resolve this shortage, a more efficient clustering algorithm, namely Fast K-means (FK-means), developed in this paper. This algorithm not only give the best results as in the FBK-means approach but also needs lower computational time in case of imbalance data.
Chapter
Classification is an important branch of Data Mining technologies. The purpose of classification is to construct a classifier, for training with the training data set, so as to construct a classification function to achieve the effect of the unknown data classification. The algorithm of classification in general can be realized by traditional algorithms like Naive Bayes, K-Nearest Neighbors, and Neural Network etc. This paper presents a special approach for constructing a classifier from a point of Gene Expression Programming, in which genetic process has been kept up as dealing with the regression problems. The proposed approach for constructing the classifier starts off with a digitization of category as the target value. Subsequently, the classification thresholds are set correspondingly in dealing with practical problems. The experimental results show that the Gene Expression Programming algorithm performs very well in classification, not only the binary classification but also three or more categories. The high performance observed and nice feature intrinsic (such as strong search capability and evolution competence) make the new proposal a promising tool for solving classification problems.
Chapter
The chapter focuses on Genetic-Fuzzy Rule Based Systems of soft computing in order to deal with uncertainty and imprecision with evolving nature for different domains. It has been observed that major professional domains such as education and technology, human resources, psychology, etc, still lack intelligent decision support system with self evolving nature. The chapter proposes a novel framework implementing Theory of Multiple Intelligence of education to identify students’ technical and managerial skills. Detail methodology of proposed system architecture which includes the design of rule bases for technical and managerial skills, encoding strategy, fitness function, cross-over and mutation operations for evolving populations is presented in this chapter. The outcome and the supporting experimental results are also presented to justify the significance of the proposed framework. It concludes by discussing advantages and future scope in different domains.
Article
Establishing security metrics in vehicular networking is still being debated. The dynamic characteristics of vehicular networks, imposes challenges to realize an appropriate solution to organize and ensure reliable data transfer between the vehicular nodes. In order to ensure road safety, avoid/reduce traffic congestion, and to identify malicious vehicles, an efficient Trust Management System has to be implemented in real time scenarios. All existing applications in this area have focused on reliable data exchange and authentication process of vehicular nodes to forward messages. This study proposes a new entity centric trust framework using decision tree classification and artificial neural networks. Decision tree classification model is used to derive rules for trust calculation and artificial neural networks are used to self-train the vehicular nodes, when expected trust value is not met. This model uses multifaceted role and distance based metrics like Euclidean distance to estimate the trust. The proposed entity centric trust model, uses a versatile new direct and recommended trust evaluation strategy to compute trust values. The suggested model is simple, reliable and efficient in comparison to the other popular entity centric trust models. Results and comparative analyses are carried out to prove the better performance of the proposed model over other related approaches.
Article
The chapter focuses on Genetic-Fuzzy Rule Based Systems of soft computing in order to deal with uncertainty and imprecision with evolving nature for different domains. It has been observed that major professional domains such as education and technology, human resources, psychology, etc, still lack intelligent decision support system with self evolving nature. The chapter proposes a novel framework implementing Theory of Multiple Intelligence of education to identify students' technical and managerial skills. Detail methodology of proposed system architecture which includes the design of rule bases for technical and managerial skills, encoding strategy, fitness function, cross-over and mutation operations for evolving populations is presented in this chapter. The outcome and the supporting experimental results are also presented to justify the significance of the proposed framework. It concludes by discussing advantages and future scope in different domains.
Conference Paper
Full-text available
In this paper we report the results of a comparative study on different variations of genetic programming applied on binary data classiffication problems. The ffirst genetic programming variant is weighting data records for calculating the classiffication error and modifying the weights during the run. Hereby the algorithm is deffining its own ffitness function in an on-line fashion giving higher weights to ‘hard’ records. Another novel feature we study is the atomic representation, where ‘Booleanization’ of data is not performed at the root, but at the leafs of the trees and only Boolean functions are used in the trees’ body. As a third aspect we look at generational and steady-state models in combination of both features.
Article
New laboratory technologies such as DNA microarrays have made it possible to measure the expression levels of thousands of genes simultaneously in a particular cell or tissue. The challenge for genetic epidemiologists will be to develop statistical and computational methods that are able to identify subsets of gene expression variables that classify and predict clinical endpoints. Linear discriminant analysis is a popular multivariate statistical approach for classification of observations into groups. This is because the theory is well described and the method is easy to implement and interpret. However, an important limitation is that linear discriminant functions need to be prespecified. To address this limitation and the limitation of linearity, we have developed symbolic discriminant analysis (SDA) for the automatic selection of gene expression variables and discriminant functions that can take any form. In the present study, we demonstrate that SDA is capable of identifying combinations of gene expression variables that are able to classify and predict autoimmune diseases. (C) 2002 Wiley-Liss, Inc.
Article
In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (
Article
this article. 0738-4602/92/$4.00 1992 AAAI 58 AI MAGAZINE for the 1990s (Silberschatz, Stonebraker, and Ullman 1990)