Conference PaperPDF Available

A Comparative Study of Classification Methods For Microarray Data Analysis

January 2006

January 2006
61:33-37

Source
DBLP

Conference: Data Mining and Analytics 2006, Proceedings of the Fifth Australasian Data Mining Conference (AusDM2006), Sydney, NSW, Australia, 29-30 November, 2006, Proceedings

Authors:

Jiuyong Li

University of South Australia

Hua Wang

Victoria University Melbourne

Show all 5 authorsHide

In response to the rapid development of DNA Mi- croarray technology, many classification methods have been used for Microarray classification. SVMs, decision trees, Bagging, Boosting and Random For- est are commonly used methods. In this paper, we conduct experimental comparison of LibSVMs, C4.5, BaggingC4.5, AdaBoostingC4.5, and Random Forest on seven Microarray cancer data sets. The experi- mental results show that all ensemble methods out- perform C4.5. The experimental results also show that all five methods benefit from data preprocessing, including gene selection and discretization, in classifi- cation accuracy. In addition to comparing the average accuracies of ten-fold cross validation tests on seven data sets, we use two statistical tests to validate find- ings. We observe that Wilcoxon signed rank test is better than sign test for such purpose.

: Summary of Wilcoxon signed rank test between any two of the compared classification methods. P values are shown and significant p-values at 95% confidence level are highlighted.

…

Figures - uploaded by Jiuyong Li

Content may be subject to copyright.

Content uploaded by Jiuyong Li

Content may be subject to copyright.

A Comparative Study of Classiﬁcation Methods for Microarray

Data Analysis

Hong Hu

Jiuyong Li

Ashley Plank

Hua Wang

Grant Daggard

Department of Mathematics and Computing

Department of Biological and Physical Sciences

University of Southern Queensland,

Toowoomba, QLD 4350, Australia

Email: huhong@usq.edu.au

Abstract

In response to the rapid development of DNA Mi-

croarray technology, many classiﬁcation methods

have been used for Microarray classiﬁcation. SVMs,

decision trees, Bagging, Boosting and Random For-

est are commonly used methods. In this paper, we

conduct experimental comparison of LibSVMs, C4.5,

BaggingC4.5, AdaBoostingC4.5, and Random Forest

on seven Microarray cancer data sets. The e xperi-

mental results show that all ensemble methods out-

perform C4.5. The experimental results also show

that all ﬁve methods beneﬁt from data preprocessing,

including gene selection and discretization, in classiﬁ-

cation accuracy. In addition to comparing the average

accuracies of ten-fold cross validation tests on seven

data sets, we use two statistical tests to validate ﬁnd-

ings. We observe that Wilcoxon signed rank test is

better than sign test for such purpose.

Keywords: Microarray data, classiﬁcation.

1 Introduction

In recent years, the rapid development of DNA Mi-

croarray technology has made it possible for scien-

tists to monitor the expression level of thousands

of genes with a single experiment (Schena, Shalon,

Davis & Brown 1995, Lockhart, Dong, Byrne &

et al. 1996). With DNA expression Microarray tech-

nology, researchers will be able to classify diﬀerent

diseases according to diﬀerent expression levels in

normal and tumor cells, to discover the relationship

between genes, to identify the critical genes in the

development of diseas e. There are many active re-

search applications of Microarray technology, such

as cancer classiﬁcation (Golub, Slonim, Tamayo &

et al. 1999, Veer, Dai, de Vijver & et al. 2002, Pet-

ricoinIII, Ardekani, Hitt, Levine & et al. 2002),

gene function identiﬁcation (Lu, Patterson, Wang,

Marquez & Atkinson 2004, Santin, Zhan, Bellone

& Palmieri 2004), clinical diagnosis (Yeang, Ra-

maswamy, Tamayo & et al. 2001), and drug discovery

studies (Maron & Lozano-P´erez 1998).

This project was partially support ed by Australian Research

Council Discovery Grant DP0559090.

2006, Australian Computer Society, Inc. This pa-

per appeared at Australasian Data Mining Conference (AusDM

2006), Sydney, December 2006. Conferences in Research and

Practice in Information Technology (CRPIT), Vol. 61. Peter

Christen, Paul Kennedy, Jiuyong Li, Simeon Simoﬀ and Gra-

ham Williams, Ed. Reproduction for academic, not-for proﬁt

purposes permitted provided this text is included.

A main task of Microarray classiﬁcation is to build

a classiﬁer from historical Microarray gene expres-

sion data, and then it use s the classiﬁer to classify

future coming data. Many methods have been used

in Microarray classiﬁcation, and typical methods are

Supp ort Vector Machines (SVMs) (Brown, Grundy,

Lin, Cristianini, Sugnet, Furey, Jr & Haussler 2000,

Guyon, Weston, Barnhill & Vapnik 2002), k-nearest

neighbor classiﬁer (Yeang et al. 2001), C4.5 decision

tree (Li & Liu 2003, Li, Liu, Ng & Wong 2003), rule-

base classiﬁcation method (Yeang et al. 2001) and en-

semble methods, such as Bagging and boosting (Tan

& Gibert 2003, Dietterich 2000).

SVMs, decision trees and ensemble methods are

most frequently used methods in Microarray classiﬁ-

cation. Reading through the literature of Microarray

data classiﬁcation, it is diﬃcult to ﬁnd consensus con-

clusions on their relative performance. We are very

interested in classifying Microarray data using C4.5

since it provides more interpretable results than other

methods do. Therefore, we design an experiment to

ﬁnd out the classiﬁcation performance of C4.5, Ad-

aBoostingC4.5, BaggingC4.5, Random Forests, Lib-

svms on seven Microarray cancer data sets.

In the experimental analysis, we use sign test

and Wilcoxon signed rank test to compare classiﬁca-

tion performance of diﬀerent methods. We ﬁnd that

Wilcoxon signed rank test is better than sign test for

such comparison. We also ﬁnd inconsistent results in

accuracy test and Wilcoxon signed rank test, and we

interpret the results in a reasonable way.

The rest of this paper is organized as follows. In

Section 2, we describe the relevant methods in this

comparison study. In Section 3, we introduce our ex-

perimental design. In Section 4, we show our experi-

mental results and present discussions. I n Section 5,

we conclude the paper.

2 Algorithm selected for comparison

Numerous Microarray data classiﬁcation algorithms

have been proposed in recent years. Most of them

have been adapted from current data mining and ma-

chine learning algorithms.

C4.5 (Quinlan 1993, Quinlan 1996) was proposed

by Quinlan in 1993 and it is a typical decision tree

algorithm. C4.5 partitions a training data into some

disjoint subsets simultaneously, based on the values

of an attribute. At each step in the c onstruction

of the decision tree , C4.5 selects an attribute which

separates data with the highest information gain ra-

tio (Quinlan 1993). The same pro c es s is repeated

on all subsets until each subset contains only one

class. To simplify the decision tree, the induced de-

cision tree is pruned using pessimistic error estima-

tion (Quinlan 1993).

SVMs was proposed by Cortes and Vapnik (Cortes

& Vapnik 1995) in 1995 and It has been a most inﬂu-

ential classiﬁcation algorithm in recent years. SVMs

are classiﬁers which transform the input samples into

a high dimensional space by a kernel function and use

a linear hyperplane to separate two classes mapped to

that high dimensional space by support vectors which

are selected vectors from training samples. SVMs

has been applied to many domains, for example,

text categorization (Joachims 1998), cancer classiﬁ-

cation (Furey, Christianini, Duﬀy, Bednarski, Schum-

mer & Hauessler 2000, Brown et al. 2000, Brown,

Grundy, Lin, Cristianini, Sugnet, Ares & Haussler

1999).

In the past decade, many researchers have devoted

their eﬀorts to the study of ensemble decision tree

methods for Microarray classiﬁcation. Ensemble de-

cision tree methods combine decision trees generated

from multiple training data sets by re-sampling the

training data set. Bagging, Boosting and Random

forests are some of the well-known ensemble methods

in the machine learning ﬁeld.

Bagging was proposed by Leo Breiman (Breiman

1996) in 1996. Bagging uses a bootstrap technique

to re-sample the training data sets. Some samples

may appear more than once in a data set whereas

some samples do not appear. A set of alternative

classiﬁers are generated from a set of re-sampled data

sets. Each classiﬁer will in turn assign a predicted

class to an incoming test sample. T he ﬁnal predicted

class for the sample is determined by the majority

vote. All classiﬁers have equal weights in voting.

The boosting method was ﬁrst developed by Fre-

und and Schapire (Freund & Schapire 1996) in 1996.

Boosting uses a re-sampling technique diﬀerent from

Bagging. A new training data set is generated ac-

cording to its sample distribution. The ﬁrst classi-

ﬁer is constructed from the original data set where

every sample has an equal distribution ratio of 1. In

the following training data sets, the distribution ra-

tios are made diﬀerently among samples. A sample

distribution ratio is reduced if the sample has been

correctly classiﬁed; otherwise the ratio is kept un-

changed. Samples which are misclassiﬁed often get

duplicates in a re-sampled training data set. In con-

trast, samples which are correctly classiﬁed often may

not appear in a re -sam pled training data set. A

weighted voting method is used in the committee de-

cision. A higher accuracy classiﬁer has larger weight

than a lower accuracy classiﬁer. The ﬁnal verdict goes

along with the largest weighted votes.

Based on Bagging, Leo Breiman introduced an-

other ensemble decision tree method called Random

Forests (Breiman 1999) in 1999. This metho d com-

bines Bagging and random feature selection methods

to generate multiple classiﬁers.

3 Experimental design methodology

3.1 Ten-fold cross-validation

Tenfold cross-validation is used in this experiment. In

tenfold cross-validation, a data set is equally divided

into 10 folds (partitions) with the same distribution.

In each test 9 folds of data are used for training and

one fold is for testing (unseen data set). The test

procedure is repeated 10 times. The ﬁnal accuracy of

an algorithm will be the average of the 10 trials.

3.2 Test data sets

Seven data sets from Ke nt Ridge Biological Data

Set Repository (?) are selected. These data sets

were collected from very well researched journal pa-

pers, namely Breast Cancer (Veer et al. 2002), Lung

Cancer (Gordon, Jensen, Hsiao, Gullans & et al.

2002), Lymphoma (Alizadeh, Eishen, Davis, Ma &

et al. 2000), ALL-AML Leukemia (Golub et al. 1999),

Colon (Alon & et al. 1999), Ovarian (PetricoinIII

et al. 2002) and Prostate (Singh & et al. 2002). Ta-

ble 1 shows the summary of the characteristics of the

seven data sets. We conduct our experiments by using

tenfold cross-validation on the merged original train-

ing and test data sets.

Data set Genes Class Record

Breast Cancer 24481 2 97

Lung Cancer 12533 2 181

Lymphoma 4026 2 47

Leukemia 7129 2 72

Colon 2000 2 62

Ovarian 15154 2 253

Prostate 12600 2 21

Table 1: Experimental data set details

3.3 Softwares used for comparison

We have done our experiments with C4.5,

C4.5AdaBoosting, C4.5Bagging, Random forests,

LibSVMs with the Weka-3-5-2 package which is avail-

able online (http://www.cs.waikato.ac.nz/ml/

weka/). Default settings are used for all compared

ensemble methods. We were aware that the accuracy

of some methods on some data sets can be improved

when parameters were changed. However, it was

diﬃcult to ﬁnd another uniform setting good for

all data sets. Therefore, we did not change default

settings since the default produced high accuracy on

average.

3.4 Microarray data preprocessing

We used information gain ratio for gene selection and

used Fayyad and Irani’s MDL discretization method

provided by Weka to discretize numerical attributes.

Our previous results (Hu, Li, Wang & Daggard 2006)

show that with preprocessing, the number of genes

selected aﬀects the classiﬁcation accuracy. The over-

all performance is better when data sets contain 50 to

100 genes. For our experiment, we set the number of

genes as 50. After the data preprocessing, each data

set contains 50 genes with discretized values.

3.5 Sign test

Sign test (Conover 1980) is used to test whether one

random variable in a pair tends to be larger than the

other random variable in the pair. Given n pairs of

observations. Within e ach pair, either a plus, tie or

minus is assigned. The plus corresponds to that one

value is greater than the other, the minus corresponds

to that one value is less than the other, and the tie

means that both equal to each other. The null hy-

pothesis is that the number of pluses and minuses are

equal. If the null hypothesis test is rejected, then one

random variable tends to be greater than the other.

3.6 Wilcoxon signed rank test

Sign test only makes use of information of whether

a value is greater, less than or equal to the other in

a pair. Wilcoxon signed rank test (Conover 1980,

Daniel 1978) calculates diﬀerences of pairs. The ab-

solute diﬀerences are ranked after discarding pairs

with the diﬀerence of z ero. The ranks are sorted in

ascending order. When several pairs have absolute

diﬀerences that are equal to each other, each of these

Data set C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

Breast Cancer 84.5 88.7 90.7 85.6 72.2

Lung Cancer 98.3 99.5 98.3 97.8 100.0

Lymphoma 74.5 93.6 89.4 89.4 55.3

Leukemia 88.9 98.6 95.8 95.8 100.0

Colon 88.7 83.9 90.3 90.3 90.3

Ovarian 96.8 99.2 98.8 98.0 100.0

Prostate 95.2 100 95.2 95.2 100.0

Average 89.6 94.8 94.1 93.2 88.3

Table 2: Average accuracy of seven preprocessed data sets with ﬁve classiﬁcation algorithms based on tenfold

cross-validation

C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

C4.5 –

Random Forests 0.063 –

AdaboostC4.5 0.031 0.63 –

BaggingC4.5 0.11 0.0088 0 –

LibSVMs 0.23 0.34 0.34 0.34 –

Table 3: Summary of sign test between any two of the compared classiﬁcation methods. P-values of the test

are given, and signiﬁcant p-values at 95% conﬁdence level are highlighted.

several pairs is assigned as the average of ranks that

would have otherwise been assigned. The hypothesis

is that the diﬀerences have the mean of 0.

4 Experimental results and discussions

Table 2 shows the individual and average accuracy re-

sults of all the compared methods based on seven pre-

processed data sets with the tenfold cross-validation

method. Table 5 shows the individual and average

accuracy results of the compared methods based on

seven original data sets with tenfold cross-validation

method.

Based on Table 2, we have the following conclu-

sions: with preprocessed data sets, all ensemble meth-

ods on average perform better than C4.5 and Lib-

SVMs. Both C 4.5 and LibSVM p erform similar to

each other.

Those results demonstrate that the ensemble de-

cision tree methods can improve the accuracy over

single decision tree method on Microarray data sets.

These results are consistent with most machine learn-

ing study.

To determine whether the ensemble methods con-

sistently outperform single classiﬁcation methods, we

also conducted a sign test. The results are shown in

Table 3. Based on the sign test, we have the following

conclusions.

1. AdaBoostC4.5 is the only one among the all com-

pared classiﬁcation algorithms that outperforms

C4.5.

2. Comparing between ensemble methods, Ran-

dom Forests and AdaBoostC4.5 outperform Bag-

gingC4.5 signiﬁcantly.

3. No suﬃcient evidence supports that any ensem-

ble method and C4.5 outperform LibSVMs.

We have the following observations from the sign

test. The average diﬀerence of 6.5% (between Ran-

dom Forest and LibSVM) may not be statistically sig-

niﬁcant, but the average diﬀerence of 0.9% (between

AdaBoostC4.5 and BaggingC4.5) are statistically sig-

niﬁcant. This may sounds strange, but is understand-

able. The average accuracy indicates the average per-

formance of a method on the data sets. However, the

sign test indicates if a method is consistently better

than another on each test data set. The accuracy

diﬀerence can be very small. For example, each ac-

curacy value of AdaboostC4.5 is slightly higher than

Bagging C4.5, and hence Sign test shows that Ad-

aBoostC4.5 is signiﬁcantly better than BaggingC4.5.

However, the accuracy improvement is marginal.

This also indicates a limitation of the sign test: the

diﬀerence of 0.01 and 10.0 are considered the same in

the s ign test since only plus or minus is used. We con-

ducted a Wilcoxon signed rank test based on Table 2.

The results of Wilcoxon signed rank test is shown in

Table 4

Table 4 shows that all ensemble methods, Ran-

dom Forest, AdaBoostC4.5 and BaggingC4.5, are sig-

niﬁcantly more accurate than C4.5. This conclusion

is consistent with most research literature. Though

AdaBoostC4.5 performs marginally better than Bag-

gingC4.5 on each data set. The Wilcoxon signed rank

test does not support that the diﬀerences are signif-

icant. We tend to believe that the Wilcoxon signed

rank test is better than sign test for our purpose.

Based on Table 2 and table 4, we can conclude

that all ensemble methods signiﬁcantly outperform

C4.5. We do not have suﬃcient evidence to show

whether LibSVM and another method is better.

Though Table 2 give a large average accuracy dif-

ference between an ensemble method and LibSVM,

we do not know wether LibSVM and an ensemble

method will perform better on a data set. This

is because that SVM and decision trees are two

diﬀerent types of classiﬁcation methods. They are

suitable for diﬀerent data sets.

To show that all methods beneﬁt from data pre-

processing, we conducted experiments on original

data sets, and show their accuracy results in Table 5.

Table 2 and Table 5, clearly indicate that all classi-

ﬁcation methods on data preprocessed by discretiza-

tion and gene selection methods achieve higher aver-

age accuracy over themselves on data without data

preprocessing. After data preprocessing, accuracy

performance has been improved signiﬁcantly for all

compared classiﬁcation algorithms with up to 17.4%

improvement.

To show that this improvement is signiﬁcant, we

conducted Sign test and Wilcoxon signed rank test on

diﬀerences between accuracies on preprocessed data

and original data. The test results are shown in Ta-

ble 6 and Table 7.

Based on a sign test of 95% conﬁdence level, All

methods except C4.5 improve the predictive accuracy

on the preprocessed Microarray data sets than the

original data sets. Not enough evidence supports that

C4.5 performs signiﬁcantly better on the preprocessed

C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

C4.5 –

Random Forests ≤ 0.05 –

AdaboostC4.5 0.005 0.2-0.3 –

BaggingC4.5 0.025 0.1-0.2 0.091 –

LibSVMs 0.5 0.4-0.5 0.4-0.5 0.4-0.5 –

Table 4: Summary of Wilcoxon signed rank test between any two of the compared classiﬁcation methods. P

values are shown and signiﬁcant p-values at 95% conﬁdence level are highlighted.

Data set C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

Breast Cancer 62.9 61.9 61.9 66.0 52.6

Lung Cancer 95.0 98.3 96.1 97.2 82.9

Lymphoma 78.7 80.9 85.1 85.1 55.3

Leukemia 79.2 86.1 87.5 86.1 65.3

Colon 82.3 75.8 77.4 82.3 64.5

Ovarian 95.7 94.1 95.7 97.6 87.0

Prostate 33.3 52.4 33.3 42.9 61.9

Average 75.3 78.5 76.7 79.6 67.1

Diﬀerence 14.3 16.3 17.4 13.6 21.2

Table 5: Average accuracy on seven original data sets of ﬁve classiﬁcation methods based on tenfold cross-

validation. The last row shows the diﬀerences in average accuracy between the average accuracy based on

preprocessed data and original data for every compared classiﬁcation method

Microarray data sets than the original data set.

These results show that the data precessing

method improves the predictive accuracy of classiﬁca-

tion. As we mentioned before, Microarray data con-

tains irrelevant and noisy genes. Those genes do not

help classiﬁcation but reduce the predictive accuracy

. Microarray data preprocessing is able to reduce the

number of irrelevant genes in Microarray data classi-

ﬁcation and therefore can generally help to improve

the classiﬁcation accuracy.

Apart from predictive accuracy, the representa-

tion of predictive results is another imp ortant fact

for determining the quality of a classiﬁcation algo-

rithm. Among the com pared algorithms, the classi-

ﬁer of C4.5 is a tree, and the classiﬁer of an ensemble

method is formed by a group of trees. Trees are more

easier to be evaluated and interpreted by users. By

contrast, the outputs of SVMs are numerical values

and are less interpretable.

5 Conclusion

In this paper, we conducted a comparative study

of classiﬁcation methods for Microarray data analy-

sis. We compared ﬁve classiﬁcation methods, namely

LibSVMs, C4.5, BaggingC4.5, AdaBoostingC4.5, and

Random Forest, on seven Microarray data sets, with

or without gene selection and discretization. The ex-

perimental results s how that all ensemble methods

are signiﬁcantly more accurate than C4.5. Data pre-

processing signiﬁcantly improves accuracies of all ﬁve

methods. We conducted both sign test and Wilcoxon

signed rank test to evaluate the performance diﬀer-

ences of comparative methods. We observed that the

Wilcoxon signed rank test is better than the sign test.

We also found that there is no suﬃcient evidence to

support the performance diﬀerence between the SVM

and an ensemble method although the average accu-

racy of SVM is much lower than that of an ensemble

method. A p oss ible explanation is that they are two

diﬀerent classiﬁcation schemes, and hence one may

be able to suits for a data set whereas the other does

not.

References

Alizadeh, A., Eishen, M., Davis, E., Ma, C. & et al.

(2000), ‘Distinct types of diﬀuse large b-cell lym-

phoma identiﬁed by gene expression proﬁling’,

Nature 403, 503–511.

Alon, U. & et al. (1999), ‘Broad patterns of gene

expression revealed by clustering analysis of tu-

mor and normal colon tissues probed by oligonu-

cleotide arrays’, PNAS 96, 6745–6750.

Breiman, L. (1996), ‘Bagging predictors’, Machine

Learning 24(2), 123–140.

Breiman, L. (1999), Random forests–random fea-

tures, Technical Rep ort 567, University of Cali-

fornia, Berkley.

Brown, M., Grundy, W., Lin, D., Cristianini, N., Sug-

net, C., Furey, T., Jr, M. & Haussler, D. (2000),

Knowledge-based analysis of microarray gene ex-

pression data by using suport vector machines,

in ‘Proc. Natl. Acad. Sci.’, Vol. 97, pp. 262–267.

Brown, M., Grundy, W. N., Lin, D., Cristianini, N.,

Sugnet, C., Ares, M. & Haussler, D. (1999), Sup-

port vector machine classiﬁcation of microarray

gene expression data, Technical Report UCSC-

CRL-99-09, University of California, Santa Cruz,

Santa Cruz, CA 95065.

Conover, W. J. (1980), Practical nonparametric sta-

tistics, Wiley, New York.

Cortes, C. & Vapnik, V. (1995), ‘Support-vector net-

works.’, Machine Learning 20(3), 273–297.

Daniel, W. W. (1978), Applied nonparametric statis-

tics, Houghton Miﬄin, Boston.

Dietterich, T. G. (2000), ‘An experimental compari-

son of three methods for constructing ensembles

of decision trees: Bagging, boosting, and ran-

domization’, Machine learning 40, 139–157.

Freund, Y. & Schapire, R. E. (1996), Experiments

with a new boosting algorithm, in ‘International

Conference on Machine Learning’, pp. 148–156.

Furey, T. S., Christianini, N., Duﬀy, N., Bednarski,

D. W., Schummer, M. & Hauessler, D. (2000),

‘Supp ort vector machine classiﬁcation and vali-

dation of cancer tissue samples using microarray

expression data.’, Bioinformatics 16(10), 906–

914.

with original data

with preprocessed data C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

C4.5 0.0625

Random Forests 0.0078

AdaboostC4.5 0.0078

BaggingC4.5 0.0078

LibSVMs 0.0156

Table 6: Summary of sign test between accuracy of the compared classiﬁcation methods on original and

preprocessed data sets. P values at 95% conﬁdence level are highlighted.

with original data

with preprocessed data C4.5 Random Forests AdaBoostC4.5 BaggingC4.5 LibSVMs

C4.5 0.025

Random Forests 0.005

AdaboostC4.5 0.005

BaggingC4.5 0.005

LibSVMs 0.01

Table 7: Summary of Wilcoxon signed rank test between accuracy of the compared classiﬁcation m ethods

based on original and preproce ss ed data sets . P values at 95% conﬁdence level are highlighted

Golub, T., Slonim, D., Tamayo, P. & et al. (1999),

‘Molecular classiﬁcation of cancer: Class discov-

ery and class prediction by gene expression mon-

itoring’, Science 286, 531–537.

Gordon, G., Jensen, R., Hsiao, L.-L., Gullans, S.

& et al. (2002), ‘Translation of microarray data

into clinically relevant cancer diagnostic tests us-

ing gege expression ratios in lung cancer and

mesothelioma’, Cancer Research 62, 4963–4967.

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V.

(2002), ‘Gene se lection for cancer classiﬁcation

using supp ort vector machines’, Machine Learn-

ing 46(1-3), 389–422.

Hu, H., Li, J., Wang, H. & Daggard, G. (2006),

Combined gene se lection methods for microarray

data analysis, in ‘10th International Conference

on KnowledgeBased & Intelligent Information &

Engineering Systems. To appear’.

Joachims, T. (1998), Text categorization with sup-

port vector machines: learning with many rele-

vant features, in ‘Proceedings of 10th European

Conference on Machine Learning’, number 1398,

pp. 137–142.

Li, J. & Liu, H. (2003), Ensembles of cascading trees,

in ‘ICDM’, pp. 585–588.

Li, J., Liu, H., Ng, S.-K. & Wong, L. (2003), Dis-

covery of signiﬁcant rules for classifying cancer

diagnosis data, in ‘ECCB’, pp. 93–102.

Lockhart, D., Dong, H., Byrne, M. & et al. (1996),

‘Expression monitoring by hybridization to high-

density oligonucleotide arrays’, Nature Biotech-

nology 14, 1675–1680.

Lu, K., Patterson, A. P., Wang, L., Marquez, R.

& Atkinson, E. (2004), ‘Selection of potential

markers for epithelial ovarian cancer with gene

expression arrays and recursive descent partition

analysis’, Clin Cancer Res 10, 291–300.

Maron, O. & Lozano-P´erez, T. (1998), A framework

for multiple-instance learning, in M. I. Jordan,

M. J. Kearns & S. A. Solla, eds, ‘Advances in

Neural Information Processing Systems’, Vol. 10,

The MIT Press, pp. 570–576.

PetricoinIII, E., Ardekani, A., Hitt, B., Levine, P. &

et al. (2002), ‘Use of proteomic patterns in serum

to identify ovarian cancer’, The lancet 359, 572–

577.

Quinlan, J. (1996), ‘Improved use of continuous at-

tributes in C4.5’, Artiﬁcial Intelligence Research

4, 77–90.

Quinlan, J. R. (1993), C4.5: Programs for Machine

Learning, Morgan Kaufmann, San Mateo, Cali-

fornia.

Santin, A., Zhan, F., Bellone, S. & Palmieri, M.

(2004), ‘Gene expression proﬁles in primary

ovarian serous papillary tumors and normal

ovarian epithelium: idnetiﬁcation of candidate

molecular markers for ovarian cancer diagnosis

and therapy’, International Journal of Cancer

112, 14–25.

Schena, M., Shalon, D., Davis, R. & Brown, P. (1995),

‘Quantitative monitoring of gene expression pat-

terns with a complementary DNA microarray’,

Science 270, 467–470.

Singh, D. & et al. (2002), ‘Gene expression correlates

of clinical prostate cancer behavior’, Cancer Cell

1, 203–209.

Tan, A. C. & Gibert, D. (2003), ‘Ensemble machine

learning on gene expression data for cancer clas-

siﬁcation’, Applied Bioinformatics 2(3), s75–s83.

Veer, L. V., Dai, H., de Vijver, M. V. & et al. (2002),

‘Gene expression proﬁling predicts clinical out-

come of breast cancer’, Nature 415, 530–536.

Yeang, C., Ramaswamy, S., Tamayo, P. & et al.

(2001), ‘Molecular classiﬁcation of multiple tu-

mor types’, Bioinformatics 17(Suppl 1), 316–

322.

Assessing Machine Learning Linear Models to Predict Egyptian Stock Market Prices

Article

Jan 2023

Ismail M.

Random Forest Classifier For Classifying Birds Species using Scikit-learn

Article

Full-text available

Jan 2020

A random forest classifier (RFC) is a collection or ensemble of decision trees. Each tree is trained on a random subset of the attributes. We propose a classification technique using voting method with random forests. Random forests are extensions of decision trees and it is a kind of ensemble method. Our proposed method can achieve high accuracy by building several classifiers and running each classifier independently. Accuracy of our proposed method is high compared with other traditional classification algorithms. Voting technique takes outcome from each decision tree and based on the majority of votes it decides which is the actual outcome. Using Scikit-learn tool we evaluated the efficiency of our proposed method. Scikit-learn is a machine learning tool which is extremely used in various machine learning applications for predicting the behavior of data

Gene Classification using Effective Random Forest Bootstrap Technique for Predicting the Gene Abnormalities

Article

Full-text available

Sep 2019

Gene classification is an increasing concern in the field of medicine for identifying various diseases at earlier stages. This work aims to specifically predict the abnormalities in human chromosome-17 by means of effective random forest bootstrap classification. The homo-sapiens dataset is initially preprocessed to remove the unwanted data. The enhanced data undergoes training phase where the appropriate and relevant features are selected by wrapper and filter methods. Based on the feature priorities, decision trees are formulated using random forest technique. The statistical quantities are estimated from the samples and a bootstrap sampling is designated. The effective bootstrap technique classifies the gene abnormalities in chromosome-17. The performance metrics are evaluated and the classification accuracy value is compared with the values of existing algorithms. From the experimental results, it is proved that the proposed method is highly accurate than the conventional methods.

Optimal Gene Selection and Classification of Microarray Data Using Fuzzy Min-Max Neural Network with LASSO

Chapter

Full-text available

Jul 2022

Microarray gene expression data is a small sample high-dimensional dataset in which each sample is attributed with thousands of genes. The gene expression dataset is therefore very hard to classify because we have to consider thousands of genes for each sample while training the dataset. In this paper, we propose to classify the lung cancer microarray gene expression data using the Fuzzy Min-Max (FMM) classifier that is seldom used for high-dimensional datasets due to the large computational overhead. To improve the accuracy and speed of the FMM classifier, we use Least Absolute Shrinkage and Selection Operator (LASSO) to select the optimal gene subset for classification of lung cancer. We compare the classification performance of FMM-LASSO with that of support vector machine (SVM), Random Forest, K-nearest Neighbor (KNN), Naïve Bayes and Logistic Regression classifiers, with and without LASSO. The results prove that FMM-LASSO performs better as compared to other approaches.

Effectiveness of Classification Methods on the Diabetes System

Article

Nov 2021

In today’s world using data mining and classification is considered to be one of the most important techniques, as today’s world is full of data that is generated by various sources. However, extracting useful knowledge out of this data is the real challenge, and this paper conquers this challenge by using machine learning algorithms to use data for classifiers to draw meaningful results. The aim of this research paper is to design a model to detect diabetes in patients with high accuracy. Therefore, this research paper using five different algorithms for different machine learning classification includes, Decision Tree, Support Vector Machine (SVM), Random Forest, Naive Bayes, and K- Nearest Neighbor (K-NN), the purpose of this approach is to predict diabetes at an early stage. Finally, we have compared the performance of these algorithms, concluding that K-NN algorithm is a better accuracy (81.16%), followed by the Naive Bayes algorithm (76.06%).

A stacking ensemble deep learning approach to cancer type classification based on TCGA data

Article

Full-text available

Aug 2021

Abstract Cancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p

Performance Evaluation and Comparative Study of Machine Learning Techniques on UCI Datasets and Microarray Datasets

Conference Paper

Apr 2023

Acute and Subchronic Exposure of Cyadox Induced Metabolic and Transcriptomic Disturbances in Wistar Rats

Article

Oct 2022
TOXICOLOGY

Cyadox, a potential antimicrobial growth promoter, has been widely studied and prospected to be used as an additive in livestock and poultry feed. Although high cyadox exposure has been reported to cause toxicity, the exact metabolic effects are not fully understood. Our study aim is to evaluate the metabolic effects of cyadox using comprehensive methods including serum clinical chemical test, histopathology analysis, metabolomics, and transcriptomics profile analysis. One single acute dosage over 7-day course and one subchronic 90-day dietary ingestion of cyadox intervention were conducted on the Wistar rats separately. Dose-dependent alterations were shown in the metabolism of the urine, kidney, plasma, and liver by metabolomics analysis. We further investigated gene expressions of the liver administered with high dose of cyadox for 12 weeks. Top sixty-six differentially expressed genes involved in the pathways, including xenobiotic (cyadox) metabolism, lipid metabolism, energy metabolism, nucleic acid metabolic process, inflammatory response, and response to the oxidative stress, which were in concordance with these metabolic alternations. Our study provided a comprehensive information on how cyadox modulates the metabolism and gene expressions, which is vital when considering the safe application of cyadox.

Employability of Data Mining Tools and Techniques in the Efficacious Prediction of Medical Issues

Article

Jan 2022

Arjun Panwar

Medical science essentially uses the system of information mining and AI. In various spaces of medical science, information mining methods are useful for exploration and arranging. A few applications are conceivable by including the assets of another registering area. An affiliation rule mining procedure-based prediction system is proposed in this specific situation. The affiliation rules are created in light of thing sets frequencies. The proposed technique takes care of accelerating the speed of affiliation rule age. Since the current Apriori calculation consumes a lot of time and memory for producing applicant sets. Subsequently, we carried out the partition and beating technique utilized with the ongoing Apriori calculation to further develop information handling speed. Since the age of most potential mixes of components or thing sets is required. The petite information input size decreases the calculation time in the proposed technique. The introduced work is an information model for foreseeing clinical infection as indicated by the different datasets accessible, UCI vault-based clinical datasets, for example, Heart and Diabetes datasets. In this introduced work, both datasets are utilized for trial and error. The acquired outcomes show that the proposed Apriori calculation builds their precision and reduces the total running time.

Cultural Tourism Development in Himachal Pradesh Emphasizing Local Festivals, Fairs, and their Promotion

Article

May 2022

S.T.Arokkiya Mary

-In the last decade, ownership and use of mobile phone has increased dramatically in India promoting the use of mobile wallet service, especially among the adults. Mobile wallet adoption and usage is poised for major growth in the next few years, and thereby displace traditional payments such as cash and cards. The COVID-19 pandemic has accelerated the trend to use mobile payment. During this study, the essential variables of technology accepted model (TAM), viz. perceived security, social influence, and perceived innovativeness, have been identified through the survey. These variables area unit expected to own associate influence on the mobile services adoption intention. The goal of this paper is to predict the adoption of mobile wallet using 5 different classifiers namely Logistic Regression (LR), Multilayer Perceptron (MLP), Random Forest (RF), Naïve Bayesian and Logistic Model Tree (LMT) classification algorithm. We assessed the classifiers of the samples collected from 100 respondents from Puducherry India.. For experimentation, WEKA is used as a simulation tool; the results reveal that the RF achieves better performance when compared to other classifiers. LR attains the classification accuracy of 78.11%, Naive Bayes 63.88%, LMT 81.5% and MLP 82.66% for the dataset respectively. Key Words:Mobile wallet service, traditional payments, TAM, social influence, perceived security, innovativeness.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Article

Jan 1998

Thorsten Joachims

Classification of cancer: Class discovery and class prediction by gene expression monitoring

Article

Jan 1999
BRAIN RES

Applied Nonparametric Statistics

Article

Dec 1978

Use of proteomic patterns in serum to identify ovarian cancer

Article

Aug 2004

RANDOM FORESTS--RANDOM FEATURES

Article

Leo Breiman

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost, but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. These ideas are al;so applicable to regression.

Practical Nonparametric Statistics

Article

Apr 2012
TECHNOMETRICS

Vasant Bhapkar

A self-contained introduction to the theory and methods of non-parametric statistics. Presents a review of probability theory and statistical inference and covers tests based on binomial and multinomial distributions and methods based on ranks and empirical distributions. Includes a thorough collection of statistics tables, hundreds of problems and references, detailed numerical examples for each procedure, and an instant consultant chart to guide the student to the appropriate procedure.

Support vector machine classification and validation of cancer tissue sample using microarray expression data

Article

Jan 2000

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues

Article

Jun 1999

Experiment With a New Boosting Algorithm

Article

Jan 1996

Y. Freund

Bagging Predictors

Article

Aug 1996

Leo Breiman

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

A Comparative Study of Classification Methods For Microarray Data Analysis

Abstract and Figures

Recommended publications

A Comparison of Decision Tree Ensemble Creation Techniques

A Comparison of Decision Tree Ensemble Creation Techniques

Combined Gene Selection Methods for Microarray Data Analysis

A maximally diversified multiple decision tree algorithm for microarray data classification

Robustness analysis of diversified ensemble decision tree algorithms for Microarray data classificat...

A maximally diversified multiple decision tree algorithm for microarray data classification