Conference PaperPDF Available

An optimized formulation of decision tree classification algorithm

Authors:

Abstract and Figures

An effective input dataset, valid pattern-spotting ability, good discovered pattern evaluation is required in order to analyze, predict and discover previously unknown knowledge from a large data set. The criteria of significance, novelty and usefulness need to be fulfilled in order to evaluate the performance of the prediction and classification of data. Thankfully data mining, an important step in this process of knowledge discovery extract hidden and non-trivial information from raw data through useful methods such as decision tree classification. But due to the enormous size, high-dimensionality and heterogeneous nature of the data sets, the traditional decision tree classification algorithms sometimes do not perform well in terms of computation time. This paper proposes a framework that uses a parallel strategy to optimize the performance of decision tree induction and cross-validation in order to classify data. Moreover, an existing pruning method is incorporated with our framework to overcome the overfitting problem and enhancing generalization ability along with reducing cost and structural complexity. Experiments on ten benchmark data sets suggest significant improvement in computation time and better classification accuracy by optimizing the classification framework.
Content may be subject to copyright.
An Optimized Formulation of Decision Tree
Classifier
Fahim Irfan Alam1, Fateha Khanam Bappee2,
Md. Reza Rabbani1, and Md. Mohaiminul Islam1
1Department of Computer Science & Engineering
University of Chittagong, Chittagong, Bangladesh
{fahim1678,md.reza.rabbani,mohaiminul2810}@gmail.com
2Department of Mathematics, Statistics and Computer Science
St. Francis Xavier University, Antigonish, Canada
bappeenstu@gmail.com
Abstract. An effective input dataset, valid pattern-spotting ability,
good discovered pattern evaluation is required in order to analyze, pre-
dict and discover previously unknown knowledge from a large data set.
The criteria of significance, novelty and usefulness need to be fulfilled
in order to evaluate the performance of the prediction and classifica-
tion of data. Thankfully data mining, an important step in this process
of knowledge discovery extract hidden and non-trivial information from
raw data through useful methods such as decision tree classification. But
due to the enormous size, high-dimensionality and heterogeneous nature
of the data sets, the traditional decision tree classification algorithms
sometimes do not perform well in terms of computation time. This pa-
per proposes a framework that uses a parallel strategy to optimize the
performance of decision tree induction and cross-validation in order to
classify data. Moreover, an existing pruning method is incorporated with
our framework to overcome the overfitting problem and enhancing gener-
alization ability along with reducing cost and structural complexity. Ex-
periments on ten benchmark data sets suggest significant improvement
in computation time and better classification accuracy by optimizing the
classification framework.
Keywords: Data Mining, Decision Tree, Knowledge Discovery, Classi-
fication, Parallel Strategy.
1 Introduction
Recent times have seen a significant growth in the availability of computers,
sensors and information distribution channels which has resulted in an increas-
ing flood of data. However, these huge data is of little use unless it is properly
analyzed, exploited and useful information are extracted. Prediction is a fasci-
nating feature that can be easily adopted in data-driven techniques of extracting
useful knowledge tasks. Because, if we can not predict a useful pattern from an
S. Unnikrishnan, S. Surve, and D. Bhoir (Eds.): ICAC3 2013, CCIS 361, pp. 105–118, 2013.
c
Springer-Verlag Berlin Heidelberg 2013
106 F.I. Alam et al.
enormous data set, there is little point of gathering these massive sets of data
unless they are recognized and acted upon in advance.
Data mining [23], a relatively recently developed methodology and technology
aims to identify valid, novel, non-trivial, potentially useful, and understandable
correlations and patterns in data [6]. It does this by analyzing the datasets in or-
der to extract patterns that are too subtle or complex for humans to detect [11].
There are number of applications regarding prediction e.g. predicting diseases
[10], stock exchange crash [9], drug discovery [17] etc where we see an unprece-
dented opportunity to develop automated data mining techniques of extracting
concealed knowledge. Extraction of hidden, non-trivial data from a large data
set leads us to obtain useful information and knowledge to classify the data
according to the pattern-spotting criteria [5].
The study of data mining builds upon the ideas and methods from various
versatile fields such as statistical and machine learning, database systems, and
data visualization. Decision tree classification method [4] [19] is one of the most
widely used technique for inductive inference in data mining applications. A
decision tree which is a predictive model is a set of conditions organized in a
hierarchical structure [19]. In this structure, an instance of data is classified by
following the path of satisfied conditions from the root node of the tree until
reaching a leaf, which will correspond to a class value. Some of the most well-
known decision tree algorithms are C4.5 [19] and CART [4].
The research in decision tree algorithms is greatly influenced by the necessity
of developing algorithms for data-sets arising in various business, information
retrieval, medical and financial applications. For example, various business or-
ganizations are using decision tree to analyze the buying patterns of customers
and their needs as well, medical experts are using them for discovering exciting
pattern from the data in order to facilitate the cure process, and credit card
industry is using them for fraud detection.
But due to the enormous size, high-dimensionality, and heterogeneous nature
of the data sets, the traditional decision tree classification algorithms sometimes
fail to perform well in applications that require computationally intensive tasks
and fast computation of classification rules. Also, computation and analysis of
massive data sets in decision trees are becoming difficult and almost impossi-
ble as well. For example, in medical domain for disease prediction tasks that
require learning the properties of atleast thousands of cases for a safe and accu-
rate prediction and classification, it is now merely possible for a human analyst
to analyze and discover useful information from these data-sets. An optimized
formulation of decision trees hold great promises for developing new sets of tools
that can be used to automatically analyze the massive data-sets resulting from
such simulations.
However, the large data sets and their high-dimensionality make data mining
applications computationally very demanding and in this regard, high-
performance parallel computing is fast becoming an essential part of the solution.
We can also improve the performance of the discovered knowledge and classifica-
tion rules by utilizing available computing resources which mostly
An Optimized Formulation of Decision Tree Classifier 107
remain unused in an environment. This has motivated us to develop a paral-
lel strategy for the existing decision tree classification algorithms in order to
save computation time and a better utilization of resources. As the nature of the
decision tree induction shows a natural concurrency, the parallel formulation is
undoubtedly a suitable option and solution for an optimized performance.
However, designing such parallel strategy is challenging and require different
issues to look into. Computation and communication costs are two such issues
that most parallel processing algorithms consider while the formulation take
place. As multiple processors work together in order to optimize the performance
but at the same time the internal exchange of information (if any) between
them increases the communication cost to some extent whichinturnaectsthe
optimization performance.
SLIQ [16] is a fast, scalable version decision tree algorithm which achieves
better classification accuracy with small execution time. But the performance
of SLIQ is limited by its use of a centralized, memory-resident data-structure
- the class list which puts a limit on the size of the datasets SLIQ can deal
with. SPRINT [21] is the parallel implementation of SPRINT which solves the
problem of SLIQ regarding memory by splitting the attribute lists evenly among
processors and find the split point for a node in the decision tree in parallel.
However in order to do that the entire hash table is required on all the processors.
In order to construct the hash table, an all- to- all broadcast [12] is performed
which in turn makes this algorithm unscalable with respect to runtime and
memory requirements. Because each processor requires O(N)memorytostore
the hash table and O(N) communication cost for all- to- all broadcast , where
N is the size of the dataset.
ScalParcC [8] is an improvised version of the SPRINT in the sense that it
uses a distributed hash table to efficiently implement the splitting phase of the
SPRINT. Here the overall communication overhead of the phase does not ex-
ceed above O(N) and the memory does not exceed O(N/p) for each processor.
This ensures the scalable property of this algorithm in terms of both runtime
and memory requirement. Another optimized formulation of decision trees is a
concatenated parallelism strategy of divide and conquer problems [7]. In this
method, the combination of data parallelism and task parallelism is used as a
solution to the parallel divide and conquer algorithm. However, in the problem
of classification decision tree, the workload cannot be determined based on the
size of data at a particular node of the tree. Therefore, one time load balancing
used in this method is not desired.
In this paper, we propose a strategy that puts the computation and communi-
cation cost to minimal. Moreover, our algorithm particularly considers the issue
of load-balancing so that every processor handle roughly equal portion of the
task and there is no underutilized resources left in the cluster. Another excit-
ing feature of our algorithm is that we propose to work with both discrete and
continuous attributes.
108 F.I. Alam et al.
Section 2 discusses a sequential decision tree algorithm that we want to op-
timize in this paper. The parallel formulation of the algorithm is explained in
section 3. Experimental results are shown in section 4. Finally, concluding re-
marks are stated in section 5.
2 Sequential Decision Tree Classification Algorithm
Most of the existing inductionbased decision tree classification algorithms e.g.
C4.5 [19], ID3 [18] and CART [4] use Hunts method [19] as the basic algorithm.
Those algorithms mostly fail to optimize for applications that require analysis
of large data sets in a short time. The recursive description of Hunts method for
constructing a decision tree is explained in Algorithm 1.
Algorithm 1. Hunt’s Method [19]
Inputs: Training Set Tof nexamples {T1,T
2,... T
n}
with classes {c1,c
2,... c
k}and attributes
{A1,A
2,... A
m}that have one or more mutually
exclusive outcomes {O1,O
2,... O
p}
Output: A decision tree Dwith nodes N1,N
2,...
Case 1:
if {T1,T
2,... T
n}belong to a single class cjthen
Dleaf identifying class cj
Case 2:
if {T1,T
2,... T
n}belong to a mixture of classes then
Split Tinto attribute-class Table Si
for each i=1to mdo
for each j=1to pdo
Separate Sifor each value of Ai
Compute degree of impurity using either entropy, gini index or classification
error.
Compute Information gain for each Ai
Dnode Niwith largest information gain
Case 3:
if Tis an empty set then
DNilabeled by the default class ck
3 Optimized Algorithm
Decision tree is an important method for classification problem. A data-set called
the training set is given as input to the tree first, which consists of a number of
examples each having a number of attributes. The attributes are either contin-
uous, when the attribute values are ordered, or categorical, when the attribute
values are unordered. One of the categorical attributes is called the class value or
the classifying attribute. The objective of inducing the decision tree is to use the
training set to build a model of the class value based on the other attributes such
that the model can be used to classify new data not from the training data-set.
An Optimized Formulation of Decision Tree Classifier 109
In 3.1, we give a parallel formulation for the classification decision tree con-
struction using partition and integration strategy followed by measuring pre-
dictive accuracy of the tree using cross-validation approach [3]. We focus our
presentation for discrete attributes only. The handling of continuous attributes
is discussed in Section 3-2. Later, a pruning will be done in order to optimize
the size of the original decision tree and reduce its structural complexity, as ex-
plained in Section 3-3. With an artificially created training set, we will describe
our parallel algorithm by the following steps:
3.1 Partition and Integration Strategy
Let us consider an artificial training set Trwith nexamples, each having m
attributes as shown in Table 1.
Table 1 . Artificial Training Set
Gender Car Ownership Trave l C o s t Income Level Class
male zero cheap low bus
male one cheap medium bus
female zero cheap low bus
male one cheap medium bus
female one expensive high car
male two expensive medium car
female two expensive high car
female one cheap medium car
male zero standard medium train
female one standard medium train
female one expensive medium bus
male two expensive medium car
male one standard high car
female two standard low bus
male one standard low train
1. A root processor Mwill calculate the degree of impurity using either en-
tropy, gini index or classification error for Tr. Then it will divide Trinto a
set of subtables Tsub ={Tr1,T
r2,... T
rm}for each maccording to attribute-
class combination and will send the subtables to the set of child processors
C={Cp1,C
p2,...}by following the cases below-
Case 1: If |C|<|Tsub |,Mwill assign the subtables to Cin such a way
that the number of subtables to be handled by each child processor Cpiwhere
i=1,2,... |C|is roughly equal.
Case 2: If |C|>|Tsub|,Mwill assign the subtables to Cin such a way that
each Cpihandles exactly one subtable.
110 F.I. Alam et al.
2. Each Cpiwill simultaneously calculate the information gain of respective
attributes and will return the calculated information gain to M.
3. Mwill compare the information gain found from each Cpiand will find the
optimum attribute as the root node that produces the maximum information
gain. Our decision tree now consists of a single root node as shown in Fig.1 for
our training data in Table 1 and will now expand it.
Fig. 1. Root node of the decision tree
4. After obtaining the root node, Mwill again split Trinto subtables according
to the values of the optimum attribute found in Step 3. Then it will send the
subtables for which impurities are found to the child processors by following the
cases below. Pure class is assigned into leaf node of the decision tree.
Case 1: If |C|<|Tsub |, the case 1 explained in Step 1 will be applicable and
each Cpiwill iterate in the same way followed by above steps.
Case 2: If |C|>|Tsub |, the case 2 explained in Step 1 will be applicable
and each Cpiwill partition the subtables into a set of sub-subtables Tsubsub
={Tsub1,T
sub2,... T
subm}according to attribute-class combination. The active
child processors will balance the load among the idle processors in such a way
that the total number of sub-subtables to be handled by all the child processors
is roughly equal. After computing the information gain, the child processors will
synchronize to find the optimum attribute and send it to M.
5. Upon receiving, Mwill add those optimum attributes as child nodes in the
decision tree. According to the training data in Table 1, the current form of the
decision tree is shown in Fig.2.
Fig. 2. Decision Tree after First Iteration
6. This process continues until no more nodes are available for the expansion.
The final decision tree for the training data is given in Fig.3. The entire process
of our optimization is depicted in Algorithm 2.
An Optimized Formulation of Decision Tree Classifier 111
Fig. 3. Final Decision Tree
Next, we focus on determining predictive accuracy of the generated hypothesis
on our dataset. The hypothesis will produce highest accuracy on the training
data but will not work well in case of unseen, new data. This overfitting problem
restricts the determination of predictive accuracy of a model. To prevent this
problem, we carry out cross-validation [3] which is a generally applicable and very
useful technique for tasks such as accuracy estimation. It consists of partitioning
adatasetTrinto nsubsets Triand then running the tree generation algorithm
n times, each time using a different training set Tr-Tr
iand validating the results
on Tri. An obvious disadvantage of cross-validation is its computation intensive
nature. For example, an n-fold cross-validation is implemented by running the
algorithm n times. To reduce the computation overhead, again a parallel strategy
is carried out to generate the nnumber of trees for an n-fold cross-validation
which is explain below:
1. A root processor will divide the original dataset into nfolds of which n-
1folds are considered as training set and the remaining fold is considered as
validation or test set. The root will continue dividing the dataset until all of the
examples in the dataset is used as validation example exactly once.
2. The root processor will send the divided datasets (consisting of both train-
ing and test) to the child processors in such a way that the assignment of datasets
to the child processors is roughly equal.
3. The respective child processors will act as roots and will form n temporary
decision trees by following algorithm 2.
4. Next, the child processors will calculate the predictive accuracy of the
temporary decision trees by running the validation sets and send the results to
the root processor.
5. Finally, the root processor will average the results and determine the actual
predictive accuracy of the original decision tree.
112 F.I. Alam et al.
Algorithm 2. Partition & Integration Algorithm
Inputs: Training Set Trof nexamples {Tr1,T
r2,... T
rn}
with classes {c1,c
2,... c
k}and attributes
{A1,A
2,... A
m}that have one or more mutually
exclusive values {V1,V
2,... V
p}, root processor
Mand child processors C={Cp1,C
p2,...}
Output: A decision tree Dwith nodes N1,N
2,...
Processor M:Compute degree of impurity for Tr
Divide Trinto Tsub={Tr1,T
r2,... T
rm}for each m
Send Tsub to C
//SubTables Assignment Process to C
if |C|<|Tsub|then
j=1
while (j=|Tsub|)do
for each i=1to |C|do
CpiTrj
j++
if i== |C|then
i=1
if |C|>=|Tsub|then
j=1
while (j=Tsub)do
for each i=1to |C|do
CpiTrj
j++
Child Processors Cpi;i=1,...|C|:
Compute Information Gain for each Trj
Send computed Information gain to M
Processor M:Find Optimum Attribute Aopt
NROOT Aopt
DNROOT
Divide Trinto Tsub according to Aopt
Send Tsub to C
Repeat actions for SubTables Assignment Process to C
Child Processors Cpi:
Partition Trjinto Tsubsub ={Tsub1,T
sub2,... T
subm}for each m
Repeat Actions for First Iteration
Compute Information Gain and send Aopt’s to Mfor each Vxof Ay;x=
1,2,... p,y=1,2,... m
Processor M:
if Entropy==0 then
DCfor Vx
else
NAopt
DN
An Optimized Formulation of Decision Tree Classifier 113
3.2 Computing Continuous Attributes
Handling continuous attributes in decision trees is different than that of discrete
attributes. They require a special attention and can not fit into the learning
scheme if we try to deal with them exactly the same way as discrete attributes.
However, the use of continuous attributes are very common in practical tasks and
we can not ignore the use of these. One way of handling continuous attributes
is to discretize them which is a process of converting or partitioning continuous
attributes to discretized attributes by some interval [2]. One possible way of
finding the interval would be selecting a splitting criteria for dividing the set of
continuous values in two sets [20]. This again involves a critical issue of unsorted
values which makes it difficult to find the splitting point. For large data sets, the
sorting of values and then selecting interval require significant time to do.
In this regard, we chose to do the sorting and selecting splitting point in par-
allel approach in order to avoid the additional time needed for discretizing large
set of continuous values. The root processor will divide the continuous attribute
set into N/P cases where N= number of training cases for that attribute and P
= number of processors. Then it will send the divided cases to each child proces-
sor. Therefore, if each processor contains N/P training cases and subsequently
will do the sorting. After the individual sorting done by each child processor,
another round of sorting phase will be carried out among the child processors by
using merge sort [13]. The final sorted values will be send over to the root and
the splitting criteria will be decided accordingly. The overall process is depicted
in Algorithm 3.
Algorithm 3. Discretize Continuous Attributes
Inputs: Training Set Trof nexamples, Continuous Attributes
{A1,A
2,... A
m}from the training set , Root
processor Mand child processors C={Cp1,C
p2,...}
Output: Discretized Continuous Attribute Values
Processor M:Send n/|C|of Ai;i=1,... m to each C
Child Processors Cpi:Sort values of Aiusing [13]
Perform Merge Sort over the sorted individual groups from each Cpi
Send the final sorted value to Mwhere a= lowest value; and b=highest value
Processor M:Compute Split Point = a+b
2
3.3 Pruning Decision Tree
Pruning is usually carried out to avoid overfitting the training data, and elim-
inates those parts that are not descriptive enough to predict future data. A
pruned decision tree has a simpler structure and good generalization ability
which comes at the expense of classification accuracy. A pruning method, called
Reduced Error Pruning (REP) is simple in structure and provides a reduced
114 F.I. Alam et al.
tree in good speed but the classification accuracy is effected. For optimization
purpose, we should produce a simpler structure but at the same time achieving
classification accuracy is also our one of the main concerns. To minimize the
trade-off, a novel method for pruning decision tree, proposed in [24], is used
in this paper. This method Cost and Structural Complexity (CSC) takes into
account both the classification ability and structural complexity which evaluate
the structural complexity in terms of the number of all kinds of nodes, the depth
of the tree, the number of conditional attributes and the possible class types.
The cost and structural complexity of a subtree in the decision tree T(to be
pruned) is defined as
CSC(Subtree(T)) = 1 r(v)+Sc(v)
where r(v) is the explicit degree of the conditional attribute vand Sc(v) is the
structural complexity of v.
Here, the explicit degree of a conditional attribute is measured in order to
evaluate the explicitness before or after pruning. This measurement is absolutely
desirable for maintaining explicitness as much as possible for achieving high
classification accuracy.
A fascinating feature of the pruning method is its post-pruning action which
deals with the problem of ‘horizon effect’ which causes inconsistency in the pre-
pruning process [24]. The pruning pays attention to overcome the problem of
over-learning of the details of data that takes place when the subtree has many
leaf nodes against the number of classes. Along with this, the method also han-
dles the number of conditional attributes of a subtree against the set of all pos-
sible conditional attributes. The final pruned tree is simple in structure which is
a very desirable feature of an optimization algorithm.
This pruned tree is again considered as the originally induced decision tree
which we again use to predict classification accuracy by using cross-validation
as directed in section 3.1. It is to be noted that the time spent on pruning for
a large dataset is a small fraction, less than 1% of the initial tree generation
[22]. Therefore, the pruning process is not adding much overhead toward the
computational complexity. Experimental results on classification accuracy based
on both pruned and non-pruned decision trees are given in the following section.
Next, we perform a theoretical comparison between our proposed framework
and other existing parallel strategies of decision tree algorithm. Other parallel
strategies defined in [14] such as dynamic data fragment, static (both horizon-
tal and vertical division) data fragments are considered to make a comparison
in terms of load balancing and communication cost. Our framework proposes
an optimization algorithm that particularly pays attention toward uniform load
balancing among the child processors which we explained in section 3.1. More-
over, the communication cost is also reduced in our proposed framework. The
comparison is given in Table 2.
An Optimized Formulation of Decision Tree Classifier 115
Table 2 . Comparison of Parallel Strategies
Strategies Communication Cost Load Balancing
Dynamic Data Fragment High Low
Vertical Data Fragment Low High
Horizontal Data Fragment Low Medium
Our Framework Low Excellent
4 Experimental Results
We used MPI (message passing interface)[15] as the communications protocol
for the implementation purposes in order to facilitate processes to communicate
with one another by sending and receiving messages. The reason for choosing
MPI is its point-to-point and collective communication supported features. These
features are very significant to fit into high performance computing today. That
makes MPI a very dominant and powerful model today for doing computationally
intensive works.
We worked with mpicc compiler for the compilation purpose. For testing pur-
poses, we implemented and tested our proposed formulation with classification
benchmark dataset from UCI machine learning repository [1]. We tested with
ten datasets from this benchmark dataset collection. Table 3 summarizes the
important parameters of these benchmark datasets.
Table 3 . Benchmark Datasets
Dataset #Examples # Classes # Conditional Attributes
Adult 32561 214
Australian 460 214
Breast 466 210
Cleve 202 213
Diabetes 512 2 8
Heart 180 213
Pima 512 2 8
Satimage 4485 636
Veh i cle 564 418
Wine 118 313
The comparison with serial implementation in terms of computation time is
given in Table 4.
The effect of pruning with REP and CSC in our framework is depicted in
terms of the reduced tree size which we can compare in the following Table 5.
The classification accuracy is one of the major parts of a decision tree algo-
rithm to measure its performance. In this proposed framework, the prediction
was done by taking cross-validation approach into consideration. The average
accuracy for all the ten datasets were calculated in case of both without and
116 F.I. Alam et al.
Table 4 . Comparison of Execution Times
Dataset Serial Parallel
Adult 115.2 17.6
Australian 1.09 0.0027
Breast 1.08 0.0032
Cleve 0.55 0.0015
Diabetes 1.09 0.004
Heart 0.31 0.0008
Pima 1.01 0.0022
Satimage 7.56 1.42
Veh i cle 1.02 0.0018
Wine 0.31 0.00003
Table 5. Tree Size before (Hunt’s Method) and after pruning (REP and CSC)
Dataset Hunt’s Method REP CSC
Adult 7743 1855 246
Australian 121 50 30
Breast 33 29 15
Cleve 56 42 29
Diabetes 51 47 27
Heart 41 41 22
Pima 55 53 32
Satimage 507 466 370
Veh i cle 132 128 112
Wine 9 9 9
Table 6 . Classification Accuracy on datasets (in %)
Dataset Without Pruning REP CSC
Adult 84 85 87.2
Australian 85.5 87 87
Breast 94.4 93.1 95
Cleve 75.4 77 77.2
Diabetes 70.2 69.9 70.8
Heart 82.7 82.7 84.1
Pima 73.1 73.2 74.9
Satimage 85 84.2 86.1
Veh i cle 68.4 67.7 69.2
Wine 86 86 86
with pruning effect. The following table 6 depicts the effect of before and after
pruning in classification accuracy for test sets which we believe is a standard
way of measuring the predictive accuracy of our hypothesis.
From the table we noticed that the inclusion of pruning using method in [24]
achieves better accuracy.
An Optimized Formulation of Decision Tree Classifier 117
5Conclusion
In this paper, we proposed an optimized formulation of decision tree in terms of
a parallel strategy as a inductive-classification learning algorithm. We designed
an algorithm in a partition and integration manner which particularly reduces
the work load from a single processor and distributes it among a number of
child processors. Instead of performing computation for the entire table, each
child processor computes a particular portion of the training set and upon re-
ceiving the results from the respective processor, the root processors forms the
decision tree. An existing pruning method that particularly draws a balance
between structural complexity and classification accuracy is incorporated in our
framework which produces a simple structured tree that generalizes well for new,
unseen data. Our experimental results on benchmark datasets indicate that the
inclusion of parallel algorithm along with pruning optimizes the performance of
the classifier as indicated by the classification accuracy.
In future, we will attempt to concentrate on several issues regarding the per-
formance of the decision tree. Firstly, we will experiment our algorithm on image
dataset in order to facilitate different computer vision applications. Also, we will
focus on selecting multiple splitting criteria for discretization of continuous at-
tributes.
References
1. http://archive.ics.uci.edu/ml/datasets.html/
2. An, A., Cercone, N.J.: Discretization of Continuous Attributes for Learning Clas-
sification Rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI),
vol. 1574, pp. 509–514. Springer, Heidelberg (1999)
3. Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation.
In: Proceedings of ICML 2001- Eighteenth International Conference on Machine
Learning, pp. 11–18. Morgan Kaufmann (2001)
4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regres-
sion Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont
(1984)
5. Chen, M.-S., Hun, J., Yu, P.S., Ibm, T.J., Ctr, W.R.: Data mining: An overview
from database perspective. IEEE Transactions on Knowledge and Data Engineer-
ing 8, 866–883 (1996)
6. Chung, H., Gray, P.: Data mining. Journal of Management Information Sys-
tems 16(1), 11–16 (1999)
7. Goil, S., Aluru, S., Ranka, S.: Concatenated parallelism: A technique for efficient
parallel divide and conquer. In: Proceedings of the 8th IEEE Symposium on Parallel
and Distributed Processing (SPDP 1996), pp. 488–495. IEEE Computer Society,
Washington, DC (1996)
8. Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: A new scalable and efficient par-
allel classification algorithm for mining large datasets. In: Proceedings of the 12th
International Parallel Processing Symposium on International Parallel Processing
Symposium, IPPS 1998, pp. 573–579. IEEE Computer Society, Washington, DC
(1998)
118 F.I. Alam et al.
9. Senthamarai Kannan, K., Sailapathi Sekar, P., Mohamed Sathik, M., Arumugam,
P.: Financial stock market forecast using data mining techniques. In: Proceedings of
the International MultiConference of Engineers and Computer Scientists (IMECS
1996), Hong Kong, March 17-19, pp. 555–559 (2010)
10. Koh, H.C., Tan, G.: Data mining applications in healthcare. Journal of Healthcare
Information Management 19(2), 64–72 (2005)
11. Kreuze, D.: Debugging hospitals. Technology Review 104(2), 32 (2001)
12. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to parallel comput-
ing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc.,
Redwood City (1994)
13. Lipschutz, S.: Schaum’s Outline of Theory and Problems of Data Structures.
McGraw-Hill, Redwood City (1986)
14. Liu, X., Wang, G., Qiao, B., Han, D.: Parallel strategies for training decision tree.
Computer Science J. 31, 129–135 (2004)
15. Madai, B., AlShaikh, R.: Performance modeling and mpi evaluation using
westmere-based infiniband hpc cluster. In: Proceedings of the 2010 Fourth UK-
Sim European Symposium on Computer Modeling and Simulation, Washington,
DC, USA, pp. 363–368 (2010)
16. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data
Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996.
LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)
17. Milley, A.: Healthcare and data mining. Health Management Technology 21(8),
44–47 (2000)
18. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
19. Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann,
San Mateo (1992)
20. Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial
Intelligence Research 4, 77–90 (1996)
21. Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data
mining. In: VLDB, pp. 544–555 (1996)
22. Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree
classification algorithms. Data Mining and Knowledge Discovery: An International
Journal, 237–261 (1998)
23. Trybula, W.J.: Data mining and knowledge discovery. Annual Review of Informa-
tion Science and Technology (ARIST) 32, 197–229 (1997)
24. Wei, J.M., Wang, S.Q., Yu, G., Gu, L., Wang, G.Y., Yuan, X.J.: A novel method for
pruning decision trees. In: Proceedings of 8th International Conference on Machine
Learning and Cybernetics, July 12-15, pp. 339–343 (2009)
Article
As a new type of intelligent damper, the magnetorheological damper has been widely used in robot, automobile NVH, and intelligent structure. However, for the intelligent response control from the structural excitation, it is the challenge to realize the intelligent control of the magnetorheological damping system. In this paper, the prediction-control mechanism of the magnetorheological damping system is modeled by a data-driven method, such as neural network and classification algorithm. The NARX (Nonlinear autoregressive with external input) neural network is used to predict the desired damping force required for the structural system in the forward direction, and the decision tree classification algorithm is used to reversely-control the desired current of the magnetorheological damping system in instant response to the structural system’s damping force requirement. The analysis results show that the prediction-control method is feasible to realize the intelligent control of the damper based on the state data of the damped system, which provides a new idea for the intelligent control of the magnetorheological damper system.
Article
Full-text available
Objective . Urumqi is one of the key areas of HIV/AIDS infection in Xinjiang and in China. The AIDS epidemic is spreading from high-risk groups to the general population, and the situation is still very serious. The goal of this study was to use four data mining algorithms to establish the identification model of HIV infection and compare their predictive performance. Method . The data from the sentinel monitoring data of the three groups of high-risk groups (injecting drug users (IDU), men who have sex with men (MSM), and female sex workers (FSW)) in Urumqi from 2009 to 2015 included demographic characteristics, sex behavior, and serological detection results. Then we used age , marital status , education level , and other variables as input variables and whether to infect HIV as output variables to establish four prediction models for the three datasets. We also used confusion matrix, accuracy, sensitivity, specificity, precision, recall, and the area under the receiver operating characteristic (ROC) curve (AUC) to evaluate classification performance and analyzed the importance of predictive variables. Results . The final experimental results show that random forests algorithm obtains the best results, the diagnostic accuracy for random forests on MSM dataset is 94.4821%, 97.5136% on FSW dataset, and 94.6375% on IDU dataset. The k-nearest neighbors algorithm came out second, with 91.5258% diagnostic accuracy on MSM dataset, 96.3083% diagnostic accuracy on FSW dataset, and 90.8287% diagnostic accuracy on IDU dataset, followed by support vector machine (94.0182%, 98.0369%, and 91.3571%). The decision tree algorithm was the poorest among the four algorithms, with 79.1761% diagnostic accuracy on MSM dataset, 87.0283% diagnostic accuracy on FSW dataset, and 74.3879% accuracy on IDU. Conclusions . Data mining technology, as a new method of assisting disease screening and diagnosis, can help medical personnel to screen and diagnose AIDS rapidly from a large number of information.
Chapter
Full-text available
Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. This paper discusses issues in building a scalable classifier and presents the design of SLIQ, a new classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting procedure is integrated with a breadth-first tree growing strategy to enable classification of disk-resident datasets. SLIQ also uses a new tree-pruning algorithm that is inexpensive, and results in compact and accurate trees. The combination of these techniques enables SLIQ to scale for large data sets and classify data sets irrespective of the number of classes, attributes, and examples (records), thus making it an attractive tool for data mining.
Article
Article
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.
Article
,Classification,is an important,problem,in the emerging,field of data,mining.,Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large,data,sets. This paper,discusses,issues in building,a scalable,classifier and presents the design of SLIQ’ , a new,classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel,pre-sorting,technique,in the tree-growth,phase. This sorting,procedure,is integrated,with,a breadth-fist,tree growing,strategy to enable,classification,of disk-resident,datasets. SLIQ also uses,a new tree-pruning algorithm that is inexpensive, and results in compact aad accurate,trees. The combination,of these techniques,enables SLIQ to scale for lerge,data,sets and classify data sets irrespective,of the,number,of classes, attributes, and examples (records), thus making it an attractive tooldata,mining.
Article
Reviews research on data mining (DM) and knowledge discovery (KD). Discusses popular conceptions of information growth, the knowledge acquisition process, major elements of the DM process, and evaluation methods. Research on DM and KD is divided into two categories: analysis of numerical databases and analysis of non-numerical or textual databases. Contains 99 references. (AEF)
Article
In recent years, we have witnessed a growing interest in optimizing the high performance computing (HPC) solutions using advanced CPU and Interconnect technologies. These advances are driven by the fact that CPU manufacturers are facing extreme challenges in doubling the processors’ speeds due to various reasons, such as the high temperatures and power consumptions. Therefore and as a continuation to Moore’s law, recent trends in HPC systems have shown that future increases in performance can only be achieved through increases in system scale using a larger number of components, such as multi-cores and faster interconnects. In this paper, we evaluate a large-scale Infini band cluster, equipped with Intel’s latest West mere processor using two MPI implementations. The paper presents the cluster configuration and evaluates its performance using various High Performance LINPACK (HPL) and Intel MPI (IMB) benchmarks. Our results show that system scalability can still be achieved with up to 87% efficiency when considering the right combination of MPI, interconnect and CPU technologies.