ArticlePDF Available

A Survey on Decision Tree Algorithms of Classification in Data Mining

Authors:

Abstract and Figures

As the computer technology and computer network technology are developing, the amount of data in information industry is getting higher and higher. It is necessary to analyze this large amount of data and extract useful knowledge from it. Process of extracting the useful knowledge from huge set of incomplete, noisy, fuzzy and random data is called data mining. Decision tree classification technique is one of the most popular data mining techniques. In decision tree divide and conquer technique is used as basic learning strategy. A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. This paper focus on the various algorithms of Decision tree (ID3, C4.5, CART), their characteristic, challenges, advantage and disadvantage.
No caption available
… 
Content may be subject to copyright.
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Volume 5 Issue 4, April 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
A Survey on Decision Tree Algorithms of
Classification in Data Mining
Himani Sharma1, Sunil Kumar2
1M.Tech Student, Department of computer Science, SRM University, Chennai, India
2Assistant Professor, Department of computer Science, SRM University, Chennai, India
Abstract:As the computer technology and computer network technology are developing, the amount of data in information industry is
getting higher and higher. It is necessary to analyze this large amount of data and extract useful knowledge from it. Process of
extracting the useful knowledge from huge set of incomplete, noisy, fuzzy and random data is called data mining. Decision tree
classification technique is one of the most popular data mining techniques. In decision tree divide and conquer technique is used as
basic learning strategy. A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a
test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is
the root node. This paper focus on the various algorithms of Decision tree (ID3, C4.5, CART), their characteristic, challenges,
advantage and disadvantage.
Keywords: Decision Tree Learning, classification, C4.5, CART, ID3
1. Introduction
In order to discover useful knowledge which is desired by the
decision maker, the data miner applies data mining
algorithms to the data obtained from data collector. The
privacy issues coming with the data mining operations are
twofold. If personal information can be directly observed in
the data, privacy of the original data owner (i.e. the data
provider) will be compromised. On the other hand, equipping
with the many powerful data mining techniques, the data
miner is able to find out various kinds of information
underlying the data. Sometimes the data mining results
reveals sensitive information about the data owners. As the
data miner gets the already modified data so here the
objective was to show the comparative performance between
already used classification method and the new method
introduced. As previous studies shows that the ensemble
techniques provide better results than the decision tree
method thus the desired result was inspired thru this concern.
1.1 Decision Tree
A decision tree is a flowchart-like tree structure, where each
internal node represents a test on an attribute, each branch
represents an outcome of the test, class label is represented
by each leaf node (or terminal node) . Given a tuple X, the
attribute values of the tuple are tested against the decision
tree. A path is traced from the root to a leaf node which holds
the class prediction for the tuple. It is easy to convert
decision trees into classification rules. Decision tree learning
uses a decision tree as a predictive model which maps
observations about an item to conclusions about the item's
target value. It is one of the predictive modelling approaches
used in statistics, data mining and machine learning. Tree
models where the target variable can take a finite set of
values are called classification trees, in this tree structure,
leaves represent class labels and branches represent
conjunctions of features that lead to those class labels.
Decision tree can be constructed relatively fast compared to
other methods of classification. SQL statements can be
constructed from tree that can be used to access databases
efficiently. Decision tree classifiers obtain similar or better
accuracy when compared with other classification methods.
A number of data mining techniques have already been done
on educational data mining to improve the performance of
students like Regression, Genetic algorithm, Bays
classification, k-means clustering, associate rules, prediction
etc. Data mining techniques can be used in educational field
to enhance our understanding of learning process to focus on
identifying, extracting and evaluating variables related to the
learning process of students. Classification is one of the most
frequently. The C4.5, ID3, CART decision tree are applied
on the data of students to predict their performance. These
algorithms are explained below-
2. ID3 Algorithm
Iterative Dichotomiser 3 is a simple decision tree learning
algorithm introduced in 1986 by Quinlan Ross. It is serially
implemented and based on Hunt‟s algorithm. The basic idea
of ID3 algorithm is to construct the decision tree by
employing a top-down, greedy search through the given sets
to test each attribute at every tree node. In the decision tree
method, information gain approach is generally used to
determine suitable property for each node of a generated
decision tree. Therefore, we can select the attribute with the
highest information gain (entropy reduction in the level of
maximum) as the test attribute of current node. In this way,
the information needed to classify the training sample subset
obtained from later on partitioning will be the smallest. So,
the use of this property for partitioning the sample set
contained in current node will make the mixture degree of
different types for all generated sample subsets reduced to a
minimum. Hence, the use of an information theory approach
will effectively reduce the required dividing number of object
classification.
Paper ID: NOV162954
2094
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Volume 5 Issue 4, April 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
3. C4.5 Algorithm
C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan. C4.5 is an extension of
Quinlan’s earlier ID3 algorithm. The decision trees generated
by C4.5 can be used for classification and for this reason
C4.5 is often referred to as a statistical classifier. As splitting
criteria, C4.5 algorithm uses information gain. It can accept
data with categorical or numerical values. Threshold is
generated to handle continuous values and then divide
attributes with values above the threshold and values equal to
or below the threshold. C4.5 algorithm can easily handle
missing values, as missing attribute values are not utilized in
gain calculations by C4.5.
3.1 The algorithm C4.5 has following advantages:
Handling each attribute with different cost.
Handling training data set with missing attribute values-
C4.5 allows attribute values to be marked as „?‟ for
missing. Missing attribute values are not used in gain and
entropy calculations.
Handling both continuous and discrete attributes- to handle
continuous attributes, C4.5 creates a threshold and then
splits the list into those whose attribute value is above the
threshold and those that are less than or equal to it.
Pruning trees after creation- C4.5 goes back through the
tree once it has been created, and attempts to remove
branches that are not needed, by replacing them with leaf
nodes.
3.2 C4.5’s tree-construction algorithm differs in several
respects from CART, for instance
Tests in CART are always binary, but C4.5 allows two or
more outcomes.
CART uses Gini index to rank tests, whereas C4.5 uses
information-based criteria.
CART prunes trees with a cost-complexity model whose
parameters are estimated by cross-validation, whereas C4.5
uses a single-pass algorithm derived from binomial
confidence limits.
This brief discussion has not mentioned what happens
when some of a case’s values are unknown.
CART looks for surrogate tests that approximate the
outcomes when the tested attribute has an unknown value, on
the other hand C4.5 apportions the case probabilistically
among the outcomes.
4. CART Algorithm
It stands for Classification And Regression Trees. It was
introduced by Breiman in 1984. It builds both classifications
and regression trees. The classification tree construction by
CART is based on binary splitting of the attributes. CART
also based on Hunt‟s algorithm and can be implemented
serially. Gini index is used as splitting measure in selecting
the splitting attribute. CART is different from other Hunt‟s
based algorithm because it is also use for regression analysis
with the help of the regression trees. The regression analysis
feature is used in forecasting a dependent variable given a set
of predictor variables over a given period of time. CARTS
supports continuous and nominal attribute data and have
average speed of processing.
4.1 CART Advantages
1) Non parametric (no probabilistic assumptions)
2) Automatically perform variable selection
3) Use any combination of continuous or discrete variables
Very nice feature: ability to automatically bin massively
categorical variables into a few categories.
4) Zip code, business class, make/model.
5) Establish “interactions” among variables
Good for “rules” search
Hybrid GLM-CART models
Table 1: Comparisons between different Decision Tree
Algorithms
Features
ID3
C4.5
CART
Type of data
Categorical
Continuous and
Categorical
continuous and
nominal
attributes data
Speed
Low
Faster than ID3
Average
Boosting
Not supported
Not supported
Supported
Pruning
No
Pre-pruning
Post pruning
Missing Values
Can’t
deal with
Can’t
deal with
Can deal with
Formula
Use information
entropy and
information Gain
Use split info
and gain ratio
Use Gini
diversity index
5. Decision Tree Learning Software
Some softwares are used for the analysis of data and some
are used for commonly used data sets for decision tree
learning are discussed below-
WEKA: WEKA (Waikato Environment for Knowledge
Analysis) workbench is set of different data mining tools
developed by machine learning group at University of
Waikato, New Zealand. For easy access to this functionality,
it contains a collection of visualization tools and algorithms
for data analysis and predictive modeling together with
graphical user interfaces. WEKA supported versions are
windows, Linux and MAC operating systems and it
providens various associations, classification and clustering
algorithms. All of WEKA's techniques are predicated on the
assumption that the data is available as a single flat file or
relation, where each data point is described by a fixed
number of attributes (normally, numeric or nominal
attributes). It also provides pre-processors like attributes
selection algorithms and filters. WEKA provides J48. In J48
we can construct trees with EBP, REP and unpruned trees.
GATree: GATree (Genetically Evolved Decision Trees) use
genetic algorithms to directly evolve classification decision
trees. Instead of using binary strings, it adopts a natural
representation of the problem by using binary tree structure.
On request to the authors, the evaluation version of GATree
is now available. To generate decision trees, we can set
various parameters like generations, populations, crossover
and mutation probability etc.
Paper ID: NOV162954
2095
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Volume 5 Issue 4, April 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Alice d'ISoft: Alice d’ISoft software for Data Mining by
decision tree is a powerful and inviting tool that allows the
creation of segmentation models. For the business user, this
software makes it possible to explore data on line
interactively and directly. Alice d’ISoft software works on
windows operating system. And the evaluation version of
Alice d’ISoft is available on request to the authors.
See5/C5.0: See5/C5.0 has been designed to analyze
substantial databases containing thousands to millions of
records and tens to hundreds of numeric, time, date, or
nominal fields. It takes advantage of computers with up to
eight cores in one or more CPUs (including Intel Hyper-
Threading) to speed up the analysis. See5/C5.0 is easy to use
and does not presume any special knowledge of Statistics
/Machine Learning. It is available for Windows Xp/Vista/7/8
and Linux.
6. Applications Of Decision Trees In Different
Areas Of Data Mining
The decision tree algorithms are largely used in all area of
real life. The application areas are listed below-
Business: Decision trees are use in visualization of
probabilistic business models, used in customer relationship
management and used for credit scoring for credit card users.
Intrusion Detection: Decision trees is used to generate
genetic algorithms to automatically generate rules for an
intrusion detection expert system. Abbes et al. proposed
protocol analysis in intrusion detection using decision tree.
Energy Modeling: Decision tree is used for energy
modeling. Energy modeling for buildings is one of the
important tasks in building design.
E-Commerce: Decision tree is widely use in e-commerce,
use to generate online catalog which is essence for the
success of an e-commerce web site.
Image Processing: Perceptual grouping of 3-D features in
aerial image using decision tree classifier.
Medicine: Medical research and practice are the important
areas of application for decision tree techniques. Decision
tree is most useful in diagnostics of various diseases.and also
use for Heart sound diagnosis.
Industry: decision tree algorithm is useful in production
quality control (faults identification), non-destructive tests
areas.
Intelligent Vehicles: The job of finding the lane boundaries
of the road is important task in development of intelligent
vehicles. Gonzalez and Ozguner proposed lane detection for
intelligent vehicles by using decision tree.
Remote Sensing: Remote sensing is a strong application area
for pattern recognition work with decision trees. Some
researcher proposed algorithm for classification for land
cover categories in remote sensing, binary tree with genetic
algorithm for land cover classification.
Web Applications Chen et al presented a decision tree
learning approach to diagnosing failures in large Internet
sites. Bonchi et al proposed decision trees for intelligent web
caching.
7. Conclusion
This paper studied various decision tree algorithms. Each
algorithm has got its own pros and cons as given in this
paper. The efficiency of various decision tree algorithms can
be analyzed based on their accuracy and time taken to derive
the tree. This paper provides students and researcher some
basic fundamental information about decision tree
algorithms, tools and applications.
References
[1] Anju Rathee, Robin prakash mathur, Survey on
Decision Tree classification algorithm for the evaluation
of student performance International Journal of
Computers & Technology, Volume 4 No. 2, March-
April, 2013, ISSN 2277-3061
[2] S.Anupama Kumar and Dr. Vijayalakshmi M.N. (2011)
“Efficiency of decision trees in predicting student‟s
academic performance”, D.C. Wyld, et al. (Eds):
CCSEA 2011, CS & IT 02, pp. 335-343, 2011.
[3] Devi Prasad bhukya and S. Ramachandram Decision
tree induction- An Approach for data classification using
AVL –Tree”, International journal of computer and
electrical engineering, Vol. 2, no. 4, August, 2010.
[4] Jiawei Han and Micheline Kamber Data Mining:
Concepts and Techniques, 2ndedition.
[5] Baik, S. Bala, J. (2004), A Decision Tree Algorithm For
Distributed Data Mining.
[6] Quinlan, J.R., C4.5 -- Programs For Machine
Learning.Morgan Kaufmann Publishers, San Francisco,
Ca, 1993.
[7] Introdution To Data Mining By Tan, Steinbach, Kumar.
[8] Mr. Brijain R Patel, Mr. Kushik K Rana, ”ASurvey on
Decision Tree Algorithm for Classification”, © 2014
IJEDR, Volume 2, Issue 1.
[9] Prof. Nilima Patil and Prof. Rekha Lathi(2012),
Comparison of C5.0 & CART Classification algorithms
using pruning technique.
[10] Baik, S. Bala, J. (2004), A Decision Tree Algorithm For
Distributed Data Mining.
[11] Neha Midha and Dr. Vikram Singh, ”A Survey on
Classification Techniques in Data Minng”, IJCSMS
(International Journal of Computer Science &
Management Studies) Vol. 16, Issue 01, Publishing
Month: July 2015.
[12] Juan Pablo Gonzalez and U. Ozguner (2000). Lane
detection using histogram-based segmentation and
decision trees. Proc. of IEEE Intelligent Transportation
Systems.
Paper ID: NOV162954
2096
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Volume 5 Issue 4, April 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
[13] M. Chen, A. Zheng, J. Lloyd, M. Jordan and E. Brewer
(2004). Failure diagnosis using decision trees. Proc. of
the International Conference on Autonomic Computing.
[14] Francesco Bonchi, Giannotti, G. Manco, C. Renso, M.
Nanni, D. Pedreschi and S. Ruggieri (2001). Data
mining for intelligent web caching. Proc. of International
Conference on Information Technology: Coding and
computing, 2001, pp. 599 - 603.
[15] Ian H. Witten; Eibe Frank, Mark A. Hall (2011). "Data
Mining: Practical machine learning tools and techniques,
3rd Edition".
[16] A. Papagelis and D. Kalles (2000). GATree: Genetically
evolved decision trees. Proc. 12th International
Conference On Tools With Artificial Intelligence, pp.
203-206.
[17] ELOMAA, T. (1996) Tools and Techniques for
Decision Tree Learning.
[18] R. Quinlan (2004). Data Mining Tools See5 and C5.0
Rulequest Research (1997).
[19] S. K. Murthy, S. Salzberg, S. Kasif And R. Beigel
(1993). OC1: Randomized induction of oblique decision
trees. In Proc. Eleventh National Conference on
Artificial Intelligence, Washington, DC, 11-15th, July
1993. AAAI Press, pp. 322-327.
[20] Dipak V. Patil and R. S. Bichkar (2012). Issues in
Optimization of Decision Tree Learning:
Paper ID: NOV162954
2097
... This ID3 algorithm stands for Iterative Dichotomiser which is an easy decision tree learning algorithm. There are many more algorithms like C4.5, CART algorithm, and many more [8]. ...
Article
Full-text available
Technology these days is pivotal for better human livelihood. In technology, if it is particularly about a specific domain then Machine learning has a colossal effect on the human lifestyle these days. Fields like Banking, Software products, Information technology, Agriculture, Defence, Manufacturing, Education, Marketing e.t.c., which healthcare sector is no exception in it. Many of the subsidiary fields in the healthcare sector like a prediction of chronic, infectious, physiological diseases and the diagnoses, tracking and monitoring the patients in real-time scenarios, and many more. With the help of the most advanced algorithms and training, data sets and machine learning models are designated to many fields and subsidiaries. Prediction of diseases based on the symptoms that are being felt by a human makes more sense in synthesizing a model with significant datasets and perceiving machine learning algorithms. The dataset which is being trained will have routine symptoms that one would ever suffer with so that the disease that is going to be predicted will make more sense and be relatable. This application lasted by illustrating the proper usage and working of most admired algorithms like Decision tree classification algorithm, Random forest classification algorithm, Naïve Bayes classifier algorithm. On a whole, this paper dispenses the working mechanism of the algorithms in the prediction of illness(or disease).
... AdaBoost [10,24], which is a well-known boosting algorithm, uses an ensemble of weak learners to produce a strong learner. For binary and multi-class classification tasks, AdaBoost demonstrates high classification performance using learners such as support vector machines (SVMs) [7] and decision trees [27]. This is due to the fact that the training data are dynamically allocated to weak learners in the ensemble during the training process. ...
Article
Full-text available
In this study, the use of boosting techniques is empirically examined to determine to what extent quantum weak learners can be improved for binary classification tasks. In classical or quantum boosting, weak learners are treated as hybrid quantum–classical learners when realized on quantum simulators or NISQ devices. For hybrid quantum–classical learners, classical algorithms are used to train the learners. In the quantum boosting method proposed in this study, quantum mean estimation using quantum amplitude estimation is applied and conditions for ensuring a strong learner are presented. In addition, a quantum perceptron using quantum phase estimation is proposed as an easy-to-implement weak learner. Experiments were conducted on two binary classification tasks (binary and continuous feature spaces), the results of which showed that the resultant learners behaved as strong learners.
... One of the most frequently used models in educational data mining studies are decision trees, which are classification models (Sharma & Kumar, 2016). The fact that the decision tree models are easy to understand and interpret is the most important factor in the frequent use of decision trees. ...
Article
Full-text available
Considering the large volume of PISA data, it is expected that data mining will often be assisted in making PISA data more meaningful. Studies show that different dimensions of ICT may reveal different relationships for mathematics achievement. The purpose of this article is to evaluate the success of the decision tree classification algorithms in predicting the effect of ICT on students' mathematics performance. The population of the research consists of 15-year-old students studying in Turkey. The sample of the study consists of 6570 students who participated from Turkey and gave adequate answers to the ICT Familiarity Questionnaire in PISA and whose mathematics score was calculated. The J48 algorithm is more successful in classifying students with low mathematics achievement than classifying students with high mathematics achievement. The rate of correctly predicting mathematics achievement with weighted average values and variables related to ICT is 66.1%. ENTUSE [ICT use outside of school (leisure)], ICTCLASS [Subject-related ICT use during lessons] and USESCH [Use of ICT at school in general] variables are the most effective variables. It is thought that the reason for the difference in the effect of the use of information and communication technologies for entertainment purposes on mathematics achievement is since the level of recreational use can have a positive effect up to a certain level, while excessive use can be harmful.
... Decision trees are popular learning algorithms that are usually used for classification problems. The construction of a decision tree is described as follows (Sharma and Kumar 2016). ...
... In reality, the number of leaves constitutes the number of induction rules extracted from the tree. The benefits of this induction structure include simplicity for interpretation, easy implementation because of less complicated calculations, and less need for data normalization [55][56][57][58]. ...
Article
Full-text available
Introduction The COVID-19 pandemic overwhelmed healthcare systems with severe shortages in hospital resources such as ICU beds, specialized doctors, and respiratory ventilators. In this situation, reducing COVID-19 readmissions could potentially maintain hospital capacity. By employing machine learning (ML), we can predict the likelihood of COVID-19 readmission risk, which can assist in the optimal allocation of restricted resources to seriously ill patients. Methods In this retrospective single-center study, the data of 1225 COVID-19 patients discharged between January 9, 2020, and October 20, 2021 were analyzed. First, the most important predictors were selected using the horse herd optimization algorithms. Then, three classical ML algorithms, including decision tree, support vector machine, and k-nearest neighbors, and a hybrid algorithm, namely water wave optimization (WWO) as a precise metaheuristic evolutionary algorithm combined with a neural network were used to construct predictive models for COVID-19 readmission. Finally, the performance of prediction models was measured, and the best-performing one was identified. Results The ML algorithms were trained using 17 validated features. Among the four selected ML algorithms, the WWO had the best average performance in tenfold cross-validation (accuracy: 0.9705, precision: 0.9729, recall: 0.9869, specificity: 0.9259, F-measure: 0.9795). Conclusions Our findings show that the WWO algorithm predicts the risk of readmission of COVID-19 patients more accurately than other ML algorithms. The models developed herein can inform frontline clinicians and healthcare policymakers to manage and optimally allocate limited hospital resources to seriously ill COVID-19 patients.
... They build multi-kernel SVM models to choose the optimal kernel for individual attributes in order to determine the depressed user. They compare their model with single-kernel SVM [42], KNN [43], Decision Trees [44], Naive Bayes [45], and a multikernel SVM model [46] for classifying depressed people with a reduced error rate to 16.54%. Dinakar et al. [47] study teenagers suffering from depression who seek help on the Internet. ...
Article
Full-text available
Social media have become an indispensable part of peoples’ daily lives. Research suggests that interactions on social media partly exhibit individuals’ personality, sentiment, and behavior. In this study, we examine the association between students’ mental health and psychological attributes derived from social media interactions and academic performance. We build a classification model where students’ psychological attributes and mental health issues will be predicted from their social media interactions. Then, students’ academic performance will be identified from their predicted psychological attributes and mental health issues in the previous level. Firstly, we select samples by using judgmental sampling technique and collect the textual content from students’ Facebook news feeds. Then, we derive feature vectors using MPNet (Masked and Permuted Pre-training for Language Understanding), which is one of the latest pre-trained sentence transformer models. Secondly, we find two different levels of correlations: (i) users’ social media usage and their psychological attributes and mental health status and (ii) users’ psychological attributes and mental health status and their academic performance. Thirdly, we build a two-level hybrid model to predict academic performance (i.e., Grade Point Average (GPA)) from students’ Facebook posts: (1) from Facebook posts to mental health and psychological attributes using a regression model (SM-MP model) and (2) from psychological and mental attributes to the academic performance using a classifier model (MP-AP model). Later, we conduct an evaluation study by using real-life samples to validate the performance of the model and compare the performance with Baseline Models (i.e., Linguistic Inquiry and Word Count (LIWC) and Empath). Our model shows a strong performance with a microaverage f-score of 0.94 and an AUC-ROC score of 0.95. Finally, we build an ensemble model by combining both the psychological attributes and the mental health models and find that our combined model outperforms the independent models.
Article
In general, agriculture is the backbone of India and also plays an important role in Indian economy by providing a certain percentage of domestic product to ensure the food security. But now-a-days, food production and prediction is getting depleted due to unnatural climatic changes, which will adversely affect the economy of farmers by getting a poor yield and also help the farmers to remain less familiar in forecasting the future crops. This research work helps the beginner farmer in such a way to guide them for sowing the reasonable crops by deploying machine learning, one of the advanced technologies in crop prediction. Naive Bayes, a supervised learning algorithm puts forth in the way to achieve it. The seed data of the crops are collected here, with the appropriate parameters like temperature, humidity and moisture content, which helps the crops to achieve a successful growth. In addition as the software, a mobile application for Android is being developed. The users are encouraged to enter parameters like temperature and their location will be taken automatically in this application in order to start the prediction process.
Article
This paper is mainly discussed on importance and contribution of statistics to Data science and how it emerges as the most important factor to solve realistic problems which contains huge amount of data processing. There are various methods in statistics which help Analysis in data science which will be explained in detail. This work also emphasizes on importance of Data Science in this present technology. Statistics is proved to be an important discipline in regulating the work analyzed in the field of Data Science. This work compare various statistical approaches with This outlines the numerous potential data analysis approach processes which helps in examining the influence of quantitative statistical measures on data collection and optimization, data interpretation, data processing and modelling, testing and presenting and Various challenges faced in the process of data science using statistics is given in brief. Here there is a numerous way to enhance the data science techniques with the help of statistics methodologies.
Chapter
This paper describes the instrument of generating the training sets of data that are used to develop the skills of marketing specialists of the organization as well as students of economic and management specialties. Using this instrument, they can effectively master the classification trees method – a powerful way to build the profile of a customer or consumer segments and create an effective promotion for a new product or service. This competence is important in digitalization era. However, the implementation of classification trees needs a deep understanding of their nuances. The specifics of the suggested instrument as well as the way of its usage are shown by typical example. This instrument can be easily installed and refined to fit the specifics of the curriculum. In addition, it can become the basics of automated training courses, as it is possible to generate variants of assignments. Also, the correct solutions are available to the tutor. Thus, it is easy to check the presented results of trainees.
Article
Full-text available
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining, Bayesian network etc can be applied on the educational data for predicting the students behavior, performance in examination etc. This prediction will help the tutors to identify the weak students and help them to score better marks. The C4.5 decision tree algorithm is applied on student’s internal assessment data to predict their performance in the final exam. The outcome of the decision tree predicted the number of students who are likely to fail or pass. The result is given to the tutor and steps were taken to improve the performance of the students who were predicted to fail. After the declaration of the results in the final examination the marks obtained by the students are fed into the system and the results were analyzed. The comparative analysis of the results states that the prediction has helped the weaker students to improve and brought out betterment in the result. To analyse the accuracy of the algorithm, it is compared with ID3 algorithm and found to be more efficient in terms of the accurately predicting the outcome of the student and time taken to derive the tree. Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining, Bayesian network etc can be applied on the educational data for predicting the students behavior, performance in examination etc. This prediction will help the tutors to identify the weak students and help them to score better marks. The C4.5 decision tree algorithm is applied on student’s internal assessment data to predict their performance in the final exam. The outcome of the decision tree predicted the number of students who are likely to fail or pass. The result is given to the tutor and steps were taken to improve the performance of the students who were predicted to fail. After the declaration of the results in the final examination the marks obtained by the students are fed into the system and the results were analyzed. The comparative analysis of the results states that the prediction has helped the weaker students to improve and brought out betterment in the result. To analyse the accuracy of the algorithm, it is compared with ID3 algorithm and found to be more efficient in terms of the accurately predicting the outcome of the student and time taken to derive the tree.
Article
Full-text available
This paper discusses the data mining and various data mining techniques of classification. The paper also describes the data mining strategies and the limitation of the data mining. Various classification techniques covered in the paper are based on the decision tree. The decision tree based classification J48, CART and ID3 are discussed in the paper. The paper is useful to discuss compares the decision tree based classification techniques and to select the useful classification technique according to requirement.
Conference Paper
Full-text available
Presents a vertical application of data warehousing and data mining technology: intelligent Web caching. We introduce several ways to construct intelligent Web caching algorithms that employ predictive models of Web requests; the general idea is to extend the LRU (least recently used) policy of Web and proxy servers by making it sensible to Web access models extracted from Web log data using data mining techniques. Two approaches have been studied, in particular one based on association rules and another based on decision trees. The experimental results of the new algorithms show substantial improvements over existing LRU-based caching techniques in terms of the hit rate, i.e. the fraction of Web documents directly retrieved in the cache. We designed and developed a prototypical system, which supports data warehousing of Web log data, extraction of data mining models and simulation of the Web caching algorithms, around an architecture that integrates the various phases in the knowledge discovery process. The system supports a systematic evaluation and benchmarking of the proposed algorithms with respect to existing caching strategies
Book
Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, offers a thorough grounding in machine learning concepts, along with practical advice on applying these tools and techniques in real-world data mining situations. This highly anticipated fourth edition of the most acclaimed work on data mining and machine learning teaches readers everything they need to know to get going, from preparing inputs, interpreting outputs, evaluating results, to the algorithmic methods at the heart of successful data mining approaches. Extensive updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including substantial new chapters on probabilistic methods and on deep learning. Accompanying the book is a new version of the popular WEKA machine learning software from the University of Waikato. Authors Witten, Frank, Hall, and Pal include today's techniques coupled with the methods at the leading edge of contemporary research. Please visit the book companion website at http://www.cs.waikato.ac.nz/ml/weka/book.html It contains Powerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book Online Appendix on the Weka workbench; again a very comprehensive learning aid for the open source software that goes with the book Table of contents, highlighting the many new sections in the 4th edition, along with reviews of the 1st edition, errata, etc. Provides a thorough grounding in machine learning concepts, as well as practical advice on applying the tools and techniques to data mining projects Presents concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods Includes a downloadable WEKA software toolkit, a comprehensive collection of machine learning algorithms for data mining tasks-in an easy-to-use interactive interface Includes open-access online courses that introduce practical applications of the material in the book.
Article
Classification is considered to be one of the important building blocks in data mining problem. The major issues concerning data mining in large databases are efficiency and scalability. This paper addresses these issues by proposing a data classification method using AVL trees, which enhances the quality and stability. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and data mining considered the issue of building a decision tree from the available data. Specifically, we consider a scenario in which we apply the multi level mining method on the data set and show how the proposed approach tend to give the efficient multiple level classifications of large amounts of data. The results specify the improved performance of the proposed algorithm which acquires designing rules from the knowledge database.
Conference Paper
This paper presents preliminary works on an agent-based approach for distributed learning of decision trees. The distributed decision tree approach is applied to intrusion detection domain, the interest of which is recently increasing. In the approach, a network profile is built by applying a distributed data analysis method for the collection of data from distributed hosts. The method integrates inductive generalization and agent-based computing, so that classification rules are learned via tree induction from distributed data to be used as intrusion profiles. Agents, in a collaborative fashion, generate partial trees and communicate the temporary results among them in the form of indices to the data records. Experimental results are presented for military network domain data used for the network intrusion detection in KDD cup 1999. Several experimental results show that the performance of distributed version of decision tree is much better than that of non-distributed version with data collected manually from distributed hosts.
Conference Paper
We use genetic algorithms to evolve classification decision trees. The performance of the system is measured on a set of standard discretized concept learning problems and compared (very favorably) with the performance of two known algorithms (C4.5, OneR)
Conference Paper
A vision system for intelligent vehicles is proposed here. The system exploits the characteristics of the gray level histogram of the road to detect lane markers. Each lane marker is then analyzed using a decision tree, and finally the relations between lane markers are analyzed to create structures defining the lane boundaries. The resulting system also generates images that can be used as preprocessing stages in lane detection, lane tracking or obstacle detection algorithms. The system runs in realtime at rates of about 30 Hz