ArticlePDF Available

Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques

Authors:
BioMed Central
Page 1 of 2
(page number not for citation purposes)
BioMedical Engineering OnLine
Open Access
Book review
Review of "Data Mining: Practical Machine Learning Tools and
Techniques" by Witten and Frank
Francisco Azuaje*
Address: Computer Science Research Institute, University of Ulster, Jordanstown, Co. Antrim, BT37 0QB, Northern Ireland, UK
Email: Francisco Azuaje* - fj.azuaje@ulster.ac.uk
* Corresponding author
Book details
Witten IH, Frank E: Data Mining: Practical Machine Learning
Tools and Techniques 2nd edition. San Francisco: Morgan
Kaufmann Publishers; 2005:560. ISBN 0-12-088407-0,
£34.99
In the early 1990s some sectors of the computer science
community were developing the idea of data understand-
ing as a discovery-driven, systematic and iterative process.
This "data mining" research and development area was
expected to take advantage of the expansion and consoli-
dation of machine learning methodologies together with
the integration of traditional statistical analysis and data-
base management strategies. The main goal was to iden-
tify relevant, interesting and potentially novel
informational patterns and relationships in large data sets
to support decision making and knowledge discovery. In
the mid 1990s developers and users of decision-making
support systems in areas such as finance (e.g. credit
approval and fraud detection applications), marketing
and sales analysis (e.g. shopping patterns and sales predic-
tion) were showing a great deal of enthusiasm about the
business value of data mining applications. During the
next few years international conferences, journals and
books were more frequently reporting advances, tools and
applications in other areas such as biomedical informat-
ics, engineering, physics, law enforcement and agricul-
ture. Today data mining is seen as a discipline or
paradigm that actively aids in the development of these
and other scientific areas (e.g. Web-based computing and
systems biology).
Data mining has become a fundamental research topic in
the progression of computing applications in health care
and biomedicine. Advances in data mining have applica-
tions and implications in areas ranging from information
management in healthcare organisations, consumer
health informatics, public health and epidemiology,
patient care and monitoring systems, large-scale image
analysis to information extraction and classification of sci-
entific literature [1]. Approaches, techniques and applica-
tions associated with data mining has also significantly
supported different data understanding and decision sup-
port tasks in bio-signal processing, such as the classifica-
tion, visualisation and identification of complex
relationships between diagnostic variables or groups of
patients [2,3].
In "Data Mining: Practical Machine Learning Tools and Tech-
niques" Witten and Frank offer users, students and
researchers alike a balanced, clear introduction to con-
cepts, techniques and tools for designing, implementing
and evaluating data mining applications. Although it puts
emphasis on machine learning techniques, it also intro-
duces basic statistical and information representation
methods. This book provides a variety of simple yet ele-
gant explanations to guide the reader to understand essen-
tial concepts and approaches. The book can also be seen
as a well-structured, intensive tutorial, which excels in
explaining how to implement solutions to different prob-
lems.
Another reason why this book represents a significant
contribution to this area is the ability of the authors to
bridge gaps between conceptual and theoretical discus-
Published: 29 September 2006
BioMedical Engineering OnLine 2006, 5:51 doi:10.1186/1475-925X-5-51
Received: 27 September 2006
Accepted: 29 September 2006
This article is available from: http://www.biomedical-engineering-online.com/content/5/1/51
© 2006 Azuaje; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BioMedical Engineering OnLine 2006, 5:51 http://www.biomedical-engineering-online.com/content/5/1/51
Page 2 of 2
(page number not for citation purposes)
sions, methods and practical implementations. Obviously
it would not be possible (or necessary) to cover, in a single
book, all the range of problems and machine learning
techniques applied to different domains. However, this
book also succeeds in organising and summarising signif-
icant amounts of material useful to assist the reader in jus-
tifying the selection of specific solutions. This is
accomplished without making exaggerated claims or over-
simplifying fundamental definitions.
The authors are also known for having led the conception
and implementation of the Weka system. Weka is an
open-source machine learning workbench used in this
book to illustrate techniques and applications. Over the
past five years Weka has facilitated educational activities
at undergraduate and postgraduate levels. But also it has
become a reference tool to support the assessment of
machine learning technologies and their applications in
biomedicine and biology [4,5].
The book is divided into two parts. The first part consists
of eight chapters introducing machine learning methods,
data pre-processing, model evaluation and practical
implementations. An important feature is the presenta-
tion of different techniques to evaluate model predictive
quality and to compare different models (e.g. cross-vali-
dation methods, probability estimations, receiver operat-
ing characteristic curves). Decision trees, different
classification rule methods, instance-based learning mod-
els and Bayesian networks are some of the machine learn-
ing techniques introduced. The second part focuses on the
Weka system, which offers three graphical user interfaces:
the Explorer, the Knowledge Flow Interface and the Experi-
menter. In comparison to its first edition, some of the
improvements include more information on neural net-
works and kernel models, as well as new (or updated) sec-
tions on methods, technical challenges and additional
reading.
"Data Mining: Practical Machine Learning Tools and Tech-
nique" may become a key reference to any student, teacher
or researcher interested in using, designing and deploying
data mining techniques and applications. This book also
deals with various aspects relevant to undergraduate or
research programmes in machine learning, intelligent sys-
tems, bioinformatics and biomedical informatics.
References
1. Shortliffe EH, Cimino JJ, (editors): Biomedical Informatics: Computer
Applications in Health Care and Biomedicine 3rd edition. New York:
Springer; 2006.
2. Sornmo L, Laguna P: Bioelectrical Signal Processing in Cardiac and Neuro-
logical Applications London: Academic Press Inc; 2006.
3. Clifford G, Azuaje F, McSharry P, (editors): Advanced Methods and
Tools for ECG Data Analysis Lodon: Artech House; 2006.
4. Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bio-
informatics using Weka. Bioinformatics 2004, 20:2479-81.
5. Browne F, Wang H, Zheng H, Azuaje F: An assessment of
machine and statistical learning approaches to inferring net-
works of protein-protein interactions. Journal of Integrative Bio-
informatics 2006, 3(2): [http://journal.imbio.de].
... The core principle of kNN involves classifying objects based on their proximity to the training examples in the feature space. Specifically, the algorithm seeks to identify a predefined number of training samples that are closest in distance to a given query instance and then uses these samples to predict the label of the query instance [27]. While kNN shares some similarities with decision tree algorithms, it is unique in that it seeks to find a path around the graph rather than constructing a tree. ...
... For instance, kNN assumes that similar values close to each other are likely to belong to the same class [27], which may not hold true in many physical systems where features are not correlated. However, this method requires loading all labeled data points into memory and computing distances between them and the test data points to assign a label. ...
Article
Full-text available
Furan tests provide a non-intrusive and cost-effective method of estimating the degradation of paper insulation, which is critical for ensuring the reliability of power grids. However, conducting routine furan tests can be expensive and challenging, highlighting the need for alternative methods, such as machine learning algorithms, to predict furan concentrations. To establish the generalizability and robustness of the furan prediction model, this study investigates two distinct datasets from different geographical locations, Utility A and Utility B. Three scenarios are proposed: in the first scenario, a round-robin cross-validation method was used, with 75% of the data for training and the remaining 25% for testing. The second scenario involved training the model entirely on Utility A and testing it on Utility B. In the third scenario, the datasets were merged, and round-robin cross-validation was applied, similar to the first scenario. The findings reveal the effectiveness of machine learning algorithms in predicting furan concentrations, and particularly the stacked generalized ensemble method, offering a non-intrusive and cost-effective alternative to traditional testing methods. The results could significantly impact the maintenance strategies of power and distribution transformers, particularly in regions where furan testing facilities are not readily available.
... Today, data mining technology and its modeling are more closely integrated. Introduction to Data Mining discusses its steps and algorithms and explores data mining methods and modeling in depth [1]. The algorithm has the following requirements when using data mining: the ability to find embedded clusters in high-dimensional data subspaces, the ability to scale, and the ability to understand end-user results. ...
Article
Full-text available
Teaching has gradually become more important, no matter which aspect of teaching must be continuously improved and kept up with the pace of development. Aerobics teaching has not been paid much attention by the masses, and many remain in the traditional teaching mode, which will delay the development of aerobics. This paper conducts an in-depth study of aerobics mixed teaching and action analysis guidance under data mining: (1) The blended learning and data mining are fully explained, and only when the two are integrated can better research be carried out. (2) The research on aerobics movements is very complicated. The process and form of the movements are analyzed through the skeleton time graph convolution and spatial graph convolution, and the action probability PCA model is established to facilitate the public study. (3) Blended learning and traditional learning of aerobics After in-depth comparison, it is found that blended learning has more advantages than traditional learning in many aspects. The learning mode should keep pace with the times, and blended learning can better teach.
... This dataset offers a substantial volume of imbalanced data for experimentation. The data was run in an Artificial Intelligence environment using the Waikato Environment for Knowledge Analysis [4]. This data is used to compute the new metrics and verify the practical applicability of the new SNR metrics. ...
Article
Full-text available
This paper proposes a new metric to compare artificial intelligence (AI) algorithms for malware detection using system calls. With the increasing complexity of malware and the proliferation of AI algorithms, there is a need for a comprehensive metric that accounts for multiple factors and not only accuracy. Motivated by the desire to enhance algorithm selection processes, this research introduces a metric that holistically evaluates algorithm performance while considering crucial aspects such as precision and time efficiency. The metric is based on the signal-to-noise ratio (SNR), which combines multiple measures such as accuracy, precision, and time to build model. The paper shows how SNR can be calculated for different AI algorithms and system call lengths using functions of mean and variability that translate to signal and noise respectively. The paper also introduces a dataset of malicious activity for anomaly-based host-based intrusion detection systems. The paper evaluates eight AI algorithms for eleven datasets and validates the results using a test set. The paper finds that SNR is an effective and robust metric for identifying the best AI algorithm that achieves high accuracy and precision with reduced time. The paper suggests that SNR can be used as a general method for optimizing AI algorithm performance.
... According to Witten IH et al. (2006), [1], data mining has become a crucial field of research as computing applications in healthcare and biomedicine progress. The understanding of various data and decision support tasks in bio-signal processing, such as the identification, visualization, and classification of complex relationships between patient groups or diagnostic variables, has also been greatly aided by data mining approaches, techniques, and applications.. 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Sharma, A., & Saini, R. (2018). ...
Article
Full-text available
This study explains the statistical analysis of cricket match score prediction using machine learning. According to recent changes in data science and sports, the use of sports-based machine learning and data mining shows the importance of process in outcome performance and prediction. The scope of this research paper is to evaluate current measurements used in the previous work to understand the estimation the ways used to model and analyze data and characterize the variables that govern performance using statistical methods. Actually, this research article will present a reliable statistical tool for data analysis using machine learning algorithms. At present, sports organizations produce enough statistical information on every player, team, match, and season for particular related sports. The first sports researchers were thought to be experts, coaches, team managers, and analysts. Sports organizations want to do statistical analysis of player from their previous data stored on their database using different data mining and machine learning algorithms. Sports data helps coaches and managers in many ways, such as predicting results, analyzing player performance, and skills, and evaluating strategies. Forecasts help managers and organizations make decisions to win teams and competitions. The current evaluation of research shows that primary studies of data mining systems can predict outcomes and evaluate the strengths and weaknesses of each system. Statistical analyses are made for each match for result predictions. Although in many respects this application is very limited. These are prime factors which important to examine machine learning algorithms in these situations to see if the application can give the nearest results in analysis. This research aims to give solutions that will help to make predictions more accurate and precise than previous methods, using more accurate data and machine learning.
... Дослідженням технології обробки Великих Даних у режимі реального часу присвячені праці К. Тобен [7] та Р. Віхмана [8]. Такі напрямки використані при формуванні концепцій хмарного компілювання, обговорених у роботі С. Куйоро [9], а також у роботі з видобутку даних та машинного навчання І. Віттена [10]. ...
... Other six basic classifiers considered in our experiments are: logistic regression (LR), multilayer perceptron (MLP), support vector machines (SVM), Naïve Bayes (NB), k-nearest neighbors (kNN) and random forest (RF). They are provided by Weka software introduced by Witten and Frank (2011), and we select the IBk and SMO algorithms for kNN and SVM classifiers respectively. Before the experiments, the four dataset files must be changed from the format of "mat" to the format "arff" so as to be dealt with by Weka. ...
Article
Full-text available
Credit risk evaluation is a difficult task to predict default probabilities and deduce risk classification, and many classification methods and techniques have already been applied in predicting credit risk. In this paper, in view of the significant limitations of feature reduction and weak interpretability of the multi-criteria optimization classifier (MCOC), an improved LASSO-based MCOC (LASSO-MCOC) for simultaneous classification and feature selection is proposed and the corresponding algorithm is constructed. Based on the four real-world credit risk datasets, the LASSO-MCOC with linear and RBF kernels are tested and compared with the SMCOC proposed by Zhang et al. (2019) and six basic classification methods including logistic regression, multilayer perceptron, support vector machines, Naïve Bayes, k-nearest neighbors and random forest. The experimental and statistically comparative analysis results show that the LASSO-MCOC we proposed is more effective for credit risk assessment with better performance in accuracy, efficiency, and interpretability than that of other classifiers and can be extended to other real-world applications.
Chapter
This chapter investigates the revolutionary convergence of data science with the Internet of Things (IoT). It demonstrates how data science approaches like advanced analytics, machine learning, and artificial intelligence (AI) may improve the capabilities of IoT devices and systems. This confluence enables data-driven decision-making, process improvement, and the empowerment of numerous businesses by deriving valuable insights from the enormous data created by IoT devices. The chapter also discusses the ethical issues surrounding data privacy and security, emphasizing responsible behaviors to support the responsible and long-term development of Intelligent IoT applications. The chapter illustrates the potential of this convergence to disrupt industries, build smart cities, and nurture a more linked and efficient future through real-world examples.
Article
Full-text available
Protein-protein interactions (PPI) play a key role in many biological systems. Over the past few years, an explosion in availability of functional biological data obtained from high-throughput technologies to infer PPI has been observed. However, results obtained from such experiments show high rates of false positives and false negatives predictions as well as systematic predictive bias. Recent research has revealed that several machine and statistical learning methods applied to integrate relatively weak, diverse sources of large-scale functional data may provide improved predictive accuracy and coverage of PPI. In this paper we describe the effects of applying different computational, integrative methods to predict PPI in Saccharomyces cerevisiae. We investigated the predictive ability of combining different sets of relatively strong and weak predictive datasets. We analysed several genomic datasets ranging from mRNA co-expression to marginal essentiality. Moreover, we expanded an existing multi-source dataset from S. cerevisiae by constructing a new set of putative interactions extracted from Gene Ontology (GO)-driven annotations in the Saccharomyces Genome Database. Different classification techniques: Simple Naive Bayesian (SNB), Multilayer Perceptron (MLP) and K-Nearest Neighbors (KNN) were evaluated. Relatively simple classification methods (i.e. less computing intensive and mathematically complex), such as SNB, have been proven to be proficient at predicting PPI. SNB produced the highest predictive quality obtaining an area under Receiver Operating Characteristic (ROC) curve (AUC) value of 0.99. The lowest AUC value of 0.90 was obtained by the KNN classifier. This assessment also demonstrates the strong predictive power of GO-driven models, which offered predictive performance above 0.90 using the different machine learning and statistical techniques. As the predictive power of single-source datasets became weaker MLP and SNB performed better than KNN. Moreover, predictive performance saturation may be reached independently of the classification models applied, which may be explained by predictive bias and incompleteness of existing Gold Standards. More comprehensive and accurate PPI maps will be produced for S. cerevisiae and beyond with the emergence of large-scale datasets of better predictive quality and the integration of intelligent classification methods.
Article
Full-text available
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection—common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. Availability: http://www.cs.waikato.ac.nz/ml/weka
Book
The analysis of bioelectrical signals continues to receive wide attention in research as well as commercially because novel signal processing techniques have helped to uncover valuable information for improved diagnosis and therapy. This book takes a unique problem-driven approach to biomedical signal processing by considering a wide range of problems in cardiac and neurological applicationsthe two "heavyweight" areas of biomedical signal processing. The interdisciplinary nature of the topic is reflected in how the text interweaves physiological issues with related methodological considerations. Bioelectrical Signal Processing is suitable for a final year undergraduate or graduate course as well as for use as an authoritative reference for practicing engineers, physicians, and researchers. Solutions Manual available online at http://www.textbooks.elsevier.com·A problem-driven, interdisciplinary presentation of biomedical signal processing·Focus on methods for processing of bioelectrical signals (ECG, EEG, evoked potentials, EMG)·Covers both classical and recent signal processing techniques·Emphasis on model-based statistical signal processing·Comprehensive exercises and illustrations·Extensive bibliography·Companion web site with project descriptions and signals for download.