An introduction to the WEKA data mining system.
ABSTRACT This is a proposal for a half day tutorial on Weka, an open source Data Mining software package written in Java and available from www.cs.waikato.ac.nz/~ml/weka/index.html. The goal of the tutorial is to introduce faculty to the package and to the pedagogical possibilities for its use in the undergraduate computer science and engineering curricula. The Weka system provides a rich set of powerful Machine Learning algorithms for Data Mining tasks, some not found in commercial data mining systems. These include basic statistics and visualization tools, as well as tools for pre-processing, classification, and clustering, all available through an easy to use graphical user interface. Data Mining studies algorithms and computational paradigms that allow computers to discover structure in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. Machine learning is concerned with building computer systems that have the ability to improve their performance in a given domain through experience. Machine learning and Data Mining are becoming increasingly important areas of engineering and computer science and have been successfully applied to a wide range of problems in science and engineering. Recently, acknowledging the importance of these areas in computer science and engineering, more work is being done to incorporate these areas into the undergraduate curriculum. Weka is a widely used package that is particularly popular for educational purposes. It is the companion software package of the book "Data Mining: Practical Machine Learning Tools and Techniques" by Ian H. Witten and Eibe Frank. The Weka team has been recently awarded with the 2005 ACM SIGKDD Service Award for their development of the Weka system, including the accompanying book. As Gregory Piatetsky-Shapiro writes in the news item about this event (KDnuggets news, June 28, 2005), "Weka is a landmark system in the history of the data mining and machine learning research communities, because it is the only toolkit that has gained such widespread adoption and survived for an extended period of time (the first version of Weka was released 11 years ago)". The purpose of this tutorial is to present an introduction to the Weka system and outline the major approaches to using Weka for teaching Machine Learning, Data and Web Mining. We will also present our experiences using Weka as a main tool for implementing Machine Learning and Web Mining student projects that have been developed in the framework of a National Science Foundation grant. In this framework, two basic applications of Weka will be used to illustrate the various Weka topics presented: • Web document classification. Some basic classification schemes provided by Weka (Nearest Neighbor, Naïve Bayes and Decision trees) are used to create models of web documents in topic directories and then to classify new documents according to their topic. • Intelligent web browser. Web documents are labeled with the preferences of web users and ML models are created. These models are then used to classify documents returned by web searches according to the user preferences. In this framework the following topics will be covered:
- [Show abstract] [Hide abstract]
ABSTRACT: Context Software development projects involve the use of a wide range of tools to produce a software artifact. Software repositories such as source control systems have become a focus for emergent research because they are a source of rich information regarding software development projects. The mining of such repositories is becoming increasingly common with a view to gaining a deeper understanding of the development process. Objective This paper explores the concepts of representing a software development project as a process that results in the creation of a data stream. It also describes the extraction of metrics from the Jazz repository and the application of data stream mining techniques to identify useful metrics for predicting build success or failure. Method This research is a systematic study using the Hoeffding Tree classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method for detecting concept drift by applying the Massive Online Analysis (MOA) tool. Results The results indicate that only a relatively small number of the available measures considered have any significance for predicting the outcome of a build over time. These significant measures are identified and the implication of the results discussed, particularly the relative difficulty of being able to predict failed builds. The Hoeffding Tree approach is shown to produce a more stable and robust model than traditional data mining approaches. Conclusion Overall prediction accuracies of 75% have been achieved through the use of the Hoeffding Tree classification method. Despite this high overall accuracy, there is greater difficulty in predicting failure than success. The emergence of a stable classification tree is limited by the lack of data but overall the approach shows promise in terms of informing software development activities in order to minimize the chance of failure.Information and Software Technology 02/2014; 56(2):183–198. · 1.52 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: This paper describes a machine learning approach for detecting web spam. Each example in this classification task corresponds to 100 web pages from a host and the task is to predict whether this collection of pages represents spam or not. This task is part of the 2007 ECML/PKDD Graph Labeling Workshop's Web Spam Challenge (track 2). Our approach begins by adding several human-engineered features constructed from the raw data. We then construct a rough classifier and use semi-supervised learning to classify the unlabelled examples provided to us. We then construct additional link-based features and incorporate them into the training process. We also employ a combinatorial feature-fusion method for "compressing" the enormous number of word-based features that are available, so that conventional machine learning algorithms can be used. Our results demonstrate the effectiveness of semi- supervised learning and the combinatorial feature-fusion method.01/2007;
- [Show abstract] [Hide abstract]
ABSTRACT: This paper demonstrates how methods borrowed from information fusion can improve the performance of a classifier by constructing ("fusing") new features that are combinations of existing numeric features. This work is an example of local pattern analysis and fusion because it identifies potentially useful patterns (i.e., feature combinations) from a single data source. In our work, we fuse features by mapping the numeric values for each feature to a rank and then averaging these ranks. The quality of the fused features is measured with respect to how well they classify minority-class examples, which makes this method especially effective for deal- ing with data sets that exhibit class imbalance. This paper evalu- ates our combinatorial feature fusion method on ten data sets, using three learning methods. The results indicate that our method can be quite effective in improving classifier performance, al- though it seems to improve the performance of some learning methods more than others.01/2007;
An Introduction to the WEKA Data Mining System
Central Connecticut State University
University of Hartford
"Drowning in Data yet Starving for Knowledge"
"Computers have promised us a fountain of wisdom but delivered a flood of data"
William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
Data Mining: "The non trivial extraction of implicit, previously unknown, and potentially
useful information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
Data mining finds valuable information hidden in large volumes of data.
Data mining is the analysis of data and the use of software techniques for finding
patterns and regularities in sets of data.
Data Mining is an interdisciplinary field involving:
–High Performance Computing
KDnuggets : Polls : Data Mining Tools You Used in
2005 (May 2005) PollData mining/Analytic tools you
used in 2005 [376 voters, 860 votes total]
• Enterprise-level: (US $10,000 and more)
Fair Isaac, IBM, Insightful, KXEN, Oracle, SAS, and
• Department-level: (from $1,000 to $9,999)
Angoss, CART/MARS/TreeNet/Random Forests,
Equbits, GhostMiner, Gornik, Mineset, MATLAB,
Megaputer, Microsoft SQL Server, Statsoft Statistica,
• Personal-level: (from $1 to $999): Excel, See5
• Free: C4.5, R, Weka, Xelopes
Data Mining Software
KDnuggets : News : 2005 : n13 : item2
SIGKDD Service Award is the highest service award in the field of data mining and knowledge discovery. It is is given
to one individual or one group who has performed significant service to the data mining and knowledge discovery
field, including professional volunteer services in disseminating technical information to the field, education, and
The 2005 ACM SIGKDD Service Award is presented to the Weka team for their development of the freely-available
Weka Data Mining Software, including the accompanying book Data Mining: Practical Machine Learning Tools and
Techniques (now in second edition) and much other documentation.
The Weka team includes Ian H. Witten and Eibe Frank, and the following major contributors (in alphabetical order of
last names): Remco R. Bouckaert, John G. Cleary, Sally Jo Cunningham, Andrew Donkin, Dale Fletcher, Steve
Garner, Mark A. Hall, Geoffrey Holmes, Matt Humphrey, Lyn Hunt, Stuart Inglis, Ashraf M. Kibriya, Richard
Kirkby, Brent Martin, Bob McQueen, Craig G. Nevill-Manning, Bernhard Pfahringer, Peter Reutemann, Gabi
Schmidberger, Lloyd A. Smith, Tony C. Smith, Kai Ming Ting, Leonard E. Trigg, Yong Wang, Malcolm Ware, and
The Weka team has put a tremendous amount of effort into continuously developing and maintaining the system since
1994. The development of Weka was funded by a grant from the New Zealand Government's Foundation for
Research, Science and Technology.
The key features responsible for Weka's success are:
– it provides many different algorithms for data mining and machine learning
– is is open source and freely available
–it is platform-independent
–it is easily useable by people who are not data mining specialists
–it provides flexible facilities for scripting experiments
–it has kept up-to-date, with new algorithms being added as they appear in the research literature.
Weka Data Mining Software