Conference Paper

An introduction to the WEKA data mining system

DOI: 10.1145/1140124.1140127 Conference: Proceedings of the 11th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education, ITiCSE 2006, Bologna, Italy, June 26-28, 2006
Source: DBLP


This is a proposal for a half day tutorial on Weka, an open source Data Mining software package written in Java and available from The goal of the tutorial is to introduce faculty to the package and to the pedagogical possibilities for its use in the undergraduate computer science and engineering curricula. The Weka system provides a rich set of powerful Machine Learning algorithms for Data Mining tasks, some not found in commercial data mining systems. These include basic statistics and visualization tools, as well as tools for pre-processing, classification, and clustering, all available through an easy to use graphical user interface. Data Mining studies algorithms and computational paradigms that allow computers to discover structure in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. Machine learning is concerned with building computer systems that have the ability to improve their performance in a given domain through experience. Machine learning and Data Mining are becoming increasingly important areas of engineering and computer science and have been successfully applied to a wide range of problems in science and engineering. Recently, acknowledging the importance of these areas in computer science and engineering, more work is being done to incorporate these areas into the undergraduate curriculum. Weka is a widely used package that is particularly popular for educational purposes. It is the companion software package of the book "Data Mining: Practical Machine Learning Tools and Techniques" by Ian H. Witten and Eibe Frank. The Weka team has been recently awarded with the 2005 ACM SIGKDD Service Award for their development of the Weka system, including the accompanying book. As Gregory Piatetsky-Shapiro writes in the news item about this event (KDnuggets news, June 28, 2005), "Weka is a landmark system in the history of the data mining and machine learning research communities, because it is the only toolkit that has gained such widespread adoption and survived for an extended period of time (the first version of Weka was released 11 years ago)". The purpose of this tutorial is to present an introduction to the Weka system and outline the major approaches to using Weka for teaching Machine Learning, Data and Web Mining. We will also present our experiences using Weka as a main tool for implementing Machine Learning and Web Mining student projects that have been developed in the framework of a National Science Foundation grant. In this framework, two basic applications of Weka will be used to illustrate the various Weka topics presented: • Web document classification. Some basic classification schemes provided by Weka (Nearest Neighbor, Naïve Bayes and Decision trees) are used to create models of web documents in topic directories and then to classify new documents according to their topic. • Intelligent web browser. Web documents are labeled with the preferences of web users and ML models are created. These models are then used to classify documents returned by web searches according to the user preferences. In this framework the following topics will be covered:

Full-text preview

Available from:
  • Source
    • "Step1: in this step we structured all records on Attribute- Relation File Format (ARFF), which is an input file format used by the machine learning tool WEKA[33]. "

    Preview · Article · Jan 2015 · International Journal of Advanced Computer Science and Applications
  • Source
    • "For this, it uses Sahara's feature selection and static analysis infrastructure[2]. Specifically , Mojave uses the GainRatio [24] decision tree algorithm with feature ranking [16] for selection. The algorithm builds a decision tree by recursively selecting a feature that splits up the dataset into subsets with homogeneous classes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Software upgrades are frequent. Unfortunately, many of the upgrades either fail or misbehave.We argue that many of these failures can be avoided for new users of the upgrade by exploiting the characteristics of the upgrade and feedback from the users that have already installed it. To demonstrate that this can be achieved, we build Mojave, the first recom-mendation system for software upgrades. Mojave leverages data from the existing and new users, machine learning, and static and dynamic source analyses. For each new user, Mo-jave computes the likelihood that the upgrade will fail for him/her. Based on this value, Mojave recommends for or against the upgrade. We evaluate Mojave for two real up-grade problems with the OpenSSH suite. Initial results show that it provides accurate recommendations to the new users.
    Full-text · Conference Paper · Oct 2012
  • Source
    • "Waikato Environment for Knowledge Analysis (Weka) is an evaluation Tool used for Data Mining for data analysis and predictive modeling purposes. Present version of Weka is built on Java programming language, so provides better flexibility and can be deployed on any platform [19]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The behavior and nature of attacks and threats to computer network systems have been evolving rapidly with the ad-vances in computer security technology. At the same time however, computer criminals and other malicious elements find ways and methods to thwart such protective measures and find techniques of penetrating such secure systems. Therefore adaptability, or the ability to learn and react to a consistently changing threat environment, is a key require-ment for modern intrusion detection systems. In this paper we try to develop a novel metric to assess the performance of such intrusion detection systems under the influence of attacks. We propose a new metric called feedback reliability ratio for an intrusion detection system. We further try to modify and use the already available statistical Canberra dis-tance metric and apply it to intrusion detection to quantify the dissimilarity between malicious elements and normal nodes in a network.
    Full-text · Article ·
Show more