ArticlePDF Available

Abstract and Figures

In the current scenario of Big Data, open source Data Mining tools are very popular in business data analytics. The paper presents a comprehensive study of three most popular open source data mining tools – R, RapidMiner and KNIME. The tools are compared by implementing them on two real datasets. Performance is evaluated by creating a decision tree of the datasets taken. Our objective is to find the best tool for classification. The study can help researchers, developers and users in selecting a tool for accuracy in their data analysis and prediction. Experiments depict that accuracy level of the tool changes with the quantity and quality of the dataset. The results show that RapidMiner is the best tool followed by KNIME and R.
Content may be subject to copyright.
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 15
Comprehensive Study of Open-Source Big Data Mining Tools
Hemlata
Research Scholar, M.C.A. Department,
M.D.University, Rohtak
Haryana, India
e-mail-hemltatachahal@gmail.com
Dr. Preeti Gulia
Assistant Professor M.C.A. Department, M.D.University,
Rohtak
Haryana India
e-mail-preetigulia91l@gmail.com
Abstract Big data mining tools refer to the tools for extracting
useful information out of the large datasets having structured,
semi-structured and unstructured data. There are many tools
available, but only three tools are considered in the present study:
KNIME, R, Rapid Miner. The paper presents the characteristics,
platform used, advantages and disadvantages of the three tools.
The goal is to benefit the researchers, educators and analysts to
choose the best tool which can be used in different types of dataset
and scenarios. As a result of analysis all the three tools are easy to
use and easy to extend. R is the best tool for statistical analysis and
accuracy followed by Rapid Miner and KNIME.
Keywords-Big Data, Data Mining, Big Data Mining Tools,
KNIME, Rapid Miner, R, Big Data Analytics.
I. INTRODUCTION
Data mining is the procedure of finding out or mining the
knowledge or useful information from the large volume of data
storage. The basic concept in data mining is to find or discover
new information in the form of a rule or pattern. Data Mining
helps the user to analyze a large volume of unstructured,
structured and semi-structured data, which is presently called
Big Data, and help them to find some conclusion or decision
from that data. Now-a-days data mining has expanded its scope
to almost all the fields like healthcare, business, education, law,
scientific etc.
Big data analytics refers to the method of analyzing huge
volumes of data, or big data. The big data is collected from a
large assortment of sources, such as social networks, videos,
digital images, and sensors. The major aim of Big Data
Analytics is to discover new patterns and relationships which
may be invisible, and it can provide new insights about the
users who created it. There are a number of tools available for
mining of Big Data and Analysis of Big Data, both professional
and non-professional.
II. BIG DATA MINING TOOLS
Recent and quick changes in the field of Database have
made the use of Data Mining very simple. This led to the
emergence of a large number of open source and free Data
Mining tools [2]. These tools are often used in this era of Big
Data. Hence the data mining tools used in big data can be
named as Big Data Mining Tools (BDM Tools).
There are many BDM Tools available for the users, out of
which some are commercial and others are open-source and
free. It depends on the user’s need and the type data to be
analysed which tool is best for his data. In this paper we have
chosen three open-source tools which are most popular.
According to 16th annual KD Nugget Software poll, 2015 R and
Rapid Miner, were on the top in Data Mining list [1]. According
to the poll approximately 47% of the users use R as a Big Data
Mining and Analytical tool. Rapid Miner is being used by 31%
of the users and 20% users use KNIME as a Data Science
software for the analysis of Big Data. The results of the Annual
Poll are shown graphically in figure 1. So, these three BDM
Tools are outlined and compared. Other, specialized tools such
as KEEL, Orange, Weka, Scikit-learn, Tanagra Elki clustering)
or Anatella (big data), and tools with small DM community
support (e.g. F#, GNU Octave) are not considered here.
Figure 1. The 16th annual KDnuggets Software Poll ,2015 [1]
III. OVERVIEW OF TOOLS IN STUDY
The three tools taken for study in the paper are the expert
tools in different fields and on different types of data. KNIME is
an open-source integration environment. We can download it
freely from https://www.knime.org and was given by team of
KNIME.com AG Switzerland [5]. Rapid Miner is also an open-
source machine learning and integrated environment which is
used for text mining, business analysis and predictive
analytics[6]. It was given by Rapid-I Company, Germany and
can be downloaded from http://rapidminer.com R is a
programming environment specifically for statistical computing.
We can obtain it freely by downloading it from
https://www.rstudio.com It was given by R Foundation,
University of Auckland, Newzealand [5]. The introduction of
the three tools is represented in Table 1 below.
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 16
TABLE I. OPEN SOURCE BDM TOOLS [5][6]
TOOLS
K NIME
RapidMiner
R
Company Name
KNIME.com AG Switzerland
Rapid-I Company, Germany
R Foundation, University
of Auckland, Newzealand
Source
https://www.knime.org/
http://rapidminer.com
https://www.rstudio.com
Logo
Description
An open source data
analytics, reporting and
integration platform.
An open source integrated
environment for machine
learning, data mining, text
mining, predictive
analytics and business analytics
A programming language and
software environment
for statistical computing and
graphics
A. KNIME
Konstanz Information Miner is an open source general data
mining tool which is pronounced as “naim”. It was first
released on 2004 at the University of Konstanz, Germany. It
runs inside the IBM’s Eclipse development environment. The
KNIME is a modular environment, which enables easy visual
assembly and interactive execution of a data pipeline. It is a
very powerful analytical tool for extracting new knowledge
from the available data [6]. The features, advantages and
disadvantages are summarised in figure 2.
Figure 2:Features, Advantages, Disadvantages of KNIME[7][8][10]
KNIME is designed as a teaching and research platform, which
allows the integration of different algorithms and tools in the
form of new nodes [10]. Initially it was used in pharmaceutical
research, but now it is used in many different or diverse fields
like business intelligence, business forecasting, financial
analysis[7][8].
KNIME is supported by the platforms like XP, Mac OS X,
Linux, distributed computing and client-server. In KNIME the
file formats for import and export are ASCII, dat, binary, dbms,
jdbc, xml, weka, images. Excel spreadsheet file cannot be the
input file but can be the output of KNIME[6]. Visualization
techniques like bar chart, line chart, bubble chart, density, pie
chart, histogram, box chart, scatter diagram, Cleveland dot are
used in KNIME [6].
B. RapidMiner
YALE (Yet Another Learning Environment) and Rapid-I were
extended and renamed as Rapid Miner by Rapid Miner
company, Germany. Rapid Miner is a general integrated
environment for machine learning and predictive analysis. It is
based on Java programming. The versions up to v.5 were open
source but latest version v.6 is proprietary with different
licenses like Starter, Personal, Professional and Enterprise. Out
of these Starter version is free with some limitations. It has a
very powerful Graphical User Interface which has become very
popular in recent years with a large community support.
Initially Rapid Miner was released in 2006, almost 10 years
back. Its current version (6.1) was released on 8 October 2014.
It can be implemented on any Operating System i.e. it is a
cross-platform software tool [7].
Gartner, Inc. has announced Rapid Miner as a leader in its 2016
Magic Quadrant for Advanced Analytics Platforms. Rapid
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 17
Miner is being announced as the industry’s #1 open source
predictive analytics platform for consecutive third year.
Figure 3: Features, Advantages, Disadvantages of RapidMiner[7][8]
In Rapid Miner the main focus is on processes and sub-
processes. These processes contain operators as components.
The operators are the data mining algorithms and sources. The
operators are used by drag and drop command and are
connected with inputs and outputs to form a data-flow. There
are different wizards available for process construction and
making it user-friendly [7]. It is an apt tool for the development
of applications in Data Mining as well as Text Mining. It is also
useful in Industrial applications and business applications. The
features, advantages and disadvantages are represented
graphically in figure 3.
As such Rapid Miner is a handy tool for Data Mining, but with
its capabilities of extension it has become a powerful tool for
the analysis of Big Data also. It supports different machine
learning algorithms like extremely randomized trees and
various logic programming algorithms. With the support of
Hadoop it has come out to be most powerful Big Data
Analytical tool (Radoop).
Rapid Miner is supported by almost all platforms like
Windows, Unix/Linux, Mac, Multi-cores, Distributed
Computing and client server [6]. It supports different input/
output file formats like ASCII, csv, dat, dbms, xml, sap, pdf,
weka, and images. Rapid Miner uses the visualization
techniques like bar chart, line chart, bubble chart, deviation
chart, density chart, survey plots, pie chart, histogram, box chart
etc.[6].
R
R is an open-source programming language and a statistical
tool. It is also an excellent option for Data Mining. As a
successor of S (statistical language developed by Bell Labs in
1970s) it was first released in 1993, almost 23 Years ago. Its
current version (3.2.3) was released on 10 December 2015. The
members of R Core Team, Ross Ihaka and Robert Gentleman,
designed this language for statistical analysis. It is freely
available with GNU (General Public License).
The language does not provide user-friendly environment as
it has a command-line shell for input. It is a little bit difficult to
learn all the commands. But only after mastering the language
we can utilize the full potential of the language [7].
It has many advantages over other statistical packages. Its
beauty lies in the quality representation of data in the form of
charts, plots and mathematical equation. The strength of the
language is well-designed publication-quality plots,
mathematical symbols, equation and formulae.
R supports different platforms like Windows (X86-X64),
Unix/Linux, Distributed Computing, Client-server. Many data
file formats (text file, binary files, xls, xml, weka, images, pdf,
csv, dbf, audio etc.) are supported for import and export.
Visualisation techniques like bar chart, line chart, pie chart,
histogram, box chart, scatter diagram etc. are available in R.
But bubble chart, density, survey plot, Andrew curves and
quartile are not available in R[6].
The features, advantages and disadvantages of R is
summarised in tabular form and represented in figure 4.
Figure 4:Features, Advantages, Disadvantages of R[7][8]
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 18
IV. COMPARISON OF TOOLS
The three tools used in the study are compared on the basis of
different criteria like Programming language, Target group, User Interface, Efficiency. The comparison is shown below in
table 2:
Table 2 : Comparison table[4][7]
NAME
Rapid Miner
R
Licence
Proprietary (Version 5.3.013 is
available as AGPL)
GNU, General Public
Licence
Current Version
6.1
3.2.3
Stable Release
October 8, 2014
December 10, 2015
Programming
Language
Java
R interpreted language
Platform
Platform independent script
Java, (platform
independent) based on
Eclipse, which is only
executes as a plug in.
Target Group
learners, advanced, professional,
enterprise users
advanced,
professional users
Efficiency
Parallel processing extension; Rapid
Miner Server provides large-scale
computation server functionality
Option concurrent
processing.
V. CONCLUSION AND FUTURE SCOPE
By considering the above differences between the tools
taken, every tool has its own importance and advantage. No
one can be claimed as the best tool for all the time and for all
type of data. While analyzing statistical data, R is the best tool.
But it has disadvantage of CUI environment. If GUI is taken
into consideration, KNIME and Rapid Miner are much better.
For future work the differences can be practically
implemented by using real or test data set in all the three tools.
Also other tools can be used for comparison. The comparison
criteria can be taken so that best analyzing tool can be decided.
REFERENCES
[1] http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-
science-software-used.html.
[2] Hasim, Nurdatillah, and Norhaidah Abu Haris. "A study of open-source
data mining tools for forecasting." Proceedings of the 9th International
Conference on Ubiquitous Information Management and
Communication. ACM, 2015.
[3] Graczyk, Magdalena, Tadeusz Lasota, and Bogdan Trawiński.
"Comparative analysis of premises valuation models using KEEL,
RapidMiner, and WEKA." Computational Collective Intelligence.
Semantic Web, Social Networks and Multiagent Systems. Springer
Berlin Heidelberg, 2009. 800-812.
[4] Lausch, Angela, Andreas Schmidt, and Lutz Tischendorf. "Data mining
and linked open dataNew perspectives for data analysis in
environmental research." Elsevier, Ecological Modelling 295 (2015): 5-
17.
[5] Jovic, Alan, Karla Brkic, and Nikola Bogunovic. "An overview of free
software tools for general data mining." Information and Communication
Technology, Electronics and Microelectronics (MIPRO), 2014 37th
International Convention on. IEEE, 2014.
[6] Al-Khoder, Ahmad, and Hazar Harmouch. "Evaluating four of the most
popular Open Source and Free Data Mining Tools." IJASR International
Journal of Academic Scientific Research ISSN: 2272-6446 Volume 3,
Issue 1 (February - March).
[7] Chauhan, Neha, and Nisha Gautam. "PARAMETRIC COMPARISON
OF DATA MINING TOOLS."
[8] Rangra, Kalpana, and K. L. Bansal. "Comparative study of data mining
tools." International Journal of Advanced Research in Computer Science
and Software Engineering 4.6 (2014): 216-223.
[9] Murakami-Brundage, William. Data Mining Wikileaks. Thomas
Murakami-Brundage.
[10] Berthold, Michael R., et al. "KNIME-the Konstanz information miner:
version 2.0 and beyond." AcM SIGKDD explorations Newsletter 11.1
(2009): 26-31.
Article
Full-text available
Data Science is a new field and introduced in the United Kingdom (UK), United States of America (USA), European Union, Australia, and Canada, in 2012. The subject such as Statistics, Mathematics, Artificial Intelligence, Machine Learning and Data Mining became an integral part of Data Science. The open-source tools were rejected by International Business Machines Corporation (IBM), Microsoft (MS), Systems Applications and Products (SAP), and Oracle. But open-source tools are essential for all bigger, smaller companies and academic institutions nowadays. This paper discusses the comparative study of the various tools of Data Science. The prime focus of the comparative study is to discuss the benefits, challenges and applications of the Data Science tools for researchers/user to decide which tools are better for their need
Article
Full-text available
The ability of DM to provide predictive information derived from huge datasets became an effective tool for companies and individuals. Along with the increasing importance of this science, there was rapid increase in the number of free and open source tools developed to implement its concepts. It wouldn’t be easy to decide which tool performs the desired task better, plus we cannot rely solely on description provided by the vendor. This paper aims to evaluate four of the most popular open source and free DM tools, namely: R, RapidMiner, WEKA and KNIME to help user, developer, and researcher in choosing his preferred tool in terms of platform in use, format of data to be mined and desired output format, needed data visualization form, performance, and the intent to develop unexciting functionality. As a result, All tools under study are modular, easy to extend, and can run on cross-platforms. R is the leading in terms of range of input/output formats, and visualization types, followed by RapidMiner, KNIME, and finally WEKA. Based on the results yielded it can be conducted that WEKA outperformed the highest accuracy level and subsequently the best performance.
Conference Paper
Full-text available
This paper described five open-sources Data Mining (DM) tools which are Weka, RapidMiner, KEEL, Orange and Tanagra. The features and functionality of these DM tools can be benefited by educators and researchers. The DM algorithms embedded in the tools can be utilized for forecasting. Weka and RapidMiner have most of the desire characteristic for a fully-functional and flexible platform therefore their use can be recommended for most of DM tasks.
Information and Communication Technology, Electronics and Microelectronics (MIPRO)
  • Alan Jovic
  • Karla Brkic
  • Nikola Bogunovic
Jovic, Alan, Karla Brkic, and Nikola Bogunovic. "An overview of free software tools for general data mining." Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on. IEEE, 2014.
PARAMETRIC COMPARISON OF DATA MINING TOOLS
  • Neha Chauhan
  • Nisha Gautam
Chauhan, Neha, and Nisha Gautam. "PARAMETRIC COMPARISON OF DATA MINING TOOLS."
Comparative study of data mining tools
  • Kalpana Rangra
  • K L Bansal
Rangra, Kalpana, and K. L. Bansal. "Comparative study of data mining tools." International Journal of Advanced Research in Computer Science and Software Engineering 4.6 (2014): 216-223.
Data Mining Wikileaks
  • William Murakami-Brundage
Murakami-Brundage, William. Data Mining Wikileaks. Thomas Murakami-Brundage.
Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems
  • Magdalena Graczyk
  • Tadeusz Lasota
  • Bogdan Trawiński
Graczyk, Magdalena, Tadeusz Lasota, and Bogdan Trawiński. "Comparative analysis of premises valuation models using KEEL, RapidMiner, and WEKA." Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems. Springer Berlin Heidelberg, 2009. 800-812.