Content uploaded by Dr Hemlata Chahal
Author content
All content in this area was uploaded by Dr Hemlata Chahal on Apr 21, 2019
Content may be subject to copyright.
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 15
Comprehensive Study of Open-Source Big Data Mining Tools
Hemlata
Research Scholar, M.C.A. Department,
M.D.University, Rohtak
Haryana, India
e-mail-hemltatachahal@gmail.com
Dr. Preeti Gulia
Assistant Professor M.C.A. Department, M.D.University,
Rohtak
Haryana India
e-mail-preetigulia91l@gmail.com
Abstract— Big data mining tools refer to the tools for extracting
useful information out of the large datasets having structured,
semi-structured and unstructured data. There are many tools
available, but only three tools are considered in the present study:
KNIME, R, Rapid Miner. The paper presents the characteristics,
platform used, advantages and disadvantages of the three tools.
The goal is to benefit the researchers, educators and analysts to
choose the best tool which can be used in different types of dataset
and scenarios. As a result of analysis all the three tools are easy to
use and easy to extend. R is the best tool for statistical analysis and
accuracy followed by Rapid Miner and KNIME.
Keywords-Big Data, Data Mining, Big Data Mining Tools,
KNIME, Rapid Miner, R, Big Data Analytics.
I. INTRODUCTION
Data mining is the procedure of finding out or mining the
knowledge or useful information from the large volume of data
storage. The basic concept in data mining is to find or discover
new information in the form of a rule or pattern. Data Mining
helps the user to analyze a large volume of unstructured,
structured and semi-structured data, which is presently called
Big Data, and help them to find some conclusion or decision
from that data. Now-a-days data mining has expanded its scope
to almost all the fields like healthcare, business, education, law,
scientific etc.
Big data analytics refers to the method of analyzing huge
volumes of data, or big data. The big data is collected from a
large assortment of sources, such as social networks, videos,
digital images, and sensors. The major aim of Big Data
Analytics is to discover new patterns and relationships which
may be invisible, and it can provide new insights about the
users who created it. There are a number of tools available for
mining of Big Data and Analysis of Big Data, both professional
and non-professional.
II. BIG DATA MINING TOOLS
Recent and quick changes in the field of Database have
made the use of Data Mining very simple. This led to the
emergence of a large number of open source and free Data
Mining tools [2]. These tools are often used in this era of Big
Data. Hence the data mining tools used in big data can be
named as Big Data Mining Tools (BDM Tools).
There are many BDM Tools available for the users, out of
which some are commercial and others are open-source and
free. It depends on the user’s need and the type data to be
analysed which tool is best for his data. In this paper we have
chosen three open-source tools which are most popular.
According to 16th annual KD Nugget Software poll, 2015 R and
Rapid Miner, were on the top in Data Mining list [1]. According
to the poll approximately 47% of the users use R as a Big Data
Mining and Analytical tool. Rapid Miner is being used by 31%
of the users and 20% users use KNIME as a Data Science
software for the analysis of Big Data. The results of the Annual
Poll are shown graphically in figure 1. So, these three BDM
Tools are outlined and compared. Other, specialized tools such
as KEEL, Orange, Weka, Scikit-learn, Tanagra Elki clustering)
or Anatella (big data), and tools with small DM community
support (e.g. F#, GNU Octave) are not considered here.
Figure 1. The 16th annual KDnuggets Software Poll ,2015 [1]
III. OVERVIEW OF TOOLS IN STUDY
The three tools taken for study in the paper are the expert
tools in different fields and on different types of data. KNIME is
an open-source integration environment. We can download it
freely from https://www.knime.org and was given by team of
KNIME.com AG Switzerland [5]. Rapid Miner is also an open-
source machine learning and integrated environment which is
used for text mining, business analysis and predictive
analytics[6]. It was given by Rapid-I Company, Germany and
can be downloaded from http://rapidminer.com R is a
programming environment specifically for statistical computing.
We can obtain it freely by downloading it from
https://www.rstudio.com It was given by R Foundation,
University of Auckland, Newzealand [5]. The introduction of
the three tools is represented in Table 1 below.
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 16
TABLE I. OPEN SOURCE BDM TOOLS [5][6]
TOOLS
K NIME
RapidMiner
R
Company Name
KNIME.com AG Switzerland
Rapid-I Company, Germany
R Foundation, University
of Auckland, Newzealand
Source
https://www.knime.org/
http://rapidminer.com
https://www.rstudio.com
Logo
Description
An open source data
analytics, reporting and
integration platform.
An open source integrated
environment for machine
learning, data mining, text
mining, predictive
analytics and business analytics
A programming language and
software environment
for statistical computing and
graphics
A. KNIME
Konstanz Information Miner is an open source general data
mining tool which is pronounced as “naim”. It was first
released on 2004 at the University of Konstanz, Germany. It
runs inside the IBM’s Eclipse development environment. The
KNIME is a modular environment, which enables easy visual
assembly and interactive execution of a data pipeline. It is a
very powerful analytical tool for extracting new knowledge
from the available data [6]. The features, advantages and
disadvantages are summarised in figure 2.
Figure 2:Features, Advantages, Disadvantages of KNIME[7][8][10]
KNIME is designed as a teaching and research platform, which
allows the integration of different algorithms and tools in the
form of new nodes [10]. Initially it was used in pharmaceutical
research, but now it is used in many different or diverse fields
like business intelligence, business forecasting, financial
analysis[7][8].
KNIME is supported by the platforms like XP, Mac OS X,
Linux, distributed computing and client-server. In KNIME the
file formats for import and export are ASCII, dat, binary, dbms,
jdbc, xml, weka, images. Excel spreadsheet file cannot be the
input file but can be the output of KNIME[6]. Visualization
techniques like bar chart, line chart, bubble chart, density, pie
chart, histogram, box chart, scatter diagram, Cleveland dot are
used in KNIME [6].
B. RapidMiner
YALE (Yet Another Learning Environment) and Rapid-I were
extended and renamed as Rapid Miner by Rapid Miner
company, Germany. Rapid Miner is a general integrated
environment for machine learning and predictive analysis. It is
based on Java programming. The versions up to v.5 were open
source but latest version v.6 is proprietary with different
licenses like Starter, Personal, Professional and Enterprise. Out
of these Starter version is free with some limitations. It has a
very powerful Graphical User Interface which has become very
popular in recent years with a large community support.
Initially Rapid Miner was released in 2006, almost 10 years
back. Its current version (6.1) was released on 8 October 2014.
It can be implemented on any Operating System i.e. it is a
cross-platform software tool [7].
Gartner, Inc. has announced Rapid Miner as a leader in its 2016
Magic Quadrant for Advanced Analytics Platforms. Rapid
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 17
Miner is being announced as the industry’s #1 open source
predictive analytics platform for consecutive third year.
Figure 3: Features, Advantages, Disadvantages of RapidMiner[7][8]
In Rapid Miner the main focus is on processes and sub-
processes. These processes contain operators as components.
The operators are the data mining algorithms and sources. The
operators are used by drag and drop command and are
connected with inputs and outputs to form a data-flow. There
are different wizards available for process construction and
making it user-friendly [7]. It is an apt tool for the development
of applications in Data Mining as well as Text Mining. It is also
useful in Industrial applications and business applications. The
features, advantages and disadvantages are represented
graphically in figure 3.
As such Rapid Miner is a handy tool for Data Mining, but with
its capabilities of extension it has become a powerful tool for
the analysis of Big Data also. It supports different machine
learning algorithms like extremely randomized trees and
various logic programming algorithms. With the support of
Hadoop it has come out to be most powerful Big Data
Analytical tool (Radoop).
Rapid Miner is supported by almost all platforms like
Windows, Unix/Linux, Mac, Multi-cores, Distributed
Computing and client server [6]. It supports different input/
output file formats like ASCII, csv, dat, dbms, xml, sap, pdf,
weka, and images. Rapid Miner uses the visualization
techniques like bar chart, line chart, bubble chart, deviation
chart, density chart, survey plots, pie chart, histogram, box chart
etc.[6].
R
R is an open-source programming language and a statistical
tool. It is also an excellent option for Data Mining. As a
successor of S (statistical language developed by Bell Labs in
1970s) it was first released in 1993, almost 23 Years ago. Its
current version (3.2.3) was released on 10 December 2015. The
members of R Core Team, Ross Ihaka and Robert Gentleman,
designed this language for statistical analysis. It is freely
available with GNU (General Public License).
The language does not provide user-friendly environment as
it has a command-line shell for input. It is a little bit difficult to
learn all the commands. But only after mastering the language
we can utilize the full potential of the language [7].
It has many advantages over other statistical packages. Its
beauty lies in the quality representation of data in the form of
charts, plots and mathematical equation. The strength of the
language is well-designed publication-quality plots,
mathematical symbols, equation and formulae.
R supports different platforms like Windows (X86-X64),
Unix/Linux, Distributed Computing, Client-server. Many data
file formats (text file, binary files, xls, xml, weka, images, pdf,
csv, dbf, audio etc.) are supported for import and export.
Visualisation techniques like bar chart, line chart, pie chart,
histogram, box chart, scatter diagram etc. are available in R.
But bubble chart, density, survey plot, Andrew curves and
quartile are not available in R[6].
The features, advantages and disadvantages of R is
summarised in tabular form and represented in figure 4.
Figure 4:Features, Advantages, Disadvantages of R[7][8]
International Journal of Artificial Intelligence and Knowledge Discovery Vol.6, Issue 1, Jan, 2016
Print-ISSN: 2231-2021 e-ISSN: 2231-0312 © RG Education Society (INDIA) 18
IV. COMPARISON OF TOOLS
The three tools used in the study are compared on the basis of
different criteria like Programming language, Target group, User Interface, Efficiency. The comparison is shown below in
table 2:
Table 2 : Comparison table[4][7]
NAME
KNIME
Rapid Miner
R
Licence
GNU, General Public Licence
Proprietary (Version 5.3.013 is
available as AGPL)
GNU, General Public
Licence
Current Version
3.1
6.1
3.2.3
Stable Release
December 6, 2015
October 8, 2014
December 10, 2015
Programming
Language
Java
Java
R interpreted language
Platform
Java (platform-independent)
Platform independent script
Java, (platform
independent) based on
Eclipse, which is only
executes as a plug in.
Target Group
learners, advanced users and
researchers
learners, advanced, professional,
enterprise users
advanced,
professional users
Efficiency
Environment of threads
running concurrently can be set
up manually
Parallel processing extension; Rapid
Miner Server provides large-scale
computation server functionality
Option concurrent
processing.
V. CONCLUSION AND FUTURE SCOPE
By considering the above differences between the tools
taken, every tool has its own importance and advantage. No
one can be claimed as the best tool for all the time and for all
type of data. While analyzing statistical data, R is the best tool.
But it has disadvantage of CUI environment. If GUI is taken
into consideration, KNIME and Rapid Miner are much better.
For future work the differences can be practically
implemented by using real or test data set in all the three tools.
Also other tools can be used for comparison. The comparison
criteria can be taken so that best analyzing tool can be decided.
REFERENCES
[1] http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-
science-software-used.html.
[2] Hasim, Nurdatillah, and Norhaidah Abu Haris. "A study of open-source
data mining tools for forecasting." Proceedings of the 9th International
Conference on Ubiquitous Information Management and
Communication. ACM, 2015.
[3] Graczyk, Magdalena, Tadeusz Lasota, and Bogdan Trawiński.
"Comparative analysis of premises valuation models using KEEL,
RapidMiner, and WEKA." Computational Collective Intelligence.
Semantic Web, Social Networks and Multiagent Systems. Springer
Berlin Heidelberg, 2009. 800-812.
[4] Lausch, Angela, Andreas Schmidt, and Lutz Tischendorf. "Data mining
and linked open data–New perspectives for data analysis in
environmental research." Elsevier, Ecological Modelling 295 (2015): 5-
17.
[5] Jovic, Alan, Karla Brkic, and Nikola Bogunovic. "An overview of free
software tools for general data mining." Information and Communication
Technology, Electronics and Microelectronics (MIPRO), 2014 37th
International Convention on. IEEE, 2014.
[6] Al-Khoder, Ahmad, and Hazar Harmouch. "Evaluating four of the most
popular Open Source and Free Data Mining Tools." IJASR International
Journal of Academic Scientific Research ISSN: 2272-6446 Volume 3,
Issue 1 (February - March).
[7] Chauhan, Neha, and Nisha Gautam. "PARAMETRIC COMPARISON
OF DATA MINING TOOLS."
[8] Rangra, Kalpana, and K. L. Bansal. "Comparative study of data mining
tools." International Journal of Advanced Research in Computer Science
and Software Engineering 4.6 (2014): 216-223.
[9] Murakami-Brundage, William. Data Mining Wikileaks. Thomas
Murakami-Brundage.
[10] Berthold, Michael R., et al. "KNIME-the Konstanz information miner:
version 2.0 and beyond." AcM SIGKDD explorations Newsletter 11.1
(2009): 26-31.