Conference PaperPDF Available

Distributed Data Analytics using RapidMiner and BOINC

Authors:

Abstract and Figures

RapidMiner is an open source environment for machine learning and data analytics. It is intensively used for academic purposes at universities as well as for industrial or commercial applications. The BOINC framework also attracted attention as it provides the ability to easily setup a distributed computing environment. This article addresses the joint usage of RapidMiner and BOINC. We describe the integration of both tools and present some of the research accomplishments of the distributedDataMining.org project.
Content may be subject to copyright.
Distributed Data Analytics using
RapidMiner and BOINC
Nico Schlitter
Karlsruhe Institute of Technology
Steinbuch Centre for Computing
Nico.Schlitter@kit.edu
org L¨assig
University of Applied Sciences Zittau/G¨orlitz
Enterprise Application Development Group
Joerg.Laessig@hszg.de
Abstract
RapidMiner is an open source environment for machine learning and
data analytics. It is intensively used for academic purposes at univer-
sities as well as for industrial or commercial applications. The BOINC
framework also attracted attention as it provides the ability to easily
setup a distributed computing environment. This article addresses the
joint usage of RapidMiner and BOINC. We describe the integration of
both tools and present some of the research accomplishments of the
distributedDataMining.org project.
1 Introduction
The term big data refers to the challenge of capturing, storing or processing of
huge data sets. Due to the amount of data, traditional data management and
data analysis approaches are no longer feasible. High performance comput-
ing environments are traditionally used to overcome the problem. However,
applying state-of-the-art techniques like MapReduce [2] in high performance
grid computing infrastructures may exceed the financial abilities of most re-
searchers. Therefore, easy to use infrastructures are needed in order to supply
scientists with cost efficient computing resources for data analytics.
In this article, we introduce a distributed computing project called dis-
tributedDataMining.org which supports scientists from different research areas
by providing computing power for data analysis purposes. We describe the
usage of the BOINC framework and the distribution of research related com-
puting tasks to thousands of heterogeneous computing nodes located all over
the world. Each of these nodes uses the BOINC client to pull computing tasks
and data packages from a central server. Then, the BOINC client starts an
instance of the RapidMiner framework in order to process the data mining
tasks. The data mining results are sent back to a central project server, which
gathers the information and provides it to researchers for further analysis.
The remainder of this article is organized as follows. Section 2 sketches
the machine learning environment RapidMiner and the distributed comput-
ing framework BOINC. Both tools are used to apply distributed data analysis
tasks within the distributedDataMining.org project. This distributed comput-
ing project is introduced in Section 3. An overview of the research cooperations
in the field of Social Network Analysis, Time Series Analysis and Biological
Data Analysis is given in Section 4. Finally, we conclude in Section 5.
2 Tools
2.1 RapidMiner
RapidMiner [8] is an environment for machine learning, data mining, text min-
ing, predictive analytics, and business analytics. The RapidMiner project was
started in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at the
Artificial Intelligence Group of Katharina Morik at the Dortmund University
of Technology.
In 2007, the project formally known as YALE was renamed and published
as RapidMiner version 4.0. Since then, the software is hosted by SourceForge
and is offered free of charge as a Community Edition released under the GNU
AGPL. There is also an Enterprise Edition offered under a commercial license
for integration into closed-source projects.
The software is written in Java and runs so called processes. A process
is basically an XML-File generated by the user and contains a sequence of
tasks which are represented by operators. More than 500 operators are al-
ready included in the software. Their functionality covers the main aspects
of data analysis such as data loading and transformation, data preprocessing
and visualization, modelling and model evaluation. By combining these op-
erators, basic machine learning tasks such as data mining, text mining, time
series analysis and forecasting, web mining as well as sentiment analysis and
opinion mining can be performed. The software also provides multiple meth-
ods for visualizing high dimensional data sets. Since RapidMiner is written in
Java it is platform independent and can be easily combined with other soft-
ware tools. Doing so, the well known WEKA framework [7] was completely
integrated into RapidMiner. In addition, RapidMiner provides a magnificent
plug-in mechanism, which can be used to easily expanded the functionality of
the core software.
Since 2007, RapidMiner has been heavily extended and became one of
most important data mining and data analytic tools. It is intensively used
in introductory courses and academic purposes at universities all over the
world. RapidMiner is also used for industrial purposes by many companies
and consultants for different applications.
2.2 BOINC
The Berkeley Open Infrastructure for Network Computing (BOINC) is a soft-
ware framework for distributed and grid computing [1]. It was originally devel-
oped at the University of California at Berkeley for the SETI@HOME program
which was founded to analyze radio signals, searching for signs of extra ter-
restrial intelligence. The BOINC project started in 2002 by releasing its first
version under the terms of the GNU LGPL.
One main objective of the BOINC development team is the separation of
project management and research related tasks. The framework supports re-
searches by providing the necessary infrastructure to distribute computational
intensive research tasks to several computers which are running the BOINC
client. The BOINC client downloads tasks for one or more research projects,
processes them and sends the results back to the project server. Thus, the
scientists can rather focus on developing analysis algorithms than taking care
of distributing data and software to different locations by themselves.
Volunteers, which want to provide their computing resources to research
purposes, download the BOINC client and connect it to one of the many
existing research projects powered by BOINC. The client then autonomously
downloads all necessary data and analysis programs and starts contributing
to the project goals. Based on the amount of computing time spend, the
volunteers get rewarded by credit points. Even though these credit points
only possess an immaterial value and can not be used to buy anything, they
are subject to competitive behaviour of the volunteers. The earned credits
are recognized as measurement for their willingness to contribute to scientific
progress and therefore motivate volunteers to participate in BOINC projects.
In March 2013, the computing power of about 400 thousand computers
was spent by volunteers to contribute to BOINC powered research projects.
This leads to an average of 9 PetaFLOPS and outperforms most of the high
performance compute clusters of the top 500 list.
2.3 Drupal
Drupal is a free and open-source content management framework written in
PHP and distributed under the terms of the GNU LGPL. The standard re-
lease of Drupal contains basic features common to content management sys-
tems. These include delivery of dynamic and static websites, user account
registration and maintenance, menu management, RSS feeds, and system ad-
ministration. The Drupal system can be easily customized in behaviour and
appearance. The core functionality can be extended by addons, which allow
to add new features. The Drupal theming mechanism provides the possibility
to integrate different themes in order the change the layout and design of the
website.
2.4 Tool integration
We were looking for a system that allows us to perform independent data
analysis tasks using a cost-efficient but high performance computing infrastruc-
ture. We decided to combine the data analysis functionality of RapidMiner
and BOINC’s capability of job distribution. The advantages are obvious -
both frameworks are free and open source software under AGPL and LGPL.
The actual computing resources are provided by enthusiastic volunteers which
gracefully support our research challenges by providing free of charge com-
puting power. Our investment was limited to the renting and housing of an
dedicated server and a high-throughput network connectivity. In addition, we
spend approximately six person-months to extend and customize the available
software and to develop an automatic data management workflow.
We implemented a module which allows Drupal to interact with the BOINC
framework. Thus, we were able to replace the standard web appearance of
BOINC and benefit from Drupal features. We extended the module by adding
charts, user profiles, a forum and all kind of statistical information in order
to adjust to needs of our project members. Doing so, we combined a state-
of-the-art content management system and the BOINC framework in order to
supply our members with up to date news.
In our setup, the RapidMiner environment is used to carry out independent
data analysis tasks, which are distributed to and performed on an armada of
volunteer computers. In this situation, we were facing a highly heterogeneous
infrastructure in terms of operating systems, amount of usable memory (RAM
and HDD) as well as types and number of available CPUs. Fortunately, the
BOINC frameworks provides powerful tools to cope with this situation.
In order to benefit from the immense computational power provided by
the BOINC framework and the committed volunteers, it is necessary to divide
the overall data analysis process into a large number of small independent
analysis tasks. Each of these tasks contains a portion of the data that needs
to be analysed and a description of the data analysis process which is to be
applied. In addition, some meta information is assigned to each task, such as
the reference to the application that performs the actual data analysis and its
estimated run time.
BOINC provides a C++ library which enables a given scientific application
to communicate with the BOINC client. This communication is essential,
since the BOINC client needs to be able to start, suspend, resume or stop
the application. In addition, the application has to provide the progress of its
computation during run time.
Unfortunately, there is currently no Java API available, which allows a
developer to add this functionality to a certain java application. Therefore,
we had to develop an C++ application for Windows and Linux environments,
that worked as a wrapper for the java-based RapidMiner. Our wrapper appli-
cation receives the communication signals from the BOINC client and controls
the RapidMiner software accordingly. The start signal leads to the execu-
tion of a Java Runtime Environment (JRE) which then starts a RapidMiner
instance. Once started, suspending and resuming of execution can be easily
achieved by using the appropriate OS-specific system calls. During suspension
the RAM is preserved and after resuming the application continues its execu-
tion seamlessly. The stop signal immediately shuts down the JRE including
the RapidMiner instance. Furthermore, an application dependent checkpoint
file is written which stores the computing progress reached so far. This check
point information is used when the start signal is sent by the BOINC client.
The wrapper interprets the checkpoint information and manipulates the data
and process files in order to restart the data mining task at a certain point.
These manipulations are highly application dependent and can not be gen-
eralized. Each RapidMiner analysis process needs a specific wrapper applica-
tion which controls the checkpoint writing, reading and interpretation. The
implementation of these specific wrappers and checkpointing mechanisms is
time consuming and a huge conceptual effort for each analytical process.
The effort gets minimized if the analysis process contains of steps that are
frequently repeated. An example would be a time series analysis task, which
uses a sliding window approach to train a classification model and applies it on
the time window in order to predict the next time series value. The necessary
steps are the same for each time window and repeated as often as there are
time windows. The checkpointing could work in a way, that the predictions
for each time window are added to an overall result file. When the wrapper
restarts the analysis process it first checks the number of predictions, that are
already in the result file and then adapts the data file by removing the first
time windows which are already processed.
The described approach can be used for each analysis or optimization pro-
cess which contains multiple repeated sub-processes. Other examples which
might be dealt with in a similar manner are parameter optimization tasks
using random search, linear search or genetic algorithms.
The problem of checkpointing can be avoided if the run time of a single
analysis process is quite short. In such a case, checkpointing can be waived
since the restart of an yet unfinished task would only lead to loosing a small
amount of CPU time.
3 The Distributed Data Mining Project
The distributedDataMining project is a scientific computing project that pro-
vides the computational power of internet-connected computers to its scien-
tific partners in order to perform research in the various fields of Simulation,
Data Analysis and Machine Learning. Since 2008, the project uses the BOINC
framework for the distribution of data analysis tasks which are then performed
by the RapidMiner environment on computers of enthusiastic and committed
volunteers. The project became available to the public in March 2010.
The project’s goal is to allow our research partners to make use of the
enormous processing power of personal computers around the world. Thereby,
the computational power spent by the project members is used to support the
research of our scientific partners. Figure 1 shows the number of volunteers
which spent the computational power of their computers for research purposes
over the time period of three years. The number of participating computers is
shown in Figure 2.
Figure 1: Number of contributing volunteers (06/2010 - 06/2013)
Figure 2: Number of contributing hosts (06/2010 - 06/2013)
The computing power as visualized in the figures was used to process dif-
ferent data analysis tasks from various research areas. The research areas as
well as the related applications and case studies are briefly introduced in the
following section.
4 Applications
4.1 Social Network Analysis
In recent years more and more social network platforms have been established.
Some prominent examples are Facebook, Xing and LinkedIn. The platform
providers collect a huge amount of data for each of their users. Besides per-
sonal information such as age, hobbies or professional career, also relations
between users are specified. Consequently, each platform can be represented
as a network of users where each user can be characterized with specific prop-
erties. The links between users may have different meanings. A very common
meaning is that two users are connected if they know each other.
The data in social networks contains valuable information. In the follow-
ing, we present some results, which were obtained by using the DenGraph
algorithm. DenGraph was implemented for the RapidMiner and distributively
processed using the BOINC framework.
4.1.1 The DenGraph algorithm
Inspired by the algorithm DBSCAN [3] for spatial data, Falkowski et al. pro-
pose the density based graph clustering algorithm DenGraph [5]. The inten-
tion of DenGraph is to cluster similar nodes in a graph into communities. The
density-based approach applies a local cluster criterion. Clusters are regarded
as regions in the graph in which the nodes are dense, and which are separated
by regions of low node density.
To allow for tracking and analyzing the temporal dynamics of Social Net-
work Communities, DenGraph-I [6] is designed as an incremental procedure:
The clustering is updated incrementally based on the changes that are ob-
served in the graph structure from one interval to another. These changes
may evoke one of the following clustering updates: creation of a new cluster,
removal of a cluster, absorption of a new cluster member, reduction of a cluster
member, merge of two or mores clusters and split of a cluster into two or more
clusters.
In social networking sites it is often observable that members belong to
more than one community. So far, if a member is close to more than one
community, it is assigned to the cluster which is discovered first. In this case,
the clustering result is not deterministic but depends on the order in which the
nodes are visited. To overcome this problem we propose DenGraph-IO that
extends the existing algorithm to handle overlapping clusters. By this, we also
achieve a more realistic clustering as individuals can be members in different
communities now.
In 2011, we proposed DenGraph-HO in order to fulfill the special needs
of social network analysts [13, 14]. In most cases, the visual inspection of a
network is the first step of the analytical process and helps to determine the
basic graph characteristics and further actions. DenGraph-HO supports this
early stage by providing a quick visual analysis of the network structure. It
provides the ability of zooming into network clusterings and has proven its
usefulness for our practical work.
The algorithm’s approach differs from traditional hierarchical clustering
methods in that DenGraph-HO is a non-partional clustering algorithm. We
consider the fact that not all nodes are necessarily members of clusters. In
addition, the proposed hierarchy is not strictly built up by the classic divi-
sive or agglomerative approach that is known from literature. We generalize
these methods and propose a top-down approach and a bottom-up approach
by extending the hierarchy paradigms. The proposed hierarchy supports su-
perordinate clusters that contain subclusters.
Each level of the hierarchy represents a clustering that fulfills the original
DenGraph paradigms. The levels, respectively the clusterings, differ in the
density that is required to form a cluster. While lower level clusterings ag-
gregate nodes with a lower similarity, higher level clusterings require a higher
similarity between nodes. The efficiency of our algorithm is based on this
iterative sequence of cluster adaptations instead of a complete new clustering.
4.1.2 Last.fm analysis
In 2008, we applied the DenGraph-IO on a music data set to analyse the
music listen behaviour of users on the Last.fm platform [4, 12]. Last.fm is
a social networking platform established in 2002. The platform has over 20
million users on the site every month, which are based in more than 200
countries. After a user signs up, Last.fm records - among others - all artists
a user listens to, aggregates this information over seven days and provides
lists of the most listened artists for each week over the lifetime of a user. We
use this information to build a user’s profile by extracting the genres of the
most listened artists. The artist’s genre is determined by the tags that the
community members use to characterize the artist. We represent each user
as node in a graph and connect users with an edge, if their profile similarity
reaches a predefined threshold. The similarity is determined by calculating the
distance between pairs of genre vectors using the cosine similarity measure.
For our study, we randomly chose approximately 600,000 users and ob-
tained their weekly artists charts over a period of 167 weeks (September 2005
to November 2008). Since many users are not active on a regular basis, we
chose randomly 2,000 users from this set who were active in at least 80% of all
periods. We applied DenGraph-IO on the resulting graph to detect and ob-
serve the evolution of clusters during the observation period of 115 weeks. The
aim was to see, whether the proposed clustering technique detects meaningful
communities and evolutions.
Week
29/06 Week
33/06 Week
37/06 Week
40/06 Week
45/06 Week
48/06
Cluster Labels
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
2
3
4
2
3
4
9
10
11
indie rock, alternative
melodic death metal, metal, death metal
progressive rock, progressive metal, rock, classic rock
hip-hop
metal core, hard rock, rock, emo, screamo, punk, metal
power metal, heavy metal, metal, symphonic metal
metal core, hard core, death metal, metal, melodic metal
electronic, ambient, idm, electronica, indie, chill out
power metal, metal, symphonic metal
progressive metal, progressive rock, metal, progressive
progressive rock, progressive metal, rock, metal
1
2
3
4
Cluster Evolution
11 cluster unchanged
new cluster
cluster removal
cluster merge
cluster split
5
6./.
1 1 1 1
3
4
3
4
7 7
4 4
10
888
9 9
4
3
2
1
5
2
7
311
Figure 3: DenGraph-HO: Last.fm hierarchy
Figure 3 shows the evolution of clusters found by the DenGraph-IO algo-
rithm. At first, only four clusters were found. These clusters represent the
music genres indy, metal, rock and hip hop. The algorithm then tracks these
clusters and detects structural changes. Four weeks later two additional clus-
ters were found. One of them disappears in the next step. In week 40/2006,
the clusters which represent the metal genre got merged in one bigger cluster.
In week 48/2006 the rock cluster is split in two subclusters.
In 2011, we applied the DenGraph-HO algorithm to the Last.fm graph
consisting of 1,209 nodes and 12,612 edges. The resulting clusters form groups
of users that have similar music listening preferences. By calculating labels
the clusters get a semantic meaning based on the music preferences of its
members. Figure 4 shows the resulting hierarchy of clusters which represent
music genres. The underlying graph and the discovered clusters are shown in
Figure 5. For the sake of clarity the graph edges are not drawn. In both the
cluster hierarchy and the graph, clusters with similar labels are located closely
in the graph and in the hierarchy.
Figure 4: DenGraph-HO: Last.fm hierarchy
Figure 5: DenGraph-HO: Last.fm Graph and Clustering
4.2 Time Series Analysis
A Time Series is an ordered sequence of data points, which are typically mea-
sured at uniform time intervals. The research area called Time Series Analysis
comprises methods for analyzing time series data in order to extract meaning-
ful statistics, rules and patterns. Later on these rules and patterns might be
used to build forecasting models that are able to predict future developments.
In case one wants to predict future trend directions (e.g. up/down) a classifi-
cation problem has to be solved. If we try to forecast future time series data
points, the relevant data mining technique is called regression.
Within the distributedDataMining.org project, we use different machine
learning algorithms to discover and extract valuable patterns which are em-
bedded in financial time series. These patterns are subsequently used to built
forecasting models which should be able to predict future developments. The
algorithms we use are integrated in the open source data mining framework
RapidMiner. For classification problems we applied Decision Trees, k-nearest
Neighbours, Support Vector Machines and Neural Networks. Furthermore,
Linear Regression, LeastMedSquare Regression and Logistic Base Regression
are used to build regression models.
In 2008, we published some studies focusing on stock price prediction
[9, 10, 11]. The proposed methodology lead to prediction models which reach
nearly always a positive profit gain. We succeeded in generating models that
outperform the general market. Furthermore, our results showed some inter-
esting news. The amount of historical data had a much higher impact on the
prediction quality than expected. This fact has not been taken into account
in other studies before.
4.3 Biological Data Analysis
4.3.1 Laryngeal high-speed video classification
For the clinical diagnosis of pathological conditions of the human body a va-
riety of sophisticated examination techniques are employed these days. Most
of these approaches yield vast amounts of images and measurement data with
high spatial and/or temporal resolutions, e.g. MRI, CT, and Ultrasound. In
order to reliably evaluate these data for diagnostic purposes, a certain extent of
subjective experience is required on the part of the physician. Due to different
reasons, in usual clinical time frames the amount of time available for analyz-
ing and interpreting the acquired data is limited. As a result, diagnostic failure
may occur, which can have serious consequences for the affected patient. By
means of combined image processing and data analysis approaches this crucial
diagnostic process can be objectified and automated. Thus, Computer-Aided
Diagnosis systems can be provided to the physician, facilitating her/his clinical
decision and yielding more reliable identification of pathological alterations.
One particular field of interest within this medical context is the automatic
identification of voice disorders, resulting in perceivable hoarseness. Com-
monly, for this purpose audio recordings of the acoustical voice signal are ana-
lyzed with specialized software quantifying the amount of perturbation (noise)
in the signal. However, this type of acoustical analysis does not allow for the
clear assignment of certain clinical pictures to a distinct set of perturbation
parameters. A more revealing approach for voice diagnosis consists in endo-
scopic examination of the sound-producing vocal folds in the larynx by means
of digital high-speed cameras. These cameras are capable of recording the la-
ryngeal movements at a frame rate of several thousand images per second, and
thus, allowing for conclusive real-time analysis. However, the task of manually
analyzing the resulting high-speed videos is time-consuming and error-prone.
Through automated feature extraction from the recordings and subsequent
machine learning analysis, laryngeal movement patterns can be quantitatively
captured and automatically classified according to different diagnostic classes
(e.g. organic and functional dysphonia). By means of the distributedDataMin-
ing infrastructure, we evaluated a large number of machine learning paradigms
(e.g. Support Vector Machines, Artificial Neural Networks) and correspond-
ing parameter optimization strategies (e.g. Grid search, Evolution strategy,
Genetic algorithms). This preliminary evaluation step allowed us to identify
certain learning schemes and parameters which are particularly suited for the
considered clinical classification task. Details on the proposed methodology
and the obtained classification results can be found in [15, 16, 17, 18]
4.3.2 Multi-Agent Simulation of Evolution
In this case study, we investigate the biological phenomenon of aposematism
(also referred to as warning coloration). This term describes the evolutionary
strategy of certain animal species to indicate their unpalatability/toxicity to
potential predators by developing skin colors and patterns that can be easily
perceived by them. Prominent examples of toxic animals with distinct warning
coloration are poison dart frogs, coral snakes and fire salamanders.
The evolution of aposematism has intrigued many biologists, because at
first glance, an evolutionary paradox seems to be underlying: Why would
unpalatable prey animals acquire conspicuous warning coloration if this trait
makes them more likely to be spotted and eaten by predators? Given that
aposematism can be frequently observed in the animal world, the question
arises how these warning signals could have evolved so many times despite
their apparent evolutionary disadvantage. The paradox is even aggravated by
the fact that in its initial stage of evolution, the proposed benefit of aposematic
colors (i.e. making it easier for predators to learn the prey’s unpalatability)
cannot be present. Consequently, the evolution of aposematism has spurred
more than a century of scientific discussion and investigation; it has been
addressed both experimentally and theoretically.
For tackling this interesting research challenge, we developed a distributed
multi-agent model that simulates the dynamic interactions of predator and
prey populations over time. By systematically testing different adaptation
and learning strategies for the agents and exploring the parameter space of our
simulation model using the computational power of the distributedDataMin-
ing.org project, we might be able to deepen the understanding of the apose-
matism phenomenon and the evolutionary paths leading to it.
5 Conclusion
In this article, we briefly described the distributedDataMining.org project
which uses BOINC, Druva and RapidMiner for distributed data analytics. We
gave an overview about the integration of the used tools and how they inter-
act. In addition, we presented our data analysis results which were achieved
in the research fields of Social Network Analysis, Time Series Analysis and
Biological Data Analysis. The distributedDataMining.org project is up and
running and offers its computational resources to interested researchers and
their institutions.
References
[1] David P. Anderson. Boinc: A system for public-resource computing and
storage. In Proceedings of the 5th IEEE/ACM International Workshop
on Grid Computing, GRID ’04, pages 4–10, Washington, DC, USA, 2004.
IEEE Computer Society.
[2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data process-
ing on large clusters. Commun. ACM, 51(1):107–113, January 2008.
[3] Martin Ester, Hans-Peter Kriegel, Jrg S, and Xiaowei Xu. A density-
based algorithm for discovering clusters in large spatial databases with
noise. pages 226–231. AAAI Press, 1996.
[4] Tanja Falkowski. Community Analysis in Dynamic Social Networks.
Sierke Verlag, Goettingen, 12 2009.
[5] Tanja Falkowski, Anja Barth, and Myra Spiliopoulou. Dengraph: A
density-based community detection algorithm. In In Proc. of the 2007
IEEE / WIC / ACM International Conference on Web Intelligence,,
pages 112–115, 2007.
[6] Tanja Falkowski, Anja Barth, and Myra Spiliopoulou. Studying com-
munity dynamics with an incremental graph mining algorithm. In Izak
Benbasat and Ali R. Montazemi, editors, AMCIS, pages 29–29. Associa-
tion for Information Systems, 2008.
[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, and Ian H. Witten. The weka data mining software: an
update. SIGKDD Explor. Newsl., 11(1):10–18, November 2009.
[8] Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and
Timm Euler. Yale: Rapid prototyping for complex data mining tasks. In
Lyle Ungar, Mark Craven, Dimitrios Gunopulos, and Tina Eliassi-Rad,
editors, KDD ’06: Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 935–940, New
York, NY, USA, August 2006. ACM.
[9] Manuel M¨oller and Nico Schlitter. Analyse und Prognose ¨okonomischer
Zeitreihen mit Support Vector Machines, pages 189–201. Chemnitzer
Informatik-Berichte. Techn. Univ. Chemnitz, Fak. f¨ur Informatik, Chem-
nitz, April 2008.
[10] Nico Schlitter. Analyse und Prognose ¨okonomischer Zeitreihen: Neu-
ronale Netze zur Aktienkursprognose. VDM Verlag Dr. M¨uller,
Saarbr¨ucken, July 2008.
[11] Nico Schlitter. A Case Study of Time Series Forecasting with Backpropa-
gation Networks, pages 203–217. Chemnitzer Informatik-Berichte. Techn.
Univ. Chemnitz, Fak. f¨ur Informatik, Chemnitz, April 2008.
[12] Nico Schlitter and Tanja Falkowski. Mining the dynamics of music pref-
erences from a social networking site. In Proceedings of the 2009 Interna-
tional Conference on Advances in Social Network Analysis and Mining,
pages 243–248, Athens, 2009 2009. IEEE Computer Society, IEEE Com-
puter Society.
[13] Nico Schlitter, Tanja Falkowski, and J¨org L¨assig. Dengraph-ho: Density-
based hierarchical community detection for explorative visual network
analysis. In Research and Development in Intelligent Systems XXVIII
Incorporating Applications and Innovations in Intelligent Systems XIX
Proceedings of AI-2011, the Thirty-first SGAI International Conference
on Innovative Techniques and Applications of Artificial Int, pages 283–
296, Cambridge, 2011. Springer, Springer.
[14] Nico Schlitter and J¨org L¨assig. Distributed privacy preserving classifica-
tion based on local cluster identifiers. In 2012 IEEE 11th International
Conference on Trust, Security and Privacy in Computing and Commu-
nications (TrustCom), Liverpool, 2012. IEEE Computer Society, IEEE
Computer Society.
[15] D. Voigt and U. Eysholdt. Identifying relevant analysis parameters for
the classification of vocal fold dynamics. J Acoust Soc Am, 130(4):2550,
2011.
[16] Daniel Voigt. Objective Analysis and Classification of Vocal Fold Dynam-
ics from Laryngeal High-Speed Recordings. Shaker Verlag GmbH, Aachen,
2010.
[17] Daniel Voigt, Michael D¨ollinger, Thomas Braunschweig, Anxiong Yang,
Ulrich Eysholdt, and J¨org Lohscheller. Classification of functional voice
disorders based on phonovibrograms. Artificial Intelligence in Medicine,
49(1):51–59, 2010.
[18] Daniel Voigt, J¨org Lohscheller, Michael D¨ollinger, Anxiong Yang, and
Ulrich Eysholdt. Automatic diagnosis of vocal fold paresis by employing
phonovibrogram features and machine learning methods. Comput Meth-
ods Programs Biomed, 99:275–288, 2010.
... It takes use of methods at the intersection of artificial intelligence, machine learning, statistics and database systems. Unfortunately, internal high-performance computing environments or similar traditional data management solutions are no longer capable of handling such amounts of data [47]. Organizations do not usually have enough internal computational resources to satisfy the demand. ...
... Owning a high-performance grid computing infrastructure may exceed the financial capacities [47]. Even though there are available distributed computing solutions that allow to easily combine internal IT resources into a distributed computing platform, organizations tend to choose other alternatives. ...
... This depends on the obtained task and done by running the required executable [17]. "distributedDataMining.org" is one additional project integrating RapidMiner to BOINC [47]. The project goal is the same: to perform data mining tasks. ...
Article
Public distributed computing is a type of distributed computing in which so-called volunteers provide computing resources to projects. Research show that public distributed computing has the required potential and capabilities to handle big data mining tasks. Considering that one of the biggest advantages of such computational model is low computational resource costs, this raises the question of why this method is not widely used for solving such today’s computational challenges as big data mining. The purpose of this paper is to overview public distributed computing capabilities for big data mining tasks. The outcome of this paper provides the foundation for future research required to bring back attention to this low-cost public distributed computing method and make it a suitable platform for big data analysis.
... where <task> is the name and parameters of the Docker container that contain task execution files (solution based on [40]). <method> can define the task scheduling method or specify a specific cluster to execute the task in. ...
Article
Full-text available
The purpose of the research is to create a hybrid cloud platform that performs distributed computing tasks using high-performance servers and volunteer computing resources. The proposed platform uses a new task scheduling method, which is also presented in this paper. It uses a task stalling buffer to manage workload among the two grids without any additional information about the tasks. Since efficient task scheduling in these distributed systems is the actual problem, the system reliability issue is solved using a hybrid cloud architecture when both high-performance servers and volunteer computing resources are combined. The results of the experiment showed that the proposed solution solves the problem of balancing workload between two grids better than the standard scheduling algorithm. Computer study and experiments also showed that the proposed hybrid cloud tasks scheduling method with a task stalling buffer reduces up to 47.3 % of total task execution time. The outcome of this paper provides a background for future research on a task stalling buffer in hybrid cloud computing.
... where <task> is the name and parameters of the Docker container that contain task execution files (solution based on [40]). <method> can define the task scheduling method or specify a specific cluster to execute the task in. ...
Article
Full-text available
This paper presents a new algorithm for a batch of task makespan minimisation in heterogeneous multigrid computing. Heterogeneous grids are known to cause straggling task problem that increases task execution makespan. Existing task distribution algorithms solve this problem by using information about the compute node capacities or task sizes. However, such information may not always be available. Task stalling solves both problems. However, this method is described for queuing systems consisting of only two heterogeneous servers or grids. Our proposed algorithm is based on an improved task stalling method, allowing it to distribute tasks in systems consisting of two or more grids. Experiment results show reduced task execution makespan by up to 19,92% compared to FIFO. This allows us to conclude that the new algorithm is suitable for a batch of task makespan minimisation in heterogeneous multigrid computing.
... RapidMiner (RapidMiner Studio version 8.2) was used to implement a similarity process model. RapidMiner is a data science platform, requiring a small learning curve and widely adopted in the academic field [62]. RapidMiner implements a vast number of algorithms to extract text similarities. ...
Article
Full-text available
Software developers need to cope with a massive amount of knowledge throughout the typical life cycle of modern projects. This knowledge includes expertise related to the software development phases (e.g., programming, testing) using a wide variety of methods and tools, including development methodologies (e.g., waterfall, agile), software tools (e.g., Eclipse), programming languages (e.g., Java, SQL), and deployment strategies (e.g., Docker, Jenkins). However, there is no explicit integration of these various types of knowledge with software development projects so that developers can avoid having to search over and over for similar and recurrent solutions to tasks and reuse this knowledge. Specifically, Q&A sites such as Stack Overflow are used by developers to share software development knowledge through posts published in several categories, but there is no link between these posts and the tasks developers perform. In this paper, we present an approach that (i) allows developers to associate project tasks with Stack Overflow posts, and (ii) recommends which Stack Overflow posts might be reused based on task similarity. We analyze an industry dataset, which contains project tasks associated with Stack Overflow posts, looking for the similarity of project tasks that reuse a Stack Overflow post. The approach indicates that when a software developer is performing a task, and this task is similar to another task that has been associated with a post, the same post can be recommended to the developer and possibly reused. We believe that this approach can significantly advance the state of the art of software knowledge reuse by supporting novel knowledge-project associations.
... Researchers have explored how existing distributed computing paradigms and algorithms can be adapted to VC. Examples include evolutionary algorithms [50,51], virtual drug discovery [52], MPI-style distributed parallel programming [53], Map-Reduce computing [54], data analytics [55], machine learning [56,57,58], gene sequence alignment [59], climate modeling [21], and distributed sensor networks [60]. Desell et al. describe a system that combines VC and participatory "citizen science" [61]. ...
Article
Full-text available
“Volunteer computing” is the use of consumer digital devices for high-throughput scientific computing. It can provide large computing capacity at low cost, but presents challenges due to device heterogeneity, unreliability, and churn. BOINC, a widely-used open-source middleware system for volunteer computing, addresses these challenges. We describe BOINC’s features, architecture, implementation, and algorithms.
... RapidMiner (RapidMiner Studio version 8.2) was used to implement a similarity process model. RapidMiner is a compelling data science platform, requiring a small learning curve to be used, widely adopted in the academic field [20]. The implemented RapidMiner model is illustrated in Figure 1. ...
... Researchers have explored how existing distributed computing paradigms and algorithms can be adapted to VC. Examples include evolutionary algorithms[46] [47], virtual drug discovery[48], MPI-style distributed parallel programming[49], Map-Reduce computing[50], data analytics[51], machine learning[52] [53][54], gene sequence alignment[55], climate modeling[56], and distributed sensor networks[57]. Desell et al. describe a system that combines VC and participatory "citizen science"[58].Silva et al. proposed extending BOINC to allow job submission by volunteers[59]. ...
Preprint
"Volunteer computing" is the use of consumer digital devices for high-throughput scientific computing. It can provide large computing capacity at low cost, but presents challenges due to device heterogeneity, unreliability, and churn. BOINC, a widely-used open-source middleware system for volunteer computing, addresses these challenges. We describe its features, architecture, and implementation.
... Web is the only source for providing information to the users via hyperlinks. Web content mining, Web structure mining and Web usage mining are the types of web mining [1].Web Content Mining is the process of extracting information (i.e. Text, Audio, Video, Image, etc…) based on the keyword given by the user. ...
Article
Full-text available
In recent times, due to the rapid usage of World Wide Web, websites are the information provider to the Internet users. Storing and retrieving the information from the web is always a challenging task. Web mining, the term is defined as extract needed information to the users from the Web. Here, the information provided by the Web is not only the exact information of user needs but also suggest the information associated to the exact one. Web mining is classified into three sub tasks such as, Web Content, Web Structure and Web Usage Mining. This paper, introduces the applications and the mining process of data mining tool (open source) Rapidminer. Here, the proposed work analyzes the usage of web pages (i.e. Browsing behavior of user) using two different clustering algorithms such as k-means, which is incorporated in the tool and Fuzzy c means (FCM) clustering using RapidMiner. The results will show operational background of FCM clustering and k-means clustering algorithm based on the cluster centroid.
Conference Paper
The paper describes an approach to association rules mining from big data sets using BOINC-based Enterprise Desktop Grid. An algorithm of data analysis and a native BOINC-based application are developed. Several experiments with the aim of validation and performance evaluation of the algorithm implementation are performed. The results of the experiments show that the approach is promising; it could be used by small and medium businesses, scientific groups and organizations.
Conference Paper
Full-text available
For the analysis of communities in social networks several data mining techniques have been developed such as the DenGraph algorithm to study the dynamics of groups in graph structures. The here proposed DenGraph-HO algorithm is an extension of the density-based graph clusterer DenGraph. It produces a cluster hierarchy that can be used to implement a zooming operation for visual social network analysis. The clusterings in the hierarchy fulfill the DenGraph-O paradigms and can be efficiently computed. We apply DenGraph-HO on a data set obtained from the music platform Last.fm and demonstrate its usability.
Article
Full-text available
This article describes the use of backpropagation networks to predict eco- nomic time series. In this study short-term prediction models for the Dow Jones Average Industrial Index and the German Stock Index are generated. The article presents the prediction process and suggests some modications for the classical training algorithm. Furthermore several methods to select relevant network inputs and their preprocessing are examined. After a short introduction of quality measures the prediction results and their implications are presented.
Conference Paper
Full-text available
In this paper we present an application of our incremental graph clustering algorithm (DENGRAPH) on a data set obtained from the music community site Last.fm. The aim of our study is to determine the music preferences of people and to observe how the taste in music changes over time. Over a period of 130 weeks, we extract for each interval user profiles of 1,800 users that represent their music listening behavior. By building and incrementally clustering a graph of similar users, we obtain groups of people with similar music preferences. We label these clusters with genres according to the user profiles of the cluster members. Due to the incremental nature of DENGRAPH we show how clusters evolve over time. Besides the growth and decrease of clusters we observe how new clusters emerge and old clusters die. Furthermore, we show the merge and split of clusters. The results of our experiments indicate that DENGRAPH is particularly useful to efficiently detect groups of similar users and to track them over time.
Conference Paper
Full-text available
Detecting densely connected subgroups in graphs such as communities in social networks is of interest in many re- search fields. Several methods have been developed to find communities but most of them have a high time complexity and are thus not applicable for large networks. Inspired by the clustering algorithm incremental DBSCAN we propose a density-based graph clustering algorithm DENGRAPH that is designed to deal with large dynamic datasets with noise and present first experimental results.
Conference Paper
This paper addresses privacy preserving classification for vertically partitioned datasets. We present an approach based on information hiding that is similar to the basic idea of microaggregation. We use a local clustering to mask the dataset of each party and replace the original attributes by cluster identifiers. That way, the masked datasets can be integrated and used to train a classifier without further privacy restrictions. We apply our approach to four standard machine learning datasets and present the results.
Conference Paper
The widespread usage of the Web and later of the Web 2.0 for social interactions has stimulated scholars of different disciplines in studying electronic communities. Traditionally, communities are observed as a static phenomenon. However, they are evolving constellations, which emerge, lose members and obtain new ones and potentially, grow, coerce, split or decline. Such dynamic phenomena require the study of social networks across the time axis.We propose the graph mining algorithm DENGRAPH for the discovery and monitoring of evolving communities. Data mining methods are successfully used for community discovery but are mostly limited to the static perspective. Taking a dynamic perspective implies the study of a stream of interactions among community members. Accordingly, our DENGRAPH is an incremental graph mining algorithm, which detects and adapts communities over time. We report on our first results in applying DENGRAPH on the social network of mail interactions of ENRON.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.