Conference PaperPDF Available

A RapidMiner extension for Open Machine Learning

Authors:

Abstract and Figures

We present a RapidMiner extension for OpenML, an open science platform for sharing machine learning datasets, algorithms and experiments. In order to share machine learning experiments as easily as possible, it is being integrated into various popular data mining and machine learning tools, including RapidMiner. Through this plugin, data can be downloaded, and workflows and results uploaded to the OpenML website, where they can be searched, aggregated and reused
Content may be subject to copyright.
A RapidMiner extension for Open
Machine Learning
Jan N. van Rijn1, Venkatesh Umaashankar2, Simon Fischer2,
Bernd Bischl3, Luis Torgo4, Bo Gao5, Patrick Winter6, Bernd
Wiswedel6, Michael R. Berthold7and Joaquin Vanschoren1
1Leiden University, {jvrijn,joaquin}@liacs.nl
2Rapid-I GmbH, {venkatesh,fischer}@rapid-i.com
3TU Dortmund, Dortmund, Germany,
bischl@statistik.tu-dortmund.de
4University of Porto, Porto, Portugal, ltorgo@inescporto.pt
5KU Leuven, Leuven, Belgium, bo.gao@cs.kuleuven.be
6KNIME.com AG,
{patrick.winter,Bernd.Wiswedel}@knime.com
7University of Konstanz, Konstanz, Germany,
Michael.Berthold@uni-konstanz.de
Abstract
We present a RapidMiner extension for OpenML, an open science
platform for sharing machine learning datasets, algorithms and exper-
iments. In order to share machine learning experiments as easily as
possible, it is being integrated into various popular data mining and ma-
chine learning tools, including RapidMiner. Through this plugin, data
can be downloaded, and workflows and results uploaded to the OpenML
website, where they can be searched, aggregated and reused.
1 Introduction
In this paper we present a RapidMiner extension for OpenML1, an open science
platform for machine learning. It allows researchers to submit datasets, algo-
rithms, workflows, experiments and their results to a single platform. OpenML
1http://openml.org/, beta
Figure 1: Components of OpenML.
automatically organizes all content in a database, where it is freely available
to everyone and searchable through its website. Above all, OpenML aims to
facilitate an open scientific culture, which in turn can tremendously speed up
progress [5]. First, by publishing detailed results, other scientists can clearly
interpret, and even verify and reproduce certain findings, so that they can
confidently build upon prior work [4]. Furthermore, by integrating results
from many studies, researchers can conduct much larger studies. Algorithms
and workflows can immediately be compared over many different datasets and
a wide range of varying parameter settings. Finally, many machine learning
questions won’t require the set up of new experiments. These can be answered
on the fly by searching and combining results from earlier studies.
The key components of OpenML are shown in Figure 1. Central are the
OpenML API and database, which contains all details and meta-data about all
shared datasets, implementations and experiments. New content can be sent
to OpenML by means of a RESTful API. This API is also being integrated
into a number of popular data mining and machine learning tools, i.e., Weka,
R, KNIME and RapidMiner, so that content can be shared automatically.
OpenML also provides various search interfaces, so that these entities can
later be retrieved, e.g., through a web interface, textual search engine or SQL
interface. The latter enables users to directly query the database by means of
SQL statements.
In Section 2 an overview of related work is provided. In Section 3, we dis-
cuss how experiments are defined in OpenML. Section 4 provides an overview
of the most important concepts in the database. In Section 5 we describe
the web API, allowing integration into various tools. Section 6 describes how
we support the sharing of experiments, and how it is being integrated into
RapidMiner. Section 7 details the search interfaces of OpenML, and Section 8
concludes.
2 Related Work
OpenML builds upon previous work on experiment databases [7], which in-
troduced the idea of sharing machine learning experiments in databases for
in-depth analysis of learning algorithms. The most notable enhancement of
OpenML is the introduction of a web API to allow integration with various
machine learning tools, and a more clear definition of experiments through
tasks (see Section 3).
Kaggle2is a platform which hosts machine learning challenges. In some
sense, these challenges are similar to OpenML tasks: users are provided with
data and instructions, and challenged to build a predictive model on this. How-
ever, as Kaggle is a competition platform, it does not support collaboration:
people are not allowed to see each other’s algorithms or results. OpenML,
however, is an open science platform, allowing researchers complete insight
into each other’s work.
There also exist several platforms for sharing algorithms, workflows and
datasets, such as myExperiment [1, 3] and MLData.3However, these platforms
were not designed to collect and organise large amounts of experimental results
over many algorithms and datasets, nor allow such detailed analysis of learning
algorithms and workflows afterwards. On the other hand, we do aim to fully
integrate these platforms with OpenML, so that datasets and algorithms can
be easily transferred between them.
Finally, MLComp4is a service that offers to run your algorithms on a
range of datasets (or vice versa) on their servers. This has the great benefit
that runtimes can be compared more easily. This is not strictly possible in
OpenML, because experiments are typically run on the user’s machines. How-
ever, OpenML does allow you to rerun the exact same experiments on different
hardware, which is necessary anyway since hardware will change over time.
Moreover, researchers do not need to adapt the way they do their research:
they can run their algorithms in their own environments. OpenML also allows
users to define different types of experiments beyond the traditional bench-
marking runs, and allows more flexible search and query capabilities beyond
direct algorithm comparisons.
2http://www.kaggle.com/
3http://www.mldata.org/
4http://www.mlcomp.org/
3 Tasks
In order to make experiments from different researchers comparable, OpenML
fully defines experiments in tasks. A task is a well-defined problem to be
solved by a machine learning algorithm or workflow. For each task, the inputs
are provided and the expected output is defined. An attempt to solve a task
is called a run. Currently, tasks of the type Supervised Classification and
Supervised Regression are supported, but OpenML is designed in such a way
that it can be extended with other task types. A typical task would be: Predict
target variable X on dataset Y with a maximized score for evaluation metric Z.
Usually, when a user wants to submit new results to the server, he searches for
appropriate tasks, and runs his algorithm on these. The results from these runs
will be uploaded to the server, along with information about the algorithm,
its version and the parameter settings. This process is explained in detail in
Section 5.
1<o m l : t a s k x ml n s :o m l=" h t t p : / /o penml . or g / op en ml ">
2<oml:task_id>2</oml:task_id>
3<oml:task_type>S u p e r v i s e d Cl a s s i f i c a t i o n</oml:task_type>
4<o m l : i n p u t name="source_data">
5<oml:data_set>
6<oml:data_set_id>61</ oml:data_set_id>
7<oml:target_feature>class</oml:target_feature>
8</oml:data_set>
9</ o m l : i n p u t>
10 <o m l : i n p u t name="estimation_procedure">
11 <oml:estimation_procedure>
12 <oml:type>cross_validation</oml:type>
13 <oml:data_splits_url>
14 http:/ / www .op e n m l .org /d a t a /s p l i t s /iris_splits .a r f f
15 </ o m l : d a t a _ s p l i t s _ u r l>
16 <oml:parameter name="number_folds">10</oml:parameter>
17 <oml:parameter name="number_repeats">10</oml:parameter>
18 <oml:parameter name="stratified_sampling">t r u e</oml:parameter>
19 </ oml:estimation_procedure>
20 </ o m l : i n p u t>
21 <o m l : i n p u t name="evaluation_measures">
22 <oml:evaluation_measures>
23 <oml:evaluation_measure>p r e d i c t i v e _ a c c u r a c y</oml:evaluation_measure>
24 </ oml:evaluation_measures>
25 </ o m l : i n p u t>
26 <o m l:o u t p ut name="predictions">
27 <oml:predictions>
28 <oml:format>AR F F</oml:format>
29 <o m l : f e a t u r e name=" c o n f i de n c e . cl a ss n am e " t y p e="numeric" />
30 <o m l : f e a t u r e name=" f o l d " ty pe=" i n t e g e r " />
31 <o m l : f e a t u r e name="prediction" t y pe="string" />
32 <o m l : f e a t u r e name=" r ep e a t " typ e=" i n t e g e r " />
33 <o m l : f e a t u r e name="r ow_ id " ty p e=" i n t e g e r " />
34 </oml:predictions>
35 </ oml:output>
36 </ oml:task>
Figure 2: XML representation of a task.
Figure 2 shows an example of a Supervised Classification task definition. It
provides all information necessary for executing it, such as a URL to download
the input dataset and an estimation procedure. The estimation procedure
describes how the algorithms that are run on this task are being evaluated,
e.g., using cross validation, a holdout set or leave-one-out. For every run
performed on a certain task, this is done using the same data splits. An ARFF
file containing these splits is provided. Also, a set of evaluation measures
to optimise on is provided. An ARFF file containing the predictions (and
confidences per class) is expected as the result.
4 Database
One of the key aspects of OpenML is the central database, containing details
about all experiments. A partial schema of the database is provided in Fig-
ure 3. In the database schema, the concept of inheritance is used: some tables
shown do not exist, but describe what fields should be contained by tables in-
heriting from them, i.e., data and setup. We call these tables interface tables.
Also, all tables inheriting from the same interface table share a primary key.
OpenML considers algorithms as conceptual entities, an algorithm itself
can not be used to execute a task. Instead, an algorithm can be implemented,
resulting in an implementation. In this way we also support versioning. For ex-
ample, when an implementation containing a bug is used, this will potentially
yield suboptimal results. This changes whenever a new version is released.
Thus, we use the table implementation, where the primary key is fullName,
an aggregation of its name (field: implementation) and its version (field:
version). More specifically, an implementation can typically be run with dif-
ferent parameter settings. The setup table contains, for each implementation,
which parameter values where used in a specific run. The table input con-
tains for each implementation all parameters and their default values. The
table input_setting contains for every setup the values of the parameters.
The tables dataset and evaluation both contain data, which can serve
as input or output of a run. These are linked together by the linking tables
input_data and output_data. Entries in the dataset table can be either
user-submitted datasets or files containing the result of a run, such as pre-
dictions. For each evaluation measure performed, an entry is stored in the
evaluation table. Querying all experiments of a specific type of task is eas-
iest if the inputs and outputs of that task types are combined in a single
table. For this reason, the views SVCRun and SVRRun have been introduced for
Supervised Classification tasks and Supervised Regression tasks, respectively.
These are materialized views containing all inputs, outputs and results of such
an experiment.
For each implementation and dataset, a number of meta-features [6] are
quality
data
Data_Quality
value
input
setup
Input_Setting
value
formula
name
Quality
description ,...
study
setup
rid
Run
parent
name
did
Dataset
url,...
data
run
Input_Data
name
data
run
Output_Data
name
function
did
Evaluation
label
value
stdev
source
did
Data
EvaluationMeasure
label
quality
implementation
Algorithm_Quality
value
label
task_id
Study
implementation
sid
Setup
isDefault
Run
implementation
fullName
Output
name
dataT ype
version
fullName
Implementation
name
implements
url, library
type
generalName
implementation
fullName
Input
name
defaultValue
dataType
valueRange,...
implement Implementation
role
parent
Component
child
description,...
name
Algorithm
name
Kernel
description ,...
name
EvaluationMeasure
description ,...
name
DistFunction
description ,...
name
Function
description ,...
description
eid
Study
expdesign
conclusions
role
parent
ImplementationComponent
child
ttid
task_id
Task
name
ttid
Task_type
description
creator ,...
input
task_id
Task_values
value
name
ttid
Input
description
format
name
ttid
Output
description
format
setup_id
task_id
rid
SVRRun
input_data
Setup
Dataset
Task
Run
Inheritance
Many-to-one
One-to-one
setup_id
task_id
rid
SVCRun
input_data
Figure 3: Database schema.
obtained and stored in the data_quality and algorithm_quality table, re-
spectively. These meta-features are called qualities. A list of all qualities can
be found in their corresponding tables.
5 RESTful API
In order to enable the sharing of experiments, a web API5has been devel-
oped. The API contains functions that facilitate downloading datasets, tasks
and implementations. Furthermore, it enables the uploading of datasets, im-
plementations and runs. The API also contains functions that list evaluation
measures, licence data and evaluation methods. We will briefly explain the
most important features.
Functions that involve the uploading of content require the user to provide
asession hash. A session hash is a unique string which is used to authenticate
5Full documentation of the API can be found at http://www.openml.org/api/
User OpenML
authenticate
session token
(a) Authenticate
User OpenML Data
Repository
task.search
mytask.xml
data.description
mydataset.xml
dataset url
mydataset.arff
mydatasplits.arff
splits url
(b) Download task
User OpenML Data
Repository
implementation.get
myimp.xml
source_url
myimp.zip
(c) Download implementation
User OpenML
run.upload
response.xml
POST session token, run.xml, results.arff
(d) Upload run
Figure 4: Use case diagrams of the API.
the user. It is valid for a limited amount of time. Users can obtain a session
hash by invoking the function openml.authenticate (see also Figure 4a).
Inputs for this function are the username and an MD5 hash of the passwords.
Tasks can be obtained by invoking the function openml.task.search. An
XML file, similar to the XML file shown in Figure 2, is returned. The source
data described is an ID referring to a dataset. In order to obtain information
concerning this dataset, including a download URL, users should perform an
additional call to openml.data.description. The dataset can reside in any
data repository, including a user’s personal webpage. Figure 4b details on how
the content of a task is obtained.
Datasets and implementations can be obtained using the API. Both are
referred to with an ID. By invoking the functions openml.data.description
and openml.implementation.get with this ID as parameter, users obtain an
XML file describing the dataset or implementation. Figure 4c shows how to
download an implementation. Datasets can be obtained in a similar way.
Runs can be submitted by invoking the function openml.run.upload. Fig-
ure 4d outlines how this works. The user provides an XML file describing
which implementation was used, and what the parameter settings were. The
implementation that was used should already be registered on OpenML. Fur-
thermore, all output of the run must be submitted. For the supervised classifi-
cation and regression tasks, this will include a file with predictions, which will
be evaluated on the server and stored in the database. The server will return an
ID referring to the record in the database. Uploading datasets and implemen-
tations happens in a similar way. For this the functions openml.data.upload
and openml.implementation.upload are used, respectively.
A list of all evaluation measures for usage in tasks can be obtained by
invoking openml.evaluation.measures.
6 Sharing Experiments
To facilitate the sharing of experiments, plugins are being developed for pop-
ular data mining and machine learning tools, including RapidMiner. The
RapidMiner plugin can be downloaded from the OpenML website. It intro-
duces three new operators.
The Read OpenML Task operator handles the downloading of tasks. When
presented with a task id, it automatically downloads this task and all associ-
ated content, i.e., the input dataset and the data splits. Every entity down-
loaded from OpenML is cached on the user’s local disk. The operator composes
the various training and test sets, and marks attributes with certain roles as
such, e.g. the target attribute or a row id.
The resulting training and test set will be sent to the OpenML Prediction
operator. For each training set submitted to this operator, a predictive model
is built, which generates predictions for all instances in the test set. These pre-
dictions will be sent to the Share on OpenML operator, which is responsible
for submitting the results to OpenML. First, it checks whether the implemen-
tation already exists, and if not, it will be registered. After that, all parameter
values are tracked. Finally, an XML file describing the run and an ARFF file
containing the predictions will be sent to OpenML.
Figure 5 contains a workflow which uses these operators. This workflow can
also be downloaded from the website. Before it can be run, a local directory for
caching the downloaded data is required. This can be done in the Preferences
menu, under the OpenML tab. When this is set, a workflow containing the
OpenML operators can be created. A global outline is shown in Figure 5a.
The operators are connected to each other in a straightforward way. We used
amultiply operator to split the outcome of the Prediction operator to both
the general output and the Share on OpenML operator. Note that if the user
does not want to share his results on line, he can simply omit the Share on
OpenML operator.
(a) Global Workflow.
(b) Sub workflow of the OpenML Prediction operator.
Figure 5: Workflow which downloads an OpenML task, and sends back the
results.
By clicking on the OpenML Prediction operator, a screen similar to Fig-
ure 5b is shown. This is where subworkflows can be created, to handle both
the training and the test data. As for the subworkflow that handles the train-
ing data, make sure that at least a model is created, e.g., by including a Naive
Bayes operator. For the Model application part it typically suffices to insert
an Apply Model operator. Finally, as parameter of the Read OpenML Task op-
erator, a task id should be provided. These can be searched from the OpenML
website.
7 Searching OpenML
All experiments in the database are openly available to everyone. Several ways
of searching through these experiments are provided.6The most notable ways
of searching through OpenML are textual search, the “search runs” interface
and the SQL interface.
All implementations, datasets and evaluation metrics submitted to OpenML
are required to include meta-data, such as a name, textual description, licence
data and in the case of implementations, installation notes and dependencies.
These textual descriptions are indexed by a search engine running on the web-
site, so that implementations and datasets can be searched through keywords.
The “search runs” is a wizard interface specialized in benchmark queries.
It can be found under the ‘Advanced’ tab of the search page. The user is pre-
sented with a form where he specifies which datasets (or collections of datasets)
and implementations he is interested in. Furthermore, he specifies on which
evaluation measure the benchmark should be performed. Typical questions
that can be answered with this interface are “what implementation performs
best on dataset X”, “compare several implementations on all datasets”, “show
the effect of data property DP on the optimal value of parameter P” and “how
influence parameter settings the performance of implementation X”.
The most flexible way of searching through OpenML is querying the database
directly by means of SQL statements. With some knowledge of the database
(see Section 4) complex queries can be executed in any way the user wants it.
Under the ‘Advanced’ tab of the search page some queries are provided. The
user can also inspect the SQL code of these queries, so these can be adapted to
the user’s specific needs. In Figure 6 an example of such a query is provided.
It studies the effect of the gamma parameter of the Weka implementation of
a Support Vector Machine, on the UCI letter dataset [2].
The results of queries can be obtained in CSV and ARFF format. Further-
more, scatterplots and line plots (as shown in Figure 6b) are provided.
6http://www.openml.org/search/
SELECT p s .va lu e as gamma ,e.val ue a s accuracy
FROM cv r u n r ,al g o r i t h m _ s e t u p s ,function_setup kernel ,dataset d ,
input_setting ps,evaluation e
WHERE r.learner=s.sid and s.algorithm=SVM’ AND k e r n e l .p a r e n t=s.sid
AND k e r n e l .function=’RBF Kerne l ’ AND p s .setup=s.sid AND ps .i np ut=
wek a .SMO( 1 . 5 3 . 2 . 2 )_G’ AND e.s o u r c e=r.rid AND e.function=
predictive_accuracy ’ AND r.inputdata=d.did AND d.n a m e =’ l e t t e r
(a) SQL statement
(b) Line plot of the result
Figure 6: Studying the effect of a parameter.
8 Summary
OpenML aims to stimulate an open approach to machine learning research,
by collecting results in a database. In order to provide an easy way of sharing
these, plugins for various machine learning tools will be provided, including
RapidMiner. Instead of running experiments over and over again, users can
easily query the database and obtain the results on relevant research questions.
Future work on OpenML includes the integration with other machine learn-
ing platforms, such as MLData and myExperiment. Also, the support for a
broader range of task types, such as time series analyses, feature selection and
graph mining, will be provided. Future work on the RapidMiner plugin in-
cludes a better integration with the various services of OpenML. Currently, the
plugin is mainly focussed on downloading tasks and uploading results. Fea-
tures like downloading workflows, uploading datasets and inspecting results
could be valuable additions to the plugin.
Acknowledgments
This work is supported by grant 600.065.120.12N150 from the Dutch Fund
for Scientific Research (NWO), and by the IST Programme of the European
Community, under the Harvest Programme of the PASCAL2 Network of Ex-
cellence, IST-2007-216886.
References
[1] D. De Roure, C. Goble, and R. Stevens. The Design and Realisation of
the myExperiment Virtual Research Environment for Social Sharing of
Workflows. Future Generation Computer Systems, 25:561–567, 2009.
[2] P.W. Frey and D. J. Slate. Letter Recognition Using Holland-Style Adap-
tive Classifiers. Machine Learning, 6:161, 1991.
[3] C. A. Goble, J. Bhagat, S. Aleksejevs, D. Cruickshank, D. Michaelides,
D. Newman, M. Borkum, S. Bechhofer, M. Roos, P. Li, and D. De Roure.
myExperiment: a repository and social network for the sharing of bioinfor-
matics workflows. Nucleic Acids Research, 38(suppl 2):W677–W682, 2010.
[4] H. Hirsh. Data mining research: Current status and future opportunities.
Statistical Analysis and Data Mining, 1(2):104–107, 2008.
[5] M. A. Nielsen. The Future of Science: Building a Better Collective Mem-
ory. APS Physics, 17(10), 2008.
[6] Y. Peng, P. Flach, C. Soares, and P. Brazdil. Improved Dataset Character-
isation for Meta-Learning. Lecture Notes in Computer Science, 2534:141–
152, 2002.
[7] J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment
databases. A new way to share, organize and learn from experiments. Ma-
chine Learning, 87(2):127–158, 2012.
... Flows are implementations of data analysis workflows. They can be single algorithm implementations, scripts (e.g., in R) or workflows (e.g., in tools such as RapidMiner [30] and KNIME [3]). They are again shared publicly or within circles, can be uploaded or linked from existing repositories (e.g. the workflow repositories discussed above), and updates are automatically versioned. ...
Article
Full-text available
Data-driven research requires many people from different domains to collaborate efficiently. The domain scientist collects and analyzes scientific data, the data scientist develops new techniques, and the tool developer implements, optimizes and maintains existing techniques to be used throughout science and industry. Today, however, this data science expertise lies fragmented in loosely connected communities and scattered over many people, making it very hard to find the right expertise, data and tools at the right time. Collaborations are typically small and cross-domain knowledge transfer through the literature is slow. Although progress has been made, it is far from easy for one to build on the latest results of the other and collaborate effortlessly across domains. This slows down data-driven research and innovation, drives up costs and exacerbates the risks associated with the inappropriate use of data science techniques. We propose to create an open, online collaboration platform, a ‘collaboratory’ for data-driven research, that brings together data scientists, domain scientists and tool developers on the same platform. It will enable data scientists to evaluate their latest techniques on many current scientific datasets, allow domain scientists to discover which techniques work best on their data, and engage tool developers to share in the latest developments. It will change the scale of collaborations from small to potentially massive, and from periodic to real-time. This will be an inclusive movement operating across academia, healthcare, and industry, and empower more students to engage in data science.
... The operators require an OpenML account to interact with the server. Download OpenML Task In order to make experiments reproducible, OpenML works with the concept of tasks [27, 35]. A task is a container that includes the input dataset(s), the data splits depending on the chosen evaluation procedure (e.g., cross-validation or holdout), and other necessary inputs. ...
Article
Full-text available
OpenML is an online, collaborative environment for machine learning where researchers and practitioners can share datasets, workflows and experiments. While it is integrated in several machine learning environments, it was not yet integrated into environments that offer a graphical interface to easily build and experiment with many data analysis workflows. In this work we introduce an integration into the popular RapidMiner environment, that will allow RapidMiner users to import data directly from OpenML and automatically share all their workflows and experiments. OpenML will then link these results to all other results obtained by other people, possibly with other tools, creating a single connected overview of the best workflows on a large set of machine learning problems. This is useful to learn and build on the results of others, to collaborate with many people online, and it provides a wealth of information to study how to construct workflows for new machine learning problems. We demonstrate the capabilities of this integration and identify several research opportunities.
... 20 OpenML makes sure that each (sub)task is clearly defined, and that all shared results are stored and organized online for easy access, reuse and discussion. Moreover, OpenML links to data available anywhere online, and is being integrated [41] in popular data mining platforms such as Weka [13], R [5; 40], MOA [4], RapidMiner [42] and KNIME [3]. This means that anyone can easily import the data into these tools, pick any algorithm or workflow to run, and automatically share all obtained results. ...
Article
Full-text available
Many sciences have made significant breakthroughs by adopting online tools that help organize, structure and mine information that is too detailed to be printed in journals. In this paper, we introduce OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems. We discuss how OpenML relates to other examples of networked science and what benefits it brings for machine learning research, individual scientists, as well as students and practitioners.
Article
Full-text available
Thousands of machine learning research papers contain extensive experimental comparisons. However, the details of those experiments are often lost after publication, making it impossible to reuse these experiments in further research, or reproduce them to verify the claims made. In this paper, we present a collaboration framework designed to easily share machine learning experiments with the community, and automatically organize them in public databases. This enables immediate reuse of experiments for subsequent, possibly much broader investigation and offers faster and more thorough analysis based on a large set of varied results. We describe how we designed such an experiment database, currently holding over 650,000 classification experiments, and demonstrate its use by answering a wide range of interesting research questions and by verifying a number of recent studies.
Conference Paper
Full-text available
This paper presents new measures, based on the induced decision tree, to characterise datasets for meta-learning in order to select appropriate learning algorithms. The main idea is to capture the characteristics of dataset from the structural shape and size of decision tree induced from the dataset. Totally 15 measures are proposed to describe the structure of a decision tree. Their effectiveness is illustrated through extensive experiments, by comparing to the results obtained by the existing data characteristics techniques, including data characteristics tool (DCT) that is the most wide used technique in meta- learning, and Landmarking that is the most recently developed method.
Article
Full-text available
myExperiment (http://www.myexperiment.org) is an online research environment that supports the social sharing of bioinformatics workflows. These workflows are procedures consisting of a series of computational tasks using web services, which may be performed on data from its retrieval, integration and analysis, to the visualization of the results. As a public repository of workflows, myExperiment allows anybody to discover those that are relevant to their research, which can then be reused and repurposed to their specific requirements. Conversely, developers can submit their workflows to myExperiment and enable them to be shared in a secure manner. Since its release in 2007, myExperiment currently has over 3500 registered users and contains more than 1000 workflows. The social aspect to the sharing of these workflows is facilitated by registered users forming virtual communities bound together by a common interest or research project. Contributors of workflows can build their reputation within these communities by receiving feedback and credit from individuals who reuse their work. Further documentation about myExperiment including its REST web service is available from http://wiki.myexp eriment.org. Feedback and requests for support can be sent to [email protected] /* */
Article
Machine rule induction was examined on a difficult categorization problem by applying a Holland-style classifier system to a complex letter recognition task. A set of 20,000 unique letter images was generated by randomly distorting pixel images of the 26 uppercase letters from 20 different commercial fonts. The parent fonts represented a full range of character types including script, italic, serif, and Gothic. The features of each of the 20,000 characters were summarized in terms of 16 primitive numerical attributes. Our research focused on machine induction techniques for generating IF-THEN classifiers in which the IF part was a list of values for each of the 16 attributes and the THEN part was the correct category, i.e., one of the 26 letters of the alphabet. We examined the effects of different procedures for encoding attributes, deriving new rules, and apportioning credit among the rules. Binary and Gray-code attribute encodings that required exact matches for rule activation were compared with integer representations that employed fuzzy matching for rule activation. Random and genetic methods for rule creation were compared with instance-based generalization. The strength/specificity method for credit apportionment was compared with a procedure we call “accuracy/utility.“
Article
In this paper we suggest that the full scientific potential of workflows will be achieved through mechanisms for sharing and collaboration, empowering scientists to spread their experimental protocols and to benefit from those of others. To facilitate this process we have designed and built the Virtual Research Environment for collaboration and sharing of workflows and experiments. In contrast to systems which simply make workflows available, provides mechanisms to support the sharing of workflows within and across multiple communities. It achieves this by adopting a social web approach which is tailored to the particular needs of the scientist. We present the motivation, design and realisation of .
Article
Machine rule induction was examined on a difficult categorization problem by applying a Holland-style classifier system to a complex letter recognition task. A set of 20,000 unique letter images was generated by randomly distorting pixel images of the 26 uppercase letters from 20 different commercial fonts. The parent fonts represented a full range of character types including script, italic, serif, and Gothic. The features of each of the 20,000 characters were summarized in terms of 16 primitive numerical attributes. Our research focused on machine induction techniques for generating IF-THEN classifiers in which the IF part was a list of values for each of the 16 attributes and the THEN part was the correct category, i.e., one of the 26 letters of the alphabet. We examined the effects of different procedures for encoding attributes, deriving new rules, and apportioning credit among the rules. Binary and Gray-code attribute encodings that required exact matches for rule activation were compared with integer representations that employed fuzzy matching for rule activation. Random and genetic methods for rule creation were compared with instance-based generalization. The strength/specificity method for credit apportionment was compared with a procedure we call “accuracy/utility.”
Article
In this paper, we introduce the concept of α-orthogonal patterns to mine a representative set of graph patterns. Intuitively, two graph patterns are α-orthogonal if their similarity is bounded above by α. Each α-orthogonal pattern ...
The Future of Science: Building a Better Collective Memory
  • M A Nielsen
M. A. Nielsen. The Future of Science: Building a Better Collective Memory. APS Physics, 17(10), 2008.
Improved Dataset Characterisation for Meta-Learning
  • Y Peng
  • P Flach
  • C Soares
  • P Brazdil
Y. Peng, P. Flach, C. Soares, and P. Brazdil. Improved Dataset Characterisation for Meta-Learning. Lecture Notes in Computer Science, 2534:141-152, 2002.