ArticlePDF Available

Fuzzy Logic in KNIME – Modules for Approximate Reasoning –



In this paper we describe the open source data analytics platform KNIME, focusing particularly on extensions and modules supporting fuzzy sets and fuzzy learning algorithms such as fuzzy clustering algorithms, rule induction methods, and interactive clustering tools. In addition we outline a number of experimental extensions, which are not yet part of the open source release and present two illustrative examples from real world applications to demonstrate the power of the KNIME extensions.
Fuzzy Logic in KNIME
– Modules for Approximate Reasoning –
Michael R. Berthold 1, Bernd Wiswedel 2, and Thomas R. Gabriel 2
1Department of Computer and Information Science, University of Konstanz,
atsstr. 10, 78484 Konstanz, Germany
E-mail: Michael.Berthold@Uni-Konstanz.DE AG,
Technoparkstrasse 1,
8005 Zurich, Switzerland
In this paper we describe the open source data analytics platform KNIME, focusing particularly on ex-
tensions and modules supporting fuzzy sets and fuzzy learning algorithms such as fuzzy clustering al-
gorithms, rule induction methods, and interactive clustering tools. In addition we outline a number of
experimental extensions, which are not yet part of the open source release and present two illustrative
examples from real world applications to demonstrate the power of the KNIME extensions.
Keywords: KNIME, Fuzzy C-Means, Fuzzy Rules, Neighborgrams.
1. Introduction
KNIME is a modular, openaplatform for data inte-
gration, processing, analysis, and exploration 2. The
visual representation of the analysis steps enables
the entire knowledge discovery process to be intu-
itively modeled and documented in a user-friendly
and comprehensive fashion.
KNIME is increasingly used by researchers in
various areas of data mining and machine learning
to give a larger audience access to their algorithms.
Due to the modular nature of KNIME, it is straight-
forward to add other data types such as sequences,
molecules, documents, or images. However, the
KNIME desktop release offers standard types for
fuzzy intervals and numbers, enabling the imple-
mentation of fuzzy learning algorithms as well.
A previous paper2has already described KN-
IME’s architecture and internals. A follow-up pub-
lication focused on improvements in version 2.03.
However, for readers not yet familiar with KNIME,
we provide a short overview of KNIME’s key con-
cepts in the following section before we describe
the integration of fuzzy concepts and learning algo-
rithms in the remainder of this paper. To the best
of our knowledge none of the other popular open
source data analysis or workflow environments14,9,17
include fuzzy types and learning algorithms. Many
specialized open source fuzzy toolboxes exist but
most are either purely in academic use or can not be
used stand alone. Commercial tools, such as Mat-
lab, often also offer fuzzy extensions. In this paper
aKNIME is downloadable from
International Journal of Computational Intelligence Systems, Vol. 6, Supplement 1 (2013), 34-45
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Received 7 December 2012
Accepted 30 March 2013
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
we focus on a complete, integrative platform which
is available open source.
We will first describe KNIME itself and pro-
vide some details concerning the underlying work-
flow engine. Afterwards we discuss the fuzzy exten-
sions, in particular the underlying fuzzy types before
discussing the integrated algorithms. Before show-
ing two real world applications of those modules we
briefly describe ongoing work.
KNIME is used to build workflows. These work-
flows consist of nodes that process data; the data are
transported via connections between the nodes. A
workflow usually starts with nodes that read in data
from some data sources, which are usually text files
or databases that can be queried by special nodes.
Imported data is stored in an internal table-based for-
mat consisting of columns with a certain data type
(integer, string, image, molecule, etc.) and an ar-
bitrary number of rows conforming to the column
specifications. These data tables are sent along the
connections to other nodes, which first pre-process
the data, e.g. handle missing values, filter columns
or rows, partition the table into training and test
data, etc. and then for the most part build predic-
tive models with machine learning algorithms like
decision trees, Naive Bayes classifiers or support
vector machines. For inspecting the results of an
analysis workflow several view nodes are available,
which display the data or the trained models in var-
ious ways. Fig. 1 shows a small workflow and its
Fig. 1. A small KNIME workflow, which builds and evalu-
ates a fuzzy rule set on the Iris data.
In contrast to pipelining tools such as Tavernab,
KNIME nodes first process the entire input table be-
fore the results are forwarded to successor nodes.
The advantages are that each node stores its results
permanently and thus workflow execution can eas-
ily be stopped at any node and resumed later on. In-
termediate results can be inspected at any time and
new nodes can be inserted and may use already cre-
ated data without preceding nodes having to be re-
executed. The data tables are stored together with
the workflow structure and the nodes’ settings. The
small disadvantage of this concept is that prelimi-
nary results are not available quite as soon as if real
pipelining were used (i.e. sending along and pro-
cessing single rows as soon as they are created).
One of KNIME’s key features is hiliting. In its
simple form, it allows the user to select and visually
mark (”hilite”) several rows in a data table. These
same rows are also hilited in all the views that show
the same data (or at least the hilited rows). This type
of hiliting is simply accomplished by using the 1:1
correspondence among the tables’ unique row keys.
However, there are several nodes that completely
change the input table’s structure and yet there is
still some relation between input and output rows.
A nice example is the MoSS node that searches for
frequent fragments in a set of molecules. The node’s
input are the molecules, the output the discovered
frequent fragments. Each of the fragments occurs in
several molecules. Hiliting some of the fragments
in the output table causes all molecules that con-
tain these fragments to be hilited in the input table.
Fig. 2 demonstrates this situation in a small work-
flow where a confusion matrix is linked back to the
original data.
One of the important design decisions was to
ensure easy extensibility, so that other users can
add functionality, usually in the form of new nodes
(and sometimes also data types). This has already
been done by several commercial vendors (Tripos,
odinger, Chemical Computing Group, ...) but
also by other university groups and open source pro-
grammers. The usage of Eclipse as the core plat-
form means that contributing nodes in the form of
plugins is a very simple procedure. The official KN-
IME website offers several extension plugins for re-
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Fuzzy Logic in KNIME
Figure 2: KNIME’s hiliting features demonstrated by the linkage between the confusion matrix and the evalua-
tion data.
porting via BIRTc, statistical analysis with Rdor ex-
tended machine learning capabilities from Wekae,
for example.
Since the initial release in mid 2006 the grow-
ing user base has voiced a number of suggestions
and requests for improving KNIME’s usability and
functionality. From the beginning KNIME has sup-
ported open standards for exchanging data and mod-
els. Early on, support for the Predictive Model
Markup Language (PMML) 15 was added and most
of the KNIME mining modules natively support
PMML, including association analysis, clustering,
regressions, neural network, and tree models. With
the latest KNIME release, PMML support was en-
hanced to cover PMML 4.1. See 18 for more details.
Before dicussing how fuzzy types and learning
methods can be integrated into KNIME, let us first
discuss the KNIME architecture in more detail.
3. KNIME Architecture
The KNIME architecture was designed with three
main principles in mind.
Visual, interactive framework: Data flows should
be combined by a simple drag and drop operation
from a variety of processing units. Customized
applications can be modeled through individual
data pipelines.
Modularity: Processing units and data containers
should not depend on each other in order to enable
easy distribution of computation and allow for in-
dependent development of different algorithms.
Data types are encapsulated, that is, no types are
predefined, new types can easily be added bring-
ing along type specific renderers and comparators.
New types can be declared compatible to existing
Easy expandability: It should be easy to add new
processing nodes or views and distribute them
through a simple plugin mechanism without the
need for complicated install/deinstall procedures.
In order to achieve this, a data analysis process con-
sists of a pipeline of nodes, connected by edges that
transport either data or models. Each node processes
the arriving data and/or model(s) and produces re-
sults on its outputs when requested. Fig. 3 schemat-
ically illustrates this process.
Fig. 3. A schematic for the flow of data and models in a
KNIME workflow.
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
The type of processing ranges from basic data
operations from filtering or merging to simple statis-
tical functions ranging from computations of mean,
standard deviation or linear regression coefficients
to the computation of intensive data modeling oper-
ators (clustering, decision trees, neural networks, to
name just a few). In addition, most of the modeling
nodes allow for an interactive exploration of their re-
sults through accompanying views. In the following
we will briefly describe the underlying schemata of
data, node, workflow management and how the in-
teractive views communicate.
3.1. Data Structures
All data flowing between nodes is wrapped within
a class called DataTable, which holds meta-
information concerning the type of its columns in
addition to the actual data. The data can be accessed
by iterating over instances of DataRow. Each row
contains a unique identifier (or primary key) and a
specific number of DataCell objects, which hold
the actual data. The reason to avoid access by Row
ID or index is scalability, that is, the desire to be able
to process large amounts of data and therefore not be
forced to keep all of the rows in memory for fast ran-
dom access. KNIME employs a powerful caching
strategy, which moves parts of a data table to the
hard drive if it becomes too large. Fig. 4 shows a
UML diagram of the main underlying data structure.
3.2. Nodes
Nodes in KNIME are the most general process-
ing units and usually resemble one node in the
visual workflow representation. The class Node
wraps all functionality and makes use of user-
defined implementations of a NodeModel, possi-
bly a NodeDialog, and one or more NodeView
instances if appropriate. Neither dialog nor view
must be implemented if no user settings or views
are needed. This schema follows the well-known
Model-View-Controller design pattern.
Fig. 4. A UML diagram of the data structure and the main
classes it relies on.
In addition, each node has a number of Inport
and Outport instances for the input and output con-
nections, which can either transport data or models.
Fig. 5 shows a UML diagram of this structure.
Fig. 5. A UML diagram of the Node and the main classes it
relies on.
3.3. Workflow Management
Workflows in KNIME are essentially graphs con-
necting nodes, or more formally, a directed acyclic
graph (DAG). The WorkflowManager allows new
nodes to be inserted and directed edges (connec-
tions) between two nodes to be added. It also keeps
track of the status of nodes (configured, executed,
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Fuzzy Logic in KNIME
...) and returns, on demand, a pool of executable
nodes. This way the surrounding framework can
freely distribute the workload among a couple of
parallel threads or – optionally – even a distributed
cluster of servers. Thanks to the underlying graph
structure, the workflow manager is able to determine
all nodes required to be executed along the paths
leading to the node the user actually wants to exe-
3.4. Views and Interactive Brushing
Each Node can have an arbitrary number of views
associated with it. Through receiving events from a
HiLiteHandler (and sending events to it) it is pos-
sible to mark selected points in such a view to enable
visual brushing described earlier. Views can range
from simple table views to more complex views on
the underlying data (e. g. scatterplots, parallel coor-
dinates) or the generated model (e. g. decision trees,
3.5. (Fuzzy) Types in KNIME
KNIME features a modular and extensible type con-
cept. As described earlier, tables in KNIME contain
meta information about the types contained in each
column. Fig. 6 shows this setup in more detail.
Fig. 6. A schematic showing how data tables can be ac-
cessed in KNIME.
This meta information essentially enumerates all
possible types (subclasses of DataValue) that all
cells in that column implement. Particular cell
implementations (extending DataCell) can imple-
ment one or more of these values, IntCell, for
instance, implements both IntValue as well as
DoubleValue, as an integer can be represented as
a double without loosing any information. The re-
verse is obviously not true, so DoubleCell only im-
plements DoubleValue. Fig. 7 shows this setup in
more detail.
Fig. 7. A schematic showing how data types are organized
Inspecting the KNIME source code reveals,
however, that DoubleCell does implement
additional extensions of DataValue namely
FuzzyNumberValue and FuzzyIntervalValue
(there are also a few other interfaces implemented
such as ComplexNumberValue which we will not
focus on here). Any double can obviously also
represent a singleton fuzzy number16 or an ex-
treme fuzzy interval with singleton core and sup-
port (or a complex number with 0i) so these ex-
tensions allow normal doubles to be treated as
fuzzy numbers resp. intervals as well. However,
of more interest are obviously the real implementa-
tions FuzzyIntervalCell and FuzzyNumberCell
which in this case represent trapezoidal resp. tri-
angular membership functions over a real-valued
Fig. 8 shows how this is represented in KN-
IME for a small fuzzy rule set learned on the Iris
data11. The meta information about the table on
the right is displayed at the top. When a ta-
ble contains fuzzy intervals/numbers the headers of
these columns represent the most common super-
type (FuzzyIntervalCell in this case) and also
some additional properties. An upper and lower
bound can be given for some types (as is the case
for the first four columns), while the nominal val-
ues are listed for others (as can be seen in the fifth
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
Fig. 8. An example of fuzzy intervals in KNIME. The un-
derlying meta data is at the top, while the data as it is shown
in a table can be seen underneath.
In the following we describe how a number of
prominent fuzzy learning methods can be easily em-
bedded into this general framework.
4. Fuzzy C-Means
The well-known fuzzy c-means algorithm7is con-
tained in one learning module/node of KNIME. The
configuration dialog of the node is shown in Fig. 9,
it exposes the usual parameters of the standard im-
plementation in addition to the setup of a noise
Fig. 9. The dialog of the fuzzy c-means clustering node,
displaying the available options.
Fig. 10. The output of the fuzzy c-means clustering node,
here using the bar renderer to display the degrees of mem-
Note the small button next to the ”Number of clus-
ters” field. This indicates that this setting can be eas-
ily controlled by a workflow variablef. It enables
workflows to be set up that loop over different num-
bers of clusters, running e.g. a cross-validation run
and collecting the results over all cluster settings.
Fig. 10 shows the output of the clustering node
for the well known Iris data set, where the degree
of membership is displayed for each pattern. Vari-
ous rendering options are available for each column,
here a bar char was chosen.
5. Fuzzy Rule Induction
For fuzzy rule induction the FRL algorithm 1,12 was
used as a basis. The algorithm constructs fuzzy clas-
sification rules and can use nominal as well as nu-
merical attributes. For the latter, it automatically
extracts fuzzy intervals for selected attributes. One
of the convenient features of this algorithm is that
it only uses a subset of the available attributes for
each rule, resulting in so-called free fuzzy rules. The
KNIME implementation follows the published algo-
rithm closely, allowing various algorithmic options
to be set as well as different fuzzy norms. After ex-
ecution, the output is a model description in a KN-
IME internal format and a table holding the rules
as fuzzy interval constraints on each attribute plus
some additional statistics (number of covered pat-
terns, spread, volume etc.). These KNIME repre-
fActually all parameters of a node can be controlled by workflow variables but this button makes it easier for typical variables, which
are often controlled from the outside.
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Fuzzy Logic in KNIME
sentations can be used to further process the rule set
but also for display purposes.
Fig. 11 shows an MDS projection of the 4-
dimensional rules on to two dimensions. The color
indicates the class of each rule, the size the number
of covered patterns. More details can be found in 5.
6. Visual Fuzzy Clustering
Another interesting aspect of KNIME is its visual-
ization capabilities. As mentioned above, views in
KNIME support cross-view selection mechanisms
(called hiliting) but views can also be more interac-
tive. One such example is the set of nodes for visual
Fig. 11. The fuzzy rules induced from the Iris data projected
on to a two dimensional space. Size represents coverage,
color the class of the rule.
fuzzy clustering. The nodes can actually perform
such clustering in multiple descriptor spaces (paral-
lel universes) in parallel 6.
For the purpose of this paper, however, a side as-
pect of this work is more interesting. The methods
described in 6allow fuzzy clusters to be identified
and revised interactively in these parallel universes.
Fig. 12 shows a screenshot of the interactive view of
this KNIME node again for the Iris data. Each row
shows a so-called Neighborgram for the data points
of interest (usually a user specified class). A single
neighborgram represents an object’s neighborhood,
which is defined by a similarity measure. It contains
a fixed number of nearest neighbors to the centroid
object, whereby good, i.e. discriminative, neighbor-
grams will have objects of the centroid’s class in the
close vicinity.
Fig. 12. The view of the visual fuzzy clustering node. Clus-
ters are presented and fine tuned iteratively by the user.
KNIME also contains nodes for automatic clus-
tering using the data structures. The neighborgrams
are then constructed for all objects of interest, e.g.
belonging to the minority class, in all available uni-
verses. The learning algorithm derives cluster candi-
dates from each neighborgram and ranks these based
on their qualities (e.g. coverage or another quality
measure). The model construction is carried out in a
sequential covering-like manner, i.e. starting with all
neighborgrams and their cluster candidates, taking
the numerically best one, adding it as a cluster and
proceeding with the remaining neighborgrams while
ignoring the already covered objects. This simple
procedure already produces a set of clusters, which
potentially originate from diverse universes. Exten-
sions to this algorithm reward clusters, which group
in different universes simultaneously, and thus re-
spect overlaps. Another interesting usage scenario
of the neighborgram data structure is the possibil-
ity to display them and thus involve the user in the
learning process. Especially the ability to visually
compare the different neighborhoods of the same
object has proven to be useful in molecular appli-
cations and for the categorization of 3D objects.
7. Ongoing Work
Current development also includes a number of pro-
totypes for other fuzzy-based analysis and/or visu-
alization methods. It is worth mentioning two more
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
visualization-based methods.
Performing multi-dimensional scaling (or most
other projection methods from a higher dimensional
space on to two dimensions) usually loses informa-
tion pertaining to the uncertainty of the underlying
fuzzy points / fuzzy sets. This can be seen in Fig. 11
above. It is not possible to infer from the picture
whether the fuzzy sets overlap or how close their
core/support regions are. An approach presented
in13 addresses this limitation by also showing esti-
mates for the spread towards neighboring points in
the projection. Fig. 13 shows an example for this
type of visualization.
Fig. 13. A prototypical view on projected fuzzy points
also displaying estimates for overlap/vicinity of neighbor-
ing points.
Another way of visualizing points in medium
high dimensional spaces are parallel coordinates.
In 4an extension for this type of visualization was
presented, which extends the mechanism to also
show fuzzy points or rules. Fig. 14 shows two of
the rules learned for the Iris data set.
Fig. 14. A visualization of fuzzy rules in parallel coordi-
8. Other Extensions
In addition to native, built-in nodes, KNIME also al-
lows existing tools to be wrapped easily. An external
tool node allows command line tools to be launched,
whereas integrations for Matlab, R, and other data
analysis or visualization tools allow existing fuzzy
learning methods such as ANFIS to be integrated as
However, a number of existing wrappers around
libraries such as LibSVM or Christian Borgelt’s As-
sociation Rule and Itemset Mining library demon-
strates that it is also feasible to integrate existing tool
sets more tightly.
9. Applications
The fuzzy extensions for KNIME discussed here are
not only of academic interest but enable users to
use these tools easily in practice. In the following
we will show two examples. The first one demon-
strates the usefulness of the visual fuzzy cluster-
ing approach for the exploration of high through-
put screening data and the second one focuses on
a more complex molecular space modeling task
around fuzzy c-means and how the resulting fuzzy
partitioning of the space can be visually explored us-
ing the KNIME network processing modules.
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Fuzzy Logic in KNIME
9.1. Screening Data Analysis
An interesting example for the use of the visual
fuzzy clustering methods presented above was re-
ported in 19. The Neighborgram-based clustering
method was applied to a well-known data set from
the National Cancer Institute, the DTP AIDS An-
tiviral Screen data set. The screen utilized a biologi-
cal assay to measure protection of human CEM cells
from HIV-1 infection. All compounds in the data
set were tested for their protection of the CEM cell;
those that did not provide at least 50% protection
were labeled as confirmed inactive (CI). All others
were tested in a second screening. Compounds that
provided protection in this screening, too, were la-
beled as confirmed active (CA), the remaining ones
as moderately active (CM). Those screening results
and chemical structural data on compounds that are
not protected by a confidentiality agreement can be
accessed onlineg. 41,316 compounds are available,
of which we have used 36,452. A total of 325 be-
long to class CA, 877 are of class CM and the re-
maining 34,843 are of class CI. Note the class distri-
bution for this data set is very unbalanced. There are
about 100 times as many inactive compounds (CI)
as there are active ones (CA), which is very common
for this type of screening data analysis: although it
is a relatively large data set, it has an unbalanced
class distribution with the main focus on a minority
class, the active compounds. The focus of analysis
is on identifying internal structures in the set of ac-
tive compounds that appeared to protect CEM cells
from the HIV-1 infection.
In order to generate Neighborgrams for this
dataset, a distance measure needs to be defined.
We initially computed Fingerprint descriptors8,
which represent each compound through a 990-
dimensional bit string. Each bit represents a
(hashed) specific chemical substructure of interest.
The used distance metric was a Tanimoto distance,
which computes the number of bits that are different
between two vectors normalized over the number of
bits that are turned on in the union of the two vec-
tors. This type of distance function is often used in
cases like this, where the used bit vectors are only
sparsely occupied with 1s.
Experiments with this (and other similar) data
sets demonstrate well how interactive clustering in
combination with Neighborgrams helps to inject do-
main knowledge in the clustering process and how
Neighborgrams help to inspect promising cluster
candidates quickly and visually. Fig. 15 shows one
example of a cluster discovered during the explo-
ration, grouping together parts of the chemical fam-
ily of Azido Pyrimidines, probably one of the best-
known classes of active compounds for HIV.
This application demonstrates perfectly how
fuzzy clustering techniques are critical for real world
applications. Attempting to partition this type of
screening data into crisp clusters would be abso-
lutely futile due to the underlying fairly noisy data.
Instead, by suggesting clusters to the user and hav-
ing him/her fine tune the (fuzzy) boundaries and
then continuing the interactive clustering procedure
allows the user to inject background knowledge into
the clustering process on the fly.
Fig. 15. A fuzzy cluster of the NIH-Aids data centered
around compound #646436. This cluster nicely covers part
of one of the most well-known classes of active compounds:
Azido Pyrimidines.
9.2. Molecular Space Modeling
Trying to get a first impression of a large molecu-
lar database is often a challenge because the con-
tained compounds do often not belong to one group
alone but share properties with more than one chem-
ical group – obviously this naturally lends itself to
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
a modeling of the space using fuzzy techniques.
The workflow depicted in Fig. 17 generates such
an overview using a number of more complex sub
on the left, the information is read from two files,
one containing the structures, the other one con-
taining additional information about each com-
the next metanode contains a subworkflow creat-
ing additional chemical descriptors;
the internals of the next metanode are displayed
in Fig. 18, it determines optimal settings for the
parameters of the fuzzy C-means algorithm: the
number of clusters and the fuzzifier;
the fuzzy C-means node than uses these settings
for the final clustering of the entire dataset;
the last three metanodes create an overview of the
clusters, sample the data down so that structures
can be displayed meaningfully later, and create
the actual network which is then displayed by the
final node.
The sub workflow shown in Fig. 18 is particu-
larly interesting here because it illustrates the use of
the KNIME looping concept. The loop start node
on the left takes its input from a table with several
settings for the number of clusters and the fuzzifier
of the fuzzy-C-means applied to one portion of the
overall data. The cluster assigner on the other parti-
tion of our data is then evaluated for its quality (es-
sentially measuring clustering quality indices based
on within and across cluster distances). The loop
end node collects the summary information of each
run and the sorter then picks the best iteration and re-
turns it as variables to be fed into the fuzzy C-means
in the overall workflow. Setups similar to this can
be used to do model selection also across different
model classes, the KNIME example server holds a
couple of examples for this as well.
The metanode on the bottom (”Create Cluster
Overview”) extracts the most common substructure
of all molecules belonging to a cluster, the resulting
table is shown in Fig. 16. For a chemist, those rep-
resentations quickly reveal the main composition of
the underlying database. However, such crisp reduc-
tions do not reveal more insights.
Fig. 16. The most common substructures of the five clus-
We do not go into much detail about the in-
terna of the subsequent metanodes as they mainly
focus on building a network of cluster centers and
molecules (as nodes) and introduce edges between
those weighted by the corresponding degree of
membership. One resulting view is shown in Fig. 19.
One can quickly see the main constituents of the
clusters, illustrated by a couple of representative
molecular structures with a high degree of member-
ship only to that one cluster (we filtered out the ma-
jority of the compounds strongly belonging to one
class for sake of readability). Compounds that are
more ambiguous are positioned inbetween two – or
in some cases also three – clusters. These are also
molecular structures that can be assigned to a spe-
cific chemical group less clearly. The true power
of such visualizations lies in their interactivity, of
course, just like in many of the other examples. The
KNIME network visualization extension allows to
highlight points as well and zoom in to focus on de-
tails of the network.
10. Conclusions
We have described fuzzy extensions in KNIME and
illustrated how classic fuzzy learning methods are
easily integrated. We also illustrated how some of
the techniques described here can be used in real
world application such as the visual clustering of
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
Fuzzy Logic in KNIME
Figure 17: A workflow for the model and visualization of a molecular space.
high throughput screening data and the modeling of
molecular spaces.
KNIME offers a solid basis for more fuzzy learn-
ing and visualization methods and we look forward
to collaborating with the fuzzy community to extend
this area of tool coverage in KNIME further.
We thank the other members of the KNIME Team
and the very active KNIME Community!
1. M.R. Berthold. “Mixed Fuzzy Rule Formation,” In In-
ternational Journal of Approximate Reasoning (IJAR),
32, 67–84, Elsevier, 2003.
2. M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel,
T. K¨
otter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and
B. Wiswedel. “KNIME: The Konstanz Information
Miner, In Studies in Classification, Data Analysis,
and Knowledge Organization (GfKL 2007). Springer,
3. M.R. Berthold, N. Cebron, F. Dill, T. R. Gabriel,
T. K¨
otter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel.
“KNIME: The Konstanz Information Miner. Version
2.0 and Beyond, In SIGKDD Explorations. ACM
Press, 11(1), 2009.
4. M.R. Berthold and L.O. Hall. “Visualizing Fuzzy
Points in Parallel Coordinates, In IEEE Transactions
on Fuzzy Systems,11(3), 369–374, 2003.
5. M.R. Berthold and R. Holve. “Visualizing High Di-
mensional Fuzzy Rules,” In Proceedings of NAFIPS,
64–68, IEEE Press, 2000.
6. M.R. Berthold, B. Wiswedel, and D.E. Patterson. “In-
teractive Exploration of Fuzzy Clusters Using Neigh-
borgrams, In Fuzzy Sets and Systems,149(1), 21–37,
Elsevier, 2005.
7. J.C. Bezdek. “Pattern Recognition with Fuzzy Objec-
tive Function Algorithms, Plenum Press, New York,
Figure 18: The subworkflow iterating over several settings of the fuzzy c-means algorithm to identify the optimal
number of clusters and the value of the fuzzifier.
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
M.R. Berthold, B. Wiswedel, Th.R. Gabriel
Figure 19: The final network as displayed by KNIME. One can nicely see how a couple of molecular structures
fall clearly within one cluster. Others, however, belong to more than one cluster and have therefore substantial
connections to more than one cluster node.
8. R.D. Clark. “Relative and absolute diversity analy-
sis of combinatorial libraries Combinatorial Library
Design and Evaluation, Marcel Dekker, New York,
337-362, 2001.
9. T. Curk, J. Demsar, Q. Xu, G. Leban, U. Petrovic,
I. Bratko, G. Shaulsky, and B. Zupan. “Microarray
data mining with visual programming,” Bioinformat-
ics.21(3), 396–408, 2005.
10. R.N. Dav´
e. “Characterization and detection of noise in
clustering,” In Pattern Recognition Letters,12, 657–
664, 1991.
11. R.A. Fisher. “The use of multiple measurements in
taxonomic problems,” In Annals of Eugenics,7(2),
179–188, 1936.
12. Th.R. Gabriel and M.R. Berthold. “Influence of fuzzy
norms and other heuristics on ’Mixed Fuzzy Rule For-
mation’,” In International Journal of Approximate
Reasoning (IJAR),35, 195–202, Elsevier, 2004.
13. Th.R. Gabriel, K. Thiel, and M.R. Berthold. “Rule
Visualization based on Multi-Dimensional Scaling,
In IEEE International Conference on Fuzzy Systems,
14. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
mann, and I.H. Witten. “The WEKA Data Mining
Software: An Update, SIGKDD Explorations,11(1),
15. A. Guazzelli, W. Lin, and T. Jena. “PMML in Action:
Unleashing the Power of Open Standards for Data
Mining and Predictive Analytics.
16. M. Hanss. Applied Fuzzy Arithmetic, An Introduc-
tion with Engineering Applications,” Springer, 2005.
17. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz,
and T. Euler. “YALE: Rapid Prototyping for Com-
plex Data Mining Tasks,” In Proceedings of the 12th
ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, 2006.
18. D. Morent, K. Stathatos, W.-C. Lin, and
M.R. Berthold. “Comprehensive PMML Pre-
processing in KNIME,” In Proceedings of the PMML
Workshop, KDD, 2011.
19. B. Wiswedel, D.E. Patterson, and M.R. Berthold. “In-
teractive Exploration of Fuzzy Clusters, In: J.V. de
Oliveira, and W. Pedrycz (eds) Advances in Fuzzy
Clustering and its Applications. John Wiley and Sons,
123-136, 2007.
20. L.A. Zadeh. “Fuzzy sets,” In Information and Control,
8(3), 338-353, 1965.
Co-published by Atlantis Press and Taylor & Francis
Copyright: the authors
... Its visual interface gives the ability to access data and apply data transformation and it supports powerful predictive analytics [32]. Knime workflow consists of connected nodes or extensions [33]. Moreover, Knime supports integration of different data analytic tools such as R, Python scripting, Weka, and other third party applications such as Google Analytics. ...
... Due to the expandability of Knime, new nodes can be added at any point to apply different kind of processing without the need to re-execute the previous nodes. Knime can be downloaded and used freely under an open source license (GPL) [33]. ...
Full-text available
In the last decade, social networks became most popular medium for communication and interaction. As an example, micro-blogging service Twitter has more than 200 million registered users who exchange more than 65 million posts per day. Users express their thoughts, ideas, and even their intentions through these tweets. Most of the tweets are written informally and often in slang language, that contains misspelt and abbreviated words. This paper investigates the problem of selecting features that affect extracting user's intention from Twitter feeds based on text mining techniques. It starts by presenting the method we used to construct our own dataset from extracted Twitter feeds. Following that, we present two techniques of feature selection followed by classification. In the first technique, we use Information Gain as a one-phase feature selection, followed by supervised classification algorithms. In the second technique, we use a hybrid approach based on forward feature selection algorithm in which two feature selection techniques employed followed by classification algorithms. We examine these two techniques with four classification algorithms. We evaluate them using our own dataset, and we critically review the results.
... In geoscience we find numerous papers that apply similarly DM to identify fossil fuels and discover mineral deposits [6], [7], [12], [13]. Feltrin et al. [14] apply a workflow approach in KNIME to perform unsupervised learning using k-means and fuzzy-c-means clustering, see also [15]. This example is discussed in the next section with the objective of demonstrating the tool applicability to geoscience data. ...
... The KNIME workflow used to generate a pseudo-geology map in [14] is composed of the following nodes (cf. Fig. 1 A): (1) An import node used to convert ASCII files (geophysical raster data) into a KNIME table; (2) pre-processing nodes for the removal of missing information (raster images contained areas devoid of geophysical data) and data standardization (a requirement of cluster analysis is the normalization of input data); (3) a JFreeChart node (not shown in Fig. 1) was used to visually check the results of data transformation; (4) two clustering nodes were used to classify the combination of gravity and magnetic gridded data: a k-means clustering [17], [18] and a Fuzzy-k-means clustering approach [12], [15]. ...
Full-text available
Abstract-KNIME (Konstanz Information Miner) is a modular computational environment, which allows easy visual assembly, interactive data analysis, and data processing. It is an open source predictive analytics platform (released under the GNU General Public License v3) suited to process a variety of data formats, from basic csv or xlsx files, to more complex data structures such as xml, url and relational databases (e.g., db2, Oracle, MySQL). Surprisingly, it has not seen wide application in the earth sciences. A number of case studies providing examples of geoscience data processing will benefit both the academia and industry, very few geoscience applications are currently reported and these are dominantly in geoinformatics. In particular, the Energy and Mineral Exploration sectors, which make extensive use of Exploratory Data Analysis, Machine Learning (ML) and Data Mining (DM) software for data classification, pattern recognition and predictive modelling, will benefit significantly from KNIME. In contrast to other predictive analytics platforms (e.g., Orange, R, Rapid- Miner, Scikit-learn), what makes KNIME particularly appealing to geoscience applications is its ability to integrate different programming languages in the same workflow environment, some of them like the statistical software R or Matlab are well known in the geoscience community. KNIME is supported by an extensive community of users and developers. Since KNIME is built on top of Eclipse it shares the benefit of a plugin architecture that makes it easily extensible, many custom-built nodes are available and easily accessible through the Community Contributions area.
... MVA can be subdivided broadly in two subdisciplines: (1) factor analysis [e.g., canonical correlation analysis, principal component analysis, discriminant analysis (Mellinger 1987)] and (2) classification analysis with methodologies such as hierarchical clustering (McQuitty 1960;Murtagh 2014), K-means clustering (Hartigan and Wong 1979;Moore 2001), spectral clustering (Jain 2008) and other variants such as fuzzy methodologies (Berthold et al. 2013) or model-based clustering (Fraley and Raftery 2003). Several examples of cluster analysis exist in the literature with applica-tions in geoscience-related problems (Micklethwaite 2007;Song et al. 2010;Ellefsen and Smith 2016). ...
Full-text available
This study proposes an extension of a visualization approach common in biochemistry (the clustered heat maps—CHMs) to geochemical data with the main objective of detecting hydrothermal alteration and mineralization. The approach allows superior visualization of unsupervised cluster analysis results. We consider two examples: a synthetic case study and an application to public data derived from the Canadian Flin Flon volcanic-hosted massive sulfide deposits. A series of experiments were run on a synthetic dataset with the aim of understanding the effect of noise and how random data sampling of variable specimen population size influences results of a variety of clustering algorithms (including K-means and other hierarchical methods) and their visualization using CHMs. These experiments on synthetic data provided the basis to propose a possible workflow for the selection of optimal classifiers to be applied on natural data and the definition of an appropriate parametrization (distance metrics and clustering algorithm). Natural data analysis provides direct evidence of how CHMs can be a fruitful approach in mineral exploration if compared to other cluster analysis methods (e.g., classic K-means or hierarchical methods), CHMs provide the opportunity of examining an additional dimension of clustering and still view chemical compositions (although in a transformed space) in a single plot. Facilitated selection of appropriate levels of granularity (G), which regulates the scale of clustering in a CHM, was found to be an instrumental tool and led to the successful separation of clusters representative of major lithological transitions vs. smaller clusters, at higher granularity, isolating VHMS alteration and mineralization. Integration of statistical tests conducted on synthetic data, together with CHM’s visualization of the classification results led us to consider the Manhattan–Ward classifier as an optimal pair for the Flin Flon dataset, despite its limitations induced by the ‘uniform effect.’
... The learning of fuzzy rule-based systems is based on the mixed fuzzy rule formation algorithm (Berthold 2003), which has been implemented on the KNIME platform (Berthold et al. 2013). ...
Full-text available
In traditional machine learning, classification is typically undertaken in the way of discriminative learning by using probabilistic approaches, i.e. learning a classifier that discriminates one class from other classes. The above learning strategy is mainly due to the assumption that different classes are mutually exclusive and each instance is clear-cut. However, the above assumption does not always hold in the context of real-life data classification, especially when the nature of a classification task is to recognize patterns of specific classes. For example, in the context of emotion detection, multiple emotions may be identified from the same person at the same time, which indicates in general that different emotions may involve specific relationships rather than mutual exclusion. In this paper, we focus on classification problems that involve pattern recognition. In particular , we position the study in the context of granular computing, and propose the use of fuzzy rule-based systems for recognition intensive classification of real-life data instances. Furthermore, we report an experimental study conducted using 7 UCI data sets on life sciences, to compare the fuzzy approach with four popular prob-abilistic approaches in pattern recognition tasks. The experimental results show that the fuzzy approach can not only be used as an alternative one to the probabilis-tic approaches but also is capable to capture more patterns which probabilistic approaches can not achieve.
... It also allows ad-hoc connectivity with dissimilar types of database managers. Other tools like Weka [35], KNIME [36], KEEL [37] and MATLAB [38] also have the same facilities, through the incorporation of plugins and toolboxes respectively. However, the ad-hoc connectivity to data sources can cause inefficiencies in the information system to use, since they take into account variables such as connectivity, network speed, service availability and dependent modular architecture. ...
Full-text available
Project monitoring and control by using key performance indicators has become a widespread method for decisionmaking in project-oriented organizations. However, the current schools and IT tools created for this purpose require an upgrade in design due to imprecision, vagueness or uncertainty present in the raw data and changing conditions in management styles. Moreover, the use of proprietary technologies in developing nations represents high costs for governments and obstacles to achieving its technological sovereignty. This paper studies the trends and challenges in project control through computational intelligence methods. It also examines schools and technological tools to manage projects, as well as open source software for the application of computational intelligence techniques over the past decades. Current tendencies and improvement areas, valuing niche markets with high applicability around the thematic goal it is also analyzed. The contribution of this study is related to the predicted necessity of constructing new models and IT tools for project control which integrate machine learning-based approaches and treatment of imprecision, vagueness or uncertainty in the information, using key performance indicators linked to fundamental knowledge areas. The implementation of new libraries for learning evaluation in project control with open source software tools, opens a field of research related to increase technological integration with IT project management tools. The content under discussion provides support to improve decision-making in project-oriented organizations.
... The aim is to show empirically that instances that belong to different classes may have high fuzzy similarity to each other and thus classes can be overlapping by having common instances. In addition, the fuzzy rule induction approach implemented on the KNIME platform is adopted to undertake the experiments [26]. Table VI shows that 25 test instances (selected as representative examples from 200) are judged to belong to multiple classes in accordance with the fuzzy membership degrees measured. ...
Conference Paper
Full-text available
Classification is one of the most popular tasks of machine learning, which has been involved in broad applications in practice, such as decision making, sentiment analysis and pattern recognition. It involves the assignment of a class/label to an instance and is based on the assumption that each instance can only belong to one class. This assumption does not hold, especially for indexing problems (when an item, such as a movie, can belong to more than one category) or for complex items that reflect more than one aspect, e.g. a product review outlining advantages and disadvantages may be at the same time positive and negative. To address this problem, multi-label classification has been increasingly used in recent years, by transforming the data to allow an instance to have more than one label; the nature of learning, however, is the same as traditional learning, i.e. learning to discriminate one class from other classes and the output of a classifier is still single (although the output may contain a set of labels). In this paper we propose a fundamentally different type of classification in which the membership of an instance to all classes(/labels) is judged by a multiple-input-multiple-output classifier through generative multi-task learning. An experimental study is conducted on five UCI data sets to show empirically that an instance can belong to more than one class, by using the theory of fuzzy logic and checking the extent to which an instance belongs to each single class, i.e. the fuzzy membership degree. The paper positions new research directions on multi-task classification in the context of both supervised learning and semi-supervised learning.
... RapidMiner tiene gran facilidad de extensión a partir de operadores, algunos de los cuales implementan técnicas de soft computing. Otras herramientas como Weka [29], KNIME [118], KEEL [119], o MATLAB [30] poseen iguales facilidades a través del uso e incorporación de librerías, plugins o toolboxes respectivamente. Todas permiten la conexión ad-hoc con disímiles tipos de gestores de bases de datos. ...
Full-text available
The control of project execution represents a fundamental socio-economic pillar to the development of nations. However, the existing schools and IT project management tools lack methods for automated calculation of indicators covering the fundamental knowledge areas, as well as the treatment of imprecision, vagueness or uncertainty contained in the information under changing management styles. The aim of the research was to develop a model for controlling the execution of projects by the use of soft computing techniques which improves the efficiency and effectiveness in support for decision-making in organizations. The main contributions of the research were related to the creation of a system of eight new key performance indicators related to the fundamental knowledge areas of project management. The proposed model is being used to control the execution of projects related to the Cuban software development industry.
... Two other well-known suites for machine learning, which offer some fuzzy capabilities, are Weka [62] and KNIME [63]. Weka includes some fuzzy rule learning algorithms that can be analyzed together with a wide collection of machine learning algorithms for data mining tasks. ...
Full-text available
Fuzzy systems have been used widely thanks to their ability to successfully solve a wide range of problems in different application fields. However, their replication and application requires a high level of knowledge and experience. Furthermore, few researchers publish the software and/or source code associated with their proposals, which is a major obstacle to scientific progress in other disciplines and in industry. In recent years, most fuzzy system software has been developed in order to facilitate the use of fuzzy systems. Some software is commercially distributed but most software is available as free and open source software, reducing such obstacles and providing many advantages: quicker detection of errors, innovative applications, faster adoption of fuzzy systems, etc. In this paper, we present an overview of freely available and open source fuzzy systems software in order to provide a well-established framework that helps researchers to find existing proposals easily and to develop well founded future work. To accomplish this, we propose a two-level taxonomy and we describe the main contributions related to each field. Moreover, we provide a snapshot of the status of the publications in this field according to the ISI Web of Knowledge. Finally, some considerations regarding recent trends and potential research directions are presented.
Dr.-Ing.-Dissertation RWTH Aachen University | Berichter: Univ.-Prof. Dr.-Ing. Herbert Pfeifer und Priv.-Doz. Dr. rer. nat. Marcus Kirschen | | English: The reproducible slag foaming in the flat bath phase is elementary for resource- and energyefficient steel production in an electric arc furnace (EAF). Due to the long arc, this is even more important in the case of a DC EAF. [...] The aim of this work is to develop a model for a posteriori data analysis of foamed slag quality for series production of a DC EAF. The acoustic measurements, the chemical data of steel, slag and off-gas in conjunction with the technological process parameters and the input raw materials are taken into account to determine a holistic diagnostic picture. The motivation to gain knowledge from the masses of "chaotically" scattering process data justifies the use of machine learning. With this model, a historical production period is to be investigated. The model results are to be verified and validated with regard to the identified process parameters by means of independent plant trials. [...] | Deutsch: Die prozesssichere Schaumschlackenfahrweise ist für eine ressourcen- und energieeffiziente Stahlerzeugung in einem Elektrolichtbogenofen elementar. Aufgrund des langen Lichtbogens ist dies im Fall eines Gleichstromofens noch bedeutender. [...] Das Ziel dieser Arbeit ist es ein Modell zur rückschauenden Datenanalyse der Schaumschlackenfahrweise in der Serienproduktion eines Gleichstromlichtbogenofens zu erstellen. Die Schallmessung, die chemischen Daten von Stahl, Schlacke und Abgas in Verbindung mit den technologischen Prozessparametern und den Einsatzstoffen werden zur Ermittlung eines ganzheitlichen Diagnosebilds berücksichtigt. Die Motivation, aus den Massen „chaotisch“ streuender Prozessdaten einen Erkenntnisgewinn anzustreben, begründet den Einsatz des maschinellen Lernens. Mit diesem Modell soll ein historischer Produktionszeitraum untersucht werden. Mit unabhängigen Betriebsversuchen sollen die Modellergebnisse hinsichtlich der identifizierten Prozessparameter verifiziert und validiert werden. [...]
In this chapter, we introduce the concepts of both generative learning and multi-task learning, and presents a proposed fuzzy approach for multi-task classification. We also discuss the advantages of fuzzy classification in the context of generative multi-task learning, in comparison with traditional classification in the context of discriminative single-task learning.
Full-text available
This paper describes PMML extensions for the modular open source data analytics platform KNIME adding preprocessing support and the ability to edit existing PMML code. It is also shown how the PMML model representation in KNIME can be used within meta learning schemes such as boosting and bagging.
Conference Paper
Full-text available
— This paper presents an approach to visualizing and exploring high-dimensional rules in two-dimensional views. The proposed method uses multi-dimensional scaling to place the rule centers and subsequently extends the rules ’ regions to depict their overlap. This results not only in a visualization of the rules ’ distribution but also enables the relationship to their immediate neighbors to be judged. The proposed technique is illustrated and discussed on a number of wellknown benchmark data sets. I.
Applied Fuzzy Arithmetic provides a well-structured compendium that offers both a deeper knowledge about the theory of fuzzy arithmetic and an extensive view on its applications in the engineering sciences, making it a resource for students, researchers, and practical engineers. The first part of the book gives an introduction to the theory of fuzzy arithmetic, which aims to present the subject in a well-organized and comprehensible form. The derivation of fuzzy arithmetic from the original fuzzy set theory and its evolution towards a successful implementation is presented with existing formulations of fuzzy arithmetic included and integrated in the overall context. The second part of the book presents a diversified exposition of the application of fuzzy arithmetic, addressing different areas of the engineering sciences, such as mechanical, geotechnical, biomedical, and control engineering. © Springer-Verlag Berlin Heidelberg 2005. All rights are reserved.
The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of new algorithms and tools as well as data manipulation or visualization methods in the form of new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture, briey sketch how new nodes can be incorporated, and highlight some of the new features of version 2.0.
In this chapter we describe methods that assist the user to visually explore fuzzy clusters. We focus on a supervised approach to generate clusters for classes of interest of a given data set. The algorithm constructs local, one-dimensional neighborhood models, so-called Neighbor-grams, for objects of the classes of interest that serve as a set of potential cluster candidates. The presented algorithm automatically chooses the best subset of Neighborgrams, but, more importantly, the accompanying visualization allows the user to fine-tune the clustering process by visually selecting, discarding, or adjusting potential cluster candidates. We also show how the algorithm can be applied to problems where multiple descriptions of data are available. This type of data can be found in biological data analysis for example, where often several different descriptors for the same molecule exist but each individual descriptor is only able to model parts of the data.
A concept of ‘Noise Cluster’ is introduced such that noisy data points may be assigned to the noise class. The approach is developed for objective functional type (K-means or fuzzy K-means) algorithms, and its ability to detect ‘good’ clusters amongst noisy data is demonstrated. The approach presented is applicable to a variety of fuzzy clustering algorithms as well as regression analysis.
A fuzzy set is a class of objects with a continuum of grades of membership. Such a set is characterized by a membership (characteristic) function which assigns to each object a grade of membership ranging between zero and one. The notions of inclusion, union, intersection, complement, relation, convexity, etc., are extended to such sets, and various properties of these notions in the context of fuzzy sets are established. In particular, a separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint.