Conference PaperPDF Available

Power to the People! - Meta-Algorithmic Modelling in Applied Data Science

Authors:

Figures

No caption available
… 
No caption available
… 
No caption available
… 
No caption available
… 
Content may be subject to copyright.
Power to the People!
Meta-Algorithmic Modelling in Applied Data Science
Marco Spruit and Raj Jagesar
Information and Computing Sciences, Utrecht University, Princetonplein 5, Utrecht, The Netherlands
Keywords: Applied Data Science, Meta-algorithmic Modelling, Machine Learning, Big Data.
Abstract: This position paper first defines the research field of applied data science at the intersection of domain
expertise, data mining, and engineering capabilities, with particular attention to analytical applications. We
then propose a meta-algorithmic approach for applied data science with societal impact based on activity
recipes. Our people-centred motto from an applied data science perspective translates to design science
research which focuses on empowering domain experts to sensibly apply data mining techniques through
prototypical software implementations supported by meta-algorithmic recipes.
1 APPLIED DATA SCIENCE
Pritzker and May (2015:7) define Data Science as
“the extraction of actionable knowledge directly
from data through a process of discovery, or
hypothesis formulation and hypothesis testing”. In
addition, they also relate the skills needed in Data
Science. Based on their observations we propose to
define Applied Data Science as follows:
Applied Data Science (ADS) is the knowledge
discovery process in which analytical
applications are designed and evaluated to
improve the daily practices of domain experts.
Note that this is in contrast to fundamental data
science which aims to develop novel statistical and
machine learning techniques for performing Data
Science. In Applied Data Science the objective is to
develop novel analytical applications to improve the
real world around us. From the perspective of the
Data Science Venn diagramme (Pritzker and May,
2015:9), Applied Data Science focuses on the
Analytical applications intersection between the
Domain expertise and Engineering capabilities.
Finally, we observe an analogy with the ubiquitous
people-process-technology model where technology
aligns with machine learning algorithms,
organisational processes are operationalised through
analytical software implementations, and domain
expertise is captured from, and enriched for, skilled
professionals. Hence the motto: power to the people!
Figure 1 contextualises the research field of and
needed skills in Applied Data Science.
It is from this novel Applied Data Science research
perspective that we investigate the core data science
topic of machine learning in the remainder of this
paper, from a meta-algorithmic modelling approach.
Figure 1: Applied Data Science in context.
2 MACHINE LEARNING
With the steadily growing availability of data
storage space and computing power, advanced data
400
Spruit, M. and Jagesar, R.
Power to the People! - Meta-Algorithmic Modelling in Applied Data Science.
DOI: 10.5220/0006081604000406
In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - Volume 1: KDIR, pages 400-406
ISBN: 978-989-758-203-5
Copyright c
2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
mining efforts are coming within reach of
increasingly more people. One common approach to
perform a data mining project, and central to this
ADS type of research, is to apply Machine Learning
(ML) techniques. The application of ML techniques
spans various disciplines like mathematics, statistics
and computer science. These disciplines combined
support the act of learning and result in models that
are fitted to data. The challenge is to derive models
that are accurate in the sense that they reflect the
underlying patterns in the data whilst ignoring
peculiarities that do not represent reality. A popular
and well known purpose of these models is to make
predictions on new and unseen examples of data.
However, ML techniques are also well suited to
explore the underlying patterns of a dataset. More
often than not, machine learning techniques are
employed to learn about the structure of a data set
(Hall et al., 2011). ML as a research field can be
considered to be positioned at the heart of
fundamental data science, as it requires both data
mining and engineering expertise. This is also
reflected in Figure 1 (Algorithms, in green colour).
3 PROBLEM STATEMENT
However, despite the growing usage and popularity
of machine learning techniques in data mining
projects, correctly applying these techniques remains
quite a challenge. We list the three main challenges
below:
1. Depth versus breadth: The ML field knows
many different use cases, each of which has a
sizeable body of literature surrounding the
specific cases. The literature is usually found
to be heavy on mathematical terminology and
aimed at the computer science community.
This prevents researchers from other fields in
learning and correctly applying machine
learning techniques in their own research
(Domingos, 2012).
2. Selection versus configuration: In line with
the aforementioned, applying machine
learning techniques confronts users with
many degrees of freedom in how to assemble
and configure a learning system. One
example of this is the fact that algorithm
performance is largely determined by
parameter settings, these settings are specific
for each class of algorithm. However, in
practice end users usually do not have
enough knowledge on how to find optimal
parameter settings (Yoo et al., 2012). Many
users leave the parameters to their default
settings and base algorithm selection on
reputation and / or intuitive appeal (Thornton
et al., 2013). This may lead to researchers
using underperforming algorithms and
gaining suboptimal results.
3. Accuracy versus transparency: Concerning
the creation of models: ML shows that
currently there is a trade-off to be had
between accuracy and transparency (Kamwa
et al., 2012). In practice this means that
algorithms which yield a high amount of
insight into the data do not perform as well as
their non-transparent (black box) counterparts
and the other way around.
In order to get a better grip on these challenges, we
propose a meta-algorithmic modelling approach,
which we define as follows:
Meta-Algorithmic Modelling (MAM) is an
engineering discipline where sequences of
algorithm selection and configuration activities
are specified deterministically for performing
analytical tasks based on problem-specific data
input characteristics and process preferences.
MAM as a discipline is inspired by Method
Engineering, “the engineering discipline to design,
construct and adapt methods, techniques and tools
for the development of information systems”
(Brinkkemper, 1996). In related work, Simke (2013)
describes a reusable, broadly-applicable set of
design patterns to empower intelligent system
architects. Finally, MAM also conceptually
resembles the Theory of Inventive Problem Solving
(TRIZ), a method for creative design thinking and
real problem solving, partly due to its “Meta-
Algorithm of Invention” (Orloff, 2016).
The strategic goal of MAM is to provide highly
understandable and deterministic method fragments
i.e. activity recipes—to guide application domain
experts without in-depth ML expertise step-by-step
through an optimized ML process following Vleugel
et al. (2010) and Pachidi and Spruit (2015), among
others, based on the Design Science Research
approach (Hevner et al., 2004). We thereby promote
reuse of state-of-the-art ML knowledge and best
practices in the appropriate application of ML
techniques, whilst at the same time provide
information on how to cope with challenges like
parameter optimization and model transparency
(Pachidi et al., 2014).
We argue that this MAM approach aligns especially
Power to the People! - Meta-Algorithmic Modelling in Applied Data Science
401
well with the Applied Data Science perspective
which we pursue in this research.
4 RESEARCH APPROACH
By taking into account our problem statement
context above the overarching research question is
formulated as follows:
How can meta-algorithmic modelling as a
domain independent approach in an applied data
science context be operationalised to guide the
process of constructing transparent machine
learning models for possible use by application
domain experts?
We will initially proceed with a limited scope: the
creation of method fragments focused on supervised
machine learning for binary classification tasks on
structured data. This type of machine learning is
concerned with deriving models from (training) data
that are already available. Coincidentally this is one
of the most applied and mature areas within the
machine learning practice (Kotsiantis et al., 2007).
First a theoretical foundation is established on
the subjects of data mining, machine learning and
model transparency. The concepts derived from this
foundation are then grouped using the structure of a
data mining process model. For our purposes we
apply the base structure of the CRISP-DM process
model and group the concepts into the following
phases: data understanding, data preparation, and
modelling & evaluation. Our method fragments will
be composed using the same structure.
In this work we employ method engineering
fragments notation to specify the meta-algorithmic
models. More specifically, we apply the meta-
modelling approach which yields a process-
deliverable diagram (PDD; Weerd et al., 2008). A
PDD consists of two diagrams: the left-hand side
shows an UML activity diagram (processes) and the
right-hand side shows an UML class diagram
(concepts or deliverables). Both diagrams are
integrated and display how the activities are tied to
each deliverable. Lastly, the activities and the
concepts are each explained in separate tables.
However, due to page restrictions these explanatory
tables are excluded from this paper.
5 MODEL TRANSPARENCY
The concept of model transparency occasionally
surfaces in the body of literature. In particular, when
it concerns decision support systems where it must
be clear how a system came to a certain
(classification) decision (Johansson et al., 2004;
Olson et al., 2012; Kamwa et al., 2012b; Allahyari
et al., 2011).
There is consensus in the literature about the
types of algorithms that are known to yield
transparent and non-transparent (black box) models.
Both tree and rule models are considered as
transparent and highly interpretable. On the other
hand, artificial neural networks, support vector
machines and ensembles like random forests are
considered as black boxes (Johansson et al., 2004;
Olson et al., 2012; Kamwa et al., 2012b).
Currently there is no common ground on the
subject of tree and rule model complexity. Although
considered as transparent, critics note that the
interpretative value of complex tree and rule models
should be questioned (Johansson et al., 2004). On
the other hand, a study on model understandability
found indications that the assumption where simpler
models are considered as more understandable does
not always hold as true either (Allahyari et al.,
2011).
The choice between a transparent and non-
transparent modelling technique is not immediately
obvious since there is a tradeoff to be made between
accuracy and transparency. Black box modelling
techniques generally have better classification and
prediction performance, but the tradeoff with better
interpretable solutions is unavoidable. We found two
solutions in the body of literature that aim to bridge
this gap.
The first solution is aimed towards extracting
comprehensible information in the form of rules and
trees from black box modelling techniques like
artificial neural networks and support vector
machines (Johansson et al., 2004; Martens et al.,
2007; Setiono, 2003). The practice delivers
comprehensible information but is criticized for
being unrepresentative of the original model due to
oversimplification (Cortez et al., 2013).
The second solution approaches the problem
from the opposite direction by improving the
performance of a transparent modelling technique to
a level where it competes with its black box
counterparts. A variant of linear modelling is applied
known as generalized additive modelling (GAM)
enriched with information on pairwise interactions
between features (Lou et al., 2013). This allows to
retain the explanatory value of linear models and at
the same time achieve high performance in terms of
classification accuracy. The technique exposes the
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
402
contribution of each feature in relation to the
outcome values.
6 METHOD FRAGMENTS
In this section we present the method fragments as
derived from our literature study on the domains of
data mining and machine learning. All analytical
recipes are accompanied with a brief description.
6.1 Data Understanding
Before starting with any data mining project it is
important to become familiar with the data that will
be analyzed. The goal is to improve one’s
understanding of the data by using (statistical) tools
to summarize, plot and review datapoints in the data
set. This practice is called exploratory data analysis
(EDA) (Tukey 1977).
Figure 2: Data understanding method fragment.
The data understanding phase as depicted in
Figure 2 revolves around the application of
exploratory data analysis (EDA) techniques to
generate visualizations and tables to gain a first
insight into the relationships between the features of
a data set. A high number of features can make these
deliverables difficult to interpret. Therefore, the
activity flow shows that in cases of high dimensional
data sets it is recommended to pre-select a subset of
features using a feature selection technique.
We recommend the creation of histogram
graphs, pairwise scatterplots and correlation matri-
ces to start exploring relationships between the
features of a dataset. Histogram graphs and pairwise
scatterplots serve the purpose of visualizing overlap
and separability between the various classes of a
data set. Feature correlation matrices are used to
determine which features are redundant; these
should be removed when applying the naive bayes
(probabilistic) model. Menger et al. (2016) notably
provide a more detailed recipe for performing
interactive visualisation-driven EDA.
6.2 Data Preparation
The data preparation phase (Figure 3) consists of
three main activities: dataset construction, feature
extraction, and modelling technique preparation.
Figure 3: Data preparation method fragment.
Dataset construction: The dataset construction
activity entails loading the raw data and engineering
new features based on the raw data. Feature
engineering can be a substantial task but is difficult
to capture in a method since it is highly situational.
The last task within this activity is feature selection.
Not all features in a given data set have the same
informative importance or any importance at all.
This can be problematic as some classification
algorithms are designed to make the most of the data
that is presented to them. In these cases even
irrelevant features will eventually be included in the
model. In other words the model will be overfitted to
the data which means that the classification
algorithm has included the noise as an integral part
of the model (Tang, Alelyani, and Liu, 2014). The
solution is to select a subset of only the most
Power to the People! - Meta-Algorithmic Modelling in Applied Data Science
403
informative features reducing the dimensionality
(number of features) of the data set in the process.
Feature selection is either performed manually using
EDA techniques, or selection is performed using a
feature selection algorithm.
Feature extraction: The feature extraction
activity entails the application of projection me-
thods. Projection methods like principal component
analysis are automated feature engineering
techniques that aim to best describe the main differ-
rentiators of a data set creating a select (low) number
of features in the process (dimensionality reduction).
Transparency between the outcome variable and the
original features may be lost while using a
projection technique.
Modelling technique preparation: Lastly, the
modelling technique preparation activity consists of
three paths that define preparation steps depending
on the model type chosen by the data scientist.
When tree and rule models are required due to
model transparency concerns, no additional prepara-
tion steps are necessary since modern algorithm
implementations take care of preparation steps
internally. Linear models and the probabilistic naive
Bayes model can be chosen due to performance
concerns. Both types require their own conversion
steps in order to be able to process the data in the
next phase of the DM process. The naive Bayes
model type e.g. requires redundant features to be
removed since they will negatively influence classi-
fier results. Linear model types require input data to
be represented in numerical form so transformation
steps should be performed as needed e.g. the
binarization of categorical data. Note however that
some concrete algorithm implementations of linear
models may perform these steps as part of their
internal workings.
6.3 Modelling and Evaluation
The modelling and evaluation method fragment
(Figure 4) consists of three activities aimed at
deriving classification models from data sets: search
space definition, find optimal parameters, and
predict & classify.
Search space definition: The search space
definition activity has a route to explore fully
automated model (and parameter) selection in
analyzing the data set. Currently one experimental
implementation exists in the form of Auto-WEKA
(Thornton et al. 2013). Auto-WEKA is an
experimental machine learning toolkit that almost
completely relies on Bayesian optimization
techniques to generate models. The toolkit is unique
Figure 4: Modelling & evaluation method fragment.
in the sense that it considers the choice for the
modelling technique as part of the problem space as
well. This relieves potential users from having to
manually select and test algorithms, instead Auto-
WEKA uses all the algorithms that are part of the
WEKA toolkit and determines which algorithm
generates the best results for a given data set.
Currently, due to the novelty of this technique, the
approach should be used to gain initial insight into
model types that may perform best on the provided
data set.
Find optimal parameters: Next, the application
of automated search strategies is central to the
following activity named “Find optimal parameters”.
Recall from our problem statement that the
performance of algorithms is highly dependent on
how they are configured, a problem known as
(hyper) parameter optimization. Getting optimal
performance from a modelling technique means
finding the right (combination of) parameter
settings. The best settings will be different for each
data set which necessitates an automated means of
determining these values. Search strategies like grid
search, random search and Bayesian optimization
support the task to (intelligently) iterate over
combinations of parameters evaluating the perfor-
mance at each attempt.
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
404
This task requires the data scientist to decide on
various factors that determine how the search for the
best configuration will be executed. We recommend
to consider at least the following “Top 5” factors:
1. Model type: The model type itself. The data
scientist can choose to iterate over different
modelling techniques (tree, rule, ensemble,
linear and probabilistic) to find out which type
works best given a specific data set. This
approach is similar to Auto-WEKA since it
includes the model type as part of the problem
(search) space.
2. Parameter types: This key factor comprises the
parameters that belong to a specific model type.
Parameter types can range from procedural
configuration settings to the specific number of
times a procedure is performed.
3. Resampling method: The resampling method
used to support the evaluation process.
Resampling methods apply various procedures
to train and test models on the data provided to
them. For example, the holdout method splits
the data set in a training and test set, usually in a
70%-30% ratio. The model is first trained using
the training set, afterwards it is tested on the
unseen instances of the test set. Other
resampling methods include: (stratified) k-fold
cross validation, leave-one-out and
bootstrapping.
4. Search strategy: The search strategy itself. Grid
search is exhaustive by nature, meaning that all
possible parameter combinations will be tried.
This can be costly both in time and computing
resources. Random search and Bayesian
optimization aim to find the optimal set of
parameters intelligently requiring significantly
less tries to do so.
5. Performance metrics: The performance
measure(s) used to evaluate each attempt.
Common measures are classification accuracy,
true positive rate (TPR), false positive rate
(FPR) and the area under the curve (AUC).
Using a combination of measures is necessary
since classification accuracy by itself is known
to misrepresent the performance of a model in
the case of class imbalances in the data set.
The factors discussed above are common to the
search strategies outlined in this section, and
combined they form the template that makes up the
complete problem space through where the search
will be executed. The structure and accessibility of
this approach is in line with the design goal of this
research where we aim to construct a method that
enables a user to create optimal models.
Predict & classify: Lastly, the activity “predict &
classify” is followed to conclude a DM project. The
model derived from the parameter search activity
can now be used to classify new and unseen data.
7 FUTURE RESEARCH
We are currently extending and refining the method
fragments as outlined in Figures 2-4 with the goal to
ultimately evaluate the method on a broad array of
data sets, ranging from small/large to low/high
dimensional data sets. We are interested to see how
classification performance holds up over different
variants in data sets. We are also especially
interested, by using qualitative research methods, in
studying to what extent the methods support non-
data scientists in their efforts to perform DM
projects.
Next, the problem space of our research could be
broadened to cover cases outside of the domain of
supervised binary classification, e.g. multiclass,
regression and image analysis problems. Method
fragments could be created to deal with (sub)cases in
the aforementioned domains.
Furthermore, the structures defined in these
methods could be used for the development and
enhancement of data mining tools. Auto-WEKA is
an example of such a tool but follows a rigid
method. For example, the tool uses a pre-set path of
actions and tasks and does not support embedding
domain knowledge during the DM process. From
our own experiences we identify a great need for
sophisticated tools that offer simplified access to
advanced ML techniques while retaining the ability
to embed domain knowledge in the data mining
process.
Finally, we aim to further refine and integrate
existing meta-algorithmic models, as well as to
incrementally yet continuously broaden our
modelling scope in creating ML method fragments
to also include unsupervised learning, non-binary
classification tasks, and unstructured data (e.g.
Spruit and Vlug, 2015), among others.
As our strategic objective we envision Meta-
Algorithmic Modelling (MAM) as a well-defined,
transparant, and methodological infrastructure for
Applied Data Science (ADS) research which has the
potential to uniformly interconnect the vast body of
knowledge as recipes for machine learning by
enabling application domain experts to reliably
perform data science tasks themselves in their daily
practices.
Power to the People! - Meta-Algorithmic Modelling in Applied Data Science
405
REFERENCES
Allahyari, H., and N. Lavesson. 2011. “User-Oriented
Assessment of Classification Model Understandabili-
ty,” in 11th Scandinavian Conference on Artifical
Intelligence, pp. 11-19.
Brinkkemper, S. 1996. “Method Engineering: Engineering
of Information Systems Development Methods and
Tools,” Information and Software Technology (38:4),
pp. 275-280.
Cortez, P., and M. J. Embrechts. 2013. “Using Sensitivity
Analysis and Visualization Techniques to Open Black
Box Data Mining Models,” Information Sciences
(225), pp. 1-17.
Domingos, P. 2012. “A Few Useful Things to Know about
Machine Learning,” Communications of the ACM
(55:10), pp. 78-87.
Hall, M., E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann, and I. H. Witten. 2009. “The WEKA Data
Mining Software: An Update,” ACM SIGKDD
Explorations Newsletter (11:1), pp. 10-18.
Hevner, A., S. March, P. Jinsoo, and S. Ram. 2004.
“Design Science in Information Systems Research,”
MIS Quarterly (28:1), pp. 75-105.
Johansson, U., L. Niklasson, and R. König. 2004. “Accu-
racy Vs. Comprehensibility in Data Mining Models,”
in Proceedings of the Seventh International Conferen-
ce on Information Fusion Vol. 1, pp. 295-300.
Kamwa, I., S. Samantaray, and G. Joós. 2012. “On the
Accuracy Versus Transparency Trade-Off of Data-
Mining Models for Fast-Response PMU-Based
Catastrophe Predictors,” IEEE Transactions on Smart
Grid (3:1), pp. 152-161.
Kotsiantis, S. B., I. Zaharakis, and P. Pintelas. 2007.
“Supervised Machine Learning: A Review of
Classification Techniques,” in Emerging Artifical
Intelligence Applications in Computer Engineering,
pp. 3-24.
Lou, Y., R. Caruana, J. Gehrke, and G. Hooker. 2013.
“Accurate Intelligible Models with Pairwise
Interactions,” in Proceedings of the 19th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 623-631.
Menger, V., M. Spruit, K. Hagoort, and F. Scheepers.
2016. “Transitioning to a Data Driven Mental Health
Practice: Collaborative Expert Sessions for Knowle-
dge and Hypothesis Finding,” Computational and
Mathematical Methods in Medicine, Article ID
9089321.
Olson, D. L., D. Delen, and Y. Meng. 2012. “Comparative
Analysis of Data Mining Methods for Bankruptcy
Prediction,” Decision Support Systems (52:2), pp. 464-
473.
Orloff, M. 2016. “ABC-TRIZ: Introduction to Creative
Design Thinking with Modern TRIZ Modelling,”
Springer.
Pachidi, S., M. Spruit, and I. van der Weerd. 2014.
“Understanding Users' Behavior with Software
Operation Data Mining,” Computers in Human
Behavior (30), pp. 583-594.
Pachidi, S., and M. Spruit. 2015. “The Performance
Mining method: Extracting performance knowledge
from software operation data”, International Journal
of Business Intelligence Research (6:1), pp. 11–29.
Pritzker, P., and W. May. 2015. NIST Big Data
interoperability Framework (NBDIF): Volume 1:
Definitions. NIST Special Publication 1500-1. Final
Version 1, September 2015.
Setiono, R. 2003. “Techniques for Extracting Classifica-
tion and Regression Rules from Artificial Neural
Networks,” Computational Intelligence: The Experts
Speak Piscataway, NJ, USA: IEEE, pp. 99-114.
Simke, S. 2013. “Meta-Algorithmics: Patterns for Robust,
Low Cost, High Quality Systems,” Wiley – IEEE.
Spruit, M., and B. Vlug. 2015. “Effective and Efficient
Classification of Topically-Enriched Domain-Specific
Text Snippets”, International Journal of Strategic
Decision Sciences (6:3), pp. 1–17.
Tang, J., S. Alelyani, and H. Liu. 2014. “Feature Selection
for Classification: A Review,” Data Classification:
Algorithms and Applications Vol. 37, pp. 2 – 29.
Thornton, C., F. Hutter, H. H. Hoos, and K. Leyton-
Brown. 2013. “Auto-WEKA: Combined Selection and
Hyperparameter Optimization of Classification
Algorithms,” in Proceedings of the 19th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 847-855.
Tukey, J. W. 1977. “Exploratory Data Analysis,”
Addison-Wesley.
van de Weerd, I., and S. Brinkkemper. 2008. “Meta-
Modelling for Situational Analysis and Design
Methods,” Handbook of Research on Modern Systems
Analysis and Design Technologies and Applications,
pp. 35-54.
Vleugel, A., M. Spruit, and A. van Daal. 2010. “Historical
data analysis through data mining from an outsourcing
perspective: the three-phases method,” International
Journal of Business Intelligence Research, (1:3), pp.
42-65.
Yoo, I., P. Alafaireet, M. Marinov, K. Pena-Hernandez, R.
Gopidi, J. Chang, and L. Hua. 2012. “Data Mining in
Healthcare and Biomedicine: A Survey of the
Literature,” Journal of Medical Systems (36:4), pp.
2431-2448.
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
406
... The concept of self-service data science was first described in [4] and has been defined as "the engineering discipline in which analytic systems are designed and evaluated to empower domain professionals to perform their own data analyses on their own data sources without coding in a reliable, usable and transparent manner within their own daily practices" [5]. Figure 1 visualises self-service data science research in the context of adjacent data science research disciplines. Bridging purely foundational and purely applied research processes, both applied data science [6] and self-service data science studies pursue a translational (e.g., application-oriented) research process, often including CRISP-DM as the knowledge discovery process of choice [7]. The Machine Learning (ML) community has noticed the need to enable access for non-expert users to ML techniques. ...
... 5. Users want to know the statistical power of the created model. 6. Users want to know the importance of each variable in the created model. ...
Article
Full-text available
(1) Background: This work investigates whether and how researcher-physicians can be supported in their knowledge discovery process by employing Automated Machine Learning (AutoML). (2) Methods: We take a design science research approach and select the Tree-based Pipeline Optimization Tool (TPOT) as the AutoML method based on a benchmark test and requirements from researcher-physicians. We then integrate TPOT into two artefacts: a web application and a notebook. We evaluate these artefacts with researcher-physicians to examine which approach suits researcher-physicians best. Both artefacts have a similar workflow, but different user interfaces because of a conflict in requirements. (3) Results: Artefact A, a web application, was perceived as better for uploading a dataset and comparing results. Artefact B, a Jupyter notebook, was perceived as better regarding the workflow and being in control of model construction. (4) Conclusions: Thus, a hybrid artefact would be best for researcher-physicians. However, both artefacts missed model explainability and an explanation of variable importance for their created models. Hence, deployment of AutoML technologies in healthcare remains currently limited to the exploratory data analysis phase.
... The application of machine learning techniques spans various disciplines such as computer science, statistics and mathematics, and by combining these disciplines, the act of learning is supported (Dellermann et al., 2019;Power, 2014). Machine learning techniques are also well suited to discover the structure of a data set, the underlying patterns of a dataset or a model tailored to the data (Spruit & Jagesar, 2016). ...
... Big data and advanced analytics have created a radical shift in how knowledge is defined, how information should be engaged with (Boyd & Crawford, 2012;Tien, 2013) and how knowledge is constituted (Kolbjørnsrud et al., 2018;Vinothini & Priya, 2017). Of significant practical and theoretical importance, is understanding the outcomes of emerging intelligent machine-knowledge worker reconfigurations (Rai, Constantinides, & Sarker, 2019;Spruit & Jagesar, 2016). ...
... The objective of this research was to develop the Meta-Algorithmic Model (MAM) shown as a recipe for using these textual resources to identify business goals explicitly stated in communications between the members of the organizations, to facilitate the business understanding phase of CRISP-DM [16,17]. Thus, the scientific contribution of this research is threefold. ...
Article
Full-text available
The Cross-Industry Standard Process for Data Mining (CRISP-DM), despite being the most popular data mining process for more than two decades, is known to leave those organizations lacking operational data mining experience puzzled and unable to start their data mining projects. This is especially apparent in the first phase of Business Understanding, at the conclusion of which, the data mining goals of the project at hand should be specified, which arguably requires at least a conceptual understanding of the knowledge discovery process. We propose to bridge this knowledge gap from a Data Science perspective by applying Natural Language Processing techniques (NLP) to the organizations’ e-mail exchange repositories to extract explicitly stated business goals from the conversations, thus bootstrapping the Business Understanding phase of CRISP-DM. Our NLP-Automated Method for Business Understanding (NAMBU) generates a list of business goals which can subsequently be used for further specification of data mining goals. The validation of the results on the basis of comparison to the results of manual business goal extraction from the Enron corpus demonstrates the usefulness of our NAMBU method when applied to large datasets.
... First, the DS Researcher that can build advanced mathematical models and create new ML algorithms by following a scientific approach, including hypothesis testing (Saltz and Grady, 2017). Second, the Applied Data Scientist that focuses on applying the vast amount of already established algorithms in DS with wide-ranging DS knowledge (Spruit and Jagesar, 2017). Our analysis did not reveal with which flavor we are dealing with, nor which one is currently more in demand on the market. ...
Conference Paper
Full-text available
The continuing proliferation of data science these days is causing organizations to reassess their workforce demands. Simultaneously, it is unclear what types of job roles, knowledge, skills, and abilities make up this field and how they differ. This ambiguity is generating a misleading myth around the Data Scientist’s role. Against this background, this paper attempts to provide clarity about the heterogeneous nature of job roles required in the field of data science by processing 25,104 job advertisements published at the online job platforms Indeed, Monster, and Glassdoor. We propose a text mining approach combining topic modeling, clustering, and expert assessment. Therefore, we identify and characterize six job roles in data science that are in a request by organizations, described by topics classified in three major knowledge domains. An understanding of job roles in data science can help organizations in acquiring and cultivating job roles to leverage data science effectively.
... Eskes et al., 2016;Heeringa et al., 2007). Preferably, receipes for applying such variables in analytical models are formulated and published as meta-algorithmic models to facilitate transparant communication with students and teachers regarding their usage (Spruit & Jagesar, 2016). ...
Chapter
This research assesses the education quality factors in secondary schools using a business intelligence approach. We operationalize each layer of the business intelligence framework to identify the stakeholders and components relevant to education quality. The resulting Education Quality Indicator (EQI) framework consists of seven Critical Success Factors (CSFs) and is measured through twenty-eight Key Performance Indicators (KPIs). The EQI framework was evaluated through expert interviews and a survey, and uncovers that the most important factor in assuring education quality is a teacher's ability to communicate with students. Furthermore, a feasibility analysis was conducted in a Dutch student monitoring information system. The results pave the way towards attainable and data-driven innovation in secondary education towards personalized student and teacher performance management using business intelligence technologies, which may ultimately integrate a wide variety of data sources from environmental sensors to wearables to optimally understand each individual student and teacher.
... Eskes et al., 2016;Heeringa et al., 2007). Preferably, receipes for applying such variables in analytical models are formulated and published as meta-algorithmic models to facilitate transparant communication with students and teachers regarding their usage (Spruit & Jagesar, 2016). ...
Chapter
This research assesses the education quality factors in secondary schools using a business intelligence approach. We operationalize each layer of the business intelligence framework to identify the stakeholders and components relevant to education quality. The resulting Education Quality Indicator (EQI) framework consists of seven Critical Success Factors (CSFs) and is measured through twenty-eight Key Performance Indicators (KPIs). The EQI framework was evaluated through expert interviews and a survey, and uncovers that the most important factor in assuring education quality is a teacher's ability to communicate with students. Furthermore, a feasibility analysis was conducted in a Dutch student monitoring information system. The results pave the way towards attainable and data-driven innovation in secondary education towards personalized student and teacher performance management using business intelligence technologies, which may ultimately integrate a wide variety of data sources from environmental sensors to wearables to optimally understand each individual student and teacher.
... The proposed scaling method is formally represented in Figure 6 as a Meta-Algorithmic Model (MAM) in Process Deliverable Diagram (PDD) notation [33]. In a PDD the processes are shown on the left and the product of the action on the right [34]. ...
Article
Full-text available
Mobile phone data are a novel data source to generate mobility information from Call Detail Records (CDRs). Although mobile phone data can provide us with valuable insights in human mobility, they often show a biased picture of the traveling population. This research, therefore, focuses on correcting for these biases and suggests a new method to scale mobile phone data to the true traveling population. Moreover, the scaled mobile phone data will be compared to roadside measurements at 100 different locations on Dutch highways. We infer vehicle trips from the mobile phone data and compare the scaled counts with roadside measurements. The results are evaluated for October 2015. The proposed scaling method shows very promising results with near identical vehicle counts from both data sources in terms of monthly, weekly, and hourly vehicle counts. This indicates the scaling method, in combination with mobile phone data, is able to correctly measure traffic intensities on highways, and thereby able to anticipate calibrated human mobility behaviour. Nevertheless, there are still some discrepancies—for one, during weekends—calling for more research. This paper serves researchers in the field of mobile phone data by providing a proven method to scale the sample to the population, a crucial step in creating unbiased mobility information.
... This research pursues an applied data science approach [8], with a particular focus on the information infrastructure dimension, by proposing a knowledge discovery method that helps data scientists set up big data processing platforms and workflows to make their analytical processes more effective and efficient. The following research question is answered: ...
Chapter
Big data analysis is increasingly becoming a crucial part of many organizations, popularizing the distributed computing paradigm. Within the emerging research field of Applied Data Science, multiple notable methods are available that help analysists and scientists to create their analytical processes. However, for distributed computing problems such methods are not available yet. Therefore, to support data analysts, scientists and software engineers in the creation of distributed computing processes, we present the CRoss-Industry Standard Process for Distributed Computing Workflows (CRISP-DCW) method. The CRISP-DCW method lets users create distributed computing workflows through following a predefined cycle and using reference manuals, where the critical elements of such a workflow are developed for the context at hand. Using our method’s reference manuals and predefined steps, data scientists can spend less time on developing big data processing workflows, thus increasing efficiency. Results were evaluated with experts and found to be satisfactory. Therefore, we argue that the CRISP-DCW method provides a good starting point for applied data scientists to develop and document their distributed computing workflow, making their processes both more efficient and effective.
Chapter
Healthcare is a data intensive industry in which data mining has a great potential for improving the wellbeing of patients. However, a multitude of barriers impedes the application of machine learning. This work focuses on medical adverse event prediction by domain experts. In this research we present AutoCrisp as a self-service data science prototype for multivariate sequential classification on electronic healthcare records to facilitate self-service data science by domain experts, without requiring any sophisticated data mining knowledge. We performed an empirical case study with the objective to predict bleedings with the use of AutoCrisp. Our results show that multivariate sequential classification for medical adverse event prediction can indeed be made accessible to healthcare professionals by providing appropriate tooling support.
Article
In a time when the employment of natural language processing techniques in domains such as biomedicine, national security, finance, and law is flourishing, this study takes a deep look at its application in policy documents. Besides providing an overview of the current state of the literature that treats these concepts, the authors implement a set of natural language processing techniques on internal bank policies. The implementation of these techniques, together with the results that derive from the experiments and expert evaluation, introduce a meta-algorithmic modelling framework for processing internal business policies. This framework relies on three natural language processing techniques, namely information extraction, automatic summarization, and automatic keyword extraction. For the reference extraction and keyword extraction tasks, the authors calculated precision, recall, and F-scores. For the former, the researchers obtained 0.99, 0.84, and 0.89; for the latter, this research obtained 0.79, 0.87, and 0.83, respectively. Finally, the summary extraction approach was positively evaluated using a qualitative assessment.
Article
Full-text available
The surge in the amount of available data in health care enables a novel, exploratory research approach that revolves around finding new knowledge and unexpected hypotheses from data instead of carrying out well-defined data analysis tasks. We propose a specification of the Cross Industry Standard Process for Data Mining (CRISP-DM), suitable for conducting expert sessions that focus on finding new knowledge and hypotheses in collaboration with local workforce. Our proposed specification that we name CRISP-IDM is evaluated in a case study at the psychiatry department of the University Medical Center Utrecht. Expert interviews were conducted to identify seven research themes in the psychiatry department, which were researched in cooperation with local health care professionals using data visualization as a modeling tool. During 19 expert sessions, two results that were directly implemented and 29 hypotheses for further research were found, of which 24 were not imagined during the initial expert interviews. Our work demonstrates the viability and benefits of involving work floor people in the analyses and the possibility to effectively find new knowledge and hypotheses using our CRISP-IDM method.
Article
Full-text available
Due to the explosive growth in the amount of text snippets over the past few years and their sparsity of text, organizations are unable to effectively and efficiently classify them, missing out on business opportunities. This paper presents TETSC: the Topically-Enriched Text Snippet Classification method. TETSC aims to solve the classification problem for text snippets in any domain. TETSC recognizes that there are different types of text snippets and, therefore, allows for stop word removal, named-entity recognition, and topical enrichment for the different types of text snippets. TETSC has been implemented in the production systems of a personal finance organization, which resulted in a classification error reduction of over 21%.
Article
Full-text available
Software Performance is a critical aspect for all software products. In terms of Software Operation Knowledge, it concerns knowledge about the software product's performance when it is used by the end-users. In this paper the authors suggest data mining techniques that can be used to analyze software operation data in order to extract knowledge about the performance of a software product when it operates in the field. Focusing on Software-as-a-Service applications, the authors present the Performance Mining Method to guide the process of performance monitoring (in terms of device demands and responsiveness) and analysis (finding the causes of the identified performance anomalies). The method has been evaluated through a prototype which was implemented for an online financial management application in the Netherlands.
Chapter
This chapter introduces an assembly-based method engineering approach for constructing situational analysis and design methods. The approach is supported by a meta-modeling technique, based on UML activity and class diagrams. Both the method engineering approach and meta-modeling technique will be explained and illustrated by case studies. The first case study describes the use of the meta-modeling technique in the analysis of method evolution. The next case study describes the use of situational method engineering, supported by the proposed meta-modeling technique, in method construction. With this research, the authors hope to provide researchers in the information system development domain with a useful approach for analyzing, constructing, and adapting methods.
Article
The confluence of cloud computing, parallelism and advanced machine intelligence approaches has created a world in which the optimum knowledge system will usually be architected from the combination of two or more knowledge-generating systems. There is a need, then, to provide a reusable, broadly-applicable set of design patterns to empower the intelligent system architect to take advantage of this opportunity. This book explains how to design and build intelligent systems that are optimized for changing system requirements (adaptability), optimized for changing system input (robustness), and optimized for one or more other important system parameters (e.g., accuracy, efficiency, cost). It provides an overview of traditional parallel processing which is shown to consist primarily of task and component parallelism; before introducing meta-algorithmic parallelism which is based on combining two or more algorithms, classification engines or other systems. Key features: Explains the entire roadmap for the design, testing, development, refinement, deployment and statistics-driven optimization of building systems for intelligence. Offers an accessible yet thorough overview of machine intelligence, in addition to having a strong image processing focus. Contains design patterns for parallelism, especially meta-algorithmic parallelism - simply conveyed, reusable and proven effective that can be readily included in the toolbox of experts in analytics, system architecture, big data, security and many other science and engineering disciplines. Connects algorithms and analytics to parallelism, thereby illustrating a new way of designing intelligent systems compatible with the tremendous changes in the computing world over the past decade. Discusses application of the approaches to a wide number of fields; primarily, document understanding, image understanding, biometrics and security printing. Companion website contains sample code and data sets.
Article
Software usage concerns knowledge about how end-users use the software in the field, and how the software itself responds to their actions. In this paper, we present the Usage Mining Method to guide the analysis of data collected during software operation, in order to extract knowledge about how a software product is used by the end-users. Our method suggests three analysis tasks which employ data mining techniques for extracting usage knowledge from software operation data: users profiling, clickstream analysis and classification analysis. The Usage Mining Method was evaluated through a prototype that was executed in the case of Exact Online, the main online financial management application in the Netherlands. The evaluation confirmed the supportive role of the Usage Mining Method in software product management and development processes, as well as the applicability of the suggested data mining algorithms to carry out the usage analysis tasks.
Conference Paper
Standard generalized additive models (GAMs) usually model the dependent variable as a sum of univariate models. Although previous studies have shown that standard GAMs can be interpreted by users, their accuracy is significantly less than more complex models that permit interactions. In this paper, we suggest adding selected terms of interacting pairs of features to standard GAMs. The resulting models, which we call GA²{M}$-models, for Generalized Additive Models plus Interactions, consist of univariate terms and a small number of pairwise interaction terms. Since these models only include one- and two-dimensional components, the components of GA²M-models can be visualized and interpreted by users. To explore the huge (quadratic) number of pairs of features, we develop a novel, computationally efficient method called FAST for ranking all possible pairs of features as candidates for inclusion into the model. In a large-scale empirical study, we show the effectiveness of FAST in ranking candidate pairs of features. In addition, we show the surprising result that GA²M-models have almost the same performance as the best full-complexity models on a number of real datasets. Thus this paper postulates that for many problems, GA²M-models can yield models that are both intelligible and accurate.