Conference PaperPDF Available

Green Data Science - Using Big Data in an “Environmentally Friendly” Manner

Green Data Science
Using Big Data in an “Environmentally Friendly” Manner
Wil M.P. van der Aalst
Eindhoven University of Technology, Department of Mathematics and Computer Science,
PO Box 513, NL-5600 MB Eindhoven, The Netherlands.
Keywords: Data Science, Big Data, Fairness, Confidentiality, Accuracy, Transparency, Process Mining.
Abstract: The widespread use of “Big Data” is heavily impacting organizations and individuals for which these data are
collected. Sophisticated data science techniques aim to extract as much value from data as possible. Powerful
mixtures of Big Data and analytics are rapidly changing the way we do business, socialize, conduct research,
and govern society. Big Data is considered as the “new oil” and data science aims to transform this into new
forms of “energy”: insights, diagnostics, predictions, and automated decisions. However, the process of trans-
forming “new oil” (data) into “new energy” (analytics) may negatively impact citizens, patients, customers,
and employees. Systematic discrimination based on data, invasions of privacy, non-transparent life-changing
decisions, and inaccurate conclusions illustrate that data science techniques may lead to new forms of “pollu-
tion”. We use the term “Green Data Science” for technological solutions that enable individuals, organizations
and society to reap the benefits from the widespread availability of data while ensuring fairness, confiden-
tiality, accuracy, and transparency. To illustrate the scientific challenges related to “Green Data Science”, we
focus on process mining as a concrete example. Recent breakthroughs in process mining resulted in powerful
techniques to discover the real processes, to detect deviations from normative process models, and to analyze
bottlenecks and waste. Therefore, this paper poses the question: How to benefit from process mining while
avoiding “pollutions” related to unfairness, undesired disclosures, inaccuracies, and non-transparency?
In recent years, data science emerged as a new and
important discipline. It can be viewed as an amal-
gamation of classical disciplines like statistics, data
mining, databases, and distributed systems. We use
the following definition: “Data science is an inter-
disciplinary field aiming to turn data into real value.
Data may be structured or unstructured, big or small,
static or streaming. Value may be provided in the form
of predictions, models learned from data, or any type
of data visualization delivering insights. Data science
includes data extraction, data preparation, data ex-
ploration, data transformation, storage and retrieval,
computing infrastructures, various types of mining
and learning, presentation of explanations and pre-
dictions, and the exploitation of results taking into
account ethical, social, legal, and business aspects.
(Aalst, 2016).
Related to data science is the overhyped term “Big
Data” that is used to refer to the massive amounts
of data collected. Organizations are heavily invest-
ing in Big Data technologies, but at the same time
citizens, patients, customers, and employees are con-
cerned about the use of their data. We live in an
era characterized by unprecedented opportunities to
sense, store, and analyze data related to human ac-
tivities in great detail and resolution. This introduces
new risks and intended or unintended abuse enabled
by powerful analysis techniques. Data may be sensi-
tive and personal, and should not be revealed or used
for proposes different from what was agreed upon.
Moreover, analysis techniques may discriminate mi-
norities even when attributes like gender and race are
removed. Using data science technology as a “black
box” making life-changing decisions (e.g., medical
prioritization or mortgage approvals) triggers a vari-
ety of ethical dilemmas.
Sustainable data science is only possible when
citizens, patients, customers, and employees are pro-
tected against irresponsible uses of data (big or
small). Therefore, we need to separate the “good”
and “bad” of data science. Compare this with envi-
ronmentally friendly forms of green energy (e.g. so-
lar power) that overcome problems related to tradi-
tional forms of energy. Data science may result in
W.M.P. van der Aalst. Green Data Science: Using Big Data in an
Environmentally Friendly Manner. Proceedings of the 18th International
Conference on Enterprise Information Systems (ICEIS 2016), INSTICC,
Rome, April 2016.
unfair decision making, undesired disclosures, inac-
curacies, and non-transparency. These irresponsible
uses of data can be viewed as “pollution”. Abandon-
ing the systematic use of data may help to overcome
these problems. However, this would be comparable
to abandoning the use of energy altogether. Data sci-
ence is used to make products and services more reli-
able, convenient, efficient, and cost effective. More-
over, most new products and services depend on the
collection and use of data. Therefore, we argue that
the “prohibition of data (science)” is not a viable so-
In this paper, we coin the term “Green Data Sci-
ence” (GDS) to refer to the collection of techniques
and approaches trying to reap the benefits of data sci-
ence and Big Data while ensuring fairness, confiden-
tiality, accuracy, and transparency. We believe that
technological solutions can be used to avoid pollution
and protect the environment in which data is collected
and used.
Section 2 elaborates on the following four chal-
Fairness – Data Science without prejudice: How
to avoid unfair conclusions even if they are true?
Confidentiality – Data Science that ensures con-
fidentiality: How to answer questions without re-
vealing secrets?
Accuracy – Data Science without guesswork:
How to answer questions with a guaranteed level
of accuracy?
Transparency – Data Science that provides trans-
parency: How to clarify answers such that they
become indisputable?
Concerns related to privacy and personal data pro-
tection triggered legislation like the EU’s Data Protec-
tion Directive. Directive 95/46/EC (“on the protection
of individuals with regard to the processing of per-
sonal data and on the free movement of such data”) of
the European Parliament and the Council was adopted
on 24 October 1995 (European Commission, 1995).
The General Data Protection Regulation (GDPR) is
currently under development and aims to strengthen
and unify data protection for individuals within the
EU (European Commission, 2015). GDPR will re-
place Directive 95/46/EC and is expected to be final-
ized in Spring 2016 and will be much more restrictive
than earlier legislation. Sanctions include fines of up
to 4% of the annual worldwide turnover. GDPR and
other forms of legislation limiting the use of data, may
prevent the use of data science also in situations where
data is used in a positive manner. Prohibiting the col-
lection and systematic use of data is like turning back
the clock. Next to legislation, positive technological
solutions are needed to ensure fairness, confidential-
ity, accuracy, and transparency. By just imposing re-
strictions, individuals, organizations and society can-
not exploit data (science) in a positive way.
The four challenges discussed in Section 2 are
quite general. Therefore, we focus on a concrete
subdiscipline in data science in Section 3: Process
Mining (Aalst, 2011). Process mining seeks the con-
frontation between event data (i.e., observed behav-
ior) and process models (hand-made or discovered au-
tomatically). Event data are related to explicit process
models, e.g., Petri nets or BPMN models. For exam-
ple, process models are discovered from event data or
event data are replayed on models to analyze com-
pliance and performance. Process mining provides
a bridge between data-driven approaches (data min-
ing, machine learning and business intelligence) and
process-centric approaches (business process model-
ing, model-based analysis, and business process man-
agement/reengineering). Process mining results may
drive redesigns, show the need for new controls, trig-
ger interventions, and enable automated decision sup-
port. Individuals inside (e.g., end-users and workers)
and outside (e.g., customers, citizens, or patients) the
organization may be impacted by process mining re-
sults. Therefore, Section 3 lists process mining chal-
lenges related to fairness, confidentiality, accuracy,
and transparency.
In the long run, data science is only sustainable
if we are willing to address the problems discussed
in this paper. Rather than abandoning the use of data
altogether, we should find positive technological ways
to protect individuals.
Figure 1 sketches the “data science pipeline”. Individ-
uals interact with a range of hardware/software sys-
tems (information systems, smartphones, websites,
wearables, etc.) . Data related to machine and in-
teraction events are collected and preprocessed
for analysis . During preprocessing data may be
transformed, cleaned, anonymized, de-identified, etc.
Models may be learned from data or made/modified
by hand . For compliance checking, models are of-
ten normative and made by hand rather than discov-
ered from data. Analysis results based on data (and
possibly also models) are presented to analysts, man-
agers, etc. or used to influence the behavior of in-
formation systems and devices . Based on the data,
decisions are made or recommendations are provided.
Analysis results may also be used to change systems,
laws, procedures, guidelines, responsibilities, etc. .
data in a
variety of
data used as
input for
devices, etc.
extract, load,
transform, clean,
anonymize, de-
identify, etc. report, discover,
mine, learn, check,
predict, recommend, etc.
interaction with individuals
interpretation by analysts, managers, etc.
Figure 1: The “data science pipeline” facing four challenges.
Figure 1 also lists the four challenges discussed
in the remainder of this section. Each of the chal-
lenges requires an understanding of the whole data
pipeline. Flawed analysis results or bad decisions
may be caused by different factors such as a sampling
bias, careless preprocessing, inadequate analysis, or
an opinionated presentation.
2.1 Fairness - Data Science Without
Prejudice: How To Avoid Unfair
Conclusions Even If They Are True?
Data science techniques need to ensure fairness: Au-
tomated decisions and insights should not be used to
discriminate in ways that are unacceptable from a le-
gal or ethical point of view. Discrimination can be de-
fined as “the harmful treatment of an individual based
on their membership of a specific group or category
(race, gender, nationality, disability, marital status, or
age)”. However, most analysis techniques aim to dis-
criminate among groups. Banks handing out loans
and credit cards try to discriminate between groups
that will pay their debts and groups that will run into
financial problems. Insurance companies try to dis-
criminate between groups that are likely to claim and
groups that are less likely to claim insurance. Hos-
pitals try to discriminate between groups for which
a particular treatment is likely to be effective and
groups for which this is less likely. Hiring employ-
ees, providing scholarships, screening suspects, etc.
can all be seen as classification problems: The goal
is to explain a response variable (e.g., person will pay
back the loan) in terms of predictor variables (e.g.,
credit history, employment status, age, etc.). Ideally,
the learned model explains the response variable as
good as possible without discriminating on the basis
of sensitive attributes (race, gender, etc.).
To explain discrimination discovery and discrimi-
nation prevention, let us consider the set of all (poten-
tial) customers of some insurance company specializ-
ing in car insurance. For each customer we have the
following variables:
gender (male or female),
car brand (Alfa, BMW, etc.),
years of driving experience,
number of claims in the last year,
number of claims in the last five years, and
status (insured, refused, or left).
The status field is used to distinguish current cus-
tomers (status=insured) from customers that were re-
fused (status=refused) or that left the insurance com-
pany during the last year (status=left). Customers that
were refused or that left more than a year ago are re-
moved from the data set.
Techniques for discrimination discovery aim to
identify groups that are discriminated based on sen-
sitive variables, i.e., variables that should not matter.
For example, we may find that “males have a higher
likelihood to be rejected than females” or that “for-
eigners driving a BMW have a higher likelihood to be
rejected than Dutch BMW drivers”. Discrimination
may be caused by human judgment or by automated
decision algorithms using a predictive model. The
decision algorithms may discriminate due to a sam-
pling bias, incomplete data, or incorrect labels. If ear-
lier rejections are used to learn new rejections, then
prejudices may be reinforced. Similar “self-fulfilling
prophecies” can be caused by sampling or missing
Even when there is no intent to discriminate, dis-
crimination may still occur. Even when the auto-
mated decision algorithm does not use gender and
uses only non-sensitive variables, the actual decisions
may still be such that (fe)males or foreigners have a
much higher probability to be rejected. The decision
algorithm may also favor more frequent values for a
variable. As a result, minority groups may be treated
low accuracy highest accuracy
possible using all data
without constraints
analysis results
and model are
analysis results and
model are created
without considering
possible compromise
between fairness and
Figure 2: Tradeoff between fairness and accuracy.
Discrimination prevention aims to create auto-
mated decision algorithms that do not discriminate us-
ing sensitive variables. It is not sufficient to remove
these sensitive variables: Due to correlations and the
handling of outliers, unintentional discrimination may
still take place. One can add constraints to the deci-
sion algorithm to ensure fairness using a predefined
criterion. For example, the constraint “males and fe-
males should have approximately the same probabil-
ity to be rejected” can be added to a decision-tree
learning algorithm. Next to adding algorithm-specific
constraints used during analysis one can also use pre-
processing (modify the input data by resampling or
relabeling) or postprocessing (modify models, e.g.,
relabel mixed leaf nodes in a decision tree). In gen-
eral there is often a trade-off between maximizing ac-
curacy and minimizing discrimination (see Figure 2).
By rejecting fewer males (better fairness), the insur-
ance company may need to pay more claims.
Discrimination prevention often needs to use sen-
sitive variables (gender, age, nationality, etc.) to en-
sure fairness. This creates a paradox, e.g., informa-
tion on gender needs to be used to avoid discrimina-
tion based on gender.
The first paper on discrimination-aware data min-
ing appeared in 2008 (Pedreshi et al., 2008). Since
then, several papers mostly focusing on fair classifica-
tion appeared: (Calders and Verwer, 2010; Kamiran
et al., 2010; Ruggieri et al., 2010). These examples
show that unfairness during analysis can be actively
prevented. However, unfairness is not limited to clas-
sification and more advanced forms of analytics also
need to ensure fairness.
2.2 Confidentiality - Data Science That
Ensures Confidentiality: How To
Answer Questions Without
Revealing Secrets?
The application of data science techniques should not
reveal certain types of personal or otherwise sensi-
tive information. Often personal data need to be kept
confidential. The General Data Protection Regula-
tion (GDPR) currently under development (European
Commission, 2015) focuses on personal information:
“The principles of data protection should apply to any in-
formation concerning an identified or identifiable natural
person. Data including pseudonymized data, which could
be attributed to a natural person by the use of additional in-
formation, should be considered as information on an iden-
tifiable natural person. To determine whether a person is
identifiable, account should be taken of all the means rea-
sonably likely to be used either by the controller or by any
other person to identify the individual directly or indirectly.
To ascertain whether means are reasonably likely to be used
to identify the individual, account should be taken of all ob-
jective factors, such as the costs of and the amount of time
required for identification, taking into consideration both
available technology at the time of the processing and tech-
nological development. The principles of data protection
should therefore not apply to anonymous information, that
is information which does not relate to an identified or iden-
tifiable natural person or to data rendered anonymous in
such a way that the data subject is not or no longer identifi-
Confidentiality is not limited to personal data. Compa-
nies may want to hide sales volumes or production times
when presenting results to certain stakeholders. One also
needs to bear in mind that few information systems hold
information that can be shared or analyzed without limits
(e.g., the existence of personal data cannot be avoided). The
“data science pipeline” depicted in Figure 1 shows that there
are different types of data having different audiences. Here
we focus on: (1) the “raw data” stored in the information
system , (2) the data used as input for analysis , and
(3) the analysis results interpreted by analysts and managers
. Whereas the raw data may refer to individuals, the data
used for analysis is often (partly) de-identified, and analysis
results may refer to aggregate data only. It is important to
note that confidentiality may be endangered along the whole
pipeline and includes analysis results.
Consider a data set that contains sensitive information.
Records in such a data set may have three types of variables:
Direct identifiers: Variables that uniquely identify a
person, house, car, company, or other entity. For ex-
ample, a social security number identifies a person.
Key variables: Subsets of variables that together can be
used to identify some entity. For example, it may be
possible to identify a person based on gender, age, and
employer. A car may be uniquely identified based on
registration date, model, and color. Key variables are
also referred to as implicit identifiers or quasi identi-
Non-identifying variables: Variables that cannot be
used to identify some entity (direct or indirect).
Confidentiality is impaired by unintended or malicious
disclosures. We consider three types of such disclosures:
Identity disclosure: Information about an entity (per-
son, house, etc.) is revealed. This can be done through
direct or implicit identifiers. For example, the salaries
of employees are disclosed unintentionally or an in-
truder is able to retrieve patient data.
Attribute disclosure: Information about an entity can be
derived indirectly. If there is only one male surgeon in
the age group 40-45, then aggregate data for this cate-
gory reveals information about this person.
Partial disclosure: Information about a group of entities
can be inferred. Aggregate information on male sur-
geons in the age group 40-45 may disclose an unusual
number of medical errors. These cannot be linked to
a particular surgeon. Nevertheless, one may conclude
that surgeons in this group are more likely to make er-
De-identification of data refers to the process of remov-
ing or obscuring variables with the goal to minimize unin-
tended disclosures. In many cases re-identification is pos-
sible by linking different data sources. For example, the
combination of wedding date and birth date may allow for
the re-identification of a particular person. Anonymization
of data refers to de-identification that is irreversible: re-
identification is impossible. A range of de-identification
methods is available: removing variables, randomization,
hashing, shuffling, sub-sampling, aggregation, truncation,
generalization, adding noise, etc. Adding some noise to a
continuous variable or the coarsening of values may have a
limited impact on the quality of analysis results while en-
suring confidentiality.
There is a trade-off between minimizing the disclosure
of sensitive information and the usefulness of analysis re-
sults (see Figure 3). Removing variables, aggregation, and
adding noise can make it hard to produce any meaningful
analysis results. Emphasis on confidentiality (like security)
may also reduce convenience. Note that personalization of-
ten conflicts with fairness and confidentiality. Disclosing all
data, supports analysis, but jeopardizes confidentiality.
data utility
no meaningful
analysis possible
full use of data
potential possible
full disclosure
of sensitive
no sensitive
data disclosed
possible compromise
between confidentiality
and utility
Figure 3: Tradeoff between confidentiality and utility.
Access rights to the different types of data and analy-
sis results in the “data science pipeline” (Figure 1) vary per
group. For example, very few people will have access to
the “raw data” stored in the information system . More
people will have access to the data used for analysis and
the actual analysis results. Poor cybersecurity may endan-
ger confidentiality. Good policies ensuring proper authen-
tication (Are you who you say you are?) and authorization
(What are you allowed to do?) are needed to protect access
to the pipeline in Figure 1. Cybersecurity measures should
not complicate access, data preparation, and analysis; oth-
erwise people may start using illegal copies and replicate
See (Monreale et al., 2014; Nelson, 2015; President’s
Council, 2014) for approaches to ensure confidentiality.
2.3 Accuracy - Data Science Without
Guesswork: How To Answer
Questions With A Guaranteed Level
Of Accuracy?
Increasingly decisions are made using a combination of al-
gorithms and data rather than human judgement. Hence,
analysis results need to be accurate and should not deceive
end-users and decision makers. Yet, there are several fac-
tors endangering accuracy.
First of all, there is the problem of overfitting the data
leading to “bogus conclusions”. There are numerous exam-
ples of so-called spurious correlations illustrating the prob-
lem. Some examples (taken from (Vigen, 2015)):
The per capita cheese consumption strongly correlates
with the number of people who died by becoming tan-
gled in their bedsheets.
The number of Japanese passenger cars sold in the
US strongly correlates with the number of suicides by
crashing of motor vehicle.
US spending on science, space and technology strongly
correlates with suicides by hanging, strangulation and
The total revenue generated by arcades strongly corre-
lates with the number of computer science doctorates
awarded in the US.
According to Bonferroni’s principle we need to avoid treating random observations as if they are real and sig-
nificant (Rajaraman and Ullman, 2011). The following example, inspired by a similar example in (Rajaraman
and Ullman, 2011), illustrates the risk of treating completely random events as patterns.
ADutch government agency is searching for terrorists by examining hotel visits of all of its 18 million citizens
(18 ×106). The hypothesis is that terrorists meet multiple times at some hotel to plan an attack. Hence, the
agency looks for suspicious “events” {p1,p2}{d1,d2}where persons p1and p2meet on days d1and d2.
How many of such suspicious events will the agency find if the behavior of people is completely random? To
estimate this number we need to make some additional assumptions. On average, Dutch people go to a hotel
every 100 days and a hotel can accommodate 100 people at the same time. We further assume that there are
100×100 =1800 Dutch hotels where potential terrorists can meet.
The probability that two persons (p1and p2) visit a hotel on a given day dis 1
100 ×1
100 =104. The probability
that p1and p2visit the same hotel on day dis 104×1
1800 =5.55 ×108. The probability that p1and p2visit
the same hotel on two different days d1and d2is (5.55 ×108)2=3.086 ×1015. Note that different hotels
may be used on both days. Hence, the probability of suspicious event {p1,p2}{d1,d2}is 3.086 ×1015.
How many candidate events are there? Assume an observation period of 1000 days. Hence, there are 1000 ×
(1000 1)/2=499,500 combinations of days d1and d2. Note that the order of days does not matter, but the
days need to be different. There are (18 ×106)×(18 ×1061)/2=1.62 ×1014 combinations of persons p1
and p2. Again the ordering of p1and p2does not matter, but p16=p2. Hence, there are 499,500×1.62 ×1014 =
8.09 ×1019 candidate events {p1,p2}{d1,d2}.
The expected number of suspicious events is equal to the product of the number of candidate events {p1,p2}
{d1,d2}and the probability of such events (assuming independence): 8.09 ×1019 ×3.086 ×1015 =249,749.
Hence, there will be around a quarter million observed suspicious events {p1,p2}{d1,d2}in a 1000 day
Suppose that there are only a handful of terrorists and related meetings in hotels. The Dutch government agency
will need to investigate around a quarter million suspicious events involving hundreds of thousands innocent
citizens. Using Bonferroni’s principle, we know beforehand that this is not wise: there will be too many false
Example: Bonferroni’s principle explained using an example taken from (Aalst, 2016). To apply the principle, compute the
number of observations of some phenomena one is interested in under the assumption that things occur at random. If this
number is significantly larger than the real number of instances one expects, then most of the findings will be false positives.
When using many variables relative to the number of in-
stances, classification may result in complex rules overfit-
ting the data. This is often referred to as the curse of di-
mensionality: As dimensionality increases, the number of
combinations grows so fast that the available data become
sparse. With a fixed number of instances, the predictive
power reduces as the dimensionality increases. Using cross-
validation most findings (e.g., classification rules) will get
rejected. However, if there are many findings, some may
survive cross-validation by sheer luck.
In statistics, Bonferroni’s correction is a method (named
after the Italian mathematician Carlo Emilio Bonferroni) to
compensate for the problem of multiple comparisons. Nor-
mally, one rejects the null hypothesis if the likelihood of
the observed data under the null hypothesis is low (Casella
and Berger, 2002). If we test many hypotheses, we also in-
crease the likelihood of a rare event. Hence, the likelihood
of incorrectly rejecting a null hypothesis increases (Miller,
1981). If the desired significance level for the whole col-
lection of null hypotheses is α, then the Bonferroni correc-
tion suggests that one should test each individual hypoth-
esis at a significance level of α
kwhere kis the number of
null hypotheses. For example, if α=0.05 and k=20, then
k=0.0025 is the required significance level for testing the
individual hypotheses.
Next to overfitting the data and testing multiple hy-
potheses, there is the problem of uncertainty in the input
data and the problem of not showing uncertainty in the re-
Uncertainty in the input data is related to the fourth “V”
in the four “V’s of Big Data” (Volume, Velocity, Variety,
and Veracity). Veracity refers to the trustworthiness of the
input data. Sensor data may be uncertain, multiple users
may use the same account, tweets may be generated by soft-
ware rather than people, etc. These uncertainties are often
not taken into account during analysis assuming that things
“even out” in larger data sets. This does not need to be the
case and the reliability of analysis results is affected by un-
reliable or probabilistic input data.
When we say, “we are 95% confident that the true value
of parameter xis in our confidence interval [a,b]”, we mean
that 95% of the hypothetically observed confidence inter-
vals will hold the true value of parameter x. Averages, sums,
standard deviations, etc. are often based on sample data.
Therefore, it is important to provide a confidence interval.
For example, given a mean of 35.4 the 95% confidence in-
terval may be [35.3,35.6], but the 95% confidence interval
may also be [15.3,55.6]. In the latter case, we will inter-
pret the mean of 35.4 as a “wild guess” rather than a rep-
resentative value for true average value. Although we are
used to confidence intervals for numerical values, decision
makers have problems interpreting the expected accuracy
of more complex analysis results like decision trees, asso-
ciation rules, process models, etc. Cross-validation tech-
niques like k-fold checking and confusion matrices give
some insights. However, models and decisions tend to be
too “crisp” (hiding uncertainties). Explicit vagueness or
more explicit confidence diagnostics may help to better in-
terpret analysis results. Parts of models should be kept de-
liberately “vague” if analysis is not conclusive.
2.4 Transparency - Data Science That
Provides Transparency: How To
Clarify Answers Such That They
Become Indisputable?
Data science techniques are used to make a variety of de-
cisions. Some of these decisions are made automatically
based on rules learned from historic data. For example, a
mortgage application may be rejected automatically based
on a decision tree. Other decisions are based on analysis re-
sults (e.g., process models or frequent patterns). For exam-
ple, when analysis reveals previously unknown bottlenecks,
then this may have consequences for the organization of
work and changes in staffing (or even layoffs). Automated
decision rules (in Figure 1) need to be as accurate as pos-
sible (e.g., to reduce costs and delays). Analysis results (
in Figure 1) also need to be accurate. However, accuracy
is not sufficient to ensure acceptance and proper use of data
science techniques. Both decisions and analysis results
also need to be transparent.
black box
decision system 2
decision system 3
decision system 1
Your claim is
because you
are male
and above 50
Figure 4: Different levels of transparency.
Figure 4 illustrates the notion of transparency. Consider
an application submitted by John evaluated using three data-
driven decision systems. The first system is a black box: It
is unclear why John’s application is rejected. The second
system reveals it’s decision logic in the form of a decision
tree. Applications from females and younger males are al-
ways accepted. Only applications from older males get re-
jected. The third system uses the same decision tree, but
also explains the rejection (“because male and above 50”).
Clearly, the third system is most transparent. When govern-
ments make decisions for citizens it is often mandatory to
explain the basis for such decisions.
Deep learning techniques (like many-layered neural
networks) use multiple processing layers with complex
structures or multiple non-linear transformations. These
techniques have been successfully applied to automatic
speech recognition, image recognition, and various other
complex decision tasks. Deep learning methods are often
looked at as a “black box”, with performance measured
empirically and no formal guarantees or explanations. A
many-layered neural network is not as transparent as for ex-
ample a decision tree. Such a neural network may make
good decisions, but it cannot explain a rule or criterion.
Therefore, such black box approaches are non-transparent
and may be unacceptable in some domains.
Transparency is not restricted to automated decision
making and explaining individual decisions, it also involves
the intelligibility, clearness, and comprehensibility of anal-
ysis results (e.g., a process model, decision tree, regression
formula). For example, a model may reveal bottlenecks in a
process, possible fraudulent behavior, deviations by a small
group of individuals, etc. It needs to be clear for the user of
such models (e.g., a manager) how these findings where ob-
tained. The link to the data and the analysis technique used
should be clear. For example, filtering the input data (e.g.,
removing outliers) or adjusting parameters of the algorithm
may have a dramatic effect on the model returned.
Storytelling is sometimes referred to as “the last mile
in data science”. The key question is: How to communi-
cate analysis results with end-users? Storytelling is about
communicating actionable insights to the right person, at
the right time, in the right way. One needs to know the gist
of the story one wants to tell to successfully communicate
analysis results (rather than presenting the whole model and
all data). One can use natural language generation to trans-
form selected analysis results into concise, easy-to-read, in-
dividualized reports.
To provide transparency there should be a clear link be-
tween data and analysis results/stories. One needs to be able
to drill-down and inspect the data from the model’s perspec-
tive. Given a bottleneck one needs to be able to drill down
to the instances that are delayed due to the bottleneck. This
related to data provenance: it should always be possible to
reproduce analysis results from the original data.
The four challenges depicted in Figure 1 are clearly in-
terrelated. There may be trade-offs between fairness,confi-
dentiality,accuracy and transparency. For example, to en-
sure confidentiality we may add noise and de-identify data
thus possibly compromising accuracy and transparency.
The goal of process mining is to turn event data into in-
sights and actions (Aalst, 2016). Process mining is an inte-
gral part of data science, fueled by the availability of data
and the desire to improve processes. Process mining can
be seen as a means to bridge the gap between data science
and process science. Data science approaches tend to be
data in a
variety of
data used as
input for
devices, etc.
extract, load,
transform, clean,
anonymize, de-
identify, etc. report, discover,
mine, learn, check,
predict, recommend, etc.
interaction with individuals
interpretation by analysts, managers, etc.
event data
(e.g., in XES
data in databases,
files, logs, etc.
having a temporal
process models (e.g.,
nets, workflow models)
techniques for process discovery,
conformance checking, and
performance analysis
results include process models
annotated with frequencies,
times, and deviations
operational support,
e.g., predictions,
decisions, and alerts
people and devices
generating a
variety of events
Figure 5: The “process mining pipeline” relates observed and modeled behavior.
process agonistic whereas process science approaches tend
to be model-driven without considering the “evidence” hid-
den in the data. This section discusses challenges related to
fairness, confidentiality, accuracy, and transparency in the
context of process mining. The goal is not to provide so-
lutions, but to illustrate that the more general challenges
discussed before trigger concrete research questions when
considering processes and event data.
3.1 What Is Process Mining?
Figure 5 shows the “process mining pipeline” and can be
viewed as a specialization of the Figure 1. Process mining
focuses on the analysis of event data and analysis results
are often related to process models. Process mining is a
rapidly growing subdiscipline within both Business Process
Management (BPM) (Aalst, 2013a) and data science (Aalst,
2014). Mainstream Business Intelligence (BI), data min-
ing and machine learning tools are not tailored towards the
analysis of event data and the improvement of processes.
Fortunately, there are dedicated process mining tools able
to transform event data into actionable process-related in-
sights. For example, ProM ( is
an open-source process mining tool supporting process dis-
covery, conformance checking, social network analysis, or-
ganizational mining, clustering, decision mining, predic-
tion, and recommendation (see Figure 6). Moreover, in
recent years, several vendors released commercial process
mining tools. Examples include: Celonis Process Mining
by Celonis GmbH (, Disco by Fluxicon
(, Interstage Business Process Man-
ager Analytics by Fujitsu Ltd (, Minit
by Gradient ECM (, myInvenio by
Cognitive Technology (, Perceptive
Process Mining by Lexmark (, QPR
ProcessAnalyzer by QPR (, Rialto Process
by Exeura (, SNP Business Process Anal-
ysis by SNP Schneider-Neureither & Partner AG (www., and PPM webMethods Process Perfor-
mance Manager by Software AG (
3.1.1 Creating and Managing Event Data
Process mining is impossible without proper event logs
(Aalst, 2011). An event log contains event data related to
a particular process. Each event in an event log refers to
one process instance, called case. Events related to a case
are ordered. Events can have attributes. Examples of typ-
ical attribute names are activity, time, costs, and resource.
Not all events need to have the same set of attributes. How-
ever, typically, events referring to the same activity have the
same set of attributes. Figure 6(a) shows the conversion of
an CSV file with four columns (case, activity, resource, and
timestamp) into an event log.
Most process mining tools support XES (eXtensible
Event Stream) (IEEE Task Force on Process Mining, 2013).
In September 2010, the format was adopted by the IEEE
Task Force on Process Mining and became the de facto ex-
change format for process mining. The IEEE Standards Or-
ganization is currently evaluating XES with the aim to turn
XES into an official IEEE standard.
To create event logs we need to extract, load, trans-
form, anonymize, and de-identify data in a variety of sys-
tems (see in Figure 5). Consider for example the hun-
dreds of tables in a typical HIS (Hospital Information Sys-
tem) like ChipSoft, McKesson and EPIC or in an ERP (En-
terprise Resource Planning) system like SAP, Oracle, and
Microsoft Dynamics. Non-trivial mappings are needed to
extract events and to relate events to cases. Event data needs
to be scoped to focus on a particular process. Moreover, the
data also needs to be scoped with respect to confidentiality
3.1.2 Process Discovery
Process discovery is one of the most challenging process
mining tasks (Aalst, 2011). Based on an event log, a process
case activity resource timestamp
each row
to an event
each dot
to an event
208 cases
process model
discovered for the most
frequent activities
checking view
analysis view
activity was
skipped 16 times
average waiting
time is 18 days
queue length is
currently 22
(a) (b)
tokens refer to
real cases
Figure 6: Six screenshots of ProM while analyzing an event log with 208 cases, 5987 events, and 74 different activities. First,
a CSV file is converted into an event log (a). Then, the event data can be explored using a dotted chart (b). A process model
is discovered for the 11 most frequent activities (c). The event log can be replayed on the discovered model. This is used to
show deviations (d), average waiting times (e), and queue lengths (f).
Table 1: Relating the four challenges to process mining specific tasks.
creating and
event data
Data Science with-
out prejudice: How to
avoid unfair conclusions
even if they are true?
The input data may
be biased, incomplete
or incorrect such
that the analysis
reconfirms prejudices.
By resampling or
relabeling the data,
undesirable forms of
discrimination can
be avoided. Note
that both cases and
resources (used to
execute activities)
may refer to individ-
uals having sensitive
attributes such as
race, gender, age, etc.
The discovered model
may abstract from
paths followed by cer-
tain under-represented
groups of cases.
algorithms can be
used to avoid this. For
example, if cases are
handled differently
based on gender, we
may want to ensure
that both are equally
represented in the
checking can be
used to “blame”
individuals, groups,
or organizations for
deviating from some
normative model.
conformance check-
ing (e.g., alignments)
needs to separate
(1) likelihood, (2)
severity and (3)
blame. Deviations
may need to be
interpreted differently
for different groups of
cases and resources.
performance measure-
ments may be unfair
for certain classes of
cases and resources
(e.g., not taking into
account the context).
performance analysis
detects unfairness
and supports process
improvements taking
into account trade-
offs between internal
fairness (worker’s
perspective) and
external fairness (citi-
predictions, rec-
and decisions
may discriminate
This problem can
be tackled using
techniques from
data mining.
Data Science that
ensures confidentiality:
How to answer questions
without revealing secrets?
Event data (e.g., XES
files) may reveal
sensitive information.
Anonymization and
de-identification can
be used to avoid
disclosure. Note
that timestamps and
paths may be unique
and a source for
re-identification (e.g.,
all paths are unique).
The discovered model
may reveal sensitive
information, espe-
cially with respect
to infrequent paths
or small event logs.
Drilling-down from
the model may need
to be blocked when
numbers get too small
(cf. k-anonymity).
Conformance check-
ing shows diagnostics
for deviating cases
and resources.
is important and
diagnostics need to be
aggregated to avoid
revealing compliance
problems at the level
of individuals.
Performance analysis
shows bottlenecks and
other problems. Link-
ing these problems to
cases and resources
may disclose sensitive
predictions, rec-
and decisions may
disclose sensitive
information, e.g.,
based on a rejection
other properties can
be derived.
Data Science with-
out guesswork: How to
answer questions with
a guaranteed level of
Event data (e.g.,
XES files) may
have all kinds of
quality problems.
Attributes may be
incorrect, imprecise,
or uncertain. For
example, timestamps
may be too coarse
(just the date) or
reflect the time of
recording rather than
the time of the event’s
Process discovery de-
pends on many pa-
rameters and charac-
teristics of the event
log. Process mod-
els should better show
the confidence level
of the different parts.
Moreover, additional
information needs to
be used better (do-
main knowledge, un-
certainty in event data,
Often multiple
explanations are
possible to interpret
Just providing one
alignment based
on a particular cost
function may be mis-
leading. How robust
are the findings?
In case of fitness
problems (process
model and event log
disagree), perfor-
mance analysis is
based on assumptions
and needs to deal
with missing values
(making results less
Inaccurate process
models may lead to
flawed predictions,
recommendations and
decisions. Moreover,
not communicating
the (un)certainty
of predictions,
and decisions, may
negatively impact
Data Science that
provides transparency:
How to clarify answers
such that they become
Provenance of event
data is key. Ide-
ally, process mining
insights can be related
to the event data they
are based on. How-
ever, this may con-
flict with confidential-
ity concerns.
Discovered process
models depend on
the event data used
as input and the
parameter settings and
choice of discovery
algorithm. How to en-
sure that the process
model is interpreted
correctly? End-users
need to understand
the relation between
data and model to
trust analysis.
When modeled and
observed behavior
disagree there may be
multiple explanations.
How to ensure that
conformance diagnos-
tics are interpreted
When detecting per-
formance problems, it
should be clear how
these were detected
and what the possi-
ble causes are. Ani-
mating event logs on
models helps to make
problems more trans-
Predictions, rec-
ommendations and
decisions are based
on process models. If
possible, these models
should be transparent.
Moreover, expla-
nations should be
added to predictions,
and decisions (“We
predict that this case
be late, because ...”).
model is constructed thus capturing the behavior seen in the
log. Dozens of process discovery algorithms are available.
Figure 6(c) shows a process model discovered using ProM’s
inductive visual miner (Leemans et al., 2015). Techniques
use Petri nets, WF-nets, C-nets, process trees, or transition
systems as a representational bias (Aalst, 2016). These re-
sults can always be converted to the desired notation, for
example BPMN (Business Process Model and Notation),
YAWL, or UML activity diagrams.
3.1.3 Conformance Checking
Using conformance checking discrepancies between the log
and the model can be detected and quantified by replaying
the log (Aalst et al., 2012). For example, Figure 6(c) shows
an activity that was skipped 16 times. Some of the discrep-
ancies found may expose undesirable deviations, i.e., con-
formance checking signals the need for a better control of
the process. Other discrepancies may reveal desirable de-
viations and can be used for better process support. Input
for conformance checking is a process model having exe-
cutable semantics and an event log.
3.1.4 Performance Analysis
By replaying event logs on process model, we can com-
pute frequencies and waiting/service times. Using align-
ments (Aalst et al., 2012) we can relate cases to paths in
the model. Since events have timestamps, we can associate
the times in-between events along such a path to delays in
the process model. If the event log records both start and
complete events for activities, we can also monitor activity
durations. Figure 6(d) shows an activity that has an aver-
age waiting time of 18 days and 16 hours. Note that such
bottlenecks are discovered without any modeling.
3.1.5 Operational Support
Figure 6(e) shows the queue length at a particular point in
time. This illustrates that process mining can be used in an
online setting to provide operational support. Process min-
ing techniques exist to predict the remaining flow time for
a case or the outcome of a process. This requires the com-
bination of a discovered process model, historic event data,
and information about running cases. There are also tech-
niques to recommend the next step in a process, to check
conformance at run-time, and to provide alerts when cer-
tain Service Level Agreements (SLAs) are violated.
3.2 Challenges in Process Mining
Table 1 maps the four generic challenges identified in Sec-
tion 2 onto the six key ingredients of process mining briefly
introduced in Section 3.1. Note that both cases (i.e., process
instances) and the resources used to execute activities may
refer to individuals (customers, citizens, patients, workers,
etc.). Event data are difficult to fully anonymize. In larger
processes, most cases follow a unique path. In the event log
used in Figure 6, 198 of the 208 cases follow a unique path
(focusing only on the order of activities). Hence, knowing
the order of a few selected activities may be used to de-
anonymize or re-identify cases. The same holds for (pre-
cise) timestamps. For the event log in Figure 6, several
cases can be uniquely identified based on the day the reg-
istration activity (first activity in process) was executed. If
one knows the timestamps of these initial activities with the
precision of an hour, then almost all cases can be uniquely
identified. This shows that the ordering and timestamp data
in event logs may reveal confidential information uninten-
tionally. Therefore, it is interesting to investigate what can
be done by adding noise (or other transformations) to event
data such that the analysis results do not change too much.
For example, we can shift all timestamps such that all cases
start in “week 0”. Most process discovery techniques will
still return the same process model. Moreover, the average
flow/waiting/service times are not affected by this.
Conformance checking (Aalst et al., 2012) can be
viewed as a classification problem. What kind of cases de-
viate at a particular point? Bottleneck analysis can also be
formulated as a classification problem. Which cases get de-
layed more than 5 days? We may find out that conformance
or performance problems are caused by characteristics of
the case itself or the people that worked on it. This allows
us to discover patterns such as:
Doctor Jones often performs an operation without mak-
ing a scan and this results in more incidents later in the
Insurance claims from older customers often get re-
jected because they are incomplete.
Citizens that submit their tax declaration too late often
get rejected by teams having a higher workload.
Techniques for discrimination discovery can be used to find
distinctions that are not desirable/acceptable. Subsequently,
techniques for discrimination prevention can be used to
avoid such situations. It is important to note that discrimi-
nation is not just related to static variables, but also relates
to the way cases are handled.
It is also interesting to use techniques from decomposed
process mining or streaming process mining (see Chap-
ter 12 in (Aalst, 2016)) to make process mining “greener”.
For streaming process mining one cannot keep track of
all events and all cases due to memory constraints and the
need to provide answers in real-time (Burattin et al., 2014;
Aalst, 2016; Zelst et al., 2015). Hence, event data need to
be stored in aggregated form. Aging data structures, queues,
time windows, sampling, hashing, etc. can be used to keep
only the information necessary to instantly provide answers
to selected questions. Such approaches can also be used
to ensure confidentiality, often without a significant loss of
For decomposed/distributed process mining event data
need to be split based on a grouping activities in the pro-
cess (Aalst, 2013b; Aalst, 2016). After splitting the event
log, it is still possible to discover process models and to
check conformance. Interestingly, the sublogs can be ana-
lyzed separately. This may be used to break potential harm-
ful correlations. Rather than storing complete cases, one
can also store shorter episodes of anonymized case frag-
ments. Sometimes it may even be sufficient to store only di-
rect successions, i.e., facts of the form “for some unknown
case activity awas followed by activity bwith a delay of 8
hours”. Some discovery algorithms only use data on direct
successions and do not require additional, possibly sensi-
tive, information. Of course certain questions can no longer
be answered in a reliable manner (e.g., flow times of cases).
The above examples illustrate that Table 1 identifies a
range of novel research challenges in process mining. In
today’s society, event data are collected about anything, at
any time, and at any place. Today’s process mining tools
are able to analyze such data and can handle event logs with
billions of events. These amazing capabilities also imply a
great responsibility. Fairness, confidentiality, accuracy and
transparency should be key concerns for any process miner.
This paper introduced the notion of “Green Data Science”
(GDS) from four angles: fairness,confidentiality,accuracy,
and transparency. The possible “pollution” caused by data
science should not be addressed (only) by legislation. We
should aim for positive, technological solutions to protect
individuals, organizations and society against the negative
side-effects of data. As an example, we discussed “green
challenges” in process mining. Table 1 can be viewed as a
research agenda listing interesting open problems.
Aalst, W. van der (2011). Process Mining: Discovery, Con-
formance and Enhancement of Business Processes.
Springer-Verlag, Berlin.
Aalst, W. van der (2013a). Business Process Management:
A Comprehensive Survey. ISRN Software Engineer-
ing, pages 1–37. doi:10.1155/2013/507984.
Aalst, W. van der (2013b). Decomposing Petri Nets for Pro-
cess Mining: A Generic Approach. Distributed and
Parallel Databases, 31(4):471–507.
Aalst, W. van der (2014). Data Scientist: The Engineer of
the Future. In Mertins, K., Benaben, F., Poler, R.,
and Bourrieres, J., editors, Proceedings of the I-ESA
Conference, volume 7 of Enterprise Interoperability,
pages 13–28. Springer-Verlag, Berlin.
Aalst, W. van der (2016). Process Mining: Data Science in
Action. Springer-Verlag, Berlin.
Aalst, W. van der, Adriansyah, A., and Dongen, B. van
(2012). Replaying History on Process Models
for Conformance Checking and Performance Analy-
sis. WIREs Data Mining and Knowledge Discovery,
Burattin, A., Sperduti, A., and Aalst, W. van der (2014).
Control-Flow Discovery from Event Streams. In IEEE
Congress on Evolutionary Computation (CEC 2014),
pages 2420–2427. IEEE Computer Society.
Calders, T. and Verwer, S. (2010). Three Naive Bayes
Approaches for Discrimination-Aware Classification.
Data Mining and Knowledge Discovery, 21(2):277–
Casella, G. and Berger, R. (2002). Statistical Inference, 2nd
Edition. Duxbury Press.
European Commission (1995). Directive 95/46/EC of the
European Parliament and of the Council on the Pro-
tection of Individuals with Wegard to the Processing
of Personal Data and on the Free Movement of Such
Data. Official Journal of the European Communities,
No L 281/31.
European Commission (2015). Proposal for a Regulation
of the European Parliament and of the Council on
the Protection of Individuals with Wegard to the Pro-
cessing of Personal Data and on the Free Movement
of Such Data (General Data Protection Regulation).
9565/15, 2012/0011 (COD).
IEEE Task Force on Process Mining (2013). XES Standard
Kamiran, F., Calders, T., and Pechenizkiy, M. (2010).
Discrimination-Aware Decision-Tree Learning. In
Proceedings of the IEEE International Conference on
Data Mining (ICDM 2010), pages 869–874.
Leemans, S., Fahland, D., and Aalst, W. van der (2015).
Exploring Processes and Deviations. In Fournier, F.
and Mendling, J., editors, Business Process Manage-
ment Workshops, International Workshop on Business
Process Intelligence (BPI 2014), volume 202 of Lec-
ture Notes in Business Information Processing, pages
304–316. Springer-Verlag, Berlin.
Miller, R. (1981). Simultaneous Statistical Inference.
Springer-Verlag, Berlin.
Monreale, A., Rinzivillo, S., Pratesi, F., Giannotti, F., and
Pedreschi, D. (2014). Privacy-By-Design in Big Data
Analytics and Social Mining. EPJ Data Science,
Nelson, G. (2015). Practical Implications of Sharing Data:
A Primer on Data Privacy, Anonymization, and De-
Identification. Paper 1884-2015, ThotWave Technolo-
gies, Chapel Hill, NC.
Pedreshi, D., Ruggieri, S., and Turini, F. (2008).
Discrimination-Aware Data Mining. In Proceedings
of the 14th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages
560–568. ACM.
President’s Council of Advisors on Science and Technology
(2014). Big Data and Privacy: A Technological Per-
spective (Report to the President). Executive Office of
the President, US-PCAST.
Rajaraman, A. and Ullman, J. (2011). Mining of Massive
Datasets. Cambridge University Press.
Ruggieri, S., Pedreshi, D., and Turini, F. (2010). DCUBE:
Discrimination Discovery in Databases. In Proceed-
ings of the ACM SIGMOD Intetrnational Conference
on Management of Data, pages 1127–1130. ACM.
Zelst, S. van, Dongen, B. van, and Aalst, W. van der (2015).
Know What You Stream: Generating Event Streams
from CPN Models in ProM 6. In Proceedings of
the BPM2015 Demo Session, volume 1418 of CEUR
Workshop Proceedings, pages 85–89.
Vigen, T. (2015). Spurious Correlations. Hachette Books.
... 3. Understanding authorization: There exists a significant power imbalance amongst medical professionals and patients in the Indian subcontinent, especially in the outlying regions. Making ensuring that patients receive adequate education about the implications of using artificial intelligence for medical purposes as well as receiving all the knowledge they require in order to make informed choices is essential [22]. 4. Transparency and accountability are notions with a confusing and ambiguous legal foundation in India in particular when it applies to circumstances concerning systems based on artificial intelligence. ...
... The Information and Communication Technology (IT) Act of 2000, sometimes referred to by the acronym the Indian IT Act, does not specifically address the application of computational intelligence (AI) in Indian healthcare. However, there are actually a number of provisions in the Act which could be relevant to the implementation of AI for healthcare purposes [21][22][23][24][25]. ...
Conference Paper
It is possible to improve patient care, assessment, and medicine with the integration of computational intelligence innovation into the health care system. However, there are significant social, legal, and ethical implications to the employment of AI in healthcare. They include worries about the impact on medical professionals and the healthcare system as a whole, as well as issues like privacy for patients, bias, and prejudice, as well as issues with transparency and responsibility. It is essential to carefully consider and manage these repercussions if artificial intelligence is to be used in medical treatment in a way that is morally right, legal, and socially responsible. The considerably expanded usage of machine cognitive ability (AI) systems for medical applications has brought about a number of significant benefits, including enhanced diagnosis, individualized treatment regimens, and significantly more efficient delivery of medical services. However, utilizing all these technologies also raises questions regarding sociological, legal, and ethical matters that must be considered. These implications include issues with fairness and discrimination, information privacy and security, comprehension and transparency, responsibility and accountability, and social inequity. Health practitioners, politicians, and researchers must concentrate on these issues to ensure the moral and completely accountable use of Ml in medicine. In order to promote the responsible implementation of these innovations, this paper highlights the need for ongoing discussions and collaboration and offers an overview of the ethical, legal, and sociological ramifications of using AI in the field of health care. Keywords-Artificially intelligent neural networks (ANN), machine learning (ML), electronic medical treatment, selecting features, and conversations powered by AI.
... Data science also tries to take into account the ethical, social and business aspects. For example, in [15], the term "Green data science" is coined to refer to data science that makes responsible use of information in any of its dimensions in terms of the amount of data collected, where information is protected from misuse or irresponsible use. ...
... However, they did not show the detail of how that activity would be performed. According to [15,[34][35][36][37], a prediction is part of the background of the topic or literature. Flath and Stein [38] described that predictive analysis is a tool that should be taken into account in the manufacturing industry. ...
Full-text available
The impact of the strategies that researchers follow to publish or produce scientific content can have a long-term impact. Identifying which strategies are most influential in the future has been attracting increasing attention in the literature. In this study, we present a systematic review of recommendations of long-term strategies in research analytics and their implementation methodologies. The objective is to present an overview from 2002 to 2018 on the development of this topic, including trends, and addressed contexts. The central objective is to identify data-oriented approaches to learn long-term research strategies, especially in process mining. We followed a protocol for systematic reviews for the engineering area in a structured and respectful manner. The results show the need for studies that generate more specific recommendations based on data mining. This outcome leaves open research opportunities from two particular perspectives—applying methodologies involving process mining for the context of research analytics and the feasibility study on long-term strategies using data science techniques.
... The International Data Corporation (IDC) predicts an annual growth of data beyond 20% from 2020 to 2025. 1 The existing literature on data mining in general (e.g., Aggarwal 2015) and data mining processes in particular (e.g., Kurgan and Musilek 2006) partially addresses computational and storage efficiencies, such as data reduction and approximate algorithms. Van der Aalst (2016) uses the term ''Green Data Science'' to describe technological solutions that allow to mitigate the effects of data science on the social environment caused by ''unfairness, undesired disclosures, inaccuracies, and non-transparency'' (p. 9). ...
Full-text available
This paper reports on a design science research (DSR) study that develops design principles for “green” – more environmentally sustainable – data mining processes. Grounded in the Cross Industry Standard Process for Data Mining (CRISP-DM) and on a review of relevant literature on data mining methods, Green IT, and Green IS, the study identifies eight design principles that fall into the three categories of reuse, reduce, and support. The paper develops an evaluation strategy and provides empirical evidence for the principles’ utility. It suggests that the results can inform the development of a more general approach towards Green Data Science and provide a suitable lens to study sustainable computing.
... Responsible data science centers around four challenging topics: fairness, i.e., data science without prejudice; accuracy, i.e., data science without guesswork; confidentiality, i.e., data science that ensures confidentiality and transparency, i.e., data science that provides transparency [28]. Training data to inform data science approaches carries concrete potential to contribute towards better outcomes. ...
Full-text available
Artificial intelligence (AI) is being increasingly applied in healthcare. The expansion of AI in healthcare necessitates AI-related ethical issues to be studied and addressed. This systematic scoping review was conducted to identify the ethical issues of AI application in healthcare, to highlight gaps, and to propose steps to move towards an evidence-informed approach for addressing them. A systematic search was conducted to retrieve all articles examining the ethical aspects of AI application in healthcare from Medline (PubMed) and Embase (OVID), published between 2010 and July 21, 2020. The search terms were “artificial intelligence” or “machine learning” or “deep learning” in combination with “ethics” or “bioethics”. The studies were selected utilizing a PRISMA flowchart and predefined inclusion criteria. Ethical principles of respect for human autonomy, prevention of harm, fairness, explicability, and privacy were charted. The search yielded 2166 articles, of which 18 articles were selected for data charting on the basis of the predefined inclusion criteria. The focus of many articles was a general discussion about ethics and AI. Nevertheless, there was limited examination of ethical principles in terms of consideration for design or deployment of AI in most retrieved studies. In the few instances where ethical principles were considered, fairness, preservation of human autonomy, explicability and privacy were equally discussed. The principle of prevention of harm was the least explored topic. Practical tools for testing and upholding ethical requirements across the lifecycle of AI-based technologies are largely absent from the body of reported evidence. In addition, the perspective of different stakeholders is largely missing.
Sustainability has captured the attention of the classical management of business processes. Organizations have become increasingly aware of the need to achieve information technology (IT)-enabled business processes that are successful in their economy and ecological and social impact. In this context, Green BPM concerns business processes’ modeling, deployment, optimization, and management with dedicated consideration for environmental consequences. Automated process discovery is a crucial process mining task to help organizations to get knowledge of the process they carry out in their daily operation, providing the basis for insights and evidence-based improvement decisions. Several process discovery algorithms have been developed and evaluated by the classical measures on resulting models, such as fitness, precision, f-score, soundness, complexity (size, structuredness, and control-flow complexity), generalization, and the execution time of the algorithm. Within the context of automated process discovery, sustainability adds a new indicator: energy efficiency. This paper extends a well-known benchmark for evaluating automated process discovery methods, measuring the energy efficiency of selected discovery methods with the same publicly available dataset. The expected contribution is to raise more awareness among the developers of process discovery methods about the energy impact of their solutions beyond the more traditional well-known measures.KeywordsSustainabilityGreen BPMprocess miningdiscovery algorithmsenergy efficiency
Full-text available
The exponential increase of published data and the diversity of systems require the adoption of good practices to achieve quality indexes that enable discovery, access, and reuse. To identify good practices, an integrative review was used, as well as procedures from the ProKnow-C methodology. After applying the ProKnow-C procedures to the documents retrieved from the Web of Science, Scopus and Library, Information Science & Technology Abstracts databases, an analysis of 31 items was performed. This analysis allowed observing that in the last 20 years the guidelines for publishing open government data had a great impact on the Linked Data model implementation in several domains and currently the FAIR principles and the Data on the Web Best Practices are the most highlighted in the literature. These guidelines presents orientations in relation to various aspects for the publication of data in order to contribute to the optimization of quality, independent of the context in which they are applied. The CARE and FACT principles, on the other hand, although they were not formulated with the same objective as FAIR and the Best Practices, represent great challenges for information and technology scientists regarding ethics, responsibility, confidentiality, impartiality, security, and transparency of data. Keywords: Best practices; Data publishing on the Web; Linked Open Data; Data quality
Full-text available
Introdução: no contexto Big Data, surge, como necessidade urgente, a aplicação de direitos individuais e empresariais e de normas regulatórias que resguardem a privacidade, a imparcialidade, a precisão e a transparência. Nesse cenário, a Responsible Data Science desponta como uma iniciativa que tem como base as diretrizes FACT, que correspondem à adoção de quatro princípios: imparcialidade, precisão, confidencialidade e transparência. Objetivo: abordar alternativas que podem assegurar a aplicação das diretrizes FACT. Metodologia: foi desenvolvida investigação exploratória e descritiva com abordagem qualitativa. Foram realizadas pesquisas nas bases de dados bibliográficas Web of Science, Scopus e pelo motor de busca Scholar Google com a utilização dos termos “Responsible Data Science”, “Fairness, Accuracy, Confidentiality, Transparency + Data Science”, FACT e FAT relacionados com Data Science. Resultados: a Responsible Data Science desponta como uma iniciativa que tem como base as diretrizes FACT, que correspondem à adoção dos princípios: imparcialidade, precisão, confidencialidade e transparência. Para a implementação dessas diretrizes, deve-se considerar o uso de técnicas e abordagens que estão sendo desenvolvidas pela Green Data Science. Conclusões: concluiu-se que a Green Data Science e as diretrizes FACT contribuem significativamente para a salvaguarda dos direitos individuais, não sendo necessário recorrer a medidas que impeçam o acesso e a reutilização de dados. Os desafios para implementar as diretrizes FACT requerem estudos, condição sine qua non para que as ferramentas para análise e disseminação dos dados sejam desenvolvidas ainda na fase de concepção de metodologias.
Due to the increasing volume and variety of data on the Internet as well as in organizations, the role of data has changed from a passive entity to an active asset. Data are considered as a novel source of revenue, and the process of creating wealth from it is called “data monetization.” Data monetization is used for realizing a type of competitive capability for organizations. It provides organizations with flexibility for using information assets in response to customer expectations and environmental pressures. The present study, hence, is aimed to clarify the configuration of data monetization by conducting a systematic review. The thematic analysis based on inductive approach was used to construct the configuration. The global themes, namely “monetization layer,” “data refinement process layer,” “base layer,” and “accessing and processing restrictions layer” with their related themes as the subset components were obtained. Each of these global themes represents the constructive layers which play an important role in data monetization mechanism. All extracted themes were synthesized as a configurational model called “data monetization configuration” (DaMoC). This proposed configuration is validated by a real application, i.e., Cardlytics.
Conference Paper
Full-text available
Green Information and Communication Technology (G-ICT) is an emerging area of research and development and the major factor pertaining to this area is energy efficient designs for IT systems. Data mining deals with a special class of algorithms which are centered towards predicting and modeling patterns and trends in huge volume of digital data. Has anyone thought of the situation where data processing needs will surpass the energy production of the world? This is really going to happen if we don't start taking appropriate steps from now onwards. One of the most sought after solution is trading of accuracy of results with current consumption efficiency and good latency which is generally known as inexact or approximate computing. In this regard, we have applied some approximation techniques which can be used to achieve energy efficient data mining or Green Data mining with results as best as possible for a given allowable deviation. We have placed some experimental analysis on data mining algorithms to show the concept.
Conference Paper
Full-text available
Researchers, patients, clinicians, and other healthcare industry participants are forging new models for data sharing in hopes that the quantity, diversity, and analytic potential of health-related data for research and practice will yield new opportunities for innovation in basic and translational science. Whether we are talking about medical records (e.g., EHR, lab, notes), administrative information (claims and billing), social contacts (on-line activity), behavioral trackers (fitness or purchasing patterns), or about contextual (geographic, environmental) or demographic (genomics, proteomics) data, it is clear that as healthcare data proliferates, threats to security grow. Beginning with a review of the major healthcare data breaches in our recent history, this paper
Full-text available
The field of process mining is concerned with supporting the analysis, improvement and understanding of business processes. A range of promising techniques have been proposed for process mining tasks such as process discovery and conformance checking. However there are challenges, originally stemming from the area of data mining, that have not been investigated extensively in context of process mining. In particular the incorporation of data stream mining techniques w.r.t. process mining has received little attention. In this paper, we present new developments that build on top of previous work related to the integration of data streams within the process mining framework ProM. We have developed means to use Coloured Petri Net (CPN) models as a basis for eventstream generation. The newly introduced functionality greatly enhances the use of event-streams in context of process mining as it allows us to be actively aware of the originating model of the event-stream under analysis.
Conference Paper
Full-text available
In process mining, one of the main challenges is to discover a process model, while balancing several quality criteria. This often requires repeatedly setting parameters, discovering a map and evaluating it, which we refer to asprocess exploration. Commercial process mining tools like Disco, Perceptive and Celonis are easy to use and have many features, such as log animation, immediate parameter feedback and extensive filtering options, but the resulting maps usually have no executable semantics and due to this, deviations cannot be analysed accurately. Most more academically oriented approaches (e.g., the numerous process discovery approaches supported by ProM) use maps having executable semantics (models), but are often slow, make unrealistic assumptions about the underlying process, or do not provide features like animation and seamless zooming. In this paper, we identify four aspects that are crucial for process exploration:zoomability, evaluation, semantics,andspeed.Wecompare existing commercial tools and academic workflows using these aspects, and introduce a new tool, that aims to combine the best of both worlds. A feature comparison and a case study show that our tool bridges the gap between commercial and academic tools.
Full-text available
Privacy is ever-growing concern in our society and is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving human personal sensitive information. Unfortunately, it is increasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze social data describing human activities in great detail and resolution. As a result, privacy preservation simply cannot be accomplished by de-identification alone. In this paper, we propose the privacy-by-design paradigm to develop technological frameworks for countering the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of social mining and big data analytical technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technology by design, so that the analysis incorporates the relevant privacy requirements from the start.
Full-text available
The practical relevance of process mining is increasing as more and more event data become available. Process mining techniques aim to discover, monitor and improve real processes by extracting knowledge from event logs. The two most prominent process mining tasks are: (i) process discovery: learning a process model from example behavior recorded in an event log, and (ii) conformance checking: diagnosing and quantifying discrepancies between observed behavior and modeled behavior. The increasing volume of event data provides both opportunities and challenges for process mining. Existing process mining techniques have problems dealing with large event logs referring to many different activities. Therefore, we propose a generic approach to decompose process mining problems. The decomposition approach is generic and can be combined with different existing process discovery and conformance checking techniques. It is possible to split computationally challenging process mining problems into many smaller problems that can be analyzed easily and whose results can be combined into solutions for the original problems.
Full-text available
Business Process Management (BPM) research resulted in a plethora of methods, techniques, and tools to support the design, enactment, management, and analysis of operational business processes. This survey aims to structure these results and provide an overview of the state-of-the-art in BPM. In BPM the concept of a process model is fundamental. Process models may be used to configure information systems, but may also be used to analyze, understand, and improve the processes they describe. Hence, the introduction of BPM technology has both managerial and technical ramifications and may enable significant productivity improvements, cost savings, and flow-time reductions. The practical relevance of BPM and rapid developments over the last decade justify a comprehensive survey.
Full-text available
Process mining techniques use event data to discover process models, to check the conformance of predefined process models, and to extend such models with information about bottlenecks, decisions, and resource usage. These techniques are driven by observed events rather than hand‐made models. Event logs are used to learn and enrich process models. By replaying history using the model, it is possible to establish a precise relationship between events and model elements . This relationship can be used to check conformance and to analyze performance. For example, it is possible to diagnose deviations from the modeled behavior. The severity of each deviation can be quantified. Moreover, the relationship established during replay and the timestamps in the event log can be combined to show bottlenecks. These examples illustrate the importance of maintaining a proper alignment between event log and process model. Therefore, we elaborate on the realization of such alignments and their application to conformance checking and performance analysis. © 2012 Wiley Periodicals, Inc. This article is categorized under: Algorithmic Development > Association Rules Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Data Concepts
This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a considerably expanded section on software tools and a completely new chapter of process mining in the large. It is self-contained, while at the same time covering the entire process-mining spectrum from process discovery to predictive analytics. After a general introduction to data science and process mining in Part I, Part II provides the basics of business process modeling and data mining necessary to understand the remainder of the book. Next, Part III focuses on process discovery as the most important process mining task, while Part IV moves beyond discovering the control flow of processes, highlighting conformance checking, and organizational and time perspectives. Part V offers a guide to successfully applying process mining in practice, including an introduction to the widely used open-source tool ProM and several commercial products. Lastly, Part VI takes a step back, reflecting on the material presented and the key open challenges. Overall, this book provides a comprehensive overview of the state of the art in process mining. It is intended for business process analysts, business consultants, process managers, graduate students, and BPM researchers.
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.