ArticlePDF Available

Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects

Authors:

Abstract and Figures

Companies are faced with the challenge of handling increasing amounts of digital data to run or improve their business. Although a large set of technical solutions are available to manage such Big Data, many companies lack the maturity to manage that kind of projects, which results in a high failure rate. This paper aims at providing better process guidance for a successful deployment of Big Data projects. Our approach is based on the combination of a set of methodological bricks documented in the literature from early data mining projects to nowadays. It is complemented by learned lessons from pilots conducted in different areas (IT, health, space, food industry) with a focus on two pilots giving a concrete vision of how to drive the implementation with emphasis on the identification of values, the definition of a relevant strategy, the use of an Agile follow-up and a progressive rise in maturity.
Content may be subject to copyright.
c
2017 by the authors; licensee RonPub, L¨
ubeck, Germany. This article is an open access article distributed under the terms and conditions of
the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).
Open Access
Open Journal of Big Data (OJBD)
Volume 3, Issue 1, 2017
http://www.ronpub.com/ojbd
ISSN 2365-029X
Combining Process Guidance and
Industrial Feedback for
Successfully Deploying Big Data Projects
Christophe Ponsard A, Mounir Touzani B, Annick Majchrowski A
ACETIC Research Centre, avenue Jean Mermoz 18, 6041 Gosselies, Belgium, {cp, am}@cetic.be
BAcad´
emie de Toulouse, rue Saint-Roch 75, 31400 Toulouse, France, mounir.touzani@ac-toulouse.fr
ABSTRACT
Companies are faced with the challenge of handling increasing amounts of digital data to run or improve their
business. Although a large set of technical solutions are available to manage such Big Data, many companies lack
the maturity to manage that kind of projects, which results in a high failure rate. This paper aims at providing better
process guidance for a successful deployment of Big Data projects. Our approach is based on the combination
of a set of methodological bricks documented in the literature from early data mining projects to nowadays. It is
complemented by learned lessons from pilots conducted in different areas (IT, health, space, food industry) with a
focus on two pilots giving a concrete vision of how to drive the implementation with emphasis on the identification
of values, the definition of a relevant strategy, the use of an Agile follow-up and a progressive rise in maturity.
TYP E OF PAPER AND KEYWORDS
Regular research paper: big data, process model, agile, method adoption, pilot case studies
1 INTRODUCTION
In today’s world, there is an ever-increasing number of
people and devices that are being connected together.
This results in the production of information at an
exponentially growing rate and opens the Big Data area.
To give a few numbers, it is estimated that 90% of
the current world’s data has been produced in just the
last two years, and that the amount of data created
by businesses doubles every 1.2 years [45]. The total
amount of data in the world reached one zettabyte (1021
bytes) around 2010, and by 2020 more than 40 zettabytes
will be available. An important shift is that most of
the data is now being generated by devices rather than
people, due to the emergence of the Internet of Things.
Companies are facing the many challenges of
processing such amounts of data. They typically view
Big Data technologies as holding a lot of potential
to improve their performance and create competitive
advantages. The main challenges companies have to
face with Big Data are often summarised by a series
of “V” words. In addition to the Volume (i.e. the risk
of information overload) already mentioned, other data
dimensions are the Variety (i.e. the diversity of structured
and non-structured formats), the required Velocity (i.e.
highly reactive, possibly real-time, data processing), the
Visualization need (in order to interpret them easily) and
the related Value (in order to derive an income) [37].
The ease to collect and store data, combined with the
availability of analysing technologies (such as NoSQL
Databases, MapReduce, Hadoop) has encouraged many
companies to launch Big Data projects. However, most
26
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
organisations are actually still failing to get business
value out of their data. A 2013 report surveying 300
companies about Big Data revealed that 55% of Big Data
projects fail and many others fall short of their objectives
[31]. An on-line survey conducted in July 2016 by
Gartner reported that many companies remain stuck at
the pilot stage and that only 15% actually deployed their
big data project to production[25].
Looking at the cause of such failures, it appears that
the main factor is actually not the technical dimension,
but rather the process and people dimensions, which are
thus equally important [24]. Of course the technology
selection for big data projects is important and needs to
be kept up-to-date with the fast technological evolution
to help selecting proper technologies [33]. However,
much less is devoted to methods and tools that can help
teams to achieve big data projects more effectively and
efficiently [46]. There exists some recent work in that
area, identifying key factors for a projects success [47],
stressing management issues [12], insisting on the need
for team process methodologies and making a critical
analysis of analytical methods [46].
Our paper is aligned with those works and aims at
helping companies engaging in a Big Data adoption
process to be driven by questions such as:
How can we be sure Big Data will help us?
Which people with what skills should be involved?
What steps should be done first?
Is my project on the right track?
Our main contribution is composed of practical
guidelines and lessons learned from a set of pilot
projects covering various domains (life sciences, health,
space, IT). Those pilots are spread over three years
and are conducted within a large project carried out in
Belgium. They are following a similar process which
is incrementally enhanced. The reported work is based
on the first four pilots while four others are in analysis
phase. It significantly extends our first report published
in [43] by:
giving a more detailed overview of existing
methodologies that form the building bricks of our
approach,
proving a detailed feedback over two industrial
pilots respectively in the data centre maintenance
and medical care domains,
putting our work in the light of other work focusing
on the successful adoption of Big data techniques.
We also discuss in more detail some important
issues like ethics and cybersecurity.
Figure 1: Evolution of data processing methodologies
(source: [36])
This paper is structured as follows. Section 2 reviews
the main available methodologies for dealing with Big
Data deployment. Section 3 presents the process
followed to come up with a method and validate it on
our pilots. It stresses key requirements for successful
deployment. Section 4 presents more detailed feedback
and highlight specific guidelines. Section 5 discusses
some related work. Finally Section 6 draws some
conclusions and on-going extensions of our work.
2 EVOLUTION OF METHODS AND
PROCESSES FOR DATA ORIENTED
PROJECTS
This section reviews existing methods and processes. It
highlights some known strengths and limitations. First,
methods inherited from the related data mining field are
presented before considering approaches more specific
to Big Data with a special attention to Agile methods.
2.1 Methods Inherited from Data Mining
Data mining was developed in the 1990’s to extract
data patterns in structured information (databases) and to
discover business factors on a relatively small scale. In
contrast, Big Data is also considering unstructured data
and operates on a larger scale. A common point between
them, from a process point of view, is that both require
the close cooperation of data scientists and management
in order to be successful. Many methodologies and
process models have been developed for data mining and
knowledge discovery [36]. Figure 1 give an overview of
the evolution and parenthood of the main methodologies.
The seminal approach is KDD (Knowledge Discovery
in Database) [22]. It was refined into many other
approaches (like SEMMA [48], Two Crows [53]). It
was then standardised under CRISP-DM (Cross Industry
27
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
Figure 2: CRISP-DM method (source: [30])
Standard Process for Data Mining) [50] which is
depicted in Figure 2.
CRISP-DM is composed of six main phases, each one
is decomposed in sub-steps. The process is not linear but
rather organised as a global cycle with usually a lot of
back and forth within and between phases. CRISP-DM
has been widely used for the past 20 years, not only for
data mining but also for predictive analytics and big data
projects.
However, CRISP-DM and alike methods suffer from
the following issues:
they fail to provide a good management view on
communication, knowledge and project aspects,
they lack some form of maturity model enabling to
highlight more important steps and milestones that
can be progressively raised,
despite the standardisation, they are not always
known by the wider business community, hence
difficult to adopt for managing the data value
aspect,
the proposed iterative model is limited: the planned
iterations are little used in practice because they do
not loop back to the business level but rather stay
in the internal IT context. In addition to the lack of
control on the added value, those iterations are very
often postponed. This is the reason why more agile
models were introduced.
2.2 Methods Adopting Agile Principles
Agile methods, initially developed for software
development, can also be applied to data analysis in
Figure 3: Agile KDD method (source: [11])
order to provide a better process guidance and value
orientation. An agile evolution of KDD and CRISP-DM
is AgileKDD [15] depicted in Figure 3. It is based on
the OpenUP life cycle which supports the statement
in the Agile Manifesto [1]. Projects are divided in
planned “sprints” with fixed deadlines, usually a few
weeks. Each sprint needs to deliver incremental value to
stakeholders in a predictable and demonstrable manner.
For example, IBM has developed ASUM-DM, an
extension and refinement of CRISP-DM combining
traditional project management with agility principles
[26]. Figure 4 illustrates its main blocks and its
iterative principle driven by specific activities at the
level of the last columns. These include governance
and community alignment. However, it does not cover
the infrastructure/operations side of implementing a data
mining/predictive analytics project. It is more focused
on activities and tasks in the deployment phase and has
no templates nor guidelines.
Although it looks quite adequate, deploying an Agile
approach for Big Data may still face resistance, just as
it is the case for software development, typically in a
more rigid kind of organisation. A survey was conducted
to validate this acceptance [23]. It revealed that quite
similarly as for software development, companies tend
to accept Agile methods for projects with smaller scope,
lesser complexity, fewer security issues and inside
organisation with more freedom. Otherwise, a more
traditional plan-managed approach is preferred.
28
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Figure 4: ASUM-DM method (source: [29])
Figure 5: AABA Method (source: [9])
2.3 Methods Developped for Big Data Projects
Architecture-centric Agile Big data Analytics (AABA)
addresses technical and organizational challenges of
Big Data [9]. Figure 5 shows it supports an Agile
delivery. It also integrates the Big Data system Design
(BDD) method and Architecture-centric Agile Analytics
with architecture-supported DevOps (AAA) model for
effective value discovery and continuous delivery of
value.
The method was validated on 11 case studies
across various domains (marketing, telecommunications,
healthcare) with the following recommendations:
1. Data Analysts/Scientists should be involved early in
the process, i.e. already at business analysis phase.
2. Continuous architecture support is required for big
data analytics.
3. Agile bursts of effort help to cope with rapid
technology changes and new requirements.
4. The availability of a reference architecture and a
technology catalog ease the definition and evolution
of the data processing.
5. Feedback loops need to be open, e.g. about
non-functional requirements such as performance,
availability and security, but also for business
feedback about emerging requirements.
Stampede is another method proposed by IBM to its
customers. Expert resources are provided at cost to help
companies to get started with Big Data in the scope
of a well-defined pilot project [28]. Its main goal is
to educate companies and help them get started more
quickly, in order to drive value from Big Data. A key tool
of the method is a half day workshop to share definitions,
identify scope/big data/infrastructure, establish a plan
and most importantly establish the business value. The
pilot execution is typically spread over 12 weeks and
carried out in an Agile way with a major milestone at
about 9 weeks as depicted in Figure 6.
29
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
Figure 6: The IBM Stampede method (source: [28])
Table 1: Maturity Model from Nott and Betteridge (IBM) (source: [39])
Level Ad hoc Foundational Competitive Differentiating Breakaway
Business
strategy
Use of
standard
reporting. Big
Data is just
mentioned
Data-related ROI
identified
Data processing
encouraged
Competitive
advantage
achieved
Business
innovation is
driven by data
processing
Analytics Limited to the
past Event detection
Prediction of the
likelihood of
specific evolution
Optimisation of
decision support
Optimisation and
process
automation
IT
Alignment
No coherent
architecture
of the
information
system
Define architecture
but not oriented
towards analytics
Definition of Big
Data
architectural
patterns for Big
Data
Defined and
standardised Big
Data oriented
architecture
Architecture
fully aligned with
Big Data
requirements
Culture
and
governance
Largely based
on key people
Rambling artefact
management,
resistance to
change
Policy and
procedure well
defined, partial
adoption
Large adoption,
daily use
Generalised
Adoption
2.4 Some Complementary Approaches
2.4.1 Introducing Maturity Models
Some attempts are being made to develop some
“Capability Maturity Model” (CMM) for scientific data
management processes in order to support the evaluation
and improvement of these processes [13, 39]. Such
a model describes the main types of processes and
the practices required for an effective management.
Classic CMM characterizes organizations using different
maturity levels that represent their ability to reliably
perform processes of growing complexity and scope.
A 5-level scale is typical and proposed both by [13]
and [39]. The first uses standard levels ranging
from “defined” to “optimized” while the latter uses a
more specific nomenclature ranging from “ad hoc” to
“breakaway”. Table 1 details the main criteria relating to
the place of the data in the business strategy, the type of
data analysis used, the alignment of the IT infrastructure,
as well as aspects of culture and governance.
2.4.2 Cognitive “Sensemaking”
The Sensemaking approach has also an iterative nature.
There are actually two internal cycles in the method:
a more classical “Foraging” loop trying to dig into the
data to find out relations among them and a second
approach “Sensemaking” loop trying to build sense out
of the data by reproducing the cognitive process followed
by humans in order to build up a representation of an
information space for achieving his/her goal. It focuses
on challenges for modelling and analysis by bringing
cognitive models into requirements engineering, in order
to analyse the features of data and the details of user
activities [32].
30
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
2.4.3 Critical Success Factors
In complement to processes, many key success factors,
best practices and risk checklists have been published,
mostly in blogs for Chief Information Officers, e.g.
[4]. A systematic classification of Critical Success
Factors has been proposed by [24] using three key
dimensions: people, process and technology. It has been
further extended by [46] with tooling and governance
dimensions. A few key factors are the following:
Data: quality, security, level of structure in data.
Governance: management support, well-defined
organisation, data-driven culture.
Objectives: business value identified (KPI),
business case-driven, realistic project size.
Process: agility, change management, maturity,
coping with data growth.
Team: data science skills, multidisciplinarity.
Tools: IT infrastructure, storage, data visualization
capabilities, performance monitoring.
3 METHOD DE VE LO PM EN T AN D
VAL IDATIO N PROCESS
The global aim of our project is to come up with a
systematic method to help companies facing big data
challenges to validate the potential benefits of a big data
solution. The global process is depicted in Figure 7.
The process is driven by eight successive pilots
which are used to tune the method and make more
technical bricks available through the proposed common
infrastructure. The final expected result is to provide a
commercial service to companies having such needs.
The selected method is strongly inspired by what
we learned from the available methods and processes
described in Section 2:
the starting point was Stampede because of some
initial training and the underlying IBM platform.
Key aspects kept from the methods are the initial
workshop with all stakeholders, the realistic focus
and a constant business value driver,
however, to cope with the lack of reference material,
we defined a process model based on CRISP-DM
which is extensively documented,
the pilots are executed in an Agile way, given
the expert availabilities (university researchers),
the pilots are planned over longer periods than in
Stampede: 3-6 months instead of 12-16 weeks.
The popular SCRUM approach was used as it
emphasizes collaboration, functioning software,
team-self management and flexibility to adapt to
business realities [49].
The global methodology is composed of three
successive phases detailed hereafter:
1. Big Data Context and Awareness. In this
introductory phase, one or more meetings take place
with the target organisation. A general introduction
is given on Big Data concepts, the available
platform, a few representative applications in
different domains (possibly with already a focus on
the organisation domain), the main challenges and
main steps. The maturity of the client and a few risk
factors can be checked (e.g. management support,
internal expertise, business motivation).
2. Business and Use Case Understanding. This is also
the first phase of CRISP-DM. Its goals are to collect
the business needs/problems that must be addressed
using Big Data and also to identify one or some
business use cases generating the most value out of
the collected data. A checklist supporting this phase
is shown in Table 2.
Table 2: Workshop checklist (source: [29])
Business Understanding Use Case Understanding
Strategy & Positioning Assess Business Maturity
Global Strategy
Product/Services Positioning Determine
Scorecards - KPI’s Use Case Objectives
Digital Strategy Value to the Client
Touchpoints for Business Success Criteria
Customers/Prospects
Search, e-Commerce, Assess Situation
Social Media, Websites,... Resource Requirements
Direct Competitors Assumptions/Constraints
Disruptive Elements Risks and Contingencies
Disruptive Models Terminology
Disruptive Technologies Costs and Benefits
Disruptive Behaviour
Select Potential Use Cases Refine Data Mining Goals
Objectives Data Mining Goals
Priorities Data Mining KPIs
Costs, ROI, Constraints
Value to the Client Produce Project Plan
Scope Approach
New Data Source Deliverables
Schedule
High Level Feasibility Risk Mitigation (Privacy,...)
Data Mining Goals & KPIs Stakeholders to involve
Resources Availability Initial Assessment of
Time to Deliver Tools and Techniques
31
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
Figure 7: Iterative development of the platform and method
This phase is organised based on one or a
few workshops, involving the Business Manager,
Data Analyst, IT architect and optionally selected
specialists, such as the IT security manager if there
are specific security/privacy issues that need to be
checked at this early stage. Both the as-is and
to-be are considered. Specific tools to support
the efficient organisation of those workshops are
described in Section 4. At the end of this step, a
project planning is also defined.
3. Pilot Implementation of Service or Product. In this
phase, the following implementation activities are
carried out in an Agile way:
Data Understanding: analyse data sets to
detect interesting subset(s) for reaching the
business objective(s) and make sure about
data quality.
Data Preparation: select the right data and
clean/extend/format them as required.
Modelling: select specific modelling
techniques (e.g. decision-tree or neural
networks). The model is then built and tested
for its accuracy and generality. Possible
modelling assumptions are also checked.
The model parameters can be reviewed or
other/complementary techniques can be
selected.
Evaluation: assess the degree to which
the model meets business objectives, using
realistic or even real data.
Deployment: transfer the validated solution to
production environment, make sure user can
use it (e.g. right visualization, dashboard) and
start monitoring (performance, accuracy).
Our pilots are kept confidential. However, Table 3
presents the main features of the first four pilots based
on the three first “V” of Big Data [37].
4 LESSONS AND RECOMMENDATIONS
LEARNED FROM OUR PILOT CAS ES
In this section, we present some lessons learned and
related guidelines that are useful to support the whole
process and increase the chances of success. We also
illustrate our feedback based on some highlights from
two pilot cases used as running examples: the IT
maintenance pilot case and the clinical pathway pilot
case.
4.1 Defining Progressive and Measurable
Objectives.
Through the deployment of a Big Data solution, a
company expects to gain value out of its data. The
business goals should be clearly expressed. There
exists different methods to capture goals. In our pilots
we integrated goal oriented requirements engineering
techniques to elicit and structure business goals and
connect them with data processing processes and
components [55, 56]. Such methods also include specific
techniques to verify that goals are not too idealised by
helping in the discovery of obstacles and their resolution
in order to define achievable goals.
Another way to connect the goals with the (business)
reality is to define how to measure the resulting
value which should be defined right from the business
32
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Table 3: Main characteristics of first pilot wave
# Domain Volume Velocity Variety Main challenge
1 Life science 20 Go/analysis
2 To/week
High (requires
parallel
processing)
Business data and
traceability (food,
pharmaceutical, cosmetic
industry)
Product quality
2 Space
Galileo ground segment
maintenance (12 EU
sites, 16 remote sites)
Medium High: messages, logs
Predictive maintenance of
costly equipment. High
dependability (99.8%)
3 Health 900 beds on 3 sites Real-time Several sources and
formats
Reduce morbidity and
mortality, guarantee
confidentiality
4IT
Maintenance About 3000 servers
High
(databases,
events, logs...)
Real-time Predictive maintenance,
cost optimisation
understanding phase, typically by relying on KPIs (Key
Performance Indicators). Companies should already
have defined their KPIs and be able to measure them. If
this is not the case, they should start improving on this:
in other words, Business Intelligence should already be
present in companies.
Based on this, different improvement strategies can be
identified and discussed to select a good business case.
In this process, the gap with the current situation should
also be considered, it is safer to keep a first project with
quite modest objectives than risking to fail by trying a
too complex project that could bring more value. Once a
pilot is successful, further improvements can be planned
in order to add more value.
Computer Maintenance Area Case Study. The
large IT provider considered here manages more
than 3000 servers that are hosting many web sites,
running applications and storing large amount of related
customer data. No matter what efforts are taken,
servers are still likely to go off-line, networks to
become unavailable or disks to crash and generally
at times that are not expected, less convenient and
more costly to manage, like during the night or
weekends. The considered company is currently
applying standard incident management and preventive
maintenance procedures based on a complete monitoring
infrastructure covering both the hardware (network
appliances, servers, disks) and the application level
(service monitoring).
In order to reduce the number of costly reactive events
and optimise preventive maintenance, the company is
willing to develop more predictive maintenance by trying
to anticipate the unavailability of the servers in such a
way they can react preventively and, ultimately, prevent
such unavailability. In the process, the client wants to
diagnose the root causes of incidents and resolve them in
order to avoid possible further incidents which can turn
into a nightmare when occurring in a reactive mode. The
ultimate goal is to increase the service availability, the
customer satisfaction and also reduce the operating costs.
The resulting KPI is called Total Cost of Ownership
(TCO) and typical breakdown costs to be can be
considered are:
maintenance on hardware and software that could
be reduced through a better prediction,
personnel working on these incidents,
any penalties related to customer Service Level
Agreements (SLAs),
indirect effects on the client’s business and its brand
image.
Clinical Pathway Case Study. Hospitals are
increasingly deploying clinical pathways, defined as
a multidisciplinary vision of the treatment process
required by a group of patients with the same pathology
with predictable clinical follow-up [6]. The reason is
not only to reduce the variability of clinical processes
but also to improve care quality and have a better cost
control [54]. It also enables richer analysis of the
data produced and thus the profiling of patients with
higher risks (for example due to multi-pathology or
intolerances).
A typical workflow (e.g. for chemotherapy) is shown
in Figure 8. It is a sequence of drugs deliveries or cures,
generally administered in day hospital. Each cure is
followed by a resting period at home that lasts for a
few days to a few weeks. A minimal interval between
cures is required because chemotherapy drugs are toxic
33
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
Figure 8: A typical chemotherapy workflow
and the body needs some time to recover between two
drugs deliveries. When following the ideal treatment
protocol, the number of cancerous cells are progressively
reduced, hopefully to reach a full healing or cancer
remission. If for some reason, chemotherapy cures do
not closely follow the intended periodicity or if doses
are significantly reduced, the treatment efficiency may
be suboptimal. In such conditions, cancerous cells may
multiply again, which can result in a cancer relapse.
Figure 9 shows the high level goals for the optimal
organisation of care pathways. Goals and obstacles are
respectively depicted using blue and red parallelograms.
Agents (either people or processing components) are
pictured using yellow hexagons. Some expectations
on human agents are also captured using yellow
parallelograms. The adequate workflow should be
enforced for all patients within the recommended
deadlines given the possible impact on patient relapse.
Ethical principles also require a fair allocation of
resources, i.e. every patient deserves optimal care
regardless of his medical condition or prognosis. The
workload should also be balanced to avoid the staff
having to manage unnecessary peak periods.
Reaching those goals together, of course, requires
enough resources to be available and a number of
related obstacles (in red) like monitoring the flow of
patients joining and leaving the pathway is important.
The available workforce can also be influenced by
staff availability and some public holidays reducing
the available slot for delivering care. A number of
mitigation actions are then identified to better control
that the workforce is adapted. An agent with a key
responsibility in the system is the scheduler, which must
manage every appointment. Human agents are not very
good at this task because the problem is very large
and it is difficult to find a solution that simultaneously
meets all patients and service constraints. Moreover, the
planning must constantly be reconsidered to deal with
unexpected events and the flow of incoming/outgoing
patients. In contrast, a combined predictive and
prescriptive solution is very interesting because it has
the capability to ensuring optimal care and service
operation by also taking into account risks that some
patient could be delayed.
In order to measure the quality of chemotherapeutic
cares, a quantifiable indicator called the “Relative Dose
Intensity” (RDI) was defined [35]. It captures both the
fact that the required dose is administered and the timing
of the delivery, on a scale from 0% (no treatment) to 100
% (total conformance).
RDI =planned dose
delivered dose ×real duration
planned duration
Medical literature has shown, for a number of cancers,
that the relapse-free survival is strongly correlated with
the RDI. For instance, for breast cancer, a key threshold
value is 85% [41]. Hence this indicator can be seen has
a gauge that should be carefully managed across the
whole clinical pathway.
4.2 From Descriptive to Predictive and then
Prescriptive Data Analysis.
Analytics is a multidisciplinary concept that can be
defined as the means to acquire data from diverse
sources, process them to elicit meaningful patterns
and insights, and distribute the results to proper
stakeholders [10, 44]. Business Analytics is the
application of such techniques by companies and
organisations in order to get a better understanding of
their level of performance of their business and drive
improvements. Three complementary categories of
analytics can be distinguished and combined in order
to reach the goal of creating insights and helping to
make better decisions. Those analytics consider different
time focus, questions and techniques as illustrated in
Table 4 [38, 51].
In a number of domains, it is interesting to consider
an evolution scheme starting from immediate reaction
raised by analysing data to more intelligence in
anticipating undesirable situations, or even considering
how to prevent them as much as possible.
Computer Maintenance Area Case Study. In terms
of maintenance, starting from the identified KPI of total
cost of ownership (TCO) including the cost of purchase,
maintenance and repair in the event of a breakdown.
Different strategies can be envisaged:
react to problems only after the occurrence of
a breakdown. This translates into a generally
high cost because quick reaction is required to
minimize downtime. Moreover, any unavailability
has a negative impact in terms of image or even
penalty if a Service Level Agreement (SLA) has
been violated. This should of course be minimised
34
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Figure 9: Goal analysis for clinical pathways: strategic goals and main obstacles
through the use of next level strategies that can
benefit from data analytics,
anticipating occurrence of breakdown based on
system monitoring. Simple strategies can be
implemented. For example, an alert can be
triggered when a storage approaches a threshold
close to the maximum capacity. However, this
does enable the prediction of failures resulting from
complex sequence of events. Mostly descriptive
techniques are used at this level,
try to predict problems based on known history and
observation of the system. At this level, predictive
data analysis techniques can discover cause-effect
relationships between parts of the system which,
in cascade, can cause unavailability. For example,
applying a badly validated patch may affect a
service that can itself paralyse a business process,
improving the system is the ultimate step. It
is necessary to ensure that the system operates
under optimum conditions by eliminating the root
causes that could trigger some failure. Prescriptive
techniques are used at this level.
The predictive solution was the best option, but it
should only be considered if the preventive step is carried
out. Similarly, the most common time patterns should be
identified and processed first. For example, a storage is
more likely to be saturated on days when backups are
performed, usually predictably (weekend or month end).
An anticipation would avoid expensive interventions,
especially during weekends.
Clinical Pathway Case Study. The operation of
clinical pathways is characterised by the occurrence
of many events which may be expected or not and
thus impacting the scheduled behaviour. An important
concern is to detect such events and decide about how
to manage possible deviations to minimise their impact,
especially on the quality of care KPI. Different possible
strategies can be explored for this purpose:
reactive strategies should be avoided as much
as possible because the impact on patient is
irreversible. Some case of reactive can be related
to a patient no-show or last minute medical no-go.
The action is to reschedule an new appointment as
soon as possible.
preventive strategies can be used to reduce the risk
of now-show, for example by sending a reminder
(e.g. text message, phone call) one or two days
before the appointment. Descriptive data analytics
are enough at this level.
predictive strategies relying on predictive analytics
can be used to learn risk factors for specific patients
which could result in more careful planning or
35
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
Table 4: Overview of analytics in terms of questions, techniques and outputs (source: [51])
Business Analytics
Descriptive Analytics Predictive Analytics Prescriptive Analytics
Questions
What has happened? What will happen? What should be done?
Why did it happen? Why will it happen? Why should it be done?
Techniques
Statistical Analytics Data Mining Optimisation
Data Integration Machine Learning Simulation
Data Augmentation ... Operation Research
Data Reduction Management Science
Outputs
Reports on historical data Future opportunities Recommended business decisions
Insight from raw data Future risks Optimal courses of actions
... ... ...
Table 5: Some workshop questions about data
Q.UD.1 What are the data sources and data types
used in your current business processes?
Q.UD.2 What tools/applications are used to deal
with your current business processes?
Q.UD.3 Are your present business processes
performing complex processing on data?
Q.UD.4 How available is your data? What happens
if data is not available?
Q.UD.5 Do different users have different access
rights on your data?
Q.UD.6 Does your data contain sensitive
information (e.g. personal or company
confidential data)?
Q.UD.7 What are the consequence of data
alteration? Do you know the quality level
of your data?
guide the drug selection. For example, the possible
intolerance or interaction with another pathology
could be anticipated and solved by selecting an
alternative drug cocktail.
prescriptive strategies will deploy an globally
optimising scheduler able to solve the planning
problem by taking into account the constraints
resulting for the treatment plan of each patient
and the service availabilities. Such a system was
successfully prototyped and is reported in [42].
4.3 Using Questionnaires for Workshops
Conducting a workshop requires to pay attention to many
issues while also focusing the discussion on the most
relevant ones. A questionnaire can provide an efficient
Table 6: Evaluation readiness checklist (partial)
R.EV.1 Are you able to understand/use the results
of the models?
R.EV.2 Do the model results make sense to you
from a purely logical perspective?
R.EV.3 Are there apparent inconsistencies that
need further exploration?
R.EV.4 From your initial glance, do the results
seem to address your organizations
business questions?
support both as possible preparation before the workshop
and as checklist during the workshop. Table 5 shows a
few questions about the data to process.
4.4 Using Modelling Notations
Modelling using standard modelling notations is useful
to support business and data understanding. During
workshops, a whiteboard can be used to sketch models
together with the audience. Note this should not to be
confused with the data modelling step in CRISP-DM
which is related but actually occurs later in the process.
In our experience, data-flow and workflow models
help to understand which process is generating, updating
or retrieving data. UML class diagrams also help to
capture the domain structure [40].
On the other hand, use cases should be avoided
because they only focus on a specific function and cannot
provide a good global picture of the problem. Those
should be used later in the process when considering the
implementation.
36
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
4.5 Defining Activity Checkpoints
An Agile approach allows the process to be quite flexible
and enable to go back and forth across activities. To
make sure an activity can be started with enough input,
we defined some readiness checklist shown in Table 6.
5 RELATED WORK AND DISCUSSION
5.1 Methodologies Focusing on Adoption
Section 2 gives an exhaustive historical perspective of
the evolution of relevant methodologies. While the
first proposed approaches based on data mining were
already iterative in nature [50], their evolution over
time is clearly about paying a growing attention on
how to ease the adoption of such methodologies by the
companies. The Agile culture has been a key milestone
to better include the customer in the process and to drive
that process towards the production of business value
[23]. Commercial methods like IBM Stampeded are also
strongly inspired by this trend [28]. In complement to
those methods, the need to identify barriers and adoption
factors has also been addressed by recent work also
discussed earlier, such as critical success factors [24, 46].
Consolidated Big Data Methodologies are also
being published under more practical and simplified
presentation forms in order to be attractive for
companies. The DISTINCT method is based on only
four steps (acquire, process, analyse, visualise) and
considers the use of feedback loops to enable repeated
refinements of the data processing [18]. Although the
analysis phase is not explicitly mentioned, this iterative
approach can be used to set up a feedback channel
between IT and business people. After each feedback
cycle, the system can then be refined by enhancing the
data preparation or data analysis steps.
The well-known “for Dummies” series has also a
dedicated book on Big Data [27]. It contains a section
about how to create an implementation roadmap based
on factors like business urgency, budget, skills, and
risks level; an agile management approach is also
recommended. The availability of Business Intelligence
is also identified as an easing factor.
Work carried out in related fields about how
to address organisational challenges is also worth
being investigated. For example, Cloud Computing
Business Framework (CCBF) helps organisations
achieve good Cloud design, deployment and services.
Similar to our approach, CCBF is a conceptual and
an architectural framework relying on modelling,
simulation, experiments and hybrid case studies [7, 8].
Given the variety and multidisciplinary nature of
complex system being analysed (e.g. supply chains,
IT system, health systems), it is useful to consider a
Multi Disciplinary Engineering Environment (MDDE)
approach. A very good and comprehensive survey on
the approaches of data integration based on ontologies
is described in [17]. It also gives guidelines for the
selection of technologies of data integration for industrial
production systems.
5.2 Ethical Concerns about Data Privacy
The interaction with companies also raised some ethical
concerns and questions like: “Are we sufficiently
cautious about the Big Data phenomenon?” It is
certainly a great technological revolution of our time
to collect large amounts of data and to value them to
improve the health and living conditions of humans.
Nevertheless, we are faced with a problem of ethics when
using predictive algorithms. Regulatory intensification is
therefore necessary to find a good compromise between
the use of personal data and the protection of privacy.
For example, in the field of health, we can wonder
about the way governments intend to exploit the data
collected. Should those data be made available (by
open data) or should a solution be found to support the
exploitation of private data?
By the use of massive data in the medical community,
the legal and economic aspects change at great
speed. This challenges ethical principles and rules
in the relationship between a doctor and a patient.
This also disturbs the balance between confidentiality
and transparency and creates a feeling of declining
confidence in the health environment around the
management and exploitation of Big Data. The ethics
of this type of data requires a well supervised control of
the use of medical information [5, 16].
Studies have also demonstrated the segmentation
power of predictive modelling and resulting business
benefits for life insurance companies [3]. While
some customers with “lower risk” could enjoy better
conditions for their insurance, customers with higher
anticipated risks could be excluded due to unaffordable
rates, thus reducing the solidarity effect of life
insurances.
Data is also increasingly carrying location information
due to the large development of mobile applications
and the emergence of the Internet of Things. A
specific branch of Big Data called location analytics
is specifically focusing on this area and can endanger
privacy if applied without safeguards. Specific
guidelines and techniques are being developed for this
purpose. Some guidelines are issued e.g. by the
European Commission for public administration [2].
Specific data processing techniques and algorithms are
also being developed for privacy preserving location
37
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
based services [34, 52].
At a more general level, in order to better control the
huge quantities of data processed every day and to ensure
that every single person is respected, the European
Commission has issued the General Data Protection
Regulation in 2016 that will come into force in May 2018
[19]. An EU portal with extensive resources is available
to give some starting points to companies [20].
Our recommendation, based on our pilots, is to
investigate this issue early in the process. This
can already be envisioned at the business and data
understanding phases and involve relevant people like
Chief Information Security Officer or even a more
specific Data Protection Officer if this role is defined.
Actually this happened quite naturally in most of our
pilot cases because the data had to be processed outside
of the owning organisation. However, the focus was
more on confidentiality than on the purpose of the data
processing itself.
5.3 Cyber Security Issues
Among the challenges of the Big Data, data security is
paramount against piracy and requires the development
of systems to secure trade by ensuring strict control of
access to the Big Data platform and thus guarantee the
confidentiality of data. Securing a Big Data platform
is nevertheless a domain in its own right because the
very principle of this system is that it can be based on
a heterogeneous architecture spread over several nodes.
The ENISA has produced a landscape of Big Data threats
and a guide of good practices [14]. This document
lists typical Big Data assets, identifies related threats,
vulnerabilities and risks. Based on these points, it
suggests emerging good practices and active areas for
research.
Storing sensitive data on the Cloud, for example,
is not without consequences, because the regulations
are not the same in all countries. A sensitive aspect
is the management of the data storage and processing
locations, e.g. the need to process data in a given
country. However, as this situation is also hindering
European competitiveness in a global market, the EU is
currently working on a framework for the free flow of
non-personal data in the EU [21].
6 CONCLUSIONS
In this paper, we described how we addressed the
challenges and risks of deploying a Big Data solution
within companies willing to adopt this technology in
order to support their business development. We first
looked at different methods reported over time in the
literature. Rather than building yet another method, we
realised the key when considering the adoption of Big
Data in an organisation, is the process followed to come
up with a method that fits the context, needs and will
maximize the chance of success. Based on this idea, we
defined a generic guidance process relying on available
methods as building bricks. To be meaningful our
approach is also strongly relying on lessons learned from
industrial cases which on one hand helped in validating
our process guidance and on the other hand can also be
used as concrete supporting illustration.
Moving forward, we plan to consolidate our work
based on what we will learn in the next series of
pilot projects. This includes investigating challenges
from other domains. We plan to address life sciences
which requires a sustained processing of high volume
of data and the space domain with highly distributed
infrastructures. Considering the global development
process, until now we have mainly focused on the
discovery and data understanding phases. So our plan is
to provide more guidance on the project execution phase
using our most advanced pilots that are now reaching full
deployment. In our guidance process, we also had to
face a number of problems which sometimes blocked all
further progress. In some cases the reason was a lack of
business value or maturity, for which the recommended
action was to postpone the process. In other cases, some
blocking issues could not be overcome or were delaying
the project a lot longer than expected, e.g. to set up
a non-disclosure agreement about data access, to get
actual data access, to configure proprietary equipment,
etc. Guidance about how to detect and avoid such cases
is also part of our work as it helps to increase the chance
of successful deployment.
ACKNOWLEDGEMENTS
This research was partly funded by the Walloon Region
through the “PIT Big Data” project (grant nr. 7481). We
thank our industrial partners for sharing their cases and
providing rich feedback.
REFERENCES
[1] R. Balduino, “Introduction to OpenUP,” https://
www.eclipse.org/epf/general/OpenUP.pdf, 2007.
[2] L. Bargiotti, I. Gielis, B. Verdegem, P. Breyne,
F. Pignatelli, P. Smits, and R. Boguslawski,
“European Union Location Framework Guidelines
for public administrations on location privacy. JRC
Technical Reports,” 2016.
[3] M. Batty, “Predictive Modeling for Life Insurance
Ways Life Insurers Can Participate in the Business
38
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Analytics Revolution, Deloitte Consulting LLP,
April 2010.
[4] T. Bedos, “5 key things to make big data analytics
work in any business, http://www.cio.com.au,
2015.
[5] J. B ´
eranger, Big Data and Ethics: The Medical
Datasphere. Elsevier Science, 2016.
[6] H. Campbell, R. Hotchkiss, N. Bradshaw, and
M. Porteous, “Integrated care pathways,” British
Medical Journal, pp. 133–137, 1998.
[7] V. Chang, A Proposed Cloud Computing Business
Framework. Commack, NY, USA: Nova Science
Publishers, Inc., 2015.
[8] V. Chang, R. J. Walters, and G. Wills, “The
development that leads to the cloud computing
business framework, International Journal of
Information Management, vol. 33, no. 3, pp. 524
– 538, 2013.
[9] H.-M. Chen, R. Kazman, and S. Haziyev, “Agile
big data analytics development: An architecture-
centric approach,” in Proceedings HICSS’16,
Hawaii, USA. Washington, DC, USA: IEEE
Computer Society, 2016, pp. 5378–5387.
[10] H. Chen, R. H. L. Chiang, and V. C. Storey,
“Business intelligence and analytics: From big data
to big impact,” MIS Q., vol. 36, no. 4, pp. 1165–
1188, Dec. 2012.
[11] K. Collier, B. Carey, E. Grusy, C. Marjaniemi,
and D. Sautter, “A Perspective on Data Mining,
Northern Arizona University, 1998.
[12] F. Corea, Big Data Analytics: A Management
Perspective. Springer Publishing Company, Inc.,
2016.
[13] K. Crowston, “A capability maturity model for
scientific data management,” BibSonomy, 2010.
[14] E. Damiani et al., “Big data threat landscape and
good practice guide,” https://www.enisa.europa.eu/
publications/bigdata-threat-landscape, 2016.
[15] G. S. do Nascimento and A. A. de Oliveira, An
Agile Knowledge Discovery in Databases Software
Process. Springer Berlin Heidelberg, 2012, pp.
56–64.
[16] EESC, “The ethics of Big Data: Balancing
economic benefits and ethical questions of Big
Data in the EU policy contex,” European Economic
and Social Committee, Visits and Publications
Unit, 2017.
[17] F. J. Ekaputra, M. Sabou, E. Serral, E. Kiesling,
and S. Biffl, “Ontology-based data integration in
multi-disciplinary engineering environments: A
review,” Open Journal of Information Systems
(OJIS), vol. 4, no. 1, pp. 1–26, 2017. [Online].
Available: http://nbn-resolving.de/urn:nbn:de:101:
1-201711266863
[18] T. Erl, W. Khattak, and P. Buhler, Big Data
Fundamentals: Concepts, Drivers & Techniques.
Prentice Hall, 2016.
[19] European Commission, “General Data Protection
Regulation 2016/679, http://eur-lex.europa.
eu/legal-content/EN/TXT/?uri=CELEX:
32016R0679, 2016.
[20] European Commission, “The EU General
Data Protection Regulation (GDPR),
http://www.eugdpr.org, 2016.
[21] European Commission, “A framework for the free
flow of non-personal data in the EU,” http://europa.
eu/rapid/press-release MEMO-17-3191 en.htm,
2017.
[22] U. Fayyad, G. Piatetsky-shapiro, and P. Smyth,
“From data mining to knowledge discovery in
databases,” AI Magazine, vol. 17, pp. 37–54, 1996.
[23] P. Frankov, M. Drahoov, and P. Balco, “Agile
project management approach and its use in big
data management,” Procedia Computer Science,
vol. 83, pp. 576 – 583, 2016.
[24] J. Gao, A. Koronios, and S. Selle, “Towards A
Process View on Critical Success Factors in Big
Data Analytics Projects,” in AMCIS, 2015.
[25] Gartner, “Gartner survey reveals investment in
big data is up but fewer organizations plan
to invest, http://www.gartner.com/newsroom/id/
3466117, 2016.
[26] J. Haffar, “Have you seen asum-dm?”
ftp://ftp.software.ibm.com/software/data/
sw-library/services/ASUM.pdf, 2015.
[27] J. Hurwitz, A. Nugent, F. Halper, and M. Kaufman,
Big Data For Dummies. John Wiley & Sons, 2013.
[28] IBM, “Stampede,” http://www.ibmbigdatahub.
com/tag/1252, 2013.
[29] IBM, “ASUM-DM, https://developer.
ibm.com/predictiveanalytics/2015/10/16/
have-you-seen- asum-dm, 2015.
[30] K. Jensen, “Crisp-dm process diagram,
https://commons.wikimedia.org/wiki/File:
CRISP-DM Process Diagram.png distributed
under CC-SHA2, 2012.
[31] J. Kelly and J. Kaskade, “CIOs & Big Data: What
Your IT Team Wants You to Know,” http://blog.
infochimps.com/2013/01/24/cios-big-data, 2013.
39
Open Journal of Big Data (OJBD), Volume 3, Issue 1, 2017
[32] L. Lau, F. Yang-Turner, and N. Karacapilidis,
“Requirements for big data analytics supporting
decision making: A sensemaking perspective,”
in Mastering data-intensive collaboration and
decision making, N. Karacapilidis, Ed. Springer
Science & Business Media, April 2014, vol. 5, pp.
49 – 70.
[33] D. Lehmann, D. Fekete, and G. Vossen,
“Technology selection for big data and analytical
applications,” Open Journal of Big Data (OJBD),
vol. 3, no. 1, pp. 1–25, 2017. [Online].
Available: http://nbn-resolving.de/urn:nbn:de:101:
1-201711266876
[34] L. Liu, “From data privacy to location privacy:
Models and algorithms,” in Proceedings of the
33rd International Conference on Very Large Data
Bases, ser. VLDB ’07. VLDB Endowment, 2007,
pp. 1429–1430.
[35] G. Lyman, “Impact of chemotherapy dose intensity
on cancer patient outcomes,” J Natl Compr Canc
Netw, pp. 99–108, Jul 2009.
[36] G. Mariscal, s. Marbn, and C. Fernndez, “A
survey of data mining and knowledge discovery
process models and methodologies,” Knowledge
Eng. Review, vol. 25, no. 2, pp. 137–166, 2010.
[37] A. D. Mauro, M. Greco, and M. Grimaldi, “A
formal definition of big data based on its essential
features,” Library Review, vol. 65, no. 3, pp. 122–
135, 04 2016.
[38] M. Minelli, M. Chambers, and A. Dhiraj, Big Data,
Big Analytics: Emerging Business Intelligence and
Analytic Trends for Today’s Businesses, 1st ed.
Wiley Publishing, 2013.
[39] C. Nott, “Big Data & Analytics Maturity
Model,” http://www.ibmbigdatahub.com/blog/
big-data-analytics-maturity-model, 2014.
[40] OMG, “Unified Modeling Language (UML) -
Version 2.X,” http://www.omg.org/spec/UML,
2005.
[41] M. Piccart, L. Biganzoli, and A. Di Leo, “The
impact of chemotherapy dose density and dose
intensity on breast cancer outcome: what have we
learned?” Eur J Cancer., vol. 36, no. Suppl 1, April
2000.
[42] C. Ponsard, R. D. Landtsheer, Y. Guyot,
F. Roucoux, and B. Lambeau, “Decision making
support in the scheduling of chemotherapy
coping with quality of care, resources and ethical
constraints,” in ICEIS 2017 - Proc. of the 19th Int.
Conf. on Enterprise Information Systems, Porto,
Portugal, April 26-29, 2017.
[43] C. Ponsard, A. Majchrowski, S. Mouton, and
M. Touzani, “Process guidance for the successful
deployment of a big data project: Lessons learned
from industrial cases,” in Proc. of the 2nd Int.
Conf. on Internet of Things, Big Data and Security,
IoTBDS 2017, Porto, Portugal, April 24-26, 2017.
[44] D. J. Power, “Using ’Big Data’ for analytics and
decision support,” Journal of Decision Systems,
vol. 23, no. 2, Mar. 2014.
[45] E. Rot, “How Much Data Will You Have
in 3 Years?” http://www.sisense.com/blog/
much-data-will-3-years, 2015.
[46] J. Saltz and I. Shamshurin, “Big Data Team
Process Methodologies: A Literature Review and
the Identification of Key Factors for a Projects
Success,” in Proc. IEEE Int. Conf. on Big Data,
2016.
[47] J. S. Saltz, “The need for new processes,
methodologies and tools to support big data teams
and improve big data project effectiveness, in
IEEE International Conference on Big Data, Big
Data 2015, Santa Clara, CA, USA, October 29 -
November 1, 2015, pp. 2066–2071.
[48] SAS Institute, “SEMMA Data Mining
Methodology,” http://www.sas.com/technologies/
analytics/datamining/miner/semma.html, 2005.
[49] Scrum Alliance, “What is scrum? an agile
framework for completing complex projects,
https://www.scrumalliance.org/why- scrum, 2016.
[50] C. Shearer, “The CRISP-DM Model: The New
Blueprint for Data Mining,” Journal of Data
Warehousing, vol. 5, no. 4, 2000.
[51] R. Soltanpoor and T. Sellis, Prescriptive Analytics
for Big Data. Cham: Springer International
Publishing, 2016, pp. 245–256.
[52] G. Sun, D. Liao, H. Li, H. Yu, and V. Chang,
“L2p2: A location-label based approach for privacy
preserving in lbs,” Future Generation Computer
Systems, vol. 74, no. Supplement C, pp. 375 – 384,
2017.
[53] Two Crows Corporation, “Introduction to
Data Mining and Knowledge Discovery,”
http://www.twocrows.com/intro-dm.pdf, 2005.
[54] P. A. van Dam, A dynamic clinical pathway for
the treatment of patients with early breast cancer is
a tool for better cancer care: implementation and
prospective analysis between 2002–2010, World
Journal of Surgical Oncology, vol. 11, no. 1, 2013.
[55] A. van Lamsweerde, “Goal-oriented requirements
engineering: a guided tour, in Requirements
40
C. Ponsard, M. Touzani, A. Majchrowski: Combining Process Guidance and Industrial Feedback for Successfully Deploying Big Data Projects
Engineering, 2001. Proceedings. Fifth IEEE
International Symposium on, 2001, pp. 249–262.
[56] A. van Lamsweerde, Requirements Engineering -
From System Goals to UML Models to Software
Specifications. Wiley, 2009.
AUTHOR BIOGRAPHIES
Ir. Christophe Ponsard
holds a master in Electrical
Engineering and Computer
Science. He runs the Software
and System Engineering
department of CETIC focusing
on requirements engineering,
model-driven development
and software quality. He
is actively contributing to several applied research
programs at European level and transfer activities with
local companies to foster the adoption of emerging
technologies like Big Data, Machine Learning and IoT.
Mounir Touzani holds a
PhD from the University of
Montpellier. His areas of
expertise are requirements
engineering, business process
analysis, business rule systems,
database engineering and Data
Science. He is actively involved
in database operation for large
scale administrative processes and is working on the
development of Big Data deployment methodologies,
Machine Learning and Cloud computing.
Annick Majchrowski is a
member of the CETIC Software
and Systems Engineering
department since 2007. She
holds a BA in Mathematics
and a BA in Computer Science.
She leads the activities related
to software process audit and
deployment. She is actively
involved in software process improvement in several
organisation both in the public and private sectors.
She also contributes to develop methodologies for the
optimal adoption of emerging technologies.
41
... Many studies have explored the BDA adoption and implementation in different types of organisations, enterprises (Orenga-Roglá & Chalmeta, 2019), government agencies (Qadadeh & Abdallah, 2020;Thamjaroenporn & Achalakul, 2020), and industries and sectors (Ponsard et al., 2017;Huber et al., 2019;Massmann et al., 2020;Mathrani & Lai, 2021). These studies explored and proposed a process framework for adopting BDA based on several theoretical models. ...
... The process framework consists of phases, steps and activities that organisations can use as a guideline to adopt BDA into their environment. It has been found that many studies developed a process framework based on the CRoss Industry Standard Process for Data Mining (CRISP-DM) process model (Li et al., 2016;Ponsard et al., 2017;Huber et al., 2019;Qadadeh & Abdallah, 2020). Some studies used project management methodology (Orenga-Roglá & Chalmeta, 2019; Kastouni & Ait Lahcen, 2020) and data lifecycle model (Blazquez & Domenech, 2018). ...
... The consideration of agile methods is discussed in the existing literature. Agile methodology was used for software development in the past and suggested big data analytics as a better process alternative (Ponsard et al., 2017). Unlike, the exiting process, an investigation in the literature shows that the big data analytics process or big data analytics workflow has different structures. ...
... The consideration of agile methods is discussed in the existing literature. Agile methodology was used for software development in the past and suggested big data analytics as a better process alternative (Ponsard et al., 2017). Unlike, the exiting process, an investigation in the literature shows that the big data analytics process or big data analytics workflow has different structures. ...
Article
Full-text available
Performance measurement is the process of making an evidence-based improvement. It reveals the performance gains or gaps, depending on the entity to be measured, being an organization, people, equipment, processes, or systems. After development, big data analytics (BDA) systems massively fail in organizational settings. The reasons, however, are not fully understood. This paper investigates how organizations can quantify the performance of their BDA systems. To answer this question, we investigated performance measures and performance-contributing factors in the existing literature and surveyed users’ perceptions of our findings. The results show that metrics of efficiency and effectiveness can be used to measure the performance of the BDA System. The results also demonstrate that technology, competency, and working conditions are the key factors that contribute to the performance of the BDA System.
Article
RESUMENEn Colombia las incapacidades temporales de origen común son reconocidas como prestaciones económicas y pagadas con recursos públicos, dichas incapacidades están dadas en días en los que el trabajador o cotizante se ausenta del trabajo para recuperar su estado de salud. El objetivo del estudio es identificar la importancia de la pertinencia en el momento de la expedición de estas, la variabilidad en días y uso racional de los recursos dispuestos por el gobierno para el reconocimiento.Se realizó un estudio cuantitativo, descriptivo de corte transversal, de las incapacidades temporales expedidas por médicos generales de una EPS colombiana, para identificar la variabilidad en los días de incapacidad ordenados, respecto a los días estándar por diagnóstico y el impacto económico que estas variaciones pueden generar. Encontrandoque, de las 199.838 incapacidades expedidas por la EPS durante los años 2018 y 2019, solo el 6% excedieron los días sugeridos. De aquellas que fueron reconocidas por la EPS, se asumió un sobrecosto de $ 867.534.037 (pesos colombianos), por los días excedidos. Confirmando, por lo tanto, mediante la comprobación de hipótesis que existe variabilidad en los tiempos ordenados por los médicos frente a los tiempos sugeridos, condición que se presenta principalmente en la población femenina y en las incapacidades de corta duración por evento y que implican un costo adicional para la EPS y para el Sistema General de Seguridad Social en Salud-SGSSS.
Chapter
Globalization, a potentially emerging scarcity of resources, significantly increased complexity of markets and the rise of the BRICS countries are the biggest challenges for the leading industrialized countries in recent years. For these nations and the companies based there, the main task for the next decades is to utilize the existing production capacities much more efficiently and to ensure an environment for highly developed industrial products. To meet these challenges, the main focus is on subsidy policies and research activities on complex concepts, such as the “Digital Factory” (Bracht, U.: Digitale Fabrik: Methoden und Praxisbeispiele, 2. Aufl., VDI-Buch, Geckler, D., Wenzel, S. (Hrsg.). Springer, Berlin/Heidelberg, 2018), “Industry 4.0” (Steven, M.: Industrie 4.0: Grundlagen – Teilbereiche – Perspektiven, 1. Ed., Moderne Produktion (Hrsg.). Kohlhammer, Stuttgart (2019), or, in general, “Intelligent Production Systems” (Schneider, M., Lean factory design: Gestaltungsprinzipien für die perfekte Produktion und Logistik. Hanser, München, 2016). In addition to this major change, another focus is on the introduction of a variety of systems to manage, optimize and control the existing operating processes. The main goal of these measures is to strive for the complete digitalization and integration of all processes of the product life cycle, including the supply chains.
Article
Full-text available
There is an increasing number of big data science projects aiming to create value for organizations by improving decision making, streamlining costs or enhancing business processes. However, many of these projects fail to deliver the expected value. It has been observed that a key reason many data science projects don’t succeed is not technical in nature, but rather, the process aspect of the project. The lack of established and mature methodologies for executing data science projects has been frequently noted as a reason for these project failures. To help move the field forward, this study presents a systematic review of research focused on the adoption of big data science process frameworks. The goal of the review was to identify (1) the key themes, with respect to current research on how teams execute data science projects, (2) the most common approaches regarding how data science projects are organized, managed and coordinated, (3) the activities involved in a data science projects life cycle, and (4) the implications for future research in this field. In short, the review identified 68 primary studies thematically classified in six categories. Two of the themes (workflow and agility) accounted for approximately 80% of the identified studies. The findings regarding workflow approaches consist mainly of adaptations to CRISP-DM ( vs entirely new proposed methodologies). With respect to agile approaches, most of the studies only explored the conceptual benefits of using an agile approach in a data science project ( vs actually evaluating an agile framework being used in a data science context). Hence, one finding from this research is that future research should explore how to best achieve the theorized benefits of agility. Another finding is the need to explore how to efficiently combine workflow and agile frameworks within a data science context to achieve a more comprehensive approach for project execution.
Article
Full-text available
Today’s industrial production plants are complex mechatronic systems. In the course of the production plant lifecycle, engineers from a variety of disciplines (e.g., mechanics, electronics, automation) need to collaborate in multi-disciplinary settings that are characterized by heterogeneity in terminology, methods, and tools. This collaboration yields a variety of engineering artifacts that need to be linked and integrated, which on the technical level is reflected in the need to integrate heterogeneous data. Semantic Web technologies, in particular ontology-based data integration (OBDI), are promising to tackle this challenge that has attracted strong interest from the engineering research community. This interest has resulted in a growing body of literature that is dispersed across the Semantic Web and Automation System Engineering research communities and has not been systematically reviewed so far. We address this gap with a survey reflecting on OBDI applications in the context of Multi-Disciplinary Engineering Environment (MDEE). To this end, we analyze and compare 23 OBDI applications from both the Semantic Web and the Automation System Engineering research communities. Based on this analysis, we (i) categorize OBDI variants used in MDEE, (ii) identify key problem context characteristics, (iii) compare strengths and limitations of OBDI variants as a function of problem context, and (iv) provide recommendation guidelines for the selection of OBDI variants and technologies for OBDI in MDEE.
Conference Paper
Full-text available
The scheduling of clinical pathways such as oncological treatments involves a tricky decision process because the therapeutic regimens require to respect strict timing constraints with possibly limited resources such as beds and caregivers availability with an increasing number of patients. Such constraints must be met simultaneously for every patient treated at the same time, by making the best use of limited hospital resources. The scheduling must also be robust in case of adverse events such as unexpected delays or partial treatment deliveries due to their toxicity. In this paper, we show how such a decision process can be driven by care quality indicators to ensure all the dimensions. We demonstrate how constraint-based local search techniques can cope with real-world size chemotherapy pathways and efficiently adapt to changes. We also share some ethical concerns about the way the objective function is expressed and more generally about how the tool integrates in the medical decision process.
Conference Paper
Full-text available
Nowadays, in order to successfully run their business, companies are facing the challenge of processing ever increasing amounts of data coming from digital repositories, enterprise applications, sensors networks and mobile devices. Although a wide range of technical solutions are available to deal with those Big Data, many companies fail to deploy them because of management challenges and a lack of process maturity. This paper focuses on those aspects and reports about lessons learned when deploying a series of Big Data pilots in different domains. We provide feedback and some practical guidelines on how to organise and manage a project based on available methodologies, covering topics like requirements gathering, data understanding, iterative project execution, maturity stages, etc.
Book
Faced with the exponential development of Big Data and both its legal and economic repercussions, we are still slightly in the dark concerning the use of digital information. In the perpetual balance between confidentiality and transparency, this data will lead us to call into question how we understand certain paradigms, such as the Hippocratic Oath in medicine. As a consequence, a reflection on the study of the risks associated with the ethical issues surrounding the design and manipulation of this "massive data? seems to be essential. This book provides a direction and ethical value to these significant volumes of data. It proposes an ethical analysis model and recommendations to better keep this data in check. This empirical and ethico-technical approach brings together the first aspects of a moral framework directed toward thought, conscience and the responsibility of citizens concerned by the use of data of a personal nature. Defines Big Data applications in health. Presents the ethical value of the medical datasphere via the description of a model of an ethical analysis of Big Data. Provides the recommendations and steps necessary for successful management and governance of personal health data. Helps readers determine what conditions are essential for the development of the study of Big Data.
Book
This book is about innovation, big data, and data science seen from a business perspective. Big data is a buzzword nowadays, and there is a growing necessity within practitioners to understand better the phenomenon, starting from a clear stated definition. This book aims to be a starting reading for executives who want (and need) to keep the pace with the technological breakthrough introduced by new analytical techniques and piles of data. Common myths about big data will be explained, and a series of different strategic approaches will be provided. By browsing the book, it will be possible to learn how to implement a big data strategy and how to use a maturity framework to monitor the progress of the data science team, as well as how to move forward from one stage to the next. Crucial challenges related to big data will be discussed, where some of them are more general - such as ethics, privacy, and ownership – while others concern more specific business situations (e.g., initial public offering, growth strategies, etc.). The important matter of selecting the right skills and people for an effective team will be extensively explained, and practical ways to recognize them and understanding their personalities will be provided. Finally, few relevant technological future trends will be acknowledged (i.e., IoT, Artificial intelligence, blockchain, etc.), especially for their close relation with the increasing amount of data and our ability to analyse them faster and more effectively.
Conference Paper
This paper reports on our review of published research relating to how teams work together to execute Big Data projects. Our findings suggest that there is no agreed upon standard for executing these projects but that there is a growing research focus in this area and that an improved process methodology would be useful. In addition, our synthesis also provides useful suggestions to help practitioners execute their projects, specifically our identified list of 33 important success factors for executing Big Data efforts, which are grouped by our six identified characteristics of a mature Big Data organization.
Conference Paper
Prescriptive analytics is considered as the next frontier in the area of business analytics. It provides organizations with adaptive, automated, and time-dependent courses of actions to take advantage of likely business opportunities. Given enterprises’ objectives, prescriptive analytics assists them maximize their business values and at the same time mitigates their likely risks by recommending optimal sequences of actions. In this work, a federated prescriptive analytics framework comprising descriptive, predictive and prescriptive components is proposed. The framework also links the extracted insight from the data to their pertinent generated actions. Finally, a few indicative use cases are presented to indicate the necessity of this new analytics paradigm.
Article
The developments in positioning and mobile communication technology have made the location-based service (LBS) applications more and more popular. For privacy reasons and due to lack of trust in the LBS providers, k-anonymity andl-diversity techniques have been widely used to preserve privacy of users in distributed LBS architectures in Internet of Things (IoT). However, in reality, there are scenarios where the locations of users are identical or similar/near each other in IoT. In such scenarios the k locations selected by k-anonymity technique are the same and location privacy can be easily compromised or leaked. To address the issue of privacy preservation, in this paper, we introduce the location labels to distinguish locations of mobile users to sensitive and ordinary locations. We design a location-label based (LLB) algorithm for protecting location privacy of users while minimizing the response time for LBS requests. We also evaluate the performance and validate the correctness of the proposed algorithm through extensive simulations.