PreprintPDF Available

CHALLENGES TO DATA SCIENCE PROJECTS WITH SMEs: AN ANALYSIS AND DECISION SUPPORT TOOL

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Data science has quickly developed as an academic field and has sparked the imagination of public and private sector alike. While considerable effort is devoted towards the technical development of statistical approaches for analysing the wealth of available data, our understanding of data science in practical contexts has lagged behind. In this paper we build on 40 applied data science projects that were conducted with small to medium enterprises (SME's) in the Netherlands to identify common pitfalls and challenges. This analysis informs the development of a "data science project canvas"; a tool that helps people with a non-technical background to define a data science project.
Content may be subject to copyright.
CHALLENGES TO DATA SCIENCE PROJECTS WITH SMES:AN
ANALYSIS AND DECISION SUPPORT TOOL
A PREPRINT
Daan A. Kolkman
Jheronimus Academy of Data Science
Eindhoven University of Technology
’s-Hertogenbosch, the Netherlands
d.kolkman@tue.nl
Ruud Sneep
Jheronimus Academy of Data Science
Eindhoven University of Technology
’s-Hertogenbosch, the Netherlands
r.sneep@tue.nl
February 27, 2019
ABS TRAC T
Data science has quickly developed as an academic field and has sparked the imagination of public
and private sector alike. While considerable effort is devoted towards the technical development of
statistical approaches for analysing the wealth of available data, our understanding of data science in
practical contexts has lagged behind. In this paper we build on 40 applied data science projects that
were conducted with small to medium enterprises (SME’s) in the Netherlands to identify common
pitfalls and challenges. This analysis informs the development of a "data science project canvas"; a
tool that helps people with a non-technical background to define a data science project.
Keywords Entrepreneurship ·Data science ·Decision making support
1 Data Science in practice
The ongoing increase of computational power allows for the development of ever more sophisticated data analysis tech-
niques, models, and algorithms (Venturini, Jensen, & Latour, 2015). This broad collection of data-centric innovations
is encompassed by the field of ’data science’
(
Hey & Tolle, 2009). Data science has quickly proliferated beyond the
academic domain; industry and government have also taken an interest in the field
(
Cukier & Mayer-Schoenberger,
2013).
The perceived advantages of data science are numerous
(
Floridi & Taddeo, 2016) and the benefits of using quantified
information in general have been the subject of inquiry for decades
(
van Daalen & Janssen, 2002). Yet, the prevalence of
models and algorithms - which are the outcomes of data science - falls short of its potential
(
Inman, 2011). Explanations
for this "application gap" are numerous
(
Kolkman, Campo, Balke-Visser, & Gilbert, 2016). Diez and Mcintosh (2011)
suggests it is caused by a lack of understanding between those that develop algorithms and those that use them, Delden
(2011) argues that the technical capabilities of algorithms and the software they are embedded in are not flexible enough,
Happe and Ballman (2008) point out that algorithms need to fit within the day tot day routines of the intended user-base.
Despite decades of research, we know very little about why some quantifications are used and others are not
(
Syme,
Bennet, Macpherson, & Thomas, 2011).
This lack of understanding is particularly surprising considering the large investments in data science
(
Bulger, Taylor,
& Schroeder, 2014), the modest success rate of data science projects
(
Cavanillas, Curry, & Wahlster, 2016), and
several examples of analytics use gone wrong
(
Sluijs, 2002; Barocas & Selbst, 2016). Authors agree on the potentially
beneficial effects of data science, yet only some offer guidance on how these effects can be brought about in practice.
Such guidance is long overdue, particularly because small to medium sized businesses risk falling behind. Their
adoption of data science has been slow and progress is hampered by a lack of understanding. Typical questions asked
Corresponding author.
APREPRINT - FE BRUARY 27, 2019
by Small to Medium sized Enterprises (SMEs) include: "What data do I have?", "How can I use my data to create
value?", and ’Where do I start?" (van der Veen, van der Born, Smetsers, & Bosma, 2017).
This paper contributes to the debate on the application gap. It reports on our experience with running data science
projects in practice. We build on 40 cases in which data science was applied to assist an SME in solving some business
challenge or question. Through analysis of the project outcomes, our notes, and two focus groups we identified several
challenges and difficulties which can prevent implementation of data science in practice. We discuss these challenges
and present the Data project canvas; a decision making tool which helps SMEs (1) to identify a problem that can be
solved through data science and (2) design a data science project which outcome they will adopt and use.
2 The JADS SME Datalab and data science projects
The Jheronimus Academy of Data Science (JADS) SME Datalab was founded 2018 with the purpose of helping Small to
Medium Enterprises to become more data-literate. The Datalab connects postgraduate students to local SMEs to deliver
data projects. SMEs pay a modest fee to cover a stipend for the student and operating expenses for the Datalab. Each
project runs for six to ten weeks, with the students investing about 80 hours per project. The students are supervised by
experienced data scientists to safeguard analytic rigour, oversee communication to the client, and ensure timely delivery
of the project. SMEs go through one or two intake meetings, during which an experienced data scientist helps them to
identify and delineate a project which adds value to the business. This project is defined in a project proposal, which
lists the required data, deliverables, and goals of the client. The proposal is signed by the client and is used to evaluate
whether a project has been completed. Students fill out an online planning tool with information about the activities
and steps necessary to complete the project. They log their progress, which permits their supervisors to keep track of
the projects. The students can raise and discuss progress online, during face-to-face supervision meetings, or during
informal sessions in the SME Datalab.
Over the course over a year we completed a total of 44 projects with 30 businesses. The variety in terms of the
industrial sector of the companies was considerable. We worked with over-the-counter businesses such as a bakery
and a hairdresser, but also with sewer maintenance engineers and a lettuce-grower. In terms of the technical scope,
the projects were similarly diverse. The projects ranged from a customer segmentation for marketing purposes to the
design of a dimension reduction algorithm to benchmark the traffic-safety status of municipalities. The heterogeneity of
this sample provides a strong foundation for theorization
(
Patton, 2002), in our case to identify data science project
challenges that occur irrespective of industry or technical scope. However, the businesses were not selected on the
basis of some sampling scheme. The entrepreneurs we engaged with where of the enthusiastic sort. They were not a
representative sample of the population. Rather, they would probably classify as "innovators" or "front-runners".
3 Common issues
We collected data in the form of project outcomes, field notes, notes from two focus groups, and emails from students
or clients. We went over our material per project and for each project listed: the challenges we encountered, the way we
solved those challenges, whether or not the agreed upon deliverables were completed, and circumstances that were
conductive to the completion of the project. We then collaboratively open-coded
(
Straus & Corbin, 1998) this data and
selected those challenges which we could have mitigated before the start of the project if we would have known about
them. We then grouped similar challenges, the list below is the result of this process:
3.1 Infrastructure
Although both the name of the SME Datalab and that of the field of data science contain the word "data" we found that
many entrepreneurs did not share our notion of what data is. The challenges contained within the infrastructure group
pertain to the availability of data, the quality of this data, and the ways in which this data can be accessed. We found
that it is important to ask the entrepreneurs to share a sample of their data early on. This forces them to try and export it
from whatever system it is in and permits the data scientist to evaluate the quality of the data.
3.1.1 Data
Some entrepreneurs came to us with elevated expectations. One business was hoping to implement a machine learning
algorithm to develop a predictive maintenance system. This system would help them identify those machines that were
most likely to break down next. With this information, they could use their resources by developing more efficient
maintenance schedules. When asked about the available data, the client sent over a couple of Excel spreadsheets which
did not contain the raw inputs. Rather, the spreadsheets contained aggregated report data about the machines. When
2
APREPRINT - FE BRUARY 27, 2019
asked about the underlying data for these reports, they were not sure. We proceeded by adjusting the project scope by
defining a project which would explore and identify the available data-sources.
3.1.2 Software
Even if the business has a large collection of data, this is no guarantee that this data can be accessed. In several projects,
we found that the data was stored within a proprietary software. It could not be exported without assistance from
the software’s developers. In most projects we were able to liaise with the software developer and get a data export.
However, software developers can be weary to allow third parties access to their databases, as they perceive a risk
that their software will be replaced. Although the data portability principle of the General Data Protection Regulation
(GDPR) states that data should be transferable from one system to the next, some software developers try to retain
clients by policing third party access to the data.
3.1.3 Expertise
The businesses we worked with had very different Information Technology (IT) competence levels. Some businesses
already had some experience with Business Intelligence applications such as dashboard and visualisations, whereas
others had just digitized their financial administration. It is important to get an idea of the data maturity of the client
before the project. This is instrumental towards designing a realistic project planning. In addition, if an organisation
data maturity is higher a project can be more technically advanced. Organisations that have a higher data maturity are
typically more proficient in integrating the outcomes of a data science project within their existing infrastructure.
3.1.4 Partners
As mentioned, the use of proprietary software can be a challenge towards the timely completion of a data science
project. A similar challenge is introduced by business that work with one or more partners for their data management or
data collection. The more external stakeholders that are involved, the higher the complexity of the data science project.
In projects were many stakeholders were involved, we lost much time to project management tasks.
3.2 Preconditions
Data science is by nature a quite technical field, the algorithms and models that can be developed as part of a data
science project can be hard to comprehend for experts, let alone for those not trained in their use. The challenges
contained in the preconditions group mostly pertain to the human an regulatory side of data science. It is important to
keep in mind that data science projects should contribute to making someone’s job easier.
3.2.1 Commitment
For some of the projects we completed, the business owner was looking to demonstrate the potential of data science
to others in his or her organisation. In these projects, it was clear to the client that project would not immediately
effectuate a change in the way the organisation works. In other projects, the business owners were looking to implement
a project where the owner himself or herself was not the intended user of the project. In such cases, we found it was
paramount to involve the users at an early stage to ensure the deliverables aligned with their routines.
3.2.2 Culture
It is not useful for a business to engage in a data science project if the results do not align well with the current
organizational culture or routines. In one of the projects we proposed to develop a model that would predict the product
demand and turnover for a bakery. In our initial proposal we aimed to implement the model in Python. After further
discussions with the business owner, we found that he was used to - and comfortable with - Microsoft Excel only. As
such we had to adapt our approach to make it fit within his weekly routine of forecasting demand and making personnel
schedules.
3.2.3 Regulations
The recent effectuation of the GDPR has put privacy and data ethics higher on the agenda. The business owners we
spoke with were typically aware of the new legislation, but did not feel confident about their level of understanding.
More generally, business owners seem to be have limited knowledge about what is or is not allowed in relaation to data
collectin. One business owner who was in the human resources business asked us if we could automate data-collection
from LinkedIn. We had to point out that it was not allowed to retrieve that information. In one project, the data we
3
APREPRINT - FE BRUARY 27, 2019
needed was accessible only through interaction with a partner of our client business. In this particular case, we lost a lot
of time in negotiating access and setting up a system for remote access to the data.
3.2.4 Budget
The SMEs we work with often have a limited budget, but great ambitions. We try to identify a project that adds value or
helps them to reduce costs.
3.3 Expectations
3.3.1 Challenges
Often, businesses approach us and have no notion of where to start. We ask them to describe their business and invite
them to tell us about what challenges they are currently facing.
3.3.2 Results
Some entrepreneurs struggled to incorporate the findings of the projects into their business. In one case we identified
postal codes with demographics that matched those of the current customer base of the business. The entrepreneur was
happy for us to have mapped his current client base. However, he was unsure what to do with the information on where
his potential clients live.
4 Canvas
The challenges outlined in the previous section formed the basis of a Data science project canvas. The canvas presented
here is still under development and is by no means intended as an exhaustive list of all factors which contribute to
successful implementation of data science in SMEs. Nonetheless, we believe that by asking the right questions before
the project start can contribute to data science projects that are used and ultimately add value to the business.
5 Discussion
This paper identified twelve challenges that can impair the progress of a data science project and can ultimately result in
its failure. We described the twelve challenges and presented a decision support tool which allows businesses to design
data science projects that overcome the challenges.
We conclude that by addressing these challenges before the projects start, their success rate will be higher. Future
research could consider if these challenges occur in other data science contexts as well. Several of the challenges we
identified align with previous research on the application gap.
Acknowledgements
We would like to thank the businesses that worked with SME Datalab and the partners that helped us to connect with
the SME community. We are grateful to Matthijs Bookelmann, Arjan van den Born and Bas Bosma for their comments.
Appendix
4
APREPRINT - FE BRUARY 27, 2019
Figure 1: The original Dutch version of the Data project canvas.
References
Barocas, S., & Selbst, A. (2016). Big data’s disparate impact. California Law Review.
Bulger, M., Taylor, G., & Schroeder, R. (2014). Data-driven business models: Challenges and opportunities of big data.
Report by the Oxford Internet Institute.
Cavanillas, J. M., Curry, E., & Wahlster, W. (2016). New horizons for a data-driven economy. Springer, New York.
Cukier, K., & Mayer-Schoenberger, V. (2013). The rise of big data: How it’s changing the way we think about the
world. Foreign Affairs 92 (28).
Delden, H. v. (2011). A methodology for the design and development of integrated models for policy support.
Environmental Modelling Software 26(3).
Diez, E., & Mcintosh, B. (2011). Organisational drivers for, constraints on and impacts of decision and information
support tool use in desertification policy and management. Environmental Modelling Software 26(3).
Floridi, L., & Taddeo, M. (2016). What is data ethics? Philosophical Transactions of the Royal Society A: Mathematical,
Physical and Engineering Sciences 374(1).
Happe, K., & Ballman, A. (2008). Doing policy in the lab! options for the future use of modelbased policy analysis for
complex decision-making. Proceedings of the 107th EAAE Seminar.
Hey, T. S., T., & Tolle, K. (2009). The fourth paradigm: Data-intensive scientific discovery. Microsoft Research Paper.
Inman, D. (2011). Perceived effectiveness of environmental decision support systems in participatory planning:
Evidence from small groups of end-users. Environmental Modelling Software 26(1).
Kolkman, D. A., Campo, P., Balke-Visser, T., & Gilbert, N. (2016). How to build models for government: criteria
driving model acceptance in policymaking. Policy Sciences 49(4).
Patton, M. (2002). Qualitative evaluation and research methods. qualitative inquiry.
Sluijs, J. v. d. (2002). (2002). a way out of the credibility crisis of models used in integrated assessment. Futures 34(1).
Straus, A., & Corbin, J. (1998). Techniques and procedures for developing grounded theory.
Syme, G., Bennet, D., Macpherson, D., & Thomas, J. (2011). Guidelines for policy modellers–30 years on: New tricks
or old dogs? Proceedings of the International Congress on Modelling and Simulation.
van Daalen, L., C.and Dresen, & Janssen, M. A. (2002). The roles of computer models in the environmental policy life
cycle. Environmental Science Policy, 238(1).
van der Veen, M., van der Born, J., Smetsers, D., & Bosma, B. (2017). Ondernemen met (big) data door het mkb.
Venturini, T., Jensen, P., & Latour, B. (2015). Fill in the gap. a new alliance for social and natural sciences. Journal of
Artificial Societies and Social Simulation 18 (2).
5
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This theme issue has the founding ambition of landscaping data ethics as a new branch of ethics that studies and evaluates moral problems related to data (including generation, recording, curation, processing, dissemination, sharing and use), algorithms (including artificial intelligence, artificial agents, machine learning and robots) and corresponding practices (including responsible innovation, programming, hacking and professional codes), in order to formulate and support morally good solutions (e.g. right conducts or right values). Data ethics builds on the foundation provided by computer and information ethics but, at the same time, it refines the approach endorsed so far in this research field, by shifting the level of abstraction of ethical enquiries, from being information-centric to being data-centric. This shift brings into focus the different moral dimensions of all kinds of data, even data that never translate directly into information but can be used to support actions or generate behaviours, for example. It highlights the need for ethical analyses to concentrate on the content and nature of computational operations—the interactions among hardware, software and data—rather than on the variety of digital technologies that enable them. And it emphasizes the complexity of the ethical challenges posed by data science. Because of such complexity, data ethics should be developed from the start as a macroethics, that is, as an overall framework that avoids narrow, ad hoc approaches and addresses the ethical impact and implications of data science and its applications within a consistent, holistic and inclusive framework. Only as a macroethics will data ethics provide solutions that can maximize the value of data science for our societies, for all of us and for our environments. This article is part of the themed issue ‘The ethical impact of data science’.
Article
Full-text available
Models are used to inform policymaking and underpin large amounts of government expenditure. Several authors have observed a discrepancy between the actual and potential use of models in government. While there have been several studies investigating model acceptance in government, it remains unclear under what conditions models are accepted. In this paper, we address the question “What criteria affect model acceptance in policymaking?”, the answer to which will contribute to the wider understanding of model use in government. We employ a thematic coding approach to identify the acceptance criteria for the eight models in our sample. Subsequently, we compare our findings with existing literature and use qualitative comparative analysis to explore what configurations of the criteria are observed in instances of model acceptance. We conclude that model acceptance is affected by a combination of the model’s characteristics, the supporting infrastructure and organizational factors.
Article
Full-text available
In the last few years, electronic media brought a revolution in the traceability of social phenomena. As particles in a bubble chamber, social trajectories leave digital trails that can be analyzed to gain a deeper understanding of collective life. To make sense of these traces a renewed collaboration between social and natural scientists is needed. In this paper, we claim that current research strategies based on micro-macro models are unfit to unfold the complexity of collective existence and that the priority should instead be the development of new formal tools to exploit the richness of digital data.
Article
Integrated Assessment Models (IAMs) have been widely used in environmental policy making because they simulate natural and socio-economic systems by integrating knowledge derived from a wide range of disciplines. The current IAMs have been found to be limited due to their inability to display both the value-laden nature of the assumptions that underlie the model and the uncertainties in their outputs. A Post-Normal Science approach is required for dealing with these issues, involving participation of ‘extended peer communities’ providing their ‘extended facts’.
Article
In this article, we identify four typical roles played by computer models in environmental policy-making, and explore the relationship of these roles to different stages of policy development over time. The four different roles are: models as eye-openers, models as arguments in dissent, models as vehicles in creating consensus and models for management. A general environmental policy life cycle is used to assess the different roles models play in the policy process. The relationship between the roles of models and the different stages of the policy life cycle is explored with a selection of published accounts of computer models and their use in environmental policy-making.
Article
The development of Decision Support Systems (DSS) to inform policy making has been increasing rapidly. This paper aims to provide insight into the design and development process of policy support systems that incorporate integrated models. It will provide a methodology for the development of such systems that attempts to synthesize knowledge and experience gained over the past 15–20 years from developing a suite of these DSSs for a number of users in different geographical contexts worldwide.The methodology focuses on the overall iterative development process that includes policy makers, scientists and IT-specialists. The paper highlights important tasks in model integration and system development and illustrates these with some practical examples from DSS that have dynamic, spatial and integrative attributes.Crucial integrative features of modelling systems that aim to provide support to policy processes, and to which we refer as integrated Decision Support Systems, are:•Synthesis of relevant drivers, processes and characteristics of the real world system at relevant spatial and temporal scales.•An integrated approach linking economic, environmental and social domains.•Connection to the policy context, interest groups and end-users.•Engagement with the policy process.•Ability to provide added value to the current decision-making practice.With this paper we aim to provide a methodology for the design and development of these integrated Decision Support Systems that includes the ‘hard’ elements of model integration and software development as well as the ‘softer’ elements related to the user-developer interaction and social learning of all groups involved in the process.
Article
Arguments for the potential benefits that environmental decision and information support tools (DISTs) bring to managing complex environmental issues like desertification are well rehearsed. However our empirical understanding of the reasons why particular DISTs are or are not used by different policy and management organisations, and the impacts they have on the work of those organisations is substantially weaker. Such understanding is needed to determine whether concerns raised in the literature about poor adoption and use of DISTs are correct, to understand why, and to remedy them. This paper presents a thematic analysis of 31 exploratory interviews with representatives of 14 desertification policy and management organisations operating at different scales about their use of DISTs; specifically GIS, remote sensing, simulation models, statistical models and DSS. DISTs of all types were found to be used along with other sources of decision and information support including hard-copy maps, aerial photography, databases, academic literature and local participation. GIS was most widely used, by 9 of the organisations interviewed. From the interview data a generic conceptual model identifying the organisational drivers for, constraints on and impacts of DIST use in desertification policy and management organisations is developed and discussed. Drivers were grouped into those concerned with system attributes (e.g. ease of use, flexibility) and those concerned with how information is used organisationally (e.g. to facilitate communication, assessment of desertification). Barriers including DIST information attributes such as reliability and uncertainty, and additional financial investment arising from training, employment and infrastructure procurement were identified. Impacts were grouped into structural changes, individual work changes and financial investment needs. No systematic variation was evident in the drivers, constraints or impacts of use according to DIST type, although GIS was the most widely used DIST and consequently was associated with a larger number of each. Results are discussed in relation to existing theory and evidence on information system and DSS adoption and use, and found to be in agreement. The paper finishes with a set of recommended improvements to DIST design processes to enhance the uptake of and positive benefits associated with use.