Conference PaperPDF Available

Challenges in Applying Continuous Experimentation: A Practitioners' Perspective

Authors:

Abstract and Figures

Background: Applying Continuous Experimentation on a large scale is not easily achieved. Although the evolution within large tech organisations is well understood, we still lack a good understanding of how to transition a company towards applying more experiments. Objective: This study investigates how practitioners define, value and apply experimentation, the blockers they experience and what to do to solve these. Method: We interviewed and surveyed over one hundred practitioners with regards to experimentation perspectives, from a large financial services and e-commerce organization, based in the Netherlands. Results: Many practitioners have different perspectives on experimentation. The value is well understood. We have learned that the practitioners are blocked by a lack of priority, experience and well functioning tooling. Challenges also arise around dependencies between teams and evaluating experiments with the correct metrics. Conclusions: Organisation leaders need to start asking for experiment results and investing in infrastructure and processes to actually enable teams to execute experiments and show the value of their work in terms of value for customers and business.
Content may be subject to copyright.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
Challenges in Applying Continuous Experimentation:
A Practitioners’ Perspective
Kevin Anderson
Delft University of Technology / Vista
Utrecht, Netherlands
k.s.anderson@tudelft.nl
Denise Visser
Bol.com
Utrecht, Netherlands
dvisser@bol.com
Jan-Willem Mannen
ING
Amsterdam, Netherlands
jan-willem.mannen@ing.com
Yuxiang Jiang
Delft University of Technology
Delft, Netherlands
y.jiang-12@student.tudelft.nl
Arie van Deursen
Delft University of Technology
Delft, Netherlands
arie.vandeursen@tudelft.nl
ABSTRACT
Background
: Applying Continuous Experimentation on a large
scale is not easily achieved. Although the evolution within large
tech organisations is well understood, we still lack a good under-
standing of how to transition a company towards applying more
experiments.
Objective
: This study investigates how practitioners dene, value
and apply experimentation, the blockers they experience and what
to do to solve these.
Method
: We interviewed and surveyed over one hundred prac-
titioners with regards to experimentation perspectives, from a
large nancial services and e-commerce organization, based in
the Netherlands.
Results
: Many practitioners have dierent perspectives on experi-
mentation. The value is well understood. We have learned that the
practitioners are blocked by a lack of priority, experience and well
functioning tooling. Challenges also arise around dependencies be-
tween teams and evaluating experiments with the correct metrics.
Conclusions
: Organisation leaders need to start asking for ex-
periment results and investing in infrastructure and processes to
actually enable teams to execute experiments and show the value
of their work in terms of value for customers and business.
CCS CONCEPTS
General and reference Surveys and overviews.
KEYWORDS
Continuous experimentation, Online controlled experiments, A/B
testing, Empirical software engineering, ING, bol.com
Work completed while at ING.
Work completed during an internship at ING.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ICSE-SEIP ’22, May 21–29, 2022, Pittsburgh, PA, USA
©2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9226-6/22/05.
https://doi.org/10.1145/3510457.3513052
ACM Reference Format:
Kevin Anderson, Denise Visser, Jan-WillemMannen, Yuxiang Jiang, and Arie
van Deursen. 2022. Challenges in Applying Continuous Experimentation:
A Practitioners’ Perspective. In 44nd International Conference on Software
Engineering: Software Engineering in Practice (ICSE-SEIP ’22), May 21–29,
2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.
org/10.1145/3510457.3513052
1 INTRODUCTION
Organizations like Microsoft, Google, Facebook and Booking.com
use online controlled experiments (A/B tests) to assess the impact of
changes made to software products and services [
6
]. The evolution
towards doing continuous experimentation is well understood and
documented within the context of these large tech companies [
2
4
].
Many of these tech companies invest in their own experimentation
platform, because it is perceived as critical to their business [7].
This study is performed in the context of a large nancial services
(ING) and an ecommerce (bol.com) organization. Although both
companies are active in dierent industries, they do share some
other characteristics. Both organizations have a strong presence
in the Netherlands with well known brands [
9
]. Both companies
have also invested in the creation of in-house built experimentation
infrastructure which has been tightly integrated into core systems.
In this study we look beyond the technical software engineering
challenges around continuous experimentation. Cultural factors
are equally important in successful adoption [6].
We expect to add to the increasing academic literature on this
topic from the perspective of organisations where continuous exper-
imentation is not yet the standard. Especially from the perspective
of practitioners who are (or could be) executing experiments.
The remainder of this paper is organized as follows. In Section 2
related work is described. In Section 3 we outline the study design.
The results of the case study are described in Section 4. We discuss
the results in Section 5 and the threats to validity in Section 6, after
which we conclude our paper in Section 7.
2 BACKGROUND AND RELATED WORK
In a mapping study by Ros and Runeson it was concluded that four
types of challenges around experimentation already received re-
search attention: technical, statistical, organizational/management,
and business challenges [16].
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
Gupta et al. further identied challenges from 13 organizations
that already apply experimentation on a large scale. Even these
organizations have challenges around creating and maintaining a
culture of experimentation [6].
Lindgren and Münch researched the state of experimentation in
10 software organizations in Finland [
15
]. Their results show that
the state of experimentation is not yet mature and the researchers
concluded that the challenges in moving towards continuous exper-
imentation are not on the technical side, but on the organizational
level: culture, slow development speed, and diculties in measuring
customer value correctly.
It is important for product and service developers to continu-
ously learn what customers want [5]. Applying continuous exper-
imentation is a way to do that. Yaman et al. dened continuous
experimentation as an experiment-driven development approach
where critical assumptions are tested iteratively [19].
The studies performed by Yaman within the context of four
software companies from the Nordics, show that the transition to
continuous experimentation is a learning process where it is im-
portant to understand the perspective from practitioners [
18
]. The
human factor plays an important part in adopting experimentation
practises and she concluded that individual people have dierent
perspectives of what experiments are.
Fabijan et al. provided actionable steps to grow and keep a culture
of experimentation and A/B testing [
2
]. They presented the concept
of a ywheel where initial value from experimentation leads to
more investments, which will lead to more value. They pointed out
that the most dicult part is actually getting the rst traction of
this ywheel eect.
Kohavi provided various guidelines on how to deploy experi-
mentation in industry [
11
,
12
]. While his work provided important
insights on how to conduct better experiments, they oered less
organizational guidelines to support organizations that are on the
path towards becoming more data and experiment driven.
3 RESEARCH METHOD
The goal of this study is to understand the specic challenges
around experimentation in two organizations from dierent indus-
tries, both active in the Netherlands. Both organizations were early
adopters of agile methodologies and have taken similar approaches
to enabling experimentation with the support and development of
in-house build experimentation tooling. At the time of this study
the rst, second and third authors were all directly involved in
developments around the respective experimentation tools at ING
and bol.com. Therefore they have a good understanding of the state
of experimentation within the organizations.
We formulated four research questions to set more ne-grained
directions for our study and to limit its scope.
RQ1: How do practitioners dene and apply experimentation?
RQ2: What is the perceived value of experimentation?
RQ3: What are blockers in doing more experiments?
RQ4: What should an organization do to solve these blockers?
The data in this study is collected in two ways; via interviews and
a survey. The detailed research design is described in the following
two subsections.
3.1 Recruitment and Participants
We recruited participants (n=34) based on personal knowledge of
the organization and by asking early participants for references. We
dened the following requirements for our participants to ensure a
solid perspective of the challenges at hand:
Mix of people in technical and business related roles
Mix of people in individual contributor and leadership roles
Signicant professional experience (a minimum of ve years)
Only two people declined the invitation to be interviewed, one
due to time constraints and another because the person did not
feel qualied to talk about the topic. We expect that the personal
connection with the interviewers and the ease of doing a short
interview over video call, led to the high acceptance rate. See Ta-
ble 1 for a list of all participants, their role and experience, and the
duration of the interview in minutes.
3.2 Interview Procedure and Analysis
The semi-structured interviews were based on a set list of questions
based on experience and discussions between the rst and third
authors. See the appendix for an overview of the questions. During
the interview there was room for follow-up questions based on
the responses. All interviews (n=34) were conducted via video
conferencing and most of them also recorded (when permission was
given). During the interview short notes were taken and afterwards
the recorded interviews were transcribed for further analysis. All
interviews at ING were conducted in January and February 2021,
alternately by the rst and third authors. The interviews at bol.com
were performed, by the second author, in July and August of the
same year.
After performing all (26) interviews at ING the rst and third
authors, analysed the results and formulated a rst set of general
themes. After the (8) interviews at bol.com, the outcome was dis-
cussed by the rst and second authors and this led to a further
renement of the general themes. The combined results are pre-
sented in section four.
3.3 Survey Procedure and Analysis
The results from the qualitative interviews have been validated
via a survey, which was conducted at ING. A total of 868 persons
from 4 dierent departments (tribes) were selected for participation
in this survey. The departments were selected based on the scope
of their work: digital transformation and improving digital sales
and service. The current adoption of experimentation, measured by
number of online controlled experiments, is also the highest in two
of these departments. The other two are in the middle and bottom
tiers of departments executing experiments.
The rst and fourth authors jointly setup the questionnaire (see
the appendix) and was pilot tested with the third author. After
considering all feedback and implementing minor renements, the
questionnaire was distributed via email by the rst author. The
survey was held between September 28 and October 6. 73 people
completed the survey, leading to a response rate of 8.4%. Although
this response rate is lower than external benchmarks [
1
], it is com-
parable to response rates from other surveys within ING. More im-
portantly, the survey participants are a good representation across
ING’s departments and job roles.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
Table 1: Prole of interview participants
ID Org. Role/Function Exp. Duration
(yrs) (min.)
P1 ING
Lead Customer Experience
12 40
P2 ING Manager Innovation 12 35
P3 ING Manager Optimisation 24 41
P4 ING Manager Data Science 13 27
P5 ING Product Owner 16 45
P6 ING Online Marketer 15 31
P7 ING Online Marketer 10 22
P8 ING Online Marketer 21 21
P9 ING Product Owner 10 46
P10 ING Product Owner 19 42
P11 ING Innovation Consultant 18 56
P12 ING
Lead Customer Experience
24 23
P13 ING Online Marketer 8 24
P14 ING Innovation Consultant 25 46
P15 ING Director Retail 25 23
P16 ING Product Owner 14 38
P17 ING Manager Social Media 6 22
P18 ING Manager UX 17 36
P19 ING Director Digital 23 29
P20 ING Online Marketer 13 47
P21 ING Sr. Manager Digital 21 23
P22 ING Product Owner 5 50
P23 ING Full Stack Engineer 12 35
P24 ING Sr. iOS Developer 15 39
P25 ING Full Stack Developer 17 50
P26 ING Android Developer 15 31
P27 bol.com Product Manager 15 37
P28 bol.com Product Manager 10 47
P29 bol.com Director IT 30 45
P30 bol.com
Manager Customer Service
18 36
P31 bol.com Test Engineer 22 24
P32 bol.com Director Business Models 30 25
P33 bol.com Team Lead Analytics 15 38
P34 bol.com Software Engineer 5 45
4 RESULTS
This section presents our ndings from the interviews and survey.
4.1 Denition and application of
experimentation (RQ1)
4.1.1 Definition of experimentation. Within ING an experiment is
dened as a test observing how customers react in order to validate
(or invalidate) business assumptions [
8
]. The interviewees have
a broad range of denitions of experimentation. Ranging from:
simply trying something (P2), making a change and see what the
eect is (P5), to running a randomized controlled trial, a pilot is not
an experiment to me (P16). The latter denition is a much more
strict denition.
Table 2 shows the results from the survey participants on the
multiple-choice question (see Q3 in the Appendix). The standard
denition within ING is also the most mentioned concept by the
Table 2: Denition & application of experimentation at ING
Experimentation concept Partic.
#, %
Observing customers reaction to (in)validate assump-
tions
64 (88%)
Quickly test something before committing to building
it completely
58 (79%)
A/B testing 58 (79%)
Trying out something new 56 (77%)
Learning what works 56 (77%)
Developing a hypothesis 45 (62%)
Interviewing the target audience 38 (52%)
Incremental improvements 37 (51%)
Working with the PACE canvases 28 (38%)
Asking colleagues what they think about an idea 25 (34%)
Changing a webpage on ING.be/nl 15 (21%)
De-risking a project 13 (18%)
survey participants (88%). To our surprise, ’asking colleagues what
they think about an idea’, is also seen as part of experimentation
by 25 (34%) participants. From the survey we learned that over
60% of respondents already took some form of experimentation
training (called ’PACE Academy’). This percentage is roughly the
same (68%) in the group that nds "asking colleagues’ part of ex-
perimentation. This might be due to time passed since attending
training or conicting perceptions.
For now, we can conclude that experimentation means, next to
the default denition, many other things to dierent people.
4.1.2 Application of experimentation. All participants from the
interviews mention concrete examples of experiments that they or
their team have executed. This ranges from A/B tests, pilot groups
to deploying data science models with control groups in place.
From the broader survey we learn that many respondents (36%)
have not executed an experiment in the last 6 months. 33% have
executed an experiment in the last month or sprint. See Table 3.
4.2 Perceived value of experimentation (RQ2)
From the interviews we captured about four distinct categories
where experimentation brings value to the participant: focus on
value, risk mitigation, team alignment and intrinsic benets on a
personal level.
4.2.1 Focus on value. Colleagues are taught that experiments help
to reduce the uncertainty in our backlog, ensuring our scarce resources
are only used for the things that really matter for our customers [
8
].
Especially knowing what kind of value you deliver to customers,
is something that is often mentioned by participants. Participant
P4 said it like this: ... knowing that you are bringing value to the
customer and knowing what that value is”. And participant P33 de-
scribed it like this: Understand which levers you can pull to add
value for the customer and the company”. Other participants stressed
the value of validating assumptions and moving away from opin-
ions: Nobody has a monopoly on wisdom, we need to validate the
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
assumptions we have. That means testing dierent versions and learn
from that. It’s a continuous process of validating assumptions”, said
participant P22. Similar remarks are made by participant P19: I
believe in continuous improvement. We need to move away from
thinking in opinions and thinking more in facts. Everything is con-
stantly changing, so we have to get this way of thinking in the genes”.
Participant P21 immediately shared a clear example of where the
application of experimentation led to signicant results: Basically
through continuously doing experiments and monitoring we have
been able to improve process X from 45% in 2019 to 70% now”. The
focus on value delivered that experimentation brings, is clearly a
big benet perceived by the participants.
4.2.2 Risk mitigation. Another reason to use experimentation is to
de-risk a project. Like participant P12 says: I see experimentation
as a way to make sure you validate the riskiest assumptions with cus-
tomers in an early stage of any project". This can prevent unexpected
behavior in later stages or the change to mitigate the identied risk.
To our surprise, the term de-risking a project’ resonated the least
with the survey participants (see table 2). Only 13 (18%) think about
this when they are asked to dene experimentation.
4.2.3 Team alignment. Experiments can give objective information
about the impact and direction of certain eects. Some participants
emphasize the value of having this objective data in the context
of discussions within, and between teams. Participant P5 said: we
simply have better discussions”. And participant P34 stresses that
experiments give clear answers to specic questions and discussions
we are having with other teams. This can enormously reduce the
number of opinion based discussions people in and between teams
have.
4.2.4 Intrinsic benefits. The fourth benet that came out of our
round of interviews is more on a personal level. Some participants
focus on the intrinsic value that doing experimentation brings to
them. Like participant P5 says: I simply enjoy to see if something
works as intended. And my life as Product Owner becomes easier,
because everyone has an opinion, but with data from experiments
we simply have better discussions". And according to participant P6:
Experiments give me certainty, certainty about what is happening. I
am creative, open minded and I like facts. It gives me strength when
I know what the eect is. I don’t like to blu, I like to see the facts.
Curiosity is also a term that was mentioned often: I’m curious to
nd out if people understand what we came up with.”, said participant
P18. Finally, participant P32 didn’t see any intrinsic benets: the
benets are for the company, not for me personally".
4.3 Experimentation blockers (RQ3)
With so many strong and clear benets, the next question focused
on the blockers of doing large scale experimentation.
4.3.1 No priority. The most common reason that was mentioned
was simply not having enough priority to create experiments. Partic-
ipant P1 summarizes it well: Focus is often on delivery, experiments
or other ways of measuring impact does not get sucient priority”.
And participant P30 pinpoints that: Often, the solution to a prob-
lem has already been devised. Then it’s about prioritizing solutions,
Table 3: Have you or your squad executed any experiment
during the last 6 months?
Frequency Partic.
#, %
No experiment in last 6 months 26 (36%)
Yes, once per month 17 (23%)
Yes, once per quarter 14 (19%)
Yes, once in the past 6 months 9 (12%)
Yes, at least once per sprint 7 (10%)
instead of problems to work on". Next to focus on delivery and pri-
oritizing solutions, sometimes priorities are temporarily shifted.
Participant P25 explains: Focus for our teams is to maintain the
feature until an important migration is done. There is no room for
improvements now".
4.3.2 Dependencies. Almost all participants mentioned having too
many dependencies as a blocker in executing more experiments.
Participant P10 explained: For many experiments I am dependent
on a Data Analyst, and it takes them a lot of time. So it feels I need
to bother someone else with it". And participant P6 says: I am de-
pendent on IT development resources in other teams to make changes
and launch experiments”. Next to mentioned dependencies on Data
Analysts and Software Engineers, other participants mentioned
dependencies with Legal and User Experience Experts.
4.3.3 No experience. From the survey participants 64% (n=47) say
they have received training around experimentation. Also 77% indi-
cated that they have discussed experimentation with their team. But
also almost half of the survey respondents indicate they (or their
squad) have executed no (36%) or just 1 (12%) experiment in the last
6 months. More teams are discussing than doing experimentation.
The lack of experience is not helping in increasing the experiment
velocity.
4.3.4 No or hard to use tooling (functionality). Participants point
to missing or broken functionality in the current experimentation
tooling. Some functionality had been present, but is currently not
working properly. Participant P13 even points out that At the
moment we don’t have access to an A/B testing tool, so we are doing
one version at the time: rst two weeks version A and then two weeks
version B. That is of course not a pure way of testing”. This participant
clearly has the will to experiment, but is not supported with the
correct tooling.
4.3.5 Issues with metrics. Another issue that participants mention
is the fact that they experience issues with evaluating their experi-
ments in the correct way. Some type of metrics are not available,
which forces people to use less relevant metrics for the evaluation
of experiments. Some teams have trouble coming up with proper
measurements at all: it is often hard to formulate clear measurable
goals. This makes it dicult to choose a meaningful metric for eval-
uation of experiments says participant P16. Another participant
(P20) says I have continuously doubts about the metrics I’m seeing
and I need to do the troubleshooting myself ”.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
4.4 Improvements (RQ4)
In this section we will answer the question: how can we improve
the situation around the blockers identied in section 4.2. This is
based on the answers from the study participants, but also on our
own understanding of the challenges.
4.4.1 Leadership priority. The remarks around a lack of priority
are in stark contrast with the identied value of experimentation.
It seems that everyone during our conversations was convinced
that the organization should be doing large scale experimentation,
but somehow they are not doing it (enough).
Participant P28 says it like this: It is not expected of us to run
experiments. We manage internal expectations based on the ’WHAT’,
and less on the contribution to KPI’s”.
This suggest that leaders should start challenging teams to eval-
uate developed features based on quantiable results, instead of
only on timely delivery. One Senior Manager (participant 21) ac-
knowledged that We need to accept the trade-os. In the beginning
we will probably go slower, but experimentation will eventually make
us go faster”. This suggests that currently the teams that have to
deal with many dependencies are already satised when they have
shipped a feature to production.
Setting priorities could also help in forming sucient self sup-
ported teams. This challenge is about speed of development. The
organization needs to have as many independent and empowered
teams that are suciently staed to live up to their purpose. As
participant P19 indicates: Pressure on [feature] delivery is high. This
limits us to experiment. Also some teams have challenges with getting
the output we seek. We are looking into ways to optimise this. This
will also make room for more validation”.
Although this blocker is not only an issue for experimentation
itself. This is hindering development of products and services in
general. In this setting experimentation is being perceived as adding
an extra layer of complexity to the product development process.
Therefore it is extra important to make running an experiment as
easy as possible.
4.4.2 Integrated tool. The second challenge organizations should
invest in is in making the tooling to execute an experiment as
easy as possible. As participant P19 notes: People nd it dicult
to quickly validate hypotheses. Eventually people just start building.
Our tooling is now not really plug-and-play, we lack easy to use
tooling”. A seamless integration of experimentation tooling in the
software development lifecycle can lower the cost of setting up an
experiment [
2
]. This continuous drive in trying to lower the cost of
an experiment needs a dedicated team to (in the rst place) enable
it and after that, keep it up and running, and continuously improve
this system. Participant P1 said it like this: All the friction you can
take away, you should. People are a bit lazy by nature, you have to
make it as easy as possible”.
4.4.3 Trustworthy results. Make sure quality controls are in place
and that issues around measuring are solved in a timely fashion.
This also needs a dedicated team that continuously monitors the
health of the experimentation tool. To prevent each and everyone
to solve issues by themselves, like participant (P20) indicated I
have continuously doubts about the metrics I’m seeing and I need to
do the troubleshooting myself .
4.4.4 Education. There needs to be training on the how to do ex-
perimentation and coming to a common understanding of what it
exactly is. As the research shows, running an experiment means
several things to people. We expect that a stricter denition will
make it easier to get understanding of what it actually means to
do experimentation. Participant P2 stresses the importance of ac-
cessible learning opportunities: All PACE modules are now freely
available. This training was very expensive, now anyone can join for
free whenever they like via our online portal”. Participant P2 also says
that training people is a good start: It’s a combination of know-how,
mindset and support from the top. We can start by investing in more
training”. Participant P11 does point out that training alone might
not be enough: Some people will still need assistance in setting up
their rst experiment”.
4.4.5 Share learnings. By sharing the learnings from experiments
across the organisation, we show that gut feeling is often not cor-
rect. This can act as a new trigger to start experimenting. Next to
that, continuously showing the value from experiments will ensure
future investments in the experimentation program, as also shown
by Fabijan et al. [
2
]. As participant P13 says: We should organise
more sessions where we can share experiences around experiments,
what works, what does not work. And what can other teams test as
well”. And participant P22 says We need a sharing platform to store
the learnings from past experiments. This can act as inspiration for
others”.
5 DISCUSSION
The outcomes of our interviews and survey as presented in this
paper highlight how practitioners dene and apply experimentation,
and what challenges they face in applying experimentation more
often in their daily work. From this, we suggest several lines of
action, divided in recommendations for practitioners (companies
and vendors) and recommendations specic to researchers and
educators.
5.1 Recommendation to Practitioners
First, practitioners should make sure that everyone within an orga-
nization is aware of what experimentation is, and what it is not. For
example, asking colleagues what they think about an idea, might
provide valuable information, but it lacks the core concept of simply
observing behavior, without explicitly asking about thoughts and
feelings. The many dierent meanings experimentation currently
seem to have to dierent people makes it harder to steer the desired
behavior around experimentation. We also recommend to include
the concept of risk mitigation in applying experimentation. These
topics should all be addressed in training.
Second, the transition towards continuous experimentation starts
with company senior leaders asking for experiment results. As long
as projects are mostly being steered based on delivery timelines
[
14
], there is less incentive for teams to start or increase the rate
of experimentation. We recommend leaders to start measuring the
impact of teams in terms of outcome metrics, not purely on output.
This includes challenging teams on how they have come to deci-
sions by asking ’How do you know?’ and ’How do we know we
will be right?’.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
Third, organizations cannot expect large and sustainable impact
from experimentation when their eorts around training and in-
frastructure are not organised from one dedicated team or centre
of excellence. This relates to the blocking issues around priority
(Section 4.3.1). Organizations that take experimentation seriously,
should invest in this important area. Good examples here are Book-
ing.com [
10
] and LinkedIn [
17
]: both companies have multiple
dedicated teams in place to build infrastructure, train people and
facilitate large scale experimentation.
Fourth, the barriers identied around dependencies (Section 4.3.2)
pose impediments to the successful adoption of experimentation.
These blocking issues go beyond experimentation. For example, if
teams are too dependent on others, this may be an issue to address
rst, before asking teams to start to execute (more) experiments.
Resource allocation and team structure go beyond the application
of experimentation. For example, research by Kula et al. indicates
that managing or reducing task dependencies is a key factor con-
tributing to on time delivery in large-scale agile development [
13
].
The blockers identied in this study stress the importance of this
topic for leaders to successfully deal with.
5.2 Recommendation to Researchers and
Educators
We advise researchers to team up with organizations on the path
towards continuous experimentation. It can lead to a better under-
standing of the sometimes messy practice many teams operate in.
More research is needed to better understand what interventions
are most eective in getting teams to adopt continuous experimen-
tation.
We also encourage cross industry collaboration between re-
searchers and companies from dierent industries. This way we can
better learn which challenges are organization specic and which
are common across industries.
Lastly, academic educators might want to consider experimen-
tation a rst class citizen in undergraduate software engineering
courses. This calls for lab work in which feature delivery is not just
based on time taken and code quality, but on actually conducted
experiments demonstrating that feature A leads to a better end user
experience than feature B. This tight connection between coding
and end user satisfaction may be hard to achieve as it requires the
involvement of end users in an academic course, but our ndings
suggest that such an experience would prepare students well for
their next position in industry.
6 THREATS TO VALIDITY
In this section we discuss the possible threats to validity and the
action we took to mitigate them.
6.1 Threats to Internal Validity
Possible threats to internal validity mostly come from the interview
process we propose and the analysis of the semi-structured inter-
views performed. We mitigated this by sucient test rounds within
the author group, but also with the participants. For example, the
rst four interviews were jointly attended by the rst and third
author. Both authors took turns in interviewing and observing. Af-
terwards they jointly reected on the process. This way we ensured
a similar approach in interviewing.
We tried to recruit the right people and not only people with a
positive outlook on experimentation. To avoid selection bias, the
term ’experimentation’ was not mentioned in the survey invitation
email. However, since it was framed as “delivering dierentiating
customer experiences”, this could also have led to attract specic
colleagues. This might have inuenced the number of respondents.
As mentioned earlier, the participants are a good representation of
departments and job roles.
6.2 Threats to External Validity
Naturally, our results come from a limited set of organizations in a
specic region. Likely when we expand with companies of dierent
sizes, from dierent application domains, or from other regions
we will nd additional challenges around adopting continuous
experimentation, as well as new challenges that the people in the
two organisations in our study have not yet been facing. Our results
serve as a starting point to conduct such further studies.
7 CONCLUSIONS
Applying continuous experimentation on a large scale is not easily
achieved. This paper describes the perspectives and challenges in
a large nancial services and an e-commerce organisation, after
interviewing 34 practitioners from these organisations. Next to that
we surveyed 73 practitioners to corroborate the results from the
interviews.
Our results point out that many practitioners have dierent per-
spectives on what experimentation exactly is. The potential value is
understood: focus on customer value, aligning teams, next to some
intrinsic benets. Although risk mitigation was often mentioned
during the interviews as a potential benet, only 18% (22) of the sur-
vey participants agreed with this. Unfortunately, experimentation
is not being applied often. Our ndings indicate that practitioners
are blocked by a lack of priority, experience and/or well functioning
and easy-to-use tooling, and by too many dependencies between
teams as well as the choice of metrics in evaluating experiments.
The study emphasizes the need for a common understanding of
what experimentation is. Leadership needs to start asking about
experiment results. This requires investing in infrastructure and
processes to actually enable teams to execute experiments that
show the value of their work in terms of value for customers and
businesses.
We sincerely hope that this paper can spark more research into
organizations on their path to applying continuous experimentation
on a large scale. We believe this study can be an important next
step in helping organisations on this journey.
ACKNOWLEDGMENTS
The authors would like to thank all participants for their willing
contributions to this project. We also wish to thank the three re-
viewers and Lukas Vermeer for their constructive feedback. This
paper was partially funded by ICAI AI for Fintech Research.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
REFERENCES
[1]
Yehuda Baruch and Brooks C. Holtom. 2008. Survey response rate levels and
trends in organizational research. Human relations 61, 8 (2008), 1139–1160.
Publisher: Sage Publications Sage UK: London, England.
[2]
Aleksander Fabijan, Benjamin Arai, Pavel Dmitriev, and Lukas Vermeer. 2021. It
takes a Flywheel to Fly: Kickstarting and Growing the A/B testing Momentum at
Scale.
[3]
Aleksander Fabijan, Pavel Dmitriev, Helena Holmström Olsson, and Jan Bosch.
2017. The evolution of continuous experimentation in software product develop-
ment: from data to a data-driven organization at scale. In Proceedings of the 39th
International Conference on Software Engineering. IEEE Press, 770–780.
[4]
Fabian Fagerholm, Alejandro Sanchez Guinea, Hanna Mäenpää, and Jürgen
Münch. 2014. Building blocks for continuous experimentation. In Proceedings of
the 1st International Workshop on Rapid Continuous Software Engineering (RCoSE
2014). Association for Computing Machinery, Hyderabad, India, 26–35. https:
//doi.org/10.1145/2593812.2593816
[5]
Fabian Fagerholm, Alejandro Sanchez Guinea, Hanna Mäenpää, and Jürgen
Münch. 2017. The RIGHT model for continuous experimentation. Journal of
Systems and Software 123 (2017), 292–305. Publisher: Elsevier.
[6]
Somit Gupta, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy,
Niall Cardin, Sumita Chandran, Nanyu Chen, Dominic Coey, and others. 2019.
Top challenges from the rst practical online controlled experiments summit.
ACM SIGKDD Explorations Newsletter 21, 1 (2019), 20–35. Publisher: ACM New
York, NY, USA.
[7]
Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Ra, and
Aleksander Fabijan. 2018. The Anatomy of a Large-Scale Experimentation Plat-
form. In 2018 IEEE International Conference on Software Architecture (ICSA). IEEE,
1–109.
[8] ING. 2020. Experiment Design Playbook. internal document.
[9]
Kantar. 2020. BrandZ Top 30 Most Valuable Dutch Brands 2021. Technical Report.
Kantar. https://www.rankingthebrands.com/PDF/Brandz%20Most%20Valuable%
20Netherlands%20Brands%202021%20top%2030.pdf
[10]
Raphael Lopez Kaufman, Jegar Pitchforth, and Lukas Vermeer. 2017. De-
mocratizing online controlled experiments at Booking. com. arXiv preprint
arXiv:1710.08217 (2017).
[11]
Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. 2014. Seven rules
of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining (KDD ’14). As-
sociation for Computing Machinery, New York, New York, USA, 1857–1866.
https://doi.org/10.1145/2623330.2623341
[12]
Ron Kohavi, Roger Longbotham, Dan Sommereld, and Randal M. Henne. 2009.
Controlled experiments on the web: survey and practical guide. Data Mining and
Knowledge Discovery 18, 1 (Feb. 2009), 140–181. https://doi.org/10.1007/s10618-
008-0114- 1
[13]
Elvan Kula, Eric Greuter, Arie Van Deursen, and Gousios Georgios. 2021. Factors
Aecting On-Time Delivery in Large-Scale Agile Software Development. IEEE
Transactions on Software Engineering (2021), 1–1. https://doi.org/10.1109/TSE.
2021.3101192
[14]
Elvan Kula, Ayushi Rastogi, Hennie Huijgens, Arie van Deursen, and Georgios
Gousios. 2019. Releasing fast and slow: an exploratory case study at ING. In
Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ACM,
Tallinn Estonia, 785–795. https://doi.org/10.1145/3338906.3338978
[15]
Eveliina Lindgren and Jürgen Münch. 2016. Raising the odds of success: the cur-
rent state of experimentation in product development. Information and Software
Technology 77 (Sept. 2016), 80–91. https://doi.org/10.1016/j.infsof.2016.04.008
[16]
Rasmus Ros and Per Runeson. 2018. Continuous experimentation and a/b test-
ing: A mapping study. In 2018 IEEE/ACM 4th International Workshop on Rapid
Continuous Software Engineering (RCoSE). IEEE, 35–41.
[17]
Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlap-
ping experiment infrastructure: More, better, faster experimentation. In Proceed-
ings of the 16th ACM SIGKDD international conference on Knowledge discovery
and data mining. ACM, 17–26.
[18]
Sezin Gizem Yaman. 2019. Initiating the Transition towards Continuous Exper-
imentation : Empirical Studies with Software Development Teams and Practi-
tioners. (Oct. 2019). https://helda.helsinki./handle/10138/305855 Accepted:
2019-10-04T06:00:19Z ISBN: 9789515155436 Publisher: Helsingin yliopisto.
[19]
Sezin Gizem Yaman, Myriam Munezero, Jürgen Münch, Fabian Fagerholm, Ossi
Syd, Mika Aaltola, Christina Palmu, and Tomi Männistö. 2017. Introducing
continuous experimentation in large software-intensive product and service
organisations. Journal of Systems and Software 133 (2017), 195–211. Publisher:
Elsevier.
This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.
The denitive version is published in the proceedings of ICSE-SEIP ’22: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793934
Interview questions
Q1. Questions on personal level
1.
Who are you? And what is your role at the company? Can you
briey explain what you do?
2.
When we talk about experimentation, what is it according to
you? How would you dene it?
3. What is your own experience with doing experiments?
4. How and when do you apply it in your daily work?
5. What is the value/advantages of experimentation to you?
6. What is blocking you to do more experiments?
Q2. Questions on organizational level
1.
What is the experience with doing experiments in your depart-
ment
2. How and when do people apply it in their daily work?
3.
What is the value/advantages of experimentation to the people
in your department?
4. What is blocking people to do more experiments?
5. How do we improve on experimentation?
6. Who else should we interview for this project?
Survey questions
Q3. What is experimentation according to you?
1.
Quickly test something before committing to building it com-
pletely
2. Observing customers reaction to (in)validate assumptions
3. Asking colleagues what they think about an idea
4. Changing a webpage on ING.be/nl
5. Interviewing the target audience
6. Working with the PACE canvases
7. Incremental improvements
8. Trying out something new
9. Developing a hypothesis
10. Learning what works
11. De-risking a project
12. A/B testing
13. Other
Q4.
Have you or your squad discussed experimentation during the
last 6 months? [Yes/No]
Q5. Have you taken any training for experimentation? [Yes/no]
Q6.
Have you or your squad executed any experiment during the
last 6 months?
1. Yes, at least once per sprint
2. Yes, once per month
3. Yes, once per quarter
4. Yes, once in the past 6 months
5. No
Q7.
To what extent do you agree with the following statements?
[Strongly disagree, Disagree, Neither agree nor disagree, Agree,
Strongly agree, N/A]
1.
The current state of experimentation at my squad/tribe needs
to be improved
2.
Experimentation means more unnecessary workload for my job.
Experimentation takes up too much time, while decisions need
to be made quickly.
3.
Agile way of working asking us to move fast, which leaves us
no time for experimentation.
4. I’m condent enough to perform experiments.
5.
Failure and invalidation of my ideas/assumptions are not en-
couraged by my KPIs or my performance appraisal.
6.
My leadership does not like it when I fail or my ideas/assump-
tions get invalidated.
7.
I feel very frustrated when my idea or assumption is proved to
be wrong.
8. I feel the need to always be right at my job.
9. I have the skills and knowledge to execute an experiment.
10.
I know where to nd support or tooling to execute an experi-
ment.
11.
I (and/or my squad) have been given enough time, budget, and
priority to conduct experiments.
12.
We as a squad talked about experimentation with other squad-
s/tribes we collaborated with.
13.
Migration projects and requests from other squads/tribes acted
as an obstacle for us to make changes to our way of working.
14. Experimentation is someone else’s job in my squad.
15. My role is not involved in experimentation.
16. I’m not sure where or how to start with experimentation.
17.
My squad is a DevOps squad that takes and executes external
requests, there is no room for us to experiment.
18.
I nd the tooling available within ING for experimentation dif-
cult to use. There are other more important objectives at ING
like cost saving or fast delivery. Therefore experimentation has
to take a back seat.
19.
I only have internal customers, so experimentation is not for
me.
20. Experiment (loop) is too big to t into our daily work.
21.
My scope of work does not allow me to conduct experiments.
Experimentation is only for big “innovation” projects that hap-
pen.
22.
I don’t see how experimentation can help me do my job better.
23.
I nd it dicult to connect experimentation with processes at
ING.
24.
I tried to convince my squad/tribe leadership in order to execute
more experiments.
25.
My Tribe leadership frequently asked for evidence for decisions
concerning products and marketing.
26.
I know what customers want based on my working experience.
27.
Launching new feature/releasing new campaign is more appre-
ciated than optimizing existing ones within ING.
28.
I know the vision of my squad and/or tribe for experimentation.
29.
My initiative of increasing experimentation is supported by my
squad and tribe.
Q8.
What are the barriers for you (your squad) in conducting exper-
imentation? [open question]
Q9. Do you have any other thoughts or remarks? [open question]
Q10. Which Tribe are you part of?
Q11. Are you based in Belgium or the Netherlands?
Q12. What is your role?
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale and encourage further academic and industrial exploration, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) were invited to the first Practical Online Controlled Experiments Summit. All thirteen organizations sent representatives. Together these organizations have tested more than one hundred thousand experiment treatments last year. Thirty-four experts from these organizations participated in the summit in Sunnyvale, CA, USA on December 13-14, 2018. While there are papers from individual organizations on some of the challenges and pitfalls in running OCEs at scale, this is the first paper to provide the top challenges faced across the industry for running OCEs at scale and some common solutions.
Article
Full-text available
There is an extensive literature about online controlled experiments, both on the statistical methods available to analyze experiment results as well as on the infrastructure built by several large scale Internet companies but also on the organizational challenges of embracing online experiments to inform product development. At Booking.com we have been conducting evidenced based product development using online experiments for more than ten years. Our methods and infrastructure were designed from their inception to reflect Booking.com culture, that is, with democratization and decentralization of experimentation and decision making in mind. In this paper we explain how building a central repository of successes and failures to allow for knowledge sharing, having a generic and extensible code library which enforces a loose coupling between experimentation and business logic, monitoring closely and transparently the quality and the reliability of the data gathering pipelines to build trust in the experimentation infrastructure, and putting in place safeguards to enable anyone to have end to end ownership of their experiments have allowed such a large organization as Booking.com to truly and successfully democratize experimentation.
Article
Full-text available
Software development in highly dynamic environments imposes high risks to development organizations. One such risk is that the developed software may be of only little or no value to customers, wasting the invested development efforts. Continuous experimentation, as an experiment-driven development approach, may reduce such development risks by iteratively testing product and service assumptions that are critical to the success of the software. Although several experiment-driven development approaches are available, there is little guidance available on how to introduce continuous experimentation into an organization. This article presents a multiple-case study that aims at better understanding the process of introducing continuous experimentation into an organization with an already established development process. The results from the study show that companies are open to adopting such an approach and learning throughout the introduction process. Several benefits were obtained, such as reduced development efforts, deeper customer insights, and better support for development decisions. Challenges included complex stakeholder structures, difficulties in defining success criteria, and building experimentation skills. Our findings indicate that organizational factors may limit the benefits of experimentation. Moreover, introducing continuous experimentation requires fundamental changes in how companies operate, and a systematic introduction process can increase the chances of a successful start.
Conference Paper
Full-text available
Software development companies are increasingly aiming to become data-driven by trying to continuously experiment with the products used by their customers. Although familiar with the competitive edge that the A/B testing technology delivers, they seldom succeed in evolving and adopting the methodology. In this paper, and based on an exhaustive and collaborative case study research in a large software-intense company with highly developed experimentation culture, we present the evolution process of moving from ad-hoc customer data analysis towards continuous controlled experimentation at scale. Our main contribution is the " Experimentation Evolution Model " in which we detail three phases of evolution: technical, organizational and business evolution. With our contribution, we aim to provide guidance to practitioners on how to develop and scale continuous experimentation in software organizations with the purpose of becoming data-driven at scale.
Article
Full-text available
Context: An experiment-driven approach to software product and service development is gaining increasing attention as a way to channel limited resources to the efficient creation of customer value. In this approach, software capabilities are developed incrementally and validated in continuous experiments with stakeholders such as customers and users. The experiments provide factual feedback for guiding subsequent development. Objective: This paper explores the state of the practice of experimentation in the software industry. It also identifies the key challenges and success factors that practitioners associate with the approach. Method: A qualitative survey based on semi-structured interviews and thematic coding analysis was conducted. Ten Finnish software development companies, represented by thirteen interviewees, participated in the study. Results: The study found that although the principles of continuous experimentation resonated with industry practitioners, the state of the practice is not yet mature. In particular, experimentation is rarely systematic and continuous. Key challenges relate to changing the organizational culture, accelerating the development cycle speed, and finding the right measures for customer value and product success. Success factors include a supportive organizational culture, deep customer and domain knowledge, and the availability of the relevant skills and tools to conduct experiments. Conclusions: It is concluded that the major issues in moving towards continuous experimentation are on an organizational level; most significant technical challenges have been solved. An evolutionary approach is proposed as a way to transition towards experiment-driven development.
Article
Full-text available
Context: Development of software-intensive products and services increasingly occurs by continuously deploying product or service increments, such as new features and enhancements, to customers. Product and service developers must continuously find out what customers want by direct customer feedback and usage behaviour observation. Objective: This paper examines the preconditions for setting up an experimentation system for continuous customer experiments. It describes the RIGHT Model for Continuous Experimentation (Rapid Iterative value creation Gained through High-frequency Testing), illustrating the building blocks required for such a system. Method: An initial model for continuous experimentation is analytically derived from prior work. The model is matched against empirical case study findings from two startup companies and further developed. Results: Building blocks for a continuous experimentation system and infrastructure are presented. Conclusions: A suitable experimentation system requires at least the ability to release minimum viable products or features with suitable instrumentation, design and manage experiment plans, link experiment results with a product roadmap, and manage a flexible business strategy. The main challenges are proper and rapid design of experiments, advanced instrumentation of software to collect, analyse, and store relevant data, and the integration of experiment results in both the product development cycle and the software development process. Our findings suggest that it is important to identify fundamental assumptions before designing experiments and validate those first in order to avoid unneccessary experimentation. Deriving experiments that properly test product strategies requires special expertise and skill. Finally, we claim that integrating experimentation outcomes into decision-making is a particular challenge for product management in companies.
Conference Paper
Full-text available
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as “speed matters,” but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD’s industrial track.
Conference Paper
Full-text available
Context: Development of software-intensive products and services increasingly occurs by continuously deploying product or service increments, such as new features and enhancements, to customers. Product and service developers need to continuously find out what customers want by direct customer feedback and observation of usage behaviour, rather than indirectly through up-front business analyses. Objective: This paper examines the preconditions for setting up an experimentation system for continuous customer experiments. It describes the building blocks required for such a system. Method: A model for continuous experimentation is analytically derived from prior work. The proposed model is validated against a case study examining a startup company. Results: Building blocks for a continuous experimentation system and infrastructure are presented. Conclusion: A suitable experimentation system requires at least the ability to release minimum viable products or features with suitable instrumentation, design and manage experiment plans, link experiment results with a product roadmap, and manage a flexible business strategy. The main challenges are proper and rapid design of experiments, advanced instrumentation of software to collect, analyse, and store relevant data, and the integration of experiment results in both the product development cycle and the software development process.
Conference Paper
The appeal of delivering new features faster has led many software projects to adopt rapid releases. However, it is not well understood what the effects of this practice are. This paper presents an exploratory case study of rapid releases at ING, a large banking company that develops software solutions in-house, to characterize rapid releases. Since 2011, ING has shifted to a rapid release model. This switch has resulted in a mixed environment of 611 teams releasing relatively fast and slow. We followed a mixed-methods approach in which we conducted a survey with 461 participants and corroborated their perceptions with 2 years of code quality data and 1 year of release delay data. Our research shows that: rapid releases are more commonly delayed than their non-rapid counterparts, however, rapid releases have shorter delays; rapid releases can be beneficial in terms of reviewing and user-perceived quality; rapidly released software tends to have a higher code churn, a higher test coverage and a lower average complexity; challenges in rapid releases are related to managing dependencies and certain code aspects, e.g., design debt.
Conference Paper
Background. Continuous experimentation (CE) has recently emerged as an established industry practice and as a research subject. Our aim is to study the application of CE and A/B testing in various industrial contexts. Objective. We wanted to investigate whether CE is used in different sectors of industry, by how it is reported in academic studies. We also wanted to explore the main topics researched to give an overview of the subject and discuss future research directions. Method. We performed a systematic mapping study of the published literature and included 62 papers, using a combination of database search and snowballing. Results. Most reported software experiments are done online and with software delivered as a service, although varied exemptions exist for e.g., financial software and games. The most frequently researched topics are challenges to conduct experiments and statistical methods for software experiments. Conclusions. The software engineering research on CE is still in its infancy. There are future research opportunities in evaluation research of technical topics and investigations of ethical experimentation. We conclude that the included studies show that A/B testing is applicable to a diversity of software and organisations.