ArticlePDF Available

Playful Probing: Towards Understanding the Interaction with Machine Learning in the Design of Maintenance Planning Tools


Abstract and Figures

In the context of understanding interaction with artificial intelligence algorithms in a decision support system, this study addresses the use of a playful probe as a potential speculative design approach. We describe the process of researching a new machine learning (ML)-based planning tool for maintenance based on aircraft conditions and the challenge of investigating how playful probes can enable end-user participation during the process of design. Using a design science research approach, we designed a playful probe protocol and materials and evaluated results by running a participatory design workshop. With this approach, participants facilitated speculative design insights into understandable interactions, especially with ML interaction. The article contributes with a design of a playful probe exercise to collaboratively study the adjustment of practices for CBM and a set of concrete insights on understandable interactions with CBM.
Content may be subject to copyright.
Citation: Ribeiro, J.; Roque, L. Playful
Probing: Towards Understanding the
Interaction with Machine Learning in
the Design of Maintenance Planning
Tools. Aerospace 2022,9, 754.
Academic Editors: Bruno F. Santos,
Theodoros H. Loutas and Dimitrios
Received: 31 October 2022
Accepted: 23 November 2022
Published: 26 November 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
Playful Probing: Towards Understanding the Interaction with
Machine Learning in the Design of Maintenance Planning Tools
Jorge Ribeiro * and Licínio Roque
CISUC—Centre Informatics and Systems, Informatics Engineering Department, University of Coimbra,
3004-531 Coimbra, Portugal
In the context of understanding interaction with artificial intelligence algorithms in a
decision support system, this study addresses the use of a playful probe as a potential speculative
design approach. We describe the process of researching a new machine learning (ML)-based planning
tool for maintenance based on aircraft conditions and the challenge of investigating how playful
probes can enable end-user participation during the process of design. Using a design science research
approach, we designed a playful probe protocol and materials and evaluated results by running
a participatory design workshop. With this approach, participants facilitated speculative design
insights into understandable interactions, especially with ML interaction. The article contributes with
a design of a playful probe exercise to collaboratively study the adjustment of practices for CBM and
a set of concrete insights on understandable interactions with CBM.
playful probing; cultural probes; design requirements; games research; participatory
design; decision support systems; condition-based maintenance; machine learning interaction
1. Introduction
There is increasing demand to incorporate powerful AI (artificial intelligence) algo-
rithms in information systems, often in data sensitive or critical decision support systems.
Introducing these algorithms in a critical and highly regulated operational context resists
experimentation and raises new design challenges.
Such a scenario of speculative design raises challenges regarding design approaches
that make it easier to understand new interactions, together with the design and human ap-
propriation and control over new AI tools. In this context, we cannot perform ethnography
of as yet nonexistent interactions; neither can we apply conventional requirements eliciting
approaches that presume such knowledge.
Moreover, interdependence between new tool development and new practices in a
strongly regulated context inhibits experimentation, creating a cultural deadlock. Such
a scenario demands a generative approach informed by current practices and also new
AI possibilities as they become available through research. Instead, a participatory ap-
proach is needed to empower practitioners [
] to develop new ways of working with and
designing AI-enhanced decision support systems. However, it remains unanswered which
participatory approach is best suited to the design of the CBM planning tool.
Cultural probes, as proposed by Gaver et al., are “an approach of user-centred design
for understanding human phenomena and exploring design opportunities” [
] focused on
new understandings of technology. Cultural probes do not give comprehensive information
about people and their practices; rather, they provide fragmentary clues about their lives
and thoughts [
]. The researcher or the designer has the mission of putting together the
pieces of that puzzle and making findings emerge from that. Therefore, cultural probes
can be a tool for designers to understand users [
]. F. Lange-Nielsen show some studies
in which probes are used as a scientific method or a design tool [
], and Hutchinson et al.
Aerospace 2022,9, 754.
Aerospace 2022,9, 754 2 of 17
show how technology probes can be a promising new design tool in the design of new
technologies [
]. Vasconcelos and others report on a study inspired by the concept of culture
probes describing the process of creating a low-, medium-, and high-fidelity prototypes for
a cognitive computer game [7].
J. Wallace et al. argue that design probes provide more than just inspiration for design
and can be used to mediate both the relationship between participant and researcher and
her own feelings [
]. However, we need more than mediation: we need a way to gather
insights on how to use AI to plan maintenance. The role of play in cultural development
has been recognized at least since Huizinga [
]. There has been extensive work on this
topic in the scientific community, as discussed in Section 2.2.
Adopting the concept of playful probes could potentially enable the exploration of
AI/ML methods by helping to develop the participants’ perspective and the appropriation
of new tools [
]. However, we still lack understanding on a research question: How can we
explore playful probing to draw insights into understandable interactions with AI/ML tools?
The purpose of the design case we present is to obtain insights and better under-
stand how condition-based maintenance (CBM) planning can be introduced in a critical
operations sector: aircraft maintenance. More specifically, we aim to gain insights on
understandable interactions of how to perform aircraft maintenance planning assisted by a
machine learning agent. In this paper, we report on a design science research process that
runs a participatory playful probing workshop for evaluating a proposed design supported
in a virtual paper prototype. This was done while simulating an ML-enhanced CBM main-
tenance context with aircraft maintenance domain experts. Videography, dialogues, and
interviews were open coded for content analysis and summary of design insights towards
the proposed research question.
This study contributes to the design of a playful probe exercise to collaboratively study
the adjustment of practices for CBM. As a result of this exercise, a set of eleven concrete
insights about understandable interactions with CBM maintenance planning emerged. In
this essay, we present and discuss these understandable interactions.
The next section refers to some background concepts related to aircraft maintenance
and cultural probes. Then, we describe the initial exploration of the work, followed by
the process of playful probes that explains how the various methods and procedures are
integrated. Subsequently, we provide the collection and analysis of data obtained in the
workshop, followed by the discussion and synthesis of the understandable interactions.
The final part includes the conclusions of this study.
2. Background and Related Work
In this chapter, we present the literature on the current state of aircraft condition-based
maintenance, followed by cultural probing and the evolution to playful probing.
2.1. Aircraft Condition-Based Maintenance
Current aircraft maintenance (AM) is based on the task-oriented MSG-3 model defined
in 1979 by the Air Transport Association (ATA) [
]. The MSG-3 method defines the obliga-
tion to carry out scheduled and routine maintenance in a given structure, but it also allows
slack for unscheduled or non-routine maintenance that results in maintenance actions to
correct divergences detected during scheduled maintenance tasks. The AM domain poses
new challenges for the design of decision support systems where a human and machine
learning (ML) confluence can open new opportunities such as a better approach to perform
maintenance in the aircraft maintenance industry: (CBM) [
]. This technique exploits
ML-based components and systems failure forecasts to schedule maintenance at the most
opportune moment instead of using a fixed interval approach, increasing aircraft avail-
ability and safety while reducing costs [
]. CBM is being increasingly adopted, including
these ML processes that produce remaining useful life (RUL) estimates for aircraft system
components and generate updated plan proposals for user validation.
Aerospace 2022,9, 754 3 of 17
However, such a critical operational context is highly regulated and resists experimen-
tation. This scenario raises new and relevant challenges for design approaches that can
enable evolution of current practices in the field by designing for human appropriation
and control over new ML algorithms. Under such a context, we cannot perform ethnogra-
phy of as yet nonexistent practices; neither can we apply a priori approaches for eliciting
requirements. Instead, a participatory approach [
] is needed to empower practitioners
to understand and develop new ways to work with and design ML-enhanced decision
support tools.
2.2. Playful Probing
The role of play in cultural development has been recognized at least since
Huizinga [9]
A playful probing approach uses games designed specifically for the study and tailored
to the research area and purpose of the study [
]. Research [
] suggests that a game
designed for playful probing “opens up for a playful and autonomous environment for
data-gathering which involve learning about individual and shared social practices”. The
playful probes technique uses similar principles to those of cultural probes while exploiting
games as a research tool to enable learning and data collection. Through the use of support
artefacts, cultural probes allow participants to document their activities and experiences to
be used as research material. This allows collection of the participants’ perspectives in the
process, allowing them to explore new things beyond the expected.
As defined by Huizinga [
], play is an experience outside the real world, a magic
circle where the player can explore, experiment, and provoke in a safe environment. With
Huizinga, we learned to recognize the role of play in cultural development. Playful probes
might enable a planner to enter this bubble and establish the dialectic with a new AI
approach and well-defined processes. As a result, this might allow us to identify, anticipate,
and understand possible problems. The concept of play has been addressed
before [36,8]
in the context of studying novel interaction design proposals. Playful probing [
] espouses
similar principles of cultural probes while exploiting games as a research tool to enable
learning and data collection [
]. However, we still must identify how to research
risk scenarios. The answer can be a simulation game as an enabler of participatory con-
text. However, we do not know how we can create such a participatory context using a
simulation game.
Since the development of modern digital computers, computer simulations have
been used for modelling and studying systems [
]. Simulation games have been used to
formalize scientific problems and have also been adopted by academia in game form as
ways to formalize and study, e.g., economic and social behavioural phenomena. Simulation
games can make it possible to create and study scenarios without compromising actual
maintenance operations. Turning simulations into games can also enable the exploration of
behaviour in alternative settings.
Previous research [
] showed that playful probing artefacts can be used to design
new ML algorithms in a critical and highly regulated operational context, and another
preview study [
] showed how to elicit ideas for integration of maintenance planning
practices with ML estimation tools and the ML agent using playful probes. However, we
still do not know how we can use a simulation game in a participatory context to allow us
to study insights about understandable interactions with AI/ML decision support tools.
3. Initial Exploration of Work
In this chapter, we describe our initial exploration of the topic. Prior to the participa-
tory process, we performed bibliographic research of the state-of-the-art in human–ML
interaction and aircraft maintenance. After this initial exploration, the literature review
was deepened and directed to the research question of this study, which has already been
presented in Section 2.
Aerospace 2022,9, 754 4 of 17
3.1. Human–ML Interaction for CBM
A better human–computer confluence can be achieved by enabling co-creation between
the user and the artificial intelligence (AI) algorithm and putting explainability at the core of
user autonomy and empowerment. Some studies [
] provide human–AI interaction guide-
lines, but explainable, accountable, and intelligible systems remain key
challenges [2325]
While progress has been made in explainability and
interpretability [2628],
the design of
AI interfaces focusing on co-creativity remains a challenge.
Bødker et al.
points to the problem of appropriation and control as people learn
new technologies and update cooperative work practices. We need to identify how a tool
can be created to support a dialogue between the planner and an ML algorithm while also
preserving the autonomy and control of the human agent in a risky context. A process is
needed to design a new tool enabling a new meaningful simulated practice.
Simulations and games have been used to model and study systems [
], formalise
scientific problems, and study economic and behavioural phenomena [
]. Simulations
allow us to test and explore new interaction approaches, while new practices can emerge
in a meaningful but simulated context. Playful probes can be the variation of the cultural
probes approach [
] that allows learning and data collection [
] in a simulated playful
3.2. Maintenance Planning
It is not possible to find technical details in the literature of how a company such as
the one we are studying performs maintenance planning. To obtain valuable information
about maintenance practices, teams, and tools, we applied the PD methods described in
the next section.
Semi-structured interviews were performed by the two authors of this paper with
two maintenance planning workers: one with a lot of experience and responsibilities in
the domain, and the other with a few years of experience. The interview was recorded,
transcribed, and analysed a posteriori. It served, to a large extent, to enlighten us on
how maintenance is done in everyday life, the volume and type of daily work, how the
maintainers cooperate with the other teams, and how the maintainers achieve their work
using their specific tools. Throughout the interview, speculative questions were asked to
gain understanding on whether indicators could be used for a CBM paradigm shift. At this
stage, we did not give relevance to how ML algorithms (in this case an ML agent) could
help in the daily work of planners.
A guided visit to the facilities took place at the maintenance planning building, main-
tenance control centre, and hangars. It was important to perceive the skills, tools, and
particularities of each team in place, particularly among maintenance planners and mainte-
nance engineers.
The information was also complemented with some presentations on the project
setting and the sharing several technical documents with maintenance-related details.
Based on this information, we were able to synthesise the main concepts that allow us
to represent how maintenance is performed:
Block: predefined routine maintenance, usually heavy and with due dates (as “A-
Cluster: usually a flexible, small group of tasks that can be routine or non-routine,
such as reactive or preventive maintenance; can have due dates, RUL, both, or none.
Flight: aircraft movement between airports. It is not possible to do any maintenance
to the aircraft in this period.
Hangar: place where maintenance is performed. It has several restrictions, such as
time, materials, and labour.
The flight element is not currently viewed by maintenance planners in planning
software. However, we consider it relevant to include it in this study, as it has a relevant
impact on hangar maintenance and resources. This information is important to discovering
Aerospace 2022,9, 754 5 of 17
the process of playful probes, described in the next section, especially for the creation of
the playful probe paper prototype used in the workshop.
4. The Process of Playful Probes
In this section, we describe how the playful probing process was set up. Bearing in
mind that we intend to speculatively study a CBM maintenance planning tool, we proposed
a combination of methods for the discovery process: a cooperative future workshop, focus
groups, and playful probes using a paper prototype. The exercise included: in a first phase,
a cooperative future workshop using playful probes, followed by a focus group functioning
as a reflection and exploitation of what happened during the first phase. The focus group
was a guided discussion followed by a structured interview sent by email after workshop
We designed playful scenarios and materials as well as a playful probing protocol. In
the process, we expected to open new relevant questions about ML-based RUL estimators,
maintenance planning practices, and how to design human–tool interaction in a future
computer prototype.
For relevance and simplicity, we determined that in this study we would just play
with shorter maintenance work cycles, called “A-checks”, using one RUL indicator for
each aircraft maintenance package instead of representing an RUL for each component and
system inside that maintenance.
This study was conducted as part of an ongoing design science research (DSR)
approach [31]
and it was conducted in a workshop session. Given the schedule restrictions and the
difficulty in obtaining the agendas of our project partners, we accepted that the institu-
tion/company would recruit the participants.
The workshop was carried out with two researchers and two domain experts in
maintenance management (hereinafter designated P1 and P2), both male and between 20
and 40 years old, with strong backgrounds in aviation and practical knowledge of planning
tools. They played the simulation scenario and cooperated to solve each maintenance
problem presented, reflecting on the mediating role of the new ML tools. The workshop was
facilitated by the researcher—who ensured the application of the protocol and clarification
of doubts on play scenarios and materials, e.g., role playing the gamemaster role—and the
designer researcher, who assisted in the discussion.
The next subsection describes the preparation of playful probe material and how it
was applied.
4.1. Playful Probe Preparation
Given the focus of developing a new CBM planning tool with the concept of short-term
maintenance, we chose to simplify the concepts that allow us to represent and explore how
short-term maintenance is performed.
We created a simplified and simulated version of a current standard planning activity
to speculate on how evolution can be done with CBM. However, we considered an ex-
ception: Instead of preprocessing the fleet information to find maintenance opportunities,
as is the current practice, we decided to include the flight plan and consider any time
when the plane is not being used as a potential period to perform maintenance. We also
decided not to include any interaction with the ML planning agent so as not to bias the
discussion and solutions presented by the participants. The materials were created with
shapes and colours to be easily identifiable and distinguishable. We wanted objects to be
easily playable and studied and focused on functionality, manipulability, and the resulting
discourse rather than aesthetics.
The steps carried out in this study are presented below.
1. Materials design
The materials presented in Figure 1were based on the main maintenance concepts set
out in Section 3:
Row: one row represents an aircraft.
Aerospace 2022,9, 754 6 of 17
Column: representation of time. One column represents one day.
Flight: blue ribbons represent aircraft flights in the respective aircraft row.
Registration time: the limit time to register the aircraft to some maintenance slot
is 30 days.
Open time: the limit time to open a new workscope (create new maintenance) is
21 days.
Block: red rectangles represent predefined routine maintenance (with due date
from the maintenance planning document). When moved before the registration
limit, it must be registered as a new block after this registration limit.
Cluster: group of tasks that represent other types of A-checks (small maintenance)
with due date, RUL, both, or none. If a cluster is moved, it must be moved after
the open workscope limit unless it is joined to a block.
Figure 1. Paper prototype materials printed for testing.
In this phase of the discovery process, we chose to create materials that tried to
faithfully represent current maintenance concepts as a starting point. We decided to
use only a small speculative detail of CBM maintenance in these materials: the RUL
indicator (in flight hours) that was included in some clusters.
2. Resolution path
To enable this task in a limited time, we created a structured resolution path. The
beginning of the resolution was linear and could only progress one way. Participants
faced the simplest concepts of flight planing and maintenance. Subsequently, the
resolution would lead to a path where users would necessarily be faced with more
complex issues such as conflicting conditions and 90% confidence RULs.
This probe was designed so that during the rehearsal, the disposition of visual artefacts
confront participants with situations that can lead to debate and the generation of
insights. The main ones were:
Introducing block and cluster grouping—Is it possible to group all the main-
tenance in these two typologies? How do we deal with the deadlines of each
Introducing estimates with 90% confidence—Does it make sense to have a large
degree of uncertainty? How do we represent it to enable decisions?
3. Material digitalization
To prepare the virtual workshop, all artefacts were designed
digitally but printed and rehearsed with manually as is common in paper prototype
exercises (Figure 1). After testing multiple approaches to instrument the playful
Aerospace 2022,9, 754 7 of 17
probing with visual artefacts, we adjusted size and complexity, and the exercise was
migrated to the digital collaboration tool (Figure 2).
Figure 2.
Maintenance scheduling problem presented to the participants in the experimental session.
Next, we describe the steps used in the rehearsal of the playful probing workshop.
4.2. Playful Probe Workshop
The workshop included, in a first phase of focus, a cooperative future workshop using
playful probes (Steps 1 and 2), followed by a focus group functioning to reflect on and
exploit what happened during the playful probe exercise through a guided discussion and
an interview (Step 3).
4. Briefing
In an initial part of the playful probing workshop, an introduction was made explain-
ing the basic maintenance elements of the game and demonstrating how to solve a
simple problem (Figure 3).
Figure 3. Minimal scheduling problem to introduce the rules and basic movements of the game.
The canvas represents a fleet of only two aircraft, with flights and maintenance
distributed over time and using a minimum block time of 4 h. To simplify the
maintenance problem for a first iteration of the game design, only three types of
artefacts were created with which the participants could interact (drag and drop),
representing the maintenance work of an “A-check”. For simplicity, we assumed that
there was only one hangar with a maintenance team available, so it was not possible
to do multiple maintenance procedures at the same time. This part lasted around
10 min, and the participants cleared some doubts about the game but did not interact
with artefacts.
5. Running the playful probing participatory design workshop
Aerospace 2022,9, 754 8 of 17
In this part of the experimental session, artefacts were presented to participants with a
non-trivial maintenance scheduling problem to be solved (Figure 2). The participants’
voices and the collaborative canvas were recorded while they presented their ideas and
played with the representations to solve each maintenance problem. The facilitator
acted as gamemaster, answered participants’ questions about whether they could take
some actions, alerted them when they were ignoring some important condition, and
tried to get them to explore the problem boundaries in a dialogue with the material
Exploration developed freely to solve each game problem, with no constraints regard-
ing order, time, or the management of concurrency among open explorations; the
facilitator favoured out-loud dialogue and the explicit manipulation of the represen-
tations as a form of dialogical imagination among participants. Given the habitual
nature of play, we expected the emergence of self-directed and highly autonomous
activities driven by participants’ playful trajectories actively exploring the boundaries
of the gameplay scenario.
6. Debriefing debate
Shortly after the participants solved the planning problem, the focus group occurred,
in which a broader discussion space was opened to reflect on the current state of
maintenance and how CBM can be used in the future.
4.3. After the Playful Probe Workshop
7. Semi-structured email interview
After reviewing the recordings, specific interview questions were sent to the partic-
ipants with the intention of clarifying or deepening the reflections they expressed
during the play and debriefing phases. The first group of questions focused on the
experience and interpretation of the participants about the exercise. The leading
questions were:
How did you perceive the experience from the moment where the problem
appeared with a yellow star to the reached solution? What did you find
most challenging and why?
The second group were speculative questions about using an ML agent to help with
maintenance planning. The leading questions were:
How would you briefly narrate a planner using the ML planing/scheduling
with this interface? At which moments ML should be called in to provide a
new solution or a partial solution to the planner?
The third group of questions focused on the visualization, interpretation, and control
of RUL indicators. The leading questions were:
Did you experience difficulties visualizing/interpreting RUL indicators?
Can you anticipate some improvement in the way we present information
to give better control to the planner?
The last group of questions focused on the playful probing exercise itself. The leading
questions were:
What did you thought about the session technique used: should we make
some changes? were the materials limiting in anyway that needs to be fixed?
did it help generate or make explicit some insights about the subject mater?
The email interviews were typically an extrinsic reflection on the experience, a post-
reflection. It will be discussed later in the discussion section.
8. Data collection and in-depth content analysis
The playful probing workshop generated audio and video recordings, dialogue text,
and interview transcriptions. A video was made (with informed consent) of the
conversation between the participants and the manipulation of game artefacts during
the playful probe workshop. All video data were analysed (verbal and actions with
Aerospace 2022,9, 754 9 of 17
materials) to generate initial codes. Then, the recording was split into 30 s segments
and coded into groups. These grouping were based on the intrinsic self-analysis of
the experience that emerged in the conversation generated as participants played
the scenario. Data collection and analysis of the workshop are described in the next
5. Data Collection and Analysis
The workshop recording was transcribed, and the content was analysed to draw main
emergent categories. These classes were organized in a taxonomy that then served to guide
the coding process. Subsequently, the recordings were divided into 30 s segments, and
each segment was classified into one or more class. The two major areas were focus and
reflection (Level 1). Focus represents when participants were focused on something directly
related to the artefacts of the game; reflection marks when participants expressed some
reflected thoughts.
5.1. Content Coding
The recording was transcribed and processed for emergent categories and split into 30
second segments. These classes were organized in the taxonomy of themes according to
Figure 4.
Figure 4. Taxonomy of coded conversation themes.
The speech about Focus was found on three immediate themes: on the technological
tools used for the workshop; on the planning game exercise, which was further subdivided
into the interpretation of how planning is represented, the manipulation of the artefacts;
and the conversation related to the game as an instrument. Focus on maintenance was
divided into solving the scheduling problem and meta speech related to the maintenance
The speech about Reflection of thoughts was divided into four topics: ideas for the
improvement of the instrument (game or playful probe); the way the maintenance of the
aircraft is being or can be done; the degree of confidence and the meaning / implications that
the RUL indicator can have in planning; and finally how machine learning can interact with
the human planner or in relation to the operation of ML algorithms. Planning practices is
separated into two categories: current practices are currently practised, and future practices
are speculated to be implemented according predictive indicators such as RUL. The RUL
class is now divided into three subcategories (meaning, time, confidence).
As this is a collaborative work, the discourse of each participant was not grouped by
category but by the conversation of the group as a whole. The resulting data are presented
in graphs to show how the conversation and the discovery of themes took place throughout
the experiment, at what times the themes of RUL and ML were approached, and how the
reflections arose, which are important for knowledge development by the participants.
5.2. Conversation Analysis
This experiment lasted 74 min: 23 for Phase 5 and 52 for Phase 6. Before the in-depth
analysis, there is an important moment during Phase 5 of the experiment to highlight:
The first three minutes were given to the participants to read the maintenance plan
and to clear up doubts before moving on to the problem.
The participants began by addressing the problem using meta-speech, suggesting that
they were “reading” the problem first and getting the right connection between artefacts
Aerospace 2022,9, 754 10 of 17
and the maintenance language that they were familiar with. For instance, they used the tail
aircraft name to refer to specific rows, “there is an issue with the first one, the N100, because
it can overlap with the N1003 flight at the beginning”. They took about 5 min between
the moment that the problem was placed until they started moving the elements in a very
intricate collaboration process, such as analysing and negotiating the movements as if they
were learning to play a game of chess. Despite the fact that they assumed different roles
and they did not interfere with each other’s work, they communicated and collaborated
When we look at the focus of the conversation of Phase 5 in Figure 5, we can see that
at the beginning, the participants talked about the representation of planning artefacts
and have some technical issues related to the technology unfamiliar to them prior to the
Figure 5. Focus of conversation during Phase 5.
Immediately after the problem was placed, participants started talking about maintenance-
related aspects (meta-speech). Only at 5:30 did they change the focus to solve the problem,
and only after 8 min did they start to manipulate the artefacts. From this moment onward,
the participants did not lose focus on solving the problem until the end of the exercise. This
problem resolution was accompanied alternately by moments of artefact manipulation or
maintenance-related meta-speech.
The reflection mainly took place during Phase 6, starting at minute 23, immediately
after the problem was solved, as can be seen in Figure 6. During this phase, it is important
to note that there was a quite intense discussion about maintenance planning practices. The
discourse alternates between current practices and speculation on what future practices
will look like. The reflective discourse in this phase is divided into three major blocks:
Between 23 and 40 min, we found speech oscillating between current and future practices,
specifically regarding RUL (e.g., RUL confidence level representation, RUL as a box plot
or distribution, maintenance risk and criticality, due date management, state of the fleet,
maintenance opportunities, cluster and block management, task and RUL management,
and RUL and task management); between 42 and 58 min, the speech was about future
practices and mostly focused on ML (e.g., agent suggestions and planner assessment,
solution fine tuning, deviation implications, planning constraints and impacts, planner
knowledge and strategies, RUL distribution visualisation, impact cost curve visualisation,
and operation and maintenance plan integration); between 58 and 70 min, only current
practices from a more global management perspective were discussed (maintenance and
operational planner communication, operational plan management, maintenance and
operational planning times, problem solving, types of maintenance, and flight hour/flight
cycle ratios).
Aerospace 2022,9, 754 11 of 17
Figure 6. Reflections on the conversation during the experiment.
Concerning the introduction of the RUL concept, we could verify that whenever
there was a dialogue about time or the confidence interval, it came with a discussion on
meanings and implications it may have. This took place mainly in the first block of the
mixed discourse between talk of current and future practices.
The discussion about the use of machine learning only started in the last part of the
debriefing phase. Due to the participants being involved in the domain of maintenance
planning, their initial interest was to discuss and clarify some maintenance concepts and
anticipate the possible changes in their daily work. Only after this clarification did they
begin to explore how ML can contribute to their work, combining both concepts. Specifi-
cally, participants discussed future maintenance practices and the use of ML algorithms to
help planners in that task. As shown in Figure 6, both the reflections on interacting with
ML algorithms were completely connected to a conversation about future aircraft practices.
They are interspersed between the form of interaction and reflection on the functioning
of the algorithms and appear at several points in time simultaneously. The third block
is exclusively a reflection of current practices. Between the first and the second blocks, a
moment of reflection on the game (playful probe) itself takes place, but only for 2 min.
In Figure 7, we can verify the relationship between focus and reflection in maintenance.
In Phase 5, the participants were completely focused on the solution of the problem and
led a meta-discourse on maintenance, especially in an early stage. During this phase,
the participants had only 3 moments of reflection on current practices. During the first
block of Phase 6, as mentioned above, it is possible to verify the alternation and balance
of the reflective discourse between current and future practices. The predominance of the
discourse on future practices takes place during the second quarter and on current practices
in the third.
Figure 7.
Cumulative number of conversation segments between focus and planning reflection,
suggesting 3 different phases.
Aerospace 2022,9, 754 12 of 17
6. Understandable Interactions
In this section, we describe the conversation about the main interaction themes that
emerged during both runs, synthesizing the insights of the evolution of ML interaction
towards a CBM planning paradigm.
The discussion is done through analysis of the concrete conversation utterances during
the experiment while comparing with feedback obtained from participants’ post-session
Design Insights
In this section, we list and detail the various insights to the iteration that emerged
during the conversation with the participants.
1. Understandable maintenance representation
Regarding the experience of interpreting the game elements (flights, blocks, clusters
of tasks, and plans) the participants generally found it clear, with P1 adding “clear and
similar to tools already in use” and P2 saying “this view is actually quite nice to be able to
quickly scan the situation”. As can be visualized in Figure 5, the participants started to
talk about planning representation and then immediately started to move artefacts at
minute 8.
P1, during the exercise, verbalized the possibility of also visualizing other kinds
maintenance that does not require a hangar
, which is important in a short-term
maintenance paradigm such as CBM, concluding “If it’s a really small problem you can
do it during a turnaround”.
2. Maintenance package management
In respect to future developments of maintenance artefacts, P1 was thinking about
how to visualise the “benefit you get
combining a cluster and a block
. Let’s say, after
this 30H (of maintenance), there are 1H of toeing, that means if you combine them, you
save an hour, so the box becomes a little bit smaller”. P2 was also concerned with this
kind of plan optimization, “we
should combine these two
, because it’s a kind of waste
bringing them back to the hangar twice in two day”. Both participants were interested in
“open” and split clusters being used as some sub-clusters, especially if there was a task
where some RUL restricts the entire cluster. This should be interesting if there is an
update in only one RUL among the possible dozens at any given time, and the best
solution is to solve the problem related to this specific RUL and leave the rest of the
cluster according to the original schedule, “in fact,
what we necessarily need to move,
is not all the work, but part of work.(P2).
3. Maintenance flexibility and control
At some point, P1 considered scheduling two hours of maintenance over the limit,
and wondered "what is the consequences of not making the exact Due Date? what’s the
consequences of having the component filled before the preventive removal?" and "How
critical is it if we don’t respect a RUL?", suggesting that the
planner should have the
flexibility to schedule tasks in other time if the return is large enough
. Participants
agreed that we should start from the assumption that the planner knows things that
cannot be coded in the model, "the planner might have more data or might have some
preferences, some strategies in his head, that make him decide to deviate from the output of ML
algorithm" (P1). Thus, we should assume that s/he can make some changes based on
human (tacit) knowledge and turn these into constraints to generate a new solution.
This can be done by fixing a particular block or cluster or locking an empty space after
some maintenance because s/he "knows there is a risk that they are working in an area
where usually have other findings which they need to attempt too as well" (P2).
4. Manual planning
Complementing the previous point, P2 said that that
maintainers need some room
to schedule clusters because they do not know what kind of corrective tasks they
will have in 30 days
: “we don’t have the luxury always of having RUL of more than
Aerospace 2022,9, 754 13 of 17
100 h (. ..). The problems pop up, let’s say, in common flights, so we need to act on
that right now (. . .) to find some spot to fix the next couple of days”.
5. Maintenance time restrictions
Participants confirmed the time to fix blocks as part of A-checks is done respecting
the time limits presented in this exercise “until like you said, the 20 days to 30 days”
(P1). However, “there are also other work as modifications, and those you can foresee
months of prompt, let’s say if you want to install wifi on the aircraft, what this is not
popping up on a short term but that you already know months in advance” and can
be scheduled in some check.
6. Flight and maintenance plan merger
Although planners do not visualize flights in maintenance planning, in part because
planning flights is currently done in the short term,
they recognized the importance
of visualizing flights on the same canvas as maintenance
. After participants played
with flights, blocks, and clusters, they suggested improvements to make them more
complete, such as including the turnaround and towing time in the flight artefacts.
Participants also suggested presented the hours per flight. This information may also
be important for cooperation with operational planners. A
task with low probability
and very high impact can trigger a discussion
about whether it should be planned,
and they should simply accept this schedule if they
have a spare aircraft stand bay
or have some buffer in the network”; otherwise, they will not take this risk, which may
lead to cancellation.
7. The role of automatic planning
Participants assumed that there would be some form of automatic planning that
would reschedule the entire plan. However, they felt the need to plan only part of
the plan. P3 felt the need to have a
button to “fix the rest”
once s/he made a few
choices. P2 also had the same question, “How will we be able to
lock some parts not to
be changed by AI plan recalculation?”
. P2 recognized that it is difficult to manually
optimize a solution, venting “Wow, this is endless!”, and concerning plan optimize “we
should combine these two, because it’s a kind of waste bringing them back to the hangar twice
in two days”.
Participants agreed that it might be a good idea for the ML agent to
group tasks into clusters and propose a solution to the planner.
Then s/he must
make an assessment and decide what to accept, taking into account that s/he will
always be able to adjust the solution that the system has proposed.
Both participants highlighted a few occasions when the ML agent could be called to
present some solution. Participant 2 said that the ML agent should be called “When a
new RUL is introduced, either a change or a new cluster”. P1 also suggest that “an initial
proposal to cope with a new ‘problem’ would be nice, indicating the differences the ML propose
to make”. P1 also presented an idea similar to some chess applications to improve the
interaction between the user and the ML agent: “If you select a block, perhaps see the
options of what you can do with that block, before moving”.
8. Discretionary balance between control and autonomy
Participants also expected the tool be useful to generate a solution that not only
respects the restrictions but also allows
limiting the search space to a certain period
of time or to some selected aircraft
. However, it should show the planner the impact
of this limitation. “For example I gave an 8 h (slack) after maintenance just because sometimes
there is an issue, but s/he sees in the planning that it has a quite lot of impact” (P2). P1 agreed
that the planner should be able to get some
kind of score, or even better, the cost of
making changes,
“because maybe there are some biases in behavior or maybe (the planner)
is used to do in a certain way.” Further, it must be feasible that this actually helps to
achieve better solutions “not just in time reduction, but also in optimality”.
9. Maintenance RUL confidence level
The run participants found it easy and clear to understand what needed to be done.
However, when confronted with the RUL confidence level, they found it not easy
Aerospace 2022,9, 754 14 of 17
to interpret, and they considered the RUL as a fixed due date. P2 said it “was quite
tricky estimate what risk you took when you interpreted the RUL”, while P1 said
the representation of RUL required some mental effort to visualize: it “was a bit
challenging to determine the due dates for the tasks, it required some mental efforts”.
P1 added during the exercise,
“the difference between 95 and 99 in my head is not
playing a role”
. Despite the difficulty in seeing the impact of the confidence level
during the exercise, they made an effort to understand the impact of the confidence
level; e.g., P1 said "I won’t to risk, because 90% is quite high”.
Maintenance RUL visualization
Participants suggested automatically visualizing the RUL on the timeline, and P1 also
suggested it
would be good to “visualize operation impact” such as costs, availabil-
ity, and the maintenance components
, asking P2 “But it could actually depend on what’s
these 65 h based on, right? What of kind components we are talking about?”. During the
exercise, P2 suggested an RUL of 60 h with a confidence level of 90%, “it would be
nice if we could see (. . .) 65+
6 h, than you kind have an idea of how close the edge
you are", and when asked if a boxplot could fit, P1 answered "Yeh, I’m thinking out
aloud now, but perhaps instead a square box, it could be a kind of distribution”. At the end of
the exercise, P1 took a co-constructive move and started using the collaboration tool
to make some design proposals. S/he started to draw
how this kind of distribution
could be
, as shown in Figure 8,
a visual analogy based on how the arrival time is
modelled but in this case as a view of the risk.
Figure 8. Proposed remaining useful life distribution visualization by Participant 1.
A participants added another curve, shown in Figure 9, and said that it was something
that s/he is not used to, that it was just his/her idea based on aircraft management
with regards to a future CBM scenario.
Figure 9. Proposed impact curve by Participant 1.
This should be something related to the
impact cost,
"so if you do this task now, it will
cost you something because it will be based on the RUL (. . . )
if you do it too early it’s
got a cost because you are wasting the RUL, but if you do it too late it’s gonna cost
you because it’s incurring a delay, cancellation, or high repair times.
But there is no
optimal here, and there is something that you can play with", referring to the possibility of
adjusting the best time to schedule some cluster and getting the respective impact of
this move.
CBM maintenance indicators
Participants asked about if they have the
data needed to possibly turn the impact
; the participant added “we know the delay cost, we know the cancellation cost
approximately, we know very much escalate repair cost is, and we know about how much
RUL cost approximately, what is a bit more difficult it’s the cost of preventive repair”. P2
presented his/her vision:
we should have a kind of class of component or class
of consequences
, and depending on that class, it must not run the risk, or it can run the
risk of exhausting the RUL”. P1 agreed, “the decision on whether to schedule something,
Aerospace 2022,9, 754 15 of 17
should not be just dependent on the description of the task but should be also dependent of
the maintenance opportunities and the state of the fleet”, and “take in consideration the
probability that’s something might fill with the large or small impact”.
7. Conclusions
This work showed how the playful probe exercise materialized in a digital paper
prototype and enabled an exploratory environment in which researchers and domain
experts were able to explore diverse aspects of adoption of ML in the coming practice
of CBM in airline maintenance. By focusing on playing with playful artefacts to solve
a concrete problem, participants could reflect on changes to their domain and open a
speculative and productive dialogue on how CBM maintenance could be designed, as
evidenced by content analysis over action and speech during the exercise and presented
as understandable interactions. Through the use of playful probes, we were able to raise
questions about interacting with an ML planning agent and dealing with RUL estimates.
The way probes are built has a great impact on the workshop and the reflections ob-
tained. Sections 3and 4describe the detailed process of how complementary participatory
design methods were found by playful probing, and how probes were carefully studied
and designed to offer the correct “bubble” experience [
] to participants, which allowed us
to originate useful reflections for speculative study of interactions.
This study showed that playful probes, even in a non-dynamic environment such as
a paper prototype exercise, can serve as a valuable tool to direct the dialogue to relevant
aspects of new interactions as yet to be developed and addressing resources, knowledge,
and meaning aspects for that interaction, as evidenced in the analysis of participants’
discourse. Playful probes can enable this exploration in a cooperative way, as suggested in
the workshops by the participants themselves. This exercise allowed the participants to
put themselves in a safe and relaxed environment to play and learn collaboratively how to
deal with a high-risk problem.
Author Contributions:
Conceptualisation, J.R. and L.R.; investigation, J.R.; methodology, J.R.; su-
pervision, L.R.; validation, L.R.; writing—original draft, J.R.; writing—review and editing, L.R. All
authors have read and agreed to the published version of the manuscript.
This research was funded by the European Union’s Horizon 2020 research and innovation
program under the REMAP project, grant number 769288 and funded by the FCT—Foundation for
Science and Technology, I.P./MCTES through national funds (PIDDAC), within the scope of CISUC
R&D Unit—project code UIDP/00326/2020. The first author is also funded by the FCT—Foundation
for Science and Technology, under the grant 2022.11131.BD.
Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.
Conflicts of Interest: The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
CBM Condition-Based Maintenance
ML Machine Learning
AI Artificial Intelligence
AMP Aircraft Maintenance Planning
RUL Remaining Useful Life
Bødker, S.; Kyng, M. Participatory design that matters—Facing the big issues. ACM Trans. Comput.-Hum. Interact.
,25, 1–31.
2. Mattelmaki, T.; Korkeakoulu, T. Design Probes; University of Art and Design: Helsinki, Finland, 2008.
Gaver, W.W.; Boucher, A.; Pennington, S.; Walker, B. Cultural probes and the value of uncertainty. Interactions
,11, 53.
Aerospace 2022,9, 754 16 of 17
Celikoglu, O.M.; Ogut, S.T.; Krippendorff, K. How Do User Stories Inspire Design? A Study of Cultural Probes. Des. Issues
33, 84–98.
Lange-Nielsen, F.; Lafont, X.V.; Cassar, B.; Khaled, R. Involving players earlier in the game design process using cultural probes.
In Proceedings of the 4th International Conference on Fun and Games-FnG ’12, Toulouse, France, 4–6 September 2012; ACM
Press: New York, NY, USA, 2012; pp. 45–54.
Hutchinson, H.; Hansen, H.; Roussel, N.; Eiderbäck, B.; Mackay, W.; Westerlund, B.; Bederson, B.B.; Druin, A.; Plaisant, C.;
Beaudouin-Lafon, M.; et al. Technology probes. In Proceedings of the Human factors in Computing Systems-CHI ’03, Ft.
Lauderdale, FL, USA, 5–10 April 2003; ACM Press: Ft. Lauderdale, FL, USA, 2003; p. 17.
Vasconcelos, A.; Silva, P.A.; Caseiro, J.; Nunes, F.; Teixeira, L.F. Designing tablet-based games for seniors: The example of
CogniPlay, a cognitive gaming platform. In Proceedings of the Fun and Games ’12: International Conference on Fun and Games,
Toulouse, France, 4–6 September 2012; Volume 3, pp. 1–10.
Wallace, J.; McCarthy, J.; Wright, P.C.; Olivier, P. Making design probes work. In Proceedings of the Conference on Human
Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 3441–3450.
9. Huizinga, J. Homo Ludens: A Study of the Play-Element in Culture; Angelico Press: Brooklyn, NY, USA 2016.
10. Gaver, B.; Dunne, T.; Pacenti, E. Design: Cultural probes. Interactions 1999,6, 21–29.
Sahay, A. An overview of aircraft maintenance. In Leveraging Information Technology for Optimal Aircraft Maintenance, Repair and
Overhaul (MRO); Elsevier: Amsterdam, The Netherlands, 2012; pp. 1–230.
Knowles, M.; Baglee, D.; Wermter, S. Reinforcement learning for scheduling of maintenance. In Res. and Dev. in Intelligent Syst.
XXVII: Incorporating Applications and Innovations in Intel. Sys. XVIII-AI 2010, 30th SGAI Int. Conf. on Innovative Techniques and
Applications of Artificial Intel.; Springer: London, UK, 2011; pp. 409–422.
Andrade, P.; Silva, C.; Ribeiro, B.; Santos, B.F. Aircraft maintenance check scheduling using reinforcement learning. Aerospace
2021,8, 113.
Bernhaupt, R.; Weiss, A.; Obrist, M.; Tscheligi, M. Playful probing: Making probing more fun. In Lecture Notes in Computer Science
(Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4662 LNCS; Springer: Berlin/Heidelberg,
Germany, 2007; pp. 606–619.
Sjovoll, V.; Gulden, T. Play probes-As a productive space and source for information. In Proceedings of the 18th International
Conference on Engineering and Product Design Education: Design Education: Collaboration and Cross-Disciplinarity, E and PDE
2016, Aalborg, Denmark, 8–9 September 2016; Number September; The Design Society: Copenhagen, Denmark; Institution of
Engineering Designers: Glasgow, UK, 2016; pp. 342–347.
Kjeldskov, J.; Gibbs, M.; Vetere, F.; Howard, S.; Pedell, S.; Mecoles, K.; Bunyan, M. Using Cultural Probes to Explore Mediated
Intimacy. Australas. J. Inf. Syst. 2004,11, 102–115.
Moser, C.; Fuchsberger, V.; Tscheligi, M. Using probes to create child personas for games. In Proceedings of the 8th International
Conference on Advances in Computer Entertainment Technology-ACE ’11, Lisbon, Portugal, 8–11 November 2011; ACM Press:
New York, NY, USA, 2011; p. 1.
Klabbers, J.H.G. The Magic Circle: Principles of Gaming & Simulation; Modeling and simulations for learning and instruction; Sense
Publishers: Rotterdam, The Netherlands, 2006.
Ribeiro, J.; Roque, L. Playfully probing practice-automation dialectics in designing new ML-tools. In Proceedings of the
VideoJogos 2020: 12th International Conference on Videogame Sciences and Arts, Mirandela, Portugal, 26–28 November 2020;
pp. 1–9.
Ribeiro, J.; Andrade, P.; Carvalho, M.; Silva, C.; Ribeiro, B. Playful Probes for Design Interaction with Machine
Learning: A Tool for Aircraft Condition-Based Maintenance Planning and Visualisation. Mathematics
,10, 1604.
Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al.
Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems-CHI
’19, Glasgow, UK, 4–9 May 2019; pp. 1–13.
Holbrook, J. Human-Centered Machine Learning. 2017. Available online:
machine-learning-a770d10562cd (accessed on 16 April 2020).
Guzdial, M.; Liao, N.; Chen, J.; Chen, S.Y.; Shah, S.; Shah, V.; Reno, J.; Smith, G.; Riedl, M.O. Friend, collaborator, student,
manager: How design of an AI-driven game level editor affects creators. In Proceedings of the Conference on Human Factors in
Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–13,
Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and Trajectories for Explainable, Accountable and
Intelligible Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC,
Canada, 21–26 April 2018; ACM: Montreal, QC, Canada, 2018; pp. 1–18.
Wang, D.; Yang, Q.; Abdul, A.; Lim, B.Y. Designing Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems-CHI ’19, Glasgow, UK, 4–9 May 2019; ACM Press: Glasgow, UK, 2019;
pp. 1–15.
Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the quality of machine learning explanations: A survey on methods
and metrics. Electronics 2021,10, 593.
Aerospace 2022,9, 754 17 of 17
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable ai: A review of machine learning interpretability methods.
Entropy 2021,23, 18.
Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; Jia, Y.; Ghosh, J.; Puri, R.; Moura, J.M.F.; Eckersley, P. Explainable machine
learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain,
27–30 January 2020; ACM: Barcelona, Spain, 2020; pp. 648–657.
Bødker, S.; Roque, L.; Larsen-Ledet, I.; Thomas, V. Taming a Run-Away Object: How to Maintain and Extend Human Control in
Human-Computer Interaction? In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018,
Montreal, QC, Canada, 21–26 March 2018; pp. 1–6.
Lukosch, H.K.; Bekebrede, G.; Kurapati, S.; Lukosch, S.G. A Scientific Foundation of Simulation Games for the Analysis and
Design of Complex Systems. Simul. Gaming 2018,49, 279–314.
Vaishnavi, V.K.; Purao, S. (Eds.) Design Science Research in Information Systems. In Proceedings of the 4th International
Conference on Design Science Research in Information Systems and Technology, DESRIST 2009, Philadelphia, PA, USA, 7–8 May
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Aircraft maintenance is a complex domain where designing new systems that include Machine Learning (ML) algorithms can become a challenge. In the context of designing a tool for Condition-Based Maintenance (CBM) in aircraft maintenance planning, this case study addresses (1) the use of Playful Probing approach to obtain insights that allow understanding of how to design for interaction with ML algorithms, (2) the integration of a Reinforcement Learning (RL) agent for Human–AI collaboration in maintenance planning and (3) the visualisation of CBM indicators. Using a design science research approach, we designed a Playful Probe protocol and materials, and evaluated results by running a participatory design workshop. Our main contribution is to show how to elicit ideas for integration of maintenance planning practices with ML estimation tools and the RL agent. Through a participatory design workshop with participants’ observation, in which they played with CBM artefacts, Playful Probes favour the elicitation of user interaction requirements with the RL planning agent to aid the planner to obtain a reliable maintenance plan and turn possible to understand how to represent CBM indicators and visualise them through a trajectory prediction.
Full-text available
This paper presents a Reinforcement Learning (RL) approach to optimize the long-term scheduling of maintenance for an aircraft fleet. The problem considers fleet status, maintenance capacity, and other maintenance constraints to schedule hangar checks for a specified time horizon. The checks are scheduled within an interval, and the goal is to, schedule them as close as possible to their due date. In doing so, the number of checks is reduced, and the fleet availability increases. A Deep Q-learning algorithm is used to optimize the scheduling policy. The model is validated in a real scenario using maintenance data from 45 aircraft. The maintenance plan that is generated with our approach is compared with a previous study, which presented a Dynamic Programming (DP) based approach and airline estimations for the same period. The results show a reduction in the number of checks scheduled, which indicates the potential of RL in solving this problem. The adaptability of RL is also tested by introducing small disturbances in the initial conditions. After training the model with these simulated scenarios, the results show the robustness of the RL approach and its ability to generate efficient maintenance plans in only a few seconds.
Full-text available
The most successful Machine Learning (ML) systems remain complex black boxes to end-users, and even experts are often unable to understand the rationale behind their decisions. The lack of transparency of such systems can have severe consequences or poor uses of limited valuable resources in medical diagnosis, financial decision-making, and in other high-stake domains. Therefore, the issue of ML explanation has experienced a surge in interest from the research community to application domains. While numerous explanation methods have been explored, there is a need for evaluations to quantify the quality of explanation methods to determine whether and to what extent the offered explainability achieves the defined objective, and compare available explanation methods and suggest the best explanation from the comparison for a specific task. This survey paper presents a comprehensive overview of methods proposed in the current literature for the evaluation of ML explanations. We identify properties of explainability from the review of definitions of explainability. The identified properties of explainability are used as objectives that evaluation metrics should achieve. The survey found that the quantitative metrics for both model-based and example-based explanations are primarily used to evaluate the parsimony/simplicity of interpretability, while the quantitative metrics for attribution-based explanations are primarily used to evaluate the soundness of fidelity of explainability. The survey also demonstrated that subjective measures, such as trust and confidence, have been embraced as the focal point for the human-centered evaluation of explainable systems. The paper concludes that the evaluation of ML explanations is a multidisciplinary research topic. It is also not possible to define an implementation of evaluation metrics, which can be applied to all explanation methods.
Full-text available
Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption, with machine learning systems demonstrating superhuman performance in a significant number of tasks. However, this surge in performance, has often been achieved through increased model complexity, turning such systems into “black box” approaches and causing uncertainty regarding the way they operate and, ultimately, the way that they come to decisions. This ambiguity has made it problematic for machine learning systems to be adopted in sensitive yet critical domains, where their value could be immense, such as healthcare. As a result, scientific interest in the field of Explainable Artificial Intelligence (XAI), a field that is concerned with the development of new methods that explain and interpret machine learning models, has been tremendously reignited over recent years. This study focuses on machine learning interpretability methods; more specifically, a literature review and taxonomy of these methods are presented, as well as links to their programming implementations, in the hope that this survey would serve as a reference point for both theorists and practitioners.
Full-text available
Explainable machine learning seeks to provide various stakeholders with insights into model behavior via feature importance scores, counterfactual explanations, and influential samples, among other techniques. Recent advances in this line of work, however, have gone without surveys of how organizations are using these techniques in practice. This study explores how organizations view and use explainability for stakeholder consumption. We find that the majority of deployments are not for end users affected by the model but for machine learning engineers, who use explainability to debug the model itself. There is a gap between explainability in practice and the goal of public transparency, since explanations primarily serve internal stakeholders rather than external ones. Our study synthesizes the limitations with current explainability techniques that hamper their use for end users. To facilitate end user interaction, we develop a framework for establishing clear goals for explainability, including a focus on normative desiderata.
Conference Paper
Full-text available
Advances in artificial intelligence (AI) frame opportunities and challenges for user interface design. Principles for human-AI interaction have been discussed in the human-computer interaction community for over two decades, but more study and innovation are needed in light of advances in AI and the growing uses of AI technologies in human-facing applications. We propose 18 generally applicable design guidelines for human-AI interaction. These guidelines are validated through multiple rounds of evaluation including a user study with 49 design practitioners who tested the guidelines against 20 popular AI-infused products. The results verify the relevance of the guidelines over a spectrum of interaction scenarios and reveal gaps in our knowledge, highlighting opportunities for further research. Based on the evaluations, we believe the set of design guidelines can serve as a resource to practitioners working on the design of applications and features that harness AI technologies, and to researchers interested in the further development of human-AI interaction design principles.
Conference Paper
Full-text available
From healthcare to criminal justice, artificial intelligence (AI) is increasingly supporting high-consequence human decisions. This has spurred the field of explainable AI (XAI). This paper seeks to strengthen empirical application-specific investigations of XAI by exploring theoretical underpinnings of human decision making, drawing from the fields of philosophy and psychology. In this paper, we propose a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across these fields. Drawing on this framework, we identify pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases. We then put this framework into practice by designing and implementing an explainable clinical diagnostic tool for intensive care phenotyping and conducting a co-design exercise with clinicians. Thereafter, we draw insights into how this framework bridges algorithm-generated explanations and human decision-making theories. Finally, we discuss implications for XAI design and development.
Full-text available
Machine learning advances have afforded an increase in algorithms capable of creating art, music, stories, games, and more. However, it is not yet well-understood how machine learning algorithms might best collaborate with people to support creative expression. To investigate how practicing designers perceive the role of AI in the creative process, we developed a game level design tool for Super Mario Bros.-style games with a built-in AI level designer. In this paper we discuss our design of the Morai Maker intelligent tool through two mixed-methods studies with a total of over one-hundred participants. Our findings are as follows: (1) level designers vary in their desired interactions with, and role of, the AI, (2) the AI prompted the level designers to alter their design practices, and (3) the level designers perceived the AI as having potential value in their design practice, varying based on their desired role for the AI.
Full-text available
Background. The use of simulation games for complex systems analysis and design has been acknowledged about 50 years ago. However, articles do not combine all salient factors for successful simulation games, and often stem from a clear view of one particular field of science only. With combining multiple disciplines, connect analysis and design as well as research and practice, we provide deep insights in design and use of simulation games. Aim. This article analyzes the design and evaluation process of a variety of game-based projects and activities, using existing scientific concepts and approaches, in order to establish games as a valid research tool. Our focus lies on the approach towards the use of games as design instrument; using them as an intervention in a larger, complex context, in order to design this context. With our contribution, we aim at providing insights and recommendations on the design and use of games as valid research tools, the limitations of this use, possible pitfalls, but also best practices. Method. We carried out a literature review of related work to identify the most important scientific concepts related to our approach of game design. Further use of combined quantitative and qualitative case study analyses highlights the design process and results of our own game studies. Results. The analyses yielded a consolidated conceptualization of simulation games as research instruments in complex systems analysis and design. The results also include methods for the evaluation of simulation games, additional evaluation methods, and limitations to use simulation games as research instruments. Conclusions. We propose guidelines for using simulation games as research instruments that may be of value to practitioners and scientists alike. Recommendation. We recommend practitioners and scientists to apply the guidelines presented here in their efforts to analyze and design complex systems.
Conference Paper
Advances in artificial intelligence, sensors and big data management have far-reaching societal impacts. As these systems augment our everyday lives, it becomes increasing-ly important for people to understand them and remain in control. We investigate how HCI researchers can help to develop accountable systems by performing a literature analysis of 289 core papers on explanations and explaina-ble systems, as well as 12,412 citing papers. Using topic modeling, co-occurrence and network analysis, we mapped the research space from diverse domains, such as algorith-mic accountability, interpretable machine learning, context-awareness, cognitive psychology, and software learnability. We reveal fading and burgeoning trends in explainable systems, and identify domains that are closely connected or mostly isolated. The time is ripe for the HCI community to ensure that the powerful new autonomous systems have intelligible interfaces built-in. From our results, we propose several implications and directions for future research to-wards this goal.