Content uploaded by Nathan Tenhundfeld
Author content
All content in this area was uploaded by Nathan Tenhundfeld on Apr 16, 2020
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Measuring Automation Bias and Complacency in an
X-Ray Screening Task
Jacob Davis
Dept. of Psychology
The University of Alabama in
Huntsville
Huntsville, USA
jrd0024@uah.edu
Andrew Atchley
Dept. of Psychology
The University of Alabama in
Huntsville
Huntsville, USA
jaa0035@uah.edu
Hannah Smitherman
Dept. of Psychology
The University of Alabama in
Huntsville
Huntsville, USA
hms0017@uah.edu
Hailey Simon
Dept. of Industrial & Systems
Engineering
The University of Alabama in
Huntsville
Huntsville, USA
hms0018@uah.edu
Nathan Tenhundfeld
Dept. of Psychology
The University of Alabama in Huntsville
Huntsville, USA
nlt0006@uah.edu
Abstract—Automation is becoming ever more prevalent in
industrial system designs, and the aviation security industry is
no exception. Automated decision aids are regularly used in
airport security procedures (as with the TSA) to assist operators
scanning baggage for hazardous items. However, there exists
serious concerns regarding the human-machine interactions. In
order to safely design systems that rely on human oversight, it is
imperative that we understand the consequences of design on
overall task performance and system usability. To do this, we
combined an x-ray screening research paradigm with a 'wizard-
of-oz' automation verification feature to create a novel research
paradigm for exploring monitoring behavior (complacency) and
performance in a simulated x-ray screening task. The
automation in the x-ray task provided participants with a
reliable recommendation to search (hazardous items detected)
or clear (no hazardous weapons detected) the baggage 80% of
the time. Users' level of complacency was measured by
registering the frequency with which they chose to verify the
automation by clicking a “Request Info" button. Monitoring
behavior, or the percent of trials in which the user requested
additional information from the automation, was low overall.
However, it was significantly higher when the automation
provided an inaccurate recommendation. These results indicate
that users experienced automation bias, the tendency to agree
with an automated decision aid. Users also exhibited
complacency during the task such that they were no longer
actively monitoring the system. Users may have noticed the
system was unreliable, given an increase in monitoring behavior
in unreliable recommendation trials, but still chose to agree with
the automation rather than visually search the baggage for
evidence. This demonstrates a unique threat to safety in these
domains, wherein users may rely on imperfect automation,
rather than their own abilities, even when they believe
something is amiss.
Keywords—Automation; Bias; Complacency
I. INTRODUCTION
Humans are beginning to work alongside automated
systems in their everyday lives [1]. Automation can be
defined as a machine or system carrying out a task, either
partially or fully, that a human could perform, partially or
fully [2]. Automation is being used ever more frequently in a
variety of settings, including surgical procedures [3], military
operations [4]–[6], and driving [7]–[9]. In the aviation
security industry, automation is used in x-ray screening
procedures to detect explosives [10] and firearms [11]. Once
the automation detects a hazardous item, a signal is rendered
to the operator. The operator must make the final decision to
search or clear the baggage. However, if the operator fails to
remain vigilant when working with imperfect automated
systems, dangerous phenomena may occur.
A. Automation Bias
Automation bias occurs when an operator relies solely on
an automated decision aid’s recommendation without
searching for disconfirming evidence. Humans tend to favor
automated recommendations simply because they are
computerized [12]. This may be especially true for time-
sensitive tasks. Temporal pressures have been shown to
increase the likelihood of bias occurring because operators are
forced to quickly make a decision and continue the task [13].
The current X-ray task featured an automated aid to
recommend a decision, but the operator still made the final
decision. If participants’ performance is greater when the
automation provides a correct recommendation than an
incorrect recommendation, automation bias may have
occurred. This would imply that participants chose to agree
with the automation rather than visually search the baggage
for the presence of hazardous items.
B. Automation-Induced Complacency
When an operator supervises an automated system for a
prolonged period of time, there may be opportunity for
automation complacency to occur. Automation-induced
complacency occurs when an operator begins to passively
monitor an automated system [14]. Complacency results in a
lack of noticing errors in the system [15] and decreases in
performance [14]. The likelihood of complacency to occur
may be increased when the system is perceived to be reliable,
as these systems lead to increased trust [16]–[18]. Automation
complacency is dangerous for supervisors of automated
systems, such as X-ray screeners, because errors of omission
may occur more frequently. The error of omission here is
allowing a hazardous item to pass through the screening
procedure. This error could then lead to a life-threatening
situation.
C. The Present Study
The objective of the current study was to implement
methods of measuring both automation bias and automation-
induced complacency in an X-ray screening paradigm. Using
this paradigm, these automation-related phenomena can be
studied in an applied scenario. Previous studies featuring
automated X-ray screening tasks have failed to utilize
automation verification methods [10], [19]–[21]. Automation
verification is a method used to measure vigilance, or the
opposite of complacency, by allowing participants to request
more information from the automation [22]. Higher vigilance
rates are associated with decreased complacency and fewer
errors of omission [23]. By implementing automation
verification into the X-ray screening paradigm, a more
comprehensive understanding of human-automation
interaction can be achieved.
II. METHODS
A. Participants
A convenience sample of undergraduate students (N = 70)
were recruited from introductory psychology courses at The
University of Alabama in Huntsville. Participants (n = 3) were
excluded from data analysis if their performance on the X-ray
screening task was lower than 2 SD from the mean. Of the
total sample, 32.86% identified as male, 65.71% identified as
female, and 1.43% of participants preferred not to answer.
Additionally, 63.32% were Caucasian, 10.14% were African
American, 10.14% reported multiple races, 2.90% were
Asian, and 14.49% reported a race/ethnicity not listed.
Participants were compensated with course credit but were
given alternative methods of earning credit if they chose not
to participate. All participants were at least 18 years of age.
This study was approved by The University of Alabama in
Huntsville Institutional Review Board.
B. Measures
Performance was measured through participants’ correct
decisions when the automation’s recommendation was correct
versus incorrect. To calculate performance on correct decision
aid (DA) trials, participants’ correct decisions were divided by
the number of trials in which the automation provided a
correct recommendation. To calculate performance on
incorrect DA trials, participants’ correct decisions were
divided by the number of trials in which the automation
provided an incorrect recommendation. This resulted in a
percentage ranging from 0% to 100% to represent
performance. Because performance was a measure of correct
agreement or correct disagreement with the automation, it was
used a measure of automation bias.
Vigilance was measured as the number of trials in which
the participant requested more information from the
automation divided by the total number of trials that
participant completed. This was split between trials in which
the automation provided correct recommendations versus
incorrect recommendations. Subsequently, vigilance on
correct DA trials and incorrect DA trials resulted in a
percentage from 0% to 100%. Vigilance can be
conceptualized as the opposite of automation-induced
complacency behaviors.
C. Materials and Procedure
Upon arrival to the session, the experimenter ensured the
participant was at least 18 years old by verifying the
participants’ identification. The study began with participants
reading a voluntary consent form. Once informed consent was
given, participants completed a series of questionnaires before
completing the X-ray screening task. Those questionnaires are
for inclusion in another study, and thus the results are not
reported here. The X-ray screening task, as depicted in Fig. 1,
consisted of 160 trials with images of X-rayed baggage. The
images were adapted from [24]. The X-ray stimulus featured
a “wizard-of-oz” automation, a simulated automated aid.
Instructions at the top of every trial read: “Search the baggage
for hazardous items (Knives or Guns). The diagnostic aid will
automatically scan the image and provide a recommendation
to search or clear the baggage. You must still make the final
decision to search or clear the baggage.”
Fig. 1. A screenshot of the simulated x-ray screening task, including the
baggage (center), instructions (top), and automated aid (left). In this trial, the
automation correctly recommends clearing the baggage.
Participants were unaware that the automation was
constructed to provide the correct recommendation 80% of the
time. This reliability percentage resulted in 32 trials in which
the automation provided an incorrect, or unreliable,
recommendation. Furthermore, participants could choose to
request more information from the automation by clicking a
“Request Info” button, which would display whether the
automation had detected a gun, knife, or no weapon. The
information button did not contradict the original
recommendation, even if it was incorrect. This was to aid
participants in their decision making, as they were required to
click either ‘search’ (weapon detected) or ‘clear’ (no weapon
detected) for each piece of baggage. The X-ray screening task
was presented to participants through MATLAB. Once
participants completed the task, they answered another
questionnaire and a demographics form. Finally, they were
debriefed and released.
III. RESULTS
A. Performance Data
Participants made the correct decision more often when
the automation provided a correct recommendation (M =
88.81%; SD = 10.34%) than when the automation was
incorrect (M = 9.29%; SD = 9.86%). This difference was
analyzed with a one-way repeated measure analysis of
variance (ANOVA). As observed in Fig. 2, the ANOVA
resulted in a main effect for performance, F(1, 61) = 1087.07,
p < 0.001, ηp2 = 0.947.
B. Monitoring Behavior Data
Monitoring behavior was measured as vigilance scores, or
the percent of trials in which the participant requested more
information from the automation out of the total number of
trials completed by that participant. Vigilance scores were
separated into correct DA trials and incorrect DA trials. A one-
way repeated measures ANOVA was conducted to determine
if vigilance was different between correct and incorrect DA
trials. As observed in Fig. 3, the difference between correct
DA trials (M = 25.29%; SD = 32.79%) and incorrect DA trials
(M = 29.86%; SD = 32.71%) was significant, F(1,61) =
18.783, p < 0.001, ηp2 = 0.235. This indicates that participants
verified the automation more frequently, on average, when the
automation provided an incorrect recommendation.
C. Participant Bias Data
Participant bias was measured to better understand how
participants responded to individual trials. This was calculated
by examining how participants responded when they made an
incorrect decision and is represented by B”. B” ranges from
-1.0 to 1.0, with -1.0 representing bias towards searching and
1.0 representing bias towards clearing the baggage. Of note,
B” is not a measure of performance, but rather an indication
of what types of errors were made when errors were made.
Said another way, a B” score of 1.0 means all errors made
were towards clearing baggage which had a weapon in it. A
B” score of -1.0 means all errors made were towards searching
baggage which had no weapon. A B” score of 0 means that
errors were evenly distributed, and scores in between
represent different strengths of bias. On average, B” was M =
0.20 with SD = 0.24. To determine if B” was significantly
different from no bias (B” = 0), a single sample t-test was
conducted, resulting in a significant difference, indicating
participants were slightly biased towards clearing baggage
t(61) = 6.56, p < 0.001, d = 0.83.
Fig. 2. Mean performance rates for participants, split by trials with a correct
DA and an incorrect DA. Error bars represent 95% within-subjects
confidence interval.
Fig. 3. Mean vigilance rates for participants, split by trials with a correct
DA and an incorrect DA. Vigilance was measured by the number of trials in
which the participant requested more information from the automation
divided by the total number of trials that participant completed. Error bars
represent 95% within-subjects confidence interval.
IV. DISCUSSION
A. Participants Exhibited Automation Bias
The findings from this study demonstrate that
participants’ performance was significantly better on trials in
which the automation recommended a correct decision than
when the automation recommended an incorrect decision.
Participants seemed to agree with the automation whether it
was correct or incorrect. The overall trend to agree with the
automation can be viewed as automation bias because
participants seemed to rely on the automated
recommendation without visually searching the baggage
itself for the presence of hazardous items [12]. Participants
were unaware of the unreliable nature of the automation and
could have had miscalibrated levels of expectations and trust
[25]. Participants may have perceived the system to be
competent and functioning normally. This may have caused
them to over-rely on the system, which can be interpreted as
automation misuse [26].
The present finding of participants exhibiting automation
bias has replicated previous studies of bias, which has been
exhibited across domains and in experts as well as novices
[15], [22], [27]. Even training programs and reminding
operators to remain vigilant cannot prevent automation bias
completely [15]. Automation bias is not a novel finding, but
it remains dangerous in its consequences. If an automated
system being used in the aviation security industry was
imperfect, and an operator exhibited automation bias, the
consequences could include life-threatening situations.
B. Participants Exhibited Complacency – To a Degree
Vigilance rates were low overall, M = 24.29%, and
comparable to previous findings [22]. This indicates that
participants exhibited a degree of complacency. The
unexpected finding that participants’ vigilance rates were
higher when the automation provided an incorrect
recommendation than when it provided a correct
recommendation implies that participants were not
completely passive in their monitoring. Participants may have
begun to notice the errors in the system and chose to verify
the automation more frequently when the system
recommended an incorrect decision. However, this finding
combined with the finding of automation bias implies that
participants who noticed the errors in the automation still
agreed with the system. Participants may have been unwilling
to disagree with the automation simply because it was
“computerized,” and may have had more confidence in the
system’s ability to detect hazardous items than their own [12].
This finding demonstrates the complex dynamics of human-
machine teaming. Some individuals may be so willing to trust
an automated system, that they knowingly agree with a
system that they suspect may be unreliable. This is a serious
risk for all operators of automated system, and for any
industry implementing an automated aspect into their system.
C. Limitations and Future Directions
Sampling issues may be present within the current study,
such that the participants in this study interacted with the
system for a shorter period of time than real-world operators
would. Furthermore, real-world operators may be more
motivated to pay attention and complete the task more
accurately [1]. The sample of college-aged students may have
exhibited signs of automation bias and complacency because
they were not invested in the task. Future research can place
more emphasis on the sample to maintain ecological validity.
Several participants (n = 10) in this study chose not to use
the automation verification feature at all. These participants
may not have fully understood the purpose of the “Request
Info” button or they may not have found the button to be
useful. The button may not have given enough additional
information to aid in decision making, and it can be improved
upon in future research. For example, the “Request Info”
button could display a level of confidence in the
recommendation or more specific information, such as the
general location or quadrant of the baggage where the
hazardous item was detected.
Participants may have been artificially biased to perceive
the automation as functioning perfectly. The automation was
designed to provide correct recommendations during the first
15 trials, in order to provide participants with a period to
become familiar with the system. While this was
implemented as a method to “train” the participants on the
system’s function, it may have inadvertently presented the
automated aid to be completely reliable. While this method is
not inherently flawed, if this method of training is used in
future studies it should be accompanied with a statement that
the automation is reliable but not perfect.
D. Potential Solutions for Designers
While these phenomena are found across individuals and
domains [15], there may be methods to prevent these adverse
effects of human-automation interaction. Engineered
solutions may be applied to these automated systems to
combat the effects of automation bias and complacency in
operators. Attention checks, such as catch trials, can be used
to detect operators who have become biased to agreeing with
the automation or are exhibiting signs of complacency. These
could be implemented randomly or systematically, as long as
they are unexpected by the operator. Fail safes could be
implemented when the automation cannot provide a
confident recommendation. If the automation is not entirely
confident or cannot fully detect a hazardous item, the baggage
should still be searched. Multiple levels of detection could be
implemented, but with the added cost of time. Automated
systems in aviation security procedures could be designed to
be more liberal in its detection. False alarms may cost time
and money, but misses could cost lives. Operators should be
aware of the system’s capabilities and limitations. An
increased understanding of the automation can calibrate the
operator’s trust in the system to match its reliability [25].
Finally, adaptive automation, or automation which tracks
operators’ performance in real-time, may be a useful avenue
for designers of automated systems to explore [28]. Solutions
must be offered and tested in order to combat the adverse
effects of human-automation interaction. This is an area
where system design is of upmost importance, as the
consequences of these human-automation related phenomena
may result in life-threatening situations.
REFERENCES
[1] S. M. Merritt and D. R. Ilgen, “Not all trust is created equal:
Dispositional and history-based trust in human-automation
interactions,” Hum. Factors, vol. 50, no. 2, pp. 194–210, 2008.
[2] R. Parasuraman, T. B. Sheridan, and C. D. Wickens, “A Model for
Types and Levels of Human Interaction with Automation,” IEEE
Trans. Syst. Man. Cybern., vol. 30, no. 3, pp. 286–297, 2000.
[3] D. Manzey, M. Luz, S. Mueller, A. Dietz, J. Meixensberger, and
G. Strauss, “Automation in surgery: The impact of navigated-
control assistance on performance, workload, situation awareness,
and acquisition of surgical skills,” Hum. Factors, vol. 53, no. 6, pp.
584–599, 2011.
[4] E. J. de Visser and R. Parasuraman, “Adaptive Aiding of Human-
Robot Teaming: Effects of Imperfect Automation on Performance,
Trust, and Workload,” J. Cogn. Eng. Decis. Mak., vol. 5, no. 2, pp.
209–231, 2011.
[5] G. T. Lorenz et al., “Assessing control devices for the supervisory
control of autonomous wingmen,” in 2019 Systems and
Information Engineering Design Symposium, SIEDS 2019, 2019,
no. July, pp. 1–6.
[6] S. A. Alexander, J. S. Rozo, B. T. Donadio, N. L. Tenhundfeld, E.
J. De Visser, and C. C. Tossell, “Transforming the air force
mission planning process with virtual and augmented reality,” in
2019 Systems and Information Engineering Design Symposium,
SIEDS 2019, 2019, pp. 1–4.
[7] N. L. Tenhundfeld, E. J. de Visser, A. J. Ries, V. S. Finomore, and
C. C. Tossell, “Trust and Distrust of Automated Parking in a Tesla
Model X,” Hum. Factors, pp. 1–18, 2019.
[8] K. Tomzcak et al., “Let Tesla Park Your Tesla: Driver Trust in a
Semi-Automated Car,” in Proceedings of the annual Systems and
Information Engineering Design Symposium (SIEDS)
Conference., 2019.
[9] N. L. Tenhundfeld, E. J. de Visser, K. S. Haring, A. J. Ries, V. S.
Finomore, and C. C. Tossell, “Calibrating trust in automation
through familiarity with the autoparking feature of a Tesla Model
X,” J. Cogn. Eng. Decis. Mak., vol. 13, no. 4, pp. 279–294, 2019.
[10] N. Hättenschwiler, Y. Sterchi, M. Mendes, and A. Schwaninger,
“Automation in airport security X-ray screening of cabin baggage:
Examining benefits and possible implementations of automated
explosives detection,” Appl. Ergon., vol. 72, no. May, pp. 58–68,
2018.
[11] D. Mery, G. Mondragon, V. Riffo, and I. Zuccar, “Detection of
regular objects in baggage using multiple X-ray views,” Insight
Non-Destructive Test. Cond. Monit., vol. 55, no. 1, pp. 16–20,
2013.
[12] K. L. Mosier et al., “Automation Bias : Decision Making and
Performance in High-Tech Cockpits Automation Bias : Decision
Making and Performance in High-Tech Cockpits,” Int. J. Aviat.
Psychol., vol. 8:1, no. January, pp. 47–63, 2016.
[13] M. L. Cummings, “Automation Bias in Intelligent Time Critical
Decision Support Systems,” in Decision Making in Aviation, 2018,
pp. 289–294.
[14] R. Parasuraman, R. Molloy, and I. L. Singh, “Performance
Consequences of Automation Induced Complacency,” Int. J.
Aviat. Psychol., vol. 3, no. 1, pp. 1–23, 1993.
[15] R. Parasuraman and D. H. Manzey, “Complacency and bias in
human use of automation: An attentional integration,” Hum.
Factors, vol. 52, no. 3, pp. 381–410, 2010.
[16] D. Ruscio, M. R. Ciceri, and F. Biassoni, “How does a collision
warning system shape driver’s brake response time? The influence
of expectancy and automation complacency on real-life emergency
braking,” Accid. Anal. Prev., vol. 77, pp. 72–81, 2015.
[17] V. A. Banks, A. Eriksson, J. O’Donoghue, and N. A. Stanton, “Is
partially automated driving a bad idea? Observations from an on-
road study.,” Appl. Ergon., vol. 68, pp. 138–145, 2018.
[18] A. Sebok and C. D. Wickens, “Implementing Lumberjacks and
Black Swans into Model-Based Tools to Support Human-
Automation Interaction,” Hum. Factors, vol. 59, no. 2, pp. 189–
203, 2017.
[19] A. Chavaillaz, A. Schwaninger, S. Michel, and J. Sauer,
“Expertise, automation and trust in X-ray screening of cabin
baggage,” Front. Psychol., vol. 10, no. 256, pp. 1–11, 2019.
[20] S. M. Merritt, H. Heimbaugh, J. Lachapell, and D. Lee, “I trust it,
but i don’t know why: Effects of implicit attitudes toward
automation on trust in an automated system,” Hum. Factors, vol.
55, no. 3, pp. 520–534, 2013.
[21] S. M. Merritt, D. Lee, J. L. Unnerstall, and K. Huber, “Are well-
calibrated users effective users? Associations between calibration
of trust and performance on an automation-aided task,” Hum.
Factors, vol. 57, no. 1, pp. 34–47, 2015.
[22] R. Parasuraman, E. de Visser, M. K. Lin, and P. M. Greenwood,
“Dopamine beta hydroxylase genotype identifies individuals less
susceptible to bias in computer-assisted decision making,” PLoS
One, vol. 7, no. 6, 2012.
[23] J. E. Bahner, A. D. Hüper, and D. H. Manzey, “Misuse of
automated decision aids: Complacency, automation bias and the
impact of training experience,” Int. J. Hum. Comput. Stud., vol. 66,
no. 9, pp. 688–699, 2008.
[24] S. M. Merritt, L. Shirase, and G. Foster, “Assessment of vigilance
performance and/or reliance on automated decision aids: an X-ray
screening task,” Unpubl. Manuscr., 2019.
[25] B. M. Muir, “Trust between humans and machines, and the design
of decision aids,” Int. J. Man. Mach. Stud., vol. 27, no. 5–6, pp.
527–539, 1987.
[26] R. Parasuraman and V. Riley, “Humans and Automation: Use,
Misuse, Disuse, Abuse,” Hum. Factors J. Hum. Factors Ergon.
Soc., vol. 39, no. 2, pp. 230–253, 1997.
[27] E. Rovira, K. McGarry, and R. Parasuraman, “Effects of imperfect
automation on decision making in a simulated command and
control task,” Hum. Factors, vol. 49, no. 1, pp. 76–87, 2007.
[28] R. Parasuraman, K. A. Cosenzo, and E. J. de Visser, “Adaptive
automation for human supervision of multiple uninhabited
vehicles: Effects on change detection, situation awareness, and
mental workload,” Mil. Psychol., vol. 21, no. 2, pp. 270–297, 2009.