Table 1. Coding Results of Human Participant’s predictions of AI Image Classification errors.
(firstname.lastname@example.org, email@example.com, firstname.lastname@example.org)
Dept. of Cognitive and Learning Sciences, Michigan Technological University, Houghton MI
Anne Linja, Lamia Alam & Shane T. Mueller
What the AI saw:
Examining human predictions of Deep Image Classification Errors
Objectives. To identify gaps between human’s
assumptions in Image Classifier’s output and actual
Image Classifier results.
Deep Image classifiers have made amazing advances
in both basic and applied problems in recent years;
however they're still both limited easily foiled by
image distortions. Importantly, the way they fail is
often unexpected, and sometimes difficult to even
understand. To understand the types of
expectations humans may have, we conducted a
study in which students were asked to predict
whether a generic AI system would correctly
identify 10 classes of tools, each with a variety of
image transforms. We also examined how five
commercial deep image classifiers performed on
the same imagery. Results revealed that humans
tended to predict that distortions and distractions
would lead to impairment of the AI systems, and
although AI failures did incorporate these factors,
they also involved many class-level errors (e.g.,
calling a wrench a tool or a product), and feature-
errors (calling a hammer 'metal' or 'wood') not
identified by human novice users. Results will be
discussed in the context of Explainable AI systems.
•Although only 13% of participants predicted that “Distraction” would
account for Image Classifier errors, the results show that Image Classifier
results were coded as erring due to “Distractions” 23% of the time.
•Results indicate that humans are able to successfully predict Classifier
errors due to attention/distraction.
•However, humans were unable to predict the categorical errors that Image
Classifiers make when classifying images.
•Results appear to indicate that humans anthropomorphize the way Image
Classifiers will err when classifying images. Results indicate that humans
tend to overestimate visual factors that impair humans.
•In order to bridge the gap between human expectations and actual
performance of AI systems, further studies should be performed to identify
specific misconceptions by humans.
1. Attention/distraction 29 -7200012 50
2. Lack of sufficiency; Important
features/outline/shape missing, blocked, distorted 45 -15054 67
3. Similarity to other objects not necessary in
image (category-type error) 6 - 0 0 2 0 11
4. Size/resolution -pixilation, blur and resolution 29 - 0 1 2 37
5. Irregular angle/orientation 5 - 1 1 7
6. Other/Not clear or diagnostic 1 - 3 13
7. Visual segmenting (figure-ground segmenting) 11 -33
•5% of participants predicted that “Misclassification” would account for
Image Classifier errors. However, Image Classifiers results were coded as
erring due to “Misclassification” 7% of the time.
•22% of the Image Classifier errors were due to color and attribute errors.
Figure 3. AI Image Classification errors.
Figure 1. Original images of Tools: Axe, Flashlight, Hammer, Pliers, Saw,
Scissors, Screwdriver, Shovel, Tape Measure, Wrench
•For Participant Error Prediction, there is significant agreement (k=.658)
between the raters.
Methods. 50 undergraduate participants were
recruited from the Michigan Technological
University participant pool.
•Participants were shown images of tools. Images
were shown as “original” (Figure 1) or with a
transformation (Figure 2).
•Participants were asked to categorize tools by
•Participants correctly identified class 98.597% of
•Participants were then instructed to consider
that an AI Image Classifier would process the
images of the tools. Participants were asked to
comment with reasons they thought the AI Image
Classifier would succeed or fail to correctly
identify the class.
•Reasons for failure were extracted, parsed by
(AI Image Classifiers).
•Image Classifiers (Amazon, Clarifai, Google,
Watson) processed the same images.
•The top response was recorded and coded
Figure 2. Image Transformations (shown with Axe)
Accuracy of Human and Deep Learning Image Classifiers.
Top classification for each image was coded for accuracy.
Humans outperformed all classifiers, Inception A (trained
with tool and flowers, but untrained on the images used
in the study) and Inception B (trained with tool and
flowers, including the images used in the study)
outperformed Amazon, Clarifai, Google and Watson.