PosterPDF Available

What the AI saw: Examining human predictions of deep image classification errors

Authors:

Abstract and Figures

Deep Image classifiers have made amazing advances in both basic and applied problems in recent years. Nevertheless, they are still very limited and can be foiled by even simple image distortions. Importantly, the way they fail is often unexpected, and sometimes difficult to even understand. Thus, advances in image classifiers made to improve their transparency and the predictability of their errors may make more of a difference than algorithmic improvements that reduce error rates on benchmarks. To understand the types of expectations humans may have, we conducted a study in which students were asked to predict whether a generic AI system would correctly identify 10 classes of tools (axe, hammer, wrench, flashlight, pliers, saw, scissors, screwdriver, tape measure), with a variety of image transforms (e.g., borders, outline filters, additional objects inserted into image, etc.), and also examined how five commercial deep image classifiers performed on the same imagery. Results revealed that humans tended to predict that distortions and distractions would lead to impairment of the AI systems, and although AI failures did incorporate these factors, they also involved many class-level errors (e.g., calling a wrench a tool or a product), and feature-errors (calling a hammer 'metal' or 'wood') not identified by human novice users. Results will be discussed in the context of Explainable AI systems.
Content may be subject to copyright.
Introduction
Table 1. Coding Results of Human Participant’s predictions of AI Image Classification errors.
(alinja@mtu.edu, lalam@mtu.edu, shanem@mtu.edu)
Dept. of Cognitive and Learning Sciences, Michigan Technological University, Houghton MI
Anne Linja, Lamia Alam & Shane T. Mueller
What the AI saw:
Examining human predictions of Deep Image Classification Errors
Objectives. To identify gaps between human’s
assumptions in Image Classifier’s output and actual
Image Classifier results.
Methods (cont.)
Deep Image classifiers have made amazing advances
in both basic and applied problems in recent years;
however they're still both limited easily foiled by
image distortions. Importantly, the way they fail is
often unexpected, and sometimes difficult to even
understand. To understand the types of
expectations humans may have, we conducted a
study in which students were asked to predict
whether a generic AI system would correctly
identify 10 classes of tools, each with a variety of
image transforms. We also examined how five
commercial deep image classifiers performed on
the same imagery. Results revealed that humans
tended to predict that distortions and distractions
would lead to impairment of the AI systems, and
although AI failures did incorporate these factors,
they also involved many class-level errors (e.g.,
calling a wrench a tool or a product), and feature-
errors (calling a hammer 'metal' or 'wood') not
identified by human novice users. Results will be
discussed in the context of Explainable AI systems.
Although only 13% of participants predicted that “Distraction” would
account for Image Classifier errors, the results show that Image Classifier
results were coded as erring due to “Distractions” 23% of the time.
Discussion
Results indicate that humans are able to successfully predict Classifier
errors due to attention/distraction.
However, humans were unable to predict the categorical errors that Image
Classifiers make when classifying images.
Results appear to indicate that humans anthropomorphize the way Image
Classifiers will err when classifying images. Results indicate that humans
tend to overestimate visual factors that impair humans.
In order to bridge the gap between human expectations and actual
performance of AI systems, further studies should be performed to identify
specific misconceptions by humans.
Results
Disagreements
Both
raters 1234567
Either
rater
1. Attention/distraction 29 -7200012 50
2. Lack of sufficiency; Important
features/outline/shape missing, blocked, distorted 45 -15054 67
3. Similarity to other objects not necessary in
image (category-type error) 6 - 0 0 2 0 11
4. Size/resolution -pixilation, blur and resolution 29 - 0 1 2 37
5. Irregular angle/orientation 5 - 1 1 7
6. Other/Not clear or diagnostic 1 - 3 13
7. Visual segmenting (figure-ground segmenting) 11 -33
126 218
5% of participants predicted that “Misclassification” would account for
Image Classifier errors. However, Image Classifiers results were coded as
erring due to “Misclassification” 7% of the time.
22% of the Image Classifier errors were due to color and attribute errors.
Figure 3. AI Image Classification errors.
Figure 1. Original images of Tools: Axe, Flashlight, Hammer, Pliers, Saw,
Scissors, Screwdriver, Shovel, Tape Measure, Wrench
For Participant Error Prediction, there is significant agreement (k=.658)
between the raters.
Methods. 50 undergraduate participants were
recruited from the Michigan Technological
University participant pool.
(Human Participants).
Participants were shown images of tools. Images
were shown as “original” (Figure 1) or with a
transformation (Figure 2).
Participants were asked to categorize tools by
class.
Participants correctly identified class 98.597% of
the time
Participants were then instructed to consider
that an AI Image Classifier would process the
images of the tools. Participants were asked to
comment with reasons they thought the AI Image
Classifier would succeed or fail to correctly
identify the class.
Reasons for failure were extracted, parsed by
reason
(AI Image Classifiers).
Image Classifiers (Amazon, Clarifai, Google,
Watson) processed the same images.
The top response was recorded and coded
(Figure 3).
Figure 2. Image Transformations (shown with Axe)
Accuracy of Human and Deep Learning Image Classifiers.
Top classification for each image was coded for accuracy.
Humans outperformed all classifiers, Inception A (trained
with tool and flowers, but untrained on the images used
in the study) and Inception B (trained with tool and
flowers, including the images used in the study)
outperformed Amazon, Clarifai, Google and Watson.
Discussion
https://www.researchgate.net/profile/Anne_Linja/research
... I will discuss these properties in turn, using [] to reference specific entries in the table, and when appropriate compare human categorization to those typically exhibited by a number of commercial and free deep image classification systems my colleagues and I have examined and tested alongside human participants (these include Google Cloud Vision Client, IBM Watson Visual Recognition, Amazon Rekognition, Google Inception, and Clarifai General Image Classifier; cf. Linja et al., 2019, for examples). ...
... Rosch et al. 's (1976) finding [b] indicate that a middle-out preference for a basic level of categorization: We typically prefer labels like "pliers" to either the superordinate "tool" or subordinate "needle-nose pliers. " In my laboratory, we have found that a major source of errors in machine image classifiers are these category-level errors (Linja et al., 2019). This probably comes, in part, from communication norms that rely on common Table 1. ...
Article
Full-text available
Modern artificial intelligence (AI) image classifiers have made impressive advances in recent years, but their performance often appears strange or violates expectations of users. This suggests that humans engage in cognitive anthropomorphism: expecting AI to have the same nature as human intelligence. This mismatch presents an obstacle to appropriate human-AI interaction. To delineate this mismatch, I examine known properties of human classification, in comparison with image classifier systems. Based on this examination, I offer three strategies for system design that can address the mismatch between human and AI classification: explainable AI, novel methods for training users, and new algorithms that match human cognition.
ResearchGate has not been able to resolve any references for this publication.