ArticlePDF Available

Two graphs walk into a bar: Readout-based measurement reveals the Bar-Tip Limit error, a common, categorical misinterpretation of mean bar graphs


Abstract and Figures

How do viewers interpret graphs that abstract away from individual-level data to present only summaries of data such as means, intervals, distribution shapes, or effect sizes? Here, focusing on the mean bar graph as a prototypical example of such an abstracted presentation, we contribute three advances to the study of graph interpretation. First, we distill principles for Measurement of Abstract Graph Interpretation (MAGI principles) to guide the collection of valid interpretation data from viewers who may vary in expertise. Second, using these principles, we create the Draw Datapoints on Graphs (DDoG) measure, which collects drawn readouts (concrete, detailed, visuospatial records of thought) as a revealing window into each person's interpretation of a given graph. Third, using this new measure, we discover a common, categorical error in the interpretation of mean bar graphs: the Bar-Tip Limit (BTL) error. The BTL error is an apparent conflation of mean bar graphs with count bar graphs. It occurs when the raw data are assumed to be limited by the bar-tip, as in a count bar graph, rather than distributed across the bar-tip, as in a mean bar graph. In a large, demographically diverse sample, we observe the BTL error in about one in five persons; across educational levels, ages, and genders; and despite thoughtful responding and relevant foundational knowledge. The BTL error provides a case-in-point that simplification via abstraction in graph design can risk severe, high-prevalence misinterpretation. The ease with which our readout-based DDoG measure reveals the nature and likely cognitive mechanisms of the BTL error speaks to the value of both its readout-based approach and the MAGI principles that guided its creation. We conclude that mean bar graphs may be misinterpreted by a large portion of the population, and that enhanced measurement tools and strategies, like those introduced here, can fuel progress in the scientific study of graph interpretation.
Content may be subject to copyright.
Journal of Vision (2021) 21(12):17, 1–36 1
Two graphs walk into a bar: Readout-based measurement
reveals the Bar-Tip Limit error, a common, categorical
misinterpretation of mean bar graphs
Sarah H. Kerns Department of Psychology, Wellesley College,
Wellesley, MA, USA
Jeremy B. Wilmer Department of Psychology, Wellesley College,
Wellesley, MA, USA
How do viewers interpret graphs that abstract away
from individual-level data to present only summaries of
data such as means, intervals, distribution shapes, or
effect sizes? Here, focusing on the mean bar graph as a
prototypical example of such an abstracted
presentation, we contribute three advances to the study
of graph interpretation. First, we distill principles for
Measurement of Abstract Graph Interpretation (MAGI
principles) to guide the collection of valid interpretation
data from viewers who may vary in expertise. Second,
using these principles, we create the Draw Datapoints
on Graphs (DDoG) measure, which collects drawn
readouts (concrete, detailed, visuospatial records of
thought) as a revealing window into each person’s
interpretation of a given graph. Third, using this new
measure, we discover a common, categorical error in the
interpretation of mean bar graphs: the Bar-Tip Limit
(BTL) error. The BTL error is an apparent conflation of
mean bar graphs with count bar graphs. It occurs when
the raw data are assumed to be limited by the bar-tip, as
in a count bar graph, rather than distributed across the
bar-tip, as in a mean bar graph. In a large,
demographically diverse sample, we observe the BTL
error in about one in five persons; across educational
levels, ages, and genders; and despite thoughtful
responding and relevant foundational knowledge. The
BTL error provides a case-in-point that simplification via
abstraction in graph design can risk severe,
high-prevalence misinterpretation. The ease with which
our readout-based DDoG measure reveals the nature
and likely cognitive mechanisms of the BTL error speaks
to the value of both its readout-based approach and the
MAGI principles that guided its creation. We conclude
that mean bar graphs may be misinterpreted by a large
portion of the population, and that enhanced
measurement tools and strategies, like those introduced
here, can fuel progress in the scientific study of graph
Background on bar graphs
How can one maximize the chance that visually
conveyed quantitative information is accurately and
eciently received? This question is fundamental to
data-driven elds, both applied (policy, education,
medicine, business, engineering) and basic (physics,
psychological science, computer science). Long a crux
of debate on this question, “mean bar graphs”—bar
graphs depicting mean values—are both widely used,
for their familiarity and clean visual impact, and
widely criticized, for their abstract nature and paucity
of information (Tufte & Graves-Morris, 1983,p.96;
Wainer, 1984;Drummond & Vowler, 2011;Weissgerber
et al., 2015;Larson-Hall, 2017;Rousselet, Pernet,
& Wilcox, 2017;Pastore, Lionetti, & Altoe, 2017;
Weissgerber et al., 2019;Vail & Wilkinson, 2020).
Conversely, bar graphs are considered a best-practice
when conveying counts—whether raw or scaled into
proportions or percentages. These “count bar graphs”
are more concrete and hide less information than mean
bar graphs. They are usefully extensible: bars may be
stacked on top of each other to convey parts of a whole
(a stacked bar graph) or arrayed next to each other
to convey a distribution (a histogram). And they take
advantage of a core bar graph strength: the alignment
of bars at a common baseline supports rapid, precise
height estimates and comparisons (Cleveland & McGill,
1984;Heer & Bostock, 2010).
When a bar represents a count, it can be thought
of as a stack, and the metaphor of bars-as-stacks is
relatively accessible across ages and expertise levels
(Zubiaga & MacNamee, 2016). In fact, the introduction
of bar graphs in elementary education is often
Citation: Kerns, S. H., & Wilmer, J. B. (2021). Two graphs walk into a bar: Readout-based measurement reveals the
Bar-Tip Limit error, a common, categorical misinterpretation of mean bar graphs. Journal of Vision,21(12):17, 1–36, Received July 20, 2020; published November 30, 2021 ISSN 1534-7362 Copyright 2021 The Authors
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 2
Figure 1. Elementary progression of count bar graph
instruction. (a) Children are taught first using manipulatives;
(b) then they transition to drawn stacks; (c) finally, they are
introduced to undivided bars.
accomplished in just this way: with manipulative stacks
(Figure 1a), translated to more abstract drawn stacks
(Figure 1b), and further abstracted to undivided bars
(Figure 1c).
While the mean bar graph potentially borrows
some virtues from the count bar graph, the use of
a single visual symbol (the bar, Figures 2a, 2c) to
represent two profoundly dierent quantities (means
and counts, Figures 2b, 2d) adds inherent ambiguity to
the interpretation of that visual symbol.
The information conveyed by these two visually
identical bar graph types diers categorically, as
Figure 2 shows. Because a count bar graph depicts
summation, its bar-tip is the limit of the individual-level
data, which are contained entirely within the bar
(Figures 2d, 2e). We call this a Bar-Tip Limit (BTL)
distribution. A mean bar graph, in contrast, depicts
a central tendency; it uses the bar-tip as the balanced
center point, or mean, and the individual-level data are
distributed across that bar-tip (Figure 2b).Wecallthis
a Bar-Tip Mean distribution.
The question of mean bar graph accessibility
While the rationale for using mean bar graphs—or
mean graphs more generally—varies by intended
audience, a common theme is visual simplication.
Potentially, the simplication of substituting a
single mean value for many raw data values could
ease comparison, aid pattern-seeking, enhance
discriminability, reduce clutter, or remove distraction
(Barton & Barton, 1987;Franzblau & Chung, 2012).
Communications intended for nonexpert consumers,
such as introductory textbooks, may favor mean bar
graphs because their visual simplicity is assumed to
yield accessibility (Angra & Gardner, 2017). In contrast,
communications intended for experts, such as scholarly
publications, may favor mean bar graphs because their
visual simplicity is assumed to yield eciency (Barton
& Barton, 1987). In either case, the aim is to enhance
communication. Yet simplication that abstracts away
from the concrete underlying data could also mislead if
the viewer fails to accurately intuit the nature of that
underlying data.
Visual simplication can, in some cases, enhance
comprehension. Yet is this the case for mean bar
graphs? Basic research in vision science and psychology
provides at least four theoretical causes to doubt
the accessibility of mean bar graphs. First, in many
domains, abstraction, per se, reduces understandability,
particularly in nonexperts (Fyfe et al., 2014;Nguyen
et al., 2020). Second, less-expert consumers may
lack sucient familiarity with a mean bar graph’s
dependent variable to accurately intuit the likely
range, variability, or shape of the distribution that it
represents. In this context, the natural tendency to
discount variation (Moore et al., 2015) and exaggerate
dichotomy (Fisher & Keil, 2018) could potentially
Figure 2. Data distribution differs categorically between mean
and count graphs. (a) Mean bar graphs and (c) count bar graphs
do not differ in basic appearance, but they do depict
categorically different data distributions. (b) In a mean bar
graph, the bar-tip is the balanced center point, or mean, with
the data distributed across it. We call this a Bar-Tip Mean
distribution. (d, e) In a count bar graph, the bar-tip acts as a
limit, containing the summed data within the bar. We call this a
Bar-Tip Limit (BTL) distribution.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 3
distort interpretations. Third, the visual salience of a
bar could be a double-edged sword, initially attracting
attention, but then spreading it across the bar’s full
extent, in eect tugging one’s focus away from the
mean value represented by the bar-tip (Egly et al.,
1994). Finally, the formal visual simplicity (i.e., low
information content) of the bar does not guarantee
that it is processed more eectively by the visual system
than a more complex stimulus. Many high-information
stimuli like faces (Dobs et al., 2019) and scenes (Thorpe
et al., 1996) are processed more rapidly and accurately
than low-information stimuli such as colored circles or
single letters (Li et al., 2002). Relatedly, a set of dots is
accurately and eciently processed into a mean spatial
location (Alvarez & Oliva, 2008), raising questions
about what is gained by replacing datapoints on a graph
with their mean value. Together, these theory-driven
doubts provided a key source of broad motivation for
the present work.
Adding to the broad concerns raised by basic vision
science and psychology research are numerous specic,
direct critiques of mean bar graphs in particular. Such
critiques, however, are chiey theoretical, as opposed to
empirical, and focus on expert audiences, as opposed
to more general audiences (Tufte & Graves-Morris,
1983, p. 96; Wainer, 1984;Drummond & Vowler, 2011;
Weissgerber et al., 2015;Larson-Hall, 2017;Rousselet,
Pernet, & Wilcox, 2017;Pastore, Lionetti, & Altoe,
2017;Weissgerber et al., 2019;Vail & Wilkinson,
2020). In contrast, our present work provides a direct,
empirical test of the claim that mean bar graphs are
accessible to a general audience (see Related works and
Results for a detailed discussion of prior work of this
Aim and process for current work
To assess mean bar graph accessibility, we needed an
informative measure of abstract-graph interpretation;
that is, a measure of how graphs that abstract away from
individual-level data to present only summary-level, or
aggregate-level, information, are interpreted. Over more
than a decade, our lab has developed cognitive measures
in areas as diverse as sustained attention, aesthetic
preferences, stereoscopic vision, face recognition,
visual motion perception, number sense, novel object
recognition, visuomotor control, general cognitive
ability, emotion identication, and trust perception
(Wilmer & Nakayama, 2007;Wilmer, 2008;Wilmer
et al., 2012;Degutis et al., 2013;Halberda et al., 2012;
Germine et al., 2015;Fortenbaugh et al., 2015;Richler,
Wilmer, & Gauthier, 2017;Deveney et al., 2018;
Sutherland et al., 2020).
The decision to develop a new measure is never
easy; it takes a major investment of time and energy to
iteratively rene and properly validate a new measure.
Yet the accessibility of abstract-graphs presented a
special measurement challenge that we felt warranted
such an investment. The challenge: record the viewer’s
conception of an invisible entity—the individual-level
data underlying an abstract-graph—while neither
invoking potentially unfamiliar statistical concepts nor
explaining those concepts in a way that could distort
the response.
Our lab’s standard process for developing measures
has evolved to include three components: (1) Identify
guiding principles (Wilmer, 2008;Wilmer et al., 2012;
Degutis et al., 2013). (2) Design and rene a measure
(Wilmer & Nakayama, 2007;Wilmer et al., 2012;
Halberda et al., 2012;Degutis et al., 2013;Germine
et al., 2015;Fortenbaugh et al., 2015;Richler, Wilmer,
& Gauthier, 2017;Deveney et al., 2018;Kerns, 2019;
Sutherland et al., 2020). (3) Apply the measure to a
question of theoretical or practical importance (Wilmer
& Nakayama, 2007;Wilmer et al., 2012;Halberda
et al., 2012;Degutis et al., 2013;Germine et al., 2015;
Fortenbaugh et al., 2015;Richler, Wilmer, & Gauthier,
2017;Deveney et al., 2018;Kerns, 2019;Sutherland
et al., 2020). We devoted over a year of intensive
piloting to the renement of these three components
for the present project. While the parallel nature of
the development process produced a high degree of
complementarity between the three, each represents its
own separate and unique contribution.
A turning point in the development process
was our rediscovery of a group of drawing-based
neuropsychological tasks that have been used to probe
for pathology of perception, attention, and cognition
in brain damaged patients (Landau et al., 2006;Smith,
2009). In one such task, the patient is asked to draw a
Figure 3. Readout of a clock drawn by a hemispatial neglect
patient. (
AllochiriaClock.png) The Draw Datapoints on Graphs (DDoG)
measure was inspired by readout-based neuropsychological
tasks like the one that produced this distorted clock drawing.
Such readout-based tasks have long been used with brain
damaged patients to probe for pathology of perception,
attention, and cognition (Smith, 2009).
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 4
Table 1. Measurement of Abstract Graph Interpretation (MAGI) principles.
clock, and pathological inattention to the left side of
visual space is detected via the bunching of numbers
to the right side of the clock (Figure 3). This clock
drawing test exhibits a feature that became central to
the present work: readout-based measurement.
A readout is a concrete, detailed, visuospatial record
of thought. Measurement via readout harnesses the
viewer’s capacity to transcribe their own thoughts,
with relative fullness and accuracy, when provided
with a response format that is suciently direct,
accessible, rich, and expressive. As the clock drawing
task in Figure 3 seeks to read out remembered numeral
positions on a clock face, our measure seeks to read out
assumed datapoint positions on a graph.
Contribution 1: Identify guiding principles
Distillation of the Measurement of Abstract Graph
Interpretation (MAGI) principles
The rst of this paper’s three contributions is the
introduction of six succinct, actionable principles to
guide collection of graph interpretation data (Table 1).
These principles are aimed specically at common graph
types (e.g., bar/line/box/violin) and graph markings
(e.g., condence/prediction/interquartile interval) that
abstract away from individual-level data to present
aggregate-level information. These Measurement of
Abstract-Graph Interpretation (MAGI) principles
center around two core aims, general usage and valid
General usage—that is, testing of a general
population that may vary in statistical and/or content
expertise—is most directly facilitated if a measure: (1)
avoids constraining the response via limited options
(Expressive freedom); (2) avoids priming the response
via suggestive instructions (Limited instructions);
and (3) avoids obscuring or hindering thinking via
unnecessary interpretive or translational steps (Limited
mental transformations).
Figure 4. The Draw Datapoints on Graph (DDoG) measure
maintains the graph as a consistent reference frame across its
three stages. (a) Participants are presented with a graph
stimulus that (b) produces a mental representation of the data;
(c) this interpretation is recorded by sketching a version of the
graph along with hypothesized locations of individual data
values. Drawings are representative examples of the two
common responses seen in pilot data collection: the correct
Bar-Tip Mean response (top), and the incorrect Bar-Tip Limit
(BTL) response (bottom).
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 5
Figure 5. The DDoG measure implements the MAGI principles.
The DDoG measure collects readouts of abstract-graph
interpretation. Shown are the major pieces of the DDoG
measure’s procedure, with the relevant MAGI principle(s) in
parentheses. (a) Drawing page: showed the participant how to
set up their paper for drawing (Expressive freedom, Limited
instructions). (b) Instructions: explained what to draw (Limited
instructions). (c) Stimulus graph: was the graph to be
interpreted (Ecological validity, Ground-truth linkage, Limited
mental transformations). (d) Figure caption: was presented
with each stimulus graph to help clarify its content (Ecological
validity). (e–f) Representative readouts: four-graph readouts
from two separate participants, with readouts of graph
Valid measurement is facilitated by all six principles.
The three principles just mentioned—Expressive
freedom, Limited instructions, and Limited mental
transformations—all help responses to more directly
reect the viewer’s actual graph interpretation. This
is especially true of Limited mental transformations.
As we will see below, a set of mental transformations
embedded in a popular probability rating scale
measure delayed for over a decade the elucidation of a
phenomenon that our investigation here reveals.
Valid measurement is additionally facilitated by: (1)
objective scoring, via the existence of correct answers
(Ground-truth linkage); (2) real-world applicability of
results, via the sourcing of real graphs (Ecological
validity); and (3) a clear, high-resolution window into
viewers’ thinking, via a response that has high detail
and bandwidth (Information richness).
As shown in the next two sections, the MAGI
principles provided a foundational conceptual basis
for the creation of the Draw Datapoints on Graphs
(DDoG) measure and for understanding the ease with
which that DDoG measure identied the common,
categorical Bar-Tip Limit (BTL) error. These principles
additionally provided a structure for comparing our
new measure with existing measures (Related works).
In these ways, we demonstrate the utility of the MAGI
principles as a metric for the creation, evaluation, and
comparison of graph interpretation measures.
Contribution 2: Design and refine a measure
Creation of the Draw Datapoints on Graphs (DDoG)
Our second contribution is the use of the MAGI
principles to create a readout-based measure of graph
comprehension. We call this the Draw Datapoints
on Graphs (DDoG) measure. Figure 4 illustrates the
DDoG measure’s basic approach: it uses a graph as
the stimulus (Figure 4a); the graph is interpreted (4b);
and this interpretation is then recorded by sketching a
version of the graph (4c). The reference-frame of the
graph is retained throughout.
Figure 5 shows a more detailed schematic of the
DDoG procedure used in the present study, which
expresses the six MAGI principles (Table 1) as follows:
The instructions (Figure 5b) are succinct, concrete,
and task-directed (Limited instructions). The graph
stimulus (5c) and its caption (5d) are taken directly from
stimulus shown in c outlined in purple (we refer to this stimulus
below as AGE) (Expressive freedom, Information richness,
Limited mental transformations). Representative AGE stimulus
graph (c) sketches are outlined in purple. Representative
readouts demonstrate (e) the correct Bar-Tip Mean response
and (f) the incorrect Bar-Tip Limit (BTL) response.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 6
Figure 6. The difference between Bar-Tip Mean and Bar-Tip Limit thinking is easily observable. (a) Cartoon and (b) readout examples
of the two common DDoG measure responses: the correct Bar-Tip Mean response (top) and the incorrect Bar-Tip Limit (BTL) response
(bottom). These readouts were all drawn for the same stimulus graph, which we refer to as AGE (see Figure 11).
a textbook (Ecological validity). The drawing-based
response medium (5a, e, f) provides the participant
with exibility to record their thoughts in a relatively
unconstrained manner (Expressive freedom). The
matched format between the stimulus (5c) and the
response (5e, 5f, purple outline) minimizes the mental
transformations required to record an interpretation
(4b) (Limited mental transformations). The extent
of the response—160 drawn datapoints, each on a
continuous scale (5e, 5f)—provides a high-resolution
window into the thought-process (Information richness).
Finally, the spatial and numerical concreteness of the
readout (5e, 5f), allows easy comparison to a variety
of logical and empirical benchmarks of ground-truth
(Ground-truth linkage). In these ways, the DDoG
measure demonstrates each of the MAGI principles.
While we use the DDoG measure here to record
interpretations of mean bar graphs, it can easily be
applied to any other abstract-graph; that is, any graph
that replaces individual-level data with summary-level
Contribution 3: Apply the measure
Using the DDoG measure to test the accessibility of mean
bar graphs
From our earliest DDoG measure pilots looking
at mean bar graph accessibility, a substantial subset
of participants drew a very dierent distribution
of data than the rest (Figure 6,Figures 5e&f,and
Figure 4c).
When instructed to draw underlying datapoints for
mean bar graphs (Figure 5b), most participants drew a
(correct) distribution with datapoints balanced across
the bar-tip (Figure 6ab, top). A substantial subset,
however, drew most, or all, datapoints within the bar
(Figure 6ab, bottom), treating the bar-tip as a limit.
This minority response would have been correct for a
count bar graph (as shown in Figure 2d), but it was
severely incorrect for a mean bar graph (Figure 2b).
We named this incorrect response pattern the Bar-Tip
Limit (BTL) error.
The dataset that we examine in Results contains
over three orders of magnitude more drawn datapoints
than the two early-pilot drawings shown in Figure 4c:
44,000 datapoints, drawn in 551 sketches, by 149
participants. This far larger dataset yields powerful
insights into the BTL error by establishing its
categorical nature, high prevalence, stability within
individuals, likely developmental inuences, persistence
despite thoughtful responding, and independence from
foundational knowledge and graph content. Yet, in
a testament to the DDoG measure’s incisiveness, the
two early pilot drawings shown in Figure 4calready
substantially convey all three of the core contributions
of the present paper: a severe error in mean bar graph
interpretation (the BTL error), saliently revealed by a
new readout-based measure (the DDoG measure) that
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 7
collects high-quality graph interpretations from experts
and nonexperts alike (using the MAGI principles).
Moreover, it takes only a few readouts to move
beyond mere identication of the BTL error, toward
elucidation of an apparent mechanism. Considering just
the readouts we have seen so far, observe the stereotyped
nature of the BTL error response across participants
(Figure 6b), graph form, and content (Figures 5e & f).
Notice the similarity of this stereotyped response to the
Bar-Tip Limit data shown in Figure 2d. These clues
align perfectly with a mechanism of conation, where
mean bar graphs are incorrectly interpreted as count
bar graphs (Figure 2).
In retrospect, the stage was clearly set for such a
conation. The use of one graph type (bar) to represent
two fundamentally dierent types of data (counts
and means) makes it dicult to visually dierentiate
the two (Figure 2) and sets up the inherent cognitive
conict that is the source of this paper’s title (“Two
graphs walk into a bar”). Additionally, the physical
stacking metaphor for count bar graphs (Figure 1)
makes a Bar-Tip Limit interpretation arguably more
straightforward and intuitive; and the common use of
count bar graphs as a curricular launching pad to early
education could easily solidify the Bar-Tip Limit idea
as a familiar, well-worn path for interpretation of bar
graphs (Figure 1).
Yet, remarkably, none of the many prior theoretical
and empirical critiques of mean bar graphs considered
that such a conation might occur (Tufte & Graves-
Morris, 1983,p.96;Wainer, 1984;Drummond &
Vowler, 2011;Weissgerber et al., 2015;Larson-Hall,
2017;Rousselet, Pernet, & Wilcox, 2017;Pastore,
Lionetti, & Altoe, 2017;Weissgerber et al., 2019;Vai l
& Wilkinson, 2020;Newman & Scholl, 2012;Correll
& Gleicher, 2014;Pentoney & Berger, 2016;Okan
et al., 2018). The concrete, granular window into graph
interpretation provided by DDoG measure readouts,
however, elucidates both phenomenon and apparent
mechanism with ease.
Related works
Here we embed our three main contributions—the
Measurement of Abstract Graph Interpretation
(MAGI) principles, the Draw Datapoints on Graphs
(DDoG) measure, and the Bar-Tip Limit (BTL) error
—into related psychological, vision science, and data
visualization literatures.
Literature related to the DDoG measure
Classification of the DDoG measure
A core design feature of the DDoG measure is
its “elicited graph” measurement approach whereby
a graph is produced as the response. This approach
can, in turn, be placed within two increasingly
broad categories: readout-based measurement and
graphical elicitation measurement. Like elicited graph
measurement, readout-based measurement produces a
detailed, visuospatial record of thought; yet its content
is broader, encompassing nongraph products such
as the clock discussed above (Figure 3). Graphical
elicitation, broader still, encompasses any measure with
a visuospatial response, regardless of detail or content
(Hullman et al., 2018; a similar term, graphic elicitation,
refers to visuospatial stimuli, not responses,Crilly et al.,
2006). Having embedded the DDoG measure within
these three nested measurement categories—elicited
graph, readout-based, and graphical elicitation—we
next use these categories to distinguish it from existing
measures of graph cognition.
Vision science measures
Vision science has long been an important source
of graph cognition measures (Cleveland & McGill,
1984;Heer & Bostock, 2010), and recent years have
seen an accelerated adoption of vision science measures
in studies of data visualization (hereafter datavis). Of
particular interest is a recent tutorial paper by Elliott
and colleagues (2020) that cataloged behavioral vision
science measures with relevance to datavis. Despite its
impressive breadth—laying out nine measures with six
response types and 11 direct applications (Elliott et al.,
2020)—not a single elicited graph, readout-based, or
graphical elicitation approach was included. This ts
with our own experience that such methods are rarely
used in vision science.
This rarity is even more surprising given that the
classic drawing tasks that helped to inspire our current
readout-focused approach are well known to vision
scientists (Landau et al., 2006;Smith, 2009;Figure 3);
indeed, they are commonly featured in Sensation and
Perception textbooks (Wolfe et al., 2020). Yet usage of
such methods within vision science has been narrow;
restricted primarily to studies of extreme decits in
single individuals (Wolfe et al., 2020).
It is unclear why readout-based measurement is
uncommon in broader vision science research. Perhaps
the relative diculty of structuring a readout-based
task to provide a consistent quantitative measurement
across multiple participants has been prohibitive. Or
maybe reliable collection of drawn samples from a
distance was logistically untenable until the recent
advent of smartphones with cameras and accessible
image-sharing technologies. We hypothesize that factors
such as these may have limited the use of readout-based
measurement in vision science, and we believe the
proof-of-concept provided by the MAGI principles
(Table 1) and the DDoG measure (Figure 4) supports
their broader use in the future.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 8
Although the DDoG measure shares its readout-
based approach with classic patient drawing tasks
(Figure 3), it diers from them in a subtle but
important way. Patient drawing tasks are typically
stimulus-indierent, using stimuli (e.g., clocks) as mere
tools to reveal the integrity of a broad mental function
(e.g., spatial attention). The DDoG measure, in
contrast, seeks to probe the interpretation of a specic
stimulus (this bar graph) or stimulus type (bar graphs in
general). The focus on how the stimulus is interpreted,
rather than on the integrity of a stimulus-independent
mental function, distinguishes the DDoG measure from
classic patient drawing tasks.
Graphical elicitation measures
Moving beyond vision science, we next examine
the domain of graphical elicitation for DDoG-related
measures. A birds-eye perspective is provided by a
recent review of evaluation methods in uncertainty
visualization (Hullman et al., 2018). Uncertainty
visualization’s study of distributions, variation, and
summary statistics makes it an informative proxy for
datavis research related to our work. Notably, that
review called the method for eliciting a response “a
critical design choice.” Of the 331 measures found in
86 qualifying papers, from 11 application domains
(from astrophysics to cartography to medicine), only
4% elicited any sort of visuospatial or drawn response,
thus qualifying as graphical elicitation (Hullman et al.,
2018). None elicited either a distribution or a full graph;
thus, none qualied as using an elicited graph approach.
Further, though some studies elicited markings on a
picture (Hansen et al., 2013)ormap(Seipel & Lim,
2017), or movement of an object across a graph
(Cumming, Williams, & Fidler, 2004), none sought to
elicit a detailed visuospatial record of thought; thus,
none qualied as using a readout-based approach.
Survey methods such as multiple choice, Likert scale,
slider, and text entry were the most common response
elicitation methods—used 66% of the time (Hullman
et al., 2018).
While readouts are rare in both vision science and
datavis, there exists a broader graphical elicitation
literature in which readouts are more common and
elicited graphs are not unheard of (Jenkinson, 2005;
O’Hagan et al., 2006;Choy, O’Leary & Mengersen,
2009). Yet still, these elicited responses tend to dier
from our work in two key respects. First, they are
typically used exclusively with experts, whereas the
DDoG measure prioritizes general usage. Second, their
aim is typically to document preexisting knowledge,
and they therefore use stimuli as a catalyst or prompt,
rather than as an object of study. The DDoG measure,
in contrast, holds the stimulus as the object of study:
we want to know how the graph was interpreted. The
DDoG measure is therefore distinctive even in the
broadly dened domain of graphical elicitation.
A small subset of graphical elicitation studies do
elicit a visuospatial response, drawn on a graph, from
a general population. There remains, however, a core
dierence in the way the DDoG measure utilizes the
elicited response. DDoG uses the elicited response as
a measure (dependent variable). The other studies, in
contrast, use it as a way to manipulate engagement with,
or processing of, a numerical task (as an independent
variable; Stern, Aprea, & Ebner, 2003;Natter &
Berry, 2005;Kim, Reinecke, & Hullman, 2017a;Kim,
Reinecke, & Hullman, 2017b). While it is possible for a
manipulation to be adapted into a measure, the creation
of a new measure requires time and eort for iterative,
evidence-based renement and validation (e.g., Wilmer,
2008;Wilmer et al., 2012;Degutis et al., 2013). The
DDoG measure is therefore distinctive in having been
created and validated explicitly as a measure.
Frequency framed measures
Another way in which the DDoG measure may be
distinguished from previous methods is in its use of a
frequency framed response. Frequency framing is the
communication of data in terms of individuals (e.g.,
“20 of 100 total patients” or “one in ve patients”)
rather than percentages or proportions (e.g., “20%
of patients” or “one fth of patients”). The DDoG
measure collects frequency-framed responses in terms
of individual values.
Figure 7. The balls-and-bins approach. Three recent
adaptations of the balls-and-bins approach to eliciting
probability distributions. This approach was originally
developed by Goldstein and Rothschild (2014). The sources of
these adaptations of balls-and-bins are: (a) Kim, Walls, Kraft &
Hullman, 2019;(b)Andre, 2016;(c)Hullman et al. 2018.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 9
Frequency framing has long been considered a
best practice for data communication to nonexperts
(Cosmides & Tooby, 1996), yet it remains uncommon in
the measurement of graph cognition. Illustratively, as
of 2016, only a small handful of papers in uncertainty
visualization had utilized frequency framing (Hullman,
2016); and in all cases, frequency framing was used in
the instructions or stimulus rather than in the measured
response (e.g., Hullman, Resnick, & Adar, 2015;Kay
et al., 2016).
Two other, more recent, datavis studies have elicited
frequency framed responses (Hullman et al., 2017;
Kim, Walls, Krat & Hullman, 2019). These studies
were geared toward nonexperts and used the so-called
balls-and-bins paradigm, originally developed by
Goldstein and Rothschild (2014), where balls are placed
in virtual bins representing specied ranges (Figure 7).
The studies aimed to elicit beliefs about sampling error
around summary statistics. This usage contrasts with
our aim of eliciting direct interpretation of the raw,
individual-level, nonaggregated data that produced a
This dierence in usage joins more substantive
dierences, captured by two MAGI principles: Limited
mental transformations and Expressive freedom
(Table 1). While the current implementation of
balls-and-bins requires spatial, format, and scale
translations between stimulus and response, the DDoG
measure achieves Limited mental transformations
by applying a stimulus-matched response: imagined
datapoints are drawn directly into a sketched version
of the graph itself. Similarly, while balls-and-bins
predetermines key aspects of the response such as
bin widths, bin heights, ball sizes, numbers of bins,
and the range from lowest to highest bin, the DDoG
measure achieves Expressive freedom by allowing drawn
responses that are constrained only by the limits of
the page. Further, the lack of built-in tracks or ranges
that constrain or suggest data placement reduces the
risk of the “observer eect,” whereby constraints or
suggestions embedded in the measurement procedure
itself alter the result.
Defining thought inaccuracies
Textual definitions of errors, biases, and confusions
To discuss the literature regarding this project’s
third contribution—identication of the Bar-Tip Limit
(BTL) error—it is helpful to distinguish three dierent
types of inaccurate thought processes: errors, biases,
and confusions. We will do this rst in words, and then
Errors are “mistakes, fallacies, misconceptions,
misinterpretations” (Oxford University Press,
n.d.). They are binary, categorical, qualitative
inaccuracies that represent wrong versus right
Biases are “leanings, tendencies, inclinations,
propensities” (Oxford University Press, n.d.). They
are quantitative inaccuracies that exist on a graded
scale or spectrum. They exhibit consistent direction
but varied magnitude.
Confusions are “bewilderments, indecisions,
perplexities, uncertainties” (Oxford University
Press, n.d.). They indicate the absence of systematic
thought, resulting, for example, from failures to
grasp, remember, or follow instructions.
Numerical/visual definitions of errors, biases, and
Let us now consider how each type of inaccurate
thought process would be instantiated in data. Such
depictions can act as visual denitions to compare with
real datasets. In Figure 8, patterns of data distributions
representing correct thoughts (8a), systematic errors
(8b), biases (8c), and confusions (8d), are illustrated as
frequency curves. They are numerical formalizations
of the textual denitions above, providing visual cues
for dierentiating these four thought processes in a
graphing context.
Figure 8. Three types of inaccurate graph interpretation.
Prototypical response distributions provide visual definitions
for three types of inaccuracy. Top row: (a) Baseline response
pattern (green) indicates no systematic inaccuracy. Responses
cluster symmetrically around the correct response value, with
imprecision in task input, processing, and output reflected in
the width of the spread around that correct response. Bottom
row: Light red backgrounds illustrate the presence of inaccurate
responses. Gray arrows indicate the major change from the
baseline response pattern for each type of inaccurate response.
(b) Systematic error responses form their own distinct mode.
(c) Systematic bias responses shift and/or flatten the baseline
response distribution. (d) Confused responses—expressed as
random, unsystematic responding—are uniformly distributed,
thus raising the tails of the baseline distribution to a constant
value without altering the mean, median, or mode.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 10
Figure 8a, represents a baseline case: a data
distribution (blue curve) that is characteristic of
generally correct interpretation, absent systematic
inaccuracy. This baseline case prototypically produces
a symmetric distribution, with the mean, median,
and mode all corresponding to a correct response.
Normal imprecision, from expected variation in
stimulus input, thought processes, or response
output, is reected by the width of the distribution.
(If everyone answered with 100% precision, the
“distribution” would be a single stack on the correct
Figure 8b illustrates a subset of systematic errors
within an otherwise correct dataset. In distributions,
an erroneous, categorically inaccurate subset
prototypically presents as a separate mode, additional
to the baseline curve. Errors siphon responses from
the competing baseline distribution, which remains
clustered around the correct answer, retaining the
correct modal response. The hallmark of an error is,
therefore, a bimodal distribution of responses whose
prevalence and severity are demonstrated, respectively,
by the height and separation of the additional
Figure 8c illustrates a subset of systematic biases
within an otherwise correct dataset. Given their graded
nature, a biased subset tends to atten and shift the
baseline distribution of responses. The prototypical bias
curve demonstrates a single, shifted mode. To the extent
that a bias reects a pervasive, relatively inescapable
aspect of human perception, cognition, or culture, there
will be relatively less attening, and more shifting, of
the response distribution.
Finally, Figure 8d illustrates a subset of confusions
within an otherwise correct dataset. A confused subset
tends to yield random responses, which lift the tails of
the response distribution to a constant, nonzero value
as responses are siphoned from the competing baseline
A rough but informative way to distinguish between
erroneous, biased, and confused thinking is to simply
compare an observed distribution of responses directly
to the blue curves in Figure 8. This visual approach may
be complemented with an array of analytic approaches
that provide quantitative estimates of the degree of
evidence for a particular type of inaccurate thinking
(e.g., Freeman & Dale, 2013;Pster et al., 2013;Zhang
& Luck, 2008). In Results, we will use both visual and
analytic approaches.
The success of any approach rests upon the precision
and accuracy of measurement: as measurement quality
is reduced, it becomes increasingly dicult to detect
clear thought patterns reected in the data. Our results
will show unequivocal bimodality in the collected data,
supporting both the presence of an erroneous graph
interpretation (the BTL error) and the precision and
accuracy of the DDoG measure.
Literature related to the BTL error
Probability rating scale results
In examining prior work related to our discovery of
the BTL error, studies by Newman and Scholl (2012),
Correll and Gleicher (2014),Pentoney and Berger
(2016), and Okan and colleagues (2018) are seminal and
contain the most reproducible prior evidence against
mean bar graph accessibility. Their core result is an
asymmetry recorded via a probability rating scale: the
average participant, when shown a mean bar graph,
rated an inside-of-bar location for data somewhat more
likely than a comparable outside-of-bar location (1
point on a 9-point probability rating scale, Figure 9a).
The original paper, by Newman and Scholl (2012),
reported ve replications of this asymmetry, including
in-person and online testing, varied samples (ferry
commuters, Amazon Mechanical Turk recruits, Yale
undergraduates), and conditions that ruled out key
numerical and spatial confounds. Five independent
research groups have since replicated and extended
this nding. The rst was Correll and Gleicher (2014),
Figure 9. A side-by-side comparison of a probability rating
scale and a DDoG measure response. DDoG measure’s more
concrete, detailed, visuospatial (aka readout-based) approach
may have contributed to the ease with which it identified the
BTL error and its apparent conflation mechanism, which were
missed by prior studies using the probability rating scale
approach. (a) The probability rating scale response sheet
provided to each participant in the original report of
asymmetry by Newman and Scholl (2012, Study 5). These
Likert-style scales, anchored by colloquial English words (from
“very unlikely” to “very likely”), were used to characterize the
likelihood that individual values occurred at two specific y-axis
values, one within the bar (5) and one outside the bar (+5).
The red circles indicate the resulting mean ratings of 6.9 and 6.1
(from Study 5, Newman & Scholl, 2012). (b) The DDoG measure
response sheet from an individual participant given four
stimulus graphs and asked to sketch each graph along with 20
hypothesized datapoints for two specified bars on the graph.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 11
who extended the result to viewers’ predictions of
future data, and who showed that graph types that
were symmetrical around the mean (e.g., violin plots)
did not produce the eect. Soon after, Pentoney and
Berger (2016) found this asymmetry present for a bar
graph with condence intervals, but absent when the
bar was removed, leaving only the condence intervals;
this replicated Correll and Gleicher’s (2014) nding
that the eect requires the presence of a traditional
bar graph. Okan and colleagues (2018) also replicated
both the core asymmetry (this time using health data)
and the nding that nonbar graphs did not produce
the eect; they additionally found that the asymmetry,
counterintuitively, increased with graph literacy. Finally,
Godau, Vogelgesang, and Gaschler (2016) and Kang
and colleagues (2021) showed that the asymmetry
remains in aggregate—carrying through to judgments
of the grand mean of multiple bars.
Newman and Scholl’s (2012) hypothesized
mechanism was a well-known perceptual bias, whereby
in-object locations are processed slightly more
eectively than out-of-object locations. The studies that
Newman and Scholl (2012) cited for this perceptual
bias, for example, had on average 4% faster and 0.6%
more accurate responses for in-object stimuli relative
to out-of-object stimuli (Egly et al., 1994;Kimchi,
Yeshurun, & Cohen-Savransky, 2007;Marino & Scholl,
2005). While subtle, this bias was believed to reect
a fundamental aspect of object processing (Newman
& Scholl, 2012),anditwasthereforeassumedtobe
pervasive and inescapable. Newman and Scholl’s (2012)
hypothesis was that this “automatic” and “irresistible”
perceptual bias had produced a corresponding
“within-the-bar bias” in graph interpretation.
Probability rating scale comparison
At the end of Results, we will compare our detailed
ndings, obtained via our DDoG measure, to those
of the probability rating scale studies reviewed just
above. Here, we compare the more general capabilities
of the two measurement approaches.Figure 9
shows side-by-side examples of a probability rating
scale response from Newman and Scholl’s (2012)
Study 5 (Figure 9a), and a DDoG measure readout
(Figure 9b). Using the MAGI principles (Table 1)as
a basis for comparison, we can see key dierences in:
Limited mental transformations, Ground-truth linkage,
Information richness,andExpressive freedom.
The MAGI principle of Limited mental
transformations aims to facilitate an accurate readout
of graph interpretation. As discussed in Frequency
framed measures even a mental transformation as
seemingly trivial as a translation from a ratio (one
in ve) to a percentage (20%) can severely distort a
person’s thinking (Cosmides & Tooby, 1996). The
DDoG measure limits mental transformations via its
stimulus-matched response (Figures 4 and 9b). Absent
such a matched response, the mental transformations
required by a measure may limit its accuracy.
IntheexamplefromNewman and Scholl (2012),
Study 5 (Figure 9a), several mental transformations
are needed for probability rating scale responses;
among them, translation from numbers (5and
5) to graph locations (in or out of the bar, and to
what degree in or out), from graph locations to
probabilities (represented in whatever intuitive or
numerical way a person’s mind may represent them),
from probabilities to the rating scale’s English adjectives
(e.g. “somewhat,” “very”), and, separately, from the
vertical scale on the graph’s y-axis to the horizontal
rating scale. While it is possible that some of these
mental transformations are accomplished eectively
and relatively uniformly for most participants, our
reanalysis below of the probability rating scale
studies (Results:A reexamination of prior results)
will suggest that their presence contributes a certain
amount of irreducible “noise” (inaccuracy) to the
With regard to Ground-truth linkage,itis
straightforward to evaluate a DDoG measure readout
(Figure 9b) relative to multiple logical and empirical
benchmarks of ground-truth. Logical benchmarks
include, for example, “Is the mean of the drawn
datapoints at the bar-tip?” Empirical benchmarks
include “How closely does the drawn data reect the
[insert characteristic] in the actual data?” In contrast,
there is no logically or empirically correct (or incorrect)
response for the probability rating scale measure
(Figure 9a). Even discrepant rated values for in-bar
versus out-of-bar locations—though suggestive of
inaccurate thinking—could be accurate in a case of a
skewed raw data distribution, which could plausibly
lead to more raw data in-bar than out-of-bar (or to
the opposite). The probability rating scales therefore
lack a conclusive Ground-truth linkage. DDoG measure
responses, in contrast, specify real numerical and spatial
values with a concreteness that is easier to compare to a
variety of ground-truths.
The DDoG measure’s precision is most directly
supported by its Information richness: in the present
study, each participant produces 160 drawn datapoints
on a continuous scale. In contrast, the probability
rating scale shown in Figure 9a yields only two pieces of
information (the two ratings) per participant. Granted,
the relative information-value of a single integer rating,
versus a single drawn datapoint, is not easily compared.
Further, as we will see below, multiple ratings at
multiple graph locations can increase precision of
the probability rating scale measure (Results:A
reexamination of prior results). Nevertheless, it would
be dicult to imagine a case where a probability rating
scale measure yielded more information richness, and,
in turn, higher precision, than the DDoG measure.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 12
Table 2. Using MAGI principles to compare graph interpretation measures.
Finally, comparing Expressive freedom between the
two measures, the rating scale measure constrained
each response to a nine-point integer scale. On this
scale, responses are led and constrained to a degree
that disallows many types of unexpected responses and
limits the capacity of distinct graph interpretations to
stand out from each other. These constraints contrast
sharply with the exibility of the drawn readout that
the DDoG measure produces.
The probability rating scale and the DDoG measure
therefore dier in terms of four separate MAGI
principles (Table 1): Limited mental transformations,
Ground-truth linkage, Information richness, and
Expressive freedom.
MAGI principles used as a metric
We have used the MAGI principles above to
highlight notable dierences between the DDoG
measure and specic implementations of two other
measurement approaches: balls-and-bins (Figure 7)
and probability rating scale (Figure 9a). However, the
question of whether these dierences are integral to the
measure-type, or restricted to specic implementations,
remains. Table 2 models the use of the MAGI principles
as a vehicle to compare, in a more structured way,
potentially integral features and limitations of these
same three graph interpretation methods.
We believe that all three measures share a capability
for Limited instructions and Ecological validity,
though choices specic to each implementation may
aect whether the principles are actually expressed
(see Hullman et al. (2017) and Kim, Walls, Kraft
and Hullman (2019) for implementations of the
balls-and-bins method, and see Newman and Scholl
(2012),Correll and Gleicher (2014),Pentoney and
Berger (2016), and Okan and colleagues (2018)for
implementations of probability rating scales). In
contrast, the three measures appear to dier more
or less intrinsically in their approaches to Expressive
freedom, Limited mental transformations, and (to a
lesser extent) Information richness. That said, measure
development is an inherently iterative, dynamic process,
and what seems intrinsic at one point in time can
sometimes shift as development progresses.
Because the DDoG measure and MAGI principles
were built around each other and rened in parallel,
it makes sense that they are closely aligned in their
optimization for assessment of abstract-graph
interpretation, with a focus on general usage and valid
measurement. Recognizing that measures with dierent
aims and motivations may be best suited for dierent
tasks, this use of MAGI principles, in table form
(Table 2), provides a model of its utility for comparison
and targeting of measures with a similar set of aims
and motivations.
Summary of related works
In the four subsections of Related works above,
we rst documented and sought to better understand
the relative rarity of elicited-graph, readout-
based, graphical elicitation, and frequency-framed
measurement in studies of graph cognition, while
arguing for the value of greater usage of elicited-graph
and readout-based measurement in particular. We
next distinguished—in both written/verbal and
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 13
numerical/visual form—three distinct types of
inaccurate graph interpretation: errors, biases, and
confusions. Third, we examined key results and methods
from prior studies of mean bar graph inaccessibility.
And, nally, we provided an illustrative example of how
the Measurement of Abstract Graph Interpretation
(MAGI) principles can be used to compare and contrast
relevant measures.
The current investigation has two parts: (A) examine
the accessibility of a set of ecologically valid mean
bar graphs in an educationally diverse sample via our
new Draw Datapoints on Graphs (DDoG) measure
(sections Participant Data Collection through Dene
the Average) and (B) assess the relative frequency of
mean versus count bar graphs across educational and
general sources (section Prevalence of Mean Versus
Count Bar Graphs). Included in the Methods sections
devoted to “A” are detailed specications for the current
implementation of the DDoG measure, as well as the
thinking behind that implementation, as a guide to
future DDoG measure usage.
Participant data collection
The development of the DDoG measure was highly
iterative. Early piloting was performed at guest lectures,
lab meetings, and in college classrooms at various levels
of the curriculum by author JBW, and in several online
data collection pilot studies by both authors. Early
wording diered from the more rened wording used
in the present investigation. Yet even the earliest pilots
produced the core result of the current investigation
(Figure 4): a common, severely inaccurate interpretation
of mean bar graphs that we call the Bar-Tip Limit
(BTL) error. The DDoG measure thus appears—at
least in the context of the BTL error—robust to fairly
wide variations in its wording. A constant through
all DDoG measure iterations, however, was succinct,
concrete, task-directed instructions: what came to be
expressed as the Limited instructions MAGI principle
(Table 1).
Data collection was completed remotely using
Qualtrics online survey software. There were neither
live nor recorded participant-investigator interactions,
to minimize priming, coaching, leading, or other forms
of experimenter-induced bias (Limited instructions
principle). Participants were recruited via the Amazon
Mechanical Turk (MTurk), Prolic, and TestableMinds
platforms. Because no statistically robust dierences
were observed between platforms, data from all three
were combined for the analyses reported below.
Participants were paid $5 for an expected 30 minutes of
work; the median time taken was 33 minutes.
Figure 10 provides a owchart of the procedure.
Participants (a) read an overview of the tasks along
with time estimates; (b) read and recorded mean values
from stimulus graphs; (c) (grey zone) completed the
DDoG measure drawing task for stimulus graphs; (g)
provided a denition for the average/mean value; and
(h) reported age, gender, educational attainment, and
prior coursework in psychology and statistics.
Foundational knowledge tasks
Graph reading task: Find the average/mean
Participants were asked to “warm up” by reading
mean values from a set of mean bar graphs that
were later used in the DDoG measure drawing task
(wording: “What is the average (or mean) value for
[condition]?”) (Figure 10j). This warm-up served as a
control, verifying that the participant was able to locate
a mean value on the graph.
To allow for some variance in sight-reading,
responses within a tolerance of two tenths of the
distance between adjacent y-axis tick-marks were
counted as correct (see Figure 11). The results reported
in Results:Independence of the BTL error from
foundational knowledge remain essentially unchanged
for response-tolerances from zero (only exactly
correct responses) through arbitrarily high values (all
responses). No reported prevalence values dip below
Graph reading was correct for 93% of graphs. As a
control for possible carryover/learning eects, 114 of
the 551 total DDoG measure response drawings were
completed without prior exposure to the graph (i.e.,
without a warm-up for that graph). No evidence of
carryover/learning eects was observed.
Definition task: Define the average/mean
After completing the DDoG measure drawing
task, as a control for comprehension and thoughtful
responding, participants were asked to explain the
concept of the average/mean. Most participants (112 of
190, or 64%) got the following version of the question:
“From your own memory, what is an average (or
mean)? Please just give the rst denition that comes to
mind. If you have no idea, it is ne to say that. Your
denition does not need to be correct (so please don’t
look it up on the internet!)” For other versions of the
question, see open data spreadsheet. No systematic
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 14
Figure 10. Flowchart of study procedure. The present study consisted of five main sections (a, b, c, g, h). Sections and subsections
colored teal (b, e, g, h) produced data that were analyzed for this study. Subsections on the right (i, j, k, l, m) are an expansion of
section c of the flowchart, showing (i) the drawing page, (j) the drawing instructions, (k) one of the four stimulus graphs, (l) the graph
caption, and (m) the upload instructions. See Methods for further procedural details.
dierences in results based on question wording were
observed. Participants were provided a text box for
answers with no character limit. Eighty-three percent
of responses were correct, with credit being given for
responses that were correct either conceptually (e.g.,
“a calculated central value of a set of numbers”) or
mathematically (e.g., “sum divided by the total number
of respondents”).
The Draw Datapoints on Graphs (DDOG)
measure: Method
Selection of DDoG measure stimulus graphs
For the present investigation, four mean bar graph
stimuli, shown in Figure 11, were taken from popular
Introductory Psychology textbooks. This source of
graphs was selected for three main reasons. First,
Introductory Psychology is among the most popular
undergraduate science courses, serving two million
plus students per year in the United States alone
(Peterson & Sesma, 2017); and this course tends to rely
heavily on textbooks, with the market for such texts
estimated at 1.2 to 1.6 million sales annually (Steuer
& Ham, 2008). The high exposure of these texts gives
greater weight to the data visualization choices made
within them. Second, Introductory Psychology attracts
students with widely varying interests, skills, and prior
data experiences, meaning that data visualization best
practices developed in the context of Introductory
Psychology may have broad applicability to other
contexts involving diverse, nonexpert populations.
Third, given the relevance of psychological research
to everyday life, inaccurate inferences fueled by
Introductory Psychology textbook data portrayal
could, in and of themselves, have important negative
real-world impacts.
The mean bar graph stimuli were chosen to exhibit
major dierences in form, and to reect diversity of
content (Figure 11). They convey concrete scientic
results, rather than abstract theories or models, so
that understanding can be measured relative to the
ground-truth of real data (Ground-truth linkage).
Introductory texts were chosen because of their direct
focus on conveying scientic results to nonexperts
(Ecological Validity). Stimuli were selected to show
“independent groups” (aka between-participants)
comparisons because these comparisons are one step
more straightforward, conceptually and statistically,
than “repeated-measures” (aka within-participants)
We refer to the four stimulus graphs by their
independent variables: as AGE, CLINICAL, SOCIAL,
and GENDER. The stimuli were selected, respectively,
from texts by Kalat (2016),Gray and Bjorklund
(2017),Grison and Gazzaniga (2019),andMyers
and DeWall (2017), which ranked 7, 22, 3, and 1 in
median sales rankings across eight days
during March and April of 2019, among the 23 major
textbooks in the Introductory Psychology market.
Further details about these graphs are shown in Tabl e 3.
Stimuli were chosen to represent both meaningful
content dierences (four areas about which individual
participants might potentially have strong personal
opinions), and form dierences (diering relationship
of bars to the baseline) to evaluate replication of results
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 15
Figure 11. The four stimulus graphs used in this study. These
stimulus graphs were taken from popular Introductory
Psychology textbooks to ensure the direct real-world relevance
of our results (see ecological validity MAGI principle). Figure
legends were adapted as necessary for comprehensibility
outside the textbook. Stimulus graph textbook sources are AGE:
Kalat, 2016; CLINICAL: Gray and Bjorklund, 2017; SOCIAL:
Grison & Gazzaniga, 2019; and GENDER: Myers & DeWall, 2017.
across graphs despite dierences that might reasonably
be expected to change graph interpretation.
The specic form dierence we examined was
the distinction between unidirectional bars (all bars
emergefromthesamesideof thebaseline)and
bidirectional bars (bars emerge in opposite directions
from the baseline). Two of the four graphs, AGE and
CLINICAL were unidirectional (Figure 11, top),
and the other two, SOCIAL and GENDER, were
bidirectional (Figure 11, bottom). While unidirectional
graphs were more common than bidirectional graphs
in the surveyed textbooks, two examples of each were
selected to set up an internal replication mechanism.
Replication of results could potentially be
demonstrated across graphs in terms of mean values
(the mean result of dierent graphs could be similar),
or in terms of individual dierences (an individual
participant’s result on one graph could predict their
result on another graph). Both types of replication were
Graph drawing task: The DDoG measure
The DDoG measure was designed using the MAGI
principles (Table 1), and it was modeled after the
patient drawing tasks mentioned above (Landau et al.,
2006;Agrell & Dehlin, 1998). Ease of administration
and incorporation into an online study were further key
design considerations.
The Expressive freedom inherent in the drawing
medium avoids placing articial constraints on
participant interpretation, and it has multiple potential
benets: it helps to combat the “observer eect,”
whereby a restricted or leading measurement procedure
impacts the observed phenomenon; it lends salience to
a consistent, stereotyped pattern (such as the Bar-Tip
Limit response shown in Figures 2d, 4c (bottom), 5f,
6b (bottom), and 13 (right)); and it allows unexpected
responses, which render the occasional truly inattentive
or confused response (Figure 8d) clearly identiable.
Twenty drawn datapoints per bar was chosen as
a quantity small enough to avoid noticeable impacts
of fatigue or carelessness, while remaining sucient
to gain a visually and statistically robust sense of
the imagined distribution of values. Recent research
suggests that graphs that show 20 datapoints enable
both precise readings (Kay et al., 2016) and eective
decisions (Fernandes et al., 2018).
Before beginning the drawing task, participants
were instructed to divide a page into quadrants, and
they were shown a photographic example of a divided
blank page (Figure 10i). They were then presented with
task instructions (Figure 10j) with one of the four bar
graph stimuli and its caption (Figure 10k, 10l). This
process—instructions, graph stimulus, caption—was
repeated four total times, with the order of the four
graph stimuli randomized to balance order eects such
as learning, priming, and fatigue. After completing
the four drawings on a single page, each participant
photographed and uploaded their readout (Figure 10m)
via Qualtrics’ le upload feature. Few technical
diculties were reported, and only a single submitted
photograph was unusable for technical reasons (due to
insucient focus). All original photographs are posted
to the Open Science Framework (OSF).
Table 3. Stimulus graph sources and variables.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 16
DDoG measure data
Collection of DDoG measure readouts
One hundred ninety participants nominally
completed the study. Of these, 149 (78%) followed the
directions suciently to enable use of their drawings,
a usable data percentage typical of online studies
of this length (Litman & Robinson, 2020). Based
on predetermined exclusion criteria, participant
submissions were disqualied if most or all of their
drawings were unusable for any of the following:
(1) Zero datapoints were drawn (18 participants).
(2) Datapoints were drawn but in no way reected
either the length or direction of target bars, thereby
demonstrating basic misunderstanding of the task
(16 participants).
(3) Drawings lacked labels or bar placement information
necessary to disambiguate condition or bar-tip
location (ve participants).
(4) Photograph was insuciently focused to allow a
count (one participant).
Each of the 149 remaining readouts contained
four drawn graphs, for 596 graphs total. Of these,
30 individual drawings were excluded for one or
more of the above reasons, and 15 were excluded
for the additional predetermined exclusion criterion
of a grossly incorrect number of datapoints,
dened as >25% dierence from the requested
20 datapoints.
The remaining 551 drawn graphs (92.4% of the 596)
from 149 participants were included in all analyses
below. Ages of these 149 participants ranged from 18
to 71 years old (median 31). Reported genders were 96
male, 52 female, and one nonbinary. Locations included
26 countries and 6 continents. The most common
countries were United Kingdom (n=58), United States
(n=34), Portugal (n=8), and Greece/Poland/Turkey
(each n=4).
Coding of DDoG measure readouts via BTL index
As a systematic quantication of datapoint
placement in DDoG measure readouts, a Bar-Tip Limit
(BTL) index was computed. The BTL index estimated
the within-bar percentage of datapoints for each graph’s
two target bars by dividing within-bar datapoints (i.e.,
datapoints drawn on the baseline side of the bar-tip) by
the total number of drawn datapoints for the two target
bars and multiplying by 100. Written as a formula:
[(# datapoints on baseline-side of bar-tips) / (total #
datapoints)] ×100
The highest possible index (100) represents an image
in which all drawn datapoints are on the baseline
sides of their respective bar-tips (i.e., within the bar; a
Bar-Tip Limit response). An index of 50 represents a
drawing with equal numbers of datapoints on either
side of the bar-tip (i.e., balanced distribution across the
mean line, or a Bar-Tip Mean response). BTL index
values in this study ranged from 27.5 (72.5% of points
drawn outside of the bar) to 100 (100% of points drawn
within the bar).
The straightforward coding procedure was designed
to yield a reproducible, quantitative measure of
datapoint distribution relative to the bar-tip, however,
one ambiguity existed. Datapoints drawn directly on
the bar-tip could reasonably be considered to either
use the bar-tip as a limit (if the border is considered
part of the object) or not (if a datapoint on the edge
is considered outside the bar). This ambiguity was
handled as follows: (1) if, at natural magnication, a
drawn datapoint was clearly leaning toward one or the
other side of the bar-tip, it was assigned accordingly,
and (2) datapoints on the bar-tip for which no clear
leaning was apparent were alternately assigned rst
inside the bar-tip, then outside, and continuing to
alternate thereafter.
This procedure maximizes reproducibility by
minimizing the need for the scorer to interpret
the drawer’s intent—and it was successful (see
Reproducibility of BTL index coding procedure). Yet
reproducibility may have come at the cost of some
minor conservatism in quantifying drawings where
the participant placed no datapoints beyond the
bar-tip—arguably displaying a complete Bar-Tip Limit
interpretation—yet placed some of the datapoints
directly on the bar-tip. For example, if six of the 20
datapoints for each bar were placed on the bar-tip, the
drawing would get a BTL index of only 85, when it
arguably represented a pure BTL conception of the
Reproducibility of BTL index coding procedure
Reproducibility of the BTL index coding was
evaluated by comparing independently scored BTL
indices of the two coauthors for a substantial subset of
201 drawn graphs. Figure 12 shows the data from this
comparison where the x coordinate is JBW’s scoring
and the y coordinate is SHK’s scoring of each drawing.
The correlation coecient between the two sets of
ratings is extremely high (r(199) =0.991, 95% CI [0.988,
0.993]). Additionally, the blue best-t line is nearly
indistinguishable from the black line representing
hypothetically equivalent indices between the two raters.
The fact that the line of best t is so closely aligned
with the line of equivalence is important. In theory,
even with a correlation coecient close to 1.0, one
rater might give ratings that are shifted higher/lower,
or that are more expanded/compressed, compared
to the other. This would show up as a best-t line
that was either vertically shifted relative to the line of
equivalence, or of a dierent slope compared to the line
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 17
Figure 12. Interrater reliability of Bar-Tip Limit (BTL) index
coding demonstrates high repeatability of coding method.
Coauthors SHK and JBW independently computed the BTL index
for a subset of 201 drawn graphs (of 551 total). Each dot
represents both BTL indices for a given drawn graph (y-value
from SHK, x-value from JBW). Semitransparent dots code
overlap as darkness. The black line is a line of equivalence,
which shows x =y for reference. The line of best fit is blue. The
close correspondence of these two lines, and the close
clustering of dots around them, demonstrate high repeatability
of the BTL index coding procedure.
of equivalence. The absence of such a vertical shift or
slope dierence is therefore particularly strong evidence
for repeatability. This strong evidence is echoed by a
high interrater reliability statistic (Krippendorf s alpha)
of 0.93 (Krippendorf, 2011); Krippendorfs alpha varies
from 0 to 1, and a score of 0.80 or higher is considered
strong evidence of repeatability (Krippendorf, 2011).
These analyses therefore demonstrate that the BTL
index coding procedure is highly repeatable, and that
variations in coding—a potential source of noise that
could theoretically constrain the capacity of a measure
to precisely capture Bar-Tip Limit interpretation—can
be minimized.
Establishing cutoffs and prevalence
For purposes of estimating the prevalence of the
BTL error in the current sample, a BTL index cuto
of 80 was selected. This was the average result when
ve clustering methods (k-means, median clustering,
average linkage between groups, average linkage within
groups, and centroid clustering) were applied via SPSS
to the 551 individual graph BTL indices (cutos were
75, 75, 80, 85, and 85, respectively). 80 was additionally
veried as reasonable via visual inspection of the data
shown in Figure 14.
Due to the strongly bimodal distribution of BTL
index scores (see Results), only 14 of the 551 total
drawings (2.5%) fell within the entire range of computed
cutos (75 to 85). Therefore, though the computed
condence intervals around prevalence estimates,
reported in Results, do not include uncertainty in
selecting the cuto, this additional source of uncertainty
is small (at most, perhaps ±1.25%).
For Pentoney and Berger’s (2016) data, analyzed
below in Prior work reects the same phenomenon:
Prevalence, the respective cutos from the same ve
clustering methods produced BTL error percentages
of 19.5, 19.5, 19.5, 19.5, and 24.6, for an average
percentage of 20.5.
Prevalence of bar graphs of means versus bar
graphs of counts
The methods discussed in this subsection pertain
to Results section Ecological exposure to mean versus
count bar graph. In that investigation, we assessed
the likelihood of encountering mean versus count bar
graphs across a set of relevant contexts by tallying the
frequency of each bar graph type from three separate
sources: elementary educational materials accessible
via Google Image searches, college-level Introductory
Psychology textbooks, and general Google Images
searches. It was hypothesized that these sources would
provide rough, but potentially informative, insights into
the relative likelihood of exposure to each bar graph
type in these areas. The methods used for each source
The elementary education Google image search used
the phrase: “bar graph [X] grade,” with “rst” to “sixth”
in place of [X]. The rst 50 grade-appropriate graph
results, from independent internet sources, for each
grade-level, were categorized as either count bar graph,
mean bar graph, or other (histogram or line graph).
The college-level Introductory Psychology textbook
count tallied all of the bar graphs of real data across
the following eight widely-used Introductory textbooks:
Ciccarelli & Berstein, 2018;Coon, Mitterer & Martini,
2018;Gazzaniga 2018;Griggs 2017;Hockenbury &
Nolan, 2018;Kalat, 2016;Lilienfeld, Lynn & Namy,
2017;andMyers & DeWall, 2017. The bar graphs were
then categorized as dealing with either means or counts
(histograms excluded).
The general Google Image search used the phrase
“bar graph.” The rst 188 graphs were categorized
as count bar graphs, mean graphs, or were excluded
for being not bar graphs (tables, histograms), or for
containing insucient information to determine what
type of bar graph they were.
Plotting and evaluation of data from previous
In our reexamination of prior results, we compare
previous study results to the present results. Data
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 18
is replotted from Newman and Scholl (2012) and
Pentoney and Berger (2016) as Figure 20.ForNewman
and Scholl (2012), data from Study 5 is plotted
categorically as “0” (no dierence between in-bar and
out-of-bar rating) versus “not 0” (positive or negative
dierence between in-bar and out-of-bar rating). The
directionality of the latter dierences, while desired,
are not obtainable from the information reported in
the original paper. For Pentoney and Berger (2016),
the data plotted in their Figure 3 are pooled across the
three separate conditions whose stimuli included bar
graphs. A similar pattern of bimodality is evident in all
three of those conditions. We used WebPlotDigitizer to
extract the raw data from Pentoney and Berger’s (2016)
Figure 3.
Statistical tools and packages
The graphs shown in Figures 12,16,and18 were
produced by, a suite of web-based
data visualization apps coded in R by the second
author using the R Shiny package. The statistics
and distribution graphs shown in Figures 14,17,19,
and 20 were produced via the ESCI R package. The
bimodality simulations and graphs shown in Figure 15
were produced in base R, and the numerical analyses of
bimodality using Hartigan’s Dip Statistic (HDS) and
Bimodality Criterion (BC) (Freeman & Dale, 2013;
Pster et al., 2013) were computed via the mousetrap
R package. CIs on HDS, BC, and Cohen’s d were
computed with resampling (10,000 draws) via the boot
R package. The clustering analyses that estimated BTL
error cutos and prevalence were conducted via SPSS.
This paper contains three separate sets of results:
rst, the core quantitative study of the BTL error
(subsections Overview of core quantitative study
through Educational and demographic correlates);
second, the study of the prevalence of mean versus
count bar graphs (subsection Ecological exposure to
mean versus count bar graphs); and third, a reanalysis
of prior studies of mean bar graph accessibility, the
results of which suggest that the previously reported
“within-the-bar bias” was actually not a bias at all, but
instead was caused by the BTL error we report here
(subsection A reexamination of prior results).
Overview of core quantitative study
From the Draw Datapoints on Graphs (DDoG)
measure readouts shown in Figures 4,5,and6, one
can already see the severe, stereotyped nature of the
Bar-Tip Limit (BTL) error. It is additionally evident
that this error is well-explained as a conation of mean
bar graphs with count bar graphs; and the eectiveness
of the DDoG measure in revealing the BTL error is
Our main study sought to add quantitative
precision to these qualitative observations by testing
a large, education-diverse sample. Examination
of this larger sample conrms the BTL error’s
categorical nature, high prevalence, stability within
individuals, persistence despite thoughtful, intentional
responding, and independence from foundational
knowledge and graph content. Together, these results
demonstrate that mean bar graphs are subject to a
common, severe error of interpretation that makes
them a potentially unwise choice for accurately
communicating empirical results, particularly where
general accessibility is a priority, such as in education,
medicine, or popular media. These results also
establish the DDoG measure and MAGI principles as
eective tools for gaining powerful insights into graph
The core study utilizes a set of 551 DDoG measure
response drawings, or readouts, produced by an
educationally and demographically diverse sample
of 149 participants. These readouts, of four mean
bar graph stimuli that vary in form and content, are
accompanied by an array of control and demographic
measures (see Methods).
Initial observations of the Bar-Tip Limit error
Figure 13 shows the four stimulus graphs and, for
each, an example of the two common response types
submitted. The correct Bar-Tip Mean response has
datapoints distributed across the bar-tip (Figure 13,
center column), and the incorrect Bar-Tip Limit
response has datapoints on only the baseline side of the
bar-tip (Figure 13, right column).
The severe inaccuracy shown in the Bar-Tip Limit
readouts, relative to the Bar-Tip Mean readouts,
raises the possibility that the Bar-Tip Limit readouts
represent a categorical dierence in thinking: an error.
Yet the few isolated examples shown in Figures 4,5,
6,and13 are insucient to conclusively distinguish
common errors of thinking from outliers. To clearly
demonstrate a common error requires a nontrivial
degree of bimodality in the response distribution
(Related works:Dening thought inaccuracies).
The analysis that follows, evaluates and plots all 551
readouts on a continuous scale of datapoint placement
(Figure 14). The Bar-Tip Limit response is revealed to
be a separate mode, eliminating the possibility that
early observations were outliers and identifying the
incorrect Bar-Tip Limit response as a common error
rather than a bias or confusion.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 19
Figure 13. DDoG measure readouts illustrating the two
common, categorically different response types for each of
the four stimulus graphs. Theleftcolumnshowsthefour
stimulus graphs (AGE, CLINICAL, GENDER, SOCIAL). For each
stimulus graph, the center column shows a representative
correct response (Bar-Tip Mean), and the right column shows
an illustrative incorrect response (Bar-Tip Limit).
A minimum requirement to dierentiate between
errors, biases, and confusions is a response scale with
sucient granularity to produce a detailed, nely
contoured response distribution. The DDoG measure
readouts provide such granularity when analyzed via
a relatively continuous scale such as the BTL index
(Methods:Coding of DDoG measure readouts via
BTL index).
The BTL index is the percentage of drawn datapoints
that did not exceed the bar-tip (i.e., those that remain
within the bar). On this scale, a response that treats the
bar-tip as mean has a value of 50 (or 50% of datapoints
within the bar), while a response that treats the bar-tip
as limit has a value of 100 (or 100% of datapoints
within the bar). Values between 50 and 100 represent
gradations between the pure “mean" versus “limit" uses
of the bar-tip.
Figure 14e plots the BTL indices for each of the
551 drawn graphs produced by 149 participants. Keep
in mind that any degree of bimodality—even barely
detectable rises amidst a sea of responses—would
provide substantial evidence that measured inaccuracy
in the responses resulted from common errors of
interpretation, as opposed to biases or confusions. Yet
the modes here in Figure 14e are not merely detectable;
they are tall, sharp peaks with a deep valley of rare
response values in between. Equally remarkable to the
height and sharpness of the peaks is the trough between
them, which bottoms out at a frequency approaching
zero. In the context of the DDoG measure’s Expressive
freedom and Information richness, which enable a very
wide array of possible responses, it is striking to nd
such a clear dichotomy.
The sharp peak on the right side of Figure 14e
includes 86 drawn graphs (15.6%) with BTL index
values of exactly 100. The left-hand mode is similarly
sharp, with a peak at 50 that contains 169 (30.7%)
drawn graphs. Taken together, the peaks of the two
modes alone—BTL index values of exactly 50 and
100—account for nearly half of all drawn graphs
Notably, the incorrect right peak—which treats
the bar-tip as limit rather than mean—is evident
despite two factors that, even given starkly categorical
thinking, could easily have blunted or dispersed it:
rst, our conservative scoring procedures (Methods:
Coding of DDoG measure readouts via BTL
index), and second, potential response variation
based on dierences in graph stimuli form and
In addition to the distinct bimodality in the data, the
particular locations of the modes are revealing: one
mode is at the correct Bar-Tip-Mean location, and the
other is at the precise incorrect Bar-Tip-Limit location
that is expected when mean bar graphs are conated
with count bar graphs.
The sharpness and location of these peaks not only
provide clarity with regard to mechanism—an error, a
conation—but they also underscore the precision of
the measure that is able to produce such clear results.
Notably, the DDoG measure, using the BTL index, is
fully capable of capturing patterns of bias (quantitative
leaning, Figures 14c, 8c), or confusion (nonsystematic
perplexity, Figures 14d, 8d), or values that do not
correspond to any known cause. Yet here we see it
clearly reveal an error (Figures 14b, 8b) with conation
as its apparent cause.
Panels f through i, on the right side of Figure 14,
show BLT indices plotted separately for each stimulus
graph. These plots are nearly indistinguishable from
each other, and from the main plot (14e), despite
substantial dierences between the four stimuli
(Figure 11). These plots thereby demonstrate four
internal replications of the aggregate results shown
in the main plot (14e). Such replications speak to the
generality of the BTL error phenomenon.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 20
Figure 14. Bimodal distribution of Bar-Tip Limit (BTL) index values reveals that the BTL error represents a categorical difference. The
main graph (e) plots the distribution of BTL index values for all 551 DDoG measure drawings. The left inset graphs show visual
definitions of: (a) baseline (no systematic inaccuracy), (b) errors, (c) biases, and (d) confusions (Figure 8). Note how closely e (our
data) matches b (data pattern for errors). The right inset graphs (f, g, h, i) plot BTL indices by graph stimulus (AGE, CLINICAL, GENDER,
SOCIAL), which provide four internal replications of the aggregate result (e). The colors on the x-axes indicate the location of the
computed cutoff of 80, taken from our cluster analyses, between the two categories of responses: Bar-Tip Mean responses (green)
and Bar-Tip Limit (BTL) responses (yellow).
Critically, these analyses demonstrate that Bar-Tip
Limit drawings like those shown in Figure 13 are in
no way rare or atypical. They are not outliers. Nor are
they the tail of a distribution. Rather, they represent
their own mode; a categorically dierent, categorically
incorrect interpretation, or error. We next turn to a
more formal assessment of BTL error prevalence.
Prevalence of the BTL error: One in five
Having rst characterized the BTL error as a severe
error in graph interpretation (Figures 13,14b, 14e),
we then assessed its prevalence in our sample using
a classication-based approach. A multi-method
cluster analysis yielded the BTL index of 80 as a
consensus cuto between Bar-Tip Limit and Bar-Tip
Mean responses (Methods:Establishing cutos and
prevalence). We marked Figure 14 (panels e to i)
with green and yellow x-axis colors to capture this
distinction. At this cuto, 122 of 551 graphs, or 22.1%,
95% CI [18.9, 25.8], demonstrated the BTL error.
Given the strength of bimodality in these data, and
the sharpness of the BTL peak, BTL error prevalence
values are remarkably insensitive to the specic choice
of cuto. For example, across the full range of cutos
(75 to 85) produced by the multiple cluster analyses we
carried out, these estimates remain around one in ve
(see Methods:Establishing cutos and prevalence). As
we will see below, this one in ve prevalence also turns
out to be highly consistent with a careful reanalysis of
past results (A reexamination of prior results).
Analytic quantification of bimodality
A next step, complementary to Figure 14s graphical
elucidation of the BTL error’s categorical nature,
was to analytically quantify the degree of bimodality
in the data. We do this via two standard analytical
bimodality indices: Hardigan’s Dip Statistic (HDS)
and the Bimodality Coecient (BC). HDS and BC are
complementary measures that are often used in tandem
(Freeman & Dale, 2013;Pster et al., 2013). HDS and
BC utilize idiosyncratic scales that range from 0.00 to
0.25, and from 0.33 to 1.00, respectively. To illustrate
how these scales quantify bimodality, Figure 15 plots
simulated distributions of 10,000 datapoints (middle),
with both their HDS (top, blue) and BC (bottom,
purple) scores. The HDS and BC scores for this dataset
(the 551 drawn graphs) are indicated with arrows,
accompanied by 95% condence intervals (CIs).
Clearly, the uncertainty in the data is small relative to
the exceptionally high magnitude of bimodality. Again,
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 21
Figure 15. Two analytic approaches confirm that the BTL index
data are strongly bimodal. Distribution graphs show reference
data sets where varying degrees of separation were imposed on
independently generated normal distributions. Separation is
quantified in units of standard deviation (SD). Two standard
bimodality statistics for each reference distribution were
computed: Hardigan’s Dip Statistic (HDS) is shown above in
blue; the Bimodality Coefficient (BC) is shown below in purple
(Freeman & Dale, 2013;Pfister et al., 2013). HDS and BC for our
full data set, marked with arrows at top and bottom, along with
95% CIs, confirm strong BTL index bimodality (Table 4
bimodality statistics provide four internal replications of this
overall result).
the clarity of this result necessarily reects not only
the bimodal nature of the BTL error, but also the high
resolution of the DDoG measure and the ecacy of
the MAGI principles used in its design.
Consistent with the replications we see in the
plotted data (Figures 14e–i), the analytic results
per stimulus graph, collected in Tabl e 4,showclear
internal replications of the aggregate results, verifying
the consistency of the nding of bimodality across
substantial dierences in form and content between the
four stimuli.
Cognitive mechanisms of the BTL error
Even before our systematic investigation, Figures 1
through 6 suggested a conation of mean and count
bar graphs as the apparent cognitive mechanism for the
BTL error. The case for conation is heightened by the
clear second mode shown in Figures 14e–i. A second
mode indicates a common, erroneous thought process
(Figure 8b, Figure 14b), and that mode’s position at
the Bar-Tip Limit response, with a BTL index of 100
signals a conation with count bar graphs in particular.
Here we evaluate potential sources of this conation.
BTL error stability within individuals
A rst question is whether the BTL error represents
a relatively stable understanding, or whether it occurs
ad hoc, reecting a more transient thought process. To
answer this question, we rst investigate consistency of
the BTL error in individuals.
We nd substantial consistency. The scatterplot
matrix shown inFigure 16 plots BTL indices using
pairwise comparison. Each scatterplot represents the
comparison of two graph stimuli, and each pale gray
dot plots a single participant’s BTL indices for those
two graphs: the x-coordinate is an individual’s BTL
index for one graph, and the y-coordinate is their BTL
index for a second graph.
If the individual’s BTL index is exactly the same
for both graphs, their dot will fall on the (black)
line of equivalence (the diagonal line where x =y).
As a result, the distance of a dot from the line of
equivalence demonstrates the dierence in BTL index
values between the two compared stimulus graphs. The
dots cluster around the line of equivalence in every
scatterplot in Figure 16, indicating that individuals
show substantial consistency of interpretation across
The dots are, additionally, semitransparent, so that
multiple dots in the same location show up darker. For
example, the very dark dot at the top-right corner of
the top-right scatterplot represents the 18 participants
whose BTL index values on both the GENDER
graph (x-axis) and AGE graph (y-axis) are 100. Thus,
the presence of dark black dots at (x =100, y =
100) and (x =50, y =50) in all of the scatterplots
is a focused visual indicator of consistency within
Together, these scatterplots conrm what the two
DDoG measure readouts in Figure 5 suggested: that
an individual’s interpretation of one graph is highly
predictive of their interpretation of the other graphs,
and that those who exhibit the BTL error for one graph
are very likely to exhibit it in most or all other graphs,
regardless of graph form or content.
Further supportive of this conclusion of consistency,
we found that a Cronbach’s alpha internal consistency
statistic (Cronbach, 1951), computed on the entire
dataset shown in Figure 16, was near ceiling, at
0.94. This high internal consistency indicates that
an individual’s BTL indices across even just these
four graphs can capture their relative tendency
toward the BTL error with near-perfect precision.
In sum, the BTL error appears to represent not
just an ad hoc, one-o interpretation, but rather, a
more stable thought process that generalizes across
Table 4. Data bimodality analysis, per stimulus graph.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 22
Independence of the BTL error from graph content or
Conceivably, despite the substantial stability of
erroneous BTL interpretations within individuals, there
might still exist global dierences, from one stimulus to
another, in levels of erroneous BTL thinking.
Remember that we chose our four graphs to vary
substantially in both form (unidirectional versus
bidirectional bars) and content (independent variables
of age, therapeutic intervention, social situation,
and gender, and dependent variables of memory,
depressive symptoms, enjoyment, and visuospatial
performance). If BTL interpretations were to dier
substantially between pairs of these stimuli, it
could suggest that such form or content dierences
played a role in generating or moderating the BTL
Figure 16 plots global mean BTL index values as red
dots. The further from the black line of equivalence
(where x =y) a red mean dot lies, the more divergent
the average interpretation is between the two graphs.
All red mean dots are close to the line of equivalence,
indicating that participants, as a group, exhibit similar
levels of BTL interpretation for all graph stimuli. The
small distances between the red dot and black line on
the graph are echoed numerically in the small Cohen’s d
eect sizes (Figure 16,“d=”). Despite a slight tendency
for bidirectional graph stimuli (SOCIAL, GENDER) to
produce greater BTL interpretation than unidirectional
graph stimuli (AGE, CLINICAL), the Cohen’s d values
do not exceed the 0.20 threshold which denes a small
Figure 16. The Bar-Tip Limit (BTL) error persists across differences in graph form and content. Each of the six subplots in this figure
compares individual BTL index values for two graph stimuli, one plotted as the x-value and the other as the y-value. Each gray dot
represents one participant’s data. Mean values are shown as red dots. Gray dots are semitransparent so that darkness indicates
overlap. High overlap is observed near BTL index values of 50 (Bar-Tip Mean) and 100 (Bar-Tip Limit) for both stimulus graphs,
indicating persistence of interpretation between compared stimuli, despite differences in graph form and content. This persistence is
reflected numerically in the high correlations among (Pearson’s r), and the small mean differences between (Cohen’s d), graph stimuli.
The high correlations are echoed visually by steep lines of best fit (blue), and the small differences are echoed visually by the
proximity of mean values (red dots) to lines of equivalence (black lines, which show where indices are equal for the two stimuli). In
brackets are the 95% CIs for r and d.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 23
eect (Cohen, 1988). The BTL error thus appears to
occur relatively independently of graph content or
Independence of the BTL error from foundational
A next step in isolating the thought processes
that underlie the apparent conation of mean bar
graphs with count bar graphs is to ask what relevant
foundational knowledge the participants have.
We assessed foundational knowledge in two ways:
(1) a denition task: asking participants to dene
the concept of average/mean in their own words
immediately after the drawing task (Denition task:
Dene the average/mean), and (2) a graph reading task:
asking participants to read a particular average/mean
value from stimulus graphs that they later sketched for
the DDoG measure drawing task (Graph reading task:
Find the average/mean). In the denition task, 83% of
denitions were mathematically or conceptually correct,
and in the graph reading task, 93% of readings were
As Figure 17 shows, ltering responses for a
correctly dened average/mean, a correctly identied
average/mean, or both, has little eect on the percentage
of BTL interpretations. The rate of the BTL error
remains near one in ve, at 96/486 =19.8% [16.5,
23.5], 87/413 =20.7% [17.1, 24.9], and 71/367 =
19.3% [15.6, 23.7], respectively. Together, these analyses
suggest that the BTL error persists despite foundational
knowledge about the average/mean and, moreover,
despite sucient care to the foundational knowledge
questions to answer them correctly.
The BTL error occurs despite thoughtful, intentional
It was also relevant to establish that erroneous
BTL interpretations were thoughtful and intentional
by exploring whether carelessness, rushing, or task
misunderstanding were major factors in BTL error. In
addition to the results shown in Figure 17,twofurther
factors suggest that the BTL error persists despite
thoughtful attention specically to the DDoG drawing
task itself.
First, drawn graphs for which the instructions were
not followed suciently to provide usable data were
excluded from the start (Methods:Collection of DDoG
measure readouts). Here, the Expressive freedom
of the DDoG measure worked to our advantage.
More constrained measures—for example, button
and task misunderstanding dicult to detect and
segregate from authentic task responses. The DDoG
measure, in contrast, makes it much harder to produce
even a single usable response without a reasonable
Figure 17. The Bar-Tip Limit (BTL) error occurs despite correctly
defining “mean” and correctly locating it on the graph.
Percentage of readouts that showed the BTL error (defined as a
BTL index over 80). “All” is the full dataset (n=551 readouts).
“Correctly defined average/mean” is restricted to participants
who produced a correct definition for the mean (n=486
readouts). “Correctly identified average/mean” is restricted to
participants who correctly identified a mean value on the same
graph that produced the readout (n=413 readouts). “Correctly
defined & identified average/mean” is restricted to participants
who both correctly defined the mean and correctly identified a
mean value on the graph (n=367 readouts). In all cases, the
proportion of BTL errors hovers around one in five. Vertical
lines show 95% CIs, and gray regions show full probability
distributions for the uncertainty around the percentage values.
degree of intentional, thoughtful responding, thus
guaranteeing a certain intentionality in the resulting
data set.
Second, rushing was evaluated as a possible
contributing factor to erroneous BTL interpretations
by computing the correlation of drawing time of usable
graphs with BTL indices, on a graph-by-graph basis.
This correlation was very close to zero: r(371) =–0.01,
95% CI [–0.11, 0.09]. Due to skew in the drawing time
measure, we additionally computed a nonparametric
(Spearman) correlation, which was also close to zero:
rho (371) =0.004, 95% CI [–0.10, 0.11]. The lack of
correlation between drawing speed and BTL index
provides further evidence that lack of intentional,
thoughtful responding was not a major contributor to
erroneous BTL interpretations.
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 24
Figure 18. The Bar-Tip Limit (BTL) error is substantially
independent of education. Mean BTL indices for each
participant plotted against general education level, number of
psychology courses taken, and number of statistics courses
Educational and demographic correlates
BTL error correlates of individual differences: Education
The existence of individual dierences in the
likelihood of making the BTL error (stability within
individuals) raises questions about their correlates
and potential origins. With this in mind, we evaluated
the dataset for demographic predictors. Of particular
interest is whether prior educational experiences,
general or specic, correlated with BTL readouts. The
stability that the BTL index exhibited within individuals
across graphs made it possible to use each individual’s
mean BTL index to evaluate educational correlates.
Figure 18 shows that while general education and
psychological coursework did not robustly predict mean
BTL indices, statistics coursework did predict a slightly
more correct (lower) mean BTL index.
This pattern of correlations raises several questions
about education’s relationship to the BTL error. First,
what causes the nonzero relationship to statistics
coursework, and does this relationship indicate that
the BTL error is malleable and amenable to training?
Second, why are all of these correlations not higher?
Does this mean that the BTL error is such a tenacious
belief that it is resistant to change? And, alternatively,
are there aspects of standard statistical, psychological,
or general education that could be improved to more
eectively ameliorate the BTL error?
BTL error correlates of individual differences: Age,
gender, country
In addition to education, we examined potential
correlates: age, gender, and location (i.e., the country
from which one completed the study). Age showed a
modest correlation with mean BTL indices, with older
individuals demonstrating somewhat more BTL error
than younger individuals (r(146) =0.21, 95% CI [0.05,
0.36]). Mean BTL indices were slightly lower among
males, but not statistically signicantly so (r(146) =
0.10, 95% CI [0.06, 0.26]). While modest numerical
dierences in mean BTL indices were found between
countries (United States M=66.1, SD =20.3, n=36;
United Kingdom M=61.2, SD =15.6, n=61; Other
countries M=64.0, SD =19.6, n=52), none of these
dierences reached statistical signicance (all pvalues
taken (each reported via the four-point scale shown on the
respective graph). The black line is the least-squares regression
line, computed with rated responses treated as interval-scale
data. Axis ranges and graph aspect ratios were chosen by such that the physical slope of each
regression line equals its respective correlation coefficient (r).
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 25
Ecological exposure to mean versus count bar
As we have seen, the apparent conation of mean
and count bar graphs demonstrated by the BTL
error does not appear to be a thoughtless act or
careless misattribution. The data show, rather, that
one in ve participants capable of understanding
a mean, identifying it on a graph, and producing
attentive, intentional, consistent drawings make this
Broadly speaking, such thoughtful conation could
arise from either or both of “nature” and “nurture.”
Inuences of nature might include fundamental,
inborn aspects of perception, such as compelling visual
interpretations of bars as solid objects (containers,
stacks). Inuences of nurture might include real-world
experiences as varied as educational curricula and mass
While a full investigation of such developmental
inuences was beyond the scope of the current
investigation, we were able to probe a key avenue made
salient by our understanding of early graph pedagogy
(Figure 1): putative exposure to mean and count bar
graphs in early education and beyond.
As an investigation of the relative prevalence of
these two bar-graph types, we counted bar graphs
from three age-correlated sources. First, we evaluated
a set of Google Image searches using the phrase “bar
graph [X] grade” where X was “1st” through “6th”
(Methods:Prevalence of bar graphs of means versus
bar graphs of counts). This search yielded an array of
graphs embedded in elementary pedagogical materials.
Such graphs were deemed interesting due to early
education’s role in forming assumptions that may
carry forward into adulthood. Second, we performed a
general Google Image search for “bar graph,” included
as a window into the prevalence of count versus mean
bar graphs on the internet as a whole. Our third and
nal source was a set of eight widely used college-level
Introductory psychology textbooks, included as a
common higher-level educational experience (Ciccarelli
& Berstein, 2018;Coon et al., 2018;Gazzaniga 2018;
Griggs 2017;Hockenbury & Nolan, 2018;Kalat, 2016;
Lilienfeld et al., 2017;andMyers & DeWall, 2017).
As Figure 19 shows, in all three sources, mean
graphs constituted only a minority of bar graphs. The
eect was nearly unanimous among the elementary
education materials, where mean bar graphs were
almost nonexistent (1/81 =1.2%, 95% CI [0.2, 6.7]).
Mean bar graphs were slightly more prevalent in the
general Google Image search, with 17% (17/100 =
17.0, 95% CI [10.9, 25.5]). Compared to the other
sources, Introductory Psychology textbooks had
a higher proportion of mean bar graphs, at 36%
(53/149 =35.6%, 95% CI [28.3, 43.5]). This higher
Figure 19. Count bar graphs are more common than mean bar
graphs across three major domains. Mean bar graphs plotted
as a percentage of total bar graphs (i.e., mean bar graphs plus
count bar graphs). “Elementary Educational Materials”: Google
Image search for grades 1–9 (n=81). “General Google Search”:
Google search for “bar graph” (n=100). “College Psychology
Textbooks”: bar graphs appearing in eight widely used college
Introductory Psychology textbooks (n=149). Vertical lines
show 95% CIs, and gray regions show full probability
distributions for uncertainty around the percentage values. In
all three cases, the percentage of mean bar graphs remains
substantially, and statistically robustly,
below 50%.
proportion was predictable, given that in psychology
research, analyses of mean values across groups or
experimental conditions are often a central focus.
Yet even in the psychology textbook sample, mean
bar graphs still constituted only about a third of bar
This analysis identies elementary education both as
a plausible original source of erroneous BTL thinking
and as a potential lever for remediating it. Further,
the preponderance of count bar graphs across all
three of these sources suggests a pattern of iterative
reinforcement throughout life. One’s understanding of
bar graphs, limited to count bar graphs in elementary
school (Figures 1,Figure 19, and reinforced through
casual exposure (Figure 19, Google), may represent a
well-worn, experience-hardened channel of thinking
that is modied only slightly by higher education
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 26
(Figure 18;BTL error correlates of individual
dierences: Education).
In this context, count bar graph interpretation
plausibly becomes relatively eortless: a count graph
heuristic, or mental shortcut (Shah & Oppenheimer,
2008;Gigerenzer & Gaissmaier, 2011). In the context
of real-world exposure, where mean bar graphs are in
the vast minority, such a count graph heuristic would
generally produce the correct answer. Only occasionally,
when misapplied to a mean bar graph, would it lead
one astray.
Moreover, the relative abstraction, ambiguity, and
complexity of mean bar graphs relative to count
bar graphs, added to the visually identical forms of
these two bar graph types (Figure 2), provides ideal
fodder for such a heuristic-based conation. Therefore,
the apparent conation that we observe may result
from an experience-fueled heuristic, or cognitive
A reexamination of prior results
As discussed in Related Works (Literature Related to
the BTL Error), the probability rating scale approach
used in prior studies of mean bar graph accessibility
had successfully identied a reproducible asymmetry:
higher rated likelihood of data in-bar than out-of-bar
for the average person (Newman & Scholl, 2012;Correll
& Gleicher, 2014;Pentoney & Berger, 2016;Okan et al.,
2018). For nearly a decade, that asymmetry—dubbed
the “within-the-bar bias”—arguably provided the
strongest existing evidence against the accessibility of
mean bar graphs.
With our current results in hand, we now reexamine
those prior reports of asymmetry, oering four key
insights: First, the prior work observed a prevalence
remarkably similar to ours. Second, the bimodality and
accurate left peak in prior results indicate that the prior
work misidentied an error as a bias. Third, a atter,
wider right data cluster in Pentoney and Berger’s study
(2016), compared to ours, highlights a key limitation
of the probability rating scale approach used in prior
work. Fourth, recent changes in methodological and
statistical practices shed light on choices that may have
aected result characterization in prior work.
The rst two observations—similar prevalence
and strong bimodality—conrm that the previously
reported “within-the-bar bias” was likely not a bias
at all, but rather the same BTL error, and the same
apparent count bar conation, that is revealed here
by our data. The latter two observations—right-mode
dierences and changing statistical practices—help
to explain how both error and conation could have
remained undiscovered in prior work despite over
a decade’s worth of relevant, robust, self-consistent
evidence from numerous independent labs.
Prior work reflects the same phenomenon: Prevalence
Our data demonstrate the Bar-Tip Limit (BTL) error
in about one in ve persons, across educational levels,
ages, and genders, and despite thoughtful responding
and relevant foundational knowledge. Prior studies
diered from ours in that they were conducted by
dierent research groups, at dierent times, with
dierent participant samples, using a dierent measure
(Newman & Scholl, 2012;Correll & Gleicher, 2014;
Pentoney & Berger, 2016;Okan et al., 2018). If, despite
all these dierences, a prevalence reasonably comparable
to our one in ve was detectible in one or more of the
prior studies, it would provide striking evidence that: (1)
those past results were likely caused by the same BTL
error phenomenon and (2) the underlying BTL error is
robust to whatever diers between our study and the
past studies.
Pentoney and Berger (2016, hereafter “PB study”)
provide a particularly rich opportunity to reexamine
prevalence, because, unique among the past studies
cited above, the PB study produced graphs that showed
raw data for each participant. A careful look (see PB
study data replotted as our Figure 20b) demonstrates
a familiar bimodality. Additionally, the same analytic
tests used to quantify the bimodality in our own data
show a high degree of bimodality in the PB study data
(HDS =0.068 [0.042, 0.098]; BC =0.66 [0.56, 0.75]),
which strongly supports the principled separation of
those data into two groups.
We applied to the PB study the same ve cluster
analysis techniques used to dene the correct/error
cuto in our own data (Methods:Establishing
cutos and prevalence). These analyses yielded an
average prevalence estimate of 20.5%. Based on visual
inspection of the PB study data in Figure 20b, this
cuto does a good job of separating the two apparent
clusters of responses. Additionally, the resulting 20.5%
prevalence estimate closely replicates the approximate
one in ve estimate from our own sample and suggests
the BTL error as the source of the asymmetry reported
in the PB study.
While none of the other three prior probability
rating scale studies (Newman & Scholl, 2012;Correll
& Gleicher, 2014;Okan et al., 2018)providedraw
underlying data of the sort that would enable a
formal analysis of bimodality and prevalence, a single
parenthetical statement in Newman and Scholl’s (2012)
Study 5 (hereafter NS5) is suggestive. It notes:
“In fact, 73% of the participants did rate the [in-bar vs out-
of-bar] points as equally likely in this study. But ... the other
27% of participants — i.e., those in the minority who rated
the two points dierently — still reliably favored the point
within the bar.”
In other words, 73% of participants showed no
asymmetry, which is the expected response for an
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 27
Figure 20. A reexamination of prior results reveals both consistency with our current results and overlooked evidence for the
Bar-Tip Limit (BTL) error. Shown side by side for direct comparison are plotted data from: (a) The present study using the DDoG
measure; (b) Pentoney and Berger (2016) (PB); (c) Newman and Scholl (2012) Study 5 (NS5). All graphs label correct Bar-Tip Mean
response values in green and computed (or inferred, in the case of NS5) Bar-Tip Limit (BTL) response values in yellow. Notably, the PB
study used the exact same 9-point rating scale and graph stimulus (of hypothetical chemical freezing temperature data) as the NS5
study (Figure 9 shows the scale and stimulus) but added ratings of four additional temperatures (15, 10, 10, 15) to NS5’s original
two (5, 5). Comparing PB’s results (b) to ours (a) suggests that while PB’s version of the probability rating scale measure achieved a
fairly high degree of accuracy and precision at values near zero, it still suffered from apparently irreducible inaccuracy and/or
imprecision as values diverged from zero (see text for further discussion).
interpretation of the bar-tip as a mean. The asymmetry
shown by the other 27% of participants, while not
numerically broken down by direction and magnitude,
"reliably favored the point within the bar”; and the
average result, including the 73% of participants with
zero asymmetry, was a robust in-bar asymmetry. So,
the notion that around 20% may have shown a rather
large in-bar asymmetry, as in the PB study seems highly
plausible. We plotted the zero asymmetry and nonzero
asymmetry results for the NS5 study as Figure 20c.
In sum, prevalence estimates gleaned from past
studies are highly consistent with our own prevalence
estimates, suggesting both that the asymmetry found
in past studies was caused by the same BTL error, and
that this error of interpretation is highly robust to
dierences between studies.
Prior work reflects the same phenomenon: Error, rather
than bias
In addition to providing prevalence, the distinct
clusters of responses and strong bimodality in the PB
study’s data demonstrate a key feature of erroneous
graph interpretation: two discernible peaks, or modes
(Figures 8b, 21a). A second key feature of erroneous
interpretation is that one of the two modes is on the
correct interpretation.
Both the DDoG and PB data plots demonstrate
two peaks, and in each plot, one of those peaks
corresponds exactly to the correct value for a Bar-Tip
Mean interpretation (left mode in Figures 8b, 21aand
20b, 21b).
In the PB study, this correct value is 0. The PB
study’s analyses used a dierence score: the mean in-bar
rating minus the mean out-of-bar rating. This score
ranged from 8 to 8, in increments of 1/3. The task was
taken directly from the NS5 study (same stimulus, same
response scale, see Figure 9a).However,thePBstudy
had participants rate three in-bar locations (rather than
NS5’s one) and three out-of-bar locations (rather than
NS5’s one). The study therefore had more datapoints,
and greater continuity of measurement, than that of
NS5 (Figure 20b).
A dierence score near 0 indicates similar in-bar
and out-of-bar ratings, which is the expected response
for a Bar-Tip Mean interpretation. Approximately
25% of the PB study’s participants (30 of 118) gave a
response of exactly 0, and more than half (64 of 118,
or 54%) gave a response in the range –0.67 to 0.67,
creating a tall, sharp mode at the correct Bar-Tip Mean
Results from the NS5 study tell a similar story.
Despite the NS5 study’s relatively low-resolution
measure (the single rated in-bar location and out-of-bar
location), and limited reporting of individual values
(only the proportion of response scores that were
zero versus nonzero were reported), we can still see
that the NS5 data contained a mode on the correct
value (Figure 20c), replicating a feature of erroneous
interpretations. Indeed, a full 73% of participants gave
0 as their response, particularly strong evidence against
bias (which tends to shift a peak away from the correct
value) and for error (which tends to preserve a peak at
the correct value; see Figure 8).
Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 28
Therefore, a careful reexamination of the available
information from the PB and NS5 studies shows strong
evidence for a causative, systematic error in participant
interpretation. The one in ve prevalence of asymmetry,
strong bimodality, and a response peak at the correct
value are three further indications that the previously
observed eects were likely caused by the same BTL
error that we observed via the DDoG measure.
How the DDoG measure supported the identification of
the BTL error
If the BTL error truly is at the root of these prior
reports of asymmetry, why did its existence and
apparent mechanism—conation of mean bar graphs
with count bar graphs—elude discovery for over a
Though our reexamination of past results (Newman
& Scholl, 2012;Correll & Gleicher, 2014;Pentoney &
Berger, 2016;Okan et al., 2018) has so far focused on
ways in which the probability rating scale data aligns
closely to our DDoG measure ndings (Figure 20), the
answer to our question may lie in the areas of dierence.
Figure 20 shows a salient way in which the DDoG
measure response distribution (Figure 20a) diers from
that of the PB study (Figure 20b): the shape of the
rightmost (erroneous) error peak. DDoG measure’s
BTL error peak is sharp, and the location corresponds
exactly to the expectation for a count bar graph
interpretation (i.e., a BTL index of 100, representing
a perfect Bar-Tip Limit response Figure 20a). The PB
study’s probability rating scale data, in contrast, shows
no such sharp right peak. The right data cluster is atter
and more dissipated (Figure 20b).
To understand why these response distributions
would dier so much if they measure the same eect,
consider a core dierence between measures. The
probability rating scale requires participants to translate
their interpretation of a graph into common English
words (“likely,” “unlikely,” “somewhat,” and “very”)
(see Figure 9a). This location-to-words adjustment is a
prime example of the sort of mental transformation
that we sought to minimize via the MAGI principle
of limited mental transformations. Such translations
infuse the response with the ambiguity of language
interpretation (i.e., does every participant quantify
“somewhat” in the same way?) A relevant analogy here
is the classic American game of “telephone,” where
translations of a message, from one person to the next,
may yield an output that bears little resemblance to the
input. Similarly, the greater the number of necessary
mental transformations that exist between viewing
and interpreting a graph stimulus and generating
a response, the less precise the translation, and the
more dicult to reconstruct the interpretation from
the response. In such cases, it becomes dicult to use
a response to objectively judge the interpretation as
correct or incorrect, violating the MAGI principle of
Ground-truth linkage.
Further, though rating scale method responses
can indicate the presence and directionality of an
asymmetric response, they lack the necessary specicity
(e.g., “somewhat,” “very”) to delineate, in concrete
terms, the magnitude of asymmetry in the underlying
interpretation. As we have seen, it is the magnitude of
the error—the position of the second mode revealed
by the DDoG measure—that so clearly indicates its
But why does this translation issue only pertain to
the right peak of the PB study’s distribution? This is
because the (left) Bar-Tip Mean peak represents a mean
value. In a DDoG measure readout, this mean value is
concretely expressed by equal distribution of datapoints
on either side of the bar-tip. In the probability rating
scale measure, equal distribution is expressed by using
the same word to describe the likelihood of both
the in-bar and out-of-bar dot position. In this single
case of equality, the language translation issue is
obviated: so long as the person uses the same words
(e.g., “somewhat likely,” or “very unlikely”) to describe
both in-bar and out-of-bar locations, it matters little
what those words are. In this specic case only, when
the dierence score is zero, the probability rating scale
has Ground-truth linkage. This, in turn, results in a tall,
sharp, interpretable left (Bar-Tip Mean) peak in the
asymmetry measure.
How changing statistical and methodological practices
supported the identification of the BTL error
We have seen that the mental transformations
required by previous rating scale methods impaired
identication of the BTL error despite compelling