Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Journal of Vision (2021) 21(12):17, 1–36 1

Two graphs walk into a bar: Readout-based measurement

reveals the Bar-Tip Limit error, a common, categorical

misinterpretation of mean bar graphs

Sarah H. Kerns Department of Psychology, Wellesley College,

Wellesley, MA, USA

Jeremy B. Wilmer Department of Psychology, Wellesley College,

Wellesley, MA, USA

How do viewers interpret graphs that abstract away

from individual-level data to present only summaries of

data such as means, intervals, distribution shapes, or

eﬀect sizes? Here, focusing on the mean bar graph as a

prototypical example of such an abstracted

presentation, we contribute three advances to the study

of graph interpretation. First, we distill principles for

Measurement of Abstract Graph Interpretation (MAGI

principles) to guide the collection of valid interpretation

data from viewers who may vary in expertise. Second,

using these principles, we create the Draw Datapoints

on Graphs (DDoG) measure, which collects drawn

readouts (concrete, detailed, visuospatial records of

thought) as a revealing window into each person’s

interpretation of a given graph. Third, using this new

measure, we discover a common, categorical error in the

interpretation of mean bar graphs: the Bar-Tip Limit

(BTL) error. The BTL error is an apparent conﬂation of

mean bar graphs with count bar graphs. It occurs when

the raw data are assumed to be limited by the bar-tip, as

in a count bar graph, rather than distributed across the

bar-tip, as in a mean bar graph. In a large,

demographically diverse sample, we observe the BTL

error in about one in ﬁve persons; across educational

levels, ages, and genders; and despite thoughtful

responding and relevant foundational knowledge. The

BTL error provides a case-in-point that simpliﬁcation via

abstraction in graph design can risk severe,

high-prevalence misinterpretation. The ease with which

our readout-based DDoG measure reveals the nature

and likely cognitive mechanisms of the BTL error speaks

to the value of both its readout-based approach and the

MAGI principles that guided its creation. We conclude

that mean bar graphs may be misinterpreted by a large

portion of the population, and that enhanced

measurement tools and strategies, like those introduced

here, can fuel progress in the scientiﬁc study of graph

interpretation.

Introduction

Background on bar graphs

How can one maximize the chance that visually

conveyed quantitative information is accurately and

eciently received? This question is fundamental to

data-driven elds, both applied (policy, education,

medicine, business, engineering) and basic (physics,

psychological science, computer science). Long a crux

of debate on this question, “mean bar graphs”—bar

graphs depicting mean values—are both widely used,

for their familiarity and clean visual impact, and

widely criticized, for their abstract nature and paucity

of information (Tufte & Graves-Morris, 1983,p.96;

Wainer, 1984;Drummond & Vowler, 2011;Weissgerber

et al., 2015;Larson-Hall, 2017;Rousselet, Pernet,

& Wilcox, 2017;Pastore, Lionetti, & Altoe, 2017;

Weissgerber et al., 2019;Vail & Wilkinson, 2020).

Conversely, bar graphs are considered a best-practice

when conveying counts—whether raw or scaled into

proportions or percentages. These “count bar graphs”

are more concrete and hide less information than mean

bar graphs. They are usefully extensible: bars may be

stacked on top of each other to convey parts of a whole

(a stacked bar graph) or arrayed next to each other

to convey a distribution (a histogram). And they take

advantage of a core bar graph strength: the alignment

of bars at a common baseline supports rapid, precise

height estimates and comparisons (Cleveland & McGill,

1984;Heer & Bostock, 2010).

When a bar represents a count, it can be thought

of as a stack, and the metaphor of bars-as-stacks is

relatively accessible across ages and expertise levels

(Zubiaga & MacNamee, 2016). In fact, the introduction

of bar graphs in elementary education is often

Citation: Kerns, S. H., & Wilmer, J. B. (2021). Two graphs walk into a bar: Readout-based measurement reveals the

Bar-Tip Limit error, a common, categorical misinterpretation of mean bar graphs. Journal of Vision,21(12):17, 1–36,

https://doi.org/10.1167/jov.21.12.17.

https://doi.org/10.1167/jov.21.12.17 Received July 20, 2020; published November 30, 2021 ISSN 1534-7362 Copyright 2021 The Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 2

Figure 1. Elementary progression of count bar graph

instruction. (a) Children are taught ﬁrst using manipulatives;

(b) then they transition to drawn stacks; (c) ﬁnally, they are

introduced to undivided bars.

accomplished in just this way: with manipulative stacks

(Figure 1a), translated to more abstract drawn stacks

(Figure 1b), and further abstracted to undivided bars

(Figure 1c).

While the mean bar graph potentially borrows

some virtues from the count bar graph, the use of

a single visual symbol (the bar, Figures 2a, 2c) to

represent two profoundly dierent quantities (means

and counts, Figures 2b, 2d) adds inherent ambiguity to

the interpretation of that visual symbol.

The information conveyed by these two visually

identical bar graph types diers categorically, as

Figure 2 shows. Because a count bar graph depicts

summation, its bar-tip is the limit of the individual-level

data, which are contained entirely within the bar

(Figures 2d, 2e). We call this a Bar-Tip Limit (BTL)

distribution. A mean bar graph, in contrast, depicts

a central tendency; it uses the bar-tip as the balanced

center point, or mean, and the individual-level data are

distributed across that bar-tip (Figure 2b).Wecallthis

a Bar-Tip Mean distribution.

The question of mean bar graph accessibility

While the rationale for using mean bar graphs—or

mean graphs more generally—varies by intended

audience, a common theme is visual simplication.

Potentially, the simplication of substituting a

single mean value for many raw data values could

ease comparison, aid pattern-seeking, enhance

discriminability, reduce clutter, or remove distraction

(Barton & Barton, 1987;Franzblau & Chung, 2012).

Communications intended for nonexpert consumers,

such as introductory textbooks, may favor mean bar

graphs because their visual simplicity is assumed to

yield accessibility (Angra & Gardner, 2017). In contrast,

communications intended for experts, such as scholarly

publications, may favor mean bar graphs because their

visual simplicity is assumed to yield eciency (Barton

& Barton, 1987). In either case, the aim is to enhance

communication. Yet simplication that abstracts away

from the concrete underlying data could also mislead if

the viewer fails to accurately intuit the nature of that

underlying data.

Visual simplication can, in some cases, enhance

comprehension. Yet is this the case for mean bar

graphs? Basic research in vision science and psychology

provides at least four theoretical causes to doubt

the accessibility of mean bar graphs. First, in many

domains, abstraction, per se, reduces understandability,

particularly in nonexperts (Fyfe et al., 2014;Nguyen

et al., 2020). Second, less-expert consumers may

lack sucient familiarity with a mean bar graph’s

dependent variable to accurately intuit the likely

range, variability, or shape of the distribution that it

represents. In this context, the natural tendency to

discount variation (Moore et al., 2015) and exaggerate

dichotomy (Fisher & Keil, 2018) could potentially

Figure 2. Data distribution diﬀers categorically between mean

and count graphs. (a) Mean bar graphs and (c) count bar graphs

do not diﬀer in basic appearance, but they do depict

categorically diﬀerent data distributions. (b) In a mean bar

graph, the bar-tip is the balanced center point, or mean, with

the data distributed across it. We call this a Bar-Tip Mean

distribution. (d, e) In a count bar graph, the bar-tip acts as a

limit, containing the summed data within the bar. We call this a

Bar-Tip Limit (BTL) distribution.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 3

distort interpretations. Third, the visual salience of a

bar could be a double-edged sword, initially attracting

attention, but then spreading it across the bar’s full

extent, in eect tugging one’s focus away from the

mean value represented by the bar-tip (Egly et al.,

1994). Finally, the formal visual simplicity (i.e., low

information content) of the bar does not guarantee

that it is processed more eectively by the visual system

than a more complex stimulus. Many high-information

stimuli like faces (Dobs et al., 2019) and scenes (Thorpe

et al., 1996) are processed more rapidly and accurately

than low-information stimuli such as colored circles or

single letters (Li et al., 2002). Relatedly, a set of dots is

accurately and eciently processed into a mean spatial

location (Alvarez & Oliva, 2008), raising questions

about what is gained by replacing datapoints on a graph

with their mean value. Together, these theory-driven

doubts provided a key source of broad motivation for

the present work.

Adding to the broad concerns raised by basic vision

science and psychology research are numerous specic,

direct critiques of mean bar graphs in particular. Such

critiques, however, are chiey theoretical, as opposed to

empirical, and focus on expert audiences, as opposed

to more general audiences (Tufte & Graves-Morris,

1983, p. 96; Wainer, 1984;Drummond & Vowler, 2011;

Weissgerber et al., 2015;Larson-Hall, 2017;Rousselet,

Pernet, & Wilcox, 2017;Pastore, Lionetti, & Altoe,

2017;Weissgerber et al., 2019;Vail & Wilkinson,

2020). In contrast, our present work provides a direct,

empirical test of the claim that mean bar graphs are

accessible to a general audience (see Related works and

Results for a detailed discussion of prior work of this

sort).

Aim and process for current work

To assess mean bar graph accessibility, we needed an

informative measure of abstract-graph interpretation;

that is, a measure of how graphs that abstract away from

individual-level data to present only summary-level, or

aggregate-level, information, are interpreted. Over more

than a decade, our lab has developed cognitive measures

in areas as diverse as sustained attention, aesthetic

preferences, stereoscopic vision, face recognition,

visual motion perception, number sense, novel object

recognition, visuomotor control, general cognitive

ability, emotion identication, and trust perception

(Wilmer & Nakayama, 2007;Wilmer, 2008;Wilmer

et al., 2012;Degutis et al., 2013;Halberda et al., 2012;

Germine et al., 2015;Fortenbaugh et al., 2015;Richler,

Wilmer, & Gauthier, 2017;Deveney et al., 2018;

Sutherland et al., 2020).

The decision to develop a new measure is never

easy; it takes a major investment of time and energy to

iteratively rene and properly validate a new measure.

Yet the accessibility of abstract-graphs presented a

special measurement challenge that we felt warranted

such an investment. The challenge: record the viewer’s

conception of an invisible entity—the individual-level

data underlying an abstract-graph—while neither

invoking potentially unfamiliar statistical concepts nor

explaining those concepts in a way that could distort

the response.

Our lab’s standard process for developing measures

has evolved to include three components: (1) Identify

guiding principles (Wilmer, 2008;Wilmer et al., 2012;

Degutis et al., 2013). (2) Design and rene a measure

(Wilmer & Nakayama, 2007;Wilmer et al., 2012;

Halberda et al., 2012;Degutis et al., 2013;Germine

et al., 2015;Fortenbaugh et al., 2015;Richler, Wilmer,

& Gauthier, 2017;Deveney et al., 2018;Kerns, 2019;

Sutherland et al., 2020). (3) Apply the measure to a

question of theoretical or practical importance (Wilmer

& Nakayama, 2007;Wilmer et al., 2012;Halberda

et al., 2012;Degutis et al., 2013;Germine et al., 2015;

Fortenbaugh et al., 2015;Richler, Wilmer, & Gauthier,

2017;Deveney et al., 2018;Kerns, 2019;Sutherland

et al., 2020). We devoted over a year of intensive

piloting to the renement of these three components

for the present project. While the parallel nature of

the development process produced a high degree of

complementarity between the three, each represents its

own separate and unique contribution.

A turning point in the development process

was our rediscovery of a group of drawing-based

neuropsychological tasks that have been used to probe

for pathology of perception, attention, and cognition

in brain damaged patients (Landau et al., 2006;Smith,

2009). In one such task, the patient is asked to draw a

Figure 3. Readout of a clock drawn by a hemispatial neglect

patient. (https://commons.wikimedia.org/wiki/File:

AllochiriaClock.png) The Draw Datapoints on Graphs (DDoG)

measure was inspired by readout-based neuropsychological

tasks like the one that produced this distorted clock drawing.

Such readout-based tasks have long been used with brain

damaged patients to probe for pathology of perception,

attention, and cognition (Smith, 2009).

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 4

Table 1. Measurement of Abstract Graph Interpretation (MAGI) principles.

clock, and pathological inattention to the left side of

visual space is detected via the bunching of numbers

to the right side of the clock (Figure 3). This clock

drawing test exhibits a feature that became central to

the present work: readout-based measurement.

A readout is a concrete, detailed, visuospatial record

of thought. Measurement via readout harnesses the

viewer’s capacity to transcribe their own thoughts,

with relative fullness and accuracy, when provided

with a response format that is suciently direct,

accessible, rich, and expressive. As the clock drawing

task in Figure 3 seeks to read out remembered numeral

positions on a clock face, our measure seeks to read out

assumed datapoint positions on a graph.

Contribution 1: Identify guiding principles

Distillation of the Measurement of Abstract Graph

Interpretation (MAGI) principles

The rst of this paper’s three contributions is the

introduction of six succinct, actionable principles to

guide collection of graph interpretation data (Table 1).

These principles are aimed specically at common graph

types (e.g., bar/line/box/violin) and graph markings

(e.g., condence/prediction/interquartile interval) that

abstract away from individual-level data to present

aggregate-level information. These Measurement of

Abstract-Graph Interpretation (MAGI) principles

center around two core aims, general usage and valid

measurement.

General usage—that is, testing of a general

population that may vary in statistical and/or content

expertise—is most directly facilitated if a measure: (1)

avoids constraining the response via limited options

(Expressive freedom); (2) avoids priming the response

via suggestive instructions (Limited instructions);

and (3) avoids obscuring or hindering thinking via

unnecessary interpretive or translational steps (Limited

mental transformations).

Figure 4. The Draw Datapoints on Graph (DDoG) measure

maintains the graph as a consistent reference frame across its

three stages. (a) Participants are presented with a graph

stimulus that (b) produces a mental representation of the data;

(c) this interpretation is recorded by sketching a version of the

graph along with hypothesized locations of individual data

values. Drawings are representative examples of the two

common responses seen in pilot data collection: the correct

Bar-Tip Mean response (top), and the incorrect Bar-Tip Limit

(BTL) response (bottom).

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 5

Figure 5. The DDoG measure implements the MAGI principles.

The DDoG measure collects readouts of abstract-graph

interpretation. Shown are the major pieces of the DDoG

measure’s procedure, with the relevant MAGI principle(s) in

parentheses. (a) Drawing page: showed the participant how to

set up their paper for drawing (Expressive freedom, Limited

instructions). (b) Instructions: explained what to draw (Limited

instructions). (c) Stimulus graph: was the graph to be

interpreted (Ecological validity, Ground-truth linkage, Limited

mental transformations). (d) Figure caption: was presented

with each stimulus graph to help clarify its content (Ecological

validity). (e–f) Representative readouts: four-graph readouts

from two separate participants, with readouts of graph

→

Valid measurement is facilitated by all six principles.

The three principles just mentioned—Expressive

freedom, Limited instructions, and Limited mental

transformations—all help responses to more directly

reect the viewer’s actual graph interpretation. This

is especially true of Limited mental transformations.

As we will see below, a set of mental transformations

embedded in a popular probability rating scale

measure delayed for over a decade the elucidation of a

phenomenon that our investigation here reveals.

Valid measurement is additionally facilitated by: (1)

objective scoring, via the existence of correct answers

(Ground-truth linkage); (2) real-world applicability of

results, via the sourcing of real graphs (Ecological

validity); and (3) a clear, high-resolution window into

viewers’ thinking, via a response that has high detail

and bandwidth (Information richness).

As shown in the next two sections, the MAGI

principles provided a foundational conceptual basis

for the creation of the Draw Datapoints on Graphs

(DDoG) measure and for understanding the ease with

which that DDoG measure identied the common,

categorical Bar-Tip Limit (BTL) error. These principles

additionally provided a structure for comparing our

new measure with existing measures (Related works).

In these ways, we demonstrate the utility of the MAGI

principles as a metric for the creation, evaluation, and

comparison of graph interpretation measures.

Contribution 2: Design and reﬁne a measure

Creation of the Draw Datapoints on Graphs (DDoG)

measure

Our second contribution is the use of the MAGI

principles to create a readout-based measure of graph

comprehension. We call this the Draw Datapoints

on Graphs (DDoG) measure. Figure 4 illustrates the

DDoG measure’s basic approach: it uses a graph as

the stimulus (Figure 4a); the graph is interpreted (4b);

and this interpretation is then recorded by sketching a

version of the graph (4c). The reference-frame of the

graph is retained throughout.

Figure 5 shows a more detailed schematic of the

DDoG procedure used in the present study, which

expresses the six MAGI principles (Table 1) as follows:

The instructions (Figure 5b) are succinct, concrete,

and task-directed (Limited instructions). The graph

stimulus (5c) and its caption (5d) are taken directly from

←

stimulus shown in c outlined in purple (we refer to this stimulus

below as AGE) (Expressive freedom, Information richness,

Limited mental transformations). Representative AGE stimulus

graph (c) sketches are outlined in purple. Representative

readouts demonstrate (e) the correct Bar-Tip Mean response

and (f) the incorrect Bar-Tip Limit (BTL) response.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 6

Figure 6. The diﬀerence between Bar-Tip Mean and Bar-Tip Limit thinking is easily observable. (a) Cartoon and (b) readout examples

of the two common DDoG measure responses: the correct Bar-Tip Mean response (top) and the incorrect Bar-Tip Limit (BTL) response

(bottom). These readouts were all drawn for the same stimulus graph, which we refer to as AGE (see Figure 11).

a textbook (Ecological validity). The drawing-based

response medium (5a, e, f) provides the participant

with exibility to record their thoughts in a relatively

unconstrained manner (Expressive freedom). The

matched format between the stimulus (5c) and the

response (5e, 5f, purple outline) minimizes the mental

transformations required to record an interpretation

(4b) (Limited mental transformations). The extent

of the response—160 drawn datapoints, each on a

continuous scale (5e, 5f)—provides a high-resolution

window into the thought-process (Information richness).

Finally, the spatial and numerical concreteness of the

readout (5e, 5f), allows easy comparison to a variety

of logical and empirical benchmarks of ground-truth

(Ground-truth linkage). In these ways, the DDoG

measure demonstrates each of the MAGI principles.

While we use the DDoG measure here to record

interpretations of mean bar graphs, it can easily be

applied to any other abstract-graph; that is, any graph

that replaces individual-level data with summary-level

information.

Contribution 3: Apply the measure

Using the DDoG measure to test the accessibility of mean

bar graphs

From our earliest DDoG measure pilots looking

at mean bar graph accessibility, a substantial subset

of participants drew a very dierent distribution

of data than the rest (Figure 6,Figures 5e&f,and

Figure 4c).

When instructed to draw underlying datapoints for

mean bar graphs (Figure 5b), most participants drew a

(correct) distribution with datapoints balanced across

the bar-tip (Figure 6ab, top). A substantial subset,

however, drew most, or all, datapoints within the bar

(Figure 6ab, bottom), treating the bar-tip as a limit.

This minority response would have been correct for a

count bar graph (as shown in Figure 2d), but it was

severely incorrect for a mean bar graph (Figure 2b).

We named this incorrect response pattern the Bar-Tip

Limit (BTL) error.

The dataset that we examine in Results contains

over three orders of magnitude more drawn datapoints

than the two early-pilot drawings shown in Figure 4c:

44,000 datapoints, drawn in 551 sketches, by 149

participants. This far larger dataset yields powerful

insights into the BTL error by establishing its

categorical nature, high prevalence, stability within

individuals, likely developmental inuences, persistence

despite thoughtful responding, and independence from

foundational knowledge and graph content. Yet, in

a testament to the DDoG measure’s incisiveness, the

two early pilot drawings shown in Figure 4calready

substantially convey all three of the core contributions

of the present paper: a severe error in mean bar graph

interpretation (the BTL error), saliently revealed by a

new readout-based measure (the DDoG measure) that

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 7

collects high-quality graph interpretations from experts

and nonexperts alike (using the MAGI principles).

Moreover, it takes only a few readouts to move

beyond mere identication of the BTL error, toward

elucidation of an apparent mechanism. Considering just

the readouts we have seen so far, observe the stereotyped

nature of the BTL error response across participants

(Figure 6b), graph form, and content (Figures 5e & f).

Notice the similarity of this stereotyped response to the

Bar-Tip Limit data shown in Figure 2d. These clues

align perfectly with a mechanism of conation, where

mean bar graphs are incorrectly interpreted as count

bar graphs (Figure 2).

In retrospect, the stage was clearly set for such a

conation. The use of one graph type (bar) to represent

two fundamentally dierent types of data (counts

and means) makes it dicult to visually dierentiate

the two (Figure 2) and sets up the inherent cognitive

conict that is the source of this paper’s title (“Two

graphs walk into a bar”). Additionally, the physical

stacking metaphor for count bar graphs (Figure 1)

makes a Bar-Tip Limit interpretation arguably more

straightforward and intuitive; and the common use of

count bar graphs as a curricular launching pad to early

education could easily solidify the Bar-Tip Limit idea

as a familiar, well-worn path for interpretation of bar

graphs (Figure 1).

Yet, remarkably, none of the many prior theoretical

and empirical critiques of mean bar graphs considered

that such a conation might occur (Tufte & Graves-

Morris, 1983,p.96;Wainer, 1984;Drummond &

Vowler, 2011;Weissgerber et al., 2015;Larson-Hall,

2017;Rousselet, Pernet, & Wilcox, 2017;Pastore,

Lionetti, & Altoe, 2017;Weissgerber et al., 2019;Vai l

& Wilkinson, 2020;Newman & Scholl, 2012;Correll

& Gleicher, 2014;Pentoney & Berger, 2016;Okan

et al., 2018). The concrete, granular window into graph

interpretation provided by DDoG measure readouts,

however, elucidates both phenomenon and apparent

mechanism with ease.

Related works

Here we embed our three main contributions—the

Measurement of Abstract Graph Interpretation

(MAGI) principles, the Draw Datapoints on Graphs

(DDoG) measure, and the Bar-Tip Limit (BTL) error

—into related psychological, vision science, and data

visualization literatures.

Literature related to the DDoG measure

Classiﬁcation of the DDoG measure

A core design feature of the DDoG measure is

its “elicited graph” measurement approach whereby

a graph is produced as the response. This approach

can, in turn, be placed within two increasingly

broad categories: readout-based measurement and

graphical elicitation measurement. Like elicited graph

measurement, readout-based measurement produces a

detailed, visuospatial record of thought; yet its content

is broader, encompassing nongraph products such

as the clock discussed above (Figure 3). Graphical

elicitation, broader still, encompasses any measure with

a visuospatial response, regardless of detail or content

(Hullman et al., 2018; a similar term, graphic elicitation,

refers to visuospatial stimuli, not responses,Crilly et al.,

2006). Having embedded the DDoG measure within

these three nested measurement categories—elicited

graph, readout-based, and graphical elicitation—we

next use these categories to distinguish it from existing

measures of graph cognition.

Vision science measures

Vision science has long been an important source

of graph cognition measures (Cleveland & McGill,

1984;Heer & Bostock, 2010), and recent years have

seen an accelerated adoption of vision science measures

in studies of data visualization (hereafter datavis). Of

particular interest is a recent tutorial paper by Elliott

and colleagues (2020) that cataloged behavioral vision

science measures with relevance to datavis. Despite its

impressive breadth—laying out nine measures with six

response types and 11 direct applications (Elliott et al.,

2020)—not a single elicited graph, readout-based, or

graphical elicitation approach was included. This ts

with our own experience that such methods are rarely

used in vision science.

This rarity is even more surprising given that the

classic drawing tasks that helped to inspire our current

readout-focused approach are well known to vision

scientists (Landau et al., 2006;Smith, 2009;Figure 3);

indeed, they are commonly featured in Sensation and

Perception textbooks (Wolfe et al., 2020). Yet usage of

such methods within vision science has been narrow;

restricted primarily to studies of extreme decits in

single individuals (Wolfe et al., 2020).

It is unclear why readout-based measurement is

uncommon in broader vision science research. Perhaps

the relative diculty of structuring a readout-based

task to provide a consistent quantitative measurement

across multiple participants has been prohibitive. Or

maybe reliable collection of drawn samples from a

distance was logistically untenable until the recent

advent of smartphones with cameras and accessible

image-sharing technologies. We hypothesize that factors

such as these may have limited the use of readout-based

measurement in vision science, and we believe the

proof-of-concept provided by the MAGI principles

(Table 1) and the DDoG measure (Figure 4) supports

their broader use in the future.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 8

Although the DDoG measure shares its readout-

based approach with classic patient drawing tasks

(Figure 3), it diers from them in a subtle but

important way. Patient drawing tasks are typically

stimulus-indierent, using stimuli (e.g., clocks) as mere

tools to reveal the integrity of a broad mental function

(e.g., spatial attention). The DDoG measure, in

contrast, seeks to probe the interpretation of a specic

stimulus (this bar graph) or stimulus type (bar graphs in

general). The focus on how the stimulus is interpreted,

rather than on the integrity of a stimulus-independent

mental function, distinguishes the DDoG measure from

classic patient drawing tasks.

Graphical elicitation measures

Moving beyond vision science, we next examine

the domain of graphical elicitation for DDoG-related

measures. A birds-eye perspective is provided by a

recent review of evaluation methods in uncertainty

visualization (Hullman et al., 2018). Uncertainty

visualization’s study of distributions, variation, and

summary statistics makes it an informative proxy for

datavis research related to our work. Notably, that

review called the method for eliciting a response “a

critical design choice.” Of the 331 measures found in

86 qualifying papers, from 11 application domains

(from astrophysics to cartography to medicine), only

4% elicited any sort of visuospatial or drawn response,

thus qualifying as graphical elicitation (Hullman et al.,

2018). None elicited either a distribution or a full graph;

thus, none qualied as using an elicited graph approach.

Further, though some studies elicited markings on a

picture (Hansen et al., 2013)ormap(Seipel & Lim,

2017), or movement of an object across a graph

(Cumming, Williams, & Fidler, 2004), none sought to

elicit a detailed visuospatial record of thought; thus,

none qualied as using a readout-based approach.

Survey methods such as multiple choice, Likert scale,

slider, and text entry were the most common response

elicitation methods—used 66% of the time (Hullman

et al., 2018).

While readouts are rare in both vision science and

datavis, there exists a broader graphical elicitation

literature in which readouts are more common and

elicited graphs are not unheard of (Jenkinson, 2005;

O’Hagan et al., 2006;Choy, O’Leary & Mengersen,

2009). Yet still, these elicited responses tend to dier

from our work in two key respects. First, they are

typically used exclusively with experts, whereas the

DDoG measure prioritizes general usage. Second, their

aim is typically to document preexisting knowledge,

and they therefore use stimuli as a catalyst or prompt,

rather than as an object of study. The DDoG measure,

in contrast, holds the stimulus as the object of study:

we want to know how the graph was interpreted. The

DDoG measure is therefore distinctive even in the

broadly dened domain of graphical elicitation.

A small subset of graphical elicitation studies do

elicit a visuospatial response, drawn on a graph, from

a general population. There remains, however, a core

dierence in the way the DDoG measure utilizes the

elicited response. DDoG uses the elicited response as

a measure (dependent variable). The other studies, in

contrast, use it as a way to manipulate engagement with,

or processing of, a numerical task (as an independent

variable; Stern, Aprea, & Ebner, 2003;Natter &

Berry, 2005;Kim, Reinecke, & Hullman, 2017a;Kim,

Reinecke, & Hullman, 2017b). While it is possible for a

manipulation to be adapted into a measure, the creation

of a new measure requires time and eort for iterative,

evidence-based renement and validation (e.g., Wilmer,

2008;Wilmer et al., 2012;Degutis et al., 2013). The

DDoG measure is therefore distinctive in having been

created and validated explicitly as a measure.

Frequency framed measures

Another way in which the DDoG measure may be

distinguished from previous methods is in its use of a

frequency framed response. Frequency framing is the

communication of data in terms of individuals (e.g.,

“20 of 100 total patients” or “one in ve patients”)

rather than percentages or proportions (e.g., “20%

of patients” or “one fth of patients”). The DDoG

measure collects frequency-framed responses in terms

of individual values.

Figure 7. The balls-and-bins approach. Three recent

adaptations of the balls-and-bins approach to eliciting

probability distributions. This approach was originally

developed by Goldstein and Rothschild (2014). The sources of

these adaptations of balls-and-bins are: (a) Kim, Walls, Kraft &

Hullman, 2019;(b)Andre, 2016;(c)Hullman et al. 2018.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 9

Frequency framing has long been considered a

best practice for data communication to nonexperts

(Cosmides & Tooby, 1996), yet it remains uncommon in

the measurement of graph cognition. Illustratively, as

of 2016, only a small handful of papers in uncertainty

visualization had utilized frequency framing (Hullman,

2016); and in all cases, frequency framing was used in

the instructions or stimulus rather than in the measured

response (e.g., Hullman, Resnick, & Adar, 2015;Kay

et al., 2016).

Two other, more recent, datavis studies have elicited

frequency framed responses (Hullman et al., 2017;

Kim, Walls, Krat & Hullman, 2019). These studies

were geared toward nonexperts and used the so-called

balls-and-bins paradigm, originally developed by

Goldstein and Rothschild (2014), where balls are placed

in virtual bins representing specied ranges (Figure 7).

The studies aimed to elicit beliefs about sampling error

around summary statistics. This usage contrasts with

our aim of eliciting direct interpretation of the raw,

individual-level, nonaggregated data that produced a

mean.

This dierence in usage joins more substantive

dierences, captured by two MAGI principles: Limited

mental transformations and Expressive freedom

(Table 1). While the current implementation of

balls-and-bins requires spatial, format, and scale

translations between stimulus and response, the DDoG

measure achieves Limited mental transformations

by applying a stimulus-matched response: imagined

datapoints are drawn directly into a sketched version

of the graph itself. Similarly, while balls-and-bins

predetermines key aspects of the response such as

bin widths, bin heights, ball sizes, numbers of bins,

and the range from lowest to highest bin, the DDoG

measure achieves Expressive freedom by allowing drawn

responses that are constrained only by the limits of

the page. Further, the lack of built-in tracks or ranges

that constrain or suggest data placement reduces the

risk of the “observer eect,” whereby constraints or

suggestions embedded in the measurement procedure

itself alter the result.

Deﬁning thought inaccuracies

Textual deﬁnitions of errors, biases, and confusions

To discuss the literature regarding this project’s

third contribution—identication of the Bar-Tip Limit

(BTL) error—it is helpful to distinguish three dierent

types of inaccurate thought processes: errors, biases,

and confusions. We will do this rst in words, and then

numerically.

•Errors are “mistakes, fallacies, misconceptions,

misinterpretations” (Oxford University Press,

n.d.). They are binary, categorical, qualitative

inaccuracies that represent wrong versus right

thinking.

•Biases are “leanings, tendencies, inclinations,

propensities” (Oxford University Press, n.d.). They

are quantitative inaccuracies that exist on a graded

scale or spectrum. They exhibit consistent direction

but varied magnitude.

•Confusions are “bewilderments, indecisions,

perplexities, uncertainties” (Oxford University

Press, n.d.). They indicate the absence of systematic

thought, resulting, for example, from failures to

grasp, remember, or follow instructions.

Numerical/visual deﬁnitions of errors, biases, and

confusions

Let us now consider how each type of inaccurate

thought process would be instantiated in data. Such

depictions can act as visual denitions to compare with

real datasets. In Figure 8, patterns of data distributions

representing correct thoughts (8a), systematic errors

(8b), biases (8c), and confusions (8d), are illustrated as

frequency curves. They are numerical formalizations

of the textual denitions above, providing visual cues

for dierentiating these four thought processes in a

graphing context.

Figure 8. Three types of inaccurate graph interpretation.

Prototypical response distributions provide visual deﬁnitions

for three types of inaccuracy. Top row: (a) Baseline response

pattern (green) indicates no systematic inaccuracy. Responses

cluster symmetrically around the correct response value, with

imprecision in task input, processing, and output reﬂected in

the width of the spread around that correct response. Bottom

row: Light red backgrounds illustrate the presence of inaccurate

responses. Gray arrows indicate the major change from the

baseline response pattern for each type of inaccurate response.

(b) Systematic error responses form their own distinct mode.

(c) Systematic bias responses shift and/or ﬂatten the baseline

response distribution. (d) Confused responses—expressed as

random, unsystematic responding—are uniformly distributed,

thus raising the tails of the baseline distribution to a constant

value without altering the mean, median, or mode.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 10

Figure 8a, represents a baseline case: a data

distribution (blue curve) that is characteristic of

generally correct interpretation, absent systematic

inaccuracy. This baseline case prototypically produces

a symmetric distribution, with the mean, median,

and mode all corresponding to a correct response.

Normal imprecision, from expected variation in

stimulus input, thought processes, or response

output, is reected by the width of the distribution.

(If everyone answered with 100% precision, the

“distribution” would be a single stack on the correct

answer).

Figure 8b illustrates a subset of systematic errors

within an otherwise correct dataset. In distributions,

an erroneous, categorically inaccurate subset

prototypically presents as a separate mode, additional

to the baseline curve. Errors siphon responses from

the competing baseline distribution, which remains

clustered around the correct answer, retaining the

correct modal response. The hallmark of an error is,

therefore, a bimodal distribution of responses whose

prevalence and severity are demonstrated, respectively,

by the height and separation of the additional

mode.

Figure 8c illustrates a subset of systematic biases

within an otherwise correct dataset. Given their graded

nature, a biased subset tends to atten and shift the

baseline distribution of responses. The prototypical bias

curve demonstrates a single, shifted mode. To the extent

that a bias reects a pervasive, relatively inescapable

aspect of human perception, cognition, or culture, there

will be relatively less attening, and more shifting, of

the response distribution.

Finally, Figure 8d illustrates a subset of confusions

within an otherwise correct dataset. A confused subset

tends to yield random responses, which lift the tails of

the response distribution to a constant, nonzero value

as responses are siphoned from the competing baseline

distribution.

A rough but informative way to distinguish between

erroneous, biased, and confused thinking is to simply

compare an observed distribution of responses directly

to the blue curves in Figure 8. This visual approach may

be complemented with an array of analytic approaches

that provide quantitative estimates of the degree of

evidence for a particular type of inaccurate thinking

(e.g., Freeman & Dale, 2013;Pster et al., 2013;Zhang

& Luck, 2008). In Results, we will use both visual and

analytic approaches.

The success of any approach rests upon the precision

and accuracy of measurement: as measurement quality

is reduced, it becomes increasingly dicult to detect

clear thought patterns reected in the data. Our results

will show unequivocal bimodality in the collected data,

supporting both the presence of an erroneous graph

interpretation (the BTL error) and the precision and

accuracy of the DDoG measure.

Literature related to the BTL error

Probability rating scale results

In examining prior work related to our discovery of

the BTL error, studies by Newman and Scholl (2012),

Correll and Gleicher (2014),Pentoney and Berger

(2016), and Okan and colleagues (2018) are seminal and

contain the most reproducible prior evidence against

mean bar graph accessibility. Their core result is an

asymmetry recorded via a probability rating scale: the

average participant, when shown a mean bar graph,

rated an inside-of-bar location for data somewhat more

likely than a comparable outside-of-bar location (∼1

point on a 9-point probability rating scale, Figure 9a).

The original paper, by Newman and Scholl (2012),

reported ve replications of this asymmetry, including

in-person and online testing, varied samples (ferry

commuters, Amazon Mechanical Turk recruits, Yale

undergraduates), and conditions that ruled out key

numerical and spatial confounds. Five independent

research groups have since replicated and extended

this nding. The rst was Correll and Gleicher (2014),

Figure 9. A side-by-side comparison of a probability rating

scale and a DDoG measure response. DDoG measure’s more

concrete, detailed, visuospatial (aka readout-based) approach

may have contributed to the ease with which it identiﬁed the

BTL error and its apparent conﬂation mechanism, which were

missed by prior studies using the probability rating scale

approach. (a) The probability rating scale response sheet

provided to each participant in the original report of

asymmetry by Newman and Scholl (2012, Study 5). These

Likert-style scales, anchored by colloquial English words (from

“very unlikely” to “very likely”), were used to characterize the

likelihood that individual values occurred at two speciﬁc y-axis

values, one within the bar (−5) and one outside the bar (+5).

The red circles indicate the resulting mean ratings of 6.9 and 6.1

(from Study 5, Newman & Scholl, 2012). (b) The DDoG measure

response sheet from an individual participant given four

stimulus graphs and asked to sketch each graph along with 20

hypothesized datapoints for two speciﬁed bars on the graph.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 11

who extended the result to viewers’ predictions of

future data, and who showed that graph types that

were symmetrical around the mean (e.g., violin plots)

did not produce the eect. Soon after, Pentoney and

Berger (2016) found this asymmetry present for a bar

graph with condence intervals, but absent when the

bar was removed, leaving only the condence intervals;

this replicated Correll and Gleicher’s (2014) nding

that the eect requires the presence of a traditional

bar graph. Okan and colleagues (2018) also replicated

both the core asymmetry (this time using health data)

and the nding that nonbar graphs did not produce

the eect; they additionally found that the asymmetry,

counterintuitively, increased with graph literacy. Finally,

Godau, Vogelgesang, and Gaschler (2016) and Kang

and colleagues (2021) showed that the asymmetry

remains in aggregate—carrying through to judgments

of the grand mean of multiple bars.

Newman and Scholl’s (2012) hypothesized

mechanism was a well-known perceptual bias, whereby

in-object locations are processed slightly more

eectively than out-of-object locations. The studies that

Newman and Scholl (2012) cited for this perceptual

bias, for example, had on average 4% faster and 0.6%

more accurate responses for in-object stimuli relative

to out-of-object stimuli (Egly et al., 1994;Kimchi,

Yeshurun, & Cohen-Savransky, 2007;Marino & Scholl,

2005). While subtle, this bias was believed to reect

a fundamental aspect of object processing (Newman

& Scholl, 2012),anditwasthereforeassumedtobe

pervasive and inescapable. Newman and Scholl’s (2012)

hypothesis was that this “automatic” and “irresistible”

perceptual bias had produced a corresponding

“within-the-bar bias” in graph interpretation.

Probability rating scale comparison

At the end of Results, we will compare our detailed

ndings, obtained via our DDoG measure, to those

of the probability rating scale studies reviewed just

above. Here, we compare the more general capabilities

of the two measurement approaches.Figure 9

shows side-by-side examples of a probability rating

scale response from Newman and Scholl’s (2012)

Study 5 (Figure 9a), and a DDoG measure readout

(Figure 9b). Using the MAGI principles (Table 1)as

a basis for comparison, we can see key dierences in:

Limited mental transformations, Ground-truth linkage,

Information richness,andExpressive freedom.

The MAGI principle of Limited mental

transformations aims to facilitate an accurate readout

of graph interpretation. As discussed in Frequency

framed measures even a mental transformation as

seemingly trivial as a translation from a ratio (one

in ve) to a percentage (20%) can severely distort a

person’s thinking (Cosmides & Tooby, 1996). The

DDoG measure limits mental transformations via its

stimulus-matched response (Figures 4 and 9b). Absent

such a matched response, the mental transformations

required by a measure may limit its accuracy.

IntheexamplefromNewman and Scholl (2012),

Study 5 (Figure 9a), several mental transformations

are needed for probability rating scale responses;

among them, translation from numbers (−5and

5) to graph locations (in or out of the bar, and to

what degree in or out), from graph locations to

probabilities (represented in whatever intuitive or

numerical way a person’s mind may represent them),

from probabilities to the rating scale’s English adjectives

(e.g. “somewhat,” “very”), and, separately, from the

vertical scale on the graph’s y-axis to the horizontal

rating scale. While it is possible that some of these

mental transformations are accomplished eectively

and relatively uniformly for most participants, our

reanalysis below of the probability rating scale

studies (Results:A reexamination of prior results)

will suggest that their presence contributes a certain

amount of irreducible “noise” (inaccuracy) to the

response.

With regard to Ground-truth linkage,itis

straightforward to evaluate a DDoG measure readout

(Figure 9b) relative to multiple logical and empirical

benchmarks of ground-truth. Logical benchmarks

include, for example, “Is the mean of the drawn

datapoints at the bar-tip?” Empirical benchmarks

include “How closely does the drawn data reect the

[insert characteristic] in the actual data?” In contrast,

there is no logically or empirically correct (or incorrect)

response for the probability rating scale measure

(Figure 9a). Even discrepant rated values for in-bar

versus out-of-bar locations—though suggestive of

inaccurate thinking—could be accurate in a case of a

skewed raw data distribution, which could plausibly

lead to more raw data in-bar than out-of-bar (or to

the opposite). The probability rating scales therefore

lack a conclusive Ground-truth linkage. DDoG measure

responses, in contrast, specify real numerical and spatial

values with a concreteness that is easier to compare to a

variety of ground-truths.

The DDoG measure’s precision is most directly

supported by its Information richness: in the present

study, each participant produces 160 drawn datapoints

on a continuous scale. In contrast, the probability

rating scale shown in Figure 9a yields only two pieces of

information (the two ratings) per participant. Granted,

the relative information-value of a single integer rating,

versus a single drawn datapoint, is not easily compared.

Further, as we will see below, multiple ratings at

multiple graph locations can increase precision of

the probability rating scale measure (Results:A

reexamination of prior results). Nevertheless, it would

be dicult to imagine a case where a probability rating

scale measure yielded more information richness, and,

in turn, higher precision, than the DDoG measure.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 12

Table 2. Using MAGI principles to compare graph interpretation measures.

Finally, comparing Expressive freedom between the

two measures, the rating scale measure constrained

each response to a nine-point integer scale. On this

scale, responses are led and constrained to a degree

that disallows many types of unexpected responses and

limits the capacity of distinct graph interpretations to

stand out from each other. These constraints contrast

sharply with the exibility of the drawn readout that

the DDoG measure produces.

The probability rating scale and the DDoG measure

therefore dier in terms of four separate MAGI

principles (Table 1): Limited mental transformations,

Ground-truth linkage, Information richness, and

Expressive freedom.

MAGI principles used as a metric

We have used the MAGI principles above to

highlight notable dierences between the DDoG

measure and specic implementations of two other

measurement approaches: balls-and-bins (Figure 7)

and probability rating scale (Figure 9a). However, the

question of whether these dierences are integral to the

measure-type, or restricted to specic implementations,

remains. Table 2 models the use of the MAGI principles

as a vehicle to compare, in a more structured way,

potentially integral features and limitations of these

same three graph interpretation methods.

We believe that all three measures share a capability

for Limited instructions and Ecological validity,

though choices specic to each implementation may

aect whether the principles are actually expressed

(see Hullman et al. (2017) and Kim, Walls, Kraft

and Hullman (2019) for implementations of the

balls-and-bins method, and see Newman and Scholl

(2012),Correll and Gleicher (2014),Pentoney and

Berger (2016), and Okan and colleagues (2018)for

implementations of probability rating scales). In

contrast, the three measures appear to dier more

or less intrinsically in their approaches to Expressive

freedom, Limited mental transformations, and (to a

lesser extent) Information richness. That said, measure

development is an inherently iterative, dynamic process,

and what seems intrinsic at one point in time can

sometimes shift as development progresses.

Because the DDoG measure and MAGI principles

were built around each other and rened in parallel,

it makes sense that they are closely aligned in their

optimization for assessment of abstract-graph

interpretation, with a focus on general usage and valid

measurement. Recognizing that measures with dierent

aims and motivations may be best suited for dierent

tasks, this use of MAGI principles, in table form

(Table 2), provides a model of its utility for comparison

and targeting of measures with a similar set of aims

and motivations.

Summary of related works

In the four subsections of Related works above,

we rst documented and sought to better understand

the relative rarity of elicited-graph, readout-

based, graphical elicitation, and frequency-framed

measurement in studies of graph cognition, while

arguing for the value of greater usage of elicited-graph

and readout-based measurement in particular. We

next distinguished—in both written/verbal and

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 13

numerical/visual form—three distinct types of

inaccurate graph interpretation: errors, biases, and

confusions. Third, we examined key results and methods

from prior studies of mean bar graph inaccessibility.

And, nally, we provided an illustrative example of how

the Measurement of Abstract Graph Interpretation

(MAGI) principles can be used to compare and contrast

relevant measures.

Methods

The current investigation has two parts: (A) examine

the accessibility of a set of ecologically valid mean

bar graphs in an educationally diverse sample via our

new Draw Datapoints on Graphs (DDoG) measure

(sections Participant Data Collection through Dene

the Average) and (B) assess the relative frequency of

mean versus count bar graphs across educational and

general sources (section Prevalence of Mean Versus

Count Bar Graphs). Included in the Methods sections

devoted to “A” are detailed specications for the current

implementation of the DDoG measure, as well as the

thinking behind that implementation, as a guide to

future DDoG measure usage.

Participant data collection

Piloting

The development of the DDoG measure was highly

iterative. Early piloting was performed at guest lectures,

lab meetings, and in college classrooms at various levels

of the curriculum by author JBW, and in several online

data collection pilot studies by both authors. Early

wording diered from the more rened wording used

in the present investigation. Yet even the earliest pilots

produced the core result of the current investigation

(Figure 4): a common, severely inaccurate interpretation

of mean bar graphs that we call the Bar-Tip Limit

(BTL) error. The DDoG measure thus appears—at

least in the context of the BTL error—robust to fairly

wide variations in its wording. A constant through

all DDoG measure iterations, however, was succinct,

concrete, task-directed instructions: what came to be

expressed as the Limited instructions MAGI principle

(Table 1).

Recruitment

Data collection was completed remotely using

Qualtrics online survey software. There were neither

live nor recorded participant-investigator interactions,

to minimize priming, coaching, leading, or other forms

of experimenter-induced bias (Limited instructions

principle). Participants were recruited via the Amazon

Mechanical Turk (MTurk), Prolic, and TestableMinds

platforms. Because no statistically robust dierences

were observed between platforms, data from all three

were combined for the analyses reported below.

Participants were paid $5 for an expected 30 minutes of

work; the median time taken was 33 minutes.

Procedure

Figure 10 provides a owchart of the procedure.

Participants (a) read an overview of the tasks along

with time estimates; (b) read and recorded mean values

from stimulus graphs; (c) (grey zone) completed the

DDoG measure drawing task for stimulus graphs; (g)

provided a denition for the average/mean value; and

(h) reported age, gender, educational attainment, and

prior coursework in psychology and statistics.

Foundational knowledge tasks

Graph reading task: Find the average/mean

Participants were asked to “warm up” by reading

mean values from a set of mean bar graphs that

were later used in the DDoG measure drawing task

(wording: “What is the average (or mean) value for

[condition]?”) (Figure 10j). This warm-up served as a

control, verifying that the participant was able to locate

a mean value on the graph.

To allow for some variance in sight-reading,

responses within a tolerance of two tenths of the

distance between adjacent y-axis tick-marks were

counted as correct (see Figure 11). The results reported

in Results:Independence of the BTL error from

foundational knowledge remain essentially unchanged

for response-tolerances from zero (only exactly

correct responses) through arbitrarily high values (all

responses). No reported prevalence values dip below

19%.

Graph reading was correct for 93% of graphs. As a

control for possible carryover/learning eects, 114 of

the 551 total DDoG measure response drawings were

completed without prior exposure to the graph (i.e.,

without a warm-up for that graph). No evidence of

carryover/learning eects was observed.

Deﬁnition task: Deﬁne the average/mean

After completing the DDoG measure drawing

task, as a control for comprehension and thoughtful

responding, participants were asked to explain the

concept of the average/mean. Most participants (112 of

190, or 64%) got the following version of the question:

“From your own memory, what is an average (or

mean)? Please just give the rst denition that comes to

mind. If you have no idea, it is ne to say that. Your

denition does not need to be correct (so please don’t

look it up on the internet!)” For other versions of the

question, see open data spreadsheet. No systematic

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 14

Figure 10. Flowchart of study procedure. The present study consisted of ﬁve main sections (a, b, c, g, h). Sections and subsections

colored teal (b, e, g, h) produced data that were analyzed for this study. Subsections on the right (i, j, k, l, m) are an expansion of

section c of the ﬂowchart, showing (i) the drawing page, (j) the drawing instructions, (k) one of the four stimulus graphs, (l) the graph

caption, and (m) the upload instructions. See Methods for further procedural details.

dierences in results based on question wording were

observed. Participants were provided a text box for

answers with no character limit. Eighty-three percent

of responses were correct, with credit being given for

responses that were correct either conceptually (e.g.,

“a calculated central value of a set of numbers”) or

mathematically (e.g., “sum divided by the total number

of respondents”).

The Draw Datapoints on Graphs (DDOG)

measure: Method

Selection of DDoG measure stimulus graphs

For the present investigation, four mean bar graph

stimuli, shown in Figure 11, were taken from popular

Introductory Psychology textbooks. This source of

graphs was selected for three main reasons. First,

Introductory Psychology is among the most popular

undergraduate science courses, serving two million

plus students per year in the United States alone

(Peterson & Sesma, 2017); and this course tends to rely

heavily on textbooks, with the market for such texts

estimated at 1.2 to 1.6 million sales annually (Steuer

& Ham, 2008). The high exposure of these texts gives

greater weight to the data visualization choices made

within them. Second, Introductory Psychology attracts

students with widely varying interests, skills, and prior

data experiences, meaning that data visualization best

practices developed in the context of Introductory

Psychology may have broad applicability to other

contexts involving diverse, nonexpert populations.

Third, given the relevance of psychological research

to everyday life, inaccurate inferences fueled by

Introductory Psychology textbook data portrayal

could, in and of themselves, have important negative

real-world impacts.

The mean bar graph stimuli were chosen to exhibit

major dierences in form, and to reect diversity of

content (Figure 11). They convey concrete scientic

results, rather than abstract theories or models, so

that understanding can be measured relative to the

ground-truth of real data (Ground-truth linkage).

Introductory texts were chosen because of their direct

focus on conveying scientic results to nonexperts

(Ecological Validity). Stimuli were selected to show

“independent groups” (aka between-participants)

comparisons because these comparisons are one step

more straightforward, conceptually and statistically,

than “repeated-measures” (aka within-participants)

comparisons.

We refer to the four stimulus graphs by their

independent variables: as AGE, CLINICAL, SOCIAL,

and GENDER. The stimuli were selected, respectively,

from texts by Kalat (2016),Gray and Bjorklund

(2017),Grison and Gazzaniga (2019),andMyers

and DeWall (2017), which ranked 7, 22, 3, and 1 in

median Amazon.com sales rankings across eight days

during March and April of 2019, among the 23 major

textbooks in the Introductory Psychology market.

Further details about these graphs are shown in Tabl e 3.

Stimuli were chosen to represent both meaningful

content dierences (four areas about which individual

participants might potentially have strong personal

opinions), and form dierences (diering relationship

of bars to the baseline) to evaluate replication of results

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 15

Figure 11. The four stimulus graphs used in this study. These

stimulus graphs were taken from popular Introductory

Psychology textbooks to ensure the direct real-world relevance

of our results (see ecological validity MAGI principle). Figure

legends were adapted as necessary for comprehensibility

outside the textbook. Stimulus graph textbook sources are AGE:

Kalat, 2016; CLINICAL: Gray and Bjorklund, 2017; SOCIAL:

Grison & Gazzaniga, 2019; and GENDER: Myers & DeWall, 2017.

across graphs despite dierences that might reasonably

be expected to change graph interpretation.

The specic form dierence we examined was

the distinction between unidirectional bars (all bars

emergefromthesamesideof thebaseline)and

bidirectional bars (bars emerge in opposite directions

from the baseline). Two of the four graphs, AGE and

CLINICAL were unidirectional (Figure 11, top),

and the other two, SOCIAL and GENDER, were

bidirectional (Figure 11, bottom). While unidirectional

graphs were more common than bidirectional graphs

in the surveyed textbooks, two examples of each were

selected to set up an internal replication mechanism.

Replication of results could potentially be

demonstrated across graphs in terms of mean values

(the mean result of dierent graphs could be similar),

or in terms of individual dierences (an individual

participant’s result on one graph could predict their

result on another graph). Both types of replication were

observed.

Graph drawing task: The DDoG measure

The DDoG measure was designed using the MAGI

principles (Table 1), and it was modeled after the

patient drawing tasks mentioned above (Landau et al.,

2006;Agrell & Dehlin, 1998). Ease of administration

and incorporation into an online study were further key

design considerations.

The Expressive freedom inherent in the drawing

medium avoids placing articial constraints on

participant interpretation, and it has multiple potential

benets: it helps to combat the “observer eect,”

whereby a restricted or leading measurement procedure

impacts the observed phenomenon; it lends salience to

a consistent, stereotyped pattern (such as the Bar-Tip

Limit response shown in Figures 2d, 4c (bottom), 5f,

6b (bottom), and 13 (right)); and it allows unexpected

responses, which render the occasional truly inattentive

or confused response (Figure 8d) clearly identiable.

Twenty drawn datapoints per bar was chosen as

a quantity small enough to avoid noticeable impacts

of fatigue or carelessness, while remaining sucient

to gain a visually and statistically robust sense of

the imagined distribution of values. Recent research

suggests that graphs that show 20 datapoints enable

both precise readings (Kay et al., 2016) and eective

decisions (Fernandes et al., 2018).

Before beginning the drawing task, participants

were instructed to divide a page into quadrants, and

they were shown a photographic example of a divided

blank page (Figure 10i). They were then presented with

task instructions (Figure 10j) with one of the four bar

graph stimuli and its caption (Figure 10k, 10l). This

process—instructions, graph stimulus, caption—was

repeated four total times, with the order of the four

graph stimuli randomized to balance order eects such

as learning, priming, and fatigue. After completing

the four drawings on a single page, each participant

photographed and uploaded their readout (Figure 10m)

via Qualtrics’ le upload feature. Few technical

diculties were reported, and only a single submitted

photograph was unusable for technical reasons (due to

insucient focus). All original photographs are posted

to the Open Science Framework (OSF).

Table 3. Stimulus graph sources and variables.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 16

DDoG measure data

Collection of DDoG measure readouts

One hundred ninety participants nominally

completed the study. Of these, 149 (78%) followed the

directions suciently to enable use of their drawings,

a usable data percentage typical of online studies

of this length (Litman & Robinson, 2020). Based

on predetermined exclusion criteria, participant

submissions were disqualied if most or all of their

drawings were unusable for any of the following:

(1) Zero datapoints were drawn (18 participants).

(2) Datapoints were drawn but in no way reected

either the length or direction of target bars, thereby

demonstrating basic misunderstanding of the task

(16 participants).

(3) Drawings lacked labels or bar placement information

necessary to disambiguate condition or bar-tip

location (ve participants).

(4) Photograph was insuciently focused to allow a

count (one participant).

Each of the 149 remaining readouts contained

four drawn graphs, for 596 graphs total. Of these,

30 individual drawings were excluded for one or

more of the above reasons, and 15 were excluded

for the additional predetermined exclusion criterion

of a grossly incorrect number of datapoints,

dened as >25% dierence from the requested

20 datapoints.

The remaining 551 drawn graphs (92.4% of the 596)

from 149 participants were included in all analyses

below. Ages of these 149 participants ranged from 18

to 71 years old (median 31). Reported genders were 96

male, 52 female, and one nonbinary. Locations included

26 countries and 6 continents. The most common

countries were United Kingdom (n=58), United States

(n=34), Portugal (n=8), and Greece/Poland/Turkey

(each n=4).

Coding of DDoG measure readouts via BTL index

As a systematic quantication of datapoint

placement in DDoG measure readouts, a Bar-Tip Limit

(BTL) index was computed. The BTL index estimated

the within-bar percentage of datapoints for each graph’s

two target bars by dividing within-bar datapoints (i.e.,

datapoints drawn on the baseline side of the bar-tip) by

the total number of drawn datapoints for the two target

bars and multiplying by 100. Written as a formula:

[(# datapoints on baseline-side of bar-tips) / (total #

datapoints)] ×100

The highest possible index (100) represents an image

in which all drawn datapoints are on the baseline

sides of their respective bar-tips (i.e., within the bar; a

Bar-Tip Limit response). An index of 50 represents a

drawing with equal numbers of datapoints on either

side of the bar-tip (i.e., balanced distribution across the

mean line, or a Bar-Tip Mean response). BTL index

values in this study ranged from 27.5 (72.5% of points

drawn outside of the bar) to 100 (100% of points drawn

within the bar).

The straightforward coding procedure was designed

to yield a reproducible, quantitative measure of

datapoint distribution relative to the bar-tip, however,

one ambiguity existed. Datapoints drawn directly on

the bar-tip could reasonably be considered to either

use the bar-tip as a limit (if the border is considered

part of the object) or not (if a datapoint on the edge

is considered outside the bar). This ambiguity was

handled as follows: (1) if, at natural magnication, a

drawn datapoint was clearly leaning toward one or the

other side of the bar-tip, it was assigned accordingly,

and (2) datapoints on the bar-tip for which no clear

leaning was apparent were alternately assigned rst

inside the bar-tip, then outside, and continuing to

alternate thereafter.

This procedure maximizes reproducibility by

minimizing the need for the scorer to interpret

the drawer’s intent—and it was successful (see

Reproducibility of BTL index coding procedure). Yet

reproducibility may have come at the cost of some

minor conservatism in quantifying drawings where

the participant placed no datapoints beyond the

bar-tip—arguably displaying a complete Bar-Tip Limit

interpretation—yet placed some of the datapoints

directly on the bar-tip. For example, if six of the 20

datapoints for each bar were placed on the bar-tip, the

drawing would get a BTL index of only 85, when it

arguably represented a pure BTL conception of the

graph.

Reproducibility of BTL index coding procedure

Reproducibility of the BTL index coding was

evaluated by comparing independently scored BTL

indices of the two coauthors for a substantial subset of

201 drawn graphs. Figure 12 shows the data from this

comparison where the x coordinate is JBW’s scoring

and the y coordinate is SHK’s scoring of each drawing.

The correlation coecient between the two sets of

ratings is extremely high (r(199) =0.991, 95% CI [0.988,

0.993]). Additionally, the blue best-t line is nearly

indistinguishable from the black line representing

hypothetically equivalent indices between the two raters.

The fact that the line of best t is so closely aligned

with the line of equivalence is important. In theory,

even with a correlation coecient close to 1.0, one

rater might give ratings that are shifted higher/lower,

or that are more expanded/compressed, compared

to the other. This would show up as a best-t line

that was either vertically shifted relative to the line of

equivalence, or of a dierent slope compared to the line

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 17

Figure 12. Interrater reliability of Bar-Tip Limit (BTL) index

coding demonstrates high repeatability of coding method.

Coauthors SHK and JBW independently computed the BTL index

for a subset of 201 drawn graphs (of 551 total). Each dot

represents both BTL indices for a given drawn graph (y-value

from SHK, x-value from JBW). Semitransparent dots code

overlap as darkness. The black line is a line of equivalence,

which shows x =y for reference. The line of best ﬁt is blue. The

close correspondence of these two lines, and the close

clustering of dots around them, demonstrate high repeatability

of the BTL index coding procedure.

of equivalence. The absence of such a vertical shift or

slope dierence is therefore particularly strong evidence

for repeatability. This strong evidence is echoed by a

high interrater reliability statistic (Krippendorf ’s alpha)

of 0.93 (Krippendorf, 2011); Krippendorf’s alpha varies

from 0 to 1, and a score of 0.80 or higher is considered

strong evidence of repeatability (Krippendorf, 2011).

These analyses therefore demonstrate that the BTL

index coding procedure is highly repeatable, and that

variations in coding—a potential source of noise that

could theoretically constrain the capacity of a measure

to precisely capture Bar-Tip Limit interpretation—can

be minimized.

Establishing cutoﬀs and prevalence

For purposes of estimating the prevalence of the

BTL error in the current sample, a BTL index cuto

of 80 was selected. This was the average result when

ve clustering methods (k-means, median clustering,

average linkage between groups, average linkage within

groups, and centroid clustering) were applied via SPSS

to the 551 individual graph BTL indices (cutos were

75, 75, 80, 85, and 85, respectively). 80 was additionally

veried as reasonable via visual inspection of the data

shown in Figure 14.

Due to the strongly bimodal distribution of BTL

index scores (see Results), only 14 of the 551 total

drawings (2.5%) fell within the entire range of computed

cutos (75 to 85). Therefore, though the computed

condence intervals around prevalence estimates,

reported in Results, do not include uncertainty in

selecting the cuto, this additional source of uncertainty

is small (at most, perhaps ±1.25%).

For Pentoney and Berger’s (2016) data, analyzed

below in Prior work reects the same phenomenon:

Prevalence, the respective cutos from the same ve

clustering methods produced BTL error percentages

of 19.5, 19.5, 19.5, 19.5, and 24.6, for an average

percentage of 20.5.

Prevalence of bar graphs of means versus bar

graphs of counts

The methods discussed in this subsection pertain

to Results section Ecological exposure to mean versus

count bar graph. In that investigation, we assessed

the likelihood of encountering mean versus count bar

graphs across a set of relevant contexts by tallying the

frequency of each bar graph type from three separate

sources: elementary educational materials accessible

via Google Image searches, college-level Introductory

Psychology textbooks, and general Google Images

searches. It was hypothesized that these sources would

provide rough, but potentially informative, insights into

the relative likelihood of exposure to each bar graph

type in these areas. The methods used for each source

were:

The elementary education Google image search used

the phrase: “bar graph [X] grade,” with “rst” to “sixth”

in place of [X]. The rst 50 grade-appropriate graph

results, from independent internet sources, for each

grade-level, were categorized as either count bar graph,

mean bar graph, or other (histogram or line graph).

The college-level Introductory Psychology textbook

count tallied all of the bar graphs of real data across

the following eight widely-used Introductory textbooks:

Ciccarelli & Berstein, 2018;Coon, Mitterer & Martini,

2018;Gazzaniga 2018;Griggs 2017;Hockenbury &

Nolan, 2018;Kalat, 2016;Lilienfeld, Lynn & Namy,

2017;andMyers & DeWall, 2017. The bar graphs were

then categorized as dealing with either means or counts

(histograms excluded).

The general Google Image search used the phrase

“bar graph.” The rst 188 graphs were categorized

as count bar graphs, mean graphs, or were excluded

for being not bar graphs (tables, histograms), or for

containing insucient information to determine what

type of bar graph they were.

Plotting and evaluation of data from previous

studies

In our reexamination of prior results, we compare

previous study results to the present results. Data

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 18

is replotted from Newman and Scholl (2012) and

Pentoney and Berger (2016) as Figure 20.ForNewman

and Scholl (2012), data from Study 5 is plotted

categorically as “0” (no dierence between in-bar and

out-of-bar rating) versus “not 0” (positive or negative

dierence between in-bar and out-of-bar rating). The

directionality of the latter dierences, while desired,

are not obtainable from the information reported in

the original paper. For Pentoney and Berger (2016),

the data plotted in their Figure 3 are pooled across the

three separate conditions whose stimuli included bar

graphs. A similar pattern of bimodality is evident in all

three of those conditions. We used WebPlotDigitizer to

extract the raw data from Pentoney and Berger’s (2016)

Figure 3.

Statistical tools and packages

The graphs shown in Figures 12,16,and18 were

produced by ShowMyData.org, a suite of web-based

data visualization apps coded in R by the second

author using the R Shiny package. The statistics

and distribution graphs shown in Figures 14,17,19,

and 20 were produced via the ESCI R package. The

bimodality simulations and graphs shown in Figure 15

were produced in base R, and the numerical analyses of

bimodality using Hartigan’s Dip Statistic (HDS) and

Bimodality Criterion (BC) (Freeman & Dale, 2013;

Pster et al., 2013) were computed via the mousetrap

R package. CIs on HDS, BC, and Cohen’s d were

computed with resampling (10,000 draws) via the boot

R package. The clustering analyses that estimated BTL

error cutos and prevalence were conducted via SPSS.

Results

This paper contains three separate sets of results:

rst, the core quantitative study of the BTL error

(subsections Overview of core quantitative study

through Educational and demographic correlates);

second, the study of the prevalence of mean versus

count bar graphs (subsection Ecological exposure to

mean versus count bar graphs); and third, a reanalysis

of prior studies of mean bar graph accessibility, the

results of which suggest that the previously reported

“within-the-bar bias” was actually not a bias at all, but

instead was caused by the BTL error we report here

(subsection A reexamination of prior results).

Overview of core quantitative study

From the Draw Datapoints on Graphs (DDoG)

measure readouts shown in Figures 4,5,and6, one

can already see the severe, stereotyped nature of the

Bar-Tip Limit (BTL) error. It is additionally evident

that this error is well-explained as a conation of mean

bar graphs with count bar graphs; and the eectiveness

of the DDoG measure in revealing the BTL error is

clear.

Our main study sought to add quantitative

precision to these qualitative observations by testing

a large, education-diverse sample. Examination

of this larger sample conrms the BTL error’s

categorical nature, high prevalence, stability within

individuals, persistence despite thoughtful, intentional

responding, and independence from foundational

knowledge and graph content. Together, these results

demonstrate that mean bar graphs are subject to a

common, severe error of interpretation that makes

them a potentially unwise choice for accurately

communicating empirical results, particularly where

general accessibility is a priority, such as in education,

medicine, or popular media. These results also

establish the DDoG measure and MAGI principles as

eective tools for gaining powerful insights into graph

interpretation.

The core study utilizes a set of 551 DDoG measure

response drawings, or readouts, produced by an

educationally and demographically diverse sample

of 149 participants. These readouts, of four mean

bar graph stimuli that vary in form and content, are

accompanied by an array of control and demographic

measures (see Methods).

Initial observations of the Bar-Tip Limit error

Figure 13 shows the four stimulus graphs and, for

each, an example of the two common response types

submitted. The correct Bar-Tip Mean response has

datapoints distributed across the bar-tip (Figure 13,

center column), and the incorrect Bar-Tip Limit

response has datapoints on only the baseline side of the

bar-tip (Figure 13, right column).

The severe inaccuracy shown in the Bar-Tip Limit

readouts, relative to the Bar-Tip Mean readouts,

raises the possibility that the Bar-Tip Limit readouts

represent a categorical dierence in thinking: an error.

Yet the few isolated examples shown in Figures 4,5,

6,and13 are insucient to conclusively distinguish

common errors of thinking from outliers. To clearly

demonstrate a common error requires a nontrivial

degree of bimodality in the response distribution

(Related works:Dening thought inaccuracies).

The analysis that follows, evaluates and plots all 551

readouts on a continuous scale of datapoint placement

(Figure 14). The Bar-Tip Limit response is revealed to

be a separate mode, eliminating the possibility that

early observations were outliers and identifying the

incorrect Bar-Tip Limit response as a common error

rather than a bias or confusion.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 19

Figure 13. DDoG measure readouts illustrating the two

common, categorically diﬀerent response types for each of

the four stimulus graphs. Theleftcolumnshowsthefour

stimulus graphs (AGE, CLINICAL, GENDER, SOCIAL). For each

stimulus graph, the center column shows a representative

correct response (Bar-Tip Mean), and the right column shows

an illustrative incorrect response (Bar-Tip Limit).

ThenatureandseverityoftheBTLerror

A minimum requirement to dierentiate between

errors, biases, and confusions is a response scale with

sucient granularity to produce a detailed, nely

contoured response distribution. The DDoG measure

readouts provide such granularity when analyzed via

a relatively continuous scale such as the BTL index

(Methods:Coding of DDoG measure readouts via

BTL index).

The BTL index is the percentage of drawn datapoints

that did not exceed the bar-tip (i.e., those that remain

within the bar). On this scale, a response that treats the

bar-tip as mean has a value of 50 (or 50% of datapoints

within the bar), while a response that treats the bar-tip

as limit has a value of 100 (or 100% of datapoints

within the bar). Values between 50 and 100 represent

gradations between the pure “mean" versus “limit" uses

of the bar-tip.

Figure 14e plots the BTL indices for each of the

551 drawn graphs produced by 149 participants. Keep

in mind that any degree of bimodality—even barely

detectable rises amidst a sea of responses—would

provide substantial evidence that measured inaccuracy

in the responses resulted from common errors of

interpretation, as opposed to biases or confusions. Yet

the modes here in Figure 14e are not merely detectable;

they are tall, sharp peaks with a deep valley of rare

response values in between. Equally remarkable to the

height and sharpness of the peaks is the trough between

them, which bottoms out at a frequency approaching

zero. In the context of the DDoG measure’s Expressive

freedom and Information richness, which enable a very

wide array of possible responses, it is striking to nd

such a clear dichotomy.

The sharp peak on the right side of Figure 14e

includes 86 drawn graphs (15.6%) with BTL index

values of exactly 100. The left-hand mode is similarly

sharp, with a peak at 50 that contains 169 (30.7%)

drawn graphs. Taken together, the peaks of the two

modes alone—BTL index values of exactly 50 and

100—account for nearly half of all drawn graphs

(46.3%).

Notably, the incorrect right peak—which treats

the bar-tip as limit rather than mean—is evident

despite two factors that, even given starkly categorical

thinking, could easily have blunted or dispersed it:

rst, our conservative scoring procedures (Methods:

Coding of DDoG measure readouts via BTL

index), and second, potential response variation

based on dierences in graph stimuli form and

content.

In addition to the distinct bimodality in the data, the

particular locations of the modes are revealing: one

mode is at the correct Bar-Tip-Mean location, and the

other is at the precise incorrect Bar-Tip-Limit location

that is expected when mean bar graphs are conated

with count bar graphs.

The sharpness and location of these peaks not only

provide clarity with regard to mechanism—an error, a

conation—but they also underscore the precision of

the measure that is able to produce such clear results.

Notably, the DDoG measure, using the BTL index, is

fully capable of capturing patterns of bias (quantitative

leaning, Figures 14c, 8c), or confusion (nonsystematic

perplexity, Figures 14d, 8d), or values that do not

correspond to any known cause. Yet here we see it

clearly reveal an error (Figures 14b, 8b) with conation

as its apparent cause.

Panels f through i, on the right side of Figure 14,

show BLT indices plotted separately for each stimulus

graph. These plots are nearly indistinguishable from

each other, and from the main plot (14e), despite

substantial dierences between the four stimuli

(Figure 11). These plots thereby demonstrate four

internal replications of the aggregate results shown

in the main plot (14e). Such replications speak to the

generality of the BTL error phenomenon.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 20

Figure 14. Bimodal distribution of Bar-Tip Limit (BTL) index values reveals that the BTL error represents a categorical diﬀerence. The

main graph (e) plots the distribution of BTL index values for all 551 DDoG measure drawings. The left inset graphs show visual

deﬁnitions of: (a) baseline (no systematic inaccuracy), (b) errors, (c) biases, and (d) confusions (Figure 8). Note how closely e (our

data) matches b (data pattern for errors). The right inset graphs (f, g, h, i) plot BTL indices by graph stimulus (AGE, CLINICAL, GENDER,

SOCIAL), which provide four internal replications of the aggregate result (e). The colors on the x-axes indicate the location of the

computed cutoﬀ of 80, taken from our cluster analyses, between the two categories of responses: Bar-Tip Mean responses (green)

and Bar-Tip Limit (BTL) responses (yellow).

Critically, these analyses demonstrate that Bar-Tip

Limit drawings like those shown in Figure 13 are in

no way rare or atypical. They are not outliers. Nor are

they the tail of a distribution. Rather, they represent

their own mode; a categorically dierent, categorically

incorrect interpretation, or error. We next turn to a

more formal assessment of BTL error prevalence.

Prevalence of the BTL error: One in ﬁve

Having rst characterized the BTL error as a severe

error in graph interpretation (Figures 13,14b, 14e),

we then assessed its prevalence in our sample using

a classication-based approach. A multi-method

cluster analysis yielded the BTL index of 80 as a

consensus cuto between Bar-Tip Limit and Bar-Tip

Mean responses (Methods:Establishing cutos and

prevalence). We marked Figure 14 (panels e to i)

with green and yellow x-axis colors to capture this

distinction. At this cuto, 122 of 551 graphs, or 22.1%,

95% CI [18.9, 25.8], demonstrated the BTL error.

Given the strength of bimodality in these data, and

the sharpness of the BTL peak, BTL error prevalence

values are remarkably insensitive to the specic choice

of cuto. For example, across the full range of cutos

(75 to 85) produced by the multiple cluster analyses we

carried out, these estimates remain around one in ve

(see Methods:Establishing cutos and prevalence). As

we will see below, this one in ve prevalence also turns

out to be highly consistent with a careful reanalysis of

past results (A reexamination of prior results).

Analytic quantiﬁcation of bimodality

A next step, complementary to Figure 14’s graphical

elucidation of the BTL error’s categorical nature,

was to analytically quantify the degree of bimodality

in the data. We do this via two standard analytical

bimodality indices: Hardigan’s Dip Statistic (HDS)

and the Bimodality Coecient (BC). HDS and BC are

complementary measures that are often used in tandem

(Freeman & Dale, 2013;Pster et al., 2013). HDS and

BC utilize idiosyncratic scales that range from 0.00 to

0.25, and from 0.33 to 1.00, respectively. To illustrate

how these scales quantify bimodality, Figure 15 plots

simulated distributions of 10,000 datapoints (middle),

with both their HDS (top, blue) and BC (bottom,

purple) scores. The HDS and BC scores for this dataset

(the 551 drawn graphs) are indicated with arrows,

accompanied by 95% condence intervals (CIs).

Clearly, the uncertainty in the data is small relative to

the exceptionally high magnitude of bimodality. Again,

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 21

Figure 15. Two analytic approaches conﬁrm that the BTL index

data are strongly bimodal. Distribution graphs show reference

data sets where varying degrees of separation were imposed on

independently generated normal distributions. Separation is

quantiﬁed in units of standard deviation (SD). Two standard

bimodality statistics for each reference distribution were

computed: Hardigan’s Dip Statistic (HDS) is shown above in

blue; the Bimodality Coeﬃcient (BC) is shown below in purple

(Freeman & Dale, 2013;Pﬁster et al., 2013). HDS and BC for our

full data set, marked with arrows at top and bottom, along with

95% CIs, conﬁrm strong BTL index bimodality (Table 4

bimodality statistics provide four internal replications of this

overall result).

the clarity of this result necessarily reects not only

the bimodal nature of the BTL error, but also the high

resolution of the DDoG measure and the ecacy of

the MAGI principles used in its design.

Consistent with the replications we see in the

plotted data (Figures 14e–i), the analytic results

per stimulus graph, collected in Tabl e 4,showclear

internal replications of the aggregate results, verifying

the consistency of the nding of bimodality across

substantial dierences in form and content between the

four stimuli.

Cognitive mechanisms of the BTL error

Even before our systematic investigation, Figures 1

through 6 suggested a conation of mean and count

bar graphs as the apparent cognitive mechanism for the

BTL error. The case for conation is heightened by the

clear second mode shown in Figures 14e–i. A second

mode indicates a common, erroneous thought process

(Figure 8b, Figure 14b), and that mode’s position at

the Bar-Tip Limit response, with a BTL index of 100

signals a conation with count bar graphs in particular.

Here we evaluate potential sources of this conation.

BTL error stability within individuals

A rst question is whether the BTL error represents

a relatively stable understanding, or whether it occurs

ad hoc, reecting a more transient thought process. To

answer this question, we rst investigate consistency of

the BTL error in individuals.

We nd substantial consistency. The scatterplot

matrix shown inFigure 16 plots BTL indices using

pairwise comparison. Each scatterplot represents the

comparison of two graph stimuli, and each pale gray

dot plots a single participant’s BTL indices for those

two graphs: the x-coordinate is an individual’s BTL

index for one graph, and the y-coordinate is their BTL

index for a second graph.

If the individual’s BTL index is exactly the same

for both graphs, their dot will fall on the (black)

line of equivalence (the diagonal line where x =y).

As a result, the distance of a dot from the line of

equivalence demonstrates the dierence in BTL index

values between the two compared stimulus graphs. The

dots cluster around the line of equivalence in every

scatterplot in Figure 16, indicating that individuals

show substantial consistency of interpretation across

graphs.

The dots are, additionally, semitransparent, so that

multiple dots in the same location show up darker. For

example, the very dark dot at the top-right corner of

the top-right scatterplot represents the 18 participants

whose BTL index values on both the GENDER

graph (x-axis) and AGE graph (y-axis) are 100. Thus,

the presence of dark black dots at (x =100, y =

100) and (x =50, y =50) in all of the scatterplots

is a focused visual indicator of consistency within

participants.

Together, these scatterplots conrm what the two

DDoG measure readouts in Figure 5 suggested: that

an individual’s interpretation of one graph is highly

predictive of their interpretation of the other graphs,

and that those who exhibit the BTL error for one graph

are very likely to exhibit it in most or all other graphs,

regardless of graph form or content.

Further supportive of this conclusion of consistency,

we found that a Cronbach’s alpha internal consistency

statistic (Cronbach, 1951), computed on the entire

dataset shown in Figure 16, was near ceiling, at

0.94. This high internal consistency indicates that

an individual’s BTL indices across even just these

four graphs can capture their relative tendency

toward the BTL error with near-perfect precision.

In sum, the BTL error appears to represent not

just an ad hoc, one-o interpretation, but rather, a

more stable thought process that generalizes across

graphs.

Table 4. Data bimodality analysis, per stimulus graph.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 22

Independence of the BTL error from graph content or

form

Conceivably, despite the substantial stability of

erroneous BTL interpretations within individuals, there

might still exist global dierences, from one stimulus to

another, in levels of erroneous BTL thinking.

Remember that we chose our four graphs to vary

substantially in both form (unidirectional versus

bidirectional bars) and content (independent variables

of age, therapeutic intervention, social situation,

and gender, and dependent variables of memory,

depressive symptoms, enjoyment, and visuospatial

performance). If BTL interpretations were to dier

substantially between pairs of these stimuli, it

could suggest that such form or content dierences

played a role in generating or moderating the BTL

error.

Figure 16 plots global mean BTL index values as red

dots. The further from the black line of equivalence

(where x =y) a red mean dot lies, the more divergent

the average interpretation is between the two graphs.

All red mean dots are close to the line of equivalence,

indicating that participants, as a group, exhibit similar

levels of BTL interpretation for all graph stimuli. The

small distances between the red dot and black line on

the graph are echoed numerically in the small Cohen’s d

eect sizes (Figure 16,“d=”). Despite a slight tendency

for bidirectional graph stimuli (SOCIAL, GENDER) to

produce greater BTL interpretation than unidirectional

graph stimuli (AGE, CLINICAL), the Cohen’s d values

do not exceed the 0.20 threshold which denes a small

Figure 16. The Bar-Tip Limit (BTL) error persists across diﬀerences in graph form and content. Each of the six subplots in this ﬁgure

compares individual BTL index values for two graph stimuli, one plotted as the x-value and the other as the y-value. Each gray dot

represents one participant’s data. Mean values are shown as red dots. Gray dots are semitransparent so that darkness indicates

overlap. High overlap is observed near BTL index values of 50 (Bar-Tip Mean) and 100 (Bar-Tip Limit) for both stimulus graphs,

indicating persistence of interpretation between compared stimuli, despite diﬀerences in graph form and content. This persistence is

reﬂected numerically in the high correlations among (Pearson’s r), and the small mean diﬀerences between (Cohen’s d), graph stimuli.

The high correlations are echoed visually by steep lines of best ﬁt (blue), and the small diﬀerences are echoed visually by the

proximity of mean values (red dots) to lines of equivalence (black lines, which show where indices are equal for the two stimuli). In

brackets are the 95% CIs for r and d.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 23

eect (Cohen, 1988). The BTL error thus appears to

occur relatively independently of graph content or

form.

Independence of the BTL error from foundational

knowledge

A next step in isolating the thought processes

that underlie the apparent conation of mean bar

graphs with count bar graphs is to ask what relevant

foundational knowledge the participants have.

We assessed foundational knowledge in two ways:

(1) a denition task: asking participants to dene

the concept of average/mean in their own words

immediately after the drawing task (Denition task:

Dene the average/mean), and (2) a graph reading task:

asking participants to read a particular average/mean

value from stimulus graphs that they later sketched for

the DDoG measure drawing task (Graph reading task:

Find the average/mean). In the denition task, 83% of

denitions were mathematically or conceptually correct,

and in the graph reading task, 93% of readings were

correct.

As Figure 17 shows, ltering responses for a

correctly dened average/mean, a correctly identied

average/mean, or both, has little eect on the percentage

of BTL interpretations. The rate of the BTL error

remains near one in ve, at 96/486 =19.8% [16.5,

23.5], 87/413 =20.7% [17.1, 24.9], and 71/367 =

19.3% [15.6, 23.7], respectively. Together, these analyses

suggest that the BTL error persists despite foundational

knowledge about the average/mean and, moreover,

despite sucient care to the foundational knowledge

questions to answer them correctly.

The BTL error occurs despite thoughtful, intentional

responding

It was also relevant to establish that erroneous

BTL interpretations were thoughtful and intentional

by exploring whether carelessness, rushing, or task

misunderstanding were major factors in BTL error. In

addition to the results shown in Figure 17,twofurther

factors suggest that the BTL error persists despite

thoughtful attention specically to the DDoG drawing

task itself.

First, drawn graphs for which the instructions were

not followed suciently to provide usable data were

excluded from the start (Methods:Collection of DDoG

measure readouts). Here, the Expressive freedom

of the DDoG measure worked to our advantage.

More constrained measures—for example, button

pressesorratingscales—oftenrendercarelessness

and task misunderstanding dicult to detect and

segregate from authentic task responses. The DDoG

measure, in contrast, makes it much harder to produce

even a single usable response without a reasonable

Figure 17. The Bar-Tip Limit (BTL) error occurs despite correctly

deﬁning “mean” and correctly locating it on the graph.

Percentage of readouts that showed the BTL error (deﬁned as a

BTL index over 80). “All” is the full dataset (n=551 readouts).

“Correctly deﬁned average/mean” is restricted to participants

who produced a correct deﬁnition for the mean (n=486

readouts). “Correctly identiﬁed average/mean” is restricted to

participants who correctly identiﬁed a mean value on the same

graph that produced the readout (n=413 readouts). “Correctly

deﬁned & identiﬁed average/mean” is restricted to participants

who both correctly deﬁned the mean and correctly identiﬁed a

mean value on the graph (n=367 readouts). In all cases, the

proportion of BTL errors hovers around one in ﬁve. Vertical

lines show 95% CIs, and gray regions show full probability

distributions for the uncertainty around the percentage values.

degree of intentional, thoughtful responding, thus

guaranteeing a certain intentionality in the resulting

data set.

Second, rushing was evaluated as a possible

contributing factor to erroneous BTL interpretations

by computing the correlation of drawing time of usable

graphs with BTL indices, on a graph-by-graph basis.

This correlation was very close to zero: r(371) =–0.01,

95% CI [–0.11, 0.09]. Due to skew in the drawing time

measure, we additionally computed a nonparametric

(Spearman) correlation, which was also close to zero:

rho (371) =0.004, 95% CI [–0.10, 0.11]. The lack of

correlation between drawing speed and BTL index

provides further evidence that lack of intentional,

thoughtful responding was not a major contributor to

erroneous BTL interpretations.

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 24

Figure 18. The Bar-Tip Limit (BTL) error is substantially

independent of education. Mean BTL indices for each

participant plotted against general education level, number of

psychology courses taken, and number of statistics courses

→

Educational and demographic correlates

BTL error correlates of individual diﬀerences: Education

The existence of individual dierences in the

likelihood of making the BTL error (stability within

individuals) raises questions about their correlates

and potential origins. With this in mind, we evaluated

the dataset for demographic predictors. Of particular

interest is whether prior educational experiences,

general or specic, correlated with BTL readouts. The

stability that the BTL index exhibited within individuals

across graphs made it possible to use each individual’s

mean BTL index to evaluate educational correlates.

Figure 18 shows that while general education and

psychological coursework did not robustly predict mean

BTL indices, statistics coursework did predict a slightly

more correct (lower) mean BTL index.

This pattern of correlations raises several questions

about education’s relationship to the BTL error. First,

what causes the nonzero relationship to statistics

coursework, and does this relationship indicate that

the BTL error is malleable and amenable to training?

Second, why are all of these correlations not higher?

Does this mean that the BTL error is such a tenacious

belief that it is resistant to change? And, alternatively,

are there aspects of standard statistical, psychological,

or general education that could be improved to more

eectively ameliorate the BTL error?

BTL error correlates of individual diﬀerences: Age,

gender, country

In addition to education, we examined potential

correlates: age, gender, and location (i.e., the country

from which one completed the study). Age showed a

modest correlation with mean BTL indices, with older

individuals demonstrating somewhat more BTL error

than younger individuals (r(146) =0.21, 95% CI [0.05,

0.36]). Mean BTL indices were slightly lower among

males, but not statistically signicantly so (r(146) =

0.10, 95% CI [−0.06, 0.26]). While modest numerical

dierences in mean BTL indices were found between

countries (United States M=66.1, SD =20.3, n=36;

United Kingdom M=61.2, SD =15.6, n=61; Other

countries M=64.0, SD =19.6, n=52), none of these

dierences reached statistical signicance (all pvalues

>0.05).

←

taken (each reported via the four-point scale shown on the

respective graph). The black line is the least-squares regression

line, computed with rated responses treated as interval-scale

data. Axis ranges and graph aspect ratios were chosen by

ShowMyData.org such that the physical slope of each

regression line equals its respective correlation coeﬃcient (r).

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 25

Ecological exposure to mean versus count bar

graphs

As we have seen, the apparent conation of mean

and count bar graphs demonstrated by the BTL

error does not appear to be a thoughtless act or

careless misattribution. The data show, rather, that

one in ve participants capable of understanding

a mean, identifying it on a graph, and producing

attentive, intentional, consistent drawings make this

error.

Broadly speaking, such thoughtful conation could

arise from either or both of “nature” and “nurture.”

Inuences of nature might include fundamental,

inborn aspects of perception, such as compelling visual

interpretations of bars as solid objects (containers,

stacks). Inuences of nurture might include real-world

experiences as varied as educational curricula and mass

media.

While a full investigation of such developmental

inuences was beyond the scope of the current

investigation, we were able to probe a key avenue made

salient by our understanding of early graph pedagogy

(Figure 1): putative exposure to mean and count bar

graphs in early education and beyond.

As an investigation of the relative prevalence of

these two bar-graph types, we counted bar graphs

from three age-correlated sources. First, we evaluated

a set of Google Image searches using the phrase “bar

graph [X] grade” where X was “1st” through “6th”

(Methods:Prevalence of bar graphs of means versus

bar graphs of counts). This search yielded an array of

graphs embedded in elementary pedagogical materials.

Such graphs were deemed interesting due to early

education’s role in forming assumptions that may

carry forward into adulthood. Second, we performed a

general Google Image search for “bar graph,” included

as a window into the prevalence of count versus mean

bar graphs on the internet as a whole. Our third and

nal source was a set of eight widely used college-level

Introductory psychology textbooks, included as a

common higher-level educational experience (Ciccarelli

& Berstein, 2018;Coon et al., 2018;Gazzaniga 2018;

Griggs 2017;Hockenbury & Nolan, 2018;Kalat, 2016;

Lilienfeld et al., 2017;andMyers & DeWall, 2017).

As Figure 19 shows, in all three sources, mean

graphs constituted only a minority of bar graphs. The

eect was nearly unanimous among the elementary

education materials, where mean bar graphs were

almost nonexistent (1/81 =1.2%, 95% CI [0.2, 6.7]).

Mean bar graphs were slightly more prevalent in the

general Google Image search, with 17% (17/100 =

17.0, 95% CI [10.9, 25.5]). Compared to the other

sources, Introductory Psychology textbooks had

a higher proportion of mean bar graphs, at 36%

(53/149 =35.6%, 95% CI [28.3, 43.5]). This higher

Figure 19. Count bar graphs are more common than mean bar

graphs across three major domains. Mean bar graphs plotted

as a percentage of total bar graphs (i.e., mean bar graphs plus

count bar graphs). “Elementary Educational Materials”: Google

Image search for grades 1–9 (n=81). “General Google Search”:

Google search for “bar graph” (n=100). “College Psychology

Textbooks”: bar graphs appearing in eight widely used college

Introductory Psychology textbooks (n=149). Vertical lines

show 95% CIs, and gray regions show full probability

distributions for uncertainty around the percentage values. In

all three cases, the percentage of mean bar graphs remains

substantially, and statistically robustly,

below 50%.

proportion was predictable, given that in psychology

research, analyses of mean values across groups or

experimental conditions are often a central focus.

Yet even in the psychology textbook sample, mean

bar graphs still constituted only about a third of bar

graphs.

This analysis identies elementary education both as

a plausible original source of erroneous BTL thinking

and as a potential lever for remediating it. Further,

the preponderance of count bar graphs across all

three of these sources suggests a pattern of iterative

reinforcement throughout life. One’s understanding of

bar graphs, limited to count bar graphs in elementary

school (Figures 1,Figure 19, and reinforced through

casual exposure (Figure 19, Google), may represent a

well-worn, experience-hardened channel of thinking

that is modied only slightly by higher education

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 26

(Figure 18;BTL error correlates of individual

dierences: Education).

In this context, count bar graph interpretation

plausibly becomes relatively eortless: a count graph

heuristic, or mental shortcut (Shah & Oppenheimer,

2008;Gigerenzer & Gaissmaier, 2011). In the context

of real-world exposure, where mean bar graphs are in

the vast minority, such a count graph heuristic would

generally produce the correct answer. Only occasionally,

when misapplied to a mean bar graph, would it lead

one astray.

Moreover, the relative abstraction, ambiguity, and

complexity of mean bar graphs relative to count

bar graphs, added to the visually identical forms of

these two bar graph types (Figure 2), provides ideal

fodder for such a heuristic-based conation. Therefore,

the apparent conation that we observe may result

from an experience-fueled heuristic, or cognitive

shortcut.

A reexamination of prior results

As discussed in Related Works (Literature Related to

the BTL Error), the probability rating scale approach

used in prior studies of mean bar graph accessibility

had successfully identied a reproducible asymmetry:

higher rated likelihood of data in-bar than out-of-bar

for the average person (Newman & Scholl, 2012;Correll

& Gleicher, 2014;Pentoney & Berger, 2016;Okan et al.,

2018). For nearly a decade, that asymmetry—dubbed

the “within-the-bar bias”—arguably provided the

strongest existing evidence against the accessibility of

mean bar graphs.

With our current results in hand, we now reexamine

those prior reports of asymmetry, oering four key

insights: First, the prior work observed a prevalence

remarkably similar to ours. Second, the bimodality and

accurate left peak in prior results indicate that the prior

work misidentied an error as a bias. Third, a atter,

wider right data cluster in Pentoney and Berger’s study

(2016), compared to ours, highlights a key limitation

of the probability rating scale approach used in prior

work. Fourth, recent changes in methodological and

statistical practices shed light on choices that may have

aected result characterization in prior work.

The rst two observations—similar prevalence

and strong bimodality—conrm that the previously

reported “within-the-bar bias” was likely not a bias

at all, but rather the same BTL error, and the same

apparent count bar conation, that is revealed here

by our data. The latter two observations—right-mode

dierences and changing statistical practices—help

to explain how both error and conation could have

remained undiscovered in prior work despite over

a decade’s worth of relevant, robust, self-consistent

evidence from numerous independent labs.

Prior work reﬂects the same phenomenon: Prevalence

Our data demonstrate the Bar-Tip Limit (BTL) error

in about one in ve persons, across educational levels,

ages, and genders, and despite thoughtful responding

and relevant foundational knowledge. Prior studies

diered from ours in that they were conducted by

dierent research groups, at dierent times, with

dierent participant samples, using a dierent measure

(Newman & Scholl, 2012;Correll & Gleicher, 2014;

Pentoney & Berger, 2016;Okan et al., 2018). If, despite

all these dierences, a prevalence reasonably comparable

to our one in ve was detectible in one or more of the

prior studies, it would provide striking evidence that: (1)

those past results were likely caused by the same BTL

error phenomenon and (2) the underlying BTL error is

robust to whatever diers between our study and the

past studies.

Pentoney and Berger (2016, hereafter “PB study”)

provide a particularly rich opportunity to reexamine

prevalence, because, unique among the past studies

cited above, the PB study produced graphs that showed

raw data for each participant. A careful look (see PB

study data replotted as our Figure 20b) demonstrates

a familiar bimodality. Additionally, the same analytic

tests used to quantify the bimodality in our own data

show a high degree of bimodality in the PB study data

(HDS =0.068 [0.042, 0.098]; BC =0.66 [0.56, 0.75]),

which strongly supports the principled separation of

those data into two groups.

We applied to the PB study the same ve cluster

analysis techniques used to dene the correct/error

cuto in our own data (Methods:Establishing

cutos and prevalence). These analyses yielded an

average prevalence estimate of 20.5%. Based on visual

inspection of the PB study data in Figure 20b, this

cuto does a good job of separating the two apparent

clusters of responses. Additionally, the resulting 20.5%

prevalence estimate closely replicates the approximate

one in ve estimate from our own sample and suggests

the BTL error as the source of the asymmetry reported

in the PB study.

While none of the other three prior probability

rating scale studies (Newman & Scholl, 2012;Correll

& Gleicher, 2014;Okan et al., 2018)providedraw

underlying data of the sort that would enable a

formal analysis of bimodality and prevalence, a single

parenthetical statement in Newman and Scholl’s (2012)

Study 5 (hereafter NS5) is suggestive. It notes:

“In fact, 73% of the participants did rate the [in-bar vs out-

of-bar] points as equally likely in this study. But ... the other

27% of participants — i.e., those in the minority who rated

the two points dierently — still reliably favored the point

within the bar.”

In other words, 73% of participants showed no

asymmetry, which is the expected response for an

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 27

Figure 20. A reexamination of prior results reveals both consistency with our current results and overlooked evidence for the

Bar-Tip Limit (BTL) error. Shown side by side for direct comparison are plotted data from: (a) The present study using the DDoG

measure; (b) Pentoney and Berger (2016) (PB); (c) Newman and Scholl (2012) Study 5 (NS5). All graphs label correct Bar-Tip Mean

response values in green and computed (or inferred, in the case of NS5) Bar-Tip Limit (BTL) response values in yellow. Notably, the PB

study used the exact same 9-point rating scale and graph stimulus (of hypothetical chemical freezing temperature data) as the NS5

study (Figure 9 shows the scale and stimulus) but added ratings of four additional temperatures (−15, −10, 10, 15) to NS5’s original

two (−5, 5). Comparing PB’s results (b) to ours (a) suggests that while PB’s version of the probability rating scale measure achieved a

fairly high degree of accuracy and precision at values near zero, it still suﬀered from apparently irreducible inaccuracy and/or

imprecision as values diverged from zero (see text for further discussion).

interpretation of the bar-tip as a mean. The asymmetry

shown by the other 27% of participants, while not

numerically broken down by direction and magnitude,

"reliably favored the point within the bar”; and the

average result, including the 73% of participants with

zero asymmetry, was a robust in-bar asymmetry. So,

the notion that around 20% may have shown a rather

large in-bar asymmetry, as in the PB study seems highly

plausible. We plotted the zero asymmetry and nonzero

asymmetry results for the NS5 study as Figure 20c.

In sum, prevalence estimates gleaned from past

studies are highly consistent with our own prevalence

estimates, suggesting both that the asymmetry found

in past studies was caused by the same BTL error, and

that this error of interpretation is highly robust to

dierences between studies.

Prior work reﬂects the same phenomenon: Error, rather

than bias

In addition to providing prevalence, the distinct

clusters of responses and strong bimodality in the PB

study’s data demonstrate a key feature of erroneous

graph interpretation: two discernible peaks, or modes

(Figures 8b, 21a). A second key feature of erroneous

interpretation is that one of the two modes is on the

correct interpretation.

Both the DDoG and PB data plots demonstrate

two peaks, and in each plot, one of those peaks

corresponds exactly to the correct value for a Bar-Tip

Mean interpretation (left mode in Figures 8b, 21aand

20b, 21b).

In the PB study, this correct value is 0. The PB

study’s analyses used a dierence score: the mean in-bar

rating minus the mean out-of-bar rating. This score

ranged from −8 to 8, in increments of 1/3. The task was

taken directly from the NS5 study (same stimulus, same

response scale, see Figure 9a).However,thePBstudy

had participants rate three in-bar locations (rather than

NS5’s one) and three out-of-bar locations (rather than

NS5’s one). The study therefore had more datapoints,

and greater continuity of measurement, than that of

NS5 (Figure 20b).

A dierence score near 0 indicates similar in-bar

and out-of-bar ratings, which is the expected response

for a Bar-Tip Mean interpretation. Approximately

25% of the PB study’s participants (30 of 118) gave a

response of exactly 0, and more than half (64 of 118,

or 54%) gave a response in the range –0.67 to 0.67,

creating a tall, sharp mode at the correct Bar-Tip Mean

value.

Results from the NS5 study tell a similar story.

Despite the NS5 study’s relatively low-resolution

measure (the single rated in-bar location and out-of-bar

location), and limited reporting of individual values

(only the proportion of response scores that were

zero versus nonzero were reported), we can still see

that the NS5 data contained a mode on the correct

value (Figure 20c), replicating a feature of erroneous

interpretations. Indeed, a full 73% of participants gave

0 as their response, particularly strong evidence against

bias (which tends to shift a peak away from the correct

value) and for error (which tends to preserve a peak at

the correct value; see Figure 8).

Journal of Vision (2021) 21(12):17, 1–36 Kerns & Wilmer 28

Therefore, a careful reexamination of the available

information from the PB and NS5 studies shows strong

evidence for a causative, systematic error in participant

interpretation. The one in ve prevalence of asymmetry,

strong bimodality, and a response peak at the correct

value are three further indications that the previously

observed eects were likely caused by the same BTL

error that we observed via the DDoG measure.

How the DDoG measure supported the identiﬁcation of

the BTL error

If the BTL error truly is at the root of these prior

reports of asymmetry, why did its existence and

apparent mechanism—conation of mean bar graphs

with count bar graphs—elude discovery for over a

decade?

Though our reexamination of past results (Newman

& Scholl, 2012;Correll & Gleicher, 2014;Pentoney &

Berger, 2016;Okan et al., 2018) has so far focused on

ways in which the probability rating scale data aligns

closely to our DDoG measure ndings (Figure 20), the

answer to our question may lie in the areas of dierence.

Figure 20 shows a salient way in which the DDoG

measure response distribution (Figure 20a) diers from

that of the PB study (Figure 20b): the shape of the

rightmost (erroneous) error peak. DDoG measure’s

BTL error peak is sharp, and the location corresponds

exactly to the expectation for a count bar graph

interpretation (i.e., a BTL index of 100, representing

a perfect Bar-Tip Limit response Figure 20a). The PB

study’s probability rating scale data, in contrast, shows

no such sharp right peak. The right data cluster is atter

and more dissipated (Figure 20b).

To understand why these response distributions

would dier so much if they measure the same eect,

consider a core dierence between measures. The

probability rating scale requires participants to translate

their interpretation of a graph into common English

words (“likely,” “unlikely,” “somewhat,” and “very”)

(see Figure 9a). This location-to-words adjustment is a

prime example of the sort of mental transformation

that we sought to minimize via the MAGI principle

of limited mental transformations. Such translations

infuse the response with the ambiguity of language

interpretation (i.e., does every participant quantify

“somewhat” in the same way?) A relevant analogy here

is the classic American game of “telephone,” where

translations of a message, from one person to the next,

may yield an output that bears little resemblance to the

input. Similarly, the greater the number of necessary

mental transformations that exist between viewing

and interpreting a graph stimulus and generating

a response, the less precise the translation, and the

more dicult to reconstruct the interpretation from

the response. In such cases, it becomes dicult to use

a response to objectively judge the interpretation as

correct or incorrect, violating the MAGI principle of

Ground-truth linkage.

Further, though rating scale method responses

can indicate the presence and directionality of an

asymmetric response, they lack the necessary specicity

(e.g., “somewhat,” “very”) to delineate, in concrete

terms, the magnitude of asymmetry in the underlying

interpretation. As we have seen, it is the magnitude of

the error—the position of the second mode revealed

by the DDoG measure—that so clearly indicates its

mechanism.

But why does this translation issue only pertain to

the right peak of the PB study’s distribution? This is

because the (left) Bar-Tip Mean peak represents a mean

value. In a DDoG measure readout, this mean value is

concretely expressed by equal distribution of datapoints

on either side of the bar-tip. In the probability rating

scale measure, equal distribution is expressed by using

the same word to describe the likelihood of both

the in-bar and out-of-bar dot position. In this single

case of equality, the language translation issue is

obviated: so long as the person uses the same words

(e.g., “somewhat likely,” or “very unlikely”) to describe

both in-bar and out-of-bar locations, it matters little

what those words are. In this specic case only, when

the dierence score is zero, the probability rating scale

has Ground-truth linkage. This, in turn, results in a tall,

sharp, interpretable left (Bar-Tip Mean) peak in the

asymmetry measure.

How changing statistical and methodological practices

supported the identiﬁcation of the BTL error

We have seen that the mental transformations

required by previous rating scale methods impaired

identication of the BTL error despite compelling