IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014 1963
1077-2626 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
A Principled Way of Assessing Visualization Literacy
Jeremy Boy, Ronald A. Rensink, Enrico Bertini, and Jean-Daniel Fekete Senior Member, IEEE
Abstract— We describe a method for assessing the visualization literacy (VL) of a user. Assessing how well people understand
visualizations has great value for research (e.g., to avoid confounds), for design (e. g., to best determine the capabilities of an
audience), for teaching (e. g., to assess the level of new students), and for recruiting (e. g., to assess the level of interviewees). This
paper proposes a method for assessing VL based on Item Response Theory. It describes the design and evaluation of two VL tests
for line graphs, and presents the extension of the method to bar charts and scatterplots. Finally, it discusses the reimplementation of
these tests for fast, effective, and scalable web-based use.
Index Terms—Literacy, Visualization literacy, Rasch Model, Item Response Theory
In April 2012, Jason Oberholtzer posted an article describing two
charts that portray Portuguese historical, political, and economic
data . While acknowledging that he is not an expert on those top-
ics, Oberholtzer claims that thanks to the charts, he feels like he has “a
well-founded opinion on the country.” He attributes this to the simplic-
ity and efﬁcacy of the charts. He then concludes by stating: “Here’s
the beauty of charts. We all get it, right?”
But do we all really get it? Although the number of people famil-
iar with visualization continues to grow, it is still difﬁcult to estimate
anyone’s ability to read graphs and charts. When designing a visual-
ization for non-specialists or when conducting an evaluation of a new
visualization system, it is important to be able to pull apart the po-
tential efﬁciency of the visualization and the actual ability of users to
In this paper, we address this issue by creating a set of visualiza-
tion literacy (VL) tests for line graphs, bar charts, and scatterplots.
At this point, we loosely deﬁne visualization literacy as the ability to
use well-established data visualizations (e. g., line graphs) to handle
information in an effective, efﬁcient, and conﬁdent manner.
To generate these tests, we develop here a method based on Item
Response Theory (IRT). Traditionally, IRT has been used to assess ex-
aminees’ abilities via predeﬁned tests and surveys in areas such as ed-
ucation , social sciences , and medicine . Our method uses
IRT in two ways: ﬁrst, in a design phase, we evaluate the relevance of
potential test items; and second, in an assessment phase, we measure
users’ abilities to extract information from graphical representations.
Based on these measures, we then develop a series of tests for fast, ef-
fective, and scalable web-based use. The great beneﬁt of this method
is that inherits IRT’s property of making ability assessments that are
based not only on raw scores, but on a model that captures the stand-
ing of users on a latent trait (e. g., the ability to use various graphical
As such, our main contributions are as follows:
•a useful deﬁnition of visualization literacy,
•a method for: 1) assessing the relevance of visualization literacy
test items, 2) assessing an examinee’s level of VL, 3) creating
•Jeremy Boy is with Inria, Telecom ParisTech, and EnsadLab. E-mail:
•Ronald A. Rensink is with the University of British Columbia. E-mail:
•Enrico Bertini is with NYU Polytechnic School of Engineering. E-mail:
•Jean-Daniel Fekete is with Inria. E-mail: email@example.com.
fast and effective assessments of VL for well established visual-
ization techniques and tasks; and
•an implementation of four online tests, based on our method.
Our immediate motivation for this work is to design a series of tests
that can help Information Visualization (InfoVis) researchers detect
low-ability participants when conducting online studies, in order to
avoid possible confounds in their data. This requires the tests to be
short, reliable, and easy to administer. However, such tests can also be
applied to many other situations, such as:
•designers who want to know how capable of understanding visu-
alizations their targeted audience is;
•teachers who want to make an assessment of the acquired knowl-
edge of freshmen;
•practitioners who need to hire capable analysts; and
•education policy-makers who may want to set a standard for vi-
This paper is organized in the following way. It begins with a back-
ground section that deﬁnes the concept of literacy and discusses some
of its best-known forms. Also introduced are the theoretical constructs
of information comprehension and graph comprehension, along with
the concepts behind Item Response Theory. Next, Section 3 presents
the basic elements of our approach. Section 4 shows how these can be
used to create and administer two VL tests using line graphs. In Sec-
tion 5, our method is extended to bar charts and scatterplots. Section 6
describes how our method can be used to redesign fast, effective, and
scalable web-based tests. Finally, Section 7 provides a set of “take-
away” guidelines for the development of future tests.
Very few studies investigate the ability of a user to extract information
from a graphical representation such as a line graph or a bar chart. And
of those that do, most make only higher-level assessments: they use
such representations as a way to test mathematical skills, or the ability
to handle uncertainty [13, 31, 32, 34, 49]. A few attempts do focus
more on the interpretation of graphically-represented quantities [18,
20], but they base their assessments only on raw scores and limited
test items. This makes it difﬁcult to create a true measure of VL.
The online Oxford dictionary deﬁnes literacy as “the ability to read
and write”. While historically this term has been closely tied to its
textual dimension, it has grown to become a broader concept. Taylor
proposes the following: “Literacy is a gateway skill that opens to the
potential for new learning and understanding” .
Given this broader understanding, other forms of literacy can be dis-
tinguished. For example, numeracy was coined to describe the skills
needed for reasoning and applying simple numerical concepts. It was
Manuscript received 31 Mar. 2014; accepted 1 Aug. 2014 ate of
publication 2014; date of current version 9 Nov. 2014.
For information on obtaining reprints of this article, please send
e-mail to: firstname.lastname@example.org.
Digital Object Identiﬁer 10.1109/TVCG.2014.2346984
1964 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014
intended to “represent the mirror image of [textual] literacy” [43, p.
269]. Like [textual] literacy, numeracy is a gateway skill.
With the advent of the Information Age, several new forms of liter-
acy have emerged. Computer literacy “refers to basic keyboard skills,
plus a working knowledge of how computer systems operate and of the
general ways in which computers can be used” . Information liter-
acy is deﬁned as the ability to “recognize when information is needed”,
and “the ability to locate, evaluate, and use effectively the needed in-
formation” . Media literacy commonly relates to the “ability to
access, analyze, evaluate and create media in a variety of forms” .
2.1.2 Higher-level Comprehension
In order to develop a meaningful measure of any form of literacy, it is
necessary to understand the various components involved, starting at
the higher levels. Friel et al.  suggest that comprehension of in-
formation in written form involves three kinds of tasks: locating, inte-
grating, and generating information. Locating tasks require the reader
to ﬁnd a piece of information based on given cues. Integrating tasks
require the reader to aggregate several pieces of information. Generat-
ing tasks not only require the reader to process given information but
also require the reader to make document-based inferences or to draw
on personal knowledge.
Another important aspect of information comprehension is question
asking, or question posing. Graesser et al.  posit that question
posing is a major factor in text comprehension. Indeed, the ability to
pose low-level questions, i. e., to identify a series of low-level tasks,
is essential for information retrieval and for achieving higher-level, or
Several literacy tests are currently in common use. The two most im-
portant are the UNESCO’s Literacy Assessment and Monitoring Pro-
gramme (LAMP) , and the OECD’s Programme for International
Student Assessment (PISA) . Other international assessments in-
clude the Adult Literacy and Lifeskills Survey (ALL) , the In-
ternational Adult Literacy Survey (IALS) , and the Miller Word
Identiﬁcation Assessment (MWIA) .
Assessments are also made using more local scales like the US Na-
tional Assessment of Adult Literacy (NAAL) , the UK’s Depart-
ment for Education Numeracy Skills Tests , or the University of
Kent’s Numerical Reasoning Test .
Most of these tests, however, take basic literacy skills for granted,
and focus on higher-level assessments. For example the PISA test
is designed for 15 year-olds who are ﬁnishing compulsory education.
This implies that examinees should have already learned—and still
remember—the basic skills required for reading and counting. It is
only when examinees clearly fail these tests that certain measures are
deployed to evaluate the lower-level skills.
NAAL provides a set of 2 complementary tests for examinees who
fail the main textual literacy test : the Fluency Addition to NAAL
(FAN) and The Adult Literacy Supplemental Assessment (ALSA).
These focus on adults’ ability to read single words and small passages.
Meanwhile, MWIA tests whole-word dyslexia. It has 2 levels, each
of which contains 2 lists of words, one Holistic and one Phonetic, that
examinees are asked to read aloud. Evaluation is based on time spent
reading and number of words missed. Proﬁcient readers should ﬁnd
such tests extremely easy, while low ability readers should ﬁnd them
2.2 Visualization Literacy
The view of literacy as a gateway skill can also be applied to the ex-
traction and manipulation of information from graphical representa-
tions. In particular, it can be the basis for what we will refer to as
visualization literacy (VL): the ability to conﬁdently use a given data
visualization to translate questions speciﬁed in the data domain into
visual queries in the visual domain, as well as interpreting visual pat-
terns in the visual domain as properties in the data domain.
This deﬁnition is related to several others that have been proposed
concerning visual messages. For example, a long-standing and of-
ten neglected concept is visual literacy. This has been deﬁned as the
“ability to understand, interpret and evaluate visual messages” . Vi-
sual literacy is rooted in semiotics, i. e., the study of signs and sign
processes, which distinguishes it from visualization literacy. While it
has probably been the most important form of literacy to date, it is
nowadays frowned upon, and general literacy tests do not take it into
Taylor  has advocated for the study of visual information lit-
eracy, while Wainer has advocated for graphicacy . Depending
on the context, these terms refer to the ability to read charts and di-
agrams, or to qualify the merging of visual and information literacy
teaching . Because of this ambiguity, we prefer the more general
term “visualization literacy.”
2.2.2 Higher-level Comprehension
Bertin  proposed three levels on which a graph may be interpreted:
elementary, intermediate, and comprehensive. The elementary level
concerns the simple extraction of information from the data. The inter-
mediate level concerns the detection of trends and relationships. The
comprehensive level concerns the comparison of whole structures, and
inferences based on both data and background knowledge. Similarly,
Curcio  distinguishes three ways of reading from a graph: from
the data, between the data, and beyond the data1.
The higher-level cognitive processes behind the reading of graphs
has been the concern of the area of graph comprehension. This
area studies the speciﬁc expectations viewers have for different graph
types , and has highlighted many differences in the understanding
of novices and expert viewers [15, 25, 26, 45].
Several inﬂuential models of graph comprehension have been pro-
posed. For example, Pinker  describes a three-way interaction be-
tween the visual features of a display, processes of perceptual organi-
zation, and what he calls the graph schema, which directs the search
for information in the particular graph. Several other models are simi-
lar (see Trickett and Trafton ). All involve the following steps:
1. the user has a pre-speciﬁed goal to extract a speciﬁc piece of
2. the user looks at the graph and the graph schema and gestalt pro-
cesses are activated
3. the salient features of the graph are encoded, based on these
4. the user now knows which cognitive/interpretative strategies to
use, because the graph is familiar
5. the user extracts the necessary goal-directed visual chunks
6. the user may compare 2 or more visual chunks
7. the user extracts the relevant information to satisfy the goal
The “visual chunking” mentioned above consists in segmenting a vi-
sual display into smaller parts, or chunks . Each chunk represents
a set of entities that have been grouped according to gestalt principles.
Chunks can in turn be subdivided into smaller chunks.
Shah  identiﬁes two cognitive processes that occur during
stages 2 through 6 of this model:
1. a top-down process where the viewer’s prior knowledge of se-
mantic content inﬂuences data interpretation, and
2. a bottom-up process where the viewer shifts from perceptual pro-
cesses to interpretation.
These processes are then interactively applied to various chunks,
which suggests that data interpretation is a serial and incremental pro-
cess. However, Carpenter & Shah  have shown that graph compre-
hension, and more speciﬁcally visual feature encoding, is more of an
iterative process than a straight-forward serial one.
1For further reference, refer to Friel et al.’s Taxonomy of Skills Required for
Answering Questions at Each Level .
BOY ET AL.: A PRINCIPLED WAY OF ASSESSING VISUALIZATION LITERACY
Freedman & Shah  relate the top-down and bottom-up pro-
cesses to a construction and an integration phase, respectively. Dur-
ing the construction phase, the viewer activates prior graphical knowl-
edge, i. e., the graph schema, and domain knowledge to construct a co-
herent conceptual representation of the available information. During
the integration phase, disparate knowledge is activated by “reading”
the graph and is combined to form a coherent representation. These
two phases take place in alternating cycles. This suggests that do-
main knowledge can inﬂuence the interpretation of graphs. However,
highly visualization-literate people should suffer less inﬂuence of both
the top-down and bottom-up processes .
Relatively little has been done on the assessment of literacy involving
graphical representations. However, interesting work has been done on
measuring the perceptual abilities of a user to extract information from
these. For example, various studies have demonstrated that users can
perceive slope, curvature, dimensionality, and continuity in line graphs
(see ). Correll et al.  have also shown that users can make
judgements about aggregate properties of data using these graphs.
Scatterplots have also received some attention. For example, studies
have examined the ability of a user to determine Pearson correlation
r[5, 9, 27, 37, 39]. Several interesting results have been obtained,
such as general tendency to underestimate correlation, especially in
the range .2 <|r|<.6, and an almost complete failure to perceive cor-
relation when |r|<.2.
Concerning the outright assessment of literacy, the only relevant re-
search work we know of is Wainer’s study on the difference in graph-
icacy levels between third-, fourth-, and ﬁfth-grade children . He
presents the design of an 8-item test using several visualizations, in-
cluding line graphs and bar charts. He then describes his use of Item
Response Theory  to score the test results, and shows the effec-
tiveness of this method for assessing abilities. His conclusion is that
children reach “adult levels of graphicacy” as soon as the fourth-grade,
leaving “little room for further improvement.” However, it is unclear
what these “adult levels” are. If we look at textual literacy, some chil-
dren are more literate than certain adults. People may also forget these
skills if they do not regularly practice. Thus, while very useful, we
consider Wainer’s work to be limited. What is needed is a way to
assess adult levels of visualization literacy.
2.3 Item Response Theory and the Rasch Model
Consider what we would like in an effective VL test. To begin with,
it should cover a certain range of abilities, each of which could be
measured by speciﬁc scores. Imagine such a test has 10 items, which
are marked 1 when answered correctly, and 0 otherwise. Rob takes the
test and gets a score of 2. Jenny also takes the test, and gets a score
of 7. We would hope that this means that Jenny is better than Rob at
reading graphs. In addition, we would expect that if Rob and Jenny
were to take the test again, both would get approximately the same
scores, or at least that Jenny would still get a higher score than Rob.
We would also expect that whatever VL test Rob and Jenny both take,
Jenny will always be better than Rob.
Now imagine that Chris takes the test and also gets a score of 2. If
we based our judgement on this raw score, we would assume that Chris
is as bad as Rob at reading graphs. However, taking a closer look at the
items that Chris and Rob got right, we realize that they are different:
Rob gave correct answers to the two easiest items, while Chris gave
correct answers to two relatively complex items. This would of course
require us to know the level of difﬁculty of each item, and would mean
that while Chris gave incorrect answers to the easy items, he might still
show some ability to read graphs. Thus, we would want the different
scores to have “meanings” to help us determine whether Chris was
simply lucky (he guessed the answers), or whether he is in fact able to
get the simpler items right, even though he didn’t this time.
Imagine now that Rob, Jenny, and Chris take a second VL test. Rob
gets a score of 3, Chris gets 4, and Jenny gets 10. We would infer that
this test is easier, since the scores are higher. However, looking at the
score intervals, we see that Jenny is 7 points ahead of Rob, whereas she
was only 5 points ahead in the ﬁrst test. If we were to truly measure
abilities, we would want these intervals to be invariant. In addition,
seeing that Chris’ score is once again similar to Rob’s (knowing that
they both got the same items right this time) would lead us to think
that they do in fact have similar abilities. We could then conclude that
this test provides more information on lower abilities than the ﬁst one,
since it is able to separate Rob and Chris’ scores.
Finally, imagine that all three examinees take a third test, and all
get a score of 10. While we might be tempted to conclude that this test
is VL-agnostic, it may simply be that its items are too easy, and not
One way of fulﬁlling all of these requirements is by using Item
Response Theory (IRT) . This is a model-based approach that does
not use response data directly, but transforms them into estimates of a
latent trait (e. g., ability), which then serves as the basis of assessment.
IRT models have been applied to tests in a variety of ﬁelds such as
health studies, education, psychology, marketing, economics, social
sciences (see ), and even graphicacy .
The core idea of IRT is that the performance of an examinee de-
pends on both the examinee’s ability and the item’s difﬁculty; the goal
is then to separate out these two factors. An important aspect of the
approach is to project them onto the same scale—that of the latent
trait. Ability, or standing on the latent trait, is derived from a pattern of
responses to a series of test items; item difﬁculty is then deﬁned by the
0.5 probability of success of an examinee with the appropriate ability.
For example, an examinee with an ability value of 0 (0 corresponding
to an average achiever) will have a 50% probability of giving a cor-
rect answer to an item with a difﬁculty value of 0, corresponding to an
average level of difﬁculty.
IRT offers models for data that are dichotomous (e. g., true/false
responses) and polytomous (e. g., responses on likert-like scales). In
this paper, we focus on models for dichotomous data. These deﬁne the
probability of success on an item iby the function:
where θis an examinee’s standing on a latent trait (i. e., his or her abil-
ity), and ai,bi, and ciare the characteristics of the item. The central
characteristic is b, the difﬁculty characteristic; if θ=b, the examinee
has a 0.5 probability of giving a correct answer to the item. Mean-
while, ais the discrimination characteristic. An item with a very high
discrimination value basically sets a sharp threshold at θ=b: exami-
nees with θ<bhave a probability of success of 0, and examinees with
θ>bhave a probability of success of 1. Conversely, an item with a
low discrimination value cannot clearly separate examinees. Finally,
cis the guessing characteristic. It sets a lower bound for the extent
to which an examinee will guess an answer. We have found cto be
unhelpful, so we have set it to zero (no guessing) for our development.
Note that the value of each characteristic is not absolute for a given
item: it is relative to the latent trait that the test is attempting to un-
cover. Therefore, it cannot be expected that the characteristics of iden-
tical items be exactly the same in different tests. For example, consider
a simple numeracy test with two items: 10 +20 (item 1) and 17+86
(item 2). It should be assumed that item 1 is easier than item 2. In other
words, the difﬁculty characteristic of item 2 should be higher than that
of item 1. Now if we add another item to the test, say 51 ×93 (item 3),
the most difﬁcult item in the previous version of the test (item 2) will
no longer seem so difﬁcult. However, it should still be more difﬁcult
than item 1. Thus, while individual characteristics may vary, the gen-
eral order of difﬁculty should be preserved. The same goes for ability
values (or ability scores). If they are to be compared between different
tests, the measured latent trait must be the same.
Various IRT models for dichotomous data have been proposed. One
is the one-parameter logistic model (1PL), which sets ato a speciﬁc
value for all items, sets c to zero, and only considers the variation of b.
Another is the two-parameter logistic model (2PL), which considers
the variations of aand b, and sets c to zero. A third is the three-
parameter logistic model (3PL), which considers variations of a,b,
and c. As such, 1PL and 2PL can be regarded as special cases of
1966 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014
3PL, where different item characteristics are assigned speciﬁc values.
A last variant is the Rasch model (RM), which is a special case of 1PL,
Thus, IRT offers a way to evaluate the relevance of test items during
adesign phase (e. g., how difﬁcult items are, or how discriminant they
are), and a way to measure examinees’ abilities during an assessment
phase. These two phases constitute the backbone of the method we
present in this paper, which is why we stress that our approach will
be successful only if an IRT model ﬁts a set of empirically collected
data. Furthermore, its accuracy will depend on how closely an IRT
model describes the interaction between examinees’ abilities and their
responses, i. e., how well the model describes the latent trait. Thus,
different variants of IRT models should be tested initially to ﬁnd the
best ﬁt. Finally, it should be mentioned that IRT models cannot be
relied upon to “ﬁx” problematic issues in a test. Proper test design is
In the approach we develop here, test items generally involve a 3-part
structure: 1) a stimulus, 2) a task, and 3) a question. The stimuli are the
particular graphical representations used. Tasks are deﬁned in terms of
the visual operations and mental projections that an examinee should
perform to answer a given question. While tasks and questions are
usually linked, we emphasize this distinction because early piloting
revealed that different “orientations” of a question (e. g., emphasis on
particular visual aspects, or data aspects) could affect performance.
To identify possible factors that may inﬂuence the difﬁculty of a
test item, we reviewed all the literacy tests that we could ﬁnd which
use graphs and charts as stimuli [13, 18, 20, 31, 32, 34, 49, 50]. Note
that our goal is not to investigate the effect of these factors on item
difﬁculty; we present them here merely as elements to be considered
in the design phase.
We identiﬁed 4 potential stimulus parameters: number of samples,
intrinsic complexity (or variability) of the data, layout, and level of
distraction. We also found 6 recurring task types: extrema (maximum
and minimum), trend, intersection, average, and comparison. Finally,
we distinguished 3 different question types: “perception” questions,
“high-congruency” questions, and “low-congruency” questions. Each
of these are described in the following subsections.
3.1 Stimulus parameters
In our survey, we ﬁrst focused on identifying parameters to describe
the graphical properties of a stimulus. We found four:
Number of samples This refers to the number of graphically en-
coded elements in the stimulus. Among other things, the value of this
parameter can impact tasks that require visual chunking .
Complexity This refers to the local and global variability of the
data. For example, a dataset of the yearly life expectancy in different
countries over a 50 year time period has a low local variation (no dra-
matic “bounces” between two consecutive years), and low global vari-
ation (a relatively stable, linear, increasing trend). In contrast, a dataset
of the daily temperature in different countries over a year shows high
local variation (temperatures can vary dramatically from on day to the
other) and medium global variation (temperature generally rises an de-
creases only once during the year).
Layout This refers to the structure of the graphical framework
and its scales. Layouts can be single (e. g., a 2-dimensional Euclidian
space), superimposed (e. g., three axes for a 2-dimensional encoding),
or multiple (e. g., several frameworks for a same visualization). Mul-
tiple layouts include cutout charts and broken charts . Scales can
be single (linear or logarithmic), bifocal, or lense-like.
2For a complete set of references on the Rasch model, refer to http://
Distraction This refers to the graphical elements present in the
stimulus that are not necessary for the task at hand. These are consid-
ered to be distractors. Correll et al.  have shown that even small
variations in attributes of distractors can impact perception. However,
here we simply use distraction in a Boolean way, i. e., present or not.
Next, we focused on identifying tasks that require only visual intelli-
gence, i. e., purely visual operations or mental projections on a graph-
ical representation. We found six: Maximum (T1), Minimum (T2),
Variation (T3), Intersection (T4), Average (T5), and Comparison (T6).
All are standard benchmark tasks in InfoVis. T1 and T2 consist in
ﬁnding the maximum and minimum data points in the graph, respec-
tively. T3 consists in detecting a trend, similarities, or discrepancies
in the data. T4 consists in ﬁnding the point at which the graph inter-
sected with a given value. T5 consists in estimating an average value.
Finally, T6 consists in comparing different values or trends.
Finally, we focused on identifying different types of questions. We
found three: perception questions, and high- and low-congruency
questions. Perception questions refer only to visual aspects of the dis-
play (e. g., “what color are the dots?”). Conversely, congruent ques-
tions refer to semantic aspects of the data. The level of congruence
is then deﬁned by the “replaceability” of the data-related terms in the
question by perceptual terms. A high-congruency question translates
into a perceptual query simply by replacing data terms by perceptual
terms (e. g., “what is the highest value”/“what is the highest bar?”).
A low-congruency question, in contrast, has no such correspondence
(e. g., “is A connected to B—in a matrix diagram”/“is the intersection
between column A and row B highlighted?”).
PPLICATION TOLINE GRAPHS
To illustrate our method, we ﬁrst created two line graph tests—Line
Graphs 1 (LG1) and Line Graphs 2 (LG2)—of slightly different de-
signs, based on the principles described above. We then calibrated
them using Amazon’s Mechanical Turk (MTurk).
4.1 Design Phase
4.1.1 Line Graphs 1: General Design
For our ﬁrst test (LG1), we created a set of twelve items using different
stimulus parameters and tasks. We hand-tailored each item based on
an expected range of difﬁculty. Piloting had revealed that high vari-
ation in item dimensions led to incoherent tests (i. e., IRT models did
not ﬁt the response data), implying that when factors vary too much
within a test, additional abilities beyond those involved in basic vi-
sualization literacy are likely at play. Thus, we kept the number of
varying factors low: only distraction and tasks varied. The test used
four samples for the stimuli, and a single layout with single scales. A
summary is given in Table 1.
Each item was repeated ﬁve times3. The test was blocked by item,
and all items and their repetitions were randomized to prevent carry-
over effects. We added an extra condition using a standard table at the
beginning of each block to give examinees the opportunity to consol-
idate their understanding of the new question, and to separate out the
comprehension stage of the question-response process believed to oc-
cur in cognitive testing . The test was thus composed of 72 trials.
In the following paragraphs, we describe other important design
parameters we used in this test.
Scenario The PISA 2012 Mathematics Framework  empha-
sizes the importance of an understandable context for problem solving.
The current test focuses on one’s community, with problems set in a
3Early piloting had revealed that examinees would stabilize their search
time and conﬁdence after a few repetitions. In addition, repeated trials usu-
ally provide more robust measures as medians can be extracted (or means in
the case of Boolean values).
BOY ET AL.: A PRINCIPLED WAY OF ASSESSING VISUALIZATION LITERACY
Item ID Task Distraction
LG1.1 max 0
LG1.2 min 0
LG1.3 variation 0
LG1.4 intersection 0
LG1.5 average 0
LG1.6 comparison 0
LG1.7 max 1
LG1.8 min 1
LG1.9 variation 1
LG1.10 intersection 1
LG1.11 average 1
LG1.12 comparison 1
Item ID Task Congruency
LG2.1 max high
LG2.2 min high
LG2.3 variation high
LG2.4 intersection high
LG2.5 average high
LG2.6 comparison high
LG2.7 max low
LG2.8 min low
LG2.9 variation low
LG2.10 intersection low
LG2.11 average low
LG2.12 comparison low
Item ID Task Samples
BC.1 max 10
BC.2 min 10
BC.3 variation 10
BC.4 intersection 10
BC.5 average 10
BC.6 comparison 10
BC.7 max 20
BC.8 min 20
BC.9 variation 20
BC.10 intersection 20
BC.11 average 20
BC.12 comparison 20
Item ID Task Distraction
SP.1 max 0
SP.2 min 0
SP.3 variation 0
SP.4 intersection 0
SP.5 average 0
SP.6 comparison 0
SP.7 max 1
SP.8 min 1
SP.9 variation 1
SP.10 intersection 1
SP.11 average 1
SP.12 comparison 1
Table 1: Designs of Line Graphs 1 (LG1), Line Graphs 2 (LG2), Bar Chart (BC), and Scatterplot (SP). Only varying dimensions are shown.
Each item is repeated 6 times, beginning with a table condition (repetitions are not shown). Pink cells in the Item ID column indicate duplicate
items in LG1 and LG2. Tasks with the same color coding are the same. Gray cells in the Distraction, Congruency, and Samples columns indicate
difference with white cells. The Distraction column uses a binary encoding: 0 =no distractors, 1 =presence of distractors.
To avoid the potential bias of a priori domain knowledge, the test
was set within the following science-ﬁction scenario: The year is 2813.
The Earth is a desolate place. Most of mankind has migrated through-
out the universe. The last handful of humans remaining on earth are
now actively seeking another planet to settle on. Please help these
people determine what the most hospitable planet is by answering the
following series of questions as quickly and accurately as possible.
Data The dataset we used had a low-local and medium-global
level of variability. It presented the monthly evolution of unemploy-
ment in different countries between the years 2000 and 2008. Country
names were changed to ﬁctitious planet names listed in Wikipedia, and
years were modiﬁed to ﬁt the scenario.
Priming and Pacing Before each new block of repetitions, ex-
aminees were primed with the upcoming graph type, so that the con-
cepts and operations necessary for information extraction could be set
up . To separate out the time required to read questions, a speciﬁc
pacing was given to each block. First, the question was displayed,
along with a button labeled “Proceed to graph framework”; this led
participants to the graphical framework with the appropriate title and
labels. At the bottom of this was another button labeled “Display data,”
which displayed the full stimulus.
As mentioned, to give examinees the opportunity to fully compre-
hended each question, every block began with a “question comprehen-
sion” condition in which the data were shown in table form. This was
intended to remove potential effects caused by the setup of high-level
operations for solving a particular kind of problem.
Finally, to make sure ability (and not capacity) was being tested, an
11stimeout was set for each repetition. This was based on the mean
time required to answer the items in our pilot studies.
Response format To respond, examinees were required to click
on one of several possible answers, displayed in the form of buttons
below the stimulus. In some cases, correct answers were not directly
displayed. For example, certain values were not explicitly shown with
labeled ticks on the graph’s axes. This was done to test examinees’
ability to make conﬁdent estimations (i. e., to handle uncertainty ).
In addition, although the stimuli used color coding to show different
planets, the response buttons did not. This forced examinees to trans-
late the answer found in the visual domain back into the data domain.
To calibrate our test, we administered it on MTurk. While the validity
of using this platform may be debated, due to lack of control over par-
ticular experimental conditions , we considered it best to perform
our calibration using the results of a wide variety of people.
Participants To our knowledge, no particular number of samples
is recommended for IRT modeling. We recruited 40 participants who
were required to have a 98% acceptance rate and a total of 1000 or
more HITS approved.
Coding Six Turkers spent less than 1.5son average reading and
answering questions; they were considered as random clickers, and
their results were removed from further analysis. All retained Turkers
were native English speakers.
The remaining data were sorted according to item and repetition ID
(assigned before randomization). Responses for the table conditions
were removed. A score dataset (LG1s) was then created in accord with
the requirements of IRT modeling: correct answers were scored 1 and
incorrect answers 0. Scores for each set of item repetitions were then
compressed by computing the rounded mean values. This resulted in
a set of twelve dichotomous item scores for each examinee.
The purpose of this calibration is to remove items that are unhelpful
for distinguishing between low and high levels of VL. To do so, we
need to: 1) check that the simplest variant of IRT models (i. e., the
Rasch model) ﬁts the data, 2) ﬁnd the best variant of the model to get
the most accurate characteristic values for each item, and 3) assess the
usefulness of each item.
Checking the Rasch model The Rasch model (RM) was ﬁrst ﬁt-
ted to the score dataset. A 200 sample parametric Bootstrap goodness-
of-ﬁt test using Pearson’s χ2statistic revealed a non-signiﬁcant p-
value for LG1s(p>0.54), suggesting an acceptable ﬁt4. The Test
Information Curve (TIC) is shown in Fig. 1a. It reveals a near-normal
distribution of test information across different ability levels, with a
slight bump around −2, and a peak around −1. This means that the
test provides more information about examinees with relatively low
abilities (0 being the ability level of an average achiever) than about
examinees with high abilities.
Finding the right model variant Different IRT models, imple-
mented in the ltm R package , were then ﬁtted to LG1s. A series
of pairwise likelihood ratio tests showed that the two-parameter logis-
tic model (2PL) was most suitable. The new TIC is shown in Fig. 1b.
Assessing the usefulness of test items The big spike in the
TIC (Fig. 1b) suggests that several items with difﬁculty characteristics
just above −2 have high discrimination values. This is conﬁrmed by
the very steep Item Characteristic Curves (ICCs) (Fig. 3a) for items
LG1.1, LG1.4, and LG1.9 (a>51), and can explain the slight distor-
tion in Fig. 1a.
The probability estimates revealed that examinees with average
abilities have a 100% probability of giving a correct answer to the
easiest items (LG1.1, LG1.4, and LG1.9), and a 41% probability of
giving a correct answer to the hardest item (LG1.11). However, the
fact that LG1.11 has a relatively low discrimination value (a<0.7)
suggests that it is not very effective for separating ability levels.
4For more information about this statistic, refer to .
1968 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014
(a) TIC of LG1sunder RM (b) TIC of LG1sunder 2PL
Fig. 1: Test Information Curves (TICs) of the score dataset of the ﬁrst
line graph test under the original constrained Rasch model (RM) (a)
and the two-parameter logistic model (2PL) (b). The ability scale
shows the θ-values. The slight bump in the normal distribution of
(a) can be explained by the presence of several highly discriminating
items, as shown by the big spike in (b).
Fig. 2: Test Information Curve of the score dataset of the second line
graph test under the original constrained Rasch model. The test infor-
mation is normally distributed.
(a) ICCs of LG1sunder 2PL (b) ICCs of LG2sunder RM
Fig. 3: Item Characteristic Curves (ICCs) of the score datasets of the
ﬁrst line graph test (LG1s) under the original the two-parameter lo-
gistic model (a), and of the second line graph test (LG2s) under the
constrained Rasch model (b). The different curve steepnesses in (a)
are due to the fact that 2PL computes individual discrimination values
for each item, while RM sets all discrimination values to 1.
IRT modeling appears to be a solid approach for calibrating our test
design. Our results (Fig. 1) show that LG1 is useful for differentiating
between examinees with relatively low abilities, but not so much for
ones with high abilities.
The slight bump in the distribution of the TIC (Fig. 1a) suggests
that several test items are quite effective for separating ability levels
around −2. This is conﬁrmed by the spike in Fig. 1b, which indi-
cates the presence of highly discriminating items. Overall, both Test
Information Curves reveal that the test is best suited for examinees
with relatively low abilities, since most of the information it provides
concerns ability levels below zero.
In addition, Fig. 3a reveals that several items in the test have iden-
tical difﬁculty and discrimination characteristics. Some of these could
be considered for removal, as they provide only redundant informa-
tion. Similarly, item LG1.11, which has a low discrimination charac-
teristic, could be dropped, as it is less effective than others.
4.1.5 Line Graphs 2: General Design
For our second line graph test (LG2), we also created twelve items,
with varying factors restricted to question congruency and tasks. The
test used four samples for the stimuli, and a single layout with single
scales. The same scenario, dataset, pacing, and response format as for
LG1 were kept, as well as the ﬁve repetitions, the question compre-
hension condition, and the 11stimeout. As such, six items in this test
were identical to items in LG1 (see pink cells in Table 1). This was
done to ensure that the order of item difﬁculty would remain consistent
across the different tests.
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of three Turkers was rejected, for the same
reason as before.
4.1.6 Results and Discussion
Our analysis was driven by the requirements listed above. Data were
sorted and encoded in the same way as before, and a score dataset for
LG2 was obtained (LG2s).
RM was ﬁtted to the score dataset, and the goodness-of-ﬁt test re-
vealed an acceptable ﬁt (p>0.3). The pairwise likelihood ratio test
showed that RM was the best of all possible IRT models. The Test
Information Curve (Fig. 2) is normally distributed, with a peak around
−1. This indicates that like our ﬁrst line graph test, LG2 is best suited
for examinees with relatively low abilities.
The Item Characteristic Curves of both tests were then compared.
While it cannot be expected that identical items have the exact same
characteristics, their difﬁculty order should remain consistent (see
Sect. 2.3). Fig. 3 shows some slight discrepancies for items 1, 3, and 6
between the two tests. However, the fact that item LG1.3 is further to
the left in Fig. 3a is misleading. It is due to the extremely high a-values
of items LG1.1 and LG1.4. Thus, while their b-values are slightly
higher than that of LG1.3, the probability of success of an average
achiever is higher for these items than it is for LG1.3 (1 >0.94). Fur-
thermore, the difﬁculty characteristics of LG1.3 and LG2.3 are very
similar (0.94 0.92). Therefore, the only exception in the ordering of
item difﬁculties is item 6, which is estimated to be more difﬁcult than
item 2 in LG1, and not in LG2.
This suggests that LG1 and LG2 cover the same latent trait, i. e.,
ability to read line graphs. To examine this, we equated the test scores
using a common item equation approach. RM was ﬁtted to the re-
sulting dataset, the goodness-of-ﬁt test showed an acceptable ﬁt, and
2PL provided best ﬁt. Individual item characteristics were generally
preserved, with the exception of item 6, which, interestingly, ended
up with characteristics very similar to those of item 2. This conﬁrms
that the two tests cover the same latent trait. Thus, although individual
characteristics are slightly altered by the equation (e. g., item 6), items
in LG1 can safely be transposed to LG2, without hindering the overall
coherence of the test, and vice-versa.
BOY ET AL.: A PRINCIPLED WAY OF ASSESSING VISUALIZATION LITERACY
4.2 Assessment Phase
Having shown that our test items have a sound basis in theory, we
now turn to the assessment of visualization literacy. While a standard
method would simply sum up the correct responses, our method con-
siders each response individually, with regard to the difﬁculty of the
item it was given for. To make this assessment, we inspected the abil-
ity scores derived from the ﬁtted IRT models. These scores represent
examinees’ standings (θ) on the latent trait, and correspond to a unique
response pattern. They have great predictive power as they can deter-
mine an examinee’s probability of success on items that s/he has not
completed, provided that these items follow the same latent variable
scale as other test items. As such, ability scores are perfect indicators
for assessing VL.
LG1 revealed 27 different ability scores, ranging from −1.85 to 1.
The distribution of these scores was near-normal, with a slight bump
around −1.75. 40.7% of participants were above average (i. e., θ>0),
and the mean was −0.27.
LG2 revealed 33 different ability scores, ranging from −1.83 to
1.19. The distribution was also near-normal, with a bump around −1.
39.4% of participants were above average, and the mean was −0.17.
These results show that the means are close to zero, and the distri-
butions near-normal. This suggests that most Turkers, while somewhat
below average in visualization literacy for line graphs, have fairly stan-
While it should be interesting to develop broader ranges of item
complexities for the line graph stimulus (by using the common item
equation approach), thus extending the psychometric quality of the
tests, we consider LG1 and LG2 to be sufﬁcient for our current line
of research. Furthermore, we believe that these low levels of difﬁculty
reﬂect the general simplicity of, and massive exposure to, line graphs.
To see whether our method also applies to other types of visualizations,
we created two additional tests: one for bar charts (BC) and one for
5.1 Design Phase
5.1.1 Bar Charts: General Design
Like LG1 and LG2, the design of our bar chart test (BC) was based
on the principles described in Section 3. We created twelve items,
with varying factors restricted to number of samples and tasks (see
Table 1). The same scenario, pacing, response format, repetitions,
question comprehension condition, and 11stimeout were kept. The
dataset presented life expectancy in various countries, with country
names again changed to ﬁctitious planet names.
The only difference with the factors used earlier (apart from the
stimulus) involved the variation task, which is essentially a trend de-
tection task. Bar charts are sub-optimal for determining trends, so
this task was replaced by a “global similarity detection” task, as done
in  (e. g., “Do all the bars have the same value?”).
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of six Turkers was rejected, for the same
reason as before.
5.1.2 Results and Discussion
Our analysis was driven by the same requirements as for the line graph
tests. Data were sorted and encoded in the same way, resulting in a
score dataset for BC (BCs).
RM was ﬁrst ﬁtted to BCs; the goodness-of-ﬁt test revealed an ac-
ceptable ﬁt (p>0.37), and the likelihood test proved that it ﬁt best.
However, the Test Information Curve (Fig. 4a) is not normally dis-
tributed. This is due to the presence of several extremely low difﬁculty
(i. e., easy) items (BC.3, BC.7, BC.8, and BC.9; b=−25.6), as shown
in Fig. 4b. Inspecting the raw scores for these items revealed a 100%
success rate. Thus, they were considered too easy, and were removed.
Similarly, items BC.1 and BC.2 (for both, b<−4) were also removed.
(a) TIC of BCsunder RM (b) ICCs of BCsunder RM
Fig. 4: Test Information Curve (a) and Item Characteristic Curves (b)
of the score dataset of the bar chart test under the constrained Rasch
model. The TIC in (a) is not normally distributed because of several
very low difﬁculty items, as shown by the curves to the far left of (b).
(a) TIC of the subset of BCsunder RM (b) ICCs of the subset of BCsunder RM
Fig. 5: Test Information Curve (a) and Item Characteristic Curves (b)
of the subset of the score dataset of the bar chart test under the con-
strained Rasch model. The subset was obtained by removing the very
low difﬁculty items shown in Fig. 4b.
To check the coherence of the resulting subset of items, RM was
ﬁtted again to the remaining set of scores. Goodness-of-ﬁt was main-
tained (p>0.33), and RM still ﬁtted best. The new TIC (Fig. 5a) is
normally distributed, with a peak around −1. This indicates that like
our line graph tests, this subset of BC is best suited for examinees with
relatively low abilities.
5.1.3 Scatterplots: General Design
For our scatterplot test (SP), we once again created twelve items, with
varying factors restricted to distraction and tasks (see Table 1). The
same scenario, pacing, response format, repetitions, and question com-
prehension condition were kept. The dataset presented levels of adult
literacy by expenditure per student in primary school in different coun-
tries, with country names again changed to ﬁctitious planet names.
Slight changes were required for some of the tasks, since scatter-
plots use two spatial dimensions (as opposed to bar charts and line
graphs). For example, stimuli with distractors in LG1 only required
examinees to focus on one of several samples; here, stimuli with dis-
tractors could either require examinees to focus on a single datapoint
or on a single dimension.
We had initially expected that SP would be more difﬁcult, and items
would require more time to complete. However, a pilot study showed
that the average response time per item was again roughly 11s. There-
fore, the 11stimeout condition was kept.
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of one Turker was not kept because of tech-
nical (logging) issues.
5.1.4 Results and Discussion
Our analysis was once again driven by the same requirements as be-
fore. The same sorting and coding was applied to the data, resulting in
the score dataset SPs. The ﬁtting procedure was then applied, reveal-
ing a good ﬁt for RM (p=0.6), and a best ﬁt for 2PL.
The Test Information Curve (Fig. 6a) shows the presence of several
highly discriminating items around b−1 and b0. The Item Char-
acteristic Curves (Fig. 6b) conﬁrm that there are three (SP.6, SP.8, and
SP.10; a>31). However, they also show that two items (SP.3, and
1970 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014
(a) TIC of SPsunder 2PL (b) ICCs of SPsunder 2PL
Fig. 6: Test Information Curve (a) and Item Characteristic Curves (b)
of the score dataset of the scatterplot test under the two-parameter lo-
gistic model. The TIC (a) shows that there are several highly dis-
criminating items, which is conﬁrmed by the very steep curves in (b).
In addition, (b) shows that there are also a two poorly discriminating
items, represented by the very gradual slopes of items SP.3 and SP.11.
(a) TIC of the subset of SPsunder 2PL (b) ICCs of the subset of SPsunder 2PL
Fig. 7: Test Information Curve (a) and Item Characteristic Curves (b)
of the subset of the score dataset of the scatterplot test under the two-
parameter logistic model. The subset was obtained by removing the
poorly discriminating items shown in Fig. 6b.
SP.11) have quite low discrimination values (a<0.6). Here, we set a
threshold for a>0.8. Thus, items SP.3 and SP.11 were removed. The
resulting subset of 10 items’ scores was ﬁtted once again. RM ﬁtted
well (p=0.69), and 2PL ﬁtted best. The different curves of the subset
are plotted in Fig. 7. They show a good amount of information for
abilities that are slightly below average (Fig. 7a), which indicates that
the subset of SP is once again best suited for examinees with relatively
5.2 Assessment phase
Here again, we inspected the Turkers’ ability scores. Only the items
retained at the end of the design phase were used.
BC revealed 21 different ability scores, ranging from −1.75 to 0.99.
The distribution of these scores was near-normal, with a slight bump
around −1.5, and the mean was −0.39. However, only 14.3% of par-
ticipants were above average.
SP revealed 23 different ability scores, ranging from −1.72 to 0.72.
However, the distribution here was not normal. 43.5% of participants
were above average, and the median was −0.14.
These results show that the majority of recruited Turkers had some-
what below average levels of visualization literacy for bar charts and
scatterplots. The very low percentage of Turkers above average in BC
led us to reconsider the removal of items BC.1 and BC.2, as they were
not truly problematic. After reintegrating them in the test scores, 21
ability scores were observed, ranging from −1.67 to 0.99, and 42.8%
of participants were above average. This seemed more convincing.
However, this important difference illustrates the relativity of these
values, and shows how important it is to properly calibrate the tests
during the design phase.
Finally, we did not attempt to equate these tests, since—unlike LG1
and LG2—we ran them independently without any overlapping items.
To have a fully comprehensive test, i. e., a generic test for visualization
literacy, intermediate tests are required where the stimulus itself is a
varying factor. If such tests prove to be coherent (i. e., if IRT models
ﬁt the results), then it should be possible to assert that VL is a general
trait that allows one to understand any kind of graphical representation.
Although we believe that this ability varies with exposure and habit of
use, a study to conﬁrm it is outside of the scope of this paper.
If these tests are to be used as practical ways of assessing VL, the
process must be sped up, both in the administration of the tests and in
the analysis of the results. While IRT provides useful information on
the quality of tests and on the ability of those who take them, it is quite
costly, both in time and in computation. This must be changed.
In this section, we present a way in which the tests we have devel-
oped in the previous sections can be optimized to be faster, while still
maintaining their effectiveness.
6.1 Test Administration Time
As we have seen, several items can be removed from the tests, while
keeping good psychometric quality. However, this should be done
carefully, as some of these items may provide useful information (like
in the case of BC.1 and BC.2, Sect. 5.2).
We ﬁrst removed LG1.11 from LG1, as its discrimination value was
<0.8 (see Sect. 5.1.4). We then inspected items with identical difﬁ-
culty and discrimination characteristics, represented by overlapping
ICCs’ (see Fig. 3). These were prime candidates for removal, since
they provide only redundant information. There was one group of
overlapping items in LG1 ([LG1.1, LG1.4, LG1.9]), and two in LG2
([LG2.1, LG2.3], [LG2.2, LG2.7]). For each group, we kept only one
item. Thus LG1.1, LG1.4, LG2.3, and LG2.7 were dropped.
We reintegrated items BC.1 and BC.2 to BC, as they proved to have
a big impact on ability scores (Sect. 5.2). The subset of SP created at
the end of the design phase was kept, and no extra items were removed.
RM was ﬁtted to the newly created subsets of LG1, LG2, and BC;
the goodness-of-ﬁt test showed acceptable ﬁts for all (p>0.69 for
LG1, and p>0.3 for both LG2 and BC). 2PL ﬁtted best for LG1, and
RM ﬁtted best for both LG2 and BC.
We conducted post-hoc analysis to see whether the number of item
repetitions could be reduced (ﬁrst to three, then to one). Results
showed that RM ﬁtted all score datasets using three repetitions. How-
ever, several examinees had lower scores. In addition, while BC and
SP showed similar amounts of information for the same ability levels,
the three very easy items in BC (i. e., BC.3, BC.7, BC.8, and BC.9)
were no longer problematic. This suggests that several participants
did not get a score of 1 for these items, and conﬁrms that, for some
examinees, more repetitions are needed. Results for one-repetition-
tests showed that RM no longer ﬁtted the scores of BC, suggesting
that unique repetitions are noisy. Therefore, we decided to keep the
In the end, the redesign of LG1 contained 9 items (with a 10 min
completion time), the redesigns of LG2 and SP contained 10 items (11
min), and the redesign of BC contained 8 items (9 min).
6.2 Analysis Time and Computational Power
To speed up the analysis, we ﬁrst considered setting up the procedure
we had used in R on a server. However, this solution would have
required a lot of computational power, so we dropped it.
Instead, we chose to tabulate all possible ability scores for each
test. An interesting feature of IRT modeling is that it can derive ability
scores from unobserved response patterns (i. e., patterns that do not
exist in the empirical data), as well as from partial response patterns
(i. e., patterns with missing values). Consequently, we generated all
the 2ni−1 possible patterns for each test, where niis the number of
items in a test. This resulted in 511 patterns for LG1, 1023 for both
LG2 and SP, and 255 for BC. We then derived the different ability
scores that could be obtained in each test.
To ensure that removing certain test items did not greatly affect the
ability scores, we computed all the scores for the full LG1 and LG2
tests, and compared them to the ones previously obtained. We found
some small differences in the upper and lower bounds of ability, but
these were considered negligible, since our tests were not designed
for ﬁne distinction between very low abilities or high abilities. We
BOY ET AL.: A PRINCIPLED WAY OF ASSESSING VISUALIZATION LITERACY
also tested the impact of reﬁtting the IRT models after item removal.
For this, we repeated the procedure using partial response patterns for
LG1 and LG2, i. e., we replaced the dichotomous response values for
the items considered for removal by not available (NA) values. The
scores were exactly the same as the ones obtained with our already
shortened and reﬁtted tests, which proves they can be trusted.
Finally, we integrated all ability scores and their corresponding re-
sponse patterns into the web-based, shortened versions of the tests, to
make them readily available. This way, by administering our online
tests, researchers can have direct access to participants’ levels of vi-
sualization literacy. Informed decisions can then be made as whether
to keep these people for further studies or not. All four tests are ac-
cessible at http://peopleviz.gforge.inria.fr/trunk/
As the preceding sections have shown, we have developed and vali-
dated a fast and effective method for assessing visualization literacy.
This section summarizes the major steps, written in the form of easy
7.1 Initial Design
1. Pay careful attention to the design of all 3 components of a
test item,i. e., stimulus, task, and question. Each can inﬂuence
item difﬁculty, and too much variation may fail to produce a co-
herent test—as was seen in our pilot studies.
2. Repeat each item several times. We did 5 repetitions + 1 “ques-
tion comprehension” condition for each item. This is important
as repeated trials provide more robust measures. Ultimately, it
may be feasible to reduce the number of repetitions to 35, al-
though our results show that this can be problematic (Sect. 6.1).
3. Use a different—and ideally, non-graphical—representation
for question comprehension. We chose a table condition.
While our present study did not focus on proving its essential-
ness, we believe that this attribute is important.
4. Randomize the order of items and of repetitions. This is com-
mon practice in experiment design, having the beneﬁt of prevent-
ing carryover effects.
5. Once the results are in, sort the data according to item and rep-
etition ID, remove the data for the question comprehension
condition, and encode examinees’ scores in a dichotomous
way,i. e., 1 for correct answers and 0 for incorrect answers.
6. Calculate the mean score for all repetitions of an item and
round the result. This will give a ﬁner estimate of the exami-
nee’s ability since it erases one-time errors which may be due to
lack of attention or to clicking on the wrong answer by mistake.
7. Begin model ﬁtting with the Rasch model. RM is the simplest
variant of IRT models. If it does not ﬁt the data, other variants
will not either. Then Check the ﬁt of the model. Here we used a
200 sample parametric Bootstrap goodness-of-ﬁt test using Pear-
sons χ2statistic. To reveal an acceptable ﬁt, the returned p-value
should not be statistically signiﬁcant (p>0.05). In some cases
(like in our pilot studies), the model may not ﬁt. Options here are
to inspect the χ2p-values for pairwise associations, or the two-
and three-way χ2residuals, to ﬁnd problematic items6.
8. Determine which IRT model variant best ﬁts the data. A se-
ries of pairwise likelihood ratio tests can be used for this. If sev-
eral models ﬁt, it is usually good to go with the model that ﬁts
best. Our experience showed that such models were most often
RM and 2PL.
9. Identify potentially useless items. In our examples of LG1 and
SP, certain items had low discrimination characteristics. These
are not very effective for separating ability levels, and can be
removed. In cases like the one for BC, items may also simply
be too easy. Before removing them permanently, however, it is
5The number of repetitions should be odd, so as to not end up with a mean
score of 0.5 for an item.
6For more information, refer to .
advised to check their impact on ability scores. Finally, it is im-
portant the model be reﬁtted at this stage (reproducing steps 7
and 8), as removing these items may affect examinee and item
7.2 Final Design
10. Identify overlapping items and remove them. If the goal is
to design a short test, such items can safely be removed, as they
provide only redundant information (see Sect. 6.1).
11. Generate all 2ni−1 possible score patterns, where niis the
number of retained items in the test. These patterns represent
series of dichotomous response values for each test item.
12. Derive the ability scores from the model, using the patterns of
responses generated in step 11. These scores represent the range
of visualization literacy levels that the test can assess.
13. Integrate the ability scores into the test to make fast, effective,
and scalable estimates of people’s visualization literacy.
ONCLUSIONS AND FUTURE WORK
In this paper, we have developed a method for assessing visualization
literacy, based on a principled set of considerations. In particular, we
used Item Response Theory to allow a separation of the effects of item
difﬁculty and examinee ability. Our motivation was to make a series of
fast, effective, and reliable tests which researchers could use to detect
participants with low VL abilities before conducting online studies.
We have shown how these tests can be tailored to get immediate esti-
mates of examinee’s levels of VL.
We intend to continue developing this approach, as well as examine
the suitability of this method for other kinds of representation (e. g.,
parallel coordinates, node link diagrams, starplots, etc.), and possibly
for other purposes. For example, in contexts like a classroom evalua-
tion, the tests could be longer, and broader assessments of visualiza-
tion literacy could be made. This would imply further exploration of
the design parameters proposed in Section 3. Evaluating the impact of
these parameters on item difﬁculty should also be interesting.
Finally, we acknowledge that this work is but a small step into the
realm of visualization literacy. As such, we have made our tests avail-
able on GitHub for versioning . Ultimately, we hope that this will
serve as a foundation for further research into VL.
This work was funded by a Google Research Award, granted for a
project called “Data Visualization for the People”.
 https://github.com/INRIA/Visualization-Literacy- 101.
 D. Abilock. Visual information literacy: Reading a documentary photo-
graph. Knowledge Quest, 36(3), January–February 2008.
 J. Baer, M. Kutner, and J. Sabatini. Basic reading skills and the literacy
of the america’s least literate adults: Results from the 2003 national as-
sessment of adult literacy (naal) supplemental studies. Technical report,
National Center for Education Statistics (NCES), February 2009.
 J. Bertin and M. Barbut. Semiologie Graphique. Mouton, 1973.
 P. BOBKO and R. KARREN. The perception of pearson product mo-
ment correlations from bivariate scatterplots. Personnel Psychology,
 M. T. Brannick. Item response theory. http://luna.cas.usf.
 V. J. Bristor and S. V. Drake. Linking the language arts and content areas
through visual technology. T.H.E. Journal, 22(2):74–77, 1994.
 P. Carpenter and P. Shah. A model of the perceptual and conceptual pro-
cesses in graph comprehension. Journal of Experimental Psychology:
Applied, 4(2):75–100, 1998.
 W. S. Cleveland, P. Diaconis, and R. McGill. Variables on scatterplots
look more highly correlated when the scales are increased. Science,
 Cognitive testing interview guide. http://www.cdc.gov/nchs/
1972 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014
 M. Correll, D. Albers, S. Franconeri, and M. Gleicher. Comparing aver-
ages in time series data. In Proceedings of the 2012 ACM annual confer-
ence on Human Factors in Computing Systems, pages 1095–1104. ACM,
 F. R. Curcio. Comprehension of mathematical relationships expressed in
graphs. Journal for research in mathematics education, pages 382–393,
 Department for Education. Numeracy skills tests: Bar charts.
 R. C. Fraley, N. G. Waller, and K. A. Brennan. An item response theory
analysis of self-report measures of adult attachment. Journal of person-
ality and social psychology, 78(2):350, 2000.
 E. Freedman and P. Shah. Toward a model of knowledge-based graph
comprehension. In M. Hegarty, B. Meyer, and N. Narayanan, editors, Di-
agrammatic Representation and Inference, volume 2317 of Lecture Notes
in Computer Science, pages 18–30. Springer Berlin Heidelberg, 2002.
 S. N. Friel, F. R. Curcio, and G. W. Bright. Making sense of graphs:
Critical factors inﬂuencing comprehension and instructional implications.
Journal for Research in mathematics Education, 32(2):124–158, 2001.
 A. C. Graesser, S. S. Swamer, W. B. Baggett, and M. A. Sell. New models
of deep comprehension. Models of understanding text, pages 1–32, 1996.
 Graph design i.q. test. http://perceptualedge.com/files/
 J. Heer and M. Bostock. Crowdsourcing graphical perception: Using
mechanical turk to assess visualization design. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, CHI ’10,
pages 203–212, New York, NY, USA, 2010. ACM.
 How to read a bar chart. http://www.quizrevolution.com/
 P. Isenberg, A. Bezerianos, P. Dragicevic, and J. Fekete. A study on dual-
scale data charts. Visualization and Computer Graphics, IEEE Transac-
tions on, 17(12):2469–2478, Dec 2011.
 Join, ACRL and Council, Chapters. Presidential committee on informa-
tion literacy: Final report. Online publication, 1989.
 I. Kirsch. The international adult literacy survey (ials): Understanding
what was measured. Technical report, Educational Testing Service, De-
 N. Knutson, K. S. Akers, and K. D. Bradley. Applying the Rasch model
to measure ﬁrst-year students perceptions of college academic readiness.
In Paper presented at the annual meeting of the MWERA Annual Meeting,
 R. Lowe. “Reading” scientiﬁc diagrams: Characterising components
of skilled performance. Research in Science Education, 18(1):112–122,
 R. Lowe. Scientiﬁc Diagrams: How Well Can Students Read Them? [mi-
croform]: What Research Says to the Science and Mathematics Teacher.
Number 3 / Richard K. Lowe. Distributed by ERIC Clearinghouse [Wash-
ington, D.C.], 1989.
 J. Meyer, M. Taieb, and I. Flascher. Correlation estimates as perceptual
judgments. Journal of Experimental Psychology: Applied, 3(1):3, 1997.
 E. Miller. The Miller word-identiﬁcation assessment. http://www.
 National Cancer Institute. Item response theory model-
 National Center for Education Statistics. Adult literacy and lifeskills sur-
 Numerical reasoning - table/graph. http://www.jobtestprep.
 Numerical reasoning online. http://
 J. Oberholtzer. Why two charts make me feel like an expert on Portugal
 OECD. Pisa 2012 assessment and analytical framework. Technical re-
port, OECD, 2012.
 OTA. Computerized Manufacturing Automation: Employment, Educa-
tion, and the Workplace. United States Ofﬁce of technology Assessment,
 S. Pinker. A theory of graph comprehension, pages 73–126. Lawrence
Erlbaum Associates, Hillsdale, NJ, 1990.
 I. Pollack. Identiﬁcation of visual correlational scatterplots. Journal of
experimental psychology, 59(6):351, 1960.
 R. Ratwani and J. Gregory Trafton. Shedding light on the graph schema:
Perceptual features versus invariant structure. Psychonomic Bulletin and
Review, 15(4):757–762, 2008.
 R. A. Rensink and G. Baldridge. The perception of correlation in scatter-
plots. Comput. Graph. Forum, 29(3):1203–1210, 2010.
 D. Rizopoulos. ltm: An R package for latent variable modeling and Item
Response Analysis. Journal of Statistical Software, 17(5):1–25, 11 2006.
 RUMMLaboratory. Rasch analysis. http://www.
 P. Shah. A model of the cognitive and perceptual processes in graphical
display comprehension. Reasoning with diagrammatic representations,
pages 94–101, 1997.
 G. Sir Crowther. The Crowther Report, volume 1. Her Majesty’s Sta-
tionery Ofﬁce, 1959.
 C. Taylor. New kinds of literacy, and the world of visual information.
 J. G. Trafton, S. P. Marshall, F. Mintz, and S. B. Trickett. Extracting
explicit and implicit information from complex visualizations. Diagrams,
pages 206–220, 2002.
 S. Trickett and J. Trafton. Toward a comprehensive model of graph com-
prehension: Making the case for spatial cognition. In D. Barker-Plummer,
R. Cox, and N. Swoboda, editors, Diagrammatic Representation and In-
ference, volume 4045 of Lecture Notes in Computer Science, pages 286–
300. Springer Berlin Heidelberg, 2006.
 B. Tversky. Semantics, Syntax, and Pragmatics of graphics., pages 141–
158. Lund University Press, Lund, 2004.
 UNESCO. Literacy assessment and monitoring programme.
 University of Kent. Numerical reasoning test. http://www.kent.
 H. Wainer. A test of graphicacy in children. Applied Psychological Mea-
surement, 4(3):331–340, 1980.
 What is media literacy? a deﬁnition... and more.
 M. Wu, R. Adams, and E. M. Solutions. Applying the Rasch model to
psycho-social measurement [electronic resource] : A practical approach
/ Margaret Wu and Ray Adams . Educational Measurement Solutions
Melbourne, Vic, 2007.