ArticlePDF Available


We describe a method for assessing the visualization literacy (VL) of a user. Assessing how well people understand visualizations has great value for research (e. g., to avoid confounds), for design (e. g., to best determine the capabilities of an audience), for teaching (e. g., to assess the level of new students), and for recruiting (e. g., to assess the level of interviewees). This paper proposes a method for assessing VL based on Item Response Theory. It describes the design and evaluation of two VL tests for line graphs, and presents the extension of the method to bar charts and scatterplots. Finally, it discusses the reimplementation of these tests for fast, effective, and scalable web-based use.
1077-2626 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See for more information.
A Principled Way of Assessing Visualization Literacy
Jeremy Boy, Ronald A. Rensink, Enrico Bertini, and Jean-Daniel Fekete Senior Member, IEEE
Abstract— We describe a method for assessing the visualization literacy (VL) of a user. Assessing how well people understand
visualizations has great value for research (e.g., to avoid confounds), for design (e. g., to best determine the capabilities of an
audience), for teaching (e. g., to assess the level of new students), and for recruiting (e. g., to assess the level of interviewees). This
paper proposes a method for assessing VL based on Item Response Theory. It describes the design and evaluation of two VL tests
for line graphs, and presents the extension of the method to bar charts and scatterplots. Finally, it discusses the reimplementation of
these tests for fast, effective, and scalable web-based use.
Index Terms—Literacy, Visualization literacy, Rasch Model, Item Response Theory
In April 2012, Jason Oberholtzer posted an article describing two
charts that portray Portuguese historical, political, and economic
data [33]. While acknowledging that he is not an expert on those top-
ics, Oberholtzer claims that thanks to the charts, he feels like he has “a
well-founded opinion on the country.” He attributes this to the simplic-
ity and efficacy of the charts. He then concludes by stating: “Here’s
the beauty of charts. We all get it, right?”
But do we all really get it? Although the number of people famil-
iar with visualization continues to grow, it is still difficult to estimate
anyone’s ability to read graphs and charts. When designing a visual-
ization for non-specialists or when conducting an evaluation of a new
visualization system, it is important to be able to pull apart the po-
tential efficiency of the visualization and the actual ability of users to
understand it.
In this paper, we address this issue by creating a set of visualiza-
tion literacy (VL) tests for line graphs, bar charts, and scatterplots.
At this point, we loosely define visualization literacy as the ability to
use well-established data visualizations (e. g., line graphs) to handle
information in an effective, efficient, and confident manner.
To generate these tests, we develop here a method based on Item
Response Theory (IRT). Traditionally, IRT has been used to assess ex-
aminees’ abilities via predefined tests and surveys in areas such as ed-
ucation [24], social sciences [14], and medicine [29]. Our method uses
IRT in two ways: first, in a design phase, we evaluate the relevance of
potential test items; and second, in an assessment phase, we measure
users’ abilities to extract information from graphical representations.
Based on these measures, we then develop a series of tests for fast, ef-
fective, and scalable web-based use. The great benefit of this method
is that inherits IRT’s property of making ability assessments that are
based not only on raw scores, but on a model that captures the stand-
ing of users on a latent trait (e. g., the ability to use various graphical
As such, our main contributions are as follows:
a useful definition of visualization literacy,
a method for: 1) assessing the relevance of visualization literacy
test items, 2) assessing an examinee’s level of VL, 3) creating
Jeremy Boy is with Inria, Telecom ParisTech, and EnsadLab. E-mail:
Ronald A. Rensink is with the University of British Columbia. E-mail:
Enrico Bertini is with NYU Polytechnic School of Engineering. E-mail:
Jean-Daniel Fekete is with Inria. E-mail:
fast and effective assessments of VL for well established visual-
ization techniques and tasks; and
an implementation of four online tests, based on our method.
Our immediate motivation for this work is to design a series of tests
that can help Information Visualization (InfoVis) researchers detect
low-ability participants when conducting online studies, in order to
avoid possible confounds in their data. This requires the tests to be
short, reliable, and easy to administer. However, such tests can also be
applied to many other situations, such as:
designers who want to know how capable of understanding visu-
alizations their targeted audience is;
teachers who want to make an assessment of the acquired knowl-
edge of freshmen;
practitioners who need to hire capable analysts; and
education policy-makers who may want to set a standard for vi-
sualization literacy.
This paper is organized in the following way. It begins with a back-
ground section that defines the concept of literacy and discusses some
of its best-known forms. Also introduced are the theoretical constructs
of information comprehension and graph comprehension, along with
the concepts behind Item Response Theory. Next, Section 3 presents
the basic elements of our approach. Section 4 shows how these can be
used to create and administer two VL tests using line graphs. In Sec-
tion 5, our method is extended to bar charts and scatterplots. Section 6
describes how our method can be used to redesign fast, effective, and
scalable web-based tests. Finally, Section 7 provides a set of “take-
away” guidelines for the development of future tests.
Very few studies investigate the ability of a user to extract information
from a graphical representation such as a line graph or a bar chart. And
of those that do, most make only higher-level assessments: they use
such representations as a way to test mathematical skills, or the ability
to handle uncertainty [13, 31, 32, 34, 49]. A few attempts do focus
more on the interpretation of graphically-represented quantities [18,
20], but they base their assessments only on raw scores and limited
test items. This makes it difficult to create a true measure of VL.
2.1 Literacy
2.1.1 Definition
The online Oxford dictionary defines literacy as “the ability to read
and write”. While historically this term has been closely tied to its
textual dimension, it has grown to become a broader concept. Taylor
proposes the following: “Literacy is a gateway skill that opens to the
potential for new learning and understanding” [44].
Given this broader understanding, other forms of literacy can be dis-
tinguished. For example, numeracy was coined to describe the skills
needed for reasoning and applying simple numerical concepts. It was
Manuscript received 31 Mar. 2014; accepted 1 Aug. 2014 ate of
publication 2014; date of current version 9 Nov. 2014.
For information on obtaining reprints of this article, please send
e-mail to:
11 Aug.
Digital Object Identifier 10.1109/TVCG.2014.2346984
intended to “represent the mirror image of [textual] literacy” [43, p.
269]. Like [textual] literacy, numeracy is a gateway skill.
With the advent of the Information Age, several new forms of liter-
acy have emerged. Computer literacy “refers to basic keyboard skills,
plus a working knowledge of how computer systems operate and of the
general ways in which computers can be used” [35]. Information liter-
acy is defined as the ability to “recognize when information is needed”,
and “the ability to locate, evaluate, and use effectively the needed in-
formation” [22]. Media literacy commonly relates to the “ability to
access, analyze, evaluate and create media in a variety of forms” [51].
2.1.2 Higher-level Comprehension
In order to develop a meaningful measure of any form of literacy, it is
necessary to understand the various components involved, starting at
the higher levels. Friel et al. [16] suggest that comprehension of in-
formation in written form involves three kinds of tasks: locating, inte-
grating, and generating information. Locating tasks require the reader
to find a piece of information based on given cues. Integrating tasks
require the reader to aggregate several pieces of information. Generat-
ing tasks not only require the reader to process given information but
also require the reader to make document-based inferences or to draw
on personal knowledge.
Another important aspect of information comprehension is question
asking, or question posing. Graesser et al. [17] posit that question
posing is a major factor in text comprehension. Indeed, the ability to
pose low-level questions, i. e., to identify a series of low-level tasks,
is essential for information retrieval and for achieving higher-level, or
deeper, goals.
2.1.3 Assessment
Several literacy tests are currently in common use. The two most im-
portant are the UNESCO’s Literacy Assessment and Monitoring Pro-
gramme (LAMP) [48], and the OECD’s Programme for International
Student Assessment (PISA) [34]. Other international assessments in-
clude the Adult Literacy and Lifeskills Survey (ALL) [30], the In-
ternational Adult Literacy Survey (IALS) [23], and the Miller Word
Identification Assessment (MWIA) [28].
Assessments are also made using more local scales like the US Na-
tional Assessment of Adult Literacy (NAAL) [3], the UK’s Depart-
ment for Education Numeracy Skills Tests [13], or the University of
Kent’s Numerical Reasoning Test [49].
Most of these tests, however, take basic literacy skills for granted,
and focus on higher-level assessments. For example the PISA test
is designed for 15 year-olds who are finishing compulsory education.
This implies that examinees should have already learned—and still
remember—the basic skills required for reading and counting. It is
only when examinees clearly fail these tests that certain measures are
deployed to evaluate the lower-level skills.
NAAL provides a set of 2 complementary tests for examinees who
fail the main textual literacy test [3]: the Fluency Addition to NAAL
(FAN) and The Adult Literacy Supplemental Assessment (ALSA).
These focus on adults’ ability to read single words and small passages.
Meanwhile, MWIA tests whole-word dyslexia. It has 2 levels, each
of which contains 2 lists of words, one Holistic and one Phonetic, that
examinees are asked to read aloud. Evaluation is based on time spent
reading and number of words missed. Proficient readers should find
such tests extremely easy, while low ability readers should find them
more challenging.
2.2 Visualization Literacy
2.2.1 Definition
The view of literacy as a gateway skill can also be applied to the ex-
traction and manipulation of information from graphical representa-
tions. In particular, it can be the basis for what we will refer to as
visualization literacy (VL): the ability to confidently use a given data
visualization to translate questions specified in the data domain into
visual queries in the visual domain, as well as interpreting visual pat-
terns in the visual domain as properties in the data domain.
This definition is related to several others that have been proposed
concerning visual messages. For example, a long-standing and of-
ten neglected concept is visual literacy. This has been defined as the
“ability to understand, interpret and evaluate visual messages” [7]. Vi-
sual literacy is rooted in semiotics, i. e., the study of signs and sign
processes, which distinguishes it from visualization literacy. While it
has probably been the most important form of literacy to date, it is
nowadays frowned upon, and general literacy tests do not take it into
Taylor [44] has advocated for the study of visual information lit-
eracy, while Wainer has advocated for graphicacy [50]. Depending
on the context, these terms refer to the ability to read charts and di-
agrams, or to qualify the merging of visual and information literacy
teaching [2]. Because of this ambiguity, we prefer the more general
term “visualization literacy.
2.2.2 Higher-level Comprehension
Bertin [4] proposed three levels on which a graph may be interpreted:
elementary, intermediate, and comprehensive. The elementary level
concerns the simple extraction of information from the data. The inter-
mediate level concerns the detection of trends and relationships. The
comprehensive level concerns the comparison of whole structures, and
inferences based on both data and background knowledge. Similarly,
Curcio [12] distinguishes three ways of reading from a graph: from
the data, between the data, and beyond the data1.
The higher-level cognitive processes behind the reading of graphs
has been the concern of the area of graph comprehension. This
area studies the specific expectations viewers have for different graph
types [47], and has highlighted many differences in the understanding
of novices and expert viewers [15, 25, 26, 45].
Several influential models of graph comprehension have been pro-
posed. For example, Pinker [36] describes a three-way interaction be-
tween the visual features of a display, processes of perceptual organi-
zation, and what he calls the graph schema, which directs the search
for information in the particular graph. Several other models are simi-
lar (see Trickett and Trafton [46]). All involve the following steps:
1. the user has a pre-specified goal to extract a specific piece of
2. the user looks at the graph and the graph schema and gestalt pro-
cesses are activated
3. the salient features of the graph are encoded, based on these
gestalt principles
4. the user now knows which cognitive/interpretative strategies to
use, because the graph is familiar
5. the user extracts the necessary goal-directed visual chunks
6. the user may compare 2 or more visual chunks
7. the user extracts the relevant information to satisfy the goal
The “visual chunking” mentioned above consists in segmenting a vi-
sual display into smaller parts, or chunks [25]. Each chunk represents
a set of entities that have been grouped according to gestalt principles.
Chunks can in turn be subdivided into smaller chunks.
Shah [42] identifies two cognitive processes that occur during
stages 2 through 6 of this model:
1. a top-down process where the viewer’s prior knowledge of se-
mantic content influences data interpretation, and
2. a bottom-up process where the viewer shifts from perceptual pro-
cesses to interpretation.
These processes are then interactively applied to various chunks,
which suggests that data interpretation is a serial and incremental pro-
cess. However, Carpenter & Shah [8] have shown that graph compre-
hension, and more specifically visual feature encoding, is more of an
iterative process than a straight-forward serial one.
1For further reference, refer to Friel et al.’s Taxonomy of Skills Required for
Answering Questions at Each Level [16].
Freedman & Shah [15] relate the top-down and bottom-up pro-
cesses to a construction and an integration phase, respectively. Dur-
ing the construction phase, the viewer activates prior graphical knowl-
edge, i. e., the graph schema, and domain knowledge to construct a co-
herent conceptual representation of the available information. During
the integration phase, disparate knowledge is activated by “reading”
the graph and is combined to form a coherent representation. These
two phases take place in alternating cycles. This suggests that do-
main knowledge can influence the interpretation of graphs. However,
highly visualization-literate people should suffer less influence of both
the top-down and bottom-up processes [42].
2.2.3 Assessment
Relatively little has been done on the assessment of literacy involving
graphical representations. However, interesting work has been done on
measuring the perceptual abilities of a user to extract information from
these. For example, various studies have demonstrated that users can
perceive slope, curvature, dimensionality, and continuity in line graphs
(see [11]). Correll et al. [11] have also shown that users can make
judgements about aggregate properties of data using these graphs.
Scatterplots have also received some attention. For example, studies
have examined the ability of a user to determine Pearson correlation
r[5, 9, 27, 37, 39]. Several interesting results have been obtained,
such as general tendency to underestimate correlation, especially in
the range .2 <|r|<.6, and an almost complete failure to perceive cor-
relation when |r|<.2.
Concerning the outright assessment of literacy, the only relevant re-
search work we know of is Wainer’s study on the difference in graph-
icacy levels between third-, fourth-, and fifth-grade children [50]. He
presents the design of an 8-item test using several visualizations, in-
cluding line graphs and bar charts. He then describes his use of Item
Response Theory [52] to score the test results, and shows the effec-
tiveness of this method for assessing abilities. His conclusion is that
children reach “adult levels of graphicacy” as soon as the fourth-grade,
leaving “little room for further improvement.” However, it is unclear
what these “adult levels” are. If we look at textual literacy, some chil-
dren are more literate than certain adults. People may also forget these
skills if they do not regularly practice. Thus, while very useful, we
consider Wainer’s work to be limited. What is needed is a way to
assess adult levels of visualization literacy.
2.3 Item Response Theory and the Rasch Model
Consider what we would like in an effective VL test. To begin with,
it should cover a certain range of abilities, each of which could be
measured by specific scores. Imagine such a test has 10 items, which
are marked 1 when answered correctly, and 0 otherwise. Rob takes the
test and gets a score of 2. Jenny also takes the test, and gets a score
of 7. We would hope that this means that Jenny is better than Rob at
reading graphs. In addition, we would expect that if Rob and Jenny
were to take the test again, both would get approximately the same
scores, or at least that Jenny would still get a higher score than Rob.
We would also expect that whatever VL test Rob and Jenny both take,
Jenny will always be better than Rob.
Now imagine that Chris takes the test and also gets a score of 2. If
we based our judgement on this raw score, we would assume that Chris
is as bad as Rob at reading graphs. However, taking a closer look at the
items that Chris and Rob got right, we realize that they are different:
Rob gave correct answers to the two easiest items, while Chris gave
correct answers to two relatively complex items. This would of course
require us to know the level of difficulty of each item, and would mean
that while Chris gave incorrect answers to the easy items, he might still
show some ability to read graphs. Thus, we would want the different
scores to have “meanings” to help us determine whether Chris was
simply lucky (he guessed the answers), or whether he is in fact able to
get the simpler items right, even though he didn’t this time.
Imagine now that Rob, Jenny, and Chris take a second VL test. Rob
gets a score of 3, Chris gets 4, and Jenny gets 10. We would infer that
this test is easier, since the scores are higher. However, looking at the
score intervals, we see that Jenny is 7 points ahead of Rob, whereas she
was only 5 points ahead in the first test. If we were to truly measure
abilities, we would want these intervals to be invariant. In addition,
seeing that Chris’ score is once again similar to Rob’s (knowing that
they both got the same items right this time) would lead us to think
that they do in fact have similar abilities. We could then conclude that
this test provides more information on lower abilities than the fist one,
since it is able to separate Rob and Chris’ scores.
Finally, imagine that all three examinees take a third test, and all
get a score of 10. While we might be tempted to conclude that this test
is VL-agnostic, it may simply be that its items are too easy, and not
sufficiently discriminant.
One way of fulfilling all of these requirements is by using Item
Response Theory (IRT) [52]. This is a model-based approach that does
not use response data directly, but transforms them into estimates of a
latent trait (e. g., ability), which then serves as the basis of assessment.
IRT models have been applied to tests in a variety of fields such as
health studies, education, psychology, marketing, economics, social
sciences (see [41]), and even graphicacy [50].
The core idea of IRT is that the performance of an examinee de-
pends on both the examinee’s ability and the item’s difficulty; the goal
is then to separate out these two factors. An important aspect of the
approach is to project them onto the same scale—that of the latent
trait. Ability, or standing on the latent trait, is derived from a pattern of
responses to a series of test items; item difficulty is then defined by the
0.5 probability of success of an examinee with the appropriate ability.
For example, an examinee with an ability value of 0 (0 corresponding
to an average achiever) will have a 50% probability of giving a cor-
rect answer to an item with a difficulty value of 0, corresponding to an
average level of difficulty.
IRT offers models for data that are dichotomous (e. g., true/false
responses) and polytomous (e. g., responses on likert-like scales). In
this paper, we focus on models for dichotomous data. These define the
probability of success on an item iby the function:
where θis an examinee’s standing on a latent trait (i. e., his or her abil-
ity), and ai,bi, and ciare the characteristics of the item. The central
characteristic is b, the difficulty characteristic; if θ=b, the examinee
has a 0.5 probability of giving a correct answer to the item. Mean-
while, ais the discrimination characteristic. An item with a very high
discrimination value basically sets a sharp threshold at θ=b: exami-
nees with θ<bhave a probability of success of 0, and examinees with
θ>bhave a probability of success of 1. Conversely, an item with a
low discrimination value cannot clearly separate examinees. Finally,
cis the guessing characteristic. It sets a lower bound for the extent
to which an examinee will guess an answer. We have found cto be
unhelpful, so we have set it to zero (no guessing) for our development.
Note that the value of each characteristic is not absolute for a given
item: it is relative to the latent trait that the test is attempting to un-
cover. Therefore, it cannot be expected that the characteristics of iden-
tical items be exactly the same in different tests. For example, consider
a simple numeracy test with two items: 10 +20 (item 1) and 17+86
(item 2). It should be assumed that item 1 is easier than item 2. In other
words, the difficulty characteristic of item 2 should be higher than that
of item 1. Now if we add another item to the test, say 51 ×93 (item 3),
the most difficult item in the previous version of the test (item 2) will
no longer seem so difficult. However, it should still be more difficult
than item 1. Thus, while individual characteristics may vary, the gen-
eral order of difficulty should be preserved. The same goes for ability
values (or ability scores). If they are to be compared between different
tests, the measured latent trait must be the same.
Various IRT models for dichotomous data have been proposed. One
is the one-parameter logistic model (1PL), which sets ato a specific
value for all items, sets c to zero, and only considers the variation of b.
Another is the two-parameter logistic model (2PL), which considers
the variations of aand b, and sets c to zero. A third is the three-
parameter logistic model (3PL), which considers variations of a,b,
and c[6]. As such, 1PL and 2PL can be regarded as special cases of
3PL, where different item characteristics are assigned specific values.
A last variant is the Rasch model (RM), which is a special case of 1PL,
where a=12.
Thus, IRT offers a way to evaluate the relevance of test items during
adesign phase (e. g., how difficult items are, or how discriminant they
are), and a way to measure examinees’ abilities during an assessment
phase. These two phases constitute the backbone of the method we
present in this paper, which is why we stress that our approach will
be successful only if an IRT model fits a set of empirically collected
data. Furthermore, its accuracy will depend on how closely an IRT
model describes the interaction between examinees’ abilities and their
responses, i. e., how well the model describes the latent trait. Thus,
different variants of IRT models should be tested initially to find the
best fit. Finally, it should be mentioned that IRT models cannot be
relied upon to “fix” problematic issues in a test. Proper test design is
still required.
In the approach we develop here, test items generally involve a 3-part
structure: 1) a stimulus, 2) a task, and 3) a question. The stimuli are the
particular graphical representations used. Tasks are defined in terms of
the visual operations and mental projections that an examinee should
perform to answer a given question. While tasks and questions are
usually linked, we emphasize this distinction because early piloting
revealed that different “orientations” of a question (e. g., emphasis on
particular visual aspects, or data aspects) could affect performance.
To identify possible factors that may influence the difficulty of a
test item, we reviewed all the literacy tests that we could find which
use graphs and charts as stimuli [13, 18, 20, 31, 32, 34, 49, 50]. Note
that our goal is not to investigate the effect of these factors on item
difficulty; we present them here merely as elements to be considered
in the design phase.
We identified 4 potential stimulus parameters: number of samples,
intrinsic complexity (or variability) of the data, layout, and level of
distraction. We also found 6 recurring task types: extrema (maximum
and minimum), trend, intersection, average, and comparison. Finally,
we distinguished 3 different question types: “perception” questions,
“high-congruency” questions, and “low-congruency” questions. Each
of these are described in the following subsections.
3.1 Stimulus parameters
In our survey, we first focused on identifying parameters to describe
the graphical properties of a stimulus. We found four:
Number of samples This refers to the number of graphically en-
coded elements in the stimulus. Among other things, the value of this
parameter can impact tasks that require visual chunking [25].
Complexity This refers to the local and global variability of the
data. For example, a dataset of the yearly life expectancy in different
countries over a 50 year time period has a low local variation (no dra-
matic “bounces” between two consecutive years), and low global vari-
ation (a relatively stable, linear, increasing trend). In contrast, a dataset
of the daily temperature in different countries over a year shows high
local variation (temperatures can vary dramatically from on day to the
other) and medium global variation (temperature generally rises an de-
creases only once during the year).
Layout This refers to the structure of the graphical framework
and its scales. Layouts can be single (e. g., a 2-dimensional Euclidian
space), superimposed (e. g., three axes for a 2-dimensional encoding),
or multiple (e. g., several frameworks for a same visualization). Mul-
tiple layouts include cutout charts and broken charts [21]. Scales can
be single (linear or logarithmic), bifocal, or lense-like.
2For a complete set of references on the Rasch model, refer to http://
Distraction This refers to the graphical elements present in the
stimulus that are not necessary for the task at hand. These are consid-
ered to be distractors. Correll et al. [11] have shown that even small
variations in attributes of distractors can impact perception. However,
here we simply use distraction in a Boolean way, i. e., present or not.
3.2 Tasks
Next, we focused on identifying tasks that require only visual intelli-
gence, i. e., purely visual operations or mental projections on a graph-
ical representation. We found six: Maximum (T1), Minimum (T2),
Variation (T3), Intersection (T4), Average (T5), and Comparison (T6).
All are standard benchmark tasks in InfoVis. T1 and T2 consist in
finding the maximum and minimum data points in the graph, respec-
tively. T3 consists in detecting a trend, similarities, or discrepancies
in the data. T4 consists in finding the point at which the graph inter-
sected with a given value. T5 consists in estimating an average value.
Finally, T6 consists in comparing different values or trends.
3.3 Congruency
Finally, we focused on identifying different types of questions. We
found three: perception questions, and high- and low-congruency
questions. Perception questions refer only to visual aspects of the dis-
play (e. g., “what color are the dots?”). Conversely, congruent ques-
tions refer to semantic aspects of the data. The level of congruence
is then defined by the “replaceability” of the data-related terms in the
question by perceptual terms. A high-congruency question translates
into a perceptual query simply by replacing data terms by perceptual
terms (e. g., “what is the highest value”/“what is the highest bar?”).
A low-congruency question, in contrast, has no such correspondence
(e. g., “is A connected to B—in a matrix diagram”/“is the intersection
between column A and row B highlighted?”).
To illustrate our method, we first created two line graph tests—Line
Graphs 1 (LG1) and Line Graphs 2 (LG2)—of slightly different de-
signs, based on the principles described above. We then calibrated
them using Amazon’s Mechanical Turk (MTurk).
4.1 Design Phase
4.1.1 Line Graphs 1: General Design
For our first test (LG1), we created a set of twelve items using different
stimulus parameters and tasks. We hand-tailored each item based on
an expected range of difficulty. Piloting had revealed that high vari-
ation in item dimensions led to incoherent tests (i. e., IRT models did
not fit the response data), implying that when factors vary too much
within a test, additional abilities beyond those involved in basic vi-
sualization literacy are likely at play. Thus, we kept the number of
varying factors low: only distraction and tasks varied. The test used
four samples for the stimuli, and a single layout with single scales. A
summary is given in Table 1.
Each item was repeated five times3. The test was blocked by item,
and all items and their repetitions were randomized to prevent carry-
over effects. We added an extra condition using a standard table at the
beginning of each block to give examinees the opportunity to consol-
idate their understanding of the new question, and to separate out the
comprehension stage of the question-response process believed to oc-
cur in cognitive testing [10]. The test was thus composed of 72 trials.
In the following paragraphs, we describe other important design
parameters we used in this test.
Scenario The PISA 2012 Mathematics Framework [34] empha-
sizes the importance of an understandable context for problem solving.
The current test focuses on one’s community, with problems set in a
community perspective.
3Early piloting had revealed that examinees would stabilize their search
time and confidence after a few repetitions. In addition, repeated trials usu-
ally provide more robust measures as medians can be extracted (or means in
the case of Boolean values).
Item ID Task Distraction
LG1.1 max 0
LG1.2 min 0
LG1.3 variation 0
LG1.4 intersection 0
LG1.5 average 0
LG1.6 comparison 0
LG1.7 max 1
LG1.8 min 1
LG1.9 variation 1
LG1.10 intersection 1
LG1.11 average 1
LG1.12 comparison 1
Item ID Task Congruency
LG2.1 max high
LG2.2 min high
LG2.3 variation high
LG2.4 intersection high
LG2.5 average high
LG2.6 comparison high
LG2.7 max low
LG2.8 min low
LG2.9 variation low
LG2.10 intersection low
LG2.11 average low
LG2.12 comparison low
Item ID Task Samples
BC.1 max 10
BC.2 min 10
BC.3 variation 10
BC.4 intersection 10
BC.5 average 10
BC.6 comparison 10
BC.7 max 20
BC.8 min 20
BC.9 variation 20
BC.10 intersection 20
BC.11 average 20
BC.12 comparison 20
Item ID Task Distraction
SP.1 max 0
SP.2 min 0
SP.3 variation 0
SP.4 intersection 0
SP.5 average 0
SP.6 comparison 0
SP.7 max 1
SP.8 min 1
SP.9 variation 1
SP.10 intersection 1
SP.11 average 1
SP.12 comparison 1
Table 1: Designs of Line Graphs 1 (LG1), Line Graphs 2 (LG2), Bar Chart (BC), and Scatterplot (SP). Only varying dimensions are shown.
Each item is repeated 6 times, beginning with a table condition (repetitions are not shown). Pink cells in the Item ID column indicate duplicate
items in LG1 and LG2. Tasks with the same color coding are the same. Gray cells in the Distraction, Congruency, and Samples columns indicate
difference with white cells. The Distraction column uses a binary encoding: 0 =no distractors, 1 =presence of distractors.
To avoid the potential bias of a priori domain knowledge, the test
was set within the following science-fiction scenario: The year is 2813.
The Earth is a desolate place. Most of mankind has migrated through-
out the universe. The last handful of humans remaining on earth are
now actively seeking another planet to settle on. Please help these
people determine what the most hospitable planet is by answering the
following series of questions as quickly and accurately as possible.
Data The dataset we used had a low-local and medium-global
level of variability. It presented the monthly evolution of unemploy-
ment in different countries between the years 2000 and 2008. Country
names were changed to fictitious planet names listed in Wikipedia, and
years were modified to fit the scenario.
Priming and Pacing Before each new block of repetitions, ex-
aminees were primed with the upcoming graph type, so that the con-
cepts and operations necessary for information extraction could be set
up [38]. To separate out the time required to read questions, a specific
pacing was given to each block. First, the question was displayed,
along with a button labeled “Proceed to graph framework”; this led
participants to the graphical framework with the appropriate title and
labels. At the bottom of this was another button labeled “Display data,”
which displayed the full stimulus.
As mentioned, to give examinees the opportunity to fully compre-
hended each question, every block began with a “question comprehen-
sion” condition in which the data were shown in table form. This was
intended to remove potential effects caused by the setup of high-level
operations for solving a particular kind of problem.
Finally, to make sure ability (and not capacity) was being tested, an
11stimeout was set for each repetition. This was based on the mean
time required to answer the items in our pilot studies.
Response format To respond, examinees were required to click
on one of several possible answers, displayed in the form of buttons
below the stimulus. In some cases, correct answers were not directly
displayed. For example, certain values were not explicitly shown with
labeled ticks on the graph’s axes. This was done to test examinees’
ability to make confident estimations (i. e., to handle uncertainty [34]).
In addition, although the stimuli used color coding to show different
planets, the response buttons did not. This forced examinees to trans-
late the answer found in the visual domain back into the data domain.
4.1.2 Setup
To calibrate our test, we administered it on MTurk. While the validity
of using this platform may be debated, due to lack of control over par-
ticular experimental conditions [19], we considered it best to perform
our calibration using the results of a wide variety of people.
Participants To our knowledge, no particular number of samples
is recommended for IRT modeling. We recruited 40 participants who
were required to have a 98% acceptance rate and a total of 1000 or
more HITS approved.
Coding Six Turkers spent less than 1.5son average reading and
answering questions; they were considered as random clickers, and
their results were removed from further analysis. All retained Turkers
were native English speakers.
The remaining data were sorted according to item and repetition ID
(assigned before randomization). Responses for the table conditions
were removed. A score dataset (LG1s) was then created in accord with
the requirements of IRT modeling: correct answers were scored 1 and
incorrect answers 0. Scores for each set of item repetitions were then
compressed by computing the rounded mean values. This resulted in
a set of twelve dichotomous item scores for each examinee.
4.1.3 Results
The purpose of this calibration is to remove items that are unhelpful
for distinguishing between low and high levels of VL. To do so, we
need to: 1) check that the simplest variant of IRT models (i. e., the
Rasch model) fits the data, 2) find the best variant of the model to get
the most accurate characteristic values for each item, and 3) assess the
usefulness of each item.
Checking the Rasch model The Rasch model (RM) was first fit-
ted to the score dataset. A 200 sample parametric Bootstrap goodness-
of-fit test using Pearson’s χ2statistic revealed a non-significant p-
value for LG1s(p>0.54), suggesting an acceptable fit4. The Test
Information Curve (TIC) is shown in Fig. 1a. It reveals a near-normal
distribution of test information across different ability levels, with a
slight bump around 2, and a peak around 1. This means that the
test provides more information about examinees with relatively low
abilities (0 being the ability level of an average achiever) than about
examinees with high abilities.
Finding the right model variant Different IRT models, imple-
mented in the ltm R package [40], were then fitted to LG1s. A series
of pairwise likelihood ratio tests showed that the two-parameter logis-
tic model (2PL) was most suitable. The new TIC is shown in Fig. 1b.
Assessing the usefulness of test items The big spike in the
TIC (Fig. 1b) suggests that several items with difficulty characteristics
just above 2 have high discrimination values. This is confirmed by
the very steep Item Characteristic Curves (ICCs) (Fig. 3a) for items
LG1.1, LG1.4, and LG1.9 (a>51), and can explain the slight distor-
tion in Fig. 1a.
The probability estimates revealed that examinees with average
abilities have a 100% probability of giving a correct answer to the
easiest items (LG1.1, LG1.4, and LG1.9), and a 41% probability of
giving a correct answer to the hardest item (LG1.11). However, the
fact that LG1.11 has a relatively low discrimination value (a<0.7)
suggests that it is not very effective for separating ability levels.
4For more information about this statistic, refer to [40].
(a) TIC of LG1sunder RM (b) TIC of LG1sunder 2PL
Fig. 1: Test Information Curves (TICs) of the score dataset of the first
line graph test under the original constrained Rasch model (RM) (a)
and the two-parameter logistic model (2PL) (b). The ability scale
shows the θ-values. The slight bump in the normal distribution of
(a) can be explained by the presence of several highly discriminating
items, as shown by the big spike in (b).
Fig. 2: Test Information Curve of the score dataset of the second line
graph test under the original constrained Rasch model. The test infor-
mation is normally distributed.
(a) ICCs of LG1sunder 2PL (b) ICCs of LG2sunder RM
Fig. 3: Item Characteristic Curves (ICCs) of the score datasets of the
first line graph test (LG1s) under the original the two-parameter lo-
gistic model (a), and of the second line graph test (LG2s) under the
constrained Rasch model (b). The different curve steepnesses in (a)
are due to the fact that 2PL computes individual discrimination values
for each item, while RM sets all discrimination values to 1.
4.1.4 Discussion
IRT modeling appears to be a solid approach for calibrating our test
design. Our results (Fig. 1) show that LG1 is useful for differentiating
between examinees with relatively low abilities, but not so much for
ones with high abilities.
The slight bump in the distribution of the TIC (Fig. 1a) suggests
that several test items are quite effective for separating ability levels
around 2. This is confirmed by the spike in Fig. 1b, which indi-
cates the presence of highly discriminating items. Overall, both Test
Information Curves reveal that the test is best suited for examinees
with relatively low abilities, since most of the information it provides
concerns ability levels below zero.
In addition, Fig. 3a reveals that several items in the test have iden-
tical difficulty and discrimination characteristics. Some of these could
be considered for removal, as they provide only redundant informa-
tion. Similarly, item LG1.11, which has a low discrimination charac-
teristic, could be dropped, as it is less effective than others.
4.1.5 Line Graphs 2: General Design
For our second line graph test (LG2), we also created twelve items,
with varying factors restricted to question congruency and tasks. The
test used four samples for the stimuli, and a single layout with single
scales. The same scenario, dataset, pacing, and response format as for
LG1 were kept, as well as the five repetitions, the question compre-
hension condition, and the 11stimeout. As such, six items in this test
were identical to items in LG1 (see pink cells in Table 1). This was
done to ensure that the order of item difficulty would remain consistent
across the different tests.
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of three Turkers was rejected, for the same
reason as before.
4.1.6 Results and Discussion
Our analysis was driven by the requirements listed above. Data were
sorted and encoded in the same way as before, and a score dataset for
LG2 was obtained (LG2s).
RM was fitted to the score dataset, and the goodness-of-fit test re-
vealed an acceptable fit (p>0.3). The pairwise likelihood ratio test
showed that RM was the best of all possible IRT models. The Test
Information Curve (Fig. 2) is normally distributed, with a peak around
1. This indicates that like our first line graph test, LG2 is best suited
for examinees with relatively low abilities.
The Item Characteristic Curves of both tests were then compared.
While it cannot be expected that identical items have the exact same
characteristics, their difficulty order should remain consistent (see
Sect. 2.3). Fig. 3 shows some slight discrepancies for items 1, 3, and 6
between the two tests. However, the fact that item LG1.3 is further to
the left in Fig. 3a is misleading. It is due to the extremely high a-values
of items LG1.1 and LG1.4. Thus, while their b-values are slightly
higher than that of LG1.3, the probability of success of an average
achiever is higher for these items than it is for LG1.3 (1 >0.94). Fur-
thermore, the difficulty characteristics of LG1.3 and LG2.3 are very
similar (0.94 0.92). Therefore, the only exception in the ordering of
item difficulties is item 6, which is estimated to be more difficult than
item 2 in LG1, and not in LG2.
This suggests that LG1 and LG2 cover the same latent trait, i. e.,
ability to read line graphs. To examine this, we equated the test scores
using a common item equation approach. RM was fitted to the re-
sulting dataset, the goodness-of-fit test showed an acceptable fit, and
2PL provided best fit. Individual item characteristics were generally
preserved, with the exception of item 6, which, interestingly, ended
up with characteristics very similar to those of item 2. This confirms
that the two tests cover the same latent trait. Thus, although individual
characteristics are slightly altered by the equation (e. g., item 6), items
in LG1 can safely be transposed to LG2, without hindering the overall
coherence of the test, and vice-versa.
4.2 Assessment Phase
Having shown that our test items have a sound basis in theory, we
now turn to the assessment of visualization literacy. While a standard
method would simply sum up the correct responses, our method con-
siders each response individually, with regard to the difficulty of the
item it was given for. To make this assessment, we inspected the abil-
ity scores derived from the fitted IRT models. These scores represent
examinees’ standings (θ) on the latent trait, and correspond to a unique
response pattern. They have great predictive power as they can deter-
mine an examinee’s probability of success on items that s/he has not
completed, provided that these items follow the same latent variable
scale as other test items. As such, ability scores are perfect indicators
for assessing VL.
LG1 revealed 27 different ability scores, ranging from 1.85 to 1.
The distribution of these scores was near-normal, with a slight bump
around 1.75. 40.7% of participants were above average (i. e., θ>0),
and the mean was 0.27.
LG2 revealed 33 different ability scores, ranging from 1.83 to
1.19. The distribution was also near-normal, with a bump around 1.
39.4% of participants were above average, and the mean was 0.17.
These results show that the means are close to zero, and the distri-
butions near-normal. This suggests that most Turkers, while somewhat
below average in visualization literacy for line graphs, have fairly stan-
dard abilities.
While it should be interesting to develop broader ranges of item
complexities for the line graph stimulus (by using the common item
equation approach), thus extending the psychometric quality of the
tests, we consider LG1 and LG2 to be sufficient for our current line
of research. Furthermore, we believe that these low levels of difficulty
reflect the general simplicity of, and massive exposure to, line graphs.
To see whether our method also applies to other types of visualizations,
we created two additional tests: one for bar charts (BC) and one for
scatterplots (SP).
5.1 Design Phase
5.1.1 Bar Charts: General Design
Like LG1 and LG2, the design of our bar chart test (BC) was based
on the principles described in Section 3. We created twelve items,
with varying factors restricted to number of samples and tasks (see
Table 1). The same scenario, pacing, response format, repetitions,
question comprehension condition, and 11stimeout were kept. The
dataset presented life expectancy in various countries, with country
names again changed to fictitious planet names.
The only difference with the factors used earlier (apart from the
stimulus) involved the variation task, which is essentially a trend de-
tection task. Bar charts are sub-optimal for determining trends, so
this task was replaced by a “global similarity detection” task, as done
in [20] (e. g., “Do all the bars have the same value?”).
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of six Turkers was rejected, for the same
reason as before.
5.1.2 Results and Discussion
Our analysis was driven by the same requirements as for the line graph
tests. Data were sorted and encoded in the same way, resulting in a
score dataset for BC (BCs).
RM was first fitted to BCs; the goodness-of-fit test revealed an ac-
ceptable fit (p>0.37), and the likelihood test proved that it fit best.
However, the Test Information Curve (Fig. 4a) is not normally dis-
tributed. This is due to the presence of several extremely low difficulty
(i. e., easy) items (BC.3, BC.7, BC.8, and BC.9; b=−25.6), as shown
in Fig. 4b. Inspecting the raw scores for these items revealed a 100%
success rate. Thus, they were considered too easy, and were removed.
Similarly, items BC.1 and BC.2 (for both, b<4) were also removed.
(a) TIC of BCsunder RM (b) ICCs of BCsunder RM
Fig. 4: Test Information Curve (a) and Item Characteristic Curves (b)
of the score dataset of the bar chart test under the constrained Rasch
model. The TIC in (a) is not normally distributed because of several
very low difficulty items, as shown by the curves to the far left of (b).
(a) TIC of the subset of BCsunder RM (b) ICCs of the subset of BCsunder RM
Fig. 5: Test Information Curve (a) and Item Characteristic Curves (b)
of the subset of the score dataset of the bar chart test under the con-
strained Rasch model. The subset was obtained by removing the very
low difficulty items shown in Fig. 4b.
To check the coherence of the resulting subset of items, RM was
fitted again to the remaining set of scores. Goodness-of-fit was main-
tained (p>0.33), and RM still fitted best. The new TIC (Fig. 5a) is
normally distributed, with a peak around 1. This indicates that like
our line graph tests, this subset of BC is best suited for examinees with
relatively low abilities.
5.1.3 Scatterplots: General Design
For our scatterplot test (SP), we once again created twelve items, with
varying factors restricted to distraction and tasks (see Table 1). The
same scenario, pacing, response format, repetitions, and question com-
prehension condition were kept. The dataset presented levels of adult
literacy by expenditure per student in primary school in different coun-
tries, with country names again changed to fictitious planet names.
Slight changes were required for some of the tasks, since scatter-
plots use two spatial dimensions (as opposed to bar charts and line
graphs). For example, stimuli with distractors in LG1 only required
examinees to focus on one of several samples; here, stimuli with dis-
tractors could either require examinees to focus on a single datapoint
or on a single dimension.
We had initially expected that SP would be more difficult, and items
would require more time to complete. However, a pilot study showed
that the average response time per item was again roughly 11s. There-
fore, the 11stimeout condition was kept.
The calibration was again conducted on MTurk. 40 participants
were recruited; the work of one Turker was not kept because of tech-
nical (logging) issues.
5.1.4 Results and Discussion
Our analysis was once again driven by the same requirements as be-
fore. The same sorting and coding was applied to the data, resulting in
the score dataset SPs. The fitting procedure was then applied, reveal-
ing a good fit for RM (p=0.6), and a best fit for 2PL.
The Test Information Curve (Fig. 6a) shows the presence of several
highly discriminating items around b1 and b0. The Item Char-
acteristic Curves (Fig. 6b) confirm that there are three (SP.6, SP.8, and
SP.10; a>31). However, they also show that two items (SP.3, and
(a) TIC of SPsunder 2PL (b) ICCs of SPsunder 2PL
Fig. 6: Test Information Curve (a) and Item Characteristic Curves (b)
of the score dataset of the scatterplot test under the two-parameter lo-
gistic model. The TIC (a) shows that there are several highly dis-
criminating items, which is confirmed by the very steep curves in (b).
In addition, (b) shows that there are also a two poorly discriminating
items, represented by the very gradual slopes of items SP.3 and SP.11.
(a) TIC of the subset of SPsunder 2PL (b) ICCs of the subset of SPsunder 2PL
Fig. 7: Test Information Curve (a) and Item Characteristic Curves (b)
of the subset of the score dataset of the scatterplot test under the two-
parameter logistic model. The subset was obtained by removing the
poorly discriminating items shown in Fig. 6b.
SP.11) have quite low discrimination values (a<0.6). Here, we set a
threshold for a>0.8. Thus, items SP.3 and SP.11 were removed. The
resulting subset of 10 items’ scores was fitted once again. RM fitted
well (p=0.69), and 2PL fitted best. The different curves of the subset
are plotted in Fig. 7. They show a good amount of information for
abilities that are slightly below average (Fig. 7a), which indicates that
the subset of SP is once again best suited for examinees with relatively
low abilities.
5.2 Assessment phase
Here again, we inspected the Turkers’ ability scores. Only the items
retained at the end of the design phase were used.
BC revealed 21 different ability scores, ranging from 1.75 to 0.99.
The distribution of these scores was near-normal, with a slight bump
around 1.5, and the mean was 0.39. However, only 14.3% of par-
ticipants were above average.
SP revealed 23 different ability scores, ranging from 1.72 to 0.72.
However, the distribution here was not normal. 43.5% of participants
were above average, and the median was 0.14.
These results show that the majority of recruited Turkers had some-
what below average levels of visualization literacy for bar charts and
scatterplots. The very low percentage of Turkers above average in BC
led us to reconsider the removal of items BC.1 and BC.2, as they were
not truly problematic. After reintegrating them in the test scores, 21
ability scores were observed, ranging from 1.67 to 0.99, and 42.8%
of participants were above average. This seemed more convincing.
However, this important difference illustrates the relativity of these
values, and shows how important it is to properly calibrate the tests
during the design phase.
Finally, we did not attempt to equate these tests, since—unlike LG1
and LG2—we ran them independently without any overlapping items.
To have a fully comprehensive test, i. e., a generic test for visualization
literacy, intermediate tests are required where the stimulus itself is a
varying factor. If such tests prove to be coherent (i. e., if IRT models
fit the results), then it should be possible to assert that VL is a general
trait that allows one to understand any kind of graphical representation.
Although we believe that this ability varies with exposure and habit of
use, a study to confirm it is outside of the scope of this paper.
If these tests are to be used as practical ways of assessing VL, the
process must be sped up, both in the administration of the tests and in
the analysis of the results. While IRT provides useful information on
the quality of tests and on the ability of those who take them, it is quite
costly, both in time and in computation. This must be changed.
In this section, we present a way in which the tests we have devel-
oped in the previous sections can be optimized to be faster, while still
maintaining their effectiveness.
6.1 Test Administration Time
As we have seen, several items can be removed from the tests, while
keeping good psychometric quality. However, this should be done
carefully, as some of these items may provide useful information (like
in the case of BC.1 and BC.2, Sect. 5.2).
We first removed LG1.11 from LG1, as its discrimination value was
<0.8 (see Sect. 5.1.4). We then inspected items with identical diffi-
culty and discrimination characteristics, represented by overlapping
ICCs’ (see Fig. 3). These were prime candidates for removal, since
they provide only redundant information. There was one group of
overlapping items in LG1 ([LG1.1, LG1.4, LG1.9]), and two in LG2
([LG2.1, LG2.3], [LG2.2, LG2.7]). For each group, we kept only one
item. Thus LG1.1, LG1.4, LG2.3, and LG2.7 were dropped.
We reintegrated items BC.1 and BC.2 to BC, as they proved to have
a big impact on ability scores (Sect. 5.2). The subset of SP created at
the end of the design phase was kept, and no extra items were removed.
RM was fitted to the newly created subsets of LG1, LG2, and BC;
the goodness-of-fit test showed acceptable fits for all (p>0.69 for
LG1, and p>0.3 for both LG2 and BC). 2PL fitted best for LG1, and
RM fitted best for both LG2 and BC.
We conducted post-hoc analysis to see whether the number of item
repetitions could be reduced (first to three, then to one). Results
showed that RM fitted all score datasets using three repetitions. How-
ever, several examinees had lower scores. In addition, while BC and
SP showed similar amounts of information for the same ability levels,
the three very easy items in BC (i. e., BC.3, BC.7, BC.8, and BC.9)
were no longer problematic. This suggests that several participants
did not get a score of 1 for these items, and confirms that, for some
examinees, more repetitions are needed. Results for one-repetition-
tests showed that RM no longer fitted the scores of BC, suggesting
that unique repetitions are noisy. Therefore, we decided to keep the
five repetitions.
In the end, the redesign of LG1 contained 9 items (with a 10 min
completion time), the redesigns of LG2 and SP contained 10 items (11
min), and the redesign of BC contained 8 items (9 min).
6.2 Analysis Time and Computational Power
To speed up the analysis, we first considered setting up the procedure
we had used in R on a server. However, this solution would have
required a lot of computational power, so we dropped it.
Instead, we chose to tabulate all possible ability scores for each
test. An interesting feature of IRT modeling is that it can derive ability
scores from unobserved response patterns (i. e., patterns that do not
exist in the empirical data), as well as from partial response patterns
(i. e., patterns with missing values). Consequently, we generated all
the 2ni1 possible patterns for each test, where niis the number of
items in a test. This resulted in 511 patterns for LG1, 1023 for both
LG2 and SP, and 255 for BC. We then derived the different ability
scores that could be obtained in each test.
To ensure that removing certain test items did not greatly affect the
ability scores, we computed all the scores for the full LG1 and LG2
tests, and compared them to the ones previously obtained. We found
some small differences in the upper and lower bounds of ability, but
these were considered negligible, since our tests were not designed
for fine distinction between very low abilities or high abilities. We
also tested the impact of refitting the IRT models after item removal.
For this, we repeated the procedure using partial response patterns for
LG1 and LG2, i. e., we replaced the dichotomous response values for
the items considered for removal by not available (NA) values. The
scores were exactly the same as the ones obtained with our already
shortened and refitted tests, which proves they can be trusted.
Finally, we integrated all ability scores and their corresponding re-
sponse patterns into the web-based, shortened versions of the tests, to
make them readily available. This way, by administering our online
tests, researchers can have direct access to participants’ levels of vi-
sualization literacy. Informed decisions can then be made as whether
to keep these people for further studies or not. All four tests are ac-
cessible at
As the preceding sections have shown, we have developed and vali-
dated a fast and effective method for assessing visualization literacy.
This section summarizes the major steps, written in the form of easy
“take-away” guidelines.
7.1 Initial Design
1. Pay careful attention to the design of all 3 components of a
test item,i. e., stimulus, task, and question. Each can influence
item difficulty, and too much variation may fail to produce a co-
herent test—as was seen in our pilot studies.
2. Repeat each item several times. We did 5 repetitions + 1 “ques-
tion comprehension” condition for each item. This is important
as repeated trials provide more robust measures. Ultimately, it
may be feasible to reduce the number of repetitions to 35, al-
though our results show that this can be problematic (Sect. 6.1).
3. Use a different—and ideally, non-graphical—representation
for question comprehension. We chose a table condition.
While our present study did not focus on proving its essential-
ness, we believe that this attribute is important.
4. Randomize the order of items and of repetitions. This is com-
mon practice in experiment design, having the benefit of prevent-
ing carryover effects.
5. Once the results are in, sort the data according to item and rep-
etition ID, remove the data for the question comprehension
condition, and encode examinees’ scores in a dichotomous
way,i. e., 1 for correct answers and 0 for incorrect answers.
6. Calculate the mean score for all repetitions of an item and
round the result. This will give a finer estimate of the exami-
nee’s ability since it erases one-time errors which may be due to
lack of attention or to clicking on the wrong answer by mistake.
7. Begin model fitting with the Rasch model. RM is the simplest
variant of IRT models. If it does not fit the data, other variants
will not either. Then Check the fit of the model. Here we used a
200 sample parametric Bootstrap goodness-of-fit test using Pear-
sons χ2statistic. To reveal an acceptable fit, the returned p-value
should not be statistically significant (p>0.05). In some cases
(like in our pilot studies), the model may not fit. Options here are
to inspect the χ2p-values for pairwise associations, or the two-
and three-way χ2residuals, to find problematic items6.
8. Determine which IRT model variant best fits the data. A se-
ries of pairwise likelihood ratio tests can be used for this. If sev-
eral models fit, it is usually good to go with the model that fits
best. Our experience showed that such models were most often
RM and 2PL.
9. Identify potentially useless items. In our examples of LG1 and
SP, certain items had low discrimination characteristics. These
are not very effective for separating ability levels, and can be
removed. In cases like the one for BC, items may also simply
be too easy. Before removing them permanently, however, it is
5The number of repetitions should be odd, so as to not end up with a mean
score of 0.5 for an item.
6For more information, refer to [40].
advised to check their impact on ability scores. Finally, it is im-
portant the model be refitted at this stage (reproducing steps 7
and 8), as removing these items may affect examinee and item
7.2 Final Design
10. Identify overlapping items and remove them. If the goal is
to design a short test, such items can safely be removed, as they
provide only redundant information (see Sect. 6.1).
11. Generate all 2ni1 possible score patterns, where niis the
number of retained items in the test. These patterns represent
series of dichotomous response values for each test item.
12. Derive the ability scores from the model, using the patterns of
responses generated in step 11. These scores represent the range
of visualization literacy levels that the test can assess.
13. Integrate the ability scores into the test to make fast, effective,
and scalable estimates of people’s visualization literacy.
In this paper, we have developed a method for assessing visualization
literacy, based on a principled set of considerations. In particular, we
used Item Response Theory to allow a separation of the effects of item
difficulty and examinee ability. Our motivation was to make a series of
fast, effective, and reliable tests which researchers could use to detect
participants with low VL abilities before conducting online studies.
We have shown how these tests can be tailored to get immediate esti-
mates of examinee’s levels of VL.
We intend to continue developing this approach, as well as examine
the suitability of this method for other kinds of representation (e. g.,
parallel coordinates, node link diagrams, starplots, etc.), and possibly
for other purposes. For example, in contexts like a classroom evalua-
tion, the tests could be longer, and broader assessments of visualiza-
tion literacy could be made. This would imply further exploration of
the design parameters proposed in Section 3. Evaluating the impact of
these parameters on item difficulty should also be interesting.
Finally, we acknowledge that this work is but a small step into the
realm of visualization literacy. As such, we have made our tests avail-
able on GitHub for versioning [1]. Ultimately, we hope that this will
serve as a foundation for further research into VL.
This work was funded by a Google Research Award, granted for a
project called “Data Visualization for the People”.
[1] 101.
[2] D. Abilock. Visual information literacy: Reading a documentary photo-
graph. Knowledge Quest, 36(3), January–February 2008.
[3] J. Baer, M. Kutner, and J. Sabatini. Basic reading skills and the literacy
of the america’s least literate adults: Results from the 2003 national as-
sessment of adult literacy (naal) supplemental studies. Technical report,
National Center for Education Statistics (NCES), February 2009.
[4] J. Bertin and M. Barbut. Semiologie Graphique. Mouton, 1973.
[5] P. BOBKO and R. KARREN. The perception of pearson product mo-
ment correlations from bivariate scatterplots. Personnel Psychology,
32(2):313–325, 1979.
[6] M. T. Brannick. Item response theory. http://luna.cas.usf.
[7] V. J. Bristor and S. V. Drake. Linking the language arts and content areas
through visual technology. T.H.E. Journal, 22(2):74–77, 1994.
[8] P. Carpenter and P. Shah. A model of the perceptual and conceptual pro-
cesses in graph comprehension. Journal of Experimental Psychology:
Applied, 4(2):75–100, 1998.
[9] W. S. Cleveland, P. Diaconis, and R. McGill. Variables on scatterplots
look more highly correlated when the scales are increased. Science,
216(4550):1138–1141, 1982.
[10] Cognitive testing interview guide.
[11] M. Correll, D. Albers, S. Franconeri, and M. Gleicher. Comparing aver-
ages in time series data. In Proceedings of the 2012 ACM annual confer-
ence on Human Factors in Computing Systems, pages 1095–1104. ACM,
[12] F. R. Curcio. Comprehension of mathematical relationships expressed in
graphs. Journal for research in mathematics education, pages 382–393,
[13] Department for Education. Numeracy skills tests: Bar charts.
numeracy/areas/barcharts, 2012.
[14] R. C. Fraley, N. G. Waller, and K. A. Brennan. An item response theory
analysis of self-report measures of adult attachment. Journal of person-
ality and social psychology, 78(2):350, 2000.
[15] E. Freedman and P. Shah. Toward a model of knowledge-based graph
comprehension. In M. Hegarty, B. Meyer, and N. Narayanan, editors, Di-
agrammatic Representation and Inference, volume 2317 of Lecture Notes
in Computer Science, pages 18–30. Springer Berlin Heidelberg, 2002.
[16] S. N. Friel, F. R. Curcio, and G. W. Bright. Making sense of graphs:
Critical factors influencing comprehension and instructional implications.
Journal for Research in mathematics Education, 32(2):124–158, 2001.
[17] A. C. Graesser, S. S. Swamer, W. B. Baggett, and M. A. Sell. New models
of deep comprehension. Models of understanding text, pages 1–32, 1996.
[18] Graph design i.q. test.
[19] J. Heer and M. Bostock. Crowdsourcing graphical perception: Using
mechanical turk to assess visualization design. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, CHI ’10,
pages 203–212, New York, NY, USA, 2010. ACM.
[20] How to read a bar chart.
[21] P. Isenberg, A. Bezerianos, P. Dragicevic, and J. Fekete. A study on dual-
scale data charts. Visualization and Computer Graphics, IEEE Transac-
tions on, 17(12):2469–2478, Dec 2011.
[22] Join, ACRL and Council, Chapters. Presidential committee on informa-
tion literacy: Final report. Online publication, 1989.
[23] I. Kirsch. The international adult literacy survey (ials): Understanding
what was measured. Technical report, Educational Testing Service, De-
cember 2001.
[24] N. Knutson, K. S. Akers, and K. D. Bradley. Applying the Rasch model
to measure first-year students perceptions of college academic readiness.
In Paper presented at the annual meeting of the MWERA Annual Meeting,
[25] R. Lowe. “Reading” scientific diagrams: Characterising components
of skilled performance. Research in Science Education, 18(1):112–122,
[26] R. Lowe. Scientific Diagrams: How Well Can Students Read Them? [mi-
croform]: What Research Says to the Science and Mathematics Teacher.
Number 3 / Richard K. Lowe. Distributed by ERIC Clearinghouse [Wash-
ington, D.C.], 1989.
[27] J. Meyer, M. Taieb, and I. Flascher. Correlation estimates as perceptual
judgments. Journal of Experimental Psychology: Applied, 3(1):3, 1997.
[28] E. Miller. The Miller word-identification assessment. http://www., 1991.
[29] National Cancer Institute. Item response theory model-
[30] National Center for Education Statistics. Adult literacy and lifeskills sur-
[31] Numerical reasoning - table/graph. http://www.jobtestprep.
en-GB&sLanguage=en- GB&&test_id=free_num_
2fnumerical-reasoning- test&endURLName=
2fnumerical-reasoning- test&testTitle=Numerical+
[32] Numerical reasoning online. http://
[33] J. Oberholtzer. Why two charts make me feel like an expert on Portugal
., 2012.
[34] OECD. Pisa 2012 assessment and analytical framework. Technical re-
port, OECD, 2012.
[35] OTA. Computerized Manufacturing Automation: Employment, Educa-
tion, and the Workplace. United States Office of technology Assessment,
[36] S. Pinker. A theory of graph comprehension, pages 73–126. Lawrence
Erlbaum Associates, Hillsdale, NJ, 1990.
[37] I. Pollack. Identification of visual correlational scatterplots. Journal of
experimental psychology, 59(6):351, 1960.
[38] R. Ratwani and J. Gregory Trafton. Shedding light on the graph schema:
Perceptual features versus invariant structure. Psychonomic Bulletin and
Review, 15(4):757–762, 2008.
[39] R. A. Rensink and G. Baldridge. The perception of correlation in scatter-
plots. Comput. Graph. Forum, 29(3):1203–1210, 2010.
[40] D. Rizopoulos. ltm: An R package for latent variable modeling and Item
Response Analysis. Journal of Statistical Software, 17(5):1–25, 11 2006.
[41] RUMMLaboratory. Rasch analysis. http://www. models.htm.
[42] P. Shah. A model of the cognitive and perceptual processes in graphical
display comprehension. Reasoning with diagrammatic representations,
pages 94–101, 1997.
[43] G. Sir Crowther. The Crowther Report, volume 1. Her Majesty’s Sta-
tionery Office, 1959.
[44] C. Taylor. New kinds of literacy, and the world of visual information.
Literacy, 2003.
[45] J. G. Trafton, S. P. Marshall, F. Mintz, and S. B. Trickett. Extracting
explicit and implicit information from complex visualizations. Diagrams,
pages 206–220, 2002.
[46] S. Trickett and J. Trafton. Toward a comprehensive model of graph com-
prehension: Making the case for spatial cognition. In D. Barker-Plummer,
R. Cox, and N. Swoboda, editors, Diagrammatic Representation and In-
ference, volume 4045 of Lecture Notes in Computer Science, pages 286–
300. Springer Berlin Heidelberg, 2006.
[47] B. Tversky. Semantics, Syntax, and Pragmatics of graphics., pages 141–
158. Lund University Press, Lund, 2004.
[48] UNESCO. Literacy assessment and monitoring programme.
lamp-literacy- assessment.aspx.
[49] University of Kent. Numerical reasoning test. http://www.kent.
[50] H. Wainer. A test of graphicacy in children. Applied Psychological Mea-
surement, 4(3):331–340, 1980.
[51] What is media literacy? a definition... and more.
what-media- literacy-definitionand-more.
[52] M. Wu, R. Adams, and E. M. Solutions. Applying the Rasch model to
psycho-social measurement [electronic resource] : A practical approach
/ Margaret Wu and Ray Adams . Educational Measurement Solutions
Melbourne, Vic, 2007.
... We then refined the prototype based on observations and feedback. Lastly, to investigate our two objectives, we deployed the toolkit in a series of in-person workshops with children (n = 5; aged [6][7][8][9][10][11] and conducted interviews with educators (n = 5). ...
... Focusing on core DVL components [10,13,15,47], prior tools for children's DVL help children improve their abilities to interpret and extract information from visually represented data. These tools primarily differ based on their types of activities (i.e., guided vs. child-driven) and intended context (i.e., formal vs. informal instruction). ...
... Existing standardized tools[10,13,47] target adults. E.g., two components of VLAT focus on determining range and extremum for various visualizations. ...
Full-text available
Fostering data visualization literacy (DVL) as part of childhood education could lead to a more data literate society. However, most work in DVL for children relies on a more formal educational context (i.e., a teacher-led approach) that limits children's engagement with data to classroom-based environments and, consequently, children's ability to ask questions about and explore data on topics they find personally meaningful. We explore how a curiosity-driven, child-led approach can provide more agency to children when they are authoring data visualizations. This paper explores how informal learning with crafting physicalizations through play and curiosity may foster increased literacy and engagement with data. Employing a constructionist approach, we designed a do-it-yourself toolkit made out of everyday materials (e.g., paper, cardboard, mirrors) that enables children to create, customize, and personalize three different interactive visualizations (bar, line, pie). We used the toolkit as a design probe in a series of in-person workshops with 5 children (6 to 11-year-olds) and interviews with 5 educators. Our observations reveal that the toolkit helped children creatively engage and interact with visualizations. Children with prior knowledge of data visualization reported the toolkit serving as more of an authoring tool that they envision using in their daily lives, while children with little to no experience found the toolkit as an engaging introduction to data visualization. Our study demonstrates the potential of using the constructionist approach to cultivate children's DVL through curiosity and play.
... For example, preconceived ideas and specific mental images about one's network blocked analysts from thinking creatively about their research questions and network structures. We also found a barrier in defining (multiple) possible networks to match different research questions and abstract from a domain problem into a network problem and back [20,52] . Mitigating these barriers has implications for future tools and research. ...
... The model is informed by our observations (O1-O10) and the participant journeys (J1-J3). It is also inspired by observations in visualization literacy more generally, notably the need to translate abstract domain concepts into visual encodings and back, in order to solve tasks with visualizations [20]. It also links to Norman's Gulfs of Execution and Evaluation for cognitive processes that accompany interactions in interactive systems [55]. ...
... Users' ability to interpret visualizations may vary depending on how visual representations present the data (plain or embellished) [1] and their own background and experiences [2]. Visualization literacy is studied extensively in the visualization community [3][4][5]. Figure 1: An example parallel coordinate plots of car dataset with 7 attributes. The image was created using Xmdv [12]. ...
... The purpose of a study provided by Boy et al. [3] is to create a visualization literacy assessment technique. They use Item Response Theory (IRT) [25] to assess the relevance of visualization literacy test items and the abilities of the examinees. ...
Visualization literacy, the ability to interpret and comprehend visual designs, is recognized as an essential skill by the visualization community. We identify and investigate barriers to comprehending parallel coordinates plots (PCPs), one of the advanced graphical representations for the display of multivariate and high-dimensional data. We develop a parallel coordinates literacy test with diverse images generated using popular PCP software tools. The test improves PCP literacy and evaluates the user’s literacy skills. We introduce an interactive educational tool that assists the teaching and learning of parallel coordinates by offering a more active learning experience. Using this pedagogical tool, we aim to advance novice users’ parallel coordinates literacy skills. Based on the hypothesis that an interactive tool that links traditional Cartesian Coordinates with PCPs interactively will enhance PCP literacy further than static slides, we compare the learning experience using traditional slides with our novel software tool and investigate the efficiency of the educational software with an online, crowdsourced user-study. User-study results show that our pedagogical tool positively impacts a user’s PCP comprehension.
... The growing area of efforts in visualization literacy also informs our investigation of feedback (for a review, see Firat et al. [8]). Many visualization literacy approaches focus on assessment, such as the Visualization Literacy Assessment Test from Lee et al. [16] and similarly focused test from Boy et al. [4]. Other efforts focus on interventions with target populations, such as game promoting visualization literacy [2], classroom activities [13], and training for advanced visualization tools [6]. ...
Graphical perception studies are a key element of visualization research, forming the basis of design recommendations and contributing to our understanding of how people make sense of visualizations. However, graphical perception studies typically include only brief training sessions, and the impact of longer and more in-depth feedback remains unclear. In this paper, we explore the design and evaluation of feedback for graphical perception tasks, called VisQuiz. Using a quiz-like metaphor, we design feedback for a typical visualization comparison experiment, showing participants their answer alongside the correct answer in an animated sequence in each trial. We extend this quiz metaphor to include summary feedback after each stage of the experiment, providing additional moments for participants to reflect on their performance. To evaluate VisQuiz, we conduct a between-subjects experiment, including three stages of 40 trials each with a control condition that included only summary feedback. Results from n = 80 participants show that once participants started receiving trial feedback (Stage 2) they performed significantly better with bubble charts than those in the control condition. This effect carried over when feedback was removed (Stage 3). Results also suggest an overall trend of improved performance due to feedback. We discuss these findings in the context of other visualization literacy efforts, and possible future work at the intersection of visualization, feedback, and learning. Experiment data and analysis scripts are available at the following repository
... Descriptive analyses of the distribution of the VL scores revealed that it differed from the distributions reported by Boy et al.[2] for , p = .18, r = .257) ...
Full-text available
The ability to read, understand, and comprehend visual information representations is subsumed under the term visualization literacy (VL). One possibility to improve the use of information visualizations is to introduce adaptations. However, it is yet unclear whether people with different VL benefit from adaptations to the same degree. We conducted an online experiment (n = 42) to investigate whether the effect of an adaptation (here: De-Emphasis) of visualizations (bar charts, scatter plots) on performance (accuracy, time) and user experiences depends on users' VL level. Using linear mixed models for the analyses, we found a positive impact of the De-Emphasis adaptation across all conditions, as well as an interaction effect of adaptation and VL on the task completion time for bar charts. This work contributes to a better understanding of the intertwined relationship of VL and visual adaptations and motivates future research.
Comprehending and exploring large and complex data is becoming increasingly important for a diverse population of users in a wide range of application domains. Visualization has proven to be well-suited in supporting this endeavor by tapping into the power of human visual perception. However, non-experts in the field of visual data analysis often have problems with correctly reading and interpreting information from visualization idioms that are new to them. To support novices in learning how to use new digital technologies, the concept of onboarding has been successfully applied in other fields and first approaches also exist in the visualization domain. However, empirical evidence on the effectiveness of such approaches is scarce. Therefore, we conducted three studies with Amazon Mechanical Turk (MTurk) workers and students investigating visualization onboarding at different levels: (1) Firstly, we explored the effect of visualization onboarding, using an interactive step-by-step guide, on user performance for four increasingly complex visualization techniques with time-oriented data: a bar chart, a horizon graph, a change matrix, and a parallel coordinates plot. We performed a between-subject experiment with 596 participants in total. The results showed that there are no significant differences between the answer correctness of the questions with and without onboarding. Particularly, participants commented that for highly familiar visualization types no onboarding is needed. However, for the most unfamiliar visualization type — the parallel coordinates plot — performance improvement can be observed with onboarding. (2) Thus, we performed a second study with MTurk workers and the parallel coordinates plot to assess if there is a difference in user performances on different visualization onboarding types: step-by-step, scrollytelling tutorial, and video tutorial. The study revealed that the video tutorial was ranked as the most positive on average, based on a sentiment analysis, followed by the scrollytelling tutorial and the interactive step-by-step guide. (3) As videos are a traditional method to support users, we decided to use the scrollytelling approach as a less prevalent way and explore it in more detail. Therefore, for our third study, we gathered data towards users’ experience in using the in-situ scrollytelling for the VA tool Netflower. The results of the evaluation with students showed that they preferred scrollytelling over the tutorial integrated in the Netflower landing page. Moreover, for all three studies we explored the effect of task difficulty. In summary, the in-situ scrollytelling approach works well for integrating onboarding in a visualization tool. Additionally, a video tutorial can help to introduce interaction techniques of a visualization.
Die zunehmende Nutzung von Daten im Journalismus steht einerseits im Zusammenhang mit der Datafizierung der Gesellschaft und andererseits mit der digitalen Transformation des Journalismus generell. Dieser Trend hat sich durch die COVID-19-Pandemie noch verstärkt und zum vermehrten Einsatz von Datenstorys geführt. Dennoch ist eine einheitliche Definition von Datenjournalismus nicht vorhanden, daher beginnen wir diesen Beitrag mit einem Vergleich verschiedener Datenjournalismus-Definitionen und beschreiben Arbeitsabläufe im Datenjournalismus. Es ergibt sich ein breites Spektrum an Methoden wie z. B. Visual Analytics, ein integraler Ansatz, in dem Daten visuell analysiert werden, um Themen zu identifizieren. Diese Methoden erfordern auf der Produzent*innen- wie auf der Rezipient*innenseite ein gewisses Maß an Visualization & Data Literacydata literacy. Weiters wird dargestellt, wie sich diese Entwicklung auf die journalistische Praxis, die Qualifikationsanforderungen an Journalist*innen und damit auf die Aus- und Weiterbildung auswirkt. Ebenso identifizieren wir kritische Faktoren der Datenjournalismus-Praxis und umreißen künftige Herausforderungen der Datenjournalismusforschung.
Increasing luxury living standards coupled with technological developments have made energy efficiency in homes much more important. By protecting the environment and preventing the depletion of energy resources, making energy use conscious has an important role in preserving a livable world for future generations. The brain–computer interface (BCI) has been widely used to improve the quality of life of individuals. There have been numerous research projects on predicting the behavior of energy consumers. However, detecting emotional responses and incorporating personal perceptions of individuals is still a challenge to understanding energy users’ behavior. This paper investigates the best selection method for energy data visualization using BCI systems for improving energy users’ experience. An experimental study has been conducted to acquire electroencephalography (EEG) signals of energy users against the stimuli of energy data visualizations to detect emotions and understand the users’ perceptions. A self-assessment manikin (SAM) is used to rate the arousal and valence scales required for emotion classification. Sample entropy (SampEn) and approximate entropy (ApEn) are utilized to analyze EEG data. One-way ANOVA and Tukey’s honestly significant difference test is applied to the entropy values of EEG signals showing some promising results from the conducted statistical analysis.
Full-text available
With the widespread advent of visualization techniques to convey complex data, visualization literacy (VL) is growing in importance. Two noteworthy facets of literacy are user understanding and the discovery of visual patterns with the help of graphical representations. The research literature on VL provides useful guidance and opportunities for further studies in this field. This introduction summarizes and presents research on VL that examines how well users understand basic and advanced data representations. To the best of our knowledge, this is the first tutorial article on interactive VL. We describe evaluation categories of existing relevant research into unique subject groups that facilitate and inform comparisons of literacy literature and provide a starting point for interested readers. In addition, the introduction also provides an overview of the various evaluation techniques used in this field of research and their challenging nature. Our introduction provides researchers with unexplored directions that may lead to future work. This starting point serves as a valuable resource for beginners interested in the topic of VL.
Full-text available
The Model Graphs are used extensively to facilitate the communication and comprehension of quantitative information, perhaps because they seem to exploit natural properties of our visual system such as the ability to process large amounts of information in parallel. Rather than a holistic pattern recognition process, however, research has found that graph comprehension is a complex, interactive process akin to text comprehension. Viewers form a mental model of the quantitative information displayed in the graph through serial, iterative cycles of identifying and relating the graphic patterns to associated variables. Furthermore, graph comprehension is not only constrained by bottom-up perceptual features of the graphical display, but is also influenced by top-down factors such as the viewer's expectations about, or familiarity with, the graph's content. Finally, individual differences in graph comprehension skill interact with top-down and bottom- up influences such that highly skilled graph viewers are less influenced by both the bottom up visual characteristics, and the top-down semantic content.
The paper compares basic principles of the classical test theory (CTT) and item response theory (IRT). Main emphasis is laid on the presentation of the IRT models and their advantages for test construction. Comparison of CTT and IRT is focused on the issues of the item-test relationship, item characteristics, reliability and accuracy of measurement, and possibilities of test results interpretation.
In this study, the schema-theoretic perspective of understanding general discourse was extended to include graph comprehension. Fourth graders (n=204) and seventh graders (n=185) were given a prior-knowledge inventory, a graph test, and the SRA Reading and Mathematics Achievement Tests during four testing sessions. The unique predictors of graph comprehension for Grade 4 included reading achievement, mathematics achievement, and prior knowledge of the topic, mathematical content, and form of the graph. The unique predictors for Grade 7 were the same except that prior knowledge of topic and graphical form were not included. The results suggest that children should be involved in graphing activities to build and expand relevant schemata needed for comprehension.
Our purpose is to bring together perspectives concerning the processing and use of statistical graphs to identify critical factors that appear to influence graph comprehension and to suggest instructional implications. After providing a synthesis of information about the nature and structure of graphs, we define graph comprehension. We consider 4 critical factors that appear to affect graph comprehension: the purposes for using graphs, task characteristics, discipline characteristics, and reader characteristics. A construct called graph sense is defined. A sequence for ordering the introduction of graphs is proposed. We conclude with a discussion of issues involved in making sense of quantitative information using graphs and ways instruction may be modified to promote such sense making.
Conference Paper
Visualizations often seek to aid viewers in assessing the big picture in the data, that is, to make judgments about aggregate properties of the data. In this paper, we present an empirical study of a representative aggregate judgment task: finding regions of maximum average in a series. We show how a theory of perceptual averaging suggests a visual design other than the typically-used line graph. We describe an experiment that assesses participants' ability to estimate averages and make judgments based on these averages. The experiment confirms that this color encoding significantly outperforms the standard practice. The experiment also provides evidence for a perceptual averaging theory.
Conclusion The interpretation of scientific diagrams has been characterised as a complex process. This contrasts sharply with an apparently widespread view among producers and users of resources for science teaching that they are generally unproblematic and that their meaning is usually quite transparent. The continued currency of this view is a matter of concern, especially when resources for learning in science are now so heavily based on pictorial presentation. It is likely that students who are new to scientific diagrams as a pictorial genre will look at them in a manner that is quite different from the way they are seen by their teachers. What seems to a teacher to be a straightforward and clearly presented depiction of a scientific concept, process or structure may be a mysterious and impenetrable abstraction to a student. Whereas the teacher is able to identify readily the elements of a diagram and link them into a coherent, meaningful whole, the student may misunderstand what it is that is depicted and how the depicted entitles are related. A critical factor underlying such differences appears to be the extent and nature of the mental representation of student and teacher. Both of these aspects of the mental representation of scientific diagrams would probably have to be addressed if there are to be improvements in the development of students' diagram interpretation skills. It does not seem sufficient merely to give students a huge diet of diagrams and assume that the necessary reading skills are either present or will develop by themselves. Rather, a deliberate programme designed to develop well structured mental representations of scientific diagrams should accompany efforts to build up an extensive knowledge base about this highly specialised form of visual display. Such a programme should be accompanied by instruction that develops appropriate processing strategies that allow students to gain maximum value from the diagrams they encounter.
This paper offers a framework that has been used for both developing the tasks used to measure literacy and for understanding the meaning of what has been reported with respect to the comparative literacy proficiencies of adults in participating countries. The framework consists of six parts that represent a logical sequence of steps, from needing to define and represent a particular domain of interest, to identifying and operationalizing characteristics used to construct items, to providing an empirical basis for interpreting results. The various parts of the framework are seen as important in that they help to provide a deeper understanding of the construct of literacy and the various processes associated with it. A processing model is proposed and variables associated with performance on the literacy tasks are identified and verified through regression analyses. These variables are shown to account for between 79% and 89% of the variance in task difficulty. Collectively, these process variables provide a means for moving away from interpreting performance on large-scale surveys in terms of discrete tasks or a single number, toward identifying levels of performance that have generalizability across pools of tasks and toward what Messick (1989) has called a higher level of measurement.