Conference PaperPDF Available

Examining Design-Centric Test Participants in Graphical Perception Experiments


Abstract and Figures

In this paper, we replicate a foundational study in graphical perception, and compare our findings from using design-centric participants with that of previous studies. We also assess the visual accuracy of two groups, students and professionals, both with design backgrounds, to identify the potential effects of participants' backgrounds on their ability to accurately read charts. Our findings demonstrate that results for reading accuracy for different chart types of previous empirical studies [CM84,HB10] are applicable to participants of design backgrounds. We also demonstrate that besides significant differences in response time, there are no significant differences in reading accuracy between the student and professional groups in our study. This indicates that, despite bias in research participants for visualization research, previous conclusions about graphical perception are likely applicable across different populations and possibly work fields.
Content may be subject to copyright.
EUROVIS 2020/ C. Garth, A. Kerren, and G. E. Marai Short Paper
Examining Design-Centric Test Participants in Graphical
Perception Experiments
G. Guo1, B. Dy2, N. Ibrahim2, S.C. Joyce2and A. Poorthuis2
1Georgia Institute of Technology, USA
2Singapore University of Technology and Design, Singapore
In this paper, we replicate a foundational study in graphical perception, and compare our findings from using design-centric
participants with that of previous studies. We also assess the visual accuracy of two groups, students and professionals, both
with design backgrounds, to identify the potential effects of participants‘ backgrounds on their ability to accurately read charts.
Our findings demonstrate that results for reading accuracy for different chart types of previous empirical studies [CM84,HB10]
are applicable to participants of design backgrounds. We also demonstrate that besides significant differences in response time,
there are no significant differences in reading accuracy between the student and professional groups in our study. This indicates
that, despite bias in research participants for visualization research, previous conclusions about graphical perception are likely
applicable across different populations and possibly work fields.
CCS Concepts
Human-centered computing Visualization; Empirical studies in visualization;
1. Introduction
Visualizations play a key role in many decision making con-
texts. However, there is relatively little empirical research on ex-
actly how visualizations are read and used to generate action-
able insights in decision-making processes. Despite overall atten-
tion to this interaction between visualization and decision mak-
ing [Bre94,CHGF09,Den75,Fla71,RWJ75,Tuf97] especially in
cartography, empirical and experimental visualization research has
for various reasons focused on the assessment of relatively narrow,
specific tasks that are tested with specific, convenient respondent
For example, previous research that looks at human graphical
perception for use in data visualization, starting with the classic ex-
periment by Cleveland & McGill (1984) [CM84] and studies that
follow the former closely [HB10,TSA14] as well as other research
[KH18,SK16] are generally limited in their choice of research par-
ticipants. Test subjects are often drawn from the researchers’ uni-
versity students or other ’convenient’ participant populations like
Amazon Mechanical Turk (‘MTurk’) [Ama]. This issue has been
well-documented and critiqued, especially in the field of psychol-
ogy, mainly under the label ‘WEIRD’ [HHN10]. In short: peo-
ple from Western, Educated, Industrialized, Rich, and Democratic
(WEIRD) backgrounds may not be representative of all people.
More specifically in the context of this research, undergraduate stu-
dents or previous work on Mechanical Turk may not be representa-
tive of design professionals (i.e. those working in fields such as ar-
chitecture, urban design, product design) and thus, design decision-
makers. This is a potential issue if we design visualization systems
for decision making for specific domains based on recommenda-
tions derived from empirical research on such groups.
From this, we make the following primary contributions:
We successfully replicate earlier empirical work on the graphi-
cal perception of different visual encodings, consistent with the-
oretical predictions on which visualization techniques are most
effective [Mac86].
We compare the performance of specific participant populations
to better understand whether and how frequent participant pop-
ulations in empirical research (i.e. students and MTurk workers)
may introduce specific sampling bias. We show that students and
professionals perform similarly in terms of reading accuracy but
that students perform significantly faster in graphical perception
2. Research Goals
Our primary goal is to assess the potential effect of different partici-
pant backgrounds on graphical perception (specifically student vis-
a-vis design professionals). We hypothesize that the background of
a research participant affects the reading accuracy of a visualiza-
tion, and that in particular, spatially trained design professionals
may be better at reading certain charts, i.e. those that rely on area
and proportion to communicate data, than others.
2020 The Author(s)
Eurographics Proceedings c
2020 The Eurographics Association.
G.Guo, B.Dy, N. Ibrahim, S.C. Joyce & A. Poorthuis / Examining Design-Centric Test Participants in Graphical Perception Experiments
Figure 1: Chart types used in the experiment.
Figure 2: A sample question from our web platform
We do this through a replication of Cleveland & McGill’s and
Heer & Bostock’s studies with the two aforementioned populations.
We replicate both studies’ tests on proportion estimation across dif-
ferent spatial encodings (position, length, angle, area). We also as-
sess whether there is any significant difference between student and
professional participants‘ performance in reading visualizations,
and to which chart types these are applicable.
3. Experiment Design
Seven chart types consisting of stacked and grouped bar charts, pie
charts and treemaps were tested in this experiment. Types 1-6 cor-
respond to the first six types used in Heer & Bostock’s proportional
judgement experiment and Type 7 corresponds to the treemap used
in their area judgement experiment (see Figure 1). All charts were
generated using the same set of values as well, mimicking Cleve-
land & McGill’s original position-length experiment.
The student group was made up of students currently undertak-
ing Architecture and Engineering courses at a Design and Technol-
ogy school in Singapore. The design professionals were recruited
through design networks, such as the National Design Council of
Singapore (NDC), and included professionals from different design
professions who are active in industry and academia, for example
in architecture, urban design, product design and the like.
All participants submitted their responses using a web platform
developed for this purpose through specific invitation links gener-
ated for each group (students and professionals). This allowed us
to track the status of different groups individually. Each participant
went through an explanatory introduction and two practice ques-
tions before attempting the actual experiment. Participants were ad-
vised not to spend too much time on each question and instead to
try to make a quick, intuitive visual judgment without using any
precise measuring techniques.
Each participant was asked to respond to 42 questions/stimuli
in random order. Each question consisted of a unique chart be-
longing to one of the seven chart types tested. Each chart had two
coloured segments, one blue and the other yellow. Participants were
first asked to indicate which segment was smaller by choosing the
colour of the segment and were then asked to judge the percent-
age the smaller segment was of the larger segment (see Figure 2).
Participants had to answer both parts before continuing to the next
question. The time spent on each question and the device details
were recorded along with the participants’ responses.
We adopt the same measure of accuracy (log absolute error) as
Cleveland & McGill [CM84] and Heer & Bostock [HB10]:
log2(|judged percent true percent|) + 1
Our measure for response time is the natural logarithm of the differ-
ence in seconds between a participant‘s entry and exit of a question:
loge(|exit time enter t ime|)(2)
To test between-group differences, we used ANOVA and followed
up with Tukey post-hoc tests. We used a Q-Q plot to determine the
validity of the F-test, and found that our time and accuracy results
follow the normal trend closely. The data collected in this study can
be made available upon request.
4. Results and Discussion
There were a total of 123 respondents, of which 87 were students
and 36 were professionals. The age of students ranged from 19 to
32, with a mean age of 23. The age of professionals ranged from
21 to 63, with a mean age of 35.
Only the fully completed stimuli/questions (4602) were included
in further analysis. From these completed questions, 298 (6.48%)
were excluded from accuracy analysis as these participants chose
not to provide their occupation or listed their profession as ‘Other’.
We also looked at the time taken for each response. The online
and unpaid nature of our questionnaire meant that compared to
the MTurk study [HB10] and the controlled environment of the
[CM84], there was a significantly higher chance of participants tak-
ing longer than was reasonable to answer questions. Hence, we
omitted another 404 outlier responses that did not fall within the
interquartile range. This led to a total of 702 (15.3%) of responses
being excluded from the analysis of time taken.
4.1. Replication of proportional judgement experiments
The log absolute errors measured in this experiment are, on aver-
age, slightly higher than in Heer & Bostock’s paper (see Figure 3).
This is likely due to their exclusion of response that differed from
the true difference by more than 40%. No such exclusion was per-
formed here.
We found a significant effect of chart type on response accuracy
(F(6,4297) = 68.868,p<0.05). A further Tukey post-hoc analysis
2020 The Author(s)
Eurographics Proceedings c
2020 The Eurographics Association.
G.Guo, B.Dy, N. Ibrahim, S.C. Joyce & A. Poorthuis / Examining Design-Centric Test Participants in Graphical Perception Experiments
Figure 3: Midmeans of log absolute errors against true percent-
ages for each proportional judgment type. Superimposed curves
were computed with lowess smoothing. The log absolute errors are,
on average, slightly higher than in [HB10].
found that with the exception of Type 2 (stacked bar) and Type 3
(dodged bar) charts, which had similar performance, all chart types
were significantly different from each other in terms of participant
accuracy. This result is in line with previous results from Cleveland
and McGill [CM84], and Heer & Bostock [HB10]. They found a
similar order of accuracy between the chart types, with grouped
bar charts as the best performing chart types and tree maps as the
worst. In our experiment, Type 1 has the lowest error (1.95), while
Type 7 has the highest error (3.21).
In addition to differences in accuracy between chart types, we
also found a significant difference in response time (F(6,3949) =
28.144,p<0.05), as seen in Figure 4. Due to strongly left-skewed
data for response time, a natural log transform was performed on
the measurements before further analysis. A Tukey post-hoc test
found that response time was not significantly different for chart
Types 1, 2, 3, and 5, while chart Types 4, 5, 6 were also similar.
Type 2 has the fastest average response time (9.2 seconds), while
Type 7 is 30% slower (12.0 seconds).
Additionally, a correlation analysis between log error and time
taken for each chart type reveals no or very weak relationships be-
tween log error and time taken for all charts. Most of these correla-
tions are not statistically significant (p>0.05). This suggests that
taking more time to give a response does not significantly alter the
accuracy of the judgment, regardless of chart type.
4.2. Differences between students and professionals
Now that we have established that our results generally replicate
earlier studies, we turn towards analyzing the difference between
the student and professional participant groups. If all stimuli (re-
gardless of chart type) are taken together, we find no significant ef-
Figure 4: 95% confidence intervals of log errors (top), time taken
(bottom) by chart type. Relative performance of each chart (log
errors) is comparable to results by [CM84], and [HB10].
Figure 5: Top: Violin plot of log error by chart type. Bottom: Violin
plot of log time taken by chart type. These results have a similar
order of accuracy between the chart types for both log error and
log response time as with [CM84] and [HB10].
fect of occupation on response accuracy (F(1,4302) = 0.592,p<
0.4415) (see Figure 6). Even when we analyze the difference in
accuracy between students and professionals for each of the seven
chart types, none of the differences are statistically significant. This
is a potential indication that the relationship between visual encod-
ing and reading accuracy discussed in previous work is not sensitive
to sampling bias present in many empirical, experimental designs.
While this result runs counter to our original hypothesis (i.e. the
background of a research participant affects the reading accuracy of
2020 The Author(s)
Eurographics Proceedings c
2020 The Eurographics Association.
G.Guo, B.Dy, N. Ibrahim, S.C. Joyce & A. Poorthuis / Examining Design-Centric Test Participants in Graphical Perception Experiments
Figure 6: Left: Violin plot of log error by profession. Right: Vio-
lin plot of log time taken by profession. Students and professionals
have similar accuracy, but students are significantly faster.
a visualization), other sampling biases not captured by our current
study design may affect the relationship between visual encoding
and reading accuracy.
However, we do find a significant effect of occupation on re-
sponse time taken (F(1,3954) = 202.675,p<0.01). On average, a
student takes 8.7 seconds to respond, while professionals take 11.0
seconds. We are not entirely sure what may be the underlying cause
behind this difference. It could be that students are indeed faster in
reading and assessing visualizations but it could also be that stu-
dents felt more rushed to complete the experiment; or that they are
more comfortable with digital surveys; or generally more trained
in test taking. Part of this could also be confounded by the student
population being younger, as we did find a significant positive ef-
fect of age on response time (r=0.447; p<0.01).
4.3. Additional findings
Apart from these main findings, we also find that the observed er-
rors have a relationship to the actual, ‘true’ difference between the
two segments that the respondents are asked to compare (see Fig-
ure 7). In general, the mean error gets lower if the difference be-
tween the two segments is smaller, implying that as the difference
between the areas of two segments increases, participants find it
more difficult to accurately estimate said difference. Interestingly,
we also find that the error variance is noticeably higher for both
very small and very large differences.
Finally, no correlation is found between the time taken by par-
ticipants and their accuracy on a given chart type. Although the
choice of chart affects both the response time and accuracy, within
each chart type there is no or very weak correlation between re-
sponse time and accuracy. In other words, ‘slower’ chart types are
generally less accurate but on an individual level taking more time
to read a chart does not lead to higher accuracy.
5. Conclusion
As the previous section illustrates, earlier empirical study and re-
sults on the graphical perception of different visual encodings was
successfully replicated in the current study. In this replication, we
tested on two different groups of participants to evaluate the effect
of a participant’s background on graphical perception. Our impe-
tus for this research design was the idea that student populations
Figure 7: 95% confidence intervals of log errors against true dif-
ferences between chart segments. As the difference between the two
segments to be compared increases, participants find it harder to
accurately estimate the difference.
(commonly used as research participants in visualization research)
may not be representative of the larger population, or in our case
for decision-makers in design fields specifically. Furthermore, there
was interest that spatially trained design professionals may be bet-
ter at reading certain charts (i.e. treemaps) than others.
Our findings, however, did not show a significant difference be-
tween students and design professionals in terms of reading accu-
racy, nor did we see any difference between design trained par-
ticipants in this test and the non-spatially trained participants in
the prior Cleveland & McGill (1984) and Heer & Bostock (2010)
experiments. This implies a similar parity of capability between
students and design professionals, and holds for both overall ac-
curacy, and for the differences in accuracy between different chart
types. Beyond accuracy, we do observe a significant difference in
how much time students and design professionals take to complete
chart reading tasks, with students being significantly faster.
In summary, the theoretical principles from visualization theory
and recommendations based on previous empirical studies on the
accuracy of different visual encodings could potentially apply be-
yond the often-used populations of students and MTurk workers.
However, as indicated by the differences in response time, spe-
cific sub-groups, such as design professionals and other decision-
makers, may indeed read visualizations in slightly different, nu-
anced ways. This has implications on the development of more
complex visualizations, such as those for use in design visualiza-
tion systems oriented towards decision making, and requires addi-
tional follow-up research.
6. Acknowledgements
The material reported in this document is supported by the SUTD-
MIT International Design Centre (IDC). Any findings, conclusions,
recommendations, or opinions expressed in this document are those
of the author(s) and do not necessary reflect the views of the IDC.
2020 The Author(s)
Eurographics Proceedings c
2020 The Eurographics Association.
G.Guo, B.Dy, N. Ibrahim, S.C. Joyce & A. Poorthuis / Examining Design-Centric Test Participants in Graphical Perception Experiments
[Ama] AMAZON: Amazon Mechanical Turk. https://www.mturk.
com/. Accessed: 2019–06–13. 1
[Bre94] BR EWER C . A.: Chapter 7 – Color use guidelines for mapping
and visualization. In Visualization in Modern Cartography, Maceachern
A. M., Taylor D. R. F., (Eds.), vol. 2 of Modern Cartography Series.
Academic Press, 1994, pp. 123–147. doi:
1016/B978-- 0--08- -042415- -6.50014-- 4.1
S.: Evaluating the effectiveness of interactive map interface designs: a
case study integrating usability metrics with eye–movement analysis.
Cartography and Geographic Information Science 36, 1 (2009), 5–17.
[CM84] CL EVEL AND W. S. , MCGIL L R.: Graphical perception: Theory,
experimentation, and application to the development of graphical meth-
ods. Journal of the American Statistical Association 79, 387 (sep 1984),
531–554. doi:10.2307/2288400.1,2,3
[Den75] DE NT B. D .: Communication aspects of value–by–area car-
tograms. The American Cartographer 2, 2 (1975), 154–168. doi:
[Fla71] FL ANNE RY J. J .: The relative effectiveness of some com-
mon graduated point symbols in the presentation of quantitative data.
Cartographica: The International Journal for Geographic Informa-
tion and Geovisualization 8 (dec 1971), 96–109. doi:10.3138/
J647-- 1776--745H- -3667.1
[HB10] HE ER J., B OST OCK M.: Crowdsourcing graphical percep-
tion: Using Mechanical Turk to assess visualization design. In Pro-
ceedings of the SIGCHI Conference on Human Factors in Computing
Systems (2010), CHI ’10, pp. 203–212. URL: http://doi.acm.
[HHN10] HENRICH J., HEI NE S. J ., NO REN ZAYAN A.: Most people are
not WEIRD. Nature 466, 7302 (2010), 29. doi:10.1038/466029a.
[KH18] KI M Y. , HEE R J.: Assessing effects of task and data distribution
on the effectiveness of visual encodings. Computer Graphics Forum 37,
3 (Jul 2018), 157–167. doi:10.1111/cgf.13409.1
[Mac86] MAC KIN LAY J.: Automating the design of graphical presenta-
tions of relational information. ACM Transactions on Graphics 5, 2 (Apr.
1986), 110—-141. doi:10.1145/22949.22950.1
[RWJ75] ROTH R. , WOO DRUFF A ., JOHNSON Z.: Value–by–
alpha maps: An alternative technique to the cartogram. The
Cartographic Journal 47, 2 (1975), 130–140. doi:10.1179/
[SK16] SK AU D., KOSARA R .: Arcs, angles, or areas: Individual data
encodings in pie and donut charts. Computer Graphics Forum 35, 3 (7
2016), 121–130. doi:10.1111f/cgf.12888.1
[TSA14] TALB OT J., S ETL UR V., ANA ND A.: Four experiments on
the perception of bar charts. IEEE Transactions on Visualization and
Computer Graphics 20, 12 (Dec 2014), 2152–2160. doi:10.1109/
[Tuf97] TUF TE E. R.: Visual and statistical thinking: Displays of evi-
dence for decision making, 1st ed. Graphics Press, Chesire, Conneticut,
1997. 1
2020 The Author(s)
Eurographics Proceedings c
2020 The Eurographics Association.
... Those with fewer years of work experience were faster and more accurate than those with more experience, while those with Bachelor's degrees performed significantly better than those with higher qualifications for decision time; no significant results were observed for either accuracy metric. Possible reasons behind this effect remain unclear, although we observed similar results in a previous experiment [51]. This effect was also not entirely captured by the age variable, and could be an area for future study. ...
Decision-makers across many professions are often required to make multi-objective decisions over increasingly larger volumes of data with several competing criteria. Data visualization is a powerful tool for exploring these complex solution spaces, but there is little research on its ability to support multi-objective decisions. In this paper, we explore the effects of visualization design and data volume on decision quality in multi-objective scenarios with complex trade-offs. We look at the impact of four common multidimensional chart types (scatter plot matrices, parallel coordinates, heat maps, radar charts), the number of options and dimensions, the ratio of number of dimensions considered to the number of dimensions shown, and participant demographics on decision time and accuracy when selecting the optimal option. As objectively evaluating the quality of multi-objective decisions and the trade-offs involved is challenging, we employ rank- and score-based accuracy metrics. Our findings show that accuracy is comparable across all four visualizations, but that it improves when users are shown fewer options and consider fewer dimensions in their decision. Similarly, considering fewer dimensions imparts a speed advantage, with heat maps being the fastest among the four charts types. Participants who use charts frequently were observed to perform significantly faster, suggesting that users can potentially be trained to effectively use visualizations in their decision-making.
Full-text available
The goal of the research described in this paper is to develop an application-independent presentation tool that automatically designs effective graphical presentations (such as bar charts, scatter plots, and connected graphs) of relational information. Two problems are raised by this goal: The codification of graphic design criteria in a form that can be used by the presentation tool, and the generation of a wide variety of designs so that the presentation tool can accommodate a wide variety of information. The approach described in this paper is based on the view that graphical presentations are sentences of graphical languages. The graphic design issues are codified as expressiveness and effectiveness criteria for graphical languages. Expressiveness criteria determine whether a graphical language can express the desired information. Effectiveness criteria determine whether a graphical language exploits the capabilities of the output medium and the human visual system. A wide variety of designs can be systematically generated by using a composition algebra that composes a small set of primitive graphical languages. Artificial intelligence techniques are used to implement a prototype presentation tool called APT (A Presentation Tool), which is based on the composition algebra and the graphic design criteria.
Full-text available
The cartogram, or value-by-area map, is a popular technique for cartographically representing social data. Such maps visually equalize a basemap prior to mapping a social variable by adjusting the size of each enumeration unit by a second, related variable. However, to scale the basemap units according to an equalizing variable, cartograms must distort the shape and/or topology of the original geography. Such compromises reduce the effectiveness of the visualization for elemental and general map-reading tasks. Here we describe a new kind of representation, termed a value-by-alpha map, which visually equalizes the basemap by adjusting the alpha channel, rather than the size, of each enumeration unit. Although not without its own limitations, the value-by-alpha map is able to circumvent the compromise inherent to the cartogram form, perfectly equalizing the basemap while preserving both shape and topology.
In addition to the choice of visual encodings, the effectiveness of a data visualization may vary with the analytical task being performed and the distribution of data values. To better assess these effects and create refined rankings of visual encodings, we conduct an experiment measuring subject performance across task types (e.g., comparing individual versus aggregate values) and data distributions (e.g., with varied cardinalities and entropies). We compare performance across 12 encoding specifications of trivariate data involving 1 categorical and 2 quantitative fields, including the use of x, y, color, size, and spatial subdivision (i.e., faceting). Our results extend existing models of encoding effectiveness and suggest improved approaches for automated design. For example, we find that colored scatterplots (with positionally‐coded quantities and color‐coded categories) perform well for comparing individual points, but perform poorly for summary tasks as the number of categories increases.
Pie and donut charts have been a hotly debated topic in the visualization community for some time now. Even though pie charts have been around for over 200 years, our understanding of the perceptual factors used to read data in them is still limited. Data is encoded in pie and donut charts in three ways: arc length, center angle, and segment area. For our first study, we designed variations of pie charts to test the importance of individual encodings for reading accuracy. In our second study, we varied the inner radius of a donut chart from a filled pie to a thin outline to test the impact of removing the central angle. Both studies point to angle being the least important visual cue for both charts, and the donut chart being as accurate as the traditional pie chart.
Bar charts are one of the most common visualization types. In a classic graphical perception paper, Cleveland & McGill studied how different bar chart designs impact the accuracy with which viewers can complete simple perceptual tasks. They found that people perform substantially worse on stacked bar charts than on aligned bar charts, and that comparisons between adjacent bars are more accurate than between widely separated bars. However, the study did not explore why these differences occur. In this paper, we describe a series of follow-up experiments to further explore and explain their results. While our results generally confirm Cleveland & McGill's ranking of various bar chart configurations, we provide additional insight into the bar chart reading task and the sources of participants' errors. We use our results to propose new hypotheses on the perception of bar charts.
The value-by-area cartogram is a form of mapping often used for its visual impact. The average reaction to cartogram area is potentially error free in that the components of a typical value-by-area cartogram are areal shapes. When not in map form, map readers generally respond to these areas in a linear fashion. Results of a testing program to measure this ability suggest that the accuracy of the average response depends upon whether shapes are amorphous or square and whether the anchor stimulus is relatively large or small. The accuracy of responses to the cartogram compare favorably with responses to a five-class Flannery circle map. Map reader attitudes point out that these cartograms are thought to be confusing and difficult to read. At the same time they appear interesting, generalized, innovative, unusual, and having—as opposed to lacking—style. Communication depends on how well the cartographer has maintained the overall shape quality of the original geographical units and the provision of an adequate stimulus anchor, a geographical-base inset map, and labeling on the statistical units.
Conference Paper
Understanding perception is critical to effective visualiza- tion design. With its low cost and scalability, crowdsourcing presents an attractive option for evaluating the large design space of visualizations; however, it first requires validation. In this paper, we assess the viability of Amazon's Mechanical Turk as a platform for graphical perception experiments. We replicate previous studies of spatial encoding and luminance contrast and compare our results. We also conduct new ex- periments on rectangular area perception (as in treemaps or cartograms) and on chart size and gridline spacing. Our re- sults demonstrate that crowdsourced perception experiments are viable and contribute new insights for visualization de- sign. Lastly, we report cost and performance data from our experiments and distill recommendations for the design of crowdsourced studies.
To understand human psychology, behavioural scientists must stop doing most of their experiments on Westerners, argue Joseph Henrich, Steven J. Heine and Ara Norenzayan.