ArticlePDF Available

Eye tracker data quality: What it is and how to measure it

  • Nicolaus Copernicus University; Universität Regensburg; University of the Free State

Abstract and Figures

Data quality is essential to the validity of research results and to the quality of gaze interaction. We argue that the lack of standard measures for eye data quality makes several aspects of manufacturing and using eye trackers, as well as researching eye movements and vision, more difficult than necessary. Uncertainty regarding the comparability of research results is a considerable impediment to progress in the field. In this paper, we illustrate why data quality matters and review previous work on how eye data quality has been measured and reported. The goal is to achieve a common understanding of what data quality is and how it can be defined, measured, evaluated, and reported.
Content may be subject to copyright.
Copyright © 2012 by the Association for Computing Machinery, Inc.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
ETRA 2012, Santa Barbara, CA, March 2830, 2012.
© 2012 ACM 978-1-4503-1225-7/12/0003 $10.00
Eye tracker data quality: What it is and how to measure it
Kenneth Holmqvist
Marcus Nystr
Fiona Mulvey
Lund University, Sweden
Data quality is essential to the validity of research results and to
the quality of gaze interaction. We argue that the lack of standard
measures for eye data quality makes several aspects of manufactur-
ing and using eye trackers, as well as researching eye movements
and vision, more difficult than necessary. Uncertainty regarding the
comparability of research results is a considerable impediment to
progress in the field. In this paper, we illustrate why data qual-
ity matters and review previous work on how eye data quality has
been measured and reported. The goal is to achieve a common
understanding of what data quality is and how it can be defined,
measured, evaluated, and reported.
CR Categories: I.3.7 [Eye tracking]: Data quality—
Keywords: data quality, eye tracker, eye movements, precision,
accuracy, latency
1 Does data quality matter?
The validity of research results based on eye movement analysis are
clearly dependent on the quality of eye movement data. The same is
true of the performance of gaze based communication devices. Eye
data contain noise and error which must be accounted for. There are
currently no norms or standards for what researchers report about
data quality in publications, or for what manufacturers report about
their eye tracker’s typical performance. What may be a serious im-
pediment for one purpose may not be significant for other purposes,
for example, a cheap eye tracker composed of off-the-shelf compo-
nents may be sufficient for clicking large buttons in gaze interaction
or for looking at larger AOIs with sufficient margin sizes, and may
work as an assistive device mounted on a wheelchair, whereas a
more expensive, high performance eye tracker may have better data
quality and a greater number of valid eye movement measures nec-
essary in much psychological, neurological and reading research. It
is a case of matching the system to the purposes and also to the user
or participant group, and this is a very difficult task without some
standardized measures of data quality. If data quality is measured
and characterised for the eye tracker, participant group and in terms
of the specific experimental measures of interest, there are meth-
ods of dealing with low quality to maximise the validity of results:
correcting or abandoning data [Holmqvist et al. 2011, p. 140 and
224]. However, these methods cannot be considered without first
analysing the data and identifying what is and is not noise or error.
We thank the members of the COGAIN Technical Committee (see for the standardisation of eye data
quality for their ongoing participation and comments to this text.
Figure 1: Good and poor precision in two remote 50 Hz eye
trackers as seen in an x-/y-visualisation (scanpath view). From
[Holmqvist et al. 2011], page 149.
Figure 2: Very inaccurate data in one corner. From [Holmqvist
et al. 2011], page 132.
Since fixation analysis obscures the original data quality, most re-
searchers estimate the quality of their own recordings from various
plots of raw data samples. For instance, Figure 1 shows good versus
poor precision, and Figure 2 a case of poor accuracy in the upper
left corner. It may be obvious that eye tracker data quality affects
the validity of results, but how large is the effect? Is it reasonable to
assume valid results from a commercial eye tracker without mea-
suring quality in a particular data set, or should all eye movements
researchers check their data quality and report it as part of their
results? To illustrate these issues, we begin with four examples.
1.1 Example 1: Effect of accuracy on dwell time mea-
Accuracy (sometimes called offset) is one of the most highlighted
aspect of data quality. Loosely speaking, it refers to the difference
between the true and the measured gaze direction.
Figure 3(a) shows high quality data recorded from one participant
looking at the stimulus image for 30 seconds, with the task of esti-
mating the age of people in the scene. Binocular data were recorded
with a tower-mounted eye tracker sampling at 500 Hz, but only data
from the left eye are shown and analysed. The eye tracker reports
an average accuracy of 0.30
horizontally and 0.14
vertically after
calibration and a four-point validation procedure.
Figure 3(b) displays areas of interest (AOIs) for faces in the stimu-
lus image. Because this is a real image, there is no whitespace—i.e.
an area not covered by any AOIs—between the faces that could be
used as AOI margins. AOIs with small margins are common in
reading research, web studies, and studies that use videos or real
world stimuli. They are also common in gaze interaction scenar-
ios, e.g. when typing on an onscreen keyboard. When there is no
room for margins, data with poor accuracy will sometimes move
to another AOI than the one intended. We can simulate degrees
of poor quality by adding 0.5
offset to the recorded data, moving
them a bit in space. Even with this additional offset, the accuracy
(a) Original data in high quality. (b) AOI positions.
1 2 3 4 5 6 7
Total dwell time (ms)
(c) Total dwell time in each
AOI; original data.
1 2 3 4 5 6 7
Total dwell time (ms)
(d) Total dwell time after 0.5
inaccuracy (offset) has been
added to the data.
Figure 3: Figure (c) and (d) compare total dwell times with accu-
rate vs slightly inaccurate data. The inaccuracy was added to the
original data. Note that 0.5
is considered a very small error.
is still considered rather high in comparison to what is commonly
reported in the literature, in fact several manufacturers report 0.5
offset as their standard or even best possible accuracy. If a system’s
inaccuracy is not taken into account when designing test stimuli
and analysing data from a study, what kind of effect may it have on
Dwell time ( ‘gaze duration’, ‘glance time’, ...) is the time gazed at
an AOI, from entry to exit, whereas total dwell time is the sum of
all dwell times to a specific AOI over a trial [Holmqvist et al. 2011,
pp. 190 and 389]. It is a very common measure in eye-movement
research. Figure 3(c) shows dwell times for seven AOIs based on
the original, and Figure 3(d) shows dwell time for the same AOIs
after a 0.5
offset has been added. Note that for some AOIs, total
dwell time is reduced, for others it is significantly reduced or even
totally removed from one AOI, and for some AOIs, dwell time is
hardly affected at all. The effect is not uniform across AOIs and so
can’t be corrected or controlled for. The purpose of this example
study was to analyse AOIs for dwell time and number of fixations,
which is typical of many studies. Adding 0.5
degree imprecision
to the data simulates many common recording scenarios. The point
to note is that even when precision is relatively good, the small
amount of imprecision present can lead to significant differences in
the results.
Often noise in data can be counteracted by increasing the amount
of data; as, for instance, with the effect of low sampling frequency
on fixation duration [Andersson et al. 2010]. In contrast, more data
does not remedy the effect of poor accuracy on AOI measures such
as dwell time, because the different data are likely often to be dis-
tributed in the same direction, out of the AOI.
Apart from the effect of accuracy on research results, accuracy also
affects gaze based communication technologies. In gaze based in-
teraction, interactive on screen targets are in fact AOIs with clear
margins. Dwell time select is a common method of ‘clicking’ a
button with gaze. Having several buttons side by side in an array,
for example in an on-screen keyboard, will produce error in selec-
tion if data moves to the neighbouring button. When the selection
method involves dwell time, this may cause an almost complete
5000 6000 7000 8000 9000 10000
RMS: 0.03 degrees RMS: 0.37 degrees
Sample number
Pixel coordinate
(a) Illustration of data with high
(left) and low (right) precision.
0 0.1 0.2 0.3 0.4
Precision (RMS of intersample distances)
Number of fixations
0 0.1 0.2 0.3 0.4
Fixation duration (ms)
(b) Influence of precision on the
number of fixations and the average
fixation duration.
Figure 4: How a decrease in precision affects the number and du-
ration of detected fixations. The precision was decreased by adding
Gaussian noise with an increasing variance. Fixations were de-
tected with the algorithm by [Nystr
om and Holmqvist 2010], using
default settings.
dwell-time based selection to restart, or if very inaccurate with no
space between targets, may mean selection is very difficult or can
only be made on very large (or magnified) targets.
1.2 Example 2: Effects of precision on the number and
duration of fixations
Inaccuracy is not the only data quality issue affecting the viability
of research results. While accuracy refers to the difference between
true and recorded gaze direction, precision refers to how consistent
calculated gaze points are, when the true gaze direction is constant.
It is often tested with an artificial eye, which does not move at all.
Precision measures are commonly conducted to test a particular eye
tracker, and when using an artificial eye, this measure gives an idea
of system noise or error, which varies with the quality of the eye
tracking system. In essence, this enables us to investigate the effect
of collecting data with or without a bite-bar or chin rest, or with a
tower-mounted eye tracker compared to a remote one. It is also one
aspect of testing eye tracker quality. By adding Gaussian noise with
an increasing standard deviation to the eye movement data in Fig-
ure 3(a), we can simulate poor precision in an eye tracker. Figure
4(a) shows an example of the original data (left part of figure) and
the data after noise has been added. The range of added noise has
been chosen to conform to recorded precision values for current eye
trackers, which according to [Holmqvist et al. 2011] is 0.010.05
for tower-mounted systems and 0.03 1.03
for remote ones. The
larger values in the latter range, however, are likely to reflect eye
trackers with exceptionally poor precision, and are therefore not in-
cluded in the data presented. Precision values are calculated as the
root mean square (RMS) of intersample distances in the data.
Figure 4(b) illustrates how precision influences the number and du-
ration of fixations, as detected by the adaptive velocity algorithm
developed by [Nystr
om and Holmqvist 2010]. According to this al-
gorithm, fixations become fewer and longer as precision decreases.
This is most likely due to the saccade detection threshold increas-
ing as a direct consequence of the higher noise level, which prevents
small saccades from being detected. These small saccades then be-
come part of adjacent fixations, merged into one longer fixation.
The effect in Figure 4(b) is dramatic; even though the data should
represent exactly the same eye movement behaviour, the number of
fixations decreases by more than 30%, whereas the average fixation
duration increases by about 10%. The size of this effect will change
with the method of event detection used. For example, it is likely
that systems using dispersion based fixation detection algorithms
produce a different result.
0 5000 10000 15000
Sample number
Pixel coordinate
Proportion of lost samples: 0.18
(a) Data with missing samples (indi-
cated with red dots). 18% of the sam-
ples were lost in this example.
0 0.05 0.1 0.15 0.2
Proportion of lost data
Number of fixations
0 0.05 0.1 0.15 0.2
Fixation duration (ms)
(b) Influence of data loss on the num-
ber of fixations and the average fixa-
tion duration.
Figure 5: How data loss affects the number and duration of de-
tected fixations. Data loss was simulated by randomly insert-
ing burst losses with a length uniformly drawn from the interval
[10, 100] pixels. Fixations were detected with the algorithm by
om and Holmqvist 2010], using default settings.
1.3 Example 3: Effect of data loss on the number and
duration of fixations
Lost data refers to samples that are reported as invalid by the eye
tracker. Typically, this correspond to (0, 0)-coordinates or sam-
ples that are flagged with a certain validity code in the data file.
Data losses derive from periods when critical features in the eye
image—often the pupil and the corneal reflection(s)—cannot be re-
liably detected and tracked. This can occur when, for example,
glasses, contact lenses, eyelashes, or blinks prevent the video cam-
era from capturing a clear image of the eye.
Sometimes, it may be desirable to differentiate blinks from other
sources of data loss. This may be because blinks are used as a be-
havioural measure (e.g. [Holland and Tarlow 1972], [Tecce 1992])
or because they are used for gaze based interaction, for example
as a ‘click’ select input. In such cases, simply removing raw data
samples with (0, 0) coordinates is not possible, and blinks need to
be modeled and differentiated from other causes of loss of signal.
Many eye trackers do not output blinks as an event.
Figure 5(a) shows how losses have been introduced into the eye
movement signal, where red dots represent lost or invalid samples.
To simulate short, local losses of data, invalid data are inserted as
burst losses, which occur with probability P
and last for N
ples, where N
is drawn uniformly from A = {10, 11, . . . , 100}.
Figure 5(b) reveals the same trend for data loss as Figure 4(b) did
for decreased precision: a reduction in the number of fixations and
an increase in fixation duration.
1.4 Example 4: Effect of screen position on pupil size
Pupil size reacts primarily to changes in illumination, but it is of-
ten used as a measure of mental workload, emotional valence, or
as an indication of drug use [Holmqvist et al. 2011, pp. 393–394].
A prerequisite for such investigations (apart from controlled light
conditions) is that the recorded change in pupil size reflects the true
change in pupil size, and therefore that the eye tracker does not add
any systematic or variable error to the data. Pupil size measures
will include systematic error if the apparent change in pupil size
with viewing angle is not controlled for by the eye tracking system,
or corrected in the recorded data subsequently. The effect of view-
ing angle is that pupil size is larger when the eye is on-axis with
the eye camera. This typically means that the pupil is largest when
looking in the centre of the screen compared to the edges. Without
knowing this relationship between pupil size and screen position for
the particular system being used, the difference in pupil size may
be attributed to differences in cognitive processing or emotional re-
sponses to the objects. [Gagl et al. 2011] reported a similar effect
and also propose a method to correct the errors. Such problems can
be corrected, but only if the error is first measured for the particular
set up used.
2 Factors influencing data quality
Many factors influence data quality, including:
1. Participants have different eye physiologies, varying neurol-
ogy and psychology, and differing ability to follow instruc-
tions. Some participants may wear glasses, contact lenses, or
mascara, or may have long eyelashes or droopy eyelids which
all interfere with the eye image and may or may not be ac-
counted for in a system’s eye model [Nystr
om et al. submit-
2. Operators have differing levels of skill, and more experienced
operators should be able to record data with higher quality
om et al. submitted]. Operator skills include adjusting
eye to camera angles and mirrors, monitoring the data quality
in order to decide whether to recalibrate, as well as providing
clear instructions to the participants.
3. A task that requires participants to move around a lot, for ex-
ample, could affect data quality. A task that causes partici-
pants blink more often leads to more data loss, unless blinks
are modeled as eye events.
4. The recording environment has a strong influence on data
quality. Was the data collected outdoors in sunlight or indoors
in a controlled laboratory environment, for instance? Were
there any vibrations in the room that reduced the stability of
the eye movement signal? These factors should be considered
and reported.
5. The geometry, that is the relative positions of eye camera, par-
ticipant, and stimulus affects data quality, as does the position
of the head in what is known as the head box [Holmqvist et al.
2011, p. 58]. This may be of particular importance when using
eye trackers as a communication aide for the disabled, who
may be constrained in their movement or sitting/lying posi-
6. The eye tracker design does of course have a large impact on
the quality of the recorded data. Simply put, an eye tracker
consists of a camera, illumination, and a collection of soft-
ware that detects relevant features in the eye, and map these
to positions on the screen. The resolution of the video cam-
era and the sharpness of the eye image are important fac-
tors that are directly related to some aspects of data qual-
ity. Equally important are the image analysis algorithms, the
eye model, the eye illumination and the calibration procedure.
Eye tracker system specifications will also have an influence
on data quality. The most quoted system specification is sam-
ple rate, or sampling frequency. Sample frequency will dic-
tate the system’s ability to record brief events and to produce
accurate velocity profiles. Other system specifications which
influence data quality are whether the system is bright or dark
pupil based (i.e. whether the eye illumination is on or off
axis, producing a bright or dark image of the pupil, for a re-
view of the various set-ups currently in use see [Hansen et al.
2011]). This may interact with eye colour or other factors to
effect data quality. Finally, whether the eye tracker records
monocularly or binocularly is of interest. Accuracy and pre-
cision of fixation data may improve if data from two eyes is
combined, particularly if using a dispersion based fixation de-
tection method, but, if data from two eyes are not separable,
saccade velocity profiles, microsaccades, drift, and saccade
amplitude measures will lose validity.
3 Terminology for data quality
First, let us make clear that we cannot know where a human is look-
ing. Even when a participant says she looks at a point, the centre of
the fovea can be slightly misaligned. When we talk about ‘actual
gaze’ we refer to this subjective but reportable impression, which is
what the vast majority of eye trackers are designed to measure.
Thus, in general terms, data quality can be defined as the spatial
and temporal deviation between the actual and the measured gaze
direction and the nature of this deviation, on a sample to sample
basis. In the very simplest case, we consider these deviations in the
presence of only one data sample ˆx
. This sample can either be re-
ported as valid or invalid by the eye tracker, where an invalid sample
usually means that relevant eye features could not be detected from
the video feed of the eye, for instance due to loss of the eye image.
Clearly, with the exception of blinks, it does not make much sense
to characterize the quality of missing data other than to classify it
as invalid. When the eye tracker reports a valid sample, data qual-
ity can be defined as the distance θ
(in visual degrees) between the
actual x
and the measured ˆx
gaze position, known as the spatial
accuracy, or just accuracy, as well as the difference between the
time of the actual movement of the eye t
and the time reported by
the eye tracker
, known as latency or temporal accuracy. If both
accuracy and precision differences are zero, the data quality for this
single sample is optimal.
The example with only one sample is however mainly of academic
interest. Typically, one needs to consider several samples recorded
from a whole experiment, a trial, or a single event such as a fixation.
Given n recorded samples, accuracy can be calculated as
, (1)
The variance in accuracy is often referred to as spatial precision and
the variance in latency is typically called temporal precision. Two
common ways to estimate the spatial precision in the eye move-
ment signal are the standard deviation of the samples and the root
mean square (RMS) of inter-sample angular distances, but a whole
range of other dispersion measures exist that could be alternatives
[Holmqvist et al. 2011, p. 359-369]. The standard deviation for a
set of n data samples ˆx
is calculated as
where ˆx
denotes the sample average. Letting θ denote the angular
distances between samples, precision can be expressed as
+ θ
+ · · · + θ
These two precision calculations reflect different factors. Precision
in particular reacts to vibrations in the environment when calculated
as standard deviation, but not so much when calculated as root mean
square (RMS). Figure 6 illustrates this important difference. It is
likely that a full standard needs several precision calculations that
each measure an aspect of data.
Both accuracy and precision can be computed separately for hori-
zontal and vertical dimensions. This may be of particular signifi-
cance for persons with physical disability. [Cotmore and Donegan
2011] outlines the development of a gaze controlled interface for a
user who only has good control of movements in one dimension, for
example. Moreover, the proportion of valid data samples recorded
Figure 6: The set of raw data samples on the left have large sample-
to-sample distances, and therefore RMS will be high. They are not
so dispersed, so standard deviation will be low. The data set on the
right, typical of a vibration in the eye tracker, has short sample-to-
sample distances, which gives a low RMS, but it is fairly dispersed,
so standard deviation will higher.
is often a good indication of whether the system has problems track-
ing a particular individual or in a particular environment.
The spatial accuracy and precision of pupil size can be defined in a
similar manner. The unit of measurement is either pixels in the eye
camera, or the perhaps more intuitive unit millimeters. Since pupil
size values are recorded at the same rate as gaze samples, temporal
quality values for pupil size are shared with those calculated for
gaze samples.
Closely related to spatial precision is a measure termed as spatial
resolution, which refers to the smallest eye movement that can be
detected in the data. If such small eye movements are oscillat-
ing quickly, they can only be represented in data with high tem-
poral resolution or sampling frequency, according to the Nyquist–
Shannon sampling theorem [Shannon 1948].
4 Measuring data quality using an artificial
The artificial eye is an important and versatile tool in the assess-
ment of data quality. However, eye trackers vary in terms of their
eye models, therefore, finding an artificial eye which will ‘trick’ all
eye trackers is difficult. When deciding which eye tracker to buy or
use for a particular study, artificial eyes provide a way of comparing
inherent system noise and error, and can be used to check system
latency. Artificial eyes are usually available from the manufacturer,
at least for systems intended for research purposes. While it is rel-
atively simple to produce artificial eyes for systems which are dark
pupil (i.e. the eye illumination is off-axis with the eye) based, it
is trickier for bright pupil based systems (i.e. where the eye illu-
mination is on-axis). Battery-equipped eyes with actively luminous
pupils would be one solution.
4.1 Precision measurements with an artificial eye
Optimal precision for an eye tracker should be calculated with sam-
ples originating from a period when the eye is fixating. The only
way to completely eliminate biological eye movement from the eye
movement signal is to use a completely stationary eye [Holmqvist
et al. 2011, p. 35-40]. Since this is not possible with actual par-
ticipants, an artificial eye, which produces the corneal reflections
required by the eye tracker, is usually employed. This is also
how many manufacturers measure precision [SR Research 2007;
Sadeghnia 2011; Johnsson and Matos 2011]. When assessing pre-
cision in real data, it is useful to know what the maximum possible
precision of the system is. If system noise means that baseline pre-
cision is low, many eye movement measures may not be validly
recorded. For example, the measurement of velocity profiles will
be far more effected by low precision than by low accuracy, if the
offset in accuracy is uniform across the screen. Likewise, low pre-
cision may effect which kind of event detection is preferable for the
data. The experimental procedure is simple: first of all, calibrate
with a human eye in the normal way so that you can start recording
coordinate data. Calibration of a human eye may introduce some
small noise, so if you have a system where you can get data without
first calibrating, you may do that, but be aware that the precision
value will not be comparable to systems that require calibration
before data recording. Then, put one or a pair of artificial eyes
where the human eye(s) would have been, and make sure the arti-
ficial eyes are securely attached. Beware of vibration movements
from the environment, which should not be part of your precision
measurement. See to it that the gaze position of the artificial eye(s)
is somewhere in the middle of the calibration area, and then start
the recording. Export the raw data samples, use trigonometry and
the eye-monitor distance with the physical size and resolution of the
monitor to calculate sample-to-sample movement in visual degrees.
Then select a few hundred samples or more where the gaze position
appears to be still, and calculate the RMS or standard deviation of
these samples.
Different artificial eyes tend to give slightly different RMS values
for the same eye tracker. [Holmqvist et al. 2011] found RMS val-
ues of 0.021
and 0.032
on the same eye tracker when using two
different artificial eyes from two manufacturers. The variance for
real eyes will be even greater. Part of standardization work might
be to build the specifications for a single or a set of artificial eyes
that can be used on all eye trackers. This may include a variation
in the colour of the artificial iris, as well as the possibility of hav-
ing a reflective artificial retina (to test bright-pupil detection based
eye trackers, i.e. those systems where the infra red light source is
placed on-axis).
Testing only with an artificial eye may be misleading, however. The
artificial eyes do not have the same iris, pupil, and corneal reflection
features as human eyes, and may be easier or more difficult for the
image analysis algorithms in the eye tracker to process. Also, in ac-
tual eye-tracking research, real eyes tend to vary greatly in terms of
image features that cannot be simulated with artificial eyes. There-
fore, some manufacturers compliment the artificial eye test with a
precision test on a human population with a large variation in eye
colour, glasses, and contact lenses, as well as ethnic background,
having them fixate several measurement points across the stimulus
monitor. The full distribution of precision values from such a test
across many measurement points and participants is an important
indicator of what precision you can expect in actual recordings, and
its average defines the typical precision. The drawback is that this
data includes oculomotor noise, and therefore both human and arti-
ficial eyes are needed.
4.2 Pupil diameter quality measurements with artifi-
cial eyes
The quality of pupil diameter data is also typically measured using
artificial eyes. There are three such data quality measures. First,
pupil precision is calculated as the RMS on a sequence of pupil
diameter samples recorded from an artificial eye.
Pupil accuracy can be measured by presenting the eye tracker with
artificial eyes that have known pupil diameters (such as 2, 3, and 4
mm). If all eyes are presented at the same distance, the eye tracker
should output a line of diameters proportional to the input. For
instance, the pupil dilation value for the 2 mm artificial pupil should
be 50 % of the diameter recorded for the 4 mm pupil. Figure 7
shows data from such a measurement.
Pupil resolution is the smallest detectable change in pupil dilation.
It can be measured by showing artificial eyes with small differences
in diameter to the eye tracker. In Figure 7, the four values around
4 mm differ with 0.1 mm. The clear proportional output shows that
the eye tracker is capable of distinguishing between these dilations.
2 2.5 3 3.5 4 4.5
Diameter of artificial pupil (mm)
Diameter reported by eye tracker (normalized at 4 mm/100%)
Figure 7: Pupil accuracy means that the diameter recorded by
the eye tracker should be directly proportional to the diameter of
the pupil in the artificial eye. This data—from a tower-mounted
eye tracker—shows a good accuracy with only minor inaccuracies.
Pupil resolution refers to the smallest change in diameter of the ar-
tificial pupil that can be distinguished by the eye tracker. The four
values around 4 mm show that this eye tracker has a pupil resolu-
tion at least on the 0.1 mm level. Data from personal communica-
tion with one manufacturer.
Figure 8: Measurement of maximum head movement speed in a
remote eye tracker. Reproduced from manufacturer document.
4.3 Controlled motion of artificial eyes
If the artificial eye could be made to move as a real eye during sac-
cades, fixation and smooth pursuit, it would be possible to measure
optimal data quality during movements. At least one such prototype
has been built, which mimics human eye movements well, except
at a slightly slower speed.
However, motion of an artificial eye can be used in other ways,
also. For instance, maximum head movement speed in a remote eye
tracker can be measured by setting artificial eyes in front of the eye
tracker in an increasingly rapid sinusoidal movement across the so-
called head box. At some speed, tracking will be lost, which can
be seen in gaze coordinates, or as in Figure 8 the x-coordinate of
the corneal reflection. The maximum speed at the last oscillation
before tracking is lost is the maximum head movement speed.
4.4 Switching corneal reflection
There are a number of ways that latency can been measured. To
control the exact onset of an ‘actual movement’, one possibility is
to turn off the infrared diode and at the same time turn on another,
identical infrared diode at a different position (Figure 9). Since the
time it takes to turn on (and off) the diode can be made arbitrarily
small in comparison to the sampling rate of the eye tracker, the
latency can be reliably measured as the time between the off- and
onset of the illumination from the diodes and the corresponding
change in coordinate data from the eye tracker.
The same setup for measuring latency can be used for simulating
loss of tracking. The infrared illuminators are turned off, so that the
eye tracker cannot detect any corneal reflections. An illuminator is
Figure 9: Example measure of eye-tracker latency: an artificial eye
is positioned so that gaze coordinates can be measured. A single
infrared light on one side of the eye tracker is used to to create
a corneal reflection. This light is turned off, and another one on
the other side, immediately turned on. This will cause a immediate
change in position of the corneal reflection at a time that is known
by software, the time until a change in gaze coordinates has been
registered is the latency. Reproduced from manufacturer document.
then turned on again after a fixed period. The recovery time can
then be defined as the time it takes from turning on the infrared
illuminator, until the change is recorded in gaze coordinates.
5 Measuring data quality using real partici-
pants looking at targets
The factors in Section 2 must be an integral part of any design to
measure data quality from human. For instance, if the purpose is to
compare data quality across different set-ups or different eye track-
ers, characteristics of the sample group are important, and those fac-
tors listed above should be measured to compare robustness of the
system to, for example, changes in eye colour or eye shape. Within
an experiment, such factors may be important in terms of the rel-
ative data quality across participants, for example, is data quality
significantly lower for those wearing glasses? and if so, were these
participants more prevalent in one comparison group?
5.1 Calibration validation procedures
When assessing data quality for the data collected in an experi-
ment, the issue is not to test the system performance but to assess
the quality of data for each individual, for exclusion criteria, or for
a particular experimental group. If standardized test reports were
available for the eye-tracking system in question, the data from an
experimental group could be compared to normative data for the
same system. Such comparisons require independent testing across
a large sample group. This work is underway but not yet complete.
In the absence of such standardized measures, data quality could be
assessed across experimental and control groups, to check if quality
may be a confounding factor for results. Calibration procedures are
proprietary to the system in question, hence testing the accuracy of
calibration will mean running a subsequent calibration validation
procedure. For this, targets (points) should be included between
trials, at known positions, so that the data can later be assessed for
accuracy and precision over the duration of recording.
Having participants look at points presented at known locations on
screen is by far the most common data quality evaluation method
and serves to validate the system calibration. It is essentially a re-
peat of the system calibration to check if inferred gaze direction
matches the actual gaze direction using targets at known screen co-
ordinates. The size, colour and shape of these targets effects re-
sulting measures; for example, very large calibration targets will be
‘hit’ even when the (x, y) point at the centre of the target is quite far
from the recorded gaze coordinate. The colour of the background
is also important; bright backgrounds will cause the pupil to close
down, which may affect accuracy [Jan Drewes 2011].
These measurement points should be placed across the stimulus
presentation screen or area which the data quality values refer to.
This area is typically the whole monitor, and for standardisation
purposes or when testing an eye tracker (as opposed to the data
recorded for a particular study), it seems reasonable to assume the
monitor provided with the system is the relevant presentation area.
In many eye trackers, accuracy tends to be best in the middle of
the monitor/recording area, and worst in the corners [Hornof and
Halverson 2002]. If the purpose is to give a realistic account of data
quality across varying stimulus presentations in future experiments
or for future interfaces, then we should select measurement points
at positions between calibration points, across the whole area of the
monitor, varying gaze angle and position across the whole range
possible when looking at the screen. Hence, the target points pre-
sented should cover the entire area used to display the experimental
5.2 Selecting data samples to include in the calcula-
tion of data quality
As artificial eyes do not move, any samples from the recording can
be used. With humans, accuracy and precision values are calculated
from samples recorded when the participant is assumed to fixate a
stationary target. The decision of when the eye is still is typically
made by an algorithm under the assumption that a fixating eye re-
mains relatively stable over a minimum period of time. As a conse-
quence, the data quality values calculated from the fixation samples
are directly related to the performance of the fixation detection al-
gorithm. To date, it has been well documented that given the same
set of eye movement data, fixation detection algorithms can output
very different results [Karsh and Breitenbach 1983; Salvucci and
Goldberg 2000; Shic et al. 2008; Nystr
om and Holmqvist 2010].
Even when fixations are correctly detected, one target can be associ-
ated with several fixations. This can happen due to saccadic under-
shoot, overshoot, or small corrective saccades and microsaccades
required to align the gaze direction with the target. The researcher
must then decide which fixation(s) should be included in the cal-
culations for a given target. Figure 10 illustrates a situation where
the the eye first undershoots the target (bottom left), then contin-
ues towards the target, to finally shift its position a little to the right.
Three fixations are detected in this case. Including the fixation clos-
est to the target would give the highest accuracy, but what motivates
this choice of fixation over a different one? Should perhaps all be
included? Could we even omit the fixation detection stage and in-
clude all samples recorded during the period when the target was
shown, even though saccade samples are present? How might we
account for the detrimental effects of latency in calculating preci-
sion, if latency values not reported by the manufacturer? There is
a strong argument that if the researcher or interface designer will
not have access to system latency values for their recorded data,
that a standard measure should also assume no latency at all, and
calculate precision values in the same way as the consumer will be
forced to. The means of selecting which data points (raw samples)
are included for calibration validation purposes should be stated as
part of the research report, or manufacturer specification sheet, and
the exclusion criteria should eventually be standardized for compa-
rability across studies.
A related problem concerns how deviating fixation samples or ‘out-
liers’ due to various recording imperfections should be handled. A
single outlier can significantly affect the calculated data quality val-
ues, particularly if the sampling frequency of the eye tracker is low.
Perhaps the samples included for the measurement of data quality
should be the same ones chosen by the system for the calculation of
fixation position, since this reflects the end user situation, but fixa-
tion accuracy and sample accuracy are two different things; fixation
accuracy is affected by the raw data plus event detection methods.
All samples
Valid samples
Calibration target
Figure 10: Three fixations are detected (labelled ‘valid samples’)
under the period when the participant is asked to look at the tar-
get (reproduced from [Nystr
om et al. submitted]). Which samples
should be included in precision and accuracy calculations?
Furthermore, removing samples raises the question of how the gaps
should be treated. Whatever method is chosen, it should be fully
described as part of the report on data quality.
6 How is data quality reported?
To date, researchers have rarely reported measured data quality val-
ues for their own data sets. The most common way to report ac-
curacy is to refer to the manufacturer’s specification sheets, for in-
stance “This system is accurate to within 0.5
(citation to manu-
facturer)”. A search on Google Scholar using “with an accuracy
of 0.5” AND “eye tracker” returns 135 papers in all varieties of
journals that have used this particular phrase for handing over re-
sponsibility for data quality to their particular manufacturer. The
vast majority of researchers appear to treat the values on the speci-
fication sheet as a correct and objective characterization of the data
they have collected, as if all data, from all participants, wherever
they look at on the monitor, would have an accuracy better than
degrees of visual angle. This assumption of optimal accu-
racy across all recording conditions and participants is unlikely to
be correct and may lead to invalid results even when data loss is
accounted for and the data look reasonable.
Criteria used to exclude data which are cited in literature include,
for instance, the percentage of zero values in the raw data sam-
ples, a high offset (poor accuracy) value during validation, a high
number of events with a velocity above 800
/s, and an average
fixation velocity above 15
/s (indicative of low precision). For
an example of accuracy and data loss criteria, see [Komogortsev
et al. 2010]. [Holmqvist et al. 2011] conclude that around 2–5%
of the data from a population of average non-pre-screened Euro-
peans needs to be excluded due to participant-specific tracking dif-
ficulties. However, this number varies significantly: [Schnipke and
Todd 2000], [Mullin et al. 2001] and [Pernice and Nielsen 2009]
report data losses of 20–60% of participants/trials, and [Burmester
and Mast 2010] excluded 12 out of 32 participants due to calibra-
tion (7) and tracking issues (5).
Manufacturer technical development groups need correct data qual-
ity values for internal benchmarking: to judge whether changes in
hard- or software result in improved data quality. Therefore, several
manufacturers develop data quality assessment methods for their
own use. In fact, many of the methods developed by manufacturers
can be expected to be essential parts in a standardization of data
quality measures. This includes the artificial eye, the point-to-point
measurement of latency and several suggestions for calculation that
we will see below. Although there is as yet no consensus on the
exact measures for data quality, the measures suggested below are
nonetheless useful and informative, and could be included in a re-
search report alongside a description of the measures chosen.
7 Reporting data quality from experiments
A standardized set of eye data quality measures could be automated
for use in experimental research, as part of the software package and
compared to an independent report for that eye tracking system or
for a similar participant group tested on other systems. Automated
data quality measures which are standardized across systems would
mean that researchers can easily access them as part of running an
ordinary study. They could also be made publicly available by an
independent body, in a similar fashion to specifications for other
computer based technologies. Table 1 shows what we propose such
a report could look like in a publication.
Table 1: Data quality report from a collection of data in an exper-
iment. Precision values reflect the RMS of inter-sample distances.
Data quality Average SD
Calibration accuracy 0.32
Accuracy just before end of recording 0.61
Calibration precision 0.14
Precision just before end of recording 0.21
Accuracy after post-recording processing - -
Precision after post-recording processing - -
Proportion of dismissed participants 9 % -
Proportion of lost data samples in retained data 0.3 % 0.041%
In order to interpret these values in relation to the other parts of
the scientific publication, it is important to specify the analysis:
what event detection algorithm was used, with what settings, to de-
tect fixations, saccades etc? What were the data exclusion criteria?
What are the sizes of AOIs and their margins? Also, the type of eye
tracker should be reported, alongside the recording, stimulus and
analysis software with version numbers. If any values are unavail-
able or unknown, they should be stated as such. This is the basic
level of information required in order to assess eye movement re-
search for possible confounding variables in the data and compare
research results.
7.1 Who benefits from eye data quality reports?
There are two major uses for a standardized method of testing and
reporting eye data quality. We propose first, a test house tests eye
trackers on the market using a battery of tests that result from a
standardized set of measurements. They use both standardized arti-
ficial eyes and a large sample of real participants, standardized and
selected according to the criteria known to influence data quality
such as eye shape and colour. Because operator experience is a sig-
nificant factor for data quality, experienced eye tracker operators
should be used, or level of experience measured and controlled for.
This activity results in a protocol that manufacturers can base their
product documentation on. To make these results useful for gaze
interaction as well as for replication in research results, the mini-
mum target size and margin between targets viably selected by the
eye can be calculated based on these values and reported alongside
accuracy and precision values.
Second, authors will be able to calculate data quality values from
their experimental data, and compare it to the known quality val-
ues for their eye tracker. In order to assist in the comparability
of research results, journal reviewers could require these values in
their papers. Such measures would greatly assist progress in the
field by removing the large uncertainty in assessing the results of
eye movement research using highly variant equipment and cal-
culations. This approach would also benefit manufacturers, who
need standardized measures to assess their systems and to compare
performance to competitors or for internal benchmarking. Finally,
it would make the task of deciding which eye tracker to buy for
particular purposes more transparent and straightforward for both
researchers and users of gaze control systems.
8 Conclusion and future work
Clearly, standardization work for eye data quality would benefit eye
movements technology and research in general. This work has al-
ready begun as a collaborative effort of the COGAIN Association,
in the form of a technical committee for the standardisation of eye
data quality. In the absence of agreed standard measures while this
work is underway, there is an immediate benefit in promoting the
testing and reporting of data quality as standard in eye movement
research using the measures outlined above.
Not all aspects of data quality would benefit from standardization,
however, there are a number of issues which might be better allowed
to freely evolve, including: (a) How accuracy and precision is ac-
tually achieved, which is proprietary information and core business
of manufacturers. (b) What the eye tracker can be used for? What
conclusions can be drawn from the tests? This should be left up to
the informed researcher or developer. (c) How can low accuracy or
precision be accommodated or overcome? Standardising this would
likely hold back research in the area. Magnifying windows in gaze
interaction software or extra post processing of the data in research
must be stated, but should not be standardized. (d) Event detection
algorithms and filters used in them. Research is not mature enough.
Many researchers may be unaware of the magnitude of the effect
of data quality on their research results or interface functionality
and there are no guidelines on how to go about assessing their data.
Likewise, manufacturers may be unsure if their in-house test meth-
ods compare to those of other manufacturers or end user’s quality
tests. We hope this paper sets a clear target which will have a pos-
itive impact on all aspects of eye movement research, eye tracker
development and gaze based interaction.
Sampling frequency and eye-tracking measures: How speed af-
fects durations, latencies, and more. Journal of Eye Movement
Research 3, 6, 1–12.
BURMESTER, M., AND MAST, M. 2010. Repeated web page visits
and the scanpath theory: A recurrent pattern detection approach.
Journal of Eye Movement Research 3, 4, 1–20.
COTMORE, S., AND DONEGAN, M. 2011. ch. Participatory De-
sign - The Story of Jayne and Other Complex Cases.
GAGL, B., HAWELKA, S., AND HUTZLER, F. 2011. Systematic
influence of gaze position on pupil size measurement: analysis
and correction. Behavior Research Methods, 1–11.
DANBEGI, D. 2011. Introduction to Eye and Gaze Trackers. In
Gaze Interaction and Applications of Eye Tracking: Advances
in Assistive Technologies, P. Majaranta, H. Aoki, M. Donegan,
D. W. Hansen, J. P. Hansen, A. Hyrskykari, and K.-J. R
Eds., no. 2010. IGI Global: Medical Information Science Refer-
ence, Hershey PA, ch. 19, 288–295.
HOLLAND, M., AND TARLOW, G. 1972. Blinking and mental
load. Psychological Reports 31, 119–127.
R., JARODZKA, H., AND VAN DE WEIJER, J. 2011. Eye track-
ing: A comprehensive guide to methods and measures. Oxford:
Oxford University Press.
HORNOF, A., AND HALVERSON, T. 2002. Cleaning up systematic
error in eye-tracking data by using required fixation locations.
Behavior Research Methods, Instruments, & Computers 34, 4,
JAN DREWES, ANNA MONTAGNINI, G. S. M. 2011. Effects of
pupil size on recorded gaze position: a live comparison of two
eyetracking systems. Talk presented at the 2011 Annual Meeting
of the Vision Science Society.
JOHNSSON, J., AND MATOS, R. 2011. Accuracy and precision
test method for remote eye trackers. Tobii Technology.
KARSH, R., AND BREITENBACH, F. W. 1983. Looking at looking:
The amorphous fixation measure. In Eye Movements and Psy-
chological Functions: International Views, R. Groner, C. Menz,
D. F. Fisher, and R. A. Monty, Eds. Mahwah NJ: Lawrence Erl-
baum Associates, 53–64.
D. H., AND GOWDA, S. 2010. Standardization of automated
analyses of oculomotor fixation and saccadic behaviors. IEEE
Transactions on Biomedical Engineering 57, 11, 2635–2645.
M., AND KATSAVRAS, E. 2001. Eye-tracking explorations in
multimedia communications. In Proceedings of IHM/HCI 2001:
People and Computers XV – Interaction without Frontiers, Cam-
bridge: Cambridge University Press, A. Blandford, J. Vander-
donckt, and P. Gray, Eds., 367–382.
OM, M., AND HOLMQVIST, K. 2010. An adaptive algo-
rithm for fixation, saccade, and glissade detection in eye-tracking
data. Behavior Research Methods 42, 1, 188–204.
WEIJER, J. submitted. Participants know best–influence of cali-
bration method and eye physiology on eye-tracking data quality.
Journal of Neuroscience Methods.
PERNICE, K., AND NIELSEN, J. 2009. Eyetracking Methodology -
How to Conduct and Evaluate Usability Studies Using Eyetrack-
ing. Berkeley, CA: New Riders Press.
SADEGHNIA, G. R. 2011. SMI Technical Report on Data Quality
Measurement. SensoMotoric Instruments.
SALVUCCI, D., AND GOLDBERG, J. H. 2000. Identifying fixa-
tions and saccades in eyetracking protocols. In Proceedings of
the 2002 Symposium on Eye-Tracking Research & Applications,
New York: ACM, 71–78.
SCHNIPKE, S. K., AND TODD, M. W. 2000. Trials and tribulations
of using an eye-tracking system. In CHI’00 Extended Abstracts
on Human Factors in Computing Systems, ACM, 273–274.
SHANNON, C. E. 1948. A mathematical theory of communication.
Bell System Technical Journal 27, 379–423, 623–656.
incomplete fixation measure. In Proceedings of the 2008 Sym-
posium on Eye-Tracking Research & Applications, New York:
ACM, 111–114.
SR RESEARCH. 2007. EyeLink User Manual 1.3.0. Mississauga,
Ontario, Canada.
TECCE, J. 1992. McGraw-Hill Yearbook of Science & Technol-
ogy. New York: McGraw-Hill, ch. Psychology, physiological
and experimental., 375–377.
... The quality of eye-tracking data is often assessed in terms of the following three crucial properties of a segment of eye-movement data: accuracy, precision, and the amount of data loss (e.g., Holmqvist et al., 2012;Niehorster et al., 2020a, b;Holmqvist, 2015;Holmqvist & Andersson 2017). First, data may be inaccurate, i.e., there is an offset between the real gaze location and the gaze position estimated by the eye-tracking system. ...
... While this is an extreme example demonstrating the impact of data loss, even if classification using an algorithm is possible, high noise levels and frequent gaps in data may significantly affect the detection of eye-movement events. When algorithms that are not robust to these two issues are used to process such lower quality data, the algorithm outputs contain more artefacts, i.e., physiologically impossible eye movements, such as too short fixations and saccades (Holmqvist et al., 2012). Moreover, the lower the quality of the data, the more differences there are between different algorithms and thresholds settings in their outputs. ...
Full-text available
Pupil–corneal reflection (P–CR) eye tracking has gained a prominent role in studying dog visual cognition, despite methodological challenges that often lead to lower-quality data than when recording from humans. In the current study, we investigated if and how the morphology of dogs might interfere with tracking of P–CR systems, and to what extent such interference, possibly in combination with dog-unique eye-movement characteristics, may undermine data quality and affect eye-movement classification when processed through algorithms. For this aim, we have conducted an eye-tracking experiment with dogs and humans, and investigated incidences of tracking interference, compared how they blinked, and examined how differential quality of dog and human data affected the detection and classification of eye-movement events. Our results show that the morphology of dogs’ face and eye can interfere with tracking methods of the systems, and dogs blink less often but their blinks are longer. Importantly, the lower quality of dog data lead to larger differences in how two different event detection algorithms classified fixations, indicating that the results of key dependent variables are more susceptible to choice of algorithm in dog than human data. Further, two measures of the Nyström & Holmqvist ( Behavior Research Methods, 42 (4), 188–204, 2010) algorithm showed that dog fixations are less stable and dog data have more trials with extreme levels of noise. Our findings call for analyses better adjusted to the characteristics of dog eye-tracking data, and our recommendations help future dog eye-tracking studies acquire quality data to enable robust comparisons of visual cognition between dogs and humans.
... Eye-tracking data quality was operationalized using common measures of accuracy and data loss (see e.g. Holmqvist, Nyström, & Mulvey, 2012;Holmqvist et al., 2022), as well as the root mean square (RMS) sampleto-sample deviation of the gaze position signal (i.e. point of regard in the scene camera video) using a 300 ms moving window technique (see Hessels et al., 2020c, for details). The RMS deviation is often used as an operationalization of the precision of a recording for a non-moving observer fixating a static target in the world. ...
Full-text available
Eye contact is essential for human interactions. We investigated whether humans are able to avoid eye contact while navigating crowds. At a science festival, we fitted 62 participants with a wearable eye tracker and instructed them to walk a route. Half of the participants were further instructed to avoid eye contact. We report that humans can flexibly allocate their gaze while navigating crowds and avoid eye contact primarily by orienting their head and eyes towards the floor. We discuss implications for crowd navigation and gaze behavior. In addition, we address a number of issues encountered in such field studies with regard to data quality, control of the environment, and participant adherence to instructions. We stress that methodological innovation and scientific progress are strongly interrelated.
... Quality of ET data. According to the previous study 42 , three methods were chosen to examine the quality of ET data: proportion of valid data, inter-sample distance, and the distance of fixation to screen center. They measured different aspects of eye tracking data quality. ...
Full-text available
The dataset of simultaneous 64-channel electroencephalography (EEG) and high-speed eye-tracking (ET) recordings was collected from 31 professional athletes and 43 college students during alertness behavior task (ABT) and concentration cognitive task (CCT). The CCT experiment lasting 1–2 hours included five sessions for groups of the Shooting, Archery and Modern Pentathlon elite athletes and the controls. Concentration targets included shooting target and combination target with or without 24 different directions of visual distractors and 2 types of music distractors. Meditation and Schulte Grid trainings were done as interventions. Analysis of the dataset aimed to extract effective biological markers of eye movement and EEG that can assess the concentration level of talented athletes compared with same-aged controls. Moreover, this dataset is useful for the research of related visual brain-computer interfaces.
... In line with Dalrymple, Manner, Harmelink, Teska, and Elison (2018) and Holmqvist, Nyström, and Mulvey (2012), an outlier threshold for accuracy and precision measures was applied at 1.50 • . Only participants for whom at least 8 of the 12 trials in the respective condition (≥ 66%) contained sufficient gaze data, were considered, resulting in a total of 1440/2190 trials from ten experts and ten novices included in the analysis of gaze behaviour. ...
Anticipation of teammates and opponents is a critical factor in many sports played in interactive environments. Deceptive actions are used in sports such as basketball to counteract anticipation of an opponent. In this study, we investigated the effects of shot deception on the players' anticipation behaviour in basketball. Thirty one basketball players (15 expert, 16 novice) watched life-sized videos of basketball players performing real shots or shot fakes aimed at the basket. Four different shot outcomes were presented in the video stimuli: a head fake, a ball fake, a high shot fake, and a genuine shot. The videos were temporally occluded at three different time points (−160 ms, −80 ms, 0 ms to ball release) during a shooting motion. The participants had to perform a basketball-related response action to either shots or shot fakes. Response accuracy, response time, and decision confidence were recorded along with gaze behaviour. Anticipation accuracy was reduced at later occlusion points for fake shooting actions. For expert athletes, this effect occurred at later occlusion points compared to novices. The gaze analysis of successful and unsuccessful shot anticipations revealed more gaze fixations towards the hip and legs in successful anticipations, whereas more fixations towards the ball and the head were found in shots unsuccessfully anticipated. It is proposed that hip and leg regions may contain causal information concerning the vertical trajectory of the shooter and identifying this information may be important for perceiving genuine and deceptive shots in basketball.
... However, data from some participants had to be excluded for analysis, due to reasons varying from their body movements, to failures in calibration, and to data failing to meet quality criteria (Wehrmeyer, 2014;. The data attrition rate in the studies ranged from 2.78 to 32%, which is to be expected, as the data loss rate reported by studies can lie between 2 to 60% (Holmqvist et al., 2012). Therefore, it is advisable for researchers to enlist more participants than is necessary, as data loss is common in eye-tracking studies. ...
Full-text available
It has been four decades since eye-tracking was first used in interpreting studies, and recent years has witnessed a growing interest in the application of this method, which holds great potential for offering a look into the “black box” of interpreting processing. However, little attention has been paid to comprehensively illustrating what has been done, what can be done, and what needs to be done with this method in this discipline. With this in view, this paper sets out to understand contributions of previous studies—key themes discussed, eye-tracking measures used, their limitations and implications, and future directions. To this end, we conduct a review of a total of 26 empirical papers from peer-reviewed journals within a time span of 4 decades ranging from 1981 to 2021. This study, as the first attempt of its kind at a comprehensive review on using eye-tracking in interpreting studies, should have implications for researchers, educators, and practitioners.
... In this case, its evaluation with the help of a dataset recorded with that particular tracker can serve as an adequate representation of future-use conditions. Even in this case, however, it has to be noted that data quality (even obtained with the same device) can vary 800 widely because of the subject-specific characteristics, subject population, the operator, and recording protocol Holmqvist et al., 2012;Nyström et al., 2013;Blignaut and Wium, 2014;Hessels et al., 2015). 805 Therefore, if the test set of recordings is not diverse enough (either in terms of the utilized devices or recording conditions), additional robustness evaluation should be considered (e.g. ...
Full-text available
Detecting eye movements in raw eye tracking data is a well-established research area by itself, as well as a common pre-processing step before any subsequent analysis. As in any field, however, progress and successful collaboration can only be achieved provided a shared understanding of the pursued goal. This is often formalised via defining metrics that express the quality of an approach to solving the posed problem. Both the big-picture intuition behind the evaluation strategies and seemingly small implementation details influence the resulting measures, making even studies with outwardly similar procedures essentially incomparable, impeding a common understanding. In this review, we systematically describe and analyse evaluation methods and measures employed in the eye movement event detection field to date. While recently developed evaluation strategies tend to quantify the detector's mistakes at the level of whole eye movement events rather than individual gaze samples, they typically do not separate establishing correspondences between true and predicted events from the quantification of the discovered errors. In our analysis we separate these two steps where possible, enabling their almost arbitrary combinations in an evaluation pipeline. We also present the first large-scale empirical analysis of event matching strategies in the literature, examining these various combinations both in practice and theoretically. We examine the particular benefits and downsides of the evaluation methods, providing recommendations towards more intuitive and informative assessment. We implemented the evaluation strategies on which this work focuses in a single publicly available library:
... Implementations of new technologies is often characterized by a lack of critical methodological approach [51]. The quality of data depends on multiple factors, one of which is the system in which the research is conducted [52]. Researchers who compare the characteristics of the data obtained with different types of ET have indicated the pros and cons of various environments used for gathering behavioral data [53][54][55][56]. ...
Full-text available
The idea of combining an eye tracker and VR goggles has opened up new research perspectives as far as studying cultural heritage is concerned, but has also made it necessary to reinvestigate the validity of more basic eye-tracking research done using flat stimuli. Our intention was to investigate the extent to which the flattening of stimuli in the 2D experiment affects the obtained results. Therefore an experiment was conducted using an eye tracker connected to virtual reality glasses and 3D stimuli, which were a spherical extension of the 2D stimuli used in the 2018 research done using a stationary eye tracker accompanied by a computer screen. The subject of the research was the so-called tunnel church effect, which stems from the belief that medieval builders deliberately lengthened the naves of their cathedrals to enhance the role of the altar. The study compares eye tracking data obtained from viewing three 3D and three 2D models of the same interior with changed proportions: the number of observers, the number of fixations and their average duration, time of looking at individual zones. Although the participants were allowed to look around freely in the VR, most of them still performed about 70–75% fixation in the area that was presented in the flat stimuli in the previous study. We deemed it particularly important to compare the perception of the areas that had been presented in 2D and that had evoked very much or very little interest: the presbytery, vaults, and floors. The results indicate that, although using VR allows for a more realistic and credible research situation, architects, art historians, archaeologists and conservators can, under certain conditions, continue to apply under-screen eye trackers in their research. The paper points out the consequences of simplifying the research scenario, e.g. a significant change in fixation duration. The analysis of the results shows that the data obtained by means of VR are more regular and homogeneous. Graphical Abstract
Recent advances in eye-tracking technology afford the possibility to collect rich data on attentional focus in a wide variety of settings outside the lab. However, apart from anecdotal reports, it is not clear how to maximize the validity of these data and prevent data loss from tracking failures. Particularly helpful in developing these techniques would be to describe the prevalence and causes of tracking failures in authentic environments. To meet this goal, we analyzed video records aligned with eye-tracking data collected from screen-mounted eye trackers employed in a middle-school classroom. Our sample includes records from 35 students recorded during multiple eye-tracking sessions. We compared student head position, body posture, attentiveness, and social interactions for time periods associated with successful and unsuccessful eye tracking. Overall, we observed substantial data loss and found that student inattentiveness, movements toward the eye tracker, and head rotations were the most prevalent factors inducing data loss. In addition, we observed a substantial relationship between data loss and apparent low involvement in the learning task. These data suggest that eye-tracking data loss is an important problem and that it can present a threat to validity because it can bias datasets to overrepresent high involvement behaviors. Based on these findings, we present several recommendations for increasing the proportion of usable data and to counter possible biases that data loss may introduce.
/Introduction/ Eye tracking (ET) is a popular tool to study what factors affect the visual behaviour of surgical team members. To our knowledge, there have been no reviews to date that evaluate the broad use of ET in surgical research. This review aims to identify and assess the quality of this evidence, to synthesize how ET can be used to inform surgical practice, and to provide recommendations to improve future ET surgical studies. /Methods/ In line with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, a systematic literature review was conducted. An electronic search was performed in MEDLINE, Cochrane Central, Embase, and Web of Science databases up to September 2020. Included studies used ET to measure the visual behaviour of members of the surgical team during surgery or surgical tasks. The included studies were assessed by two independent reviewers. /Results/ A total of 7614 studies were identified, and 111 were included for data extraction. Eleven applications were identified; the four most common were skill assessment (41%), visual attention assessment (22%), workload measurement (17%), and skills training (10%). A summary was provided of the various ways ET could be used to inform surgical practice, and three areas were identified for the improvement of future ET studies in surgery. /Conclusions/ This review provided a comprehensive summary of the various applications of ET in surgery and how ET could be used to inform surgical practice, including how to use ET to improve surgical education. The information provided in this review can also aid in the design and conduct of future ET surgical studies.
Full-text available
We use simulations to investigate the effect of sampling frequency on common dependent variables in eye-tracking. We identify two large groups of measures that behave differently, but consistently. The effect of sampling frequency on these two groups of measures are explored and simulations are performed to estimate how much data are required to overcome the uncertainty of a limited sampling frequency. Both simulated and real data are used to estimate the temporal uncertainty of data produced by low sampling frequencies. The aim is to provide easy-to-use heuristics for researchers using eye-tracking. For example, we show how to compensate the uncertainty of a low sampling frequency with more data and post-experiment adjustments of measures. These findings have implications primarily for researchers using naturalistic setups where sampling frequencies typically are low.
Full-text available
Holmqvist, K., Nyström, N., Andersson, R., Dewhurst, R., Jarodzka, H., & Van de Weijer, J. (Eds.) (2011). Eye tracking: a comprehensive guide to methods and measures, Oxford, UK: Oxford University Press.
Full-text available
Recording eye movement data with high quality is often a prerequisite for producing valid and replicable results and for drawing well-founded conclusions about the oculomotor system. Today, many aspects of data quality are often informally discussed among researchers but are very seldom measured, quantified, and reported. Here we systematically investigated how the calibration method, aspects of participants' eye physiologies, the influences of recording time and gaze direction, and the experience of operators affect the quality of data recorded with a common tower-mounted, video-based eyetracker. We quantified accuracy, precision, and the amount of valid data, and found an increase in data quality when the participant indicated that he or she was looking at a calibration target, as compared to leaving this decision to the operator or the eyetracker software. Moreover, our results provide statistical evidence of how factors such as glasses, contact lenses, eye color, eyelashes, and mascara influence data quality. This method and the results provide eye movement researchers with an understanding of what is required to record high-quality data, as well as providing manufacturers with the knowledge to build better eyetrackers.
Full-text available
In an effort toward standardization, this paper evaluates the performance of five eye-movement classification algorithms in terms of their assessment of oculomotor fixation and saccadic behavior. The results indicate that performance of these five commonly used algorithms vary dramatically, even in the case of a simple stimulus-evoked task using a single, common threshold value. The important contributions of this paper are: evaluation and comparison of performance of five algorithms to classify specific oculomotor behavior; introduction and comparison of new standardized scores to provide more reliable classification performance; logic for a reasonable threshold-value selection for any eye-movement classification algorithm based on the standardized scores; and logic for establishing a criterion-based baseline for performance comparison between any eye-movement classification algorithms. Proposed techniques enable efficient and objective clinical applications providing means to assure meaningful automated eye-movement classification.
This paper investigates the eye movement sequences of users visiting web pages repeatedly. We are interested in potential habituation due to repeated exposure. The scanpath theory posits that every person learns an idiosyncratic gaze sequence on first exposure to a stimulus and re-applies it on subsequent exposures. Josephson and Holmes (2002) tested the applicability of this hypothesis to web page revisitation but results were inconclusive. With a recurrent temporal pattern detection technique, we examine additional aspects and expose scanpaths. Results do not suggest direct applicability of the scanpath theory. While repetitive scan patterns occurred and were individually distinctive, their occurrence was variable, there were often several different patterns per person, and patterns were not primarily formed on the first exposure. However, extensive patterning occurred for some participants yet not for others which deserves further study into its determinants.
In the previous chapters of the book, you will have seen multiple applications for using (and the benefits of using) a gaze tracker. In this chapter, you will be given more insight into how an eye tracker operates. Not only can this aid in understanding the eye tracker better, it also gives important information about how future applications might improve on current ones, by using more of the information available from the eye tracker: as we shall see, an eye tracker can often provide you with more information than just coordinates on a screen. This chapter gives an overview of the components of an eye tracker and introduces basics of gaze modelling. It helps in understanding the following chapters which each provide some details of how to build an eye tracker. This section has technical content, but it is our hope that also readers not particularly interested in the details of eye and gaze trackers will gain some useful insights.
The ACE Centre user trials have involved over a hundred people who have severe and complex physical, cognitive and visual difficulties. Participatory Design methodology underpinned the approach that was adopted, which involved being led by the requirements of the most complex users. Jayne was one of many users through whom we not only developed more effective ways of using the technology, but also more innovative strategies to support its implementation. In this chapter, we describe the process, and outcome of our participatory design approach, through the cases of Jayne and other users.
Countless aspects of visual processing are reflected in eye movements and analyzing eye movements during visual stimulation has become the methodology of choice for many researchers in vision science and beyond. For decades, the scleral searchcoil technique has been considered the “gold standard” in terms of precision and signal to noise ratio, at the cost of pronounced setup overhead and a certain degree of invasiveness. On the other hand, camera-based eyetrackers are easy to use and non-invasive, yet, despite the dramatic improvement of the last generation systems, they have been known to be more noisy and less precise. Recently, a significant impact of changes in pupil size on the accuracy of camera-based eyetrackers during fixation has been reported (Wyatt, 2010). We compared the accuracy and the pupil-size effect between a scleral searchcoil-based eyetracker (DNI) and an up-to-date infrared camera-based eyetracker (SR Research Eyelink 1000) by simultaneously recording human eye movements with both techniques. Between pupil-constricted (PC) vs. pupil-relaxed (PR) conditions we find a subject-specific shift in reported gaze position of up to >2 degrees with the camera based eyetracker, while the scleral searchcoil system simultaneously reported steady fixation, confirming that the actual point of fixation did not change during pupil constriction/relaxation. Individual repetitions of 25-point calibration grids show the positional accuracy of the searchcoil system to be unaffected by pupil size (PC 0.52 +−0.1 deg, PR 0.54+−0.08 deg), whereas the camera-based system is much less accurate in the PR condition (PC 0.38 ± 0.12 deg, PR 0.98 ± 0.22 deg) due to increased pupil size variability. We show how these pupil-dependent shifts in recorded gaze position can affect the recorded dynamics of fixations (drift), saccades (reduced accuracy), pursuit (altered trajectory) and ocular following (directional bias), and we evaluate a dual-calibration-based method to compensate the pupil-based shift utilizing recorded pupil size.
This paper's focus is on the challenges associated with collecting eye-tracking data. Despite operator training conducted by the manufacturer, one year of experience with eye-tracking and extensive calibration, the data collection success rate in the current investigation was very low; only six out of sixteen participants (37.5%) were successfully eye-tracked. We discuss possible explanations for this low success rate, and why we do not currently believe that eye-tracking is ready to be employed in usability laboratories.