Content uploaded by Pawel Kasprowski
Author content
All content in this area was uploaded by Pawel Kasprowski on Aug 11, 2018
Content may be subject to copyright.
Implicit Calibration Using Predicted Gaze Targets
Pawel Kasprowski∗
Silesian University of Technology
Katarzyna Harezlak†
Silesian University of Technology
Abstract
The paper presents the algorithm supporting an implicit calibration
of eye movement recordings. The algorithm does not require any
explicit cooperation from users, yet it uses only information about
a stimulus and an uncalibrated eye tracker output. On the basis of
this data, probable fixation locations are calculated at first. Such a
fixation set is used as an input to the genetic algorithm which task
is to choose the most probable targets. Both information can serve
to calibrate an eye tracker. The main advantage of the algorithm is
that it is general enough to be used for almost any stimulation. It
was confirmed by results obtained for a very dynamic stimulation
which was a shooting game. Using the calibration function built by
the algorithm it was possible to predict where a user will click with
a mouse. The accuracy of the prediction was about 75%.
Keywords: eye tracker, calibration, genetic algorithm
Concepts: •Human-centered computing →Interaction tech-
niques;
1 Introduction
With increasing availability of low cost eye trackers thinking about
using eye tracking in the wild by unexperienced and unsupervised
users has become possible. However, in order to take full advantage
of data recorded by such a device, in most cases a prior calibration is
required. It is a cumbersome and unnatural process and may be con-
sidered as one of the main obstacles for spreading popularity of eye
tracking as an enhancement for human computer interfaces. There-
fore, some effort has been made to omit or simplify this process
and some methods have been developed (see [Brolly and Mulligan
2004], [Villanueva and Cabeza 2008] or [Hansen et al. 2010]) but
they usually require complicated hardware setups with more than
one light sources and cameras.
The research presented in this paper aims at calibrating a device
without any explicit cooperation with users. In such a setup a sys-
tem builds a calibration function using information obtained from
an eye tracker and from elements of an interface. Contrary to the
previously mentioned studies, the tests presented in the paper used a
popular and cheap, off-the-shelf eye tracker without any additional
hardware.
Of course such a calibration is possible only when a system has
some knowledge about area where a person is supposed to look at
a specified moment. Having this information it is possible to pair
∗e-mail: pawel.kasprowski@polsl.pl
†e-mail: katarzyna.harezlak@polsl.pl
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org. c
2016 ACM.
ETRA’16, March 14-17, 2016, Charleston, SC, USA
ISBN: 978-1-4503-4125-7/16/03
DOI: http://dx.doi.org/10.1145/2857491.2857511
data obtained from an eye tracker and a predicted fixation location
and use it in a way identical as during a classic calibration. Such an
on-the-fly calibration was called the implicit calibration, because
it is realized during a normal interface usage without any user’s
cooperation.
This paper presents some theoretical background for the implicit
calibration and describes the algorithm designed to perform such a
calibration. The algorithm was checked using a specially designed
experiment, during which participants were playing a simple game.
2 Model of implicit calibration
An output from an eye tracker may be represented by a sequence of
points (e1...eN). There are eye trackers, which can work without
any calibration, however some of them have to be initialized by a
prior calibration procedure to start any signal registration. In this
research it was assumed that such a device was calibrated by one
person and then used by other users. Thus, in both the aforemen-
tioned cases we will call eipoints the uncalibrated output.
Additionally, there is information about a stimulation that was ob-
served by a person when the uncalibrated eye positions where reg-
istered. A task is to build a function that correctly transforms uncal-
ibrated eye tracker data eito gaze points githat reference objects
of the stimulation. To be able to build such a function some knowl-
edge about where a person was looking at in a specified time is
required. In a traditional explicit calibration a person is forced to
look at defined points whereas for the implicit calibration it is nec-
essary to find such points by analyzing both a stimulation and eye
tracker data.
Let’s assume that there are Ntimestamps during the presentation
of a stimulation (called later screens) for which kpossible fixation
locations (targets) can be estimated. The number of targets (k) may
be different for every screen as it is presented in Figure 1.
targets
screens
Figure 1: Screens with targets. The figure shows subsequent screen
frames with several targets on each of them. Horizontal axis repre-
sents time.
Additionally, there is also an eye tracker output (ei) available for ev-
ery screen. The task for the algorithm is to choose a target t∗
ifor ev-
ery screen siand pair it with ei. The sequence of such pairs [t∗
i, ei]
may then be used to build a calibration function in a way identi-
cal to the explicit calibration. The calibration function C alF un(e)
maps every eye tracker output (ei) to a correct gaze point (gi).
gi(x, y) = C alF un(ei(x, y)) (1)
This is a pre-print. The final version of the paper is available at ACM Digital Library.
Link library via http://dx.doi.org/10.1145/2857491.2857511
The method used in this study is based on the idea of RFLs (Re-
quired Fixation Locations) that was introduced in [Hornof and
Halverson 2002]. RFLs are objects on a screen at which a partic-
ipant must look in order to accomplish a task. In the aforemen-
tioned work they were determined based on locations of mouse
clicks (with an assumption that people look where they click) and
were used to check the calibration quality and invoke recalibration
if necessary.
The case when there are more than one possible objects (targets) on
a screen has been studied in [Zhang and Hornof 2011] and [Vadillo
et al. 2014]. Instead of RFLs, so called PFLs (Probable Fixation
Locations) were used to improve calibration. For every screen they
automatically chose the closest target as the probable gaze location.
Both RFLs and PFLs has been used in [Zhang and Hornof 2014].
Similarly to [Vadillo et al. 2014], PFL was chosen as a target closest
to the eye tracker output. RFLs - as more reliable - were weighted
ten times higher than the PFLs while creating a recalibration func-
tion.
All these approaches aimed at improving a calibration model,
which was calculated for the same person at the beginning of an
experiment. During our studies we have found that, when an eye
tracker is calibrated for one person, a similar technique may be used
to recalibrate it for another one. This idea was checked during the
experimental part of our research. The basic task is to choose a se-
quence of targets tj
i(where i= 1,2, ..., N is a screen number and
jis an index of the target selected for the ith screen) representing
genuine targets of user’s gazes and utilize it to build a calibration
model. The main problem is how to find an appropriate sequence
among all the possibilities. Intuitively, this evaluation may be based
on the quality of the model obtained for some genuine data, for
which correct gaze values are known. However, such a solution
is not feasible when data of that type is not available. This was
a motivating factor to undertake studies on elaborating a method
determining a correct sequence of targets without any knowledge
about the true gaze locations.
3 Sequence of targets evaluation
Every sequence of targets and corresponding eye tracker points
[t∗
i, ei]may be used to build a calibration model. Then, it is possi-
ble to evaluate the quality (fitness) of this model. One of the most
popular measures of a model fitness is the coefficient of determina-
tion.
Having a set of Mreference genuine gaze points
gr(1), gr(2), ..., gr(M)and a corresponding eye tracker out-
put er(1), er(2), ..., er(M)the model quality may be calculated
by a comparison of grand the model output gm=Cal F un(er)
(see equation (1)).
R2
g(gr, gm) = 1 −PM
i(gm(i)−gr(i))2
PM
i(gr(i)−gr)2(2)
where gris the average of all reference values gr(i).R2
gis equal to
1 when the model fits perfectly, i.e. every gm(i)point is exactly at
the same location as the corresponding gr(i)point.
When reference gaze points grare not available, the only possible
way of the evaluation is a calculation of the model fitness for the
data that was used to build it (in our case chosen target’s positions
t∗(i)). In such a case the coefficient of determination may be cal-
culated as:
R2
m(t∗, g) = 1 −PN
i(t∗(i)−g(i))2
PN
i(t∗(i)−t∗)2(3)
where t∗is the mean value for all chosen targets and g(i) =
Cal F un(e(i)) is a gaze value for a screen iwith a target t∗(i)
calculated using the chosen model.
R2
mis equal to 1 when the model fits perfectly (i.e. every g(i)point
is exactly at the same location as the corresponding target t∗(i)).
It is important to emphasize that R2
mas calculated in equation (3)
measures only the model fitness and gives no clue if targets taken
to calculate this model are the genuine targets.
4 Finding the best sequence
The search for the best sequence of targets may be treated as an
optimization task when the best solution is searched among many
possibilities. A cost function is used to evaluate every solution (a
sequence of targets in our case). The number of possible sequences
is j1∗j2∗j3∗... ∗jNwhere jxis the number of targets on the jth
screen. So, even for only 10 screens and 6 targets for every screen
there are 610 - more than 60 mln possibilities. Therefore, it is not
feasible just to check all sequences. It is necessary to use some
heuristic that tries to find a ”good” sequence. Among a plethora of
optimization algorithms including genetic, ant colony or e.g. sim-
ulated annealing, which may be used in this case, the genetic one
was chosen for the research purpose. Such an algorithm takes an
initial population of candidate solutions and then tries to find better
solutions modifying the current ones using different operations (it
is called evolution). Our implementation started with a crossover
operator (using 35% operations per generation) followed by a mu-
tation with probability 1/12. Every solution, called a chromosome,
consists of genes. In our case every ith gene was an index of the
ith target on the screen. A chromosome represented a sequence of
targets - one from each screen. Therefore, the chromosome’s length
was equal to the number of screens.
A criterion for a chromosome optimization is a value of a cost func-
tion calculated for the chromosome. During the evolution chromo-
somes with higher function’s values are preferred. As the result,
after some number of iterations (evolutions), a chromosome with
the highest cost is obtained. However, naturally, it may not be the
best possible chromosome, it is just the best found chromosome.
To evaluate the cost function value for a sequence (chromosome)
the sequence was at first used to prepare a calibration model us-
ing [t∗
i, ei]pairs. There was a linear regression with Levenberg-
Marquardt algorithm used to prepare the model separately for ver-
tical and horizontal directions:
Cal F unx(e) = Ax∗ex+By∗ey+Cx(4)
Cal F uny(e) = Ay∗ex+By∗ey+Cy(5)
The created model may be subsequently used to calculate a gaze
point gibased on eye tracker data ei(equation 1).
There were two different cost functions used. The first one eval-
uated the model generated for the sequence (chromosome) using
genuine reference gaze points r(1), r(2), ..., r(M)and the corre-
sponding eye tracker output er(1), er(2), ..., er(M). The model
quality was assessed by a comparison of rand the model output
gr=Cal F un(er). Therefore, we had
GenF unction(r, gr) = (R2
gx(r, gr)∗R2
gy (r, gr))2(6)
where R2
gx(r, gr)and R2
gy (r, gr)are coefficients of determination
calculated using equation (2), separately for horizontal and verti-
cal axes. The best sequence found utilizing this function may be
considered the correct one as the optimization algorithm uses the
correct gaze points for the evaluation. Therefore this sequence is
called the genuine sequence in the subsequent text.
The second cost function did not use any reference points and took
into account only pairs of t∗and the corresponding e. In other
words, we had
F itF unction(t∗, g) = (R2
mx(t∗, g )∗R2
my(t∗, g ))2(7)
where R2
mx(t∗, g )and R2
my(t∗, g )are coefficients of determination
calculated using equation (3) separately for horizontal and vertical
axes. The best chromosome (sequence) found for this function is
called the fittest sequence in the subsequent text.
Our hypothesis was that a model built using the correct targets
should be ”easier” to calculate, so we expected that R2
mvalue
for this model should be higher than for models built using im-
proper targets. It means that a sequence determined with usage of
F itF unction - as the cost function - should be close to the appro-
priate one. This technique is similar to regression based approach
presented in [Kasprowski and Harezlak 2015]. During the experi-
mental part of the research this assumption was checked against the
real data.
5 Experiment
To check whether the algorithms described above may be used for
the implicit calibration, an experiment was conducted based on the
EyeTribe eye tracker, registering data with 60Hz frequency. Be-
cause it is not possible to work with the EyeTribe without a cali-
bration, at first the device was calibrated by a person that did not
take part in the next steps of the experiment. Then 43 different par-
ticipants played a simple game during which their eye movements
were registered. One game run was called a trial, thus this way 43
trials were collected.
The game scenario was as follows. There were two kinds of objects
moving on the screen - the ”good guys” and the ”bad guys”. The
task for a participant was to use a mouse pointer to shoot down
as many bad guys as possible. The hit object disappeared and the
new one was created in a random place. There were always about
10 objects visible on the screen. The whole recording lasted 60
seconds and, when the game finished, a score was calculated taking
the number of killed bad guys and good guys as well as the number
of bad guys that escaped out of the screen into account. It was 26”
display used and an approximate distance from the screen for all
participants was about 50 cm.
Our fundamental assumption was that participants follow with eyes
the objects on the screen. Thus, for every moment in a trial, when
the participant’s eyes are during fixation or smooth pursuit, a list of
possible targets may be calculated. This list may be subsequently
used as an input to the sequence finding algorithm.
Another assumption was that people look where they click (as it
was assumed in [Hornof and Halverson 2002]), especially when a
target is small and moving, which was the case during the exper-
iment. So, information about gaze points during the mouse clicks
may be used as a reference to estimate quality of the calibration
model (which did not use this information).
It was expected that about 3600 eye positions should be avail-
able for every trial (for 60Hz frequency and 60 seconds recording).
However, the experiment was conducted ”in the wild” - participants
Table 1: Average errors for fittest and genuine sequences calcu-
lated for each trial
Direction Fittest Genuine
horizontal 88.5 (33.87) 82.3 (34.7)
vertical 106.24 (59.0) 82.56 (41.39)
just came and played the game and the only initial setup was a pre-
liminary check whether participant’s eyes are visible for the eye
tracker’s camera. Therefore, it has happened that the eye tracker
could not locate eyes during a trial and it was unable to provide
data. We decided to exclude from subsequent experiments trials for
which less than 2300 eye positions were recorded. It resulted in
exclusion of 8 trials and only 35 remaining were used in the further
analysis.
6 Data processing
Before the sequence search algorithm was run, some data prepro-
cessing steps were performed, for each trial independently. The first
step was the extraction of screens. The screen was defined for ev-
ery timestamp with 100 ms interval. For every screen the locations
of objects (good and bad guys) were calculated and added as a set
of targets. Then eye tracker data before and after a timestamp was
used to calculate the value of eifor a screen. A classic velocity
threshold was used to choose only these recordings that belong to
fixations or smooth pursuits. Screens for which it was impossible
to find at least 10 recordings were removed. Finally, it was some
number (k) of targets (t1
i...tk
i) and an eye tracker output eidefined
for each screen i. Then the genetic algorithm using the cost func-
tion defined on equation (7) was used to find the fittest sequence of
targets.
The next step was a search for the genuine sequence. Mouse click
locations with timestamps were extracted as reference points (ri).
For each mouse click location, eye tracker measurements close in
time and belonging to a fixation or smooth pursuit were utilized
to estimate ei- the eye tracker output for this click. It resulted in
a list of genuine pairs [ri, ei], which were subsequently used by
the cost function to evaluate each targets’ sequence according to
the equation (6). The sequence for which the calibration model
gave the best results was returned by the genetic algorithm and was
treated as the genuine one.
7 Results
The first task of the experimental part was to compare the fittest
sequence found for each trial with its genuine counterpart. If the
fittest sequence occurred to be similar to the genuine one it would
indicate that the fitness function (equation (7)) may be used for the
sequences optimization and no additional data (like mouse clicks)
is necessary. The absolute error formula was used to calculate er-
ror for the model built using the sequence, taking Mclick points
(ri) and the corresponding model output (gi) in both horizontal and
vertical directions into account (equation (8)):
Error(t∗) = sPM
i(ri−gi)2
M−1(8)
The averaged results for all 35 trials are presented in Table 1. Ana-
lyzing these results it may be noticed that the fittest sequence gives
higher errors than the genuine one. Such findings were predictable,
because the fittest sequence optimization did not use any informa-
tion about the genuine points. On the other hand, it is visible that
the results for both sequences are comparable. Although for ver-
tical direction the genuine sequence is significantly better than the
fittest one (p=0.023), the difference between these sequences for the
horizontal direction is even not significant (p=0.22). It leads to the
conclusion, that the fittest sequence may be used for the calibration
purpose when obtaining the genuine one is not possible because of
lack of reference points.
7.1 Clicked target prediction
The test checking, whether a model created using the fittest se-
quence may be applied to predict the next target to click, was per-
formed as well. At first the gaze points gwere calculated using the
calibration model calculated for the fittest sequence. Then, for ev-
ery click c(t)the corresponding gaze point g(t)was found and the
target closest to the gaze point was chosen. If the selected target
was the same as that truly clicked, it was treated as a success. This
calculation was repeated for each trial separately - there were on
average 82.5 (±18.6) clicks during every trial. The summarized
results showed that it was possible to predict targets correctly in
2181 out of 2886 clicks which gave accuracy equal to 75.6% (±
15.3 for a single trial).
These satisfactory results were obtained when all screens from a
trial were taken into account. Because the target prediction requires
some calculation to be done, we decided to check the possibility to
improve this process performance and reduce a number of screens
maintaining the good prediction rate on the same level. During this
experiment only screens from the first Xseconds where taken into
account while building a calibration model to predict all clicked
targets. The results are presented on Figure 2.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1 2 3 4 5 6 7 8 9 10 20 30 40 50 60
Accuracy of prediction
Number of seconds
Figure 2: Accuracy of clicked target prediction depending the
recording duration.
The results show that after just 20 seconds of recording it is possible
to predict correctly - with accuracy over 70% - which target is about
to be clicked and only 6 seconds is required to obtain accuracy of
prediction over 50%.
8 Discussion
The results described earlier are satisfactory, however, the exact
numbers achieved are strictly correlated with the experiment sce-
nario. In our case the participation in the game was very dynamic
with a lot of short fixations and a lot of clicks (about 1.4 clicks per
second on average). Because most targets were moving, there were
in fact more smooth pursuits than fixations in the recorded data.
It gives a potential possibility to use information about direction
and velocity of the movements to better predict targets, similarly
to [Pfeuffer et al. 2013]. Additionally, there were always about 10
targets on every screen. It may be expected that in less dynamic
scenario - with less targets to deal with - the results should get bet-
ter.
On the other hand, the main drawback of the method is that it re-
quires considerably complex computations. For our purpose we
need only several seconds of recording but to choose a correct se-
quence a genetic algorithm must calculate a lot of models. In our
case it was 1000 chromosomes for each generation multiplied by
1000 evolution steps. It took about 1 minute for recordings last-
ing 10 seconds and almost 6 minutes for recordings lasting 60 sec-
onds, using our laboratory computer (Intel Xenon 3.1GHz with
8GB RAM). Therefore, some more sophisticated heuristic will be
checked during our further research to find the best solution faster.
9 Conclusion
The paper describes the algorithm supporting an implicit calibra-
tion of eye movement recording. The algorithm does not require
any cooperation from users, yet uses only information about the
stimulation and an uncalibrated eye tracker output. The correct-
ness of the algorithm was confirmed during experiments involving
35 people. The results obtained showed that is useful in improving
calibration process. Moreover, experiments presented in the paper
showed that it is possible to obtain meaningful data only after 10
seconds of recording. The main advantage of the algorithm is that
it does not require any explicit user’s feedback. Only location of
possible targets is needed to use it, so it is quite general and may be
used in many eye tracking scenarios.
References
BROL LY, X. L., AN D MULLIGAN, J . B. 2004. Implicit calibra-
tion of a remote gaze tracker. In Computer Vision and Pattern
Recognition Workshop, 2004. CVPRW’04. Conference on, IEEE,
134–134.
HAN SE N, D . W., AG UST IN , J. S., AN D VILLANUEVA, A. 2010.
Homography normalization for robust gaze estimation in uncal-
ibrated setups. In Proceedings of the 2010 Symposium on Eye-
Tracking Research & Applications, ACM, 13–20.
HORNOF, A. J., AND HALVER SO N , T. 2002. Cleaning up sys-
tematic error in eye-tracking data by using required fixation lo-
cations. Behavior Research Methods, Instruments, & Computers
34, 4, 592–604.
KAS PRO WS KI , P., AN D HAREZLAK, K. 2015. Using non-
calibrated eye movement data to enhance human computer inter-
faces. In Intelligent Decision Technologies. Springer, 347–356.
PFE UFF ER , K., VIDA L , M., TURNER, J ., BU LL IN G, A ., AN D
GEL LE RS EN, H. 2013. Pursuit calibration: Making gaze cal-
ibration less tedious and more flexible. In Proceedings of the
26th annual ACM symposium on User interface software and
technology, ACM, 261–270.
VADIL LO, M. A., ST RE E T, C. N. , BEE SL EY, T., AND SHANKS,
D. R. 2014. A simple algorithm for the offline recalibration
of eye-tracking data through best-fitting linear transformation.
Behavior research methods, 1–12.
VILLANUEVA, A., AN D CAB EZ A , R . 2008. A novel gaze esti-
mation system with one calibration point. Systems, Man, and
Cybernetics, Part B: Cybernetics, IEEE Transactions on 38, 4,
1123–1138.
ZHA NG , Y., AN D HORNOF, A. J. 2011. Mode-of-disparities error
correction of eye-tracking data. Behavior Research Methods 43,
3, 834–842.
ZHA NG , Y., AND HORNOF, A. J. 2014. Easy post-hoc spatial
recalibration of eye tracking data. In ETRA, 95–98.