Content uploaded by Lech Swirski
Author content
All content in this area was uploaded by Lech Swirski on Aug 12, 2014
Content may be subject to copyright.
A fullyautomatic, temporal approach to single camera, glintfree
3D eye model ﬁtting
Lech
´
Swirski and Neil Dodgson
University of Cambridge
We present a 3D eye model ﬁtting algorithm for use in gaze estimation, that operates
on pupil ellipse geometry alone. It works with no usercalibration and does not
require calibrated lighting features such as glints. Our algorithm is based on ﬁtting
a consistent pupil motion model to a set of eye images. We describe a noniterative
method of initialising this model from detected pupil ellipses, and two methods of
iteratively optimising the parameters of the model to best ﬁt the original eye images.
We also present a novel eye image dataset, based on a rendered simulation, which
gives a perfect ground truth for gaze and pupil shape. We evaluate our approach using
this dataset, measuring both the angular gaze error (in degrees) and the pupil repro
jection error (in pixels), and discuss the limitations of a usercalibration–free approach.
Keywords: gaze estimation, eye model, pupil detection, glintfree
Introduction
Camerabased eye tracking approaches are normally
divided into two stages: eye/pupil detection and gaze
estimation based on the eye or pupil information.
Eye/pupil detection algorithms normally work in 2D
image space; a common approach is to detect the pupil
as an ellipse in 2D. Gaze estimation algorithms then at
tempt to convert this eye/pupil information into a gaze
vector or point of regard.
Most gaze estimation algorithms can be classiﬁed
into two groups: regressionbased and modelbased.
Regressionbased algorithms assume that there is some
unknown relationship between the detected eye pa
rameters and the gaze. They then approximate this re
lationship using some form of regression — often this
is polynomial regression, although there are also ap
proaches using neural networks. Modelbased ap
proaches instead attempt to model the eye and thus the
gaze. Given the detected eye/pupil parameters, these
adjust the model (e.g. rotate the eye) to ﬁt the data, and
output the gaze information. We refer the reader to the
indepth survey by Hansen and Ji (2010).
Both approaches require some form of personal cal
ibration, either to ﬁnd the parameters of the regression
function or to ﬁt the eye model to the current user. Nor
mally this consists of an interactive calibration, where
the user is asked to look at several points at known lo
cations; for example, many studies use a 9point grid
calibration. Modelbased approaches often use anthro
pomorphic averages to decrease the number of variable
parameters, however they normally still require some
amount of calibration.
Additionally, many approaches use glints or Purk
inje images as additional data in either the calibration
(e.g. to obtain corneal curvature) or inference (e.g. us
ing the glint–pupil vector rather than the pupil position
alone). These are reﬂections of a light source from vari
ous parts of the cornea or lens. However, these require
one or more calibrated light sources and can be fairly
difﬁcult to detect under uncontrolled lighting condi
tions. Some approaches also use multiple cameras to
build a 3D model of the eye from stereo information.
However, there are many use cases where such high
quality, controlled and calibrated lighting is not avail
able. There is an increasing amount of research into
“homemade” cheap eyetrackers, where a webcam is
mounted on glasses frames, and illuminated either by
visible light or IR LEDs (Agustin, Skovsgaard, Hansen,
& Hansen, 2009; Chau & Betke, 2005; Tsukada, Shino,
Devyver, & Kanade, 2011). Figure 1 is an example of
such an eyetracker, built in our lab by the ﬁrst author.
In such systems, it can be difﬁcult or even impossible
to calibrate the positions of the lights, so glintfree ap
proaches must be used.
We present a modelbase gaze estimation approach
that does not require any interactive calibration from
the user, only requires a single camera, and does not
require calibrated lights for glint information. Instead,
we only require multiple images of the pupil from a
headmounted camera.
We evaluate our approach using a new dataset, cre
ated by rendering a highlyrealistic 3D model of the eye
and surrounding areas. Using a rendering rather than
real video allows us to calculate the ground truth with
perfect accuracy, rather than relying on another form
1
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Figure 1. An example of a cheap, headmounted eye tracker,
build in our lab, which uses webcams and roughly positioned
IR LED lighting.
of measurement. We also discuss the limitations of a
fullyautomatic system.
Our approach
Our approach is based on the projective geometry of
the pupil. After detecting the pupil as an ellipse in the
image, we approximate the orientation and position of
the original circular pupil contour in 3D. By combin
ing information from multiple frames, we build an eye
model based on assumptions on how the motion of the
pupil is constrained.
In our model, we do not consider the offset between
the optical axis and the visual axis— that is, we calcu
late the gaze vector rather than the sight vector. While
this means that the model cannot be directly used to
calculate the point of regard, the offset between the
gaze and sight vectors is constant per individual, and
can either be approximated by anthropomorphic aver
ages or trivially be found using a single point calibra
tion.
We also do not consider head motion. Since we as
sume that the camera is headmounted, we can operate
entirely in camera space—this means that the gaze vec
tor returned is relative to the camera. For a gazetrack
ing system that requires gaze in world space, we expect
the camera’s position and orientation to be externally
tracked. The transformation from camera space would
simply amount to a single rotation and translation.
Our approach proceeds as follows. We ﬁrst detect
the pupil ellipse in each image independently. We
use our pupil detection algorithm (
´
Swirski, Bulling, &
Dodgson, 2012), which gives us a pupil ellipse (ﬁg. 2)
and a set of edge points from which the pupil was cal
culated.
Once we have the pupil ellipses in all the images, we
independently unproject each one as a circle in 3D. We
then combine the information from these 3D circles to
Figure 2. The pupil ellipse, in magenta, detected by our pupil
tracking algorithm (
´
Swirski et al., 2012)
create a rough model of the pupil motion. Finally, we
optimise the model parameters to best ﬁt the original
image data.
Our approach of using a 3D eye model combined
with an optimisation step is similar to Tsukada et al.
(2011). However, this work manually sets the 3D eye
model parameters and only reﬁnes the perframe pupil
parameters, whereas we automatically estimate and re
ﬁne all of the parameters.
We describe the stages of our approach in detail in
the following sections.
Twocircle unprojection
The ﬁrst stage of our algorithm ‘unprojects’ each
pupil ellipse into a 3D pupil circle — that is, we ﬁnd
a circle whose projection is the given ellipse. Many
approaches simplify this unprojection by assuming
a scaled orthogonal or weak perspective projection
model, where the unprojection can be calculated using
simple trigonometry (Schnieders, Fu, & Wong, 2010;
Tsukada et al., 2011). However, weak perspective is
only an approximation to full perspective, and it is
valid only for distant objects that lie close to the opti
cal axis of the camera. When the pupil is close to the
camera, or far away from the optical axis of the cam
era, the weak perspective approximation begins to fail.
Instead, we assume a full perspective projection with a
pinhole camera model.
Under a a full perspective projection model, the
space of possible projections of the pupil circle can be
seen as a cone with the pupil circle as the base and cam
era focal point as the vertex. The pupil ellipse is then
the intersection of this cone with the image plane.
This means that the circular unprojection of the el
lipse can be found by reconstructing this cone, using
the ellipse as the base. The circular intersection of this
cone will then be the pupil circle (ﬁg. 3). We ﬁnd the
circular intersection using the method of SafaeeRad,
2
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Figure 3. We unproject the pupil ellipse by constructing a
cone through the camera focal point and pupil ellipse on
the image plane. We then ﬁnd circular intersections of this
cone— at any given distance, there are two solutions for the
intersection (green and red).
Tchoukanov, Smith, and Benhabib (1992). This unpro
jection operation gives us a pupil position, a gaze vec
tor, and a pupil radius
pupil circle = (p, n, r) (1)
If necessary, the gaze is ﬂipped so that it points ‘to
wards’ the camera.
There are two ambiguities when unprojecting an el
lipse. The ﬁrst is a distance–size ambiguity; any per
spective unprojection from 2D into 3D is illposed, and
there is no way to tell if the pupil is small and close,
or large and far away. We resolve this by setting r to an
arbitrary value in this stage, and ﬁnding the true radius
in later stages of the algorithm.
The second is that there are two solutions when
ﬁnding the ﬁxedsize circular intersection of the con
structed cone, symmetric about the major axis of the el
lipse, arising from solving a quadratic (shown in green
and red in ﬁgure 3). At this stage of the algorithm, we
do not disambiguate between these two cases, and re
turn both solutions, denoted as
(p
+
, n
+
, r), (p
−
, n
−
, r) (2)
Model initialisation
To estimate the gaze, we consider only the pupil ori
entation and location, and so we do not require a full
model of the eye. In fact, we wish to model only the
pupil and its range of motion, and not the iris or the
eyeball itself.
Therefore, instead of modelling the pupil as a hole
in the iris of a spherical eyeball, we model it as a disk
laying tangent to a rotating sphere. This sphere has the
same centre of rotation as the eyeball. The gaze is then
the normal of the disk, or equivalently a radial vector
from the sphere centre to the pupil centre.
Sphere centre estimate
For each eye image, we consider the pupil circle
(p
i
, n
i
, r
i
), where i is the index of the image. Given our
pupil model, we wish to ﬁnd a sphere which is tangent
to every pupil circle. Since each pupil circle is tangent
to the sphere, the normals of the circles, n
i
, will be ra
dial vectors of the sphere, and thus their intersection
will be the sphere’s centre.
However, there are two problems with this ap
proach, corresponding to the two ambiguities in the
ellipse unprojection. Firstly, we do not know the true
3D position of each pupil circle, only the position un
der the assumption that the pupil is of a certain size. If
the pupil radius r
i
did not change between frames, as
the relative unprojected positions would be correct up
to scale. This is the case for approaches which use the
iris contour rather than the pupil contour. However,
due to pupil dilation, we cannot make this assumption.
Secondly, we have two circles for each ellipse, rather
than one, and we do not know which one of the two
circles— (p
+
i
, n
+
i
, r) or (p
−
i
, n
−
i
, r
i
)— is correct.
We resolve both of these problems by considering
the intersection of projected normal vectors
˜
n
±
i
in 2D
imagespace rather than 3D worldspace. In this case,
the distance–size ambiguity disappears by construc
tion, as we are using the same projection which orig
inally introduced it. The twocircle ambiguity also dis
appears: both projected normal vectors are parallel:
˜
n
+
i
∝
˜
n
−
i
. (3)
Similarly, the line between the two projected circle cen
tres,
˜
p
+
i
and
˜
p
−
i
, is parallel to
˜
n
±
i
. This means that:
∃s,t ∈ R.
˜
p
+
i
=
˜
p
−
i
+ s
˜
n
+
i
=
˜
p
−
i
+t
˜
n
+
i
. (4)
which means that we can arbitrarily choose either one
of the two solutions for this stage.
We thus ﬁnd the projected sphere centre
˜
c by calcu
lating an intersection of lines. These lines correspond
to the projected gaze: each line passes through the pro
jected pupil centre and lies parallel to the projected
pupil normal (ﬁgure 4). Formally, we ﬁnd the inter
section of the set of lines L
i
, where
L
i
=
{
(x, y) =
˜
p
i
+ s
˜
n
i

s ∈ R
}
(5)
Since there may be numerical, discretisation or mea
surement error in these vectors, the lines will almost
certainly not intersect at a single point. We instead ﬁnd
the point closest to each line in a leastsquares sense, by
calculating
˜
c =
∑
i
I −
˜
n
i
˜
n
T
i
!
−1
∑
i
(I −
˜
n
i
˜
n
T
i
)
˜
p
i
!
(6)
3
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Figure 4. To ﬁnd the centre of the sphere, we intersect the pro
jected gaze vectors of each pupil. The projected sphere centre
is estimated to be near the intersection of these lines.
We then unproject the projected sphere centre
˜
c to
ﬁnd the 3D sphere centre c. Again, there is a size–dis
tance ambiguity in the unprojection, which we resolve
by ﬁxing the z coordinate of c.
Sphere radius estimate
Once we have the projected sphere centre
˜
c, we note
that each pupil’s normal n
i
has to point away from the
sphere centre c:
n
i
· (c −p
i
) > 0 (7)
and therefore the projected normal
˜
n
i
has to point away
from the projected centre
˜
c:
˜
n
i
· (
˜
c −
˜
p
i
) > 0 (8)
Furthermore, since they are symmetric about the major
axis of the ellipse
˜
n
+
i
and
˜
n
−
i
point in opposite direc
tions, which means that one will point towards
˜
c, and
one will point away from it. This allows us to disam
biguate between the twocircle problem, by choosing
the circle (p
i
, n
i
, r
i
) whose projected normal
˜
n
i
points
away from
˜
c.
We can now use the unprojected pupils to estimate
the sphere radius R. Since each pupil lies on the sphere,
this should be the distance between p
i
and the c—how
ever once again, due to the distance–size ambiguity in
the pupil unprojection, and the potentially changing
actual pupil size, we cannot use p
i
directly.
Instead, we consider a different candidate pupil cen
tre
ˆ
p
i
, which is another possible unprojection of
˜
p
i
, but
potentially at a different distance. This means that this
point has to lie somewhere along the line of possible
unprojections of
˜
p
i
, which is the line passing through
p
i
and the camera centre.
Figure 5. For each candidate pupil circle (red), we consider
the possible unprojections of its centre p
i
(orange line). We
intersect this line with the gaze line from the sphere centre
(blue line) to ﬁnd
ˆ
p
i
(orange cross), which is the centre of a
circle that lies tangent to the sphere (dashed red). The dis
tance from the sphere centre (blue cross) to
ˆ
p
i
is the sphere
radius.
We want the circle (
ˆ
p
i
, n
i
, ˆr
i
) to be consistent with our
assumptions: that the circle lies tangent to the sphere
whose centre is c. This means that we want n
i
to be
parallel to the line from c to
ˆ
p
i
.
Given these two constraints,
ˆ
p
i
can be found by in
tersecting the line from c to
ˆ
p
i
— the gaze line from the
centre of the sphere— with the line passing through p
i
and the camera centre—the projection line of p
i
(ﬁgure
5). Since lines in 3D almost certainly do not cross, we
once again ﬁnd the leastsquares intersection point.
The sphere radius R is then calculated by ﬁnding the
mean distance from the sphere centre to each pupil cen
tre.
R = mean(
{
R
i
=
k
ˆ
p
i
− c
k 
∀i
}
) (9)
Consistent pupil estimate
Given the sphere, calculated above, we want the
pupil circles to lay tangent to the surface. For each
pupil, we wish for its centre to lie on the sphere, and
for its projection to be
˜
p
i
. Due to the distance–size am
biguity, it is almost certain that the circle (p
i
, n
i
, r
i
) will
not lie on the sphere. We therefore want to calculate a
new circle, (p
0
i
, n
0
i
, r
0
i
), where
p
0
i
= sp
i
(10)
p
0
i
= c + Rn
0
i
(11)
r
0
i
z
0
i
=
r
i
z
i
(12)
The last of these deﬁnes r
0
i
as r
i
scaled by perspective.
To ﬁnd p
0
i
, we wish to ﬁnd a value of s such that sp
i
lies on the surface of the sphere (c, R), which can be
4
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
calculated as a line–sphere intersection. This gives two
solutions, of which we take the nearest. n
0
i
and r
0
i
are
then trivially calculated. Note that as part of this pro
cess, we discard the original gaze vector n
i
.
Once these steps are taken, we have a rough model
of the pupil motion, where every pupil circle lies tan
gent to the surface of a certain sphere.
Model optimisation
We parametrise our model using 3 + 3N parameters,
where N is the number of pupil images. These param
eters represent the 3D position of the sphere, c, and for
each pupil, its position on the sphere (as two angles, θ
and ψ) and its radius r.
From the previous initialisation steps, we obtain a
set of parameters approximating the original detected
pupil ellipses. We then wish to optimise these parame
ters so that the projected pupil ellipses best ﬁt the orig
inal eye image data. The result of the optimisation is a
set of circles in 3D, all of which are constrained to lie
tangent to a sphere. The normals of these circles corre
spond to the gaze vector of the user, and the projected
ellipses are a good ﬁt for the pupils in the images.
We investigate two metrics for deﬁning the “best ﬁt”.
The ﬁrst is an region comparison, where we attempt to
maximise the contrast between the inside and outside
of each pupil. The second is a point distance, where we
attempt to minimise the distance of each pupil ellipse
from the pupil edge pixels found by the pupil tracker.
Region contrast maximisation
The ﬁrst metric we use is a region contrast metric.
Simply put, it requires that the pupil ellipse be dark on
the inside and light on the outside, and for the differ
ence in brightness of the inside and outside to be max
imal.
For each pupil image, we consider a thin band
around the interior and exterior of the pupil ellipse, R
+
and R
−
(ﬁgure 6). These are deﬁned as
R
+
=
x
x ∈ R
2
, 0 < d(x) ≤ w
(13)
R
−
=
x
x ∈ R
2
, −w < d(x) ≤ 0
(14)
where w is the width of the band (we use w = 5 px),
and d(x) is a signed distance function of a point to the
ellipse edge, positive inside the ellipse and negative
outside. As calculating the distance of a point to the
ellipse edge is nontrivial, we use an approximate dis
tance function, described in the later “Ellipse distance”
section.
We then ﬁnd the mean pixel value of each region –
that is, we ﬁnd
µ
±
= mean
I(x)
x ∈ R
±
(15)
where I(x) is the image value at x.
Figure 6. To optimise the model, we consider a thin band
around the interior (R
+
) and exterior (R
−
) of each pupil el
lipse. We maximise contrast by maximising the difference in
the average pixel value inside these two regions.
This can be rewritten as a weighted average
µ
±
=
Z
B
±
(d(x)) ·I(x) dx
Z
B
±
(d(x)) dx
(16)
(17)
where
B
+
(t) =
1 if 0 < t ≤ w
0 otherwise
(18)
B
−
(t) =
1 if −w < t ≤ 0
0 otherwise
(19)
so that B
±
(d(x)) is 1 when x ∈ R
±
, and 0 otherwise.
B
±
(t) can then be deﬁned using the Heaviside func
tion
B
+
(t) = H(t) −H(t − w) (20)
B
−
(t) = H(t + w)− H(t) (21)
This gives a closed deﬁnition of “mean pixel value
in the region”. Our contrast metric per ellipse is then
simply
E = µ
−
− µ
+
(22)
which increases when the interior region becomes
darker, or the exterior region becomes lighter. We then
maximise the contrast over all ellipses, that is, for the
parameter vector p we ﬁnd
argmax
p
∑
i
E
i
(23)
We maximise this using gradient ascent— speciﬁcally,
using the BroydenFletcherGoldfarbShanno (BFGS)
method (Nocedal & Wright, 1999).
5
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
−ε
ε
0.5
1
Figure 7. The soft step function, H
ε
.
Since we are dealing with discrete pixels rather than
a continuous image space, the above weighted integral
becomes a weighted sum over pixels. This introduces
discretisation issues in the differentiation, as the differ
ential of H(t) is 0 everywhere except at t = 0.
To ameliorate these issues, we take the approach of
Zhao, Chan, Merriman, and Osher (1996) of using a
regularised version of the Heaviside function, H
ε
(t),
which provides a smooth, continuous unit step rather
than an immediate step. We use smootherstep (Ebert,
Musgrave, Peachey, Perlin, & Worley, 2002):
H
ε
(t) =
1 if t ≥ ε
0 if t ≤ −ε
6
t+ε
2ε
5
− 15
t+ε
2ε
4
+ 10
t+ε
2ε
3
otherwise
(24)
which deﬁnes a sigmoid for t between −ε and ε, and
behaves like the Heaviside function otherwise (ﬁgure
7). We use ε = 0.5 px. Note that lim
ε→0
= H
ε
(t).
It is interesting to note what happens as w tends to 0.
The integrals over B
+
(t) and B
−
(t) tend towards path
integrals over the ellipse contour, and so:
lim
w→0
Z
B
±
(d(x)) dx =
I
1 dt (25)
lim
w→0
Z
B
±
(d(x)) ·I(x) dx =
I
I(x(t)) dt (26)
This means that µ
+
and µ
−
tend towards being the av
erage value of the pixels along the circumference of the
ellipse C:
lim
w→0
µ
±
=
1
C
I
I(x(t)) dt (27)
As µ
+
and µ
−
approach the ellipse boundary from op
posite “sides”, their difference (scaled by w) can be in
terpreted as a differential of this average value across
Figure 8. The edge pixels for a detected pupil ellipse. We
wish to minimise the distance of all of these edge points to
their respective ellipses.
the ellipse boundary:
lim
w→0
E
w
= lim
w→0
µ
−
− µ
+
w
=
∂
∂r
1
C
I
I(x(t)) dt
(28)
This bears a strong resemblance to the integrodifferen
tial operator used by Daugman for pupil and iris detec
tion (Daugman, 2004).
Edge distance minimisation
The second metric we use is an edge pixel distance
minimisation. From the pupil detection, we obtain a
list of edge pixel locations which were used to deﬁne
the given pupil ellipse (ﬁg. 8). Then, for each repro
jected pupil ellipse, we wish to minimise the distances
to the corresponding edge pixels.
We minimise these distances using a leastsquares
approach. For each ellipse, we consider the set of edge
pixels E = {e}. We wish to minimise the squared dis
tance of each edge pixel to the ellipse edge; that is, we
wish to minimise
∑
e∈E
d(e)
2
(29)
where d(x) is a signed distance function of a point x to
the ellipse edge, deﬁned in the next section.
We then wish to do optimise the parameters to min
imise this distance over all ellipses
argmax
p
∑
i
∑
e∈E
i
d(e)
2
(30)
Note that this is a minimisation of a sum of squares.
Thus, we can use a least squares minimisation algo
rithm— we use the LevenbergMarquadt implementa
tion in Ceres Solver, a C++ least squares minimisation
library (Agarwal & Mierle, n.d.).
6
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Ellipse distance
Both metrics above require a function d(x) which is
a signed distance of the point x to the the ellipse edge,
positive inside the ellipse an negative outside.
Calculating the true Euclidean distance of a point to
an ellipse is computationally expensive, requiring the
solution of a quartic. Instead, we transform the image
space so that the ellipse becomes the unit circle, ﬁnd the
signed distance to this circle, and scale by the major ra
dius of the ellipse to get an approximate pixel distance.
While this does not give a true Euclidean distance, it
is a good approximation for ellipses with low eccentric
ity, and we have found it to be sufﬁcient for our needs.
It is also very efﬁcient to calculate this distance over a
grid of pixels, as the transformation can be represented
by a matrix multiplication, and this matrix can be pre
computed for each ellipse rather than being computed
for every pixel.
Automatic differentiation
For both of the above optimisations, we need to cal
culate the gradient of the metric with respect to the
original parameters. To avoid both having to calculate
the gradient function by hand, and the inaccuracy of
numeric differentiation, we employ automatic differen
tiation (Rall, 1981).
Notice that any implementation of a function will,
ultimately, be some composition of basic operations,
such as addition, multiplication or exponentiation, or
elementary functions such as sin, cos or log. For any
of these, we can already calculate the derivative of any
value trivially; by repeatedly applying the chain rule,
we can then calculate the value of the derivative of any
arbitrary composition of these functions.
Automatic differentiation works based on this obser
vation. Every basic operation or elementary function is
redeﬁned to take a pair of inputs: a value and its differ
entials. These functions are then rewritten to simulta
neously compute both the output value and its differ
entials, applying the chain rule to the input differentials
where appropriate. For example, the sin function
f (x) = sin(x) (31)
is rewritten as the simultaneous calculation of value
and gradient:
f
x,
dx
d p
=
sin(x), cos(x)
dx
d p
(32)
Thus, for every function, the values of the differ
entials of that function can be calculated at the same
time as the value of the function itself, by repeating the
above reasoning. In particular, this can be done for our
metric functions. Each parameter is initialised with a
differential of 1 with respect to itself, and 0 with re
spect to all other differentials; passing these annotated
parameters through the metric function gives a vector
of differentials of the function with respect to each pa
rameter. This can be implemented simply in C++ using
operator overloading and function overloading.
Note that, while the individual basic function differ
entiations are performed manually (this is, it is the pro
grammer who speciﬁes that
d
d p
sin(x) = cos(x)
dx
d p
), this is
not symbolic differentiation. This is because symbolic
differentiation operates on a symbolic expression of the
entire function, and returns a symbolic expression de
scribing the differential, whereas automatic differentia
tion only operate on the values of the differential at any
given point.
Performing the differentiation as described above
would increase the runtime by O(3 + 3N), as we would
have to calculate the differential with respect to all
O(3 + 3N) parameters. However, the bulk of each met
ric calculation is performed on each pupil indepen
dently. Since pupils are independent of each other —
that is, each pupil only depends on the 3 sphere pa
rameters and its own 3 parameters— we can pass only
the 6 relevant parameters to the part of the calculation
dealing with only that pupil, hence only imparting a
constant runtime increase for the bulk of the calcula
tion.
Evaluation
We evaluate our algorithm by calculating the gaze
error and pupil reprojection error on a ground truth
dataset. We evaluate both optimisation metrics, and
compare them to naïve ellipse unprojection.
Ground truth dataset
To evaluate our algorithm, we chose to work en
tirely under simulation, so that we would have perfect
ground truth data. We took a model of a head from the
internet (Holmberg, 2012), and modiﬁed it slightly in
Blender to add eyelashes, control of pupil dilation, and
changed the texture to be more consistent with infra
red eye images. We then rendered an animation of the
eye looking in various directions, with the pupil dilat
ing and constricting several times (ﬁg. 9). This dataset,
as well as the 3D model used to generate it, are publi
cally available
1
.
It is often argued that simulation can never accu
rately emulate all the factors of a system working in
real life, however we argue that simulated data is per
fectly appropriate for evaluating this sort of system.
Firstly, using simulated images gives a fully con
trolled environment; in particular measurements of
gaze direction, pupil size or camera position can be
speciﬁed precisely. In real images, on the other hand,
this information is obtained by additional measure
ment, by using another system, assumptions on where
1
http://www.cl.cam.ac.uk/research/rainbow/
projects/eyemodelﬁt/
7
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Figure 9. Examples of rendered images from our ground truth dataset.
Figure 10. A comparison of the angular gaze error, in degrees,
of naïve unprojection and our approach.
Figure 11. A comparison of the pupil ellipse reprojection er
ror, in pixels, of the pupil tracker input and our approach.
Table 1
Mean and standard deviation of the angular gaze error.
Mean Std. dev.
Ellipse Unprojection 3.5558 4.2385
Unoptimised Model 2.6890 1.5338
Region Contrast Maximisation 2.2056 0.1629
Edge Distance Minimisation 1.6831 0.3372
Table 2
Mean and standard deviation of the ellipse error.
Mean Std. dev.
Pupil Tracker 3.9301 5.8056
Unoptimised Model 4.4420 4.9157
Region Contrast Maximisation 2.1194 1.1354
Edge Distance Minimisation 2.7474 2.7615
the user is looking or manual labelling. All of these are
prone to error, while the ground truth from simulation
data is perfect by deﬁnition.
Secondly, we argue that modern image rendering
techniques are closing the gap between simulation and
real life. Our simulation includes eyelashes, skin de
formation, iris deformation, reﬂections, shadows and
depthofﬁeld blur. Furthermore, future work on ren
dered datasets could include further simulation of
realworld variables, such as varying eye shapes, eye
colours, different lighting conditions, environmental
reﬂections, or motion blur. We believe that this is an
area with vast potential for future work.
In short, we argue that, while simulation is not per
fect, the quality of simulated images is approaching
that of real images, and the beneﬁts of perfect ground
truth data far outweigh the issues of simulation.
Results
We measured the angular gaze error of our ap
proach, using both optimisation metrics, and compared
it against the error from simple ellipse unprojection,
and the error of the model before optimisation. Figure
10 shows a graph of these results, and Table 1 shows a
summary.
The simple ellipse unprojection had, as expected, the
highest gaze error. We found that this approach par
ticularly suffers when the pupil ﬁt is poor, as there is
no regularisation of the resulting gaze vector (ﬁg. 12).
This results in signiﬁcant gaze errors, as high as 29
◦
.
This gaze data would not be suitable for eye tracking
without some form of smoothing or outlier rejection.
We were surprised to ﬁnd that an unoptimised ver
sion of the model gave good gaze data, both in terms of
mean and variance. This is the model after only the ini
tialisation, before applying any form of optimisation.
8
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
Figure 12. Example of a poor initial pupil ellipse ﬁt, with a
gaze error of 29.1
◦
.
We believe that this is in large part due to a good ini
tialisation of the sphere centre and radius, and the last
stage of the initialisation which discards gaze informa
tion and only uses pupil centre information. While the
ellipse shape may be wrong, and thus give bad gaze
data, the ellipse does tend to be in vaguely the right
area. By discarding the gaze in the latter stage of the
initialisation, we obtain a gaze estimate based on the
sphere radial vector rather than the ellipse shape, and
this tends to be more reliable.
However, the optimisation stage reduces the error
further. We found that on this dataset, the edge dis
tance minimisation gave a lower mean gaze error than
the region contrast maximisation (Table 1). We are not
certain why this is the case, however we suspect that
it is a case of overoptimisation. The lower variance of
the region contrast maximisation gaze error supports
this hypothesis.
We also investigated the reprojection error of the
pupils, in pixels. We used the Hausdorff distance, as
described by
´
Swirski et al. (2012), to compare projec
tions of the pupils against the ground truth. We also
calculated the ellipse ﬁt error of the output of the initial
pupil tracking stage. Figure 11 shows a graph of these
results, and Table 2 shows a summary.
We were not surprised that the unoptimised model
had higher reprojection error than the pupil tracker. In
terestingly, however, the optimised model had lower
reprojection error than the pupil tracker, for both op
timisation metrics. It is not immediately obvious that
this should be the case; the pupil tracker is designed
only to ﬁnd and ﬁt the pupil ellipse, whatever size or
shape it may be, while our model imposes additional
constraints on the possible shapes and locations of the
pupils. Thus, we would expect the pupil tracker to
overﬁt the pupil ellipse. However, we have found that,
in the case of bad pupil information such as occlusions
or weak edges, the constraints on the optimised model
provide additional information where the pupil tracker
Figure 13. The ellipse ﬁt from Figure 12 after region contrast
maximisation. The ellipse reprojection error is 1.77 px, and
the gaze error is 2.25
◦
. The dark green circle is the outline of
the optimised model’s sphere.
Figure 14. The ellipse ﬁt from Figure 12 after edge distance
minimisation. Note that in the original pupil ﬁt, the left side
of the ellipse was a good ﬁt, while the right side was a poor
ﬁt. This is also the case in the optimised ellipse, as the quality
of the edge distance minimisation is limited by the output
of the pupil tracker. However, the optimised ellipse is still a
better ﬁt due to to the constraints imposed by the model and
the other images in the dataset.
has none, and thus limit the magnitude of the error (ﬁg.
13 and 14).
Of the two optimisation metrics, we found that the
region contrast maximisation had a smaller mean re
projection error than the edge distance minimisation,
as well as a far smaller variance. This is because the
edge distance minimisation metric uses the output of
the pupil tracker; namely, the list of edge points used
to ﬁt the original ellipse. When there is a bad pupil
ﬁt, this information is poor, and so the edge distance
minimisation will also give poor results (ﬁg. 14). The
region contrast minimisation, on the other hand, works
directly on the image pixels, and so is independent of
the pupil tracker (ﬁg. 13).
9
´
Swirski, L. and Dodgson, N. (2013)
A fullyautomatic, temporal approach to single camera, glintfree 3D eye model ﬁtting
These pupil reprojection results present an interest
ing insight when combined with the gaze error results.
Although the region contrast maximisation has a lower
reprojection error than the edge distance minimisation,
it has a higher gaze angle error. Similarly, the pupil
tracking has a lower ellipse error than the reprojected
unoptimised model, yet the unprojected ellipses have a
higher gaze error. It appears therefore that an improve
ment in pupil ellipse ﬁtting does not necessarily cor
relate with an improvement in gaze accuracy, and that
the dominant factor in establishing a good gaze vector
is ﬁnding the correct sphere centre rather than a good
ﬁt of the pupil contour.
Conclusion
We have presented a novel algorithm for ﬁtting a 3D
pupil motion model to a sequence of eye images, with
a choice of two different metrics for ﬁtting the model.
Our approach does not require user calibration, cali
brated lighting, or any prior on relative eye position or
size, and can estimate gaze to an accuracy of approxi
mately 2
◦
(see Table 1).
We have also introduced a novel eye image data set,
rendered from simulation, which provides a perfect
ground truth for pupil detection and gaze estimation.
We have evaluated our algorithm on this dataset, and
have found it to be superior in terms of gaze angle er
ror to previous approaches which did not constrain the
pupil motion. We have also found that the 2D pupil el
lipses calculated as a sideeffect of the model ﬁtting are
superior to the ellipses found by current pupil track
ing algorithms, but that a better ellipse ﬁt does not nec
essarily correlate with better gaze estimation, and that
even poor ellipse ﬁts result in a surprisingly good gaze
vector.
We believe that our algorithm, in particular the re
gion contrast minimisation optimisation, is close to the
limit of how good an ellipse ﬁtting approach can be,
especially given issues such as depthofﬁeld blur or
the fact that real pupils are not perfectly circular. De
spite this, the gaze error is relatively high compared
to calibrationbased approaches. Therefore, we believe
that our algorithm approaches the limit of accuracy
one can obtain purely from a geometric analysis of the
pupil, and any signiﬁcant improvements in gaze accu
racy can only be obtained through collecting additional
data (such as user calibration) rather than by improv
ing the pupil ellipse ﬁt.
Although our approach does not achieve the gaze
accuracy of systems that include user calibration, our
approach is accurate enough for approximate gaze es
timation. Furthermore, our pupil motion model could
be used as part of a usercalibrated gaze estimation; as
an initialisation, a ﬁrst pass approximation, or for reg
ularisation.
References
Agarwal, S., & Mierle, K. (n.d.). Ceres Solver: Tutorial &
Reference [Computer software manual].
Agustin, J. S., Skovsgaard, H., Hansen, J. P., & Hansen, D. W.
(2009). Lowcost gaze interaction: ready to deliver the
promises. In Proc. chi ea (pp. 4453–4458). ACM.
Chau, M., & Betke, M. (2005). Real Time Eye Tracking and Blink
Detection with USB Cameras.
Daugman, J. (2004, January). How Iris Recognition Works.
IEEE Transactions on Circuits and Systems for Video Technol
ogy, 14(1), 21–30. doi: 10.1109/TCSVT.2003.818350
Ebert, D. S., Musgrave, F. K., Peachey, D., Perlin, K., & Wor
ley, S. (2002). Texturing and Modeling: A Procedural Ap
proach (3rd ed.). San Francisco, CA, USA: Morgan Kauf
mann Publishers Inc.
Hansen, D. W., & Ji, Q. (2010, March). In the eye of the be
holder: a survey of models for eyes and gaze. IEEE Trans.
PAMI, 32(3), 478–500. doi: 10.1109/TPAMI.2009.30
Holmberg, N. (2012). Advance head rig. Retrieved from
http://www.blendswap.com/blends/view/48717
Nocedal, J., & Wright, S. J. (1999). Numerical optimization.
Springer.
Rall, L. B. (1981). Automatic Differentiation: Techniques and
Applications (Vol. 120). Berlin: Springer.
SafaeeRad, R., Tchoukanov, I., Smith, K., & Benhabib, B.
(1992). Threedimensional location estimation of circular
features for machine vision. IEEE Transactions on Robotics
and Automation, 8(5), 624–640. doi: 10.1109/70.163786
Schnieders, D., Fu, X., & Wong, K.y. K. (2010). Reconstruc
tion of Display and Eyes from a Single Image. In Proc. cvpr
(pp. 1442–1449).
´
Swirski, L., Bulling, A., & Dodgson, N. A. (2012, March). Ro
bust realtime pupil tracking in highly offaxis images. In
Proceedings of etra.
Tsukada, A., Shino, M., Devyver, M. S., & Kanade, T. (2011).
Illuminationfree gaze estimation method for ﬁrstperson
vision wearable device. Computer Vision in Vehicle Technol
ogy.
Zhao, H., Chan, T., Merriman, B., & Osher, S. (1996). A vari
ational level set approach to multiphase motion. Journal of
computational physics, 195, 179–195.
10