Conference PaperPDF Available

EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition

Authors:

Abstract

Recent advances in eye tracking have given birth to a new genre of gaze-based context sensing applications, ranging from cognitive load estimation to emotion recognition. To achieve state-of-the-art recognition accuracy, a large-scale, labeled eye movement dataset is needed to train deep learning-based classifiers. However, due to the heterogeneity in human visual behavior, as well as the labor-intensive and privacy-compromising data collection process, datasets for gaze-based activity recognition are scarce and hard to collect. To alleviate the sparse gaze data problem, we present EyeSyn, a novel suite of psychology-inspired generative models that leverages only publicly available images and videos to synthesize a realistic and arbitrarily large eye movement dataset. Taking gaze-based museum activity recognition as a case study, our evaluation demonstrates that EyeSyn can not only replicate the distinct patterns in the actual gaze signals that are captured by an eye tracking device, but also simulate the signal diversity that results from different measurement setups and subject heterogeneity. Moreover, in the few-shot learning scenario, EyeSyn can be readily incorporated with either transfer learning or meta-learning to achieve 90% accuracy, without the need for a large-scale dataset for training.
EyeSyn: Psychology-inspired Eye Movement Synthesis for
Gaze-based Activity Recognition
Guohao Lan
TU Delft
g.lan@tudelft.nl
Tim Scargill
Duke University
timothyjames.scargill@duke.edu
Maria Gorlatova
Duke University
maria.gorlatova@duke.edu
ABSTRACT
Recent advances in eye tracking have given birth to a new genre
of gaze-based context sensing applications, ranging from cogni-
tive load estimation to emotion recognition. To achieve state-of-
the-art recognition accuracy, a large-scale, labeled eye movement
dataset is needed to train deep learning-based classiers. However,
due to the heterogeneity in human visual behavior, as well as the
labor-intensive and privacy-compromising data collection process,
datasets for gaze-based activity recognition are scarce and hard
to collect. To alleviate the sparse gaze data problem, we present
EyeSyn
, a novel suite of psychology-inspired generative models that
leverages only publicly available images and videos to synthesize a
realistic and arbitrarily large eye movement dataset. Taking gaze-
based museum activity recognition as a case study, our evaluation
demonstrates that EyeSyn can not only replicate the distinct pat-
terns in the actual gaze signals that are captured by an eye tracking
device, but also simulate the signal diversity that results from dif-
ferent measurement setups and subject heterogeneity. Moreover,
in the few-shot learning scenario, EyeSyn can be readily incorpo-
rated with either transfer learning or meta-learning to achieve 90%
accuracy, without the need for a large-scale dataset for training.
CCS CONCEPTS
Human-centered computing Ubiquitous and mobile com-
puting theory, concepts and paradigms
;
Computing method-
ologies Simulation types and techniques.
KEYWORDS
Eye tracking, eye movement synthesis, activity recognition.
1 INTRODUCTION
Eye tracking is on the verge of becoming pervasive due to recent
advances in mobile and embedded systems. A broad selection of
commercial products, such as Microsoft
HoloLens 2
[
1
], Magic Leap
One [
2
], and VIVE Pro Eye [
3
], is already incorporating eye tracking
to enable novel gaze-based interaction and human context sensing.
Moreover, general-purpose RGB cameras, such as those embedded
in smartphones [
4
], tablets [
5
], and webcams [
6
], can also be used
to capture users’ eye movements. The accessibility of eye tracking-
enabled devices has given birth to a new genre of gaze-based sensing
applications, including cognitive load estimation [
7
], sedentary
activity recognition [
8
], reading comprehension analysis [
9
], and
emotion recognition [10].
Recent gaze-based sensing systems leverage learning-based tech-
niques, in particular deep neural networks (DNNs) [
10
12
], to
achieve state-of-the-art recognition performance. However, the
success of DNN-based methods depends on how well the training
dataset covers the inference data in deployment scenarios. Ideally,
one would like to collect a large-scale labeled eye movement dataset,
e.g., hundreds of instances for each subject and visual stimulus [
10
],
to derive robust DNN models that are generalized across dierent
deployment conditions. However, this is impractical for three rea-
sons. First, human visual behavior is highly heterogeneous across
subjects, visual stimuli, hardware interfaces, and environments. For
instance, eye movements involved in reading are diverse among
subjects [
13
], layouts of the reading materials [
9
], and text pre-
sentation formats [
14
]. Thus, the countless possible combinations
of the dependencies make the collection of a large-scale, labeled
dataset impractical. Second, since eye movement patterns can re-
veal users’ psychological and physiological contexts [
15
], a gaze
dataset that is collected from dozens or hundreds of users over mul-
tiple activity sessions is vulnerable to potential privacy threats [
16
].
Lastly, the collection of eye movement data is a labor-intensive and
time-consuming process, which typically involves the recruitment
of human subjects to perform a set of pre-designed activities. It
is even more challenging and problematic to perform large-scale
data collection when human interactions are restricted, such as
throughout the COVID-19 shelter-in-place orders.
These challenges make the collection of large-scale, labeled eye
movement datasets impractical, which further limits the perfor-
mance of existing gaze-based activity recognition systems. In fact,
previous work has shown that the lack of sucient training data
can lead to a 60% accuracy deciency [
12
]. While recent transfer
learning [
17
] and meta-learning-based methods [
18
] can be adopted
to mitigate the dependency of the DNN models on large-scale train-
ing datasets in the deployment stage, they still require a highly
diverse base dataset to pre-train the models.
To move beyond the current limitations, we present EyeSyn, a
comprehensive set of psychology-inspired generative models that can
synthesize realistic eye movement data for four common categories
of cognitive activity, including text reading,verbal communication,
and static and dynamic scene perception. Specically, EyeSyn lever-
ages publicly available images and videos as the inputs, and consid-
ers them as the visual stimuli to generate the corresponding gaze
signals that would be captured by an eye tracking device when the
subject is performing a certain activity.
EyeSyn embraces three important features. First, distinct from
the Generative Adversarial Network (GAN)-based data augmen-
tation methods [
19
,
20
], which require hundreds of data samples
for training [
21
], EyeSyn is training-free and does not require any
eye movement data for synthesis. Second, EyeSyn can readily use a
wide range of image and video datasets to generate an arbitrarily
large and highly diverse eye movement dataset. For instance, it can
leverage a public painting image dataset [
22
], which contains 7,937
images of famous paintings, to synthesize the potential eye move-
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
ments when subjects are viewing these paintings. It can also exploit
a text image dataset [
23
], which consists of 600 images of scanned
documents, to generate the corresponding eye movements when
subjects are reading these texts. Third, in contrast to a conventional
data collection process that is usually conned to specic setups,
visual stimuli, or subjects, EyeSyn can simulate dierent eye track-
ing setups, including visual distance, rendering size of the visual
stimuli, sampling frequency, and subject diversity. These features
make EyeSyn an important rst step towards the greater vision of
automatic eye movement synthesis that can alleviate the sparse
data problem in gaze-based activity recognition.
EyeSyn is made possible by a comprehensive suite of novel mod-
els devised in this work. First, we introduce the ReadGaze model
(Section 4.2) to simulate visual attention in text reading. Specically,
we design a text recognition-based optimal viewing position detec-
tion module to identify the potential viewing points in a given text
stimulus. We also develop a skipping eect simulator to model the
visual behavior of skip reading [
24
]. Second, we develop the Verbal-
Gaze model (Section 4.3) which consists of a facial region tracking
module and a Markov chain-based attention model to simulate the
visual behaviors of xating on and switching attention between
dierent facial regions [
25
] in verbal communication. Lastly, we
design the StaticScene and DynamicScene models (Section 4.4) to
synthesize eye movements in static and dynamic scene perception.
Specically, we propose a saliency-based xation estimation model
to identify potential xation locations in the visual scene, and pro-
pose a centrality-focused saliency selection module to model the
eects of the central xation bias [
26
] on xation selection. Our
major contributions are summarized as follows:
We propose EyeSyn, a novel set of psychology-inspired genera-
tive models that synthesize eye movement signals in reading, verbal
communication, and scene perception. Taking the actual gaze sig-
nals captured by an eye tracker as the ground-truth, we demonstrate
that EyeSyn can not only replicate the distinct trends and geometric
patterns in the gaze signal for each of the four activities, but can
also simulate the heterogeneity among dierent subjects.
We demonstrate that EyeSyn can leverage a wide range of pub-
licly available images and videos to generate an arbitrarily large
and diverse eye movement dataset. As shown in Section 5.1, using
a small set of image and video stimuli we have prepared, EyeSyn
synthesizes over 180 hours of gaze signals, which is 18 to 45 times
larger than the existing gaze-based activity datasets [8, 12].
Using gaze-based museum activity recognition as a case study, we
demonstrate that a convolutional neural network (CNN)-based clas-
sier, trained by the synthetic gaze signals generated by EyeSyn, can
achieve 90% accuracy which is as high as state-of-the-art solutions,
without the need for labor-intensive and privacy-compromising
data collection.
The rest of the paper is organized as follows. We review related
work in Section 2. We introduce the overall design, underlying
cognitive mechanisms, and the case study in Section 3. We present
the design details of the psychology-inspired generative models
in Section 4. Section 5 introduces the system design and dataset.
We evaluate our work in Section 6, and discuss the current limita-
tions and future directions in Section 7. We conclude the paper in
Section 8.
The research artifacts, including the implementation of the gen-
erative models and our own collected gaze dataset are publicly
available at https://github.com/EyeSyn/EyeSynResource.
2 RELATED WORK
Gaze-based context sensing.
Our work is related to recent eorts
in gaze-based context sensing, including sedentary activity recog-
nition [
8
,
12
], reading behavior analysis [
11
], and emotion recogni-
tion [
10
,
27
]. All these works require a large-scale gaze [
8
,
11
,
12
]
or eye image dataset [
10
,
27
] to train DNN-based classiers for
context recognition. Although recent transfer learning [
17
] and
meta-learning-based methods [
12
,
18
] can be adopted to mitigate
the dependency of the DNN models on a large-scale training dataset
in the deployment stage, they still require a highly diverse base
dataset to pre-train the DNN models.
Gaze simulation.
The problem of synthesizing realistic gaze
signals has been studied in computer graphics and eye tracking
literature [
28
]. For instance, Eyecatch [
29
] introduces a genera-
tive model that simulates the gaze of animated human characters
performing fast visually guided tasks, e.g., tracking a thrown ball.
Similarly, building on the statistics obtained from eye tracking data,
EyeAlive [
30
] simulates the gaze of avatars in face-to-face con-
versational interactions. More recently, Duchowski et al. [
31
,
32
]
introduce a physiologically plausible model to synthesize realistic
saccades and xation perturbations on a grid of nine calibration
points. Dierent from the existing eorts that rely solely on statis-
tical models for gaze simulation, EyeSyn can leverage a wide range
of images and videos to synthesize realistic gaze signals that would
be captured by eye tracking devices.
Fixations estimation.
Our work is also related to existing works
on visual attention estimation [
33
], which predict a subject’s xa-
tion locations on images [
34
37
] and videos [
38
,
39
]. Early works
in this eld either leverage low-level image features extracted from
the image [
34
,
35
], or combine image features with task-related
contexts [
36
38
] to estimate a subject’s visual attention. Recently,
data-driven approaches have achieved more advanced performance
in xation estimation by taking advantage of deep learning mod-
els that are trained on large amounts of gaze data [
40
42
]. In this
work, we build the scene perception model of EyeSyn (Section 4.4)
on the image feature-based saliency detection model proposed by
Itti et al.
[
34
] to ensure training-free attention estimation, and ad-
vance it with a centrality-focused xation selection algorithm to
generate more realistic gaze signals. In addition, as shown in Sec-
tions 4.2 and 4.3, inspired by the research ndings in cognitive
science [
24
,
25
], EyeSyn also introduces two novel models to esti-
mate xations in text reading and verbal communication.
3 OVERVIEW
3.1 Overall Design
An overview of EyeSyn is shown in Figure 1. It takes publicly avail-
able images and videos as the inputs to synthesize realistic eye
movements for four common categories of cognitive activity, in-
cluding: text reading,verbal communication, and static and dynamic
scene perception. As shown, EyeSyn incorporates three psychology-
inspired generative models to synthesize the corresponding visual
behaviors that would be captured by an eye tracker when a sub-
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
Figure 1: Overview of EyeSyn.
ject is performing the activity. Moreover, to generate realistic gaze
signals, the xation model is introduced to simulate gaze pertur-
bations that result from both microsaccades and the measurement
noise in eye tracking. EyeSyn opens up opportunities to generate
realistic, large-scale eye movement datasets that can facilitate the
training of gaze-based activity recognition applications [
8
,
9
,
12
],
and eliminate the need for expensive and privacy-compromising
data collection. Below, we introduce the underlying cognitive mech-
anism in eye movement control that motivates our design. For each
of the four activities, we describe how the human visual system
makes decisions about the xation location and xation duration
by answering the questions of: where and when will the eyes move?
and why do the eyes move in such a way?
3.2 Cognitive Mechanism and Motivation
3.2.1 Text reading. During reading, the human visual system makes
decisions about the xation location and xation duration in two
independent processes [
24
]. The xation locations are largely de-
termined by the low-level visual information, such as the length
of the word and its distance to the prior xation location [
24
]. It
is generally argued that readers attempt to land their xations on
the center of the word, which is known as the optimal viewing
position (OVP) [
43
]. The OVP is the location in a word at which the
visual system needs the minimum amount of time to recognize the
word. The xation durations are determined by the characteristics
of the word, in particular, the word length [
24
]. Moreover, words
are sometimes skipped in reading, which is known as the skipping
eect. In general, the probability of skipping a word decreases with
the word length [24, 44, 45].
Following this cognitive mechanism, we propose the
ReadGaze
model
(Section 4.2) to simulate visual attention in text reading. As
shown in Figure 1, ReadGaze consists of the text recognition-based
OVP detection module to identify the potential xation points in a
given text stimulus, as well as the the skipping eect simulator to
simulate the visual behavior of skip reading.
3.2.2 Verbal communication. Research in cognitive neuroscience
has shown that participants in verbal communication direct most of
their visual attention at their communication partner. Specically,
they tend to xate on and scan dierent regions of the partner’s
face [
25
], even if the face occupies only a small portion of the visual
eld. Among dierent facial regions, the eyes,nose, and mouth are
the three most salient xation regions, as they provide many useful
cues for both speech and cognitive perception [
46
]. The underlying
motivation of this cognitive behavior is that listeners care about
where the speaker is focusing, and thus eye gaze is used as the cue
to track and follow the attention of the speaker [
47
]. Similarly, the
movements of the mouth provide additional linguistic information
and audiovisual speech cues for the listener [
46
]. Lastly, facial
expressions in the nose region help in the recognition of emotions
of the speaker [48].
We propose the
VerbalGaze
model (Section 4.3) to simulate
eye movement in verbal communication. As shown in Figure 1, it
leverages monologue videos that are widely available online as the
inputs to simulate the interactions in verbal communication. Specif-
ically, it models the eye movements of the people who are listening
to the speaker in the video. In fact, monologue videos are widely
used in cognitive science to study attention and eye movement
patterns in social interactions [
25
,
46
], and have been proven to
have the same underlying cognitive mechanism as in-person verbal
communication [
49
]. In our design, we propose the facial region
tracking module and the Markov chain-based attention model to
simulate the visual behaviors of xating on and switching attention
between dierent facial regions [25] in verbal communication.
3.2.3 Static and dynamic scene perception. When inspecting com-
plex visual scenes, the human visual system does not process every
part of the scene. Instead, it selects portions of the scene and di-
rects attention to each one of them in a serial fashion [
50
]. Such
selective visual attention can be explained by the feature integration
theory [
51
], which suggests that the visual system integrates low-
level features of the scene, such as color, orientation, and spatial
frequency, into a topographic saliency map in the early stages of
the process. Then, visual attention is directed serially to each of the
salient regions that locally stands out from their surroundings [
50
].
The selection of xation locations is also aected by the central
xation bias [
26
] which refers to the strong tendency in visual
perception for subjects to look at the center of the viewing scene.
Studies have shown that the center of the scene is an optimal loca-
tion for extracting global visual information, and is a convenient
starting point for the oculomotor system to explore the scene [
52
].
In this work, we design two generative models,
StaticScene
and
DyamicScene
(Section 4.4), to simulate eye movements in static
and dynamic scene perception (subjects are viewing paintings or
watching videos), respectively. As shown in Figure 1, we propose the
saliency-based xation estimation module to identify the potential
xation locations in the image, and propose a centrality-focused
xation selection module to model the eects of the central xation
bias [26] on xation selection.
3.2.4 Fixation model. Lastly, EyeSyn also incorporates a set of
statistical models that simulate the gaze perturbations in xations
(Section 4.1). Specically, we model both the microsaccades, the
subconscious microscopic eye movements produced by the human
oculomotor system during the xations, and the measurement noise
in eye tracking to generate realistic xation patterns.
3.3 Case Study
In this paper, we consider
gaze-based museum activity recog-
nition for mobile augmented reality (AR)
as a case study. We
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
show how the synthesized eye movement data from EyeSyn can
improve the recognition accuracy of a DNN-based classier without
the need for a large-scale gaze dataset for training.
Dierent from traditional museum exhibitions, mobile AR allows
augmenting physical exhibits with more vivid and informative con-
tent, which enhances visitors’ engagement and experience. There
are many practical deployments of AR-based museum exhibitions.
For instance, the Skin and Bones [
53
] application, deployed at the
Smithsonian National Museum of Natural History, provides visitors
with a new way to see what extinct species looked like and how
they moved.
To ensure accurate and timely virtual content delivery, it is es-
sential to have a context-aware system that can continuously track
and recognize the physical object the user is interacting with. Al-
though one can leverage the camera on the AR device to recognize
the object in the user’s view directly [
54
], one practical aspect that
has been largely overlooked is that having the object in view does
not always mean the user is interacting with it. This is especially
true in scenarios where head-mounted AR devices are used, for
which one cannot simply rely on the location and orientation of the
device as the indicators of potential user-object interaction. In fact,
state-of-the-art head-mounted AR solutions have incorporated eye
trackers to estimate the visual attention of the user [1, 2].
In this case study, we leverage the gaze signals captured by head-
mounted AR devices to recognize four interactive activities that
are performed by a visitor to a virtual museum:
Read: reading text descriptions of an exhibit.
Communicate
: talking with someone in the museum or watch-
ing monologue videos of an artist.
Browse: browsing paintings that are exhibited in the museum.
Watch: watching a descriptive video about an exhibit.
To showcase how gaze-based activity recognition can be used to
benet an AR user’s experience in this application, we develop a
demo on the Magic Leap One AR headset [
2
]. A short video of the
demo can be found at https://github.com/EyeSyn/EyeSynResource.
Specically, leveraging the gaze signals that are captured by the
Magic Leap One, the context-aware system can recognize the inter-
active activity the user is performing. Then, based on the context,
the system adjusts the digital content displayed in the user’s view
to enhance her engagement and learning experience.
4 PSYCHOLOGY-INSPIRED GENERATIVE
MODELS
Below, we present the detailed design of EyeSyn. We rst introduce
the xation model, followed by three psychology-inspired models
that synthesize eye movements in text reading, verbal communica-
tion, and scene perception. While these models are designed based
on ndings in psychology and cognitive science, to the best of
our knowledge, we are the rst to develop generative models to
synthesize realistic eye movement signals for activity recognition.
4.1 Fixations Modeling
Gaze and xation are the two most common eye movement be-
haviors. Gaze point refers to the instantaneous spatial location on
the stimulus where the subject’s visual attention lands, while xa-
tion point refers to the spatial location where the subject tries to
Figure 2: Example of gaze perturbations in a xation.
Figure 3: (a) Example of the gaze perturbation in terms of
gaze angle (𝜃) and the gaze oset (𝑙). (b) Decomposition of
the overall gaze perturbation.
maintain her gaze. When the eyes are xating on the xation point,
the gaze points captured by the eye tracker contain perturbations.
To illustrate, we use the Pupil Labs eye tracker [
55
] to record the
gaze points while a subject is xating on a red calibration point dis-
played on a computer monitor. As shown in Figure 2, the recorded
gaze points contain many perturbations and uctuate around the
calibration point. The two major sources of the perturbations are
the microsaccades and the noise in eye tracking. Below, we introduce
models that simulate the perturbations and generate realistic gaze
signals in xations.
Metrics for modeling
. To quantify the gaze perturbations, we
introduce
gaze angle
and
gaze oset
as the metrics. As shown in
Figure 3(a), the red dashed line is the direction from the eyes to the
xation point, while the green dashed line is the link between the
eyes and the gaze measured by the eye tracker. The gaze angle
𝜃
measures the deviation in degrees between the two lines, while the
gaze oset
𝑙
captures the Euclidean distance between the measured
gaze and the xation point on the visual scene. The height and
width (in the unit of meters) of the visual scene are denoted by
and
𝑤
, respectively. The line-of-sight distance between the eyes
and the xation point is denoted by
𝑑
. Below, we model the gaze
perturbations in terms of the two metrics.
Modeling microsaccades.
During xations, the eyes make mi-
croscopic movements known as microsaccades, which are subcon-
scious movements that are produced by the human oculomotor
system to maximize visual acuity and visual perception during the
xation. Recent studies in neurophysiology have shown that the
erratic uctuations in xations can be modeled by
1/𝑓𝛼
noise [
56
],
where
𝑓
is the cyclic frequency of the signal and
𝛼
is the inverse fre-
quency power that ranges from 0 to 2. In this work, we simulate the
microsaccades-induced
gaze perturbation by applying a
1/𝑓𝛼
lter
on a stream of Gaussian white noise [
57
]. Specically, we model the
perturbations in gaze angle,
𝜃micro
, by
𝜃micro =F(𝑠, 𝛼 )
, where
𝑠
is the input white noise that follows the Gaussian distribution of
N (0,1/300)
(in degrees) [
31
], and
F(𝑠, 𝛼 )
is a
1/𝑓𝛼
lter with the
inverse frequency power of
𝛼
. We set
𝛼
to 0.7 for generating more
realistic microsaccade patterns [31, 32].
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
Figure 4: Example of simulated gaze points given dierent
visual scene sizes (𝑤and ) and distances 𝑑.
Modeling noise in eye tracking.
The noise in eye tracking also
contributes to the gaze perturbations. In practice, many factors can
inuence eye tracking quality [
58
], including: the environment (e.g.,
dierent lighting conditions), the eye physiology of the subjects,
and the design of the eye tracker (e.g., resolution of the camera and
the eye tracking algorithm). Following the literature [
32
], we model
the gaze perturbations (in degrees) that result from eye tracking
noise, 𝜃track, by a Gaussian distribution 𝜃track N (0,1.07).
Gaze simulator.
To sum up, as shown in Figure 3(b), taking
the xation point as the origin of the coordinate system, we can
further decompose the overall perturbation in the X and Y direc-
tions, in which we use notations
𝑙𝑥
,
𝑙𝑦
,
𝜃𝑥
, and
𝜃𝑦
to denote the
decomposed gaze osets and gaze angles for the two directions,
respectively. Then, we can obtain
𝑙𝑥
and
𝑙𝑦
by
𝑙𝑥=2𝑑·sin(𝜃𝑥/2)
and
𝑙𝑦=2𝑑·sin(𝜃𝑦/2)
, respectively, where
𝑑
is the line-of-sight
distance between the eyes and the calibration point;
𝜃𝑥
and
𝜃𝑦
are
the decomposed gaze angles in X and Y, respectively, which are
modeled by 𝜃micro +𝜃track .
After modeling the gaze oset, we use a sequence of
𝑚
xation
points,
P={p1, . . . , p𝑚}
, as the input to simulate gaze points in
xations. Each point
p𝑖=(𝑥𝑖, 𝑦𝑖)
represents a potential xation
point on a normalized 2D plane, where
𝑥𝑖
and
𝑦𝑖
are the X and Y
coordinates, respectively. Moreover, given a
𝑤×
visual scene, we
can transfer
p𝑖
from the normalized plane to the coordinate of the
visual scene by:
p
𝑖=(𝑥𝑖×𝑤, 𝑦𝑖×),p𝑖=(𝑥𝑖, 𝑦𝑖) ∈ P.
This
transformation allows us to take the size of the visual scene into
account when simulating the gaze points.
Then, we use
Gi={𝑔𝑖,1, . . . , 𝑔𝑖,𝑛 }
to denote a sequence of
𝑛
gaze points that will be captured by the eye tracker when the
subject is xating on
p
𝑖
. The length of the sequence,
𝑛
, is equal
to
𝑡𝑖×𝑓𝑠
, in which
𝑡𝑖
is the xation duration on
p
𝑖
, and
𝑓𝑠
is the
sampling frequency of the eye tracking device. The
𝑘
th gaze point,
Gi(𝑘)=𝑔𝑖,𝑘 , is obtained by adding a gaze oset to p
𝑖:
Gi(𝑘)=p
𝑖+Li(𝑘),(1)
where
Li
is a sequence of gaze osets generated based on Equa-
tions (1)-(3), and
Li(𝑘)=𝑙𝑥(𝑖), 𝑙𝑦(𝑖)
is the
𝑘
th gaze oset in the
sequence. As an example, Figure 4 shows the simulated gaze points
when taking a grid of nine xation points as the inputs. Dierent
visual scene sizes
𝑤×
and distances
𝑑
are used in the simulation.
We observe that a longer visual distance
𝑑
or a smaller visual scene
leads to higher perturbations in the simulated gaze signal, which
matches the observations with practical eye trackers [59].
4.2 Eye Movement in Reading
Below, we introduce the details of the
ReadingGaze
model, which
incorporates both OVP theory and the skipping eect to simulate
the eye movements in text reading.
Figure 5: Example of detecting the optimal viewing posi-
tions on the input text image.
Table 1: Probability of xation and mean xation duration
(in ms) on the target word as a function of the word length
(in number of letters) [44, 45].
Word length Fixation probability Fixation duration
1 0.077 209
2 0.205 215
3 0.318 210
4 0.480 205
5 0.800 229
6 0.825 244
7 0.875 258
8 0.915 260
9 0.940 276
4.2.1 Text recognition-based OVP detection. We introduce a text
recognition-based OVP detection module to identify the potential
xation points in a given text stimulus. Specically, we leverage the
Google Tesseract optical character recognition engine [
60
] to detect
the locations and lengths of the words in an input text image. We use
Tesseract because of its high eciency and its support of more than
100 languages [
61
]. As shown in Figure 5, the words in the input
text image are detected and highlighted by blue bounding boxes.
The centers of the detected words are regarded as the OVPs. The
associated word lengths are shown above the bounding boxes. Note
that we are not interested in recognizing the exact text. Rather, we
leverage the coordinates of the detected bounding box to calculate
the OVP. Moreover, we obtain the length of the word (in number
of letters) and use it to simulate the skipping eect.
4.2.2 Skipping eect and fixation simulation. We leverage the eye
movement statistics reported in Rayner et al. [
44
,
45
] as the inputs
to simulate the skipping eect and the xation decision in text
reading. Specically, Table 1 shows the probability of xation and
the mean xation duration on the target word as a function of the
word length (in number of letters). Note that the xation durations
in Table 1 do not consider rexation (i.e., the behavior of xating
on a given word more than once), because given the OVP as the
landing position for xation, the probability of rexating is lower
than 6%, regardless of the word length [24].
4.2.3 The ReadingGaze model. Putting everything together, our
model takes the text image as the input and detects a sequence of
OVPs with the associated word lengths (as shown in Figure 5). Then,
leveraging the statistics given in Table 1, it simulates the skipping
eect on each of the detected OVPs based on its word length, and
assigns xation durations to the selected OVPs (i.e., OVPs that
will be xated on). The outputs of the ReadingGaze model are a
sequence of
𝑚
xation points
P={p1, . . . , p𝑚}
and the associated
xation durations
T={t1, . . . , t𝑚}
, where each point
p𝑖=(𝑥𝑖, 𝑦𝑖)
is an OVP at which the subject will xate on while reading, and
𝑡𝑖
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
Figure 6: The pipeline of facial regions tracking.
Figure 7: The tracked coordinates of the three facial regions
in a 20-second video. The outliers in the time series are due
to the detection errors of the Viola-Jones algorithm.
is the associated xation duration. Lastly, we take
P
and
T
as the
input of the gaze simulator in Equation 1.
4.3 Eye Movement in Verbal Communication
Below, we introduce the detailed design of VerbalGaze, which con-
sists of a facial regions tracking module and a Markov chain-based
attention model.
4.3.1 Facial region tracking. Taking the monologue video as input,
we leverage the resource-ecient Viola-Jones algorithm [
62
] to
detect the eyes, nose, and mouth of the speaker in the video frames.
The centers of the detected facial regions are considered as the
potential xation locations. The processing pipeline of the facial
regions tracking is shown in Figure 6. The detected eyes, nose, and
mouth are bounded by red, yellow, and blue boxes, respectively,
with their centers marked by circles. We denote the time series of
the tracked coordinates of eyes, nose, and mouth, by
C𝑒𝑦𝑒𝑠
,
C𝑛𝑜𝑠𝑒
,
and C𝑚𝑜𝑢𝑡ℎ , respectively.
Figure 7 is an example of tracking the facial regions for a 20-
second video with a 30fps frame rate (thus, 600 frames). The time
series of the tracked positions are normalized. We can see outliers
in the tracked positions, which result from the detection errors of
the Viola-Jones algorithm, and appear mostly when the eyes or the
mouth of the speaker are closed. We apply a scaled median absolute
deviation-based outlier detector on a sliding window of 60 points
to detect and remove these errors.
4.3.2 Markov chain-based aention model. We design a three-state
Markov chain to simulate the visual behaviors of xating on and
switching attention between dierent facial regions in verbal commu-
nication. As shown in Figure 8(a), we model the behaviors of xating
on the eyes, nose, and mouth regions as three states of a discrete-
Figure 8: (a) Diagram of the three-state Markov chain; the
three states EYES,NOSE, and MOUTH, represent the eye
movement behavior of xating on the eyes, nose, and mouth
regions, respectively; the transitions model the aention
shi between facial regions. (b) the Gaussian distributions
of the ISI on the three states.
time Markov chain with state space
X={𝐸𝑌 𝐸𝑆 , 𝑁 𝑂𝑆 𝐸, 𝑀 𝑂𝑈 𝑇 𝐻 }
.
We model the attention shift from one facial region to another by a
Markovian transition. For instance, the attention shift from eyes to
mouth is modeled by the transition from EYES to MOUTH. Lastly,
each transition is assigned a transition probability. In this work, the
transition probabilities are calculated based on the eye movement
statistics reported by Jiang et al. [
25
]. Note that we can easily adjust
the transition probabilities to t the eye movement behaviors in
dierent scenarios. For instance, we can increase the probability
of xating on eyes to simulate verbal communication in a face-to-
face scenario, in which listeners tend to look more at the speaker’s
eyes due to more frequent eye contact [
49
]. Then, to simulate the
attention shifts among the three facial regions, we perform a ran-
dom walk on the Markov chain to generate a sequence of states
𝑥1:𝑛(𝑥1, . . . , 𝑥𝑛)
, where
𝑥𝑡:Ω→ X
and
𝑥𝑡𝑥1:𝑛
represents
the state at step
𝑡
. An example of the simulated state sequence is
shown in Figure 9(a), where the initial state 𝑥1is 𝐸𝑌𝐸 𝑆.
4.3.3 Adding the ‘sense of time’. We use the inter-state interval (ISI)
to represent the duration of time (in seconds) that the attention will
stay in each state
𝑥𝑡𝑥1:𝑛
. Moreover, since the three facial regions
function dierently in the cognitive process of verbal communi-
cation, they lead to dierent xation durations [
25
,
46
]. Thus, as
shown in Figure 8(b), we use three Gaussian distributions to model
the ISI of the three states. The mean,
𝜇
, and standard deviation,
𝜎
,
of the distributions are adopted from the statistics reported in [
25
].
For a video with
𝑚
frames, we generate the attention sequence,
𝑎1:𝑚(𝑎1, . . . , 𝑎𝑚)
, to simulate the subject’s attention on each
of the video frames. Formally, attention shift is simulated to occur
at frame index
𝜏𝜏1:𝑛(𝜏1, . . . , 𝜏𝑛)
of the video, where
𝜏1=
𝑓𝑣×𝐼𝑆 𝐼1,and 𝜏𝑖=𝜏𝑖1+𝑓𝑣×𝐼𝑆 𝐼𝑖,𝑥𝑖𝑥1:𝑛
. Notation
𝐼𝑆 𝐼𝑖
denotes
the inter-state interval for attention state
𝑥𝑖
, and is sampled from
the corresponding Gaussian distribution dened in Figure 8(b);
𝑓𝑣
is the frame rate (in fps) of the video. Then,
𝑎1:𝑚
is generated
by assigning each of the image frames with the corresponding
attention state value |𝑥𝑖|:
𝑎(𝜏𝑖1+1):𝜏𝑖
=|𝑥𝑖|,𝜏𝑖𝜏1:𝑛,(2)
where
|𝑥𝑖| ∈ {𝐸𝑌 𝐸𝑆 , 𝑁 𝑂𝑆 𝐸, 𝑀 𝑂𝑈 𝑇 𝐻 }
. As an example, Figure 9(b)
shows the attention sequence 𝑎1:𝑚for a 100-second video.
4.3.4 The VerbalGaze model. We combine the simulated attention
sequence,
𝑎1:𝑚
, with the location time series, i.e.,
C𝑒𝑦𝑒𝑠
,
C𝑛𝑜𝑠𝑒
, and
C𝑚𝑜𝑢𝑡ℎ
, obtained from the facial region tracking module, to gener-
ate a sequence of
𝑚
xation points
P={p1, . . . , p𝑚}
. Each xation
point
p𝑖=(𝑥𝑖, 𝑦𝑖)
represents the location of the corresponding
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
Figure 9: (a) Simulated discrete state sequence 𝑥1:𝑛with 𝑛=
62 and 𝑥1=𝐸𝑌 𝐸𝑆 ; (b) the corresponding attention sequence
𝑎1:𝑚on 3000 video frames (with 𝑓𝑣=30); (c) simulated xa-
tion sequence P; (d) simulated gaze time series.
facial region the subject will xate on:
p𝑖=
C𝑒𝑦𝑒𝑠 (𝑖)if 𝑎𝑖=𝐸 𝑌 𝐸𝑆,
C𝑛𝑜𝑠𝑒 (𝑖)if 𝑎𝑖=𝑁𝑂𝑆𝐸 ,
C𝑚𝑜𝑢𝑡ℎ (𝑖)if 𝑎𝑖=𝑀𝑂𝑈 𝑇 𝐻 ,
p𝑖P.(3)
As
𝑎1:𝑚
simulates the visual attention for all the video frames, the
associated set of xation durations
T={t1, . . . , t𝑚}
is obtained by:
ti=1/𝑓𝑠,tiT
. An example of
P
is shown in Figure 9(c), which
is generated by taking the tracked facial region locations (shown
in Figure 7) and the attention sequence (shown in Figure 9(b)) as
the inputs. Finally,
P
and
T
are fed into the gaze simulator (in
Equation 1) to synthesize the gaze signal shown in Figure 9(d).
4.4 Eye Movement in Scene Perception
Below, we introduce two generative models,
StaticScene
and
Dy-
namicScene, to synthesize eye movements in static and dynamic
scene perception, respectively. Specically, we design the image
feature-based saliency detection model to identify the potential
xation locations in the scene, and develop a centrality-focused
saliency selection algorithm to simulate the eects of the central
xation bias on the selection of xation location.
4.4.1 Saliency-based fixation estimation. We leverage the widely
used bottom-up saliency model proposed by Itti et al. [
34
,
35
] to
identify the saliency of an input image. In brief, the saliency estima-
tion model rst extracts low-level vision features to construct the
intensity, color, and orientation feature maps, respectively. Then,
the three feature maps are normalized and combined into the nal
saliency map [
35
]. Taking the saliency map
S
as the input, we
simulate the serial and selective visual attention behavior in scene
perception. Specically, for each of the salient regions in
S
, we rst
identify the location of its local maxima, which indicates the point
to which attention will most likely be directed. Then, we generate a
set of
𝑚
xation points
P={p1, . . . , p𝑚}
, in which
𝑚
is the number
of salient regions in
S
, and each xation point
p𝑖=(𝑥𝑖, 𝑦𝑖) ∈ P
corresponds to the location of one local maxima. As shown in Fig-
ure 10(a), six salient regions and their local maxima are identied
Figure 10: (a) Original saliency map Swith the associated
xation sequence P={p1,p2,p3,p4,p5,p6}overlaid on it.
(b) The weighted saliency map ¯
Swith the new xation se-
quence ¯
P={p2,p6,p1,p3,p4,p5}overlaid on it. (c) The sim-
ulated gaze points overlaid on the input image.
in
S
, which correspond to six potential xation locations. Finally,
we simulate the serial attention behavior by connecting the iden-
tied xation locations in order of their local maxima. As shown,
a xation sequence
P={p1,p2,p3,p4,p5,p6}
is generated, in
which
p1
and
p6
correspond to the xation points that have the
highest and the lowest local maxima in S, respectively.
4.4.2 Centrality-focused fixation selection. To simulate the central
xation bias eect, we further weight each of the xation points
in
P
by its distance to the image center. Specically, we use notation
S(p𝑖)
to denote the saliency value of
p𝑖
in
S
. The weighted saliency
value ¯
S(p𝑖)is obtained by:
¯
S(p𝑖)=S(p𝑖) · 𝑒 piA,p𝑖P,(4)
where
A
denotes the center of the saliency map, and
piA
is the
Euclidean distance between
pi
and
A
. This distance metric gives
more weight to xation points that are closer to the image center.
Then, by sorting the weighted xation points, we generate a new
xation sequence
¯
P
. An example is shown in Figure 10, in which
the original saliency map
S
is compared with the weighted saliency
map
¯
S
. The xation point
p6
, which is closer to the image center,
has a higher saliency value after the weighting, and is selected as
the second attention location in the weighted xation sequence
¯
P
.
Below, we introduce two generative models we developed to syn-
thesize gazes in static and dynamic scene perception.
4.4.3 Static scene perception. A static scene refers to the scenar-
ios in which the salient regions of the scene do not change over
time (e.g., paintings). In this case, the input for the eye move-
ments simulation is simply the image of the static visual scene.
We introduce the
StaticScene
model which leverages the afore-
mentioned image saliency-based and centrality-focused xation
estimation algorithm to generate a sequence of xation points,
¯
P={p1, . . . , p𝑛}
, to simulate visual attention when a subject is
viewing the static scene. We further model the xation durations,
T={t1, . . . , t𝑛}
, in static scene perception by a Gamma distribution
TΓ(𝛼=2.55, 𝛽 =71.25)
. The values of the shape parameter
𝛼
and the rate parameter
𝛽
are estimated based on 16,300 xation
duration instances extracted from the DesktopActivity [
12
] and the
SedentaryActivity [
8
] eye tracking datasets. Specically, we lever-
age the dispersion-based xation detection algorithm [
63
] to detect
xations from the raw gaze signal, and t a Gamma distribution on
the calculated xation durations. Finally, for gaze signal simulation,
we use
P
and
T
as the inputs of the gaze simulator in Equation 1.
As an example, Figure 10(c) shows the gaze points synthesized by
the StaticGaze model when a subject is viewing the painting.
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
Figure 11: Example of generating the xation sequence in a
dynamic scene: gures in the rst row are three continuous
video frames; gures in the second row are the correspond-
ing weighted saliency maps.
4.4.4 Dynamic scene perception. In dynamic scene perception (e.g.,
a subject watching videos or performing visual search in free space),
the salient objects of the visual scene change over time. We intro-
duce the
DynamicScene
model which takes a stream of video
frames as the input for gaze simulation. According to the literature,
the mean xation duration in scene perception and visual search
is around 180-330ms [
64
]. Thus, when the frame rate of the input
video is higher than 5.4fps, i.e., with a frame duration shorter than
180ms, there will be only one xation point in each video frame. In
the current design, we assume the frame rate of the input video is
higher than 5.4fps, and thus, instead of considering the local max-
ima of all the salient regions as xation points, for each of the video
frames we only select the location with the highest saliency value
as the xation point. As shown in Figure 11, a xation sequence
P={p1,p2,p3}
is generated by selecting the salient region with
the highest saliency in each of the three continuous frames. The
xation durations
T={t1, . . . , t𝑛}
in dynamic scene perception are
determined by the frame rate
𝑓𝑣
of the video:
t𝑖=1/𝑓𝑣,tiT
.
P
and
T
are used as the inputs of Equation 1 to synthesize eye
movement signals.
5 SYSTEM DESIGN AND DATASET
5.1 Synthetic Eye Movement Dataset
We implement EyeSyn in MATLAB, and use it to construct a massive
synthetic eye movement dataset, denoted as
SynGaze
. The details
of SynGaze are summarized in Table 2. Specically, we use the
following image and video data as the inputs to simulate gaze
signals for the four activities:
Read
: we extract 100 text images from each of the three digital
books, “Rich Dad Poor Dad”, “Discrete Calculus”, and “Adler’s
Physiology of the Eye”. The three books dier in both text layout
and font size. The extracted text images are used as the inputs to
the ReadingGaze model.
Communicate
: we extract 100 monologue video clips from the
online interview series of the “ACM Turing Award Laureate
Interview” as the inputs to the VerbalGaze model. Each video clip
lasts 5 to 7 minutes with a frame rate of 30fps.
Browse
: we leverage a public dataset with 7,937 images of famous
paintings [22] as the input to the StaticScene model.
Watch
: we extract 50 short videos from the “National Geographic
Animals 101” online documentary video series as the input to
Table 2: Summary of the synthetic eye movement dataset.
Activity Simulation inputs Simulated data length
Read 300 text images from three books 9.9 hours
Communicate 100 video clips of monologue interview 30.9 hours
Browse 7,937 images of paintings 132.3 hours
Watch 50 video clips of documentary videos 11.7 hours
Figure 12: (a) The scatter plots of the aggregated gaze signals;
and (b) the gaze heatmap generated from the gaze signal.
the DynamicScene model. Each video lasts 2 to 6 minutes.
When modeling the microsaccades and the eye tracking noise in
xations (Section 4.1), we consider dierent settings of the scale
parameters to simulate various rendering sizes of the visual stimuli
(
𝑤==0.5
and
𝑤==1𝑚
), and viewing distances (
𝑑=0.5𝑚
,
𝑑=1𝑚
, and
𝑑=2𝑚
). The sampling frequency for the simulation is
set to 30Hz.
Extension feasibility.
Note that SynGaze can easily be ex-
tended by using a variety of simulation settings, and by taking
dierent datasets as the inputs. For instance, EyeSyn can be read-
ily applied to the visual saliency dataset [
39
] which contains 431
video clips of six dierent genres, the iMet Collection dataset [
65
]
which contains over 200K images of artwork, and the text image
dataset [
23
] which consists of 600 images of scanned documents,
to synthesize realistic gaze signals on new sets of visual stimuli.
5.2 Gaze-based Activity Recognition
Gaze heatmap.
We propose the gaze heatmap as the data repre-
sentation for gaze-based activity recognition. A gaze heatmap is a
spatial representation of an aggregation of gaze points over a certain
window of time. It provides an overview of the eye movements and
indicates the regions in the visual scene at which subject’s attention
is located. As an example, Figure 12 shows the gaze heatmaps that
are generated from the aggregated gaze points captured by the
eye tracker. The color of the heatmap indicates the density of the
subject’s visual attention on the normalized 2D scene. To generate
a gaze heatmap, we take the gaze points aggregated in each sensing
window as the inputs, and create the 2D histogram of the gaze
points based on their normalized coordinates. Then, we perform a
2D convolution operation with a Gaussian kernel on the histogram
to generate the gaze heatmap. In our implementation, the resolution
of the histogram is 128, and the width of the Gaussian kernel is 1.
The nal gaze heatmap has a size of 128×128.
CNN-based classier.
We design a convolutional neural net-
work (CNN)-based classier for gaze-based activity recognition.
Table 3 shows the network architecture of the classier. We choose
this shallow design over deeper models (e.g., ResNet and VGGNet)
to prevent overtting when a small-scale dataset is used for model
training [
18
]. The input to the classier is a 128
×
128 gaze heatmap.
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
Table 3: The network design of the CNN-based classier.
Layer Size In Size Out Filter
conv1 128 ×128 ×1 128 ×128 ×32 3 ×3, 1
pool1 128 ×128 ×32 64 ×64 ×32 2 ×2, 2
conv2 64 ×64 ×32 64 ×64 ×32 3 ×3, 1
pool2 64 ×64 ×32 32 ×32 ×32 2 ×2, 2
conv3 32 ×32 ×32 32 ×32 ×32 3 ×3, 1
pool3 32 ×32 ×32 16 ×16 ×32 2 ×2, 2
atten 16 ×16 ×32 8192
fc 8192 128
fc 182 4
Note that while conventional hand-crafted feature-based classi-
ers [
8
,
9
] may also benet from the synthesized data generated
by EyeSyn, we choose the CNN-based design due to its superior
ability in extracting spatial features from the gaze signal [12].
6 EVALUATION
In this section, we rst perform a signal level evaluation to assess
the similarity between the actual and the synthesized gaze signals.
Then, we investigate how the synthesized signals can be used to
improve the performance of gaze-based activity recognition.
6.1 Data Collection
We collect a gaze dataset, denoted as
VisualProcessingActivity
,
for the evaluation. The study is approved by our institution’s In-
stitutional Review Board. Two dierent eye tracking devices, the
Pupil Labs [
55
] and the Magic Leap One [
2
], are used in the data
collection, which allows us to evaluate our work with real gaze
signals captured by heterogeneous devices. Eight subjects partici-
pate in the study: four subjects leverage the onboard eye tracker
in the Magic Leap One, while the others use the Pupil Labs for eye
movement collection. Both devices capture eye movements with a
sampling frequency of 30Hz. The subjects can move freely during
the experiment. Specically, the subjects who are wearing the Pupil
Labs are sitting in front of a 34-inch computer monitor at a distance
of 50cm. The visual stimulus for each of the activities is displayed
on the monitor. The resolution of the display is 800
×
600. We con-
duct the manufacturer’s default on-screen ve-points calibration
for each of the subjects. For the Magic Leap One, the stimuli are
rendered as virtual objects placed on blank white walls around a
room at head height. The virtual objects are 50cm
×
50cm in size,
and their distances to the subjects are 1 to 1.5m. We perform the
built-in visual calibration on the Magic Leap One for each subject.
For both devices, we ask the subjects to perform each of the
four activities, i.e., Read, Communicate, Browse, and Watch, for ve
minutes. They can freely choose the stimuli that we have prepared:
Read
: we create three sets of text images from three digital
reading materials that dier in both text layout and font size:
a transcription of Richard Hamming’s talk on “You and Your
Research”; a chapter from the book “Rich Dad Poor Dad”; and a
chapter from the book “Discrete Calculus”.
Communicate
: seven monologue videos are prepared, includ-
ing: three video clips extracted from an online interview with
Anthony Fauci; two video clips extracted from the ACM Tur-
ing Award Laureate interview with Raj Reddy; and two online
YouTube videos in which the speaker is giving advice on career
Figure 13: Comparison between the actual (left) and the sim-
ulated (right) gaze signals for the four activities. The four
rows from top to bottom correspond to the four dierent
activities: Read, Communicate, Browse, and Watch.
development. All videos have only one speaker.
Browse
: we randomly select a subset of 200 images from a public
painting image dataset [
22
] that contains 7,937 images of famous
paintings. During the data collection, for each of the subjects, we
randomly select 30 images from the subset and show each of the
selected images to the subject for 10 seconds.
Watch
: we randomly pick six short documentary videos from
the online video series “National Geographic Animals 101”. Each
video lasts 5 to 6 minutes.
The details of the stimuli used in the data collection can be found
at https://github.com/EyeSyn/EyeSynResource.
6.2 Signal Level Evaluation
6.2.1 Setup. In this evaluation we leverage the Pupil Labs eye
tracker to collect gaze signals from two subjects when they are
performing the four activities. For each of the activities, we give
the same visual stimuli to the two subjects, and ask them to per-
form each of the activities for 30 seconds. The stimuli used in this
experiment are: (1) a page of text in the book “Rich Dad Poor Dad”
for Read; (2) an interview video with Anthony Fauci for Communi-
cate; (3) an image of a Paul Cezanne painting for Browse; and (4) a
documentary video from the National Geographic series for Watch.
For gaze simulation, the scale parameters
𝑑
,
𝑤
, and
(dened in
Figure 3) are set to 50cm, 40cm, and 30cm, respectively. Identical
visual stimuli are also used as the inputs for gaze synthesis.
6.2.2 Signal Comparison. The scatter plots in Figure 13 compare
the real gaze signals with the synthetic signals. The dots in each
of the images are the 900 gaze points displayed in a normalized
2D plane (with X and Y coordinates ranging from 0 to 1). The four
rows from top to bottom correspond to the gaze signals for Read,
Communicate, Browse, and Watch, respectively. The two columns
on the left correspond to the actual gaze signals of the two subjects;
the two columns on the right are the synthesized signals generated
in two simulation sessions.
First, the dierence between the gaze signals of the two subjects
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
demonstrates the heterogeneity in human visual behavior, even in
the case where the same visual stimuli and the same eye tracker were
used in the data collection. For instance, the gaze points shown in
Figure 13(a) cover a wider range in the Y direction than the gaze
points shown in Figure 13(b). This indicates that Subject 1 reads
faster than Subject 2 (i.e., Subject 1 reads more lines in 30 seconds).
Similarly, the gaze points in Figure 13(e) are clustered in a single
area, which indicates that Subject 1 xates his visual attention on
a single facial region of the speaker in the monologue video. By
contrast, the three clusters in Figure 13(f) indicate that Subject 2
switches her attention among the three facial regions of the speaker.
Second, by comparing the synthesized signal with the real gaze
signal, we make the following observations for each of the activities:
Read:
Figures 13(a-d) show that the distinct “left-to-right" read-
ing pattern [
13
,
14
] in the actual gaze signals is well reproduced
in the simulated signals. Figures 13(c,d) show that the diversity
in reading speed is also well captured in the simulated signals.
Communicate:
As shown in Figures 13(g,h), similar to the real
gaze signal, the synthesized gaze points are clustered in three ar-
eas that correspond to the three facial regions of the speaker. The
results show that the VerbalGaze model introduced in Section 4.3
can eectively replicate the actual visual behaviors of “xating
on and switching attention between dierent facial regions” [
25
].
Browse:
Figures 13(i-l) indicate that the geometric patterns of
the gaze signals when subjects are browsing the painting are well
reproduced by the StaticScene model introduced in Section 4.4.
Specically, in both real and synthesized signals, the gaze points
are clustered at dierent saliency regions of the painting.
Watch:
In the stimuli used for the Watch activity, the most salient
object appears frequently at locations that are close to the center
of the scene. Thus, for both real and synthesized eye movement
signals, the gaze points are densely located around the center
of the 2D plane. As shown in Figures 13(m-p), this geometrical
pattern is well simulated by the DynamicScene model.
Overall, our results demonstrate the feasibility of using EyeSyn
in synthesizing realistic eye movement signals that closely resemble
the real ones. More specically, our models can not only replicate
the distinct trends and geometric patterns in the eye movement signal
for each of the four activities, but can also simulate the heterogeneity
among subjects. The latter is important as a synthesized training
dataset that captures the heterogeneity in eye movements can po-
tentially overcome the domain shift problem in gaze-based activity
recognition and ensure better classication accuracy [12].
6.3 Performance in Activity Recognition
Below, we leverage the synthetic and real gaze datasets,
SynGaze
and VisualProcessingActivity, to investigate how EyeSyn can be
used to improve the performance of gaze-based activity recognition.
Specically, we consider the few-shot learning scenario, where we
aim to train the CNN-based classier (Section 5.2) such that it
can quickly adapt to new subjects with only
𝐾
training instances
(
𝐾∈ {1,2,3,5,10}
is a small number) for each of the four activities.
We perform the evaluation on the
VisualProcessingActivity
dataset
in the leave-one-subject-out manner, which has been used in previ-
ous studies [
9
,
12
]. Specically, we regard the data collected from
one subject as the target set and the data collected from the remain-
ing subjects as the source set. The single subject in the target set
simulates the scenario where the system is deployed to a new sub-
ject with limited real gaze samples available for training (
𝐾
samples
per class). We denote the simulated gaze dataset SynGaze as the
synthetic training set in our evaluation. The sensing window size is
30s with 50% overlap between consecutive windows.
6.3.1 Methods. We consider ve strategies to train the CNN-based
classier for the few-shot learning scenario:
(S1) Real data +Image-based data augmentation
: we use the
few-shot samples, i.e.,
4×𝐾
samples, from the target set to train the
classier and test it using the remaining data in the target set. This
represents the scenario where we only have the data collected from
the target subject. Moreover, we apply the ImageDataGenerator [
66
]
in Keras to perform standard image-based data augmentation tech-
niques during the training. Specically, we apply horizontal and
vertical shifts with the range of (-0.3, 0.3) to the input gaze heatmaps
to simulate shifts of the gaze signal in both X and Y directions; we
apply rotation augmentation with the range of (-10, 10) degrees to
simulate variance in the gaze signal due to dierent head orienta-
tions; nally, we leverage the zoom augmentation with the range
of (0.5, 1.5) to simulate the eects of dierent viewing distances.
(S2) Real data +Transfer learning
: we rst train the CNN-based
classier on the source set. Then, we employ transfer learning [
67
]
to transfer the trained model to the target set. In brief, we freeze
the pre-trained weights of all the convolutional layers in the DNN
architecture (shown in Table 3), and ne-tune the fully connected
layers using the few-shot samples from the target set. This strat-
egy represents the scenario where we have the access to the gaze
samples collected from the other subjects during training. This
method has been widely used for domain adaptation with few-shot
instances [10, 17].
(S3) Real data +MAML
: we apply the model-agnostic meta learn-
ing (MAML) [
68
] to train the classier on the VisualProcessingAc-
tivity dataset. Specically, we use the source set to train the classi-
er in the meta-training phase, and ne-tune it with the few-shot
instances from the target set in the adaptation phase [
68
]. The
MAML-based strategy is the state-of-the-art solution for few-shot
gaze-based activity recognition [
12
]. Similar to strategy S2, this
strategy also assumes the availability of the source set during the
training.
(S4) Synthetic data +Transfer learning
: we train the classier
on the synthetic training set, and leverage transfer learning to ne-
tune the fully connected layers of the classier using the few-shot
samples from the target set. In contrast to strategies S2 and S3, it
requires only the synthesized gaze data for training, and only the
few-shot real gaze samples are needed during the ne-tuning stage.
(S5) Synthetic data +MAML
: we apply MAML on the synthetic
training set during the meta-training phase. Then, in the adaptation
phase, we ne-tune all layers of the classier by using the few-shot
samples from the target set. Similar to strategy S4, we do not need
any real gaze samples in the pre-training stage.
6.3.2 Overall result. The performance of the ve learning strate-
gies with dierent numbers of shots (
𝐾
) is shown in Figure 14.
Figures 14 (a) and (b) are the averaged accuracy over all the sub-
jects who use the Magic Leap One and the Pupil Labs in the data
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
(a) Magic Leap One
(b) Pupil Labs
Figure 14: Accuracy of dierent training strategies in the
few-shot learning scenario with gaze data collected from (a)
Magic Leap One and (b) Pupil Labs.
collection, respectively. The error bar is the standard deviation
across the subjects. We make the following observations.
First, strategy S1 achieves the worst accuracy in all examined
cases, as the limited training samples lead to overtting, which indi-
cates that standard image-based data augmentation cannot simulate
the diversity in gaze signals even for the same subject. By contrast, us-
ing the synthetic gaze signals for training, the transfer learning and
MAML-based strategies, i.e., S4 and S5, improve upon the accuracy
of S1 by 17.9% and 19.5% on average, respectively.
Second, leveraging the synthetic gaze dataset for training, S4 and
S5 achieve good accuracy on datasets collected from both the Magic
Leap One and the Pupil Labs. Moreover, since the two datasets are
collected from dierent subjects in dierent environments, the
results demonstrate the capability of the proposed models in cap-
turing such diversity and improve the robustness of the classier
in heterogeneous sensing conditions.
Lastly, we compare the accuracy of the strategies that use real
(S2 and S3) and synthetic (S4 and S5) gaze signals for training. The
accuracy dierences are further summarized in Table 4. As shown,
for all examined cases, we see a negligible accuracy drop when
using synthetic data for training. Specically, for the data collected
from the Magic Leap One and the Pupil Labs, we see only 0.8% to
4.2%, and 0.3% to 4.0% accuracy drop, respectively. Moreover, when
the number of shots
𝐾5
, the accuracy deciency for transfer
learning and MAML are less than 2% and 3%, respectively.
Note that the small accuracy gains achieved by S2 and S3 rely
on a labor-intensive process to collect eye movement data from the
other subjects. Based on our own experience, due to the calibra-
tion, experiment setup, instruction, and device failure, it takes more
than 40 minutes to collect 20 minutes of gaze data with satisfac-
Table 4: The accuracy dierence (in %) between the use of
real and synthetic gaze signals for classier training.
Eye tracker Method Number of shots (𝐾)
10 5 3 2 1
Magic Leap One Transfer learning (S2-S4) 1.5 1.6 4.2 3.8 0.8
MAML (S3-S5) 2.0 3.0 2.4 3.5 1.2
Pupil Labs Transfer learning (S2-S4) 0.3 1.2 3.7 3.3 4.0
MAML (S3-S5) 2.2 2.3 2.1 2.4 2.4
Figure 15: Accuracy with dierent sizes of synthetic data
used in the training. The classier is tested on the data col-
lected from: (a) Magic Leap One and (b) Pupil Labs.
tory quality from a single subject. Indeed, the labor-intensive and
privacy-compromising [
16
] process has prohibited the collection
of large-scale eye movement datasets, which is evidenced by the
fact that the sizes of current public gaze-based activity datasets are
on the order of a couple of hours [
8
,
9
,
12
]. By contrast, leveraging
the massive gaze data simulated from the already-available images
and videos for training, S4 and S5 eliminate the labor-intensive
data collection and require only few-shot instances from the target
subject for ne-tuning the classier.
6.3.3 Impact of synthetic data size and sensing window size. Below,
we examine how the amount of synthetic data used in training and
the sensing window size will aect the recognition accuracy. We
use strategy S4 as the training method in this evaluation.
First, we evaluate the recognition accuracy given dierent sizes
of synthetic data used in training. Specically, we use one-fth,
one-third, and all of the synthetic signals in SynGaze (in Section 5.1)
to train the CNN-based classier. Then, for each of the subjects, we
apply transfer learning to ne-tune the classier using few-shot (
𝐾
)
gaze samples from the corresponding target set. The results are
shown in Figure 15. We observe that the accuracy increases with the
size of synthetic data used in training. Note that, since we are using
diverse image and video stimuli as the inputs for gaze simulation, a
larger synthetic dataset indicates a higher diversity of input stimuli.
Thus, the results indicate that the scalability of EyeSyn to diverse
visual stimuli is crucial for the nal recognition accuracy: taking the
ready-to-use public image and video datasets as the inputs, EyeSyn
can readily simulate a massive amount of diverse gaze signals, i.e.,
the 185 hours of data generated in the current work, to ensure good
recognition accuracy.
Finally, we examine the impact of sensing window size on the
recognition performance. As shown in Figure 16, for all the exam-
ined few-shot scenarios, the accuracy increases with the window
size, as a larger sensing window contains more information about
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
Figure 16: Accuracy with dierent sensing window sizes.
The classier is tested on the data collected from: (a) Magic
Leap One and (b) Pupil Labs.
eye movements. Moreover, with a window size of ve seconds, the
accuracy drops signicantly. This is because the ve-second win-
dow is too short to contain enough distinct eye movement patterns.
In fact, based on the statistics shown previously in Table 1 and
Figure 8(b), a ve-second window may contain only two xation
points, which is insucient for activity recognition.
Overview:
Our results demonstrate that the synthetic data can
be incorporated with either transfer learning or MAML to achieve
good recognition accuracy with only few-shot gaze instances re-
quired from the target sensing scenario (i.e., a new subject). More
importantly, without sacricing the recognition accuracy, the pro-
posed work eliminates the need for the expensive and privacy-
compromising large-scale eye movement dataset that is required
by current state-of-the-art solutions [8, 12] for classier training.
7 DISCUSSION
7.1 Limitations
Although EyeSyn embodies several psychology ndings in the lit-
erature, its current design cannot fully replicate the complex mech-
anisms of human visual processing to synthesize eye movements
for all subject groups. For instance, people with neurodevelopmen-
tal or mental disorders, such as autism spectrum disorder [
69
],
schizophrenia [
70
], or social anxiety disorder [
71
], may exhibit
atypical eye movement patterns in social interactions, e.g., avoid-
ing direct eye contact with the communication partner. Moreover,
decision making in visual attention is aected by many cognitive
factors, such as the mental workload of the subject [
7
], the reward
of dierent visual saliency [
72
], and the current cognitive task [
73
].
These cognitive factors have highly diverse impacts on eye move-
ments [
33
]. The current design of the proposed generative models
did not take these factors into account. In fact, the implementation
of a generalized model is still an open challenge in visual behavior
modeling research [
28
,
74
], as it is dicult to have a one size ts
all model that can synthesize visual attention for all subject groups
and possible cognitive cases. Solving this problem requires future
endeavors to integrate knowledge from various disciplines, such as
psychology, neuroscience, and the social sciences.
7.2 Future Directions
EyeSyn can be readily extended to cover more complex scenarios by
embodying the atypical eye movement characteristics of dierent
subject groups in its design. For instance, current works in the
neuropsychology literature [
75
,
76
] have shown that individuals
with autism spectrum disorder exhibit reduced visual attention to
social and semantic stimuli, e.g., faces, but focus more on non-social
and low-level stimuli, e.g., vehicles. To model this behavior, we can
extend the current saliency-based xation estimation method by
taking the social and semantic properties of the underlying stimuli
into account, e.g., we can assign a higher weight to xation points
that are associated with non-social and low-level stimuli, and vice
versa. Similarly, subjects with schizophrenia are known to have
strikingly dierent eye movement patterns during smooth pursuit (a
type of eye movement in which the eyes remain xated on a moving
object) and visual search [
77
,
78
]. For instance, when conducting
smooth pursuit to track a moving stimulus using their eyes, the
gaze positions for subjects with schizophrenia often lag behind the
moving stimulus, as the speed of their eye movements cannot keep
up with that of the moving visual target [
78
] due to the lesions
in the superior temporal sulcus [
79
]. Thus, to model this atypical
eye movement pattern in scene perception, we can introduce a lag
when associating the coordinates of the selected salient location
with the simulated gaze points. Overall, we believe the current
design of EyeSyn can serve as an important rst step towards a
more comprehensive suite of models for eye movement synthesis.
7.3 Potential Applications
EyeSyn can also benet applications that feature animated charac-
ters or avatars [
80
], such as video games [
29
,
30
], social conversa-
tional agents [
28
], and photo-realistic facial animation for virtual
reality [
81
83
]. In these applications, the virtual avatars should
have realistic eye movements that are consistent with the ongo-
ing activity and the visual stimuli. The gaze signals synthesized
by EyeSyn can be used as the inputs of the avatar model to pro-
duce realistic eye movements for the facial animation. EyeSyn can
also be used to estimate spatial-temporal attention when a user
is viewing dierent visual stimuli [
84
,
85
]. The estimated xation
locations and saccade trajectories can further serve as the inputs for
attention-adaptive systems to improve user perceived quality in ser-
vices such as webpage loading [
86
], gaze-contingent rendering [
87
],
and foveated rendering in virtual and augmented reality [88, 89].
8 CONCLUSION
In this work we present
EyeSyn
, a novel suite of psychology-inspired
generative models that leverage only publicly available images and
videos to synthesize a realistic and arbitrarily large eye movement
dataset for DNN training. Our evaluation demonstrates the ecacy
of EyeSyn in replicating the distinct patterns in actual gaze signals,
as well as in simulating the gaze diversity that results from dierent
measurement setups and subject heterogeneity. Using gaze-based
museum activity recognition as a case study, we show that a CNN-
based classier trained by the synthetic gaze signals can achieve
90% accuracy, without the need for labor-intensive and privacy-
compromising data collection.
ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers and the shep-
herd for their insightful comments and guidance. This work was
supported in part by NSF grants CSR-1903136 and CNS-1908051,
NSF CAREER Award IIS-2046072, and an IBM Faculty Award.
EyeSyn: Psychology-inspired Eye Movement Synthesis for Gaze-based Activity Recognition IPSN ’22, May 4–6, 2022, Milan, Italy
REFERENCES
[1]
“Eye tracking on HoloLens 2,” https://docs.microsoft.com/en-us/windows/mixed-
reality/design/eye-tracking.
[2] “Magic Leap One,” https://www.magicleap.com/en-us/magic-leap- 1.
[3] “VIVE Pro Eye,” https://www.vive.com/eu/product/vive-pro-eye/.
[4]
N. Valliappan, N. Dai, E. Steinberg, J. He, K. Rogers, V. Ramachandran, P. Xu,
M. Shojaeizadeh, L. Guo, K. Kohlho et al., “Accelerating eye movement research
via accurate and aordable smartphone eye tracking,Nature Communications,
vol. 11, no. 1, pp. 1–12, 2020.
[5]
E. Wood and A. Bulling, “EyeTab: Model-based gaze estimation on unmodied
tablet computers,” in Proceedings of the ACM Symposium on Eye Tracking Research
and Applications, 2014, pp. 207–210.
[6]
Y. Sugano, X. Zhang, and A. Bulling, “AggreGaze: Collective estimation of audi-
ence attention on public displays,” in Proceedings of the ACM Annual Symposium
on User Interface Software and Technology, 2016, pp. 821–831.
[7]
L. Fridman, B. Reimer, B. Mehler, and W. T. Freeman, “Cognitive load estimation
in the wild,” in Proceedings of the ACM CHI Conference on Human Factors in
Computing Systems, 2018, pp. 1–9.
[8]
N. Srivastava, J. Newn, and E. Velloso,“Combining low and mid-level gaze features
for desktop activity recognition,Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 2, no. 4, p. 189, 2018.
[9]
K. Kunze, Y. Utsumi, Y. Shiga, K. Kise, and A. Bulling, “I know what you are
reading: Recognition of document types using mobile eye tracking,” in Proceedings
of the ACM International Symposium on Wearable Computers, 2013, pp. 113–116.
[10]
H. Wu, J. Feng, X. Tian, E. Sun, Y. Liu, B. Dong, F. Xu, and S. Zhong, “EMO:
Real-time emotion recognition from single-eye images for resource-constrained
eyewear devices,” in Proceedings of the ACM International Conference on Mobile
Systems, Applications, and Services, 2020, pp. 448–461.
[11]
S. Ahn, C. Kelton, A. Balasubramanian, and G. Zelinsky, “Towards predicting read-
ing comprehension from gaze behavior,” in Proceedings of the ACM Symposium
on Eye Tracking Research and Applications, 2020, pp. 1–5.
[12]
G. Lan, B. Heit, T. Scargill, and M. Gorlatova, “GazeGraph: Graph-based few-shot
cognitive context sensing from human visual behavior,” in Proceedings of the
ACM Conference on Embedded Networked Sensor Systems, 2020, pp. 422–435.
[13]
K. Rayner, “The 35th Sir Frederick Bartlett lecture: Eye movements and attention
in reading, scene perception, and visual search,Quarterly Journal of Experimental
Psychology, vol. 62, no. 8, pp. 1457–1506, 2009.
[14]
G. Öquist and K. Lundin, “Eye movement study of reading text on a mobile
phone using paging, scrolling, leading, and RSVP,” in Proceedings of the ACM
International Conference on Mobile and Ubiquitous Multimedia, 2007, pp. 176–183.
[15]
M. K. Eckstein, B. Guerra-Carrillo, A. T. M. Singley, and S. A. Bunge, “Beyond
eye gaze: What else can eyetracking reveal about cognition and cognitive devel-
opment?” Developmental Cognitive Neuroscience, vol. 25, pp. 69–91, 2017.
[16]
J. Li, A. R. Chowdhury, K. Fawaz, and Y. Kim, “Kal
𝜀
ido: Real-time privacy control
for eye-tracking systems,” in Proceedings of the USENIX Security Symposium, 2021,
pp. 1793–1810.
[17] S. A. Rokni, M. Nourollahi, and H. Ghasemzadeh, “Personalized human activity
recognition using convolutional neural networks,” in Proceedings of the AAAI
Conference on Articial Intelligence, vol. 32, no. 1, 2018.
[18]
T. Gong, Y. Kim, J. Shin, and S.-J. Lee, “MetaSense: Few-shot adaptation to un-
trained conditions in deep mobile sensing,” in Proceedings of the ACM Conference
on Embedded Networked Sensor Systems, 2019, pp. 110–123.
[19]
C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (medical) time series gener-
ation with recurrent conditional GANs,arXiv preprint arXiv:1706.02633, 2017.
[20]
J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial
networks,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
[21]
T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training gener-
ative adversarial networks with limited data,” in Advances in Neural Information
Processing Systems, vol. 33, 2020, pp. 12 104–12114.
[22]
“Best artworks of all time dataset,” https://www.kaggle.com/ikarus777/best-
artworks-of- all-time.
[23]
“Noisy and Rotated Scanned Documents Dataset,” https://www.kaggle.com/
sthabile/noisy-and- rotated-scanned-documents.
[24]
K. Rayner, “Eye movements in reading and information processing: 20 years of
research.Psychological Bulletin, vol. 124, no. 3, p. 372, 1998.
[25]
J. Jiang, K. Borowiak, L. Tudge, C. Otto, and K. von Kriegstein, “Neural mecha-
nisms of eye contact when listening to another person talking,Social Cognitive
and Aective Neuroscience, vol. 12, no. 2, pp. 319–328, 2017.
[26]
B. W. Tatler, “The central xation bias in scene viewing: Selecting an optimal
viewing position independently of motor biases and image feature distributions,
Journal of Vision, vol. 7, no. 14, pp. 4–4, 2007.
[27]
J. Nie, Y. Hu, Y. Wang, S. Xia, and X. Jiang, “SPIDERS: Low-cost wireless glasses
for continuous in-situ bio-signal acquisition and emotion recognition,” in Pro-
ceedings of IEEE/ACM International Conference on Internet-of-Things Design and
Implementation, 2020, pp. 27–39.
[28]
K. Ruhland, S. Andrist, J. Badler, C. Peters, N. Badler, M. Gleicher, B. Mutlu, and
R. Mcdonnell, “Look me in the eyes: A survey of eye and gaze animation for
virtual agents and articial systems,” in Proceedings of Eurographics State of the
Art Reports, 2014, pp. 69–91.
[29]
S. H. Yeo,M. Lesmana, D. R. Neog, and D. K. Pai, “Eyecatch: Simulating visuomotor
coordination for object interception,ACM Transactions on Graphics, vol. 31, no. 4,
pp. 1–10, 2012.
[30]
S. P. Lee, J. B. Badler, and N. I. Badler, “Eyes alive,” in Proceedings of the Annual
Conference on Computer Graphics and Interactive Techniques, 2002, pp. 637–644.
[31]
A. Duchowski, S. Jörg, A. Lawson, T. Bolte, L. Świrski, and K. Krejtz, “Eye move-
ment synthesis with 1/f pink noise,” in Proceedings of the ACM SIGGRAPH Con-
ference on Motion in Games, 2015, pp. 47–56.
[32]
A. Duchowski, S. Jörg, T. N. Allen, I. Giannopoulos, and K. Krejtz, “Eye movement
synthesis,” in Proceedings of the ACM Symposium on Eye Tracking Research &
Applications, 2016, pp. 147–154.
[33]
A. Borji and L. Itti, “State-of-the-art in visual attention modeling,IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207,
2012.
[34]
L. Itti, C. Koch, and E. Niebur,“A model of saliency-based visual attention for rapid
scene analysis,IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 20, no. 11, pp. 1254–1259, 1998.
[35]
L. Itti and C. Koch, “Computational modelling of visual attention,Nature Reviews
Neuroscience, vol. 2, no. 3, pp. 194–203, 2001.
[36]
R. J. Peters and L. Itti, “Beyond bottom-up: Incorporating task-dependent inu-
ences into a computational model of spatial attention,” in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
[37]
T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where
humans look,” in Proceedings of IEEE International Conference on Computer Vision,
2009, pp. 2106–2113.
[38]
Y. Zhai and M. Shah, “Visual attention detection in video sequences using spa-
tiotemporal cues,” in Proceedings of the ACM International Conference on Multi-
media, 2006, pp. 815–824.
[39]
J. Li, Y. Tian, T. Huang, and W. Gao, “Probabilistic multi-task learning for visual
saliency estimation in video,International Journal of Computer Vision, vol. 90,
no. 2, pp. 150–165, 2010.
[40]
Z. Hu, C. Zhang, S. Li, G. Wang, and D. Manocha, “SGaze: A data-driven eye-
head coordination model for realtime gaze prediction,IEEE Transactions on
Visualization and Computer Graphics, vol. 25, no. 5, pp. 2002–2010, 2019.
[41]
M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye xa-
tions via an LSTM-based saliency attentive model,IEEE Transactions on Image
Processing, vol. 27, no. 10, pp. 5142–5154, 2018.
[42]
Y. Zhu, G. Zhai, X. Min, and J. Zhou, “The prediction of saliency map for head and
eye movements in 360 degree images,IEEE Transactions on Multimedia, vol. 22,
no. 9, pp. 2331–2344, 2020.
[43]
F. Vitu, J. K. O’Regan, and M. Mittau, “Optimal landing position in reading isolated
words and continuous text,Perception & Psychophysics, vol. 47, no. 6, pp. 583–600,
1990.
[44]
K. Rayner and G. W. McConkie, “What guides a reader’s eye movements?” Vision
Research, vol. 16, no. 8, pp. 829–837, 1976.
[45]
K. Rayner, S. C. Sereno, and G. E. Raney, “Eye movement control in reading: A
comparison of two types of models,Journal of Experimental Psychology: Human
Perception and Performance, vol. 22, no. 5, p. 1188, 1996.
[46]
L. G. Lusk and A. D. Mitchel, “Dierential gaze patterns on eyes and mouth
during audiovisual speech segmentation,” Frontiers in Psychology, vol. 7, p. 52,
2016.
[47]
T. Foulsham, “Eye movements and their functions in everyday tasks,Eye, vol. 29,
no. 2, pp. 196–199, 2015.
[48]
S. Vassallo, S. L. Cooper, and J. M. Douglas, “Visual scanning in the recognition
of facial aect: Is there an observer sex dierence?” Journal of Vision, vol. 9, no. 3,
pp. 11–11, 2009.
[49]
M. Freeth, T. Foulsham, and A. Kingstone, “What aects social attention? social
presence, eye contact and autistic traits,PloS One, vol. 8, no. 1, p. e53286, 2013.
[50]
L. Itti and C. Koch, “A saliency-based search mechanism for overt and covert
shifts of visual attention,Vision Research, vol. 40, no. 10-12, pp. 1489–1506, 2000.
[51] A. M. Treisman and G. Gelade, “A feature-integration theory of attention,Cog-
nitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.
[52]
J. Najemnik and W. S. Geisler, “Optimal eye movement strategies in visual search,”
Nature, vol. 434, no. 7031, pp. 387–391, 2005.
[53]
“Skin and Bones application in the Smithsonian National Museum of Natural
History,” https://naturalhistory.si.edu/exhibits/bone- hall.
[54]
Z. Liu, G. Lan, J. Stojkovic, Y. Zhang, C. Joe-Wong, and M. Gorlatova, “CollabAR:
Edge-assisted collaborative image recognition for mobile augmented reality,” in
Proceedings of ACM/IEEE International Conference on Information Processing in
Sensor Networks, 2020, pp. 301–312.
[55] “Pupil Labs eye tracker,” https://pupil-labs.com/.
[56]
D. Aks, G. Zelinsky, and J. Sprott, “Memory across eye-movements: 1/f dynamic
in visual search,Journal of Vision, vol. 1, no. 3, pp. 230–230, 2001.
[57] N. J. Kasdin, “Discrete simulation of colored noise and stochastic processes and
1/f power law noise generation,Proceedings of the IEEE, vol. 83, no. 5, pp. 802–827,
1995.
[58]
K. Holmqvist, M. Nyström, and F. Mulvey, “Eye tracker data quality: what it is
IPSN ’22, May 4–6, 2022, Milan, Italy Guohao Lan, Tim Scargill, and Maria Gorlatova
and how to measure it,” in Proceedings of the ACM Symposium on Eye Tracking
Research & Applications, 2012, pp. 45–52.
[59]
J. Johnsson and R. Matos, “Accuracy and precision test method for remote eye
trackers,Test Specication of Tobii Technology, 2011.
[60] “Tesseract Open-Source OCR,” https://opensource.google/projects/tesseract.
[61]
R. Smith, D. Antonova, and D.-S. Lee, “Adapting the Tesseract open source OCR
engine for multilingual OCR,” in Proceedings of the International Workshop on
Multilingual OCR, 2009, pp. 1–8.
[62]
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, vol. 1, 2001, pp. I–I.
[63]
D. D. Salvucci and J. H. Goldberg, “Identifying xations and saccades in eye-
tracking protocols,” in Proceedings of the ACM Symposium on Eye Tracking Re-
search & Applications, 2000, pp. 71–78.
[64]
K. Rayner and M. Castelhano, “Eye movements,Scholarpedia, vol. 2, no. 10, p.
3649, 2007.
[65] “iMet Collection Artwork Dataset,” https://github.com/visipedia/imet- fgvcx.
[66]
“Keras ImageGenerator,” https://keras.io/api/preprocessing/image/
#imagedatagenerator-class.
[67]
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in
deep neural networks?” in Proceedings of the International Conference on Neural
Information Processing Systems, 2014, pp. 3320–3328.
[68]
C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adap-
tation of deep networks,” in Proceedings of International Conference on Machine
Learning, 2017, pp. 1126–1135.
[69]
K. M. Dalton, B. M. Nacewicz, T. Johnstone, H. S. Schaefer, M. A. Gernsbacher,
H. H. Goldsmith, A. L. Alexander, and R. J. Davidson, “Gaze xation and the
neural circuitry of face processing in autism,Nature Neuroscience, vol. 8, no. 4,
pp. 519–526, 2005.
[70]
S.-H. Choi, J. Ku, K. Han, E. Kim, S. I. Kim, J. Park, and J.-J. Kim, “Decits in
eye gaze during negative social interactions in patients with schizophrenia,The
Journal of Nervous and Mental Disease, vol. 198, no. 11, pp. 829–835, 2010.
[71] F. R. Schneier, T. L. Rodebaugh, C. Blanco, H. Lewin, and M. R. Liebowitz, “Fear
and avoidance of eye contact in social anxiety disorder,Comprehensive Psychiatry,
vol. 52, no. 1, pp. 81–87, 2011.
[72]
V. Navalpakkam, C. Koch, A. Rangel, and P. Perona, “Optimal reward harvesting
in complex perceptual environments,Proceedings of the National Academy of
Sciences, vol. 107, no. 11, pp. 5232–5237, 2010.
[73]
V. Navalpakkam and L. Itti, “Modeling the inuence of task on attention,Vision
Research, vol. 45, no. 2, pp. 205–231, 2005.
[74]
J. Gutiérrez, Z. Che, G. Zhai, and P. Le Callet, “Saliency4ASD: Challenge, dataset
and tools for visual attention modeling for autism spectrum disorder,Signal
Processing: Image Communication, vol. 92, p. 116092, 2021.
[75]
G. Dawson, S. J. Webb, and J. McPartland, “Understanding the nature of face pro-
cessing impairment in autism: Insights from behavioral and electrophysiological
studies,Developmental Neuropsychology, vol. 27, no. 3, pp. 403–424, 2005.
[76]
N. J. Sasson, J. T. Elison, L. M. Turner-Brown, G. S. Dichter, and J. W. Bodsh,
“Brief report: Circumscribed attention in young children with autism,Journal of
Autism and Developmental Disorders, vol. 41, no. 2, pp. 242–247, 2011.
[77]
P. S. Holzman, L. R. Proctor, and D. W. Hughes, “Eye-tracking patterns in
schizophrenia,Science, vol. 181, no. 4095, pp. 179–181, 1973.
[78]
K. Morita, K. Miura, K. Kasai, and R. Hashimoto, “Eye movement characteristics
in schizophrenia: A recent update with clinical implications,Neuropsychophar-
macology Reports, vol. 40, no. 1, pp. 2–9, 2020.
[79]
M. Dursteler and R. H. Wurtz, “Pursuit and optokinetic decits following chemical
lesions of cortical areas MT and MST,Journal of Neurophysiology, vol. 60, no. 3,
pp. 940–965, 1988.
[80]
K. Ruhland, C. E. Peters, S. Andrist, J. B. Badler, N. I. Badler, M. Gleicher, B. Mutlu,
and R. McDonnell, “A review of eye gaze in virtual agents, social robotics and
HCI: Behaviour generation, user interaction and perception,Computer Graphics
Forum, vol. 34, no. 6, pp. 299–326, 2015.
[81]
S.-E. Wei, J. Saragih, T. Simon, A. W. Harley, S. Lombardi, M. Perdoch, A. Hypes,
D. Wang, H. Badino, and Y. Sheikh, “VR facial animation via multiview image
translation,ACM Transactions on Graphics, vol. 38, no. 4, pp. 1–16, 2019.
[82]
G. Schwartz, S.-E. Wei,T.-L. Wang, S. Lombardi, T.Simon, J. Saragih, and Y. Sheikh,
“The eyes have it: An integrated eye and face model for photorealistic facial
animation,ACM Transactions on Graphics, vol. 39, no. 4, pp. 91–1, 2020.
[83]
A. Richard, C. Lea, S. Ma, J. Gall, F. De la Torre, and Y. Sheikh, “Audio- and
gaze-driven facial animation of codec avatars,” in Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, 2021, pp. 41–50.
[84]
Y. Li, P. Xu, D. Lagun, and V. Navalpakkam, “Towards measuring and inferring
user interest from gaze,” in Proceedings of the ACM International Conference on
World Wide Web Companion, 2017, pp. 525–533.
[85]
P. Xu, Y. Sugano, and A. Bulling, “Spatio-temporal modeling and prediction of
visual attention in graphical user interfaces,” in Proceedings of the ACM CHI
Conference on Human Factors in Computing Systems, 2016, pp. 3299–3310.
[86]
C. Kelton, J. Ryoo, A. Balasubramanian, and S. R. Das, “Improving user perceived
page load times using gaze,” in Proceedings of USENIX Symposium on Networked
Systems Design and Implementation, 2017, pp. 545–559.
[87]
E. Arabadzhiyska, O. T. Tursun, K. Myszkowski, H.-P. Seidel, and P. Didyk,
“Saccade landing position prediction for gaze-contingent rendering,ACM Trans-
actions on Graphics, vol. 36, no. 4, pp. 1–12, 2017.
[88]
A. Patney, M. Salvi, J. Kim, A. Kaplanyan, C. Wyman, N. Benty, D. Luebke, and
A. Lefohn, “Towards foveated rendering for gaze-tracked virtual reality,” ACM
Transactions on Graphics, vol. 35, no. 6, pp. 1–12, 2016.
[89]
J. Kim, Y. Jeong, M. Stengel, K. Akşit, R. Albert, B. Boudaoud, T. Greer, J. Kim,
W. Lopes, Z. Majercik et al., “Foveated AR: Dynamically-foveated augmented
reality display,ACM Transactions on Graphics, vol. 38, no. 4, pp. 1–15, 2019.
... We propose a solution by supplementing gaze direction with gaze-based activity recognition and head pose, to determine if a user is truly engaging with the content they are looking towards. We developed a gaze-based activity recognition classifier for the case of a virtual art gallery in our accompanying paper, EyeSyn [6], and in this demo we present Catch My Eye, a full working system with a commercial AR headset and realistic virtual content. An illustration of Catch My Eye in action is shown in Figure 1. ...
... Activity classification: Our classification model (for details see Section 5.2 of our accompanying work, EyeSyn [6]), takes a 128×128 gaze heatmap as input (the graph representation of the gaze points), and outputs one of three common activities performed by users while engaging with our virtual exhibits: Reading text, Viewing painting, and Watching video. If the classification confidence score is less than 0.5, we consider it a 'null' result. ...
Conference Paper
Full-text available
The personalization of augmented reality (AR) experiences based on environmental and user context is key to unlocking their full potential. The recent addition of eye tracking to AR headsets provides a convenient method for detecting user context, but complex analysis of raw gaze data is required to detect where a user's attention and thoughts truly lie. In this demo we present Catch My Eye, the first system to incorporate deep neural network (DNN)-based activity recognition from user gaze into a realistic mobile AR app. We develop an edge computing-based architecture to offload context computation from resource-constrained AR devices, and present a working example of content adaptation based on user context, for the scenario of a virtual art gallery. It shows that user activities can be accurately recognized and employed with sufficiently low latency for practical AR applications.
Conference Paper
Full-text available
A good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between variables across time. Existing methods that bring generative adversarial networks (GANs) into the sequential setting do not adequately attend to the temporal correlations unique to time-series data. At the same time, supervised models for sequence prediction-which allow finer control over network dynamics-are inherently deterministic. We propose a novel framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, we encourage the network to adhere to the dynamics of the training data during sampling. Empirically, we evaluate the ability of our method to generate realistic samples using a variety of real and synthetic time-series datasets. Qualitatively and quantitatively, we find that the proposed framework consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.
Article
Full-text available
Eye tracking has been widely used for decades in vision research, language and usability. However, most prior research has focused on large desktop displays using specialized eye trackers that are expensive and cannot scale. Little is known about eye movement behavior on phones, despite their pervasiveness and large amount of time spent. We leverage machine learning to demonstrate accurate smartphone-based eye tracking without any additional hardware. We show that the accuracy of our method is comparable to state-of-the-art mobile eye trackers that are 100x more expensive. Using data from over 100 opted-in users, we replicate key findings from previous eye movement research on oculomotor tasks and saliency analyses during natural image viewing. In addition, we demonstrate the utility of smartphone-based gaze for detecting reading comprehension difficulty. Our results show the potential for scaling eye movement research by orders-of-magnitude to thousands of participants (with explicit consent), enabling advances in vision research, accessibility and healthcare.
Article
The recent studies showing that gaze features can be useful in the identification of Autism Spectrum Disorder (ASD), have opened a new domain where Visual Attention (VA) modeling could be of great help. In this sense, this paper presents a report of the Grand Challenge “Saliency4ASD: Visual attention modeling for Autism Spectrum Disorder”, organized at IEEE ICME’19, aiming at supporting the research on VA modeling towards this healthcare societal challenge. In particular, this paper describes the workflow, obtained results, and datasets and tools that were used within this activity, in order to help on the development and evaluation of two types of VA models: (1) to predict saliency maps that fit gaze behavior of people with ASD, and (2) to identify individuals with ASD from typical development.
Article
By recording the whole scene around the capturer, virtual reality (VR) techniques can provide viewers the sense of presence. To provide a satisfactory quality of experience, there should be at least 60 pixels per degree, so the resolution of panoramas should reach 21600 × 10800. The huge amount of data will put great demands on data processing and transmission. However, when exploring in the virtual environment, viewers only perceive the content in the current field of view (FOV). Therefore if we can predict the head and eye movements which are important behaviors of viewer, more processing resources can be allocated to the active FOV. But conventional saliency prediction methods are not fully adequate for panoramic images. In this paper, a new panorama-oriented model, to predict head and eye movements, is proposed. Due to the superiority of computation in the spherical domain, the spherical harmonics are employed to extract features at different frequency bands and orientations. Related low- and high-level features including the rare components in the frequency domain and color domain, the difference between center vision and peripheral vision, visual equilibrium, person and car detection, and equator bias are extracted to estimate the saliency. To predict head movements, visual mechanisms including visual uncertainty and equilibrium are incorporated, and the graphical model and functional representation for the switch of head orientation are established. Extensive experimental results on the publicly available database demonstrate the effectiveness of our methods.