Conference PaperPDF Available

Predicting human gaze using low-level saliency combined with face detection


Abstract and Figures

Under natural viewing conditions, human observers shift their gaze to allocate processing resources to subsets of the visual input. Many computational mod- els try to predict such voluntary eye and attentional shifts. Although the impor- tant role of high level stimulus properties (e.g., semantic information) in search stands undisputed, most models are based on low-level image properties. We here demonstrate that a combined model of face detection and low-level saliency sig- nificantly outperforms a low-level model in predicting locations humans fixate on, based on eye-movement recordings of humans observing photographs of natural scenes, most of which contained at least one person. Observers, even when not in- structed to look for anything particular, fixate on a face with a probability of over 80% within their first two fixations; furthermore, they exhibit more similar scan- paths when faces are present. Remarkably, our model's predictive performance in images that do not contain faces is not impaired, and is even improved in some cases by spurious face detector responses.
Content may be subject to copyright.
Predicting human gaze using low-level saliency
combined with face detection
Moran Cerf
Computation and Neural Systems
California Institute of Technology
Pasadena, CA 91125
Jonathan Harel
Electrical Engineering
California Institute of Technology
Pasadena, CA 91125
Wolfgang Einh¨
Institute of Computational Science
Swiss Federal Institute of Technology (ETH)
Zurich, Switzerland
Christof Koch
Computation and Neural Systems
California Institute of Technology
Pasadena, CA 91125
Under natural viewing conditions, human observers shift their gaze to allocate
processing resources to subsets of the visual input. Many computational mod-
els try to predict such voluntary eye and attentional shifts. Although the impor-
tant role of high level stimulus properties (e.g., semantic information) in search
stands undisputed, most models are based on low-level image properties. We here
demonstrate that a combined model of face detection and low-level saliency sig-
nificantly outperforms a low-level model in predicting locations humans fixate on,
based on eye-movement recordings of humans observing photographs of natural
scenes, most of which contained at least one person. Observers, even when not in-
structed to look for anything particular, fixate on a face with a probability of over
80% within their first two fixations; furthermore, they exhibit more similar scan-
paths when faces are present. Remarkably, our model’s predictive performance in
images that do not contain faces is not impaired, and is even improved in some
cases by spurious face detector responses.
1 Introduction
Although understanding attention is interesting purely from a scientific perspective, there are nu-
merous applications in engineering, marketing and even art that can benefit from the understanding
of both attention per se, and the allocation of resources for attention and eye movements. One ac-
cessible correlate of human attention is the fixation pattern in scanpaths [1], which has long been of
interest to the vision community [2]. Commonalities between different individuals’ fixation patterns
allow computational models to predict where people look, and in which order [3]. There are several
models for predicting observers’ fixations [4], some of which are inspired by putative neural mech-
anisms. A frequently referenced model for fixation prediction is the Itti et al. saliency map model
(SM) [5]. This “bottom-up” approach is based on contrasts of intrinsic images features such as
color, orientation, intensity, flicker, motion and so on, without any explicit information about higher
order scene structure, semantics, context or task-related (“top-down”) factors, which may be cru-
cial for attentional allocation [6]. Such a bottom-up saliency model works well when higher order
semantics are reflected in low-level features (as is often the case for isolated objects, and even for
reasonably cluttered scenes), but tends to fail if other factors dominate: e.g., in search tasks [7, 8],
strong contextual effects [9], or in free-viewing of images without clearly isolated objects, such as
forest scenes or foliage [10]. Here, we test how images containing faces - ecologically highly rel-
evant objects - influence variability of scanpaths across subjects. In a second step, we improve the
standard saliency model by adding a “face channel” based on an established face detector algorithm.
Although there is an ongoing debate regarding the exact mechanisms which underlie face detec-
tion, there is no argument that a normal subject (in contrast to autistic patients) will not interpret
a face purely as a reddish blob with four lines, but as a much more significant entity ([11, 12]. In
fact, there is mounting evidence of infants’ preference for face-like patterns before they can even
consciously perceive the category of faces [13], which is crucial for emotion and social processing
([13, 14, 15, 16]).
Face detection is a well investigated area of machine vision. There are numerous computer-vision
models for face detection with good results ([17, 18, 19, 20]). One widely used model for face
recognition is the Viola & Jones [21] feature-based template matching algorithm (VJ). There have
been previous attempts to incorporate face detection into a saliency model. However, they have
either relied on biasing a color channel toward skin hue [22] - and thus being ineffective in many
cases nor being face-selective per se - or they have suffered from lack of generality [23]. We here
propose a system which combines the bottom-up saliency map model of Itti et al. [5] with the Viola
& Jones face detector.
The contributions of this study are: (1) Experimental data showing that subjects exhibit significantly
less variable scanpaths when viewing natural images containing faces, marked by a strong tendency
to fixate on faces early. (2) A novel saliency model which combines a face detector with intensity,
color, and orientation information. (3) Quantitative results on two versions of this saliency model,
including one extended from a recent graph-based approach, which show that, compared to previous
approaches, it better predicts subjects’ fixations on images with faces, and predicts as well otherwise.
2 Methods
2.1 Experimental procedures
Seven subjects viewed a set of 250 images (1024 ×768 pixels) in a three phase experiment. 200 of
the images included frontal faces of various people; 50 images contained no faces but were otherwise
identical, allowing a comparison of viewing a particular scene with and without a face. In the first
(“free-viewing”) phase of the experiment, 200 of these images (the same subset for each subject)
were presented to subjects for 2 s, after which they were instructed to answer “How interesting was
the image?” using a scale of 1-9 (9 being the most interesting). Subjects were not instructed to
look at anything in particular; their only task was to rate the entire image. In the second (“search”)
phase, subjects viewed another 200 image subset in the same setup, only this time they were initially
presented with a probe image (either a face, or an object in the scene: banana, cell phone, toy car,
etc.) for 600 ms after which one of the 200 images appeared for 2 s. They were then asked to
indicate whether that imaged contained the probe. Half of the trials had the target probe present. In
half of those the probe was a face. Early studies suggest that there should be a difference between
free-viewing of a scene, and task-dependent viewing of it [2, 4, 6, 7, 24]. We used the second
task to test if there are any differences in the fixation orders and viewing patterns between free-
viewing and task-dependent viewing of images with faces. In the third phase, subjects performed a
100 images recognition memory task where they had to answer with y/n whether they had seen the
image before. 50 of the images were taken from the experimental set and 50 were new. Subjects’
mean performance was 97.5% correct, verifying that they were indeed alert during the experiment.
The images were introduced as “regular images that one can expect to find in an everyday personal
photo album”. Scenes were indoors and outdoors still images (see examples in Fig. 1). Images
included faces in various skin colors, age groups, and positions (no image had the face at the center
as this was the starting fixation location in all trials). A few images had face-like objects (see balloon
in Fig. 1, panel 3), animal faces, and objects that had irregular faces in them (masks, the Egyptian
sphinx face, etc.). Faces also vary in size (percentage of the entire image). The average face was
5% ±1% (mean ±s.d.) of the entire image - between 1to 5of the visual field; we also varied the
number of faces in the image between 1-6, with a mean of 1.1±0.48. Image order was randomized
throughout, and subjects were na¨
ıve to the purpose of the experiment. Subjects fixated on a cross in
the center before each image onset. Eye-position data were acquired at 1000 Hz using an Eyelink
1000 (SR Research, Osgoode, Canada) eye-tracking device. The images were presented on a CRT
screen (120 Hz), using Matlab’s Psychophysics and eyelink toolbox extensions ([25, 26]). Stimulus
luminance was linear in pixel values. The distance between the screen and the subject was 80 cm,
giving a total visual angle for each image of 28×21. Subjects used a chin-rest to stabilize their
head. Data were acquired from the right eye alone. All subjects had uncorrected normal eyesight.
Figure 1: Examples of stimuli during the “free-viewinng” phase. Notice that faces have neu-
tral expressions. Upper 3 panels include scanpaths of one individual. The red triangle marks
the first and the red square the last fixation, the yellow line the scanpath, and the red circles
the subsequent fixations. Lower panels show scanpaths of all 7 subjects. The trend of visiting
the faces first - typically within the 1st or 2nd fixation - is evident. All images are available at˜moran/db/faces/.
2.2 Combining face detection with various saliency algorithms
We tried to predict the attentional allocation via fixation patterns of the subjects using various
saliency maps. In particular, we computed four different saliency maps for each of the images
in our data set: (1) a saliency map based on the model of [5] (SM), (2) a graph-based saliency map
according to [27] (GBSM), (3) a map which combines SM with face-detection via VJ (SM+VJ), and
(4) a saliency map combining the outputs of GBSM and VJ (GBSM+VJ). Each saliency map was
represented as a positive valued heat map over the image plane.
SM is based on computing feature maps, followed by center-surround operations which highlight lo-
cal gradients, followed by a normalization step prior to combining the feature channels. We used the
“Maxnorm” normalization scheme which is a spatial competition mechanism based on the squared
ratio of global maximum over average local maximum. This promotes feature maps with one con-
spicuous location to the detriment of maps presenting numerous conspicuous locations. The graph-
based saliency map model (GBSM) employs spectral techniques in lieu of center surround subtrac-
tion and “Maxnorm” normalization, using only local computations. GBSM has shown more robust
correlation with human fixation data compared with standard SM [27].
For face detection, we used the Intel Open Source Computer Vision Library (“OpenCV”) [28] im-
plementation of [21]. This implementation rapidly processes images while achieving high detection
rates. An efficient classifier built using the Ada-Boost learning algorithm is used to select a small
number of critical visual features from a large set of potential candidates. Combining classifiers
in a cascade allows background regions of the image to be quickly discarded, so that more cycles
process promising face-like regions using a template matching scheme. The detection is done by
applying a classifier to a sliding search window of 24x24 pixels. The detectors are made of three
joined black and white rectangles, either up-right or rotated by 45. The values at each point are
calculated as a weighted sum of two components: the pixel sum over the black rectangles and the
sum over the whole detector area. The classifiers are combined to make a boosted cascade with
classifiers going from simple to more complex, each possibly rejecting the candidate window as
“not a face” [28]. This implementation of the facedetect module was used with the standard default
training set of the original model. We used it to form a “Faces conspicuity map”, or “Face channel”
by convolving delta functions at the (x,y) detected facial centers with 2D Gaussians having standard
deviation equal to estimated facial radius. The values of this map were normalized to a fixed range.
For both SM and GBSM, we computed the combined saliency map as the mean of the normalized
color (C), orientation (O), and intensity (I) maps [5]:
3(N(I) + N(C) + N(O))
And for SM+VJ and GBSM+VJ, we incorporated the normalized face conspicuity map (F) into this
mean (see Fig 2):
4(N(I) + N(C) + N(O) + N(F))
This is our combined face detector/saliency model. Although we could have explored the space of
combinations which would optimize predictive performance, we chose to use this simplest possible
combination, since it is the least complicated to analyze, and also provides us with first intuition for
further studies.
Color Intensity Orientation
SaliencyMap SaliencyMapwithfacedetection
Figure 2: Modified saliency model. An image is processed through standard [5] color, orientation
and intensity multi-scale channels, as well as through a trained template-matching face detection
mechanism. Face coordinates and radius from the face detector are used to form a face conspicuity
map (F), with peaks at facial centers. All four maps are normalized to the same dynamic range, and
added with equal weights to a final saliency map (SM+VJ, or GBSM+VJ). This is compared to a
saliency map which only uses the three bottom-up features maps (SM or GBSM).
3 Results
3.1 Psychophysical results
To evaluate the results of the 7 subjects’ viewing of the images, we manually defined minimally
sized rectangular regions-of-interest (ROIs) around each face in the entire image collection. We first
assessed, in the “free-viewing” phase, how many of the first fixations went to a face, how many of the
second, third fixations and so forth. In 972 out of the 1050 (7 subjects x 150 images with faces) trials
(92.6%), the subject fixated on a face at least once. In 645/1050 (61.4%) trials, a face was fixated
on within the first fixation, and of the remaining 405 trials, a face was fixated on in the second
fixation in 71.1% (288/405), i.e. after two fixations a face was fixated on in 88.9% (933/1050) of
trials (Fig. 3). Given that the face ROIs were chosen very conservatively (i.e. fixations just next to
a face do not count as fixations on the face), this shows that faces, if present, are typically fixated
on within the first two fixations (327 ms ±95 ms on average). Furthermore, in addition to finding
early fixations on faces, we found that inter-subject scanpath consistency on images with faces
was higher. For the free-viewing task, the mean minimum distance to another’s subject’s fixation
(averaged over fixations and subjects) was 29.47 pixels on images with faces, and 34.24 pixels on
images without faces (with p < 106). We found similar results using a variety of different metrics
(ROC, Earth Mover’s Distance, Normalized Scanpath Saliency, etc.). To verify that the double
spatial bias of photographer and observer ([29] for discussion of this issue) did not artificially result
in high fractions of early fixations on faces, we compared our results to an unbiased baseline: for
each subject, the fraction of fixations from all images which fell in the ROIs of one particular image.
The null hypothesis that we would see the same fraction of first fixations on a face at random is
rejected at p < 1020 (t-test).
To test for the hypothesis that face saliency is not due to top-down preference for faces in the absence
of other interesting things, we examined the results of the “search” task, in which subjects were
presented with a non-face target probe in 50% of the trials. Provided the short amount of time for
the search (2 s), subjects should have attempted to tune their internal saliency weights to adjust color,
intensity, and orientation optimally for the searched target [30]. Nevertheless, subjects still tended
to fixate on the faces early. A face was fixated on within the first fixation in 24% of trials, within
the first two fixations in 52% of trials, and within the three fixations in 77% of the trials. While this
is substantially weaker than in free-viewing, where 88.9% was achieved after just two fixations, the
difference from what would be expected for random fixation selection (unbiased baseline as above)
is still highly significant (p < 108).
Overall, we found that in both experimental conditions (“free-viewing” and “search”), faces were
powerful attractors of attention, accounting for a strong majority of early fixations when present.
This trend allowed us to easily improve standard saliency models, as discussed below.
Figure 3: Extent of fixation on face regions-of-interest (ROIs) during the “free-viewing” phase .
Left: image with all fixations (7 subjects) superimposed. First fixation marked in blue, second in
cyan, remaining fixations in red. Right: Bars depict percentage of trials, which reach a face the first
time in the first, second, third, . . . fixation. The solid curve depicts the integral, i.e. the fraction of
trials in which faces were fixated on at least once up to and including the nth fixation.
3.2 Assessing the saliency map models
We ran VJ on each of the 200 images used in the free viewing task, and found at least one face
detection on 176 of these images, 148 of which actually contained faces (only two images with
faces were missed). For each of these 176 images, we computed four saliency maps (SM, GBSM,
SM+VJ, GBSM+VJ) as discussed above, and quantified the compatibility of each with our scan-
path recordings, in particular fixations, using the area under an ROC curve. The ROC curves were
generated by sweeping over saliency value thresholds, and treating the fraction of non-fixated pixels
on a map above threshold as false alarms, and the fraction of fixated pixels above threshold as hits
[29, 31]. According to this ROC fixation “prediction” metric, for the example image in Fig. 4, all
models predict above chance (50%): SM performs worst, and GBSM+VJ best, since including the
face detector substantially improves performance in both cases.
Figure 4: Comparison of the area-under-the-curve (AUC) for image 1 (chosen arbitrarily. Subjects’
scanpaths shown on the left panels of figure 1). Top panel: image with the 49 fixations of the 7
subjects (red). Initial central fixations for each subject were excluded. From left to right, saliency
map model of Itti et al. (SM) with fixations superimposed, saliency map with the VJ face detection
map (SM+VJ), the graph-based saliency map (GBSM), and the graph-based saliency map with face
detection channel (GBSM+VJ). Lower panels depict ROC curves corresponding to each map. Here,
GBSM+VJ predicts fixations best, as quantified by the highest AUC.
Across all 176 images, this trend prevails (Fig. 5): first, all models perform better than chance,
even over the 28 images without faces. The SM+VJ model performed better than the SM model for
154/176 images. The null hypothesis to get this result by chance can be rejected at p < 1022 (using
a coin-toss sign-test for which model does better, with uniform null-hypothesis, neglecting the size
of effects). Similarly, the GBSM+VJ model performed better than the GBSM model for 142/176
images, a comparably vast majority (p < 1015) (see Fig. 5, right). For the 148/176 images with
faces, SM+VJ was better than SM alone for 144/148 images (p < 1029), whereas VJ alone (equal
to the face conspicuity map) was better than SM alone for 83/148 images, a fraction that fails to
reach significance. Thus, although the face conspicuity map was surprisingly predictive on its own,
fixation predictions were much better when it was combined with the full saliency model. For the 28
images without faces, SM (better than SM+VJ for 18) and SM+VJ (better than SM for 10) did not
show a significant difference, nor did GBSM vs. GBSM+VJ (better on 15/28 compared to 13/28,
respectively, p < 0.85). However, in a recent follow-up study with more non-face images, we found
preliminary results indicating that the mean ROC score of VJ-enhanced saliency maps is higher on
such non-face images, although the median is slightly lower, i.e. performance is much improved
when improved at all indicating that VJ false positives can sometimes enhance saliency maps.
In summary, we found that adding a face detector channel improves fixation prediction in images
with faces dramatically, while it does not impair prediction in images without faces, even though the
face detector has false alarms in those cases.
4 Discussion
First, we demonstrated that in natural scenes containing frontal shots of people, faces were fixated on
within the first few fixations, whether subjects had to grade an image on interest value or search it for
a specific possibly non-face target. This powerful trend motivated the introduction of a new saliency
0.5 0.9
0.6 0.8
image with face
image without face
0.5 0.9
0.6 0.8
Figure 5: Performance of SM compared to SM+VJ and GBSM compared to GBSM+VJ. Scatter-
plots depict the area under ROC curves (AUC) for the 176 images in which VJ found a face. Each
point represents a single image. Points above the diagonal indicate better prediction of the model
including face detection compared to the models without face channel. Blue markers denote images
with faces; red markers images without faces (i.e. false positives of the VJ face detector). His-
tograms of the SM and SM+VJ (GBSM and GBSM+VJ) are depicted to the top and left (binning:
0.05); colorcode as in scatterplots.
model, which combined the “bottom-up” feature channels of color, orientation, and intensity, with a
special face-detection channel, based on the Viola & Jones algorithm. The combination was linear
in nature with uniform weight distribution for maximum simplicity. In attempting to predict the
fixations of human subjects, we found that this additional face channel improved the performance of
both a standard and a more recent graph-based saliency model (almost all blue points in Fig. 5 are
above the diagonal) in images with faces. In the few images without faces, we found that the false
positives represented in the face-detection channel did not significantly alter the performance of the
saliency maps – although in a preliminary follow-up on a larger image pool we found that they boost
mean performance. Together, these findings point towards a specialized “face channel” in our vision
system, which is subject to current debate in the attention literature [11, 12, 32].
In conclusion, inspired by biological understanding of human attentional allocation to meaningful
objects - faces - we presented a new model for computing an improved saliency map which is more
consistent with gaze deployment in natural images containing faces than previously studied models,
even though the face detector was trained on standard sets. This suggests that faces always attract
attention and gaze, relatively independent of the task. They should therefore be considered as part
of the bottom-up saliency pathway.
[1] G. Rizzolatti, L. Riggio, I. Dascola, and C. Umilta. Reorienting attention across the horizontal and vertical
meridians: evidence in favor of a premotor theory of attention. Neuropsychologia, 25(1A):31–40, 1987.
[2] G.T. Buswell. How People Look at Pictures: A Study of the Psychology of Perception in Art. The
University of Chicago press, 1935.
[3] M. Cerf, D. R. Cleary, R. J. Peters, and C. Koch. Observers are consistent when rating image conspicuity.
Vis Res, 47(25):3017–3027, 2007.
[4] S.J. Dickinson, H.I. Christensen, J. Tsotsos, and G. Olofsson. Active object recognition integrating atten-
tion and viewpoint control. Computer Vision and Image Understanding, 67(3):239–260, 1997.
[5] L. Itti, C. Koch, E. Niebur, et al. A model of saliency-based visual attention for rapid scene analysis.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
[6] A.L. Yarbus. Eye Movements and Vision. Plenum Press New York, 1967.
[7] J.M. Henderson, J.R. Brockmole, M.S. Castelhano, and M. Mack. Visual Saliency Does Not Account
for Eye Movements during Visual Search in Real-World Scenes. Eye Movement Research: Insights into
Mind and Brain, R. van Gompel, M. Fischer, W. Murray, and R. Hill, Eds., 1997.
[8] Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, and Dimitris Samaras. The role of top-down and
bottom-up processes in guiding eye movements during visual search. In Y. Weiss, B. Sch¨
olkopf, and
J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1569–1576. MIT Press,
Cambridge, MA, 2006.
[9] A. Torralba, A. Oliva, M.S. Castelhano, and J.M. Henderson. Contextual guidance of eye movements and
attention in real-world scenes: the role of global features in object search. Psych Rev, 113(4):766–786,
[10] W. Einh¨
auser and P. K ¨
onig. Does luminance-contrast contribute to a saliency map for overt visual atten-
tion? Eur. J Neurosci, 17(5):1089–1097, 2003.
[11] O. Hershler and S. Hochstein. At first sight: a high-level pop out effect for faces. Vision Res, 45(13):1707–
24, 2005.
[12] R. Vanrullen. On second glance: Still no high-level pop-out effect for faces. Vision Res, 46(18):3017–
3027, 2006.
[13] C. Simion and S. Shimojo. Early interactions between orienting, visual sampling and decision making in
facial preference. Vision Res, 46(20):3331–3335, 2006.
[14] R. Adolphs. Neural systems for recognizing emotion. Curr. Op. Neurobiol., 12(2):169–177, 2002.
[15] A. Klin, W. Jones, R. Schultz, F. Volkmar, and D. Cohen. Visual Fixation Patterns During Viewing of
Naturalistic Social Situations as Predictors of Social Competence in Individuals With Autism, 2002.
[16] JJ Barton. Disorders of face perception and recognition. Neurol Clin, 21(2):521–48, 2003.
[17] K.K. Sung and T. Poggio. Example-based learning for view-based human face detection. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 20(1):39–51, 1998.
[18] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998.
[19] H. Schneiderman and T. Kanade. Statistical method for 3 D object detection applied to faces and cars.
Computer Vision and Pattern Recognition, 1:746–751, 2000.
[20] D. Roth, M. Yang, and N. Ahuja. A snow-based face detection. In S. A. Solla, T. K. Leen, and K. R.
Muller, editors, Advances in Neural Information Processing Systems 13, pages 855–861. MIT Press,
Cambridge, MA, 2000.
[21] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Computer
Vision and Pattern Recognition, 1:511–518, 2001.
[22] D. Walther. Interactions of visual attention and object recognition: computational modeling, algorithms,
and psychophysics. PhD thesis, California Institute of Technology, 2006.
[23] C. Breazeal and B. Scassellati. A context-dependent attention system for a social robot. 1999 International
Joint Conference on Artificial Intelligence, pages 1254–1259, 1999.
[24] V. Navalpakkam and L. Itti. Search Goal Tunes Visual Features Optimally. Neuron, 53(4):605–617, 2007.
[25] D.H. Brainard. The psychophysics toolbox. Spat Vis, 10(4):433–436, 1997.
[26] F.W. Cornelissen, E.M. Peters, and J. Palmer. The Eyelink Toolbox: Eye tracking with MATLAB and the
Psychophysics Toolbox. Behav Res Meth Instr Comput, 34(4):613–617, 2002.
[27] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In B. Sch ¨
olkopf, J. Platt, and T. Hoffman,
editors, Advances in Neural Information Processing Systems 19, pages 545–552. MIT Press, Cambridge,
MA, 2007.
[28] G. Bradski, A. Kaehler, and V. Pisarevsky. Learning-based computer vision with Intels open source
computer vision library. Intel Technology Journal, 9(1), 2005.
[29] B.W. Tatler, R.J. Baddeley, and I.D. Gilchrist. Visual correlates of fixation selection: effects of scale and
time. Vision Res, 45(5):643–59, 2005.
[30] V. Navalpakkam and L. Itti. Search goal tunes visual features optimally. Neuron, 53(4):605–617, 2007.
[31] R.J. Peters, A. Iyer, L. Itti, and C. Koch. Components of bottom-up gaze allocation in natural images.
Vision Res, 45(18):2397–2416, 2005.
[32] O. Hershler and S. Hochstein. With a careful look: Still no low-level confound to face pop-out Authors’
reply. Vis Res, 46(18):3028–3035, 2006.
... The onset of gaze unfolding is approximately located at the center of the image for all viewers. Image and eye-tracking data are publicly available from the FIFA dataset [15]. ...
... A proof of concept is thus put forward in Section 6, by using a publicly available eye-tracking dataset [15]. A typical example of the kind of data we are dealing with has been presented in Figure 1. ...
... Right: the observer's scan path, namely, the continuous raw data trace parsed into a discrete sequence of fixations (yellow disks) and saccades (segments between subsequent fixations); disk radius is proportional to fixation time. Image and eye-tracking data are publicly available from the FIFA dataset [15]. All in all, the exploration/exploitation pattern springing from the gaze sampling endeavour provides a signature of the individual's plans, goals, interests, likely sources of rewards and expectations about future events, [63,64], social traits and personality [61,65]. ...
Full-text available
A principled approach to the analysis of eye movements for behavioural biometrics is laid down. The approach grounds in foraging theory, which provides a sound basis to capture the uniqueness of individual eye movement behaviour. We propose a composite Ornstein-Uhlenbeck process for quantifying the exploration/exploitation signature characterising the foraging eye behaviour. The relevant parameters of the composite model, inferred from eye-tracking data via Bayesian analysis, are shown to yield a suitable feature set for biometric identification; the latter is eventually accomplished via a classical classification technique. A proof of concept of the method is provided by measuring its identification performance on a publicly available dataset. Data and code for reproducing the analyses are made available. Overall, we argue that the approach offers a fresh view on either the analyses of eye-tracking data and prospective applications in this field.
... Faces and especially eyes elicit faster saccades than other stimuli (Broda, Haddad, & de Haas, 2022;Crouzet, Kirchner, & Thorpe, 2010) and most observers fixate faces within the first two fixations when present in a scene. Consequently, the addition of face detection to low-level saliency models significantly improves gaze prediction (Cerf, Harel, J., Einhäuser, & Koch, 2008). On top of the general tendency to fixate faces, there are large and reliable individual differences (Guy et al., 2019). ...
... Our analyses revealed that, even though body features such as torsi and legs had the largest share of person pixels in the images, face regions attracted more fixations, underscoring their saliency (Cerf et al., 2008(Cerf et al., , 2009de Haas et al., 2019). Relative to pixel size, eyes and mouths attracted almost 10 times as many fixations as any other person feature, underscoring the salience of inner face features (Barton, Radcliffe, Cherkasova, Edelman, & Intriligator, 2006;Birmingham, Bischof, & Kingstone, 2008; Kauffmann, Khazaz, Peyrin, & Guyader, 2021; van Results indicate that participants who looked more often at persons overall also tended to direct more of these fixations onto the face and fewer onto the body. ...
Full-text available
Individuals freely viewing complex scenes vary in their fixation behavior. The most prominent and reliable dimension of such individual differences is the tendency to fixate faces. However, much less is known about how observers distribute fixations across other body parts of persons in scenes and how individuals may vary in this regard. Here, we aimed to close this gap. We expanded a popular annotated stimulus set (Xu, Jiang, Wang, Kankanhalli, & Zhao, 2014) with 6,365 hand-delineated pixel masks for the body parts of 1,136 persons embedded in 700 complex scenes, which we publish with this article ( This resource allowed us to analyze the person-directed fixations of 103 participants freely viewing these scenes. We found large and reliable individual differences in the distribution of fixations across person features. Individual fixation tendencies formed two anticorrelated clusters, one for the eyes, head, and the inner face and one for body features (torsi, arms, legs, and hands). Interestingly, the tendency to fixate mouths was independent of the face cluster. Finally, our results show that observers who tend to avoid person fixations in general, particularly do so for the face region. These findings underscore the role of individual differences in fixation behavior and reveal underlying dimensions. They are further in line with a recently proposed push-pull relationship between cortical tuning for faces and bodies. They may also aid the comparison of special populations to general variation.
... Other key elements of the game viewing experience, such as facecam and chat, elicited a higher amount of visual attention than IGAs, with the chat drawing the highest amount of visual attention (10.68%). Unexpectedly, the results of the facecam (4.60%), although positive (+1.11% concerning IGAs), are not consistent with the widely supported concept of the human face as a more powerful attentional driver than other visual stimuli [81][82][83][84][85][86], thus opening new research avenues in contexts such as esports. In addition, this paper contributed to identifying the advertising spaces that generated higher visual attention during the game viewing experience, represented by the "Digital Billboards" (1.59%) and the "Left Banner" (1.30%). ...
... For example, concerning the chat, a component that has been studied for decades [47,48] and that drew the highest amount of visual attention in the current study, it could be interesting to investigate to what extent the presence of chat emojis or different "chat update rates" may influence the attention toward the chat itself and/or toward other key elements (e.g., IGAs). The results obtained for the chat also showed that facecam, an element featured by social connectivity and interactivity [28,50,51], cannot be acknowledged as the most powerful attentional driver in this context, in contrast to other studies portraying the human face as the attention grabber par excellence [81][82][83][84][85][86]. To further explore such evidence, future research may explore if the facecam ability to capture the users' attention changes over time (e.g., decreases) and to what extent such potential changes may affect the visual attention toward other key elements (e.g., IGAs). ...
Full-text available
In recent years, technological advances and the introduction of social streaming platforms (e.g., Twitch) have contributed to an increase in the popularity of esports, a highly profitable industry with millions of active users. In this context, there is little evidence, if any, on how users perceive in-game advertising (IGA) and other key elements of the game viewing experience (e.g., facecam and chat) in terms of visual attention. The present eye-tracking study aimed at investigating those aspects, and introducing an eye-tracking research protocol specifically designed to accurately measure the visual attention associated with key elements of the game viewing experience. Results showed that (1) the ads available in the game view (IGAs) are capable altogether to attract 3.49% of the users’ visual attention; (2) the chat section draws 10.68% of the users’ visual attention and more than the streamer’s face, known as a powerful attentional driver; (3) the animated ad format elicits higher visual attention (1.46%) than the static format (1.12%); and (4) in some circumstances, the visual attention elicited by the ads is higher in the “Goal” scenes (0.69%) in comparison to “No-Goal” scenes (0.51%). Relevant managerial implications and future directions for the esports industry are reported and discussed.
... Inspired by this, simultaneous semantic segmentation is introduced in our method to simulate the human eyes. (Cerf et al., 2007;Schauerte & Stiefelhagen, 2012) propose to utilize a face detector to determine the visual saliency. (Ren et al., 2015) process, the recent object detection networks mainly focus on the features inside the proposed bounding boxes in the second stage reasoning on an object level while semantic segmentation can more explicitly model the object and the background information on a pixel level simultaneously. ...
Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. However, the traditional convolution could not capture the global features of the image well due to its small kernel size. Besides, the high-level factors which closely correlate to human visual perception, e.g., objects, color, light, etc., are not considered. Inspired by these, we propose a Transformer-based method with semantic segmentation as another learning objective. More global cues of the image could be captured by Transformer. In addition, simultaneously learning the object segmentation simulates the human visual perception, which we would verify in our investigation of human gaze control in cognitive science. We build an extra decoder for the subtask and the multiple tasks share the same Transformer encoder, forcing it to learn from multiple feature spaces. We find in practice simply adding the subtask might confuse the main task learning, hence Multi-task Attention Module is proposed to deal with the feature interaction between the multiple learning targets. Our method achieves competitive performance compared to other state-of-the-art methods.
... As shown by Cerf et. al. [Cer+08], the presence of faces in images is a very important high-level information to take into account when studying visual attention. We then provide bounding boxes delimiting each face on each frame. ...
When watching movies, we do not grasp the full image that is displayed at all time. Instead, we focus on several parts of the frame, depending on what we deem relevant, be it for the visual properties of this area or its semantic importance in the narration. With more than a century of cinematographic experience, filmmakers have developed a whole array of tools and techniques to direct the attention of their audience, using cuts, camera motion, staging, and so on. In this work, we propose to explore the links between film editing and the visual perception an audience has of it, using a data-driven approach. While there exists a lot of efficient models predicting where people will look on a video, we found that these models could often be wrong on cinematographic stimuli. We then propose a visual saliency model dedicated to include the high-level information created by the director's editing choices, and we show a significant improvement on cinematic stimuli compared to the state-of-the-art. Finally, we propose two models dedicated to predict the inter-observer visual congruency on both static and dynamic stimuli, with particular care to the case of cinematographic stimuli.
... Moreover, it is the first detection framework to operate in -real time. The Viola-Jones detector has been widely used as a foundation for face identification algorithms [26,27] prior to the development of deep learning technology. ...
Full-text available
Repair and maintenance of underwater structures as well as marine science rely heavily on the results of underwater object detection, which is a crucial part of the image processing workflow. Although many computer vision-based approaches have been presented, no one has yet developed a system that reliably and accurately detects and categorizes objects and animals found in the deep sea. This is largely due to obstacles that scatter and absorb light in an underwater setting. With the introduction of deep learning, scientists have been able to address a wide range of issues, including safeguarding the marine ecosystem, saving lives in an emergency, preventing underwater disasters, and detecting, spooring, and identifying underwater targets. However, the benefits and drawbacks of these deep learning systems remain unknown. Therefore, the purpose of this article is to provide an overview of the dataset that has been utilized in underwater object detection and to present a discussion of the advantages and disadvantages of the algorithms employed for this purpose.
Full-text available
In general, humans preferentially look at conspecifics in naturalistic images. However, such group-based effects might conceal systematic individual differences concerning the preference for social information. Here, we investigated to what degree fixations on social features occur consistently within observers and whether this preference generalizes to other measures of social prioritization in the laboratory as well as the real world. Participants carried out a free viewing task, a relevance taps task that required them to actively select image regions that are crucial for understanding a given scene, and they were asked to freely take photographs outside the laboratory that were later classified regarding their social content. We observed stable individual differences in the fixation and active selection of human heads and faces that were correlated across tasks and partly predicted the social content of self-taken photographs. Such relationship was not observed for human bodies indicating that different social elements need to be dissociated. These findings suggest that idiosyncrasies in the visual exploration and interpretation of social features exist and predict real-world behavior. Future studies should further characterize these preferences and elucidate how they shape perception and interpretation of social contexts in healthy participants and patients with mental disorders that affect social functioning.
Visual attention is one of the most important mechanisms deployed in the human visual system (HVS) to reduce the amount of information that our brain needs to process. An increasing amount of efforts has been dedicated to the study of visual attention, and this chapter proposes to clarify the advances achieved in computational modeling of visual attention. First the concepts of visual attention, including the links between visual salience and visual importance, are detailed. The main characteristics of the HVS involved in the process of visual perception are also explained. Next we focus on eye-tracking, because of its role in the evaluation of the performance of the models. A complete state of the art in computational modeling of visual attention is then presented. The research works that extend some visual attention models to 3D by taking into account of the impact of depth perception are finally explained and compared.
Conference Paper
Full-text available
A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: first forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits combination with other maps. The model is simple, and biologically plausible insofar as it is naturally parallelized. This model powerfully predicts human fixations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch ([2], [3], [4]) achieve only 84%.
Conference Paper
Full-text available
A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incrementally learned feature space and is speci cally tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and others. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods.
In this paper, we describe a statistical method for 3D object detection. We represent the statistics of both object appearance and 'non-object' appearance using a product of histograms. Each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the object. Our approach is to use many such histograms representing a wide variety of visual attributes. Using this method, we have developed the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reliably detect passenger cars over a wide range of viewpoints.
Photographic records were made of the eye movements of 200 subjects while they were looking at pictures of paintings (colored and uncolored), of vases and dishes, of furniture and design, of statuary and museum pieces, of tapestries, buildings, posters, outlines, and geometric figures. 67 plates and 10 tables illustrate and summarize the results. Records were made both of direction and duration of movement. Color has little effect on eye movement, which, however, is influenced by the instructions given the subject, by training in art, and by the length of time that the picture is inspected. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Abstract We tested the hypothesis that fixation locations during scene viewing,are primarily determined,by visual salience. Eye movements,were collected from participants who viewed photographs of real-world scenes during an active search task. Visual salience as determined by a popular computational,model did not predict region-to-region saccades or saccade sequences any better than did a random,model. Consistent with other reports in the literature, intensity, contrast, and edge density differed at fixated scene regions compared to regions that were not fixated, but these fixated regions also differ in rated semantic informativeness. Therefore, any observed correlations between fixation locations and image statistics cannot be unambiguously,attributed to these image,statistics. We conclude that visual saliency does not account for eye movements,during active search. The existing evidence is consistent with the hypothesis that cognitive factors play the dominant,role in active gaze control. Elsevier,AMS,Ch25-I044980 Job code:,EMAW,14-2-2007 1:11 p.m.,Page:539,Trimsize:165×240 MM Basal Fonts:Times,Margins:Top:4.6 pc Gutter:4.6 pc Font Size:10/12 Text Width:30 pc Depth:43 Lines Ch. 25: Visual Saliency Does Not Account for Eye Movements During Visual Search,539
We present an active object recognition strategy which combines the use of an attention mechanism for focusing the search for a 3-D object in a 2-D image, with a viewpoint control strategy for disambiguating recovered object features. The attention mechanism consists of a probabilistic search through a hierarchy of predicted feature observations, taking objects into a set of regions classified according to the shapes of their bounding contours. If the features recovered during the attention phase do not provide a unique mapping to the 3-D object being searched, the probabilistic feature hierarchy can be used to guide the camera to a new viewpoint from where the object can be disambiguated.