Content uploaded by Mariofanna Milanova
Author content
All content in this area was uploaded by Mariofanna Milanova on Sep 08, 2018
Content may be subject to copyright.
Emotion Recognition and Eye Gaze Estimation
System: EREGE
Suzan Anwar
1,2(&)
, Mariofanna Milanova
1
, Shereen Abdulla
3
,
and Zvetomira Svetleff
4
1
University of Arkansas at Little Rock, Little Rock, AR 72204, USA
{sxanwar,mgmilanova}@ualr.edu
2
Salahaddin University, Erbil, Iraq
3
Polytechnic University, Erbil, Iraq
shereenabdulla@epu.edu.krd
4
University of Nevada, Las Vegas, USA
svetleff@unlv.nevada.edu
Abstract. In this paper, we proposed EREGE system, EREGE system con-
siders as a face analysis package including face detection, eye detection, eye
tracking, emotion recognition, and gaze estimation. EREGE system consists of
two parts; facial emotion recognition that recognizes seven emotions such as
neutral, happiness, sadness, anger, disgust, fear, and surprise. In the emotion
recognition part, we have implemented an Active Shape Model (ASM) tracker
which tracks 116 facial landmarks via webcam input. The tracked landmark
points are used to extract face expression features. A support Vector machine
(SVM) based classifier is implemented which gives rise to robust our system by
recognizing seven emotions. The second part of EREGE system is the eye gaze
estimation that starts by creating the head model followed by presenting both
Active Shape Model (ASM) and Pose from Orthography and Scaling with
Iterations (POSIT) algorithms for head tracking and position estimation.
Keywords: Face emotion ASM Gaze estimation SVM POSIT
RANSAC
1 Introduction
EREGE is a real time face analysis package including face detection, eye detection, eye
tracking, emotion recognition, and gaze estimation. EREGE system can be utilized for
monitoring health through recording gaze direction and combined with other health
related measurements such as skin color. The EREGE system provides a tool for
monitoring emotional behavior of Autistic children by tracking their eye movement and
recognizing their face expression [1]. The EREGE system can be used also as a tool to
analyze how a customer views advertisements and what product catches the eyes in
online or public stores. EREGE does consider the affective states of its users and
maintain the communication and interaction information. Therefore, EREGE is a good
example for a successful human-computer interaction (HCI). A Deep Neural Network
(DNN) followed by a Conditional Random Field (CRF) is used for facial expression
©Springer International Publishing AG, part of Springer Nature 2018
C. Stephanidis (Ed.): HCII Posters 2018, CCIS 851, pp. 364–371, 2018.
https://doi.org/10.1007/978-3-319-92279-9_49
recognition in videos [2]. A method to aggregate features along fiducial trajectories in a
feature space that deeply learnt is proposed to distinguish the deceptive expression
among a set of genuine expressions [3]. Several kinds of eye and head gestures, such as
smooth pursuit, saccades and nod-roll are presented as interaction methods in a head
mounted virtual reality (VR) device. A head mounted show coupled with an eye tracker
is utilized to find better user experience gestures result in a VR environment [4]. Based
on our knowledge, the only available visual inspection system with the standard camera
is the open source Opengazer. Its main disadvantage is that the system is rooted. The
user is completely still during the program operation. There are many commercial
systems for eye gaze estimation and emotion recognition, the disadvantage of these
existing systems is the need to have special equipment, starting with dedicated mounted
on infrared cameras. We make the following contributions:
•Design, build and test a complete system for both emotion detection and eye gaze
estimation to be used for many useful applications and fields.
•EREGE system does not require any expensive hardware for eye gaze and emotion
recognition.
2 Algorithm
2.1 Emotion Recognition Algorithm
The emotion recognition algorithm has been applied the proposed method in [1], which
includes the following steps;
Pre-processing and Using ASM for Face Detection and Triangulation Points
Tracking. One of the most prevalent triangulation points detection and tracking is
Active Shape Model (ASM) introduced by Cootes et al. in 1995 [5]. In order to identify
the triangulation points in an image, first the location of face is detected with an overall
face detector called Viola-Jones [6]. The average face shape which is aligned according
to position of the face constitutes the starting point of the search. Then the steps
described below are repeated until the shape converges.
1. For each point, best matching position with the template is identified by using the
gradient of image texture in the proximity of that point.
2. The identified points are projected from their point locations in training set to the
shape eigenvalues which is obtained by Principal Component Analysis (PCA).
ASM tracking is developed by asmlibrary developed by Wei [7].
Regions Finding. ASM-based tracker tracks 116 triangulationpoints as shown in
Fig. 1(c), tracker works sturdy on eyebrows, eyes, chin and nose points however, since
it cannot track flexible lip points correctly, it is not reasonable to directly use the
location of this points as attributes. There are two reasons for this phenomenon, first
reason is ASM’s holistic modeling of all triangulation point’s locations, and second
reason is losing small changes in location of lip points to constraints made on shape
with PCA. Further, the difference in intensity at lip edges is not as significant as other
Emotion Recognition and Eye Gaze Estimation System: EREGE 365
face components. Therefore, instead of directly using the locations of triangulation
points being tracked, attributes appear of Fig. 1(b). First three attributes are obtained by
Mahalanobis distance of the corresponding triangulation points to each other. For the
other attributes, the image is first smoothened by filtering with Gauss core, then by
filtering with Sobel vertical and horizontal cores separately, edge domains are calcu-
lated. Next, absolute value corresponding to each region is calculated. For motivation
on selecting attributes, movement descriptions corresponding to emotional expression
given in Table 1can be examined. These clues are based on leading research done by
Ekman and Friesen [8].
Emotion Classification. In this process an average of derived attributes is calculated
and in frames following that, in order to enable system to act independent of envi-
ronmental variables, attributes are normalized by division to their average [1].
2.2 Eye Gaze Estimation Algorithm
The algorithm presented in [9] is used to estimate the eye gaze. The proposed algorithm
consists of the following steps:
Head Tracking. The process starts by detecting an eye using a classifier. Open
Computer Vision Library (OpenCV) [11] is used for algorithm implementation. The
Active Shape model from the face expression algorithm explained in pervious section
Fig. 1. Facial landmarks (a), attributes that are used to extract the region of interest (b) and facial
triangulation points (c)
Table 1. Emotional expressions and their descriptions
Emotion Description
Surprise Rise of eyebrows, sight opening of mouth, slight fall of chin
Anger Frowning of eyebrows, tightening of lips and standing out of eyes
Happiness Rise and fall of mouth edges
Sadness Fall of mouth edges and frowning of inner eyebrows
Fear Rise of eye brows, standing out of eyes and slight opening of mouth
Disgust Rise of upper lip, wrinkle of nose, fall of cheeks
366 S. Anwar et al.
is used to detect and track the head. Creating a model starts with building an object
pattern using a set of labeled points representing a given shape. The grid is determined
using the Deluannay triangulation method [12] on the set of characteristic points.
Texture g for each input image is defined as the pixel intensity in the image.
g¼½g1;g2;...;gmTð1Þ
The columns of the G matrix are the normalized texture vectors of the training
images g. The covariance matrix, where N indicates the number of Images in training
set is calculated as follows:
Xg
1
N1GTGð2Þ
The new texture is generated by a linear combination of eigenvectors of the
covariance matrix, where Uis a matrix containing the eigenvectors.
g¼g0þUgbgð3Þ
Where b
g
is a parameters vector. The obtained 2-D points of features are used in
next section to construct the 3-D head model.
3D Head Model Initialization. Lucas-Kalman algorithm is the gradient method that is
used to transform an image into the next image in the same sequence. The performance
of the algorithm is based on three basic assumptions; The brightness of the image does
not change much between successive sequence frames. The motion of objects in the
image is minor. Points that are within a short distance of each other move similarly.
The brightness of the image is determined by function of time.
fx;y;tðÞIxtðÞ;ytðÞ;tðÞ ð4Þ
The following equation shows that the brightness of the image does not change
significantly in time.
IxtðÞ;yyðÞ;tðÞ¼IxtþdtðÞ;ytþdtðÞ;tþdtðÞð5Þ
This means that the intensity of the tracked pixel does not change over time:
@fx;yðÞ
@t¼0ð6Þ
Using this assumption, it is possible to record the optical flow condition by u the
velocity vector in the x direction, and v the velocity vector in the y direction
@I
@t¼@I
@xuþ@I
@yvð7Þ
Emotion Recognition and Eye Gaze Estimation System: EREGE 367
The assumption of the program is to perform in real time, so we cannot choose a
complicated method that would be too heavy for the processor. One of the most
commonly used edge and corner detector was introduced by Harris and Stephens [13].
The definition of the second-degree Hessian matrices is based on the image intensity at
the point p (x; y):
HpðÞ¼
@2I
@x2
@2I
@x@y
@2I
@x@y
@2I
@y2
ð8Þ
The Hessian autocorrelation matrix M (x, y) is determined by the following
equation:
Pn
Ki;jKI2
xxþi;yþjðÞ
Pn
Ki;jKIxxþi;yþjðÞIyxþi;yþjðÞ
Pn
Ki;jKIxxþi;yþjðÞIyxþi;yþjðÞ
Pn
Ki;jKI2
yxþi;yþjðÞ
ð9Þ
The obtained features vector during the head model creation, is used for tracking
and determining the 3D head position. To improve the accuracy of the initial param-
eters, the algorithm RANSAC [14] (Random Sample Consensus) is being used. Usage
of the RANSAC algorithm is reliable and resistant against the noise in the results. The
head position is determined by six degrees of space: three rotation angles and three
translation values (x, y, z). Head rotation can be characterized by three Euler angles:
around the z-axis (roll, h) then around y-axis (yaw, b) and finally around the x-axis
(pitch, a). The rotation matrix R is determined based on those three Euler angles.
RyhðÞ¼
coshsinh0
sinhcosh0
001
ð15Þ
RybðÞ¼
cosb0sinb
010
sinb0cosb
ð16Þ
RxaðÞ¼
10 0
0cosasina
0sinacosa
ð17Þ
The algorithm determining the 3D position of the object is based on a simplified
model of a camera called “camera obscura”. The idea is to approximate the model
parameters, estimating the projection characteristics of the best suited facility to the
location of these features in the image. Using the simplified camera model, the pro-
jection ~
apoint of the 3D model into a ~
bplan image, on the understanding that the lack
of distortion caused by lens imperfections may be described in the following way:
368 S. Anwar et al.
~
b¼T~
a;ux¼fbx
bz
;uy¼fby
by
ð18Þ
where T is a transformation matrix in a homogeneous coordinate system. The T matrix
is a compound of following geometric operations: rotation around coordinate axis of
angles h,band at the end aand translation by a vector M
T¼Mx;y;zðÞRzhðÞRybðÞRyaðÞ ð19Þ
The ffactor represents the value of the focal length of the lens. After applying the
simplifications, current head position p is described by six variables
~
P¼x;y;z;a;b;h
fg ð20Þ
In general, the projection points of the 3D object on a 2D plan is a non-linear
operation, assuming small changes between the known position, fixed in the previous
frame, and the current one.
POSIT Algorithm. (Pose from Orthography and Scaling with Iteration) [10] serves
into the estimation of the position in three dimensions of the known object. It was
presented in 1992, as a method for determining the position (position determined by
translation vector T and orientation matrix R) of the 3D object of known dimensions.
The initial position estimation and iterative improvement of the result assumes that the
designated image points are placed in the same distance from the camera and difference
of the object size, related to distance change to the camera is negligible small.
Assumption, that both points are in the same distance, means that the object is far
enough from the camera and the depth difference can be omitted. Using the calculated
position in the previous iteration, points go through projection 3D of the object [9].
Eye Center Detection. The direction in which the person is looking in, might be
clearly established by examining the eye’s pupil movement and the corners of the eye.
The current movement and rotation of the head should also be taken into consideration.
The best approach to track the changes of eyeball’s angle is to examine the movement
between the eye’s pupil and the corners of the eye. The algorithms used to follow the
change of the positions of the eye are possible to split into two main groups; the group
basing on the features and the group basing on the current position of the eye. It
requires to establish the correct criteria, dependent on the method that is being used
which specifies the occurrence of the feature that is being looked. The choice of the
criterium’s values is most often provided by the system’s parameter, which should be
then set up manually [9]. To be able to establish the direction, in which the eyes are
pointed, it is necessary to precisely specify the center of the pupil. The default formula
defining central contour of the order (p, q) [9].
Emotion Recognition and Eye Gaze Estimation System: EREGE 369
mp;q¼Xn
i¼0Ix;yðÞxpyqð21Þ
Calibration. To appoint the direction of the user’s gaze, the linear homographic
mapping of the gaze vector’s being used. The vector is the difference between the
distance of the pupil’s center and the projection of the eyeball center positions on the
camera’s plane. The mapping Hessian matrix is established basing on the set of con-
nections between the gaze vector, and the line that is being displayed on the monitor.
The Hessian Matrix includes 8 levels of latitude, and because of this fact, it is required
to know at least 4 of those pairs. To increase the precision, during the process of the
calibration 9 of those points are being registered. The Hessian Matrix establishing is
based on the method of the smallest rectangles. The difference between the gaze vector
and the point on the monitor is being minimized [9].
3 Results
The Child Affective Facial Expression (CAFE) dataset is used to evaluate and measure
the performance of face expression recognition proposed algorithm. The proposed
system achieved a correctness score of 93% [1]. The tests for eye gaze estimation were
conducted on a group of 5 people. The test of indicating the gaze direction accuracy
consists of defining a difference between a line direction on a screen and the gaze at
Fig. 2. Samples for the results
370 S. Anwar et al.
which an examined person is looking, and which is determined by the program as
shown in Fig. 2. To achieve the goal, an examined person was requested to visually
follow a point moving between determined locations [9].
4 Conclusion
This paper presented a real time face expression recognition and eye gaze estimation
system. EREGE system aims to efficiently track the face triangulation points to detect
seven emotions, neutral, surprise, anger, happiness, sadness, disgust, and fear.
Simultaneously, EREGE system tracks the eye movements using PC’s webcam only.
The system is fully automated initialization, and the process of choosing the needed
parameters is fully automated too. Tests were carried out on a 14-inch screen. A built-in
webcam was used to record the image at a resolution of 1920 1080 pixels with a
frame frequency of 15 frames per second.
References
1. Suzan, A., Milanova, M.: Real time face expression recognition of children with autism. Int.
Acad. Eng. Med. Res. 1(1), 1–7 (2016)
2. Hasani, B., Mahoor, M.: Spatio-temporal facial expression recognition using convolutional
neural networks, vol. 2. arXiv:1703.06995 (2017)
3. Ikechukwu, O., Kaustubh, K., Ciprian Adrian, C., Sergio, E., Xavier, B., Sylwia, H., J’uri,
A., Gholamreza, A.: Automatic recognition of deceptive facial. J. IEEE Trans. Affect.
Comput. 1, 13 (2017)
4. Piumsomboon, T., Lee, G., Lindeman, R.W., Billinghurst, M.: Exploring natural
eye-gaze-based interaction for immersive virtual. In: IEEE Symposium on 3D User
Interfaces, pp. 36–39 (2017)
5. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and
application. Comput. Vis. Image Underst. 61,38–59 (1995)
6. Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 52(2), 137–154
(2004)
7. Wei, Y.: Research on facial expression recognition and synthesis. Department of Computer
Science and Technology, Nanjing University (2009). http://code.google.com/p/asmlibrary/
8. Ekman, P., Friesen, W.: Facial Action Coding System: A Technique for the Measurement of
Facial Movement. Consulting Psychologists Press, Palo Alto (1978)
9. Anwar, S., Milanova, M., Svetleff, Z., Abdulla, S.: Real time eye gaze estimation. In:
International Conference on Computational Science and Computational Intelligence, Las
Vegas, USA (2017)
10. DeMenthon, D.F., Davis, L.S.: Model-based object pose in 25 lines of code. In: Sandini, G.
(ed.) ECCV 1992. LNCS, vol. 588, pp. 335–343. Springer, Heidelberg (1992). https://doi.
org/10.1007/3-540-55426-2_38
11. http://sourceforge.net/projects/opencvlibrary/,http://sourceforge.net/projects/opencvlibrary/
12. Bradski, G., Kaehler, A.: Learning OpenCV. O’Reilly, Sebastopol (2008)
13. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th
Alvey Vision Conference (1988)
14. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Comm. ACM 24(6), 381–395
(1981)
Emotion Recognition and Eye Gaze Estimation System: EREGE 371