Content uploaded by Michael Healy
Author content
All content in this area was uploaded by Michael Healy on Feb 06, 2019
Content may be subject to copyright.
A Machine Learning Emotion Detection Platform to
Support Affective Well Being
Michael Healy, Department
of Computer Science, Cork
Insititute of Technology,
Cork City, Ireland,
Michael.healy2@mycit.ie
Ryan Donovan, Department
of Computer Science, Cork
Insititute of Technology,
Cork City, Ireland,
Brendan.donovan@mycit.ie
Paul Walsh, Department of
Computer Science, Cork
Insititute of Technology,
Cork City, Ireland,
paul.walsh@cit.ie
Huiru Zheng, Department of
Computer Science, Ulster
University, Antrim,
Northern Ireland,
h.zheng@ulster.ac.uk
Abstract- This paper describes a new emotional detection system
based on a video feed in real-time. It demonstrates how a bespoke
machine learning support vector machine (SVM) can be utilized
to provide quick and reliable classification. Features used in the
study are 68-point facial landmarks. In a lab setting, the
application has been trained to detect six different emotions by
monitoring changes in facial expressions. Its utility as a basis for
evaluating the emotional condition of people in situations using
video and machine learning is discussed.
Keywords: Affective Computing; Machine Learning; Emotion
Detection;
I.
I
NTRODUCTION
Emotions are an integral part of experiencing the world.
Functioning emotions help us to perceive, think, and act
correctly. The crucial role of emotions in general well-being
becomes self-evident when they become dysfunctional.
Consider the fact that one of the main aims of psychotherapy is
to help people deal with difficult emotions [1]; that the
likelihood of experiencing psychopathology has been linked to
the tendency to experience extreme levels of emotions [2]; that
our ability to make seemingly innocuous and everyday choices,
such as what clothes to wear, becomes impaired if the areas
related to emotions in the brain are damaged [3]. This latter
example of dysfunction is of particular concern as people
increasingly live longer and as a result become more susceptible
to neurodegenerative diseases, such as dementia.
The world’s aging population is increasing. In 2017, 13% of
the general population were aged 60 or over. Some estimates
expect this percentage to double by 2050 [4]. As people get older
their likelihood of developing dementia sharply increases [5]. As
dementia becomes more prevalent, the necessity to deal with its
negative consequences becomes more pertinent. People with
dementia (PwD) tend to suffer from a variety of affective
problems, which can damage their cognition, relationships, and
general well-being [6]. Examples of these affective problems
are: difficulty in managing emotions, difficulty in articulating
and expressing emotions, and increased levels of agitation and
frustration. Furthermore, PwD are also at an increased risk to
suffer from debilitating affective conditions such as depression,
which could further damage their quality of life. In order to
understand how to effectively manage these problems,
researchers need to be able to accurately measure emotions [7]].
To measure a phenomenon we first need to describe it [8].
Yet despite the fact that scholars from multiple perspectives as
far back as Plato have sought to explain emotions, nobody as yet
has provided an agreed-upon definition. Even folk conceptions
cannot be relied on, as people differentiate emotions based on
their raw conscious experience of those emotions [9]. This
method of subjective introspection is unsuited for objective
scientific categorization. As one prominent affective
neuroscientist wrote: “Unfortunately, one of the most significant
things ever said about emotions may be that everyone knows
what it is until they are asked to define it” [10]. Hence, the
essence of emotions still remains unclear.
This inability to define emotions has encouraged more
systematic research. The component viewpoint, for example,
aims to identify the physical patterns that co-aside or underline
the experience of an emotion and what causes such responses
[11]. These patterns can range from neuronal activity in the
central-nervous-system, facial expressions that co-aside with
emotions, to general changes in behavior (e.g., fist clenching and
tone of voice increasing while angry). Under this viewpoint,
emotions are a set of sub-cortical goal-monitoring systems.
This fits neatly with the basic emotion approach, which
separates emotions based on their ability to produce consistent
yet particular forms of appraisals, action preparations,
physiological patterns, and subjective experiences [12]. These
emotions are considered basic in the sense they are deeply rooted
adaptations that helped our ancestors navigate their social
environments. Currently, there are six proposed universal basic
emotions: Joy, Fear, Disgust, Anger, Surprise, Sadness. There
also exists a ‘higher’ level of emotions that are more mediated
by socio-cognitive factors (e.g. shame). One of the main
characteristics that distinguish basic emotions from these latter
forms of emotions is the presence of universal signals, such as
facial expressions. Based on this view, facial expressions offer
researchers a way to measure at least a subset of key emotions.
For this view to be correct, at least three strands of evidence
are required. First, facial expressions signaling emotions are
universal. Second, facial expressions are a valid or an ‘honest’
signal to underlying emotions. Third, we can reliably decipher
emotional expressions. On the first point, although there is still
some debate on this issue, there is independent research
indicating that facial emotional expressions are consistent cross-
culturally [13]. On the second point, in their review of the facial
expression literature, Schmidt & Cohn [14] came to the
conclusion that facial expressions can be honest expressions of
emotions. The main piece of evidence for this is that faking
emotional expressions is too cognitively demanding to
repeatedly maintain. Similarly, those who do repeatedly fake
emotional expressions are more likely to be considered as
having duplicitous motives. Therefore, cognitive limitations and
social pressures encourage honest emotional signaling via facial
expressions. On the third point, which will be the remaining
focus of this article, advances in affective computing have given
researchers means to quickly and accurately decipher emotional
expressions.
Affective computing is an emerging field that attempts to
model technology to detect, predict, and display emotions in the
goal of improving human-computer interactions [15], [16]. One
example of affective computing in action is the SenseCare
project, which aims to integrate multiple methods of emotion
detection, in order to provide objective insight into people’s
well-being [17], [18]. Another example is the SliceNet project
(https://5g-ppp.eu/slicenet/), which aims to detect patient
demenour, over 5G Networks in order to monitor patients in
ambulances. In this paper, we present the development of the
SliceNet Emotion Viewer (SEV), a real-time video-based
emotion detection application. A Support Vector Machines
(SVM) classifier is used to detect emotional expressions. The
proposed and developed SEV can potentially be used to detect
emotions in a variety of clinical settings, including ambulance.
II. S
TATE OF
T
HE
A
RT
There have been a number of different approaches to detecting
emotion from facial detection. A team at the Jadavpur
University [19] proposed a way of detecting emotion by
monitoring regions of the face such as the mouth and eyes and
applying fuzzy logic. Using this technique the authors achieved
an accuracy of 90%. Research conducted by Philipp Michel
adopted a SVM approach to classifying emotions [20]. The
authors measured displacement of particular facial regions
between a neutral and peak video frame. This displacement was
used as an input parameter to a machine learning algorithm and
an accuracy of 86% for still images and 71.8% for video streams
was achieved. Other researchers focused on using the Particle
Swarm Optimization (PSO) algorithm for detecting action
muscle movements, known as action units, on the human face
[21] which obtained an average success rate of 85%.
Much of the existing literature focuses on different feature
extractions and methods of machine learning to achieve a high
accuracy. There has been little or no development of software
tools that utilise this research to assist people in the areas of
health, well-being, and emotion functioning. Our aim is to
develop a system that is capable of analysing emotional data in
real-time from either a live stream from a camera or a pre-
recorded source such as Youtube. Given the evolving and
dynamic nature of machine learning, the system shall be
designed in such a way that the models used could easily be
replaced with new or updated models, without needing a code
change to the system. Finally, we aimed to overcome the
problem of the lack of data available to produce machine
learning models, by leveraging existing libraries that can take
data from multiple different sources and combine them to
produce the initial model used in the system. Thus the SEV
emotion viewer was developed to analyse a live or pre-recorded
video stream while outputting emotional detection information
in real-time.
III. M
ETHODOLOGY
The SEV emotion viewer was developed as a web service.
Video data is streamed to the SEV, using a standard PC or
laptop webcam. The results of the emotion detection are
feedback in real-time in a HTML webpage format. Figure 1
illustrates a high-level flow of events for the system.
Figure 1 High-Level Flow of Events in SEV
The discussions of the various methods and implementations
that were used to create the system is detailed below.
1. Machine learning: Machine learning is a branch of
artificial intelligence that gives computers the ability to learn
without being explicitly programmed. It was summarised by
Tom M. Mitchell in his quote “A computer program is said to
learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E” [22]. Machine
learning algorithms are used in a wide range of applications and
problems. Some modern uses of machine learning include spam
filtering, speech recognition, facial recognition, document
classification, and natural language processing. There are many
categories of machine learning applications, the most commonly
used are clustering, regression, and classification. A
classification problem is a form of supervised learning which
aims to assign predetermined labels to unseen data based on
previous training examples. In this research, we apply a
classification model to detect an emotion from facial
expressions.
To assist with the complex mathematics of creating a
classification model we used a Support Vector Machine (SVM).
The SVM has been used in supervised learning to assist with the
generation of a model by using built-in algorithms to find the
optimal hyperplane. The hyperplane is the largest separation of
the classes from the training examples. New unseen examples
are added to the same space and their class is predicted based on
which side of the gap they fill. The SVM used is called LIBSVM
[23]. LIBSVM is an integrated software library for support
vector classification, regression, and distribution estimation. It
also has support for multi-class classification which enables the
algorithm to compare the given data to multiple classes which
would be useful when attempting to classify multiple emotional
states. LIBSVM was originally written in C but now has support
for a wide range of programming languages such as Java and
Python. Details of parameters used can be found in this paper
[24].
2. Datasets Used: In order to create machine learning
models for the Emotion Viewer application, training data was
taken from the Cohn-Kanade database (CK+) [25] and
Multimedia Understanding Group (MUG) [26] database. Both
databases contain images of people in lab environments
displaying Ekman’s six basic emotions [12].
Table 1 Datasets Under Study
N
ame Subjects Images Emotions
CK+ 123 593 6
MUG 46 1496 6
3. Feature Extraction and Selection: There are many
different features which can be extracted from a human face to
be used as inputs in machine learning algorithms. In order to
analyse in real-time, it is necessary to leverage existing methods
for extracting features of a human face. One of these methods is
to use facial landmarks. Facial landmarks are defined as the
detection and localisation of certain key points on a human face.
They are also known as vertices or anchor points. There are
various different annotation styles available for detecting
landmark points. The SEV uses the 68-point style facial
landmarks created by Multi-PIE [27], as illustrated in Figure 2.
Figure 2 Multi-PIE 68-point mark-up
The Multi-PIE mark-up is applied to both the training data
and unseen data. It also allows the use of multiple training
databases which do not use compatible mark-ups in their own
annotations of the images. Using these facial landmarks,
classification features were chosen to be the Euclidean distance
from each point to every other point forming a mesh-like
structure. Where each straight line distance represents a
parameter. The Euclidean distance is calculated used the
straight-line formula (eq. 1).
₁₂ (
)
(
)
(1)
Where (x
1,
y
1
) is the coordinates of the first landmark and (x2,
y2) is the coordinates of the second landmarkand D
12
is this
straight line distance between them.
Figure 3 Sample of parameters inputs to the SVM classification
model
IV. R
ESULTS
&
E
XPERIMENTATION
This section of the paper will cover the training of the machine
learning model, graphical interface of the application, its run-
time operations and will present examples of some ongoing
experiments currently using this software.
Using the distances between all 68 landmark points results in
2278 features to be inputted into the SVM. This is intensive for
a CPU to compute and could result in a slow performance when
classifying multiple images in a real-time scenario. To
maximise the accuracy and efficiency of the model, we utilised
techniques outlined by Yi-Wei Chen and Chih-Jen Lin [28].
Following the steps outlined in the referenced paper, we
performed an F-1 analysis to calculate the discriminatory power
of each feature. This ranks all the features by their
discriminatory power values. Next, the highest 17
th
scoring
features were used in a grid search to find the hyperparameters
for “C” and “y” values which are used in the SVM algorithm.
The output of the grid search is the optimal accuracy that could
be achieved using these features as training data. This step was
repeated 8 times, each time doubling the number of features
used (in descending order from highest scoring).
Figure 4 Accuracy versus Number of features
Figure 4 above shows the accuracy increases from 17 – 569
features before reaching a plateau and even decreasing in
training accuracy.
Figure 5 Grid Search using 569 Features
Figure 5 above shows the output from running the grid search
using 569 of the highest scoring features. As seen in the graph,
the hyperparameter for C is 64 and .5 for y which gives an
88.76% accuracy in k-10 fold cross validation. Figure 6
illustrates the confusion matrix of the trained model. All the
classes perform well apart from the “Anger” class. This is
largely due to the few samples from the “Anger” class used in
testing compared to the other classes.
Figure 6 Confusion Matrix
1. Graphical User Interface: As the project is a web service,
there is a graphical user interface in HTML. When the
application is running the user can enter the IP address and port
a number of the applications into a browser and they will be
brought to the home screen which can be seen in Figure 7.
Figure 7 Emotion Viewer Homepage
The user is prompted, via browser notification, to allow the use
of their webcam. If approved video from the user’s webcame
begins to stream to the server. There are two switch options
available on the GUI, they are “Track Face” and “Track
Emotion”. The “Track Face” switch enables the face detection
feature which will draw a rectangle box around all detected
faces from the picture. The “Track Emotion” switch enables the
feature to receive emotional analysis for each frame. The
64.08%
73.14%
78.47%
84.41%
88.03%
89.84%
89.34%
89.54%
17 35 71 142 284 569 1139 2278
TESTING ACCURACY
NUMBER OF FEATURES
“Track Face” option must be enabled before this feature can be
used.
The next item on the GUI is the “Voting Count”. This was
implementended as a way to control the transitions between
emotions. For example, while a person is talking the emotion
detected will change repeatedly as the expressions of the face
tends to change. To overcome this, a particular emotion must
reach the user defined consecutive votes before the analysis text
changes to that emotion.
2. Run-Time Operations: For reference purposes, the
operating system that was used is Windows 10 x64 bit and the
hardware used is as follows:
• Intel Core i7-8550U (Laptop)
• 8GB DDR4 RAM
• Nvidia GeForce MX150 (Mobile)
A Javascript file is executed which makes a web socket
connection to the SEV server once the webpage is loaded.
Immediately after the connection is established, the video
begins streaming the image data to the server. The rate at which
the images are streamed can be adjusted.
When the server receives the image data it begins an analysis.
Before the analysis can be conducted, all faces in the images
must be found. This is achieved using the Dlib library [29]
which contains a convolutional neural network trained in face
detection. Dlib is highly portable and contains very few
dependencies making it the ideal choice for the project. It also
has support for Nvidia’s CUDA library. CUDA is a parallel
computing platform that is used for general computing on
GPUs. Using CUDA requires a compatible GPU, however it
drastically speeds up the performance of the Emotion Viewer
application. A CUDA enabled and standard version of the
application has been developed. The second function of Dlib is
to extract the 68 landmark points from the detected face.
This information is then sent to the internal algorithms where
it is preprocessed. Then with the use of LIBSVM and the
pretrained model, a classification is made. This classification
result is returned to the client along with the image that contains
the rectangle box around the detected face. The client’s browser
displays this feed on the HTML page.
3. Experiments: The review of current literature outlined
some issues with using training data that was generated in a lab
setting. Given the emotional expressions were designed and not
spontaneous, this casts doubt about their external validity.
Therefore we decided to test the application by analysing
speeches taken from Youtube. During a speech a speaker can
exaggerate expressions or emotions as a way to engage more
and to captivate the attension of the attendees or listeners. It can
also help to drive the point they are trying to make. Also by
testing on Youtube videos, it allowed us to test the speed at
which the system can classify frames from the video.
The first examples in Figure 8a and 8b are taken from a
speech by Donald Trump during a peroid when certain news
outlets had made accusations of wrong doing during his
campagin. Throughout the speech Trump is visually distressed
which the Emotion Viewer detects. During the clip the detector
outputs a anger / disgust classification. This aligned closely
with the overall narrative of the speech. The full analysis is
available on Youtube with the following link.
https://www.youtube.com/watch?v=NaCe8bchs9I&index=2&l
ist=PLwagddoyFHYZOCeOVoTnM2UFYKhyMuwEJ&t=0s
Figure 8a Screenshot from Trump analysis
Figure 8b Screenshot from Trump analysis
The next analysis was one that was in a sombre setting. The
video used was taken from former US president Barrack Obama
discussing issues around gun control and referencing to school
shootings which he felt passionately about. The SEV detected
two dominant emotions during this video, i.e. sadness and
anger. This is consistent with Obama visually shedding tears
and stating in the speech that “every time I think of those kids
it gets me mad” [0:54-0:58]. Some extracts from this analysis
can be seen in Figures 9a and 9b. The full analysis video is
available on Youtube
https://www.youtube.com/watch?v=q1VyU02wgzs&list=PLw
agddoyFHYZOCeOVoTnM2UFYKhyMuwEJ&index=2
Figure 9a Screenshot from Obama analysis
Figure 9b Screenshot from Obama analysis
The last analysis featured a montage of people smiling and
laughing in a variety of different settings. Although most seem
exaggerated, there is not much of a visual difference between a
true expression of happiness and an exaggerated expression of
happiness. A screenshot can be seen in Figure 10. The full
analysis video is also available on Youtube using the follow
link:
https://www.youtube.com/watch?v=pvjK5LVvz2A&index=4
&list=PLwagddoyFHYZOCeOVoTnM2UFYKhyMuwEJ
Figure 10 Screenshot from "Happy" video analysis
V. C
ONCLUSION
Facial expressions are a gateway to detecting emotions. The
ability to accurately make face-to-state classifications opens the
potential for researchers to investigate emotions in new settings.
In particular, this paper discussed the SEV platform, which uses
machine learning support vectors in the analysis of emotions on
real-time video. The results suggest that the prototype has
external validity, as the emotions detected were consistent with
the emotions presented by the speakers. Using the laptop which
was described in section 2 of results & experimentation, the
application could classify frames at a speed of 8 frames per
second. This could be improved by deploying the application to
a more powerful hardware and we hope to achieve classification
on a 30fps video in the future through the use of mobile edge
computing (MEC). The next step of this project will be to test
and evaluate the system in real time applications in a mobile
ambulance use case in the SliceNet project. However, given the
accuracy found in the results, the initial signs suggest that
affective computing research is close to providing a powerful
new tool to quickly and objectively determine fundamental
aspects of human well-being.
VI. A
CKNOWLEDGEMENT
The authors MH and PW are supported by the SliceNet
project (Grant Number: 761913), JZ and RD are supported by
the SenseCare project (Grant Number: 690862) funded by
European Commission Horizon 2020 Programme.
VII. R
EFERENCES
[1] J. C. N. M. J. V. a. N. J. K. L. F. Campbell, “Recognition of
psychotherapy effectiveness: The APA resolution,” Psychotherapy, vol.
50, no. 1, p. 98, 2013.
[2] K. L. D. a. J. Panksepp, The Emotional Foundations of Personality: A
Neurobiological and Evolutionary Approach, WW Norton & Company,
2018.
[3] Y. L. P. V. a. K. S. K. J. S. Lerner, “Emotion and decision making,”
Annual review of psychology, vol. 66, 2015.
[4] United Nations, “Ageing,” 2017. [Online]. Available:
http://www.un.org/en/sections/issues-depth/ageing/. [Accessed 29
August 2018].
[5] A. F. Jorm and D. Jolley, “The incidence of dementia: A meta-
analysis,” Neurology, vol. 51, no. 1, pp. 728-733, 1998.
[6] A. Burns and S. Iliffe , “Dementia,” BMJ (Clinical Research), vol. 338,
p. B75, 2009.
[7] M. Mulvenna, H. Zheng, R. Bond, P. McAlliser, H. Wang and R.
Riestra, “Participatory design-based requirements elicitation involving
people living with dementia towards a home-based platform to monitor
emotional wellbeing,” in 2017 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), Kansas City, 2017.
[8] R. B. Cattell, “The description of personality: Principles and findings in
a factor analysis,” The American Journal of Psychology, vol. 58, no. 1,
p. 69–90, 1945.
[9] J. LeDoux, “Rethinking the emotional brain,” Neuron, vol. 73, no. 4, p.
653–676, 2012.
[10] J. LeDoux, The emotional brain: The mysterious underpinnings of
emotional life, Simon and Schuster, 1998.
[11] K. R. Scherer, “What are emotions? And how can they be measured?,”
Social science information, vol. 44, no. 4, p. 695–729, 2005.
[12] P. Ekman, “Basic emotions,” Handbook of cognition and emotion, p.
45–60, 1999.
[13] M. G. F. a. H. S. H. D. Matsumoto, Nonverbal communication: Science
and applications, Sage, 2013.
[14] R. W. Picard, Affective computing, 1995.
[15] R. W. Picard, “Affective Computing for HCI,” presented at the HCI
(1), p. 829–833, 1999.
[16] A. K. a. P. W. M. Healy, “Prototype proof of concept for a mobile
agitation tracking system for use in elderly and dementia care use
cases,” in CERC, Cork, 2016.
[17] R. R. Bond, H. Zheng, H. Wang, M. D. Mulvenna, P. McAllister, K.
Delaney , P. Walsh, A. Keary, R. Riestra and S. Guaylupo, “SenseCare:
using affective computing to manage and care for the emotional
wellbeing of older people,” in eHealth 360°, vol. 181, K. Giokas, B.
Laszlo and F. Hopfgartner, Eds., Springer, 2017, pp. 352-356.
[18] P. W. Michael Healy, “Detecting Demeanor for Healthcare with
Machine,” in 2017 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), Missouri, 2017.
[19] A. K. U. K. C. a. A. C. A. Chakraborty, “Emotion R ecognition From
Facial Expressions and Its Control Using Fuzzy Logic,” IEEE
Transactions on Systems, Man, and Cybernetics - Part A: Systems and
Humans, vol. 39, no. 4, pp. 726-743, 2009.
[20] P. a. R. E. K. Michel, “Real time facial expression recognition in video
using support vector machines.,” in Proceedings of the 5th
international conference on Multimodal interfaces, 2003.
[21] R. N. a. H. D. Bashir Mohammed Ghandi, “Real-Time System for
Facial Emotion Detection,” in 2010 IEEE Symposium on Industrial
Electronics and Applications, Penang, 2010.
[22] T. M. H. Mitchell, Machine Learning, 1997.
[23] C.-C. C. a. C.-J. Lin, “LIBSVM -- A Library for Support Vector
Machines,” 23 July 2018. [Online]. Available:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[24] A. K. P. W. Michael Healy, “Preforming real-time emotion
classification using an Intel RealSense camera, multiple facial
expression databases and a Support Vector Machine,” in CERC,
Karlsruhe, 2017.
[25] J. F. C. T. K. J. S. Z. A. a. I. M. P. Lucey, “The Extended Cohn-Kanade
Dataset (CK+): A complete dataset for action unit and emotion-
specified expression,” in 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition - Workshops, San Francisco,
CA, 2010.
[26] C. P. a. A. D. N. Aifanti, “The MUG Facial Expression Database,” in
Proc. 11th Int. Workshop on Image Analysis for Multimedia Interactive
Services (WIAMIS), Desenzano, 2010.
[27] R. M. I. C. J. K. T. a. B. S. Gross, “Multi-Pie,” Image and Vision
Computing, vol. 5, no. 28, pp. 807-813, 2010.
[28] Y.-W. Chen and C.-J. Lin, “Combining SVMs with Various Feature
Selection Strategies,” in Feature Extraction. Studies in Fuzziness and
Soft Computing, Heidelberg, Springer, 2006, pp. 315-324.
[29] D. E. King, “Dlib-ml: A Machine Learning Toolkit,” Journal of
Machine Learning Research, vol. 10, pp. 1755-1758, 2009.