Content uploaded by Chuang Yu
Author content
All content in this area was uploaded by Chuang Yu on Mar 07, 2022
Content may be subject to copyright.
Explainable Emotion Recognition for Trustworthy
Human-Robot Interaction
Hongbo Zhu
Department of Computer Science
University of Manchester
Manchester, UK
hongbo.zhu@manchester.ac.uk
Chuang Yu
Department of Computer Science
University of Manchester
Manchester, UK
chuang.yu@manchester.ac.uk
Angelo Cangelosi
Department of Computer Science
University of Manchester
Manchester, UK
angelo.cangelosi@manchester.ac.uk
Abstract—With the improvement of computing power and the
availability of large datasets, deep learning models can achieve
excellent performance in facial expression recognition tasks.
As these deep neural networks have very complex nonlinear
structures, when the model makes a prediction, it is difficult
to understand what is the basis for the model’s prediction.
Specifically, we do not know which facial features contribute to
the classification. The development of affective computing models
with more explainable and transparent feedback for human
interactors is essential for a trustworthy human-robot interaction
(HRI). In this paper, we explore explainable facial emotion
recognition with Layer-Wise Relevance Propagation (LRP) in an
HRI task. The experiment with the Pepper robot shows promising
results in terms of the explainability of the heatmap generated by
the LRP. This can explain the facial expression prediction made
by the CNN model at the pixel level, and this facilitates HRI in
a more natural and reliable way.
Index Terms—XAI, Trustworthy, Emotion Recognition, HRI
I. INTRODUCTION
Expressing our emotions is an innate ability of humans,
and emotions can also influence our behavior and decision-
making. The accurate understanding and correct expression of
our emotions play an important role in our communication,
which is crucial for successful interactions among humans. In
the HRI context, we wish robots to have a similar ability to
understand, express and explain emotions as humans do, so
that HRI can be more natural and coordinated. To achieve this
goal, a key point is that robots should be able to infer and
explain human emotions.
Recently, a variety of different machine learning models
have been used to solve the task of facial expression recog-
nition, the most popular of which are deep neural networks.
Compared to other machine learning models, deep learning ob-
tains better performance at recognition accuracy. For example,
Recurrent neural network (RNN) [1] and Convolutional Neural
Networks (CNN) [2] have achieved a good performance in
facial emotion prediction tasks. CNN models like Inceptionv3
[3] and VGG16 [4] are widely used to extract important
facial features around interested face regions for emotion
recognition. In this work, we chose VGG16 as our classifier
for facial emotion recognition.
Although CNN models offer significant advantages in terms
of recognition accuracy, their disadvantages in lacking ex-
plainability and transparency are hinders their use in inter-
action tasks. Recently, various explainable algorithms have
been proposed for deep networks. The three best-known
methods are Layer-Wise Relevance Propagation (LRP) [5],
Local Interpretable Model-agnostic Explanations (LIME) [6]
and SHapley Additive exPlanations (SHAP) [7]. What these
algorithms have in common is an attempt to identify the parts
of the input that are important in influencing the decisions of
the classification. In image classification tasks, these parts are
usually referred to as key pixels.
In this paper, we proposed an explainable emotion recogni-
tion solution for trustworthy HRI. The pipeline of our model
is shown in Figure 1. Firstly, the Pepper robot predicts the
facial emotional states of the interactor during HRI. Then, the
robot extracts the explainable presentation with the help of
the LRP model. Pepper verbalises the predicted emotion as
the linguistic feedback and shows the heatmap extracted from
the LRP model as explainable visual feedback. In addition, the
robot can give more detailed emotion recognition feedback to
increase the interaction transparency. This explainable visual
feedback can help the human interactor understand the internal
process of the robot’s emotion recognition model, which in
turn facilitates human-robot trust.
Fig. 1. The pipeline of our proposed framework.
The rest of the paper is organized as follows: Section II
introduces the related literature. Section III shows the method-
ology of explainable emotion recognition. The experiment
results are shown in Section IV. The discussion and future
work are part of Section V.
II. RE LATE D WO RK
Explainable AI is a popular research direction in the
machine learning area. This approach attempts to explain
the working mechanisms of machine learning models and
to provide explanations of their decisions, thus providing
confidence and trustworthiness for learning models. This is the
case for example of models of emotion recognition. In general,
facial emotion recognition consists of three steps: image pre-
processing, feature extraction, and classification. As a pioneer,
Paul Ekman [8] categorized faces emotion into 7 classes. This
is a common approach to discrete emotion recognition, which
is also used in our work.
As for the explainable methods for deep learning, in general
there are three approaches in eXplainable Machine Learn-
ing (XML): attribution-based interpretability, perturbation-
based interpretability and backpropagation-based interpretabil-
ity. These three methods correspond respectively to SHAP,
LIME and LRP. In this paper, we used the last method, namely
LRP for the explanations.
LRP is one of the most prominent backpropagation-based
methods. The purpose of LRP is to explain any neural net-
work’s output in the domain of its input. This method does
not interact with the training process of the network and
it can be easily applied to an already trained DNN model.
LRP provides an intuitive human-readable heatmap of input
images at the pixel level. It uses the network weights and the
neural activation created by the forward-pass to propagate the
output back through the network from the predicted output
to the input layer. The heatmap [9] is used to visualize the
contribution of each pixel to the prediction. The contribution of
the intermediate neuron or each pixel is quantified to relevance
value R, which represents how important the given pixel is to
a particular prediction task.
III. METHODOLOGY
Our architecture consisted of two main parts. The first is
the facial emotion recognition module based on a pre-trained
VGG16 model and the second is an explainable module based
on the LRP model.
A. Face Emotion Recognition Model
Face recognition with VGG16 is used in this paper, using
the concept of transfer learning which is about ”transferring”
the learned weights in a network to another similar task.
For example, this is suitable for training problems lacking
large volumes of labelled samples. VGG16 is a simple and
widely used convolutional neural network model for image
classification tasks, which can achieves 92.7% test accuracy
in ImageNet [10].
In this work, we reused the pre-trained image classification
VGG16 model to fine-tune it to our face emotional recognition
task by loading the pre-trained VGG16 model trained on the
ImageNet database. The image is passed through a series of
convolutional layers, following three fully-connected layers
Fig. 2. The architecture VGG16.
which have different depths. As in our task we have 7
emotions as classification task, the final fully-connected layer
was changed to seven dimensions, as shown in Figure 2. The
whole pre-trained model is then further trained on the facial
emotional dataset for facial emotion recognition. During this
new training, the previous layers of the pre-trained VGG16
are kept fixed.
B. Explainable method for CNN
Layer-Wise Relevance Propagation (LRP) can explain the
relevance of inputs for a certain prediction, typically for image
processing. So we can see which part of the input images,
or more precisely which pixels, most contribute to a certain
prediction. LRP is a model-specific method as it is mainly
designed for neural networks. The method generally assumes
that the classifier can be decomposed into several layers of
computation.
Fig. 3. Visualisation of the pixel-wise decomposition process [5]
The workflow of LRP is shown in Figure 3. In the for-
ward pass process, the image goes through a convolutional
neural network for feature extraction from the image. Then,
these extracted features are input to a classifier with a fully
connected neural network and a softmax layer which gives
the final prediction. We have two classes: e.g. one for a cat,
another for no cat. At this point, we are interested in why
the model predicts the image as a cat. LRP goes in reverse
order over the layers, we have visited in the forward pass and
calculates the relevant scores for each of the neurons in each
of the layers until we arrive at the input again. We can then
calculate the relevance for each pixel of the input image. The
positive relevant scores (red part) would indicate that the pixels
were relevant for the prediction, and the negative values (green
part) mean these pixels would speak against it, this leads to
the heatmap result.
Fig. 4. The LRP propagation procedure, red arrows indicate the direction of
back propagation flow. The image is from reference [11].
When the LRP model is applied to the trained neural net-
work, it propagates the classification function f(x)backward
in the network through pre-defined propagation rules from the
output layer to the input layer. The Working mechanisms of
LRP are shown in Figure 4 which nicely summarizes the main
procedure of LRP.
Let jand kbe neurons at two continuous layers. The
propagating relevance scores R(l+1)
kat a given layer l+ 1
onto neurons of the lower layer lis achieved by applying the
following rule:
R(l)
j=X
k
Zjk
Pj′Zj′k
R(l+1)
k(1)
The quantity Zjk models how much importance the neuron
jhas contributed to making the neuron krelevant . The
propagation procedure will not stop until it reaches the input
layer.
Zjk = x(l)
jw(l,l+1)
jk (2)
The relevance of a neuron is calculated according to For-
mula 1, which can calculate the relevance Rfor a neuron jin
layer l. So our current layer is land the output layer becomes
l+ 1. The calculation for neuron jnow works as follows. For
each neuron jin the layer l, we calculate the activation based
on the neuron j. And the activation is calculated according
to Zjk. It simply multiplies the input for the neuron jin our
current layer, with the weight that goes into the neuron kin the
next layer. This input xcomes from passing the pixel values
through the previous layers, which can show us how strong
the activation is between these neurons. Intuitively, if there is
a high value, it means that the neuron was very important for
the output. So we interpret this fraction as a relative activation
of a specific neuron, compared to all activations in that layer.
Finally, we multiply the relevant score of the neuron in the
next layer with this relative value to propagate the relevance
of the next layer backward.
For the different layers of VGG16, different LRP rules
should be applied, as shown in Figure 4, ranging from upper
layer LRP −0to middle layer LRP −ϵand Lower layer
LRP −γ.
Fig. 5. Sample from the KDEF dataset, displaying seven emotions.
IV. EXPERIMENT AND RESULTS
A. Facial Emotional Dataset and Data Preprocessing
The training and testing are performed on the Karolinska
Directed Emotional Faces (KDEF) [12] dataset, which has
4900 facial expression photos with 70 individuals (half males
and half females, ages from 20 to 30 ). Each person imitates
seven different facial emotions and each face expression is
recorded from five camera views. Some examples are as shown
in Figure 5.
In this paper, we only make use of the front face photos
in our experiment as our robot mostly interacts with a human
user in a front view. It means that we used one-fifth of the
dataset, 980 pictures in total, so that each subset contains 140
front view images for each expression. The face images are
rescaled to a standard 224*224 pixels and three color channels,
so as to fit the input format of the network. And we randomly
split the front-face dataset into the training part (700 samples),
validation part (140 samples), and testing part (140 examples)
on the modified VGG16 model.
B. Training and Testing result
On the basis of the pre-trained VGG16 model, our face
emotion recognition model is further trained on an Nvidia RTX
2080Ti graphic card. We set the batch size to 32 and use
Adam as the optimization algorithm, with the learning rate of
0.00001. And after 250 epochs of training, the model achieves
a testing accuracy of 91.4% on the KDEF dataset.
Fig. 6. The testing result of our explainable facial emotion recognition model.
C. Experiment on Pepper robot
The trained emotion recognition model was integrated with
the Pepper robot software for an interactive HRI task. The
Pepper is a social humanoid robot with two RGB cameras
and one 3D camera [13] and can recognize and interact with
the person talking to it. It is also able to engage with the
natural HRI through conversation skills and the touch screen
on its chest. It also has 20 degrees of freedom for natural and
expressive movements.
In this paper, we implement our methods on the Pepper
robot via its Choregraphe software, an open and fully pro-
grammable platform. During the experiment a person interacts
with Pepper, who can simultaneously recognise their facial ex-
pressions and verbalise the human face emotion prediction as
verbal feedback in HRI. The related speech is simply generated
based on the emotion recognition results. For example, if the
emotion recognition result is happy, the speech sentence will
be you looks like happy. The speech voice is synthesized
through the Text-To-Speech (TTS) tool of the Naoqi SDK of
the Pepper robot in the Choregraphe environment. And based
on the LRP model, the robot can extract the heatmap images as
the pixel-level explanation for the interactor face image. The
original face and the heatmap face are be shown in the Pepper
chest screen as the interpretable visual feedback which can be
seen from Figure 7. Through the verbal prediction feedback
and explainable visual feedback, it is promising to build up a
trustworthy human-robot interaction.
Fig. 7. The HRI experiment scene and results
V. DISCCUSSION AND FUTURE WORK
In conclusion, this paper proposes the integration of ex-
plainable methods for deep learning in the context of emotion
recognition for trustworthy HRI. By using the explainable
method LRP, the robot can extract the facial heatmap that
highlights sufficient parts of the facial pixels most responsible
for the emotion prediction task.
The heatmap visualisation, with verbal feedback, can help
the user to understand the perceptual mechanism the robot
uses to recognise the emotions. For example, the comparison
of the two heatmap images in Figure 6 shows that the
robot uses similar feature pixels in two different faces. This
means that when VGG16 classifies the face as a ’surprise’
emotion, the robot relies more on features near the eyes and
lips to make its prediction. This is in line with theories of
human emotion perception and cognition. Thus the explainable
method provides essential insights into the natural features of
the prediction model.
In this paper, we only completed the basic human facial
emotion recognition and related explanation based on the LRP
model and have not conducted so much work in human-
robot interaction scenes. So in the future work, more human-
joined tests will be taken for trust evaluation to explore the
effectiveness of our explainable model. And we will explore
how we can use the explainable visual feedback for human-
in-the-loop robot learning to improve the robot’s emotional
perception ability in dynamic human-robot interaction scenes.
Moreover, the multimodal explanation will be explored for
the trustworthy human-robot interaction, including verbal and
non-verbal behaviours.
ACKNOWLEDGMENT
This work was supported by the UKRI Trustworthy Au-
tonomous Systems Node in Trust(EP/V026682/1).
REFERENCES
[1] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial–temporal
recurrent neural network for emotion recognition,” IEEE transactions
on cybernetics, vol. 49, no. 3, pp. 839–847, 2018.
[2] C. Huang, “Combining convolutional neural networks for emotion
recognition,” in 2017 IEEE MIT Undergraduate Research Technology
Conference (URTC). IEEE, 2017, pp. 1–4.
[3] M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-
based facial emotion recognition for human–computer interaction appli-
cations,” Neural Computing and Applications, pp. 1–18, 2021.
[4] A. Sepas-Moghaddam, A. Etemad, F. Pereira, and P. L. Correia, “Facial
emotion recognition using light field images with deep attention-based
bidirectional lstm,” in ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2020, pp. 3367–3371.
[5] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M¨
uller,
“Layer-wise relevance propagation: an overview,” Explainable AI: inter-
preting, explaining and visualizing deep learning, pp. 193–209, 2019.
[6] S. Mishra, B. L. Sturm, and S. Dixon, “Local interpretable model-
agnostic explanations for music content analysis.” in ISMIR, 2017, pp.
537–543.
[7] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model
predictions,” in Proceedings of the 31st international conference on
neural information processing systems, 2017, pp. 4768–4777.
[8] P. Ekman, “Emotions revealed,” Bmj, vol. 328, no. Suppl S5, 2004.
[9] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨
uller, and
W. Samek, “On pixel-wise explanations for non-linear classifier deci-
sions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, p.
e0130140, 2015.
[10] E. Rezende, G. Ruppert, T. Carvalho, A. Theophilo, F. Ramos, and
P. de Geus, “Malicious software classification using vgg16 deep neural
network’s bottleneck features,” in Information Technology-New Gener-
ations. Springer, 2018, pp. 51–59.
[11] Layer-wise relevance propagation. [Accessed: 25-Jan-2022]. [Online].
Available: https://www.hhi.fraunhofer.de/en/departments/ai/research-
groups/explainable-artificial-intelligence/research-topics/layer-wise-
relevance-propagation.html
[12] D. Lundqvist, A. Flykt, and A. ¨
Ohman, “Karolinska directed emotional
faces,” Cognition and Emotion, 1998.
[13] A. K. Pandey and R. Gelin, “A mass-produced sociable humanoid robot:
Pepper: The first machine of its kind,” IEEE Robotics & Automation
Magazine, vol. 25, no. 3, pp. 40–48, 2018.