Conference PaperPDF Available

Explainable Emotion Recognition for Trustworthy Human-Robot Interaction

Authors:

Abstract and Figures

With the improvement of computing power and the availability of large datasets, deep learning models can achieve excellent performance in facial expression recognition tasks. As these deep neural networks have very complex nonlinear structures, when the model makes a prediction, it is difficult to understand what is the basis for the model's prediction. Specifically, we do not know which facial features contribute to the classification. The development of affective computing models with more explainable and transparent feedback for human interactors is essential for a trustworthy human-robot interaction (HRI). In this paper, we explore explainable facial emotion recognition with Layer-Wise Relevance Propagation (LRP) in an HRI task. The experiment with the Pepper robot shows promising results in terms of the explainability of the heatmap generated by the LRP. This can explain the facial expression prediction made by the CNN model at the pixel level, and this facilitates HRI in a more natural and reliable way.
Content may be subject to copyright.
Explainable Emotion Recognition for Trustworthy
Human-Robot Interaction
Hongbo Zhu
Department of Computer Science
University of Manchester
Manchester, UK
hongbo.zhu@manchester.ac.uk
Chuang Yu
Department of Computer Science
University of Manchester
Manchester, UK
chuang.yu@manchester.ac.uk
Angelo Cangelosi
Department of Computer Science
University of Manchester
Manchester, UK
angelo.cangelosi@manchester.ac.uk
Abstract—With the improvement of computing power and the
availability of large datasets, deep learning models can achieve
excellent performance in facial expression recognition tasks.
As these deep neural networks have very complex nonlinear
structures, when the model makes a prediction, it is difficult
to understand what is the basis for the model’s prediction.
Specifically, we do not know which facial features contribute to
the classification. The development of affective computing models
with more explainable and transparent feedback for human
interactors is essential for a trustworthy human-robot interaction
(HRI). In this paper, we explore explainable facial emotion
recognition with Layer-Wise Relevance Propagation (LRP) in an
HRI task. The experiment with the Pepper robot shows promising
results in terms of the explainability of the heatmap generated by
the LRP. This can explain the facial expression prediction made
by the CNN model at the pixel level, and this facilitates HRI in
a more natural and reliable way.
Index Terms—XAI, Trustworthy, Emotion Recognition, HRI
I. INTRODUCTION
Expressing our emotions is an innate ability of humans,
and emotions can also influence our behavior and decision-
making. The accurate understanding and correct expression of
our emotions play an important role in our communication,
which is crucial for successful interactions among humans. In
the HRI context, we wish robots to have a similar ability to
understand, express and explain emotions as humans do, so
that HRI can be more natural and coordinated. To achieve this
goal, a key point is that robots should be able to infer and
explain human emotions.
Recently, a variety of different machine learning models
have been used to solve the task of facial expression recog-
nition, the most popular of which are deep neural networks.
Compared to other machine learning models, deep learning ob-
tains better performance at recognition accuracy. For example,
Recurrent neural network (RNN) [1] and Convolutional Neural
Networks (CNN) [2] have achieved a good performance in
facial emotion prediction tasks. CNN models like Inceptionv3
[3] and VGG16 [4] are widely used to extract important
facial features around interested face regions for emotion
recognition. In this work, we chose VGG16 as our classifier
for facial emotion recognition.
Although CNN models offer significant advantages in terms
of recognition accuracy, their disadvantages in lacking ex-
plainability and transparency are hinders their use in inter-
action tasks. Recently, various explainable algorithms have
been proposed for deep networks. The three best-known
methods are Layer-Wise Relevance Propagation (LRP) [5],
Local Interpretable Model-agnostic Explanations (LIME) [6]
and SHapley Additive exPlanations (SHAP) [7]. What these
algorithms have in common is an attempt to identify the parts
of the input that are important in influencing the decisions of
the classification. In image classification tasks, these parts are
usually referred to as key pixels.
In this paper, we proposed an explainable emotion recogni-
tion solution for trustworthy HRI. The pipeline of our model
is shown in Figure 1. Firstly, the Pepper robot predicts the
facial emotional states of the interactor during HRI. Then, the
robot extracts the explainable presentation with the help of
the LRP model. Pepper verbalises the predicted emotion as
the linguistic feedback and shows the heatmap extracted from
the LRP model as explainable visual feedback. In addition, the
robot can give more detailed emotion recognition feedback to
increase the interaction transparency. This explainable visual
feedback can help the human interactor understand the internal
process of the robot’s emotion recognition model, which in
turn facilitates human-robot trust.
Fig. 1. The pipeline of our proposed framework.
The rest of the paper is organized as follows: Section II
introduces the related literature. Section III shows the method-
ology of explainable emotion recognition. The experiment
results are shown in Section IV. The discussion and future
work are part of Section V.
II. RE LATE D WO RK
Explainable AI is a popular research direction in the
machine learning area. This approach attempts to explain
the working mechanisms of machine learning models and
to provide explanations of their decisions, thus providing
confidence and trustworthiness for learning models. This is the
case for example of models of emotion recognition. In general,
facial emotion recognition consists of three steps: image pre-
processing, feature extraction, and classification. As a pioneer,
Paul Ekman [8] categorized faces emotion into 7 classes. This
is a common approach to discrete emotion recognition, which
is also used in our work.
As for the explainable methods for deep learning, in general
there are three approaches in eXplainable Machine Learn-
ing (XML): attribution-based interpretability, perturbation-
based interpretability and backpropagation-based interpretabil-
ity. These three methods correspond respectively to SHAP,
LIME and LRP. In this paper, we used the last method, namely
LRP for the explanations.
LRP is one of the most prominent backpropagation-based
methods. The purpose of LRP is to explain any neural net-
work’s output in the domain of its input. This method does
not interact with the training process of the network and
it can be easily applied to an already trained DNN model.
LRP provides an intuitive human-readable heatmap of input
images at the pixel level. It uses the network weights and the
neural activation created by the forward-pass to propagate the
output back through the network from the predicted output
to the input layer. The heatmap [9] is used to visualize the
contribution of each pixel to the prediction. The contribution of
the intermediate neuron or each pixel is quantified to relevance
value R, which represents how important the given pixel is to
a particular prediction task.
III. METHODOLOGY
Our architecture consisted of two main parts. The first is
the facial emotion recognition module based on a pre-trained
VGG16 model and the second is an explainable module based
on the LRP model.
A. Face Emotion Recognition Model
Face recognition with VGG16 is used in this paper, using
the concept of transfer learning which is about ”transferring”
the learned weights in a network to another similar task.
For example, this is suitable for training problems lacking
large volumes of labelled samples. VGG16 is a simple and
widely used convolutional neural network model for image
classification tasks, which can achieves 92.7% test accuracy
in ImageNet [10].
In this work, we reused the pre-trained image classification
VGG16 model to fine-tune it to our face emotional recognition
task by loading the pre-trained VGG16 model trained on the
ImageNet database. The image is passed through a series of
convolutional layers, following three fully-connected layers
Fig. 2. The architecture VGG16.
which have different depths. As in our task we have 7
emotions as classification task, the final fully-connected layer
was changed to seven dimensions, as shown in Figure 2. The
whole pre-trained model is then further trained on the facial
emotional dataset for facial emotion recognition. During this
new training, the previous layers of the pre-trained VGG16
are kept fixed.
B. Explainable method for CNN
Layer-Wise Relevance Propagation (LRP) can explain the
relevance of inputs for a certain prediction, typically for image
processing. So we can see which part of the input images,
or more precisely which pixels, most contribute to a certain
prediction. LRP is a model-specific method as it is mainly
designed for neural networks. The method generally assumes
that the classifier can be decomposed into several layers of
computation.
Fig. 3. Visualisation of the pixel-wise decomposition process [5]
The workflow of LRP is shown in Figure 3. In the for-
ward pass process, the image goes through a convolutional
neural network for feature extraction from the image. Then,
these extracted features are input to a classifier with a fully
connected neural network and a softmax layer which gives
the final prediction. We have two classes: e.g. one for a cat,
another for no cat. At this point, we are interested in why
the model predicts the image as a cat. LRP goes in reverse
order over the layers, we have visited in the forward pass and
calculates the relevant scores for each of the neurons in each
of the layers until we arrive at the input again. We can then
calculate the relevance for each pixel of the input image. The
positive relevant scores (red part) would indicate that the pixels
were relevant for the prediction, and the negative values (green
part) mean these pixels would speak against it, this leads to
the heatmap result.
Fig. 4. The LRP propagation procedure, red arrows indicate the direction of
back propagation flow. The image is from reference [11].
When the LRP model is applied to the trained neural net-
work, it propagates the classification function f(x)backward
in the network through pre-defined propagation rules from the
output layer to the input layer. The Working mechanisms of
LRP are shown in Figure 4 which nicely summarizes the main
procedure of LRP.
Let jand kbe neurons at two continuous layers. The
propagating relevance scores R(l+1)
kat a given layer l+ 1
onto neurons of the lower layer lis achieved by applying the
following rule:
R(l)
j=X
k
Zjk
PjZjk
R(l+1)
k(1)
The quantity Zjk models how much importance the neuron
jhas contributed to making the neuron krelevant . The
propagation procedure will not stop until it reaches the input
layer.
Zjk = x(l)
jw(l,l+1)
jk (2)
The relevance of a neuron is calculated according to For-
mula 1, which can calculate the relevance Rfor a neuron jin
layer l. So our current layer is land the output layer becomes
l+ 1. The calculation for neuron jnow works as follows. For
each neuron jin the layer l, we calculate the activation based
on the neuron j. And the activation is calculated according
to Zjk. It simply multiplies the input for the neuron jin our
current layer, with the weight that goes into the neuron kin the
next layer. This input xcomes from passing the pixel values
through the previous layers, which can show us how strong
the activation is between these neurons. Intuitively, if there is
a high value, it means that the neuron was very important for
the output. So we interpret this fraction as a relative activation
of a specific neuron, compared to all activations in that layer.
Finally, we multiply the relevant score of the neuron in the
next layer with this relative value to propagate the relevance
of the next layer backward.
For the different layers of VGG16, different LRP rules
should be applied, as shown in Figure 4, ranging from upper
layer LRP 0to middle layer LRP ϵand Lower layer
LRP γ.
Fig. 5. Sample from the KDEF dataset, displaying seven emotions.
IV. EXPERIMENT AND RESULTS
A. Facial Emotional Dataset and Data Preprocessing
The training and testing are performed on the Karolinska
Directed Emotional Faces (KDEF) [12] dataset, which has
4900 facial expression photos with 70 individuals (half males
and half females, ages from 20 to 30 ). Each person imitates
seven different facial emotions and each face expression is
recorded from five camera views. Some examples are as shown
in Figure 5.
In this paper, we only make use of the front face photos
in our experiment as our robot mostly interacts with a human
user in a front view. It means that we used one-fifth of the
dataset, 980 pictures in total, so that each subset contains 140
front view images for each expression. The face images are
rescaled to a standard 224*224 pixels and three color channels,
so as to fit the input format of the network. And we randomly
split the front-face dataset into the training part (700 samples),
validation part (140 samples), and testing part (140 examples)
on the modified VGG16 model.
B. Training and Testing result
On the basis of the pre-trained VGG16 model, our face
emotion recognition model is further trained on an Nvidia RTX
2080Ti graphic card. We set the batch size to 32 and use
Adam as the optimization algorithm, with the learning rate of
0.00001. And after 250 epochs of training, the model achieves
a testing accuracy of 91.4% on the KDEF dataset.
Fig. 6. The testing result of our explainable facial emotion recognition model.
C. Experiment on Pepper robot
The trained emotion recognition model was integrated with
the Pepper robot software for an interactive HRI task. The
Pepper is a social humanoid robot with two RGB cameras
and one 3D camera [13] and can recognize and interact with
the person talking to it. It is also able to engage with the
natural HRI through conversation skills and the touch screen
on its chest. It also has 20 degrees of freedom for natural and
expressive movements.
In this paper, we implement our methods on the Pepper
robot via its Choregraphe software, an open and fully pro-
grammable platform. During the experiment a person interacts
with Pepper, who can simultaneously recognise their facial ex-
pressions and verbalise the human face emotion prediction as
verbal feedback in HRI. The related speech is simply generated
based on the emotion recognition results. For example, if the
emotion recognition result is happy, the speech sentence will
be you looks like happy. The speech voice is synthesized
through the Text-To-Speech (TTS) tool of the Naoqi SDK of
the Pepper robot in the Choregraphe environment. And based
on the LRP model, the robot can extract the heatmap images as
the pixel-level explanation for the interactor face image. The
original face and the heatmap face are be shown in the Pepper
chest screen as the interpretable visual feedback which can be
seen from Figure 7. Through the verbal prediction feedback
and explainable visual feedback, it is promising to build up a
trustworthy human-robot interaction.
Fig. 7. The HRI experiment scene and results
V. DISCCUSSION AND FUTURE WORK
In conclusion, this paper proposes the integration of ex-
plainable methods for deep learning in the context of emotion
recognition for trustworthy HRI. By using the explainable
method LRP, the robot can extract the facial heatmap that
highlights sufficient parts of the facial pixels most responsible
for the emotion prediction task.
The heatmap visualisation, with verbal feedback, can help
the user to understand the perceptual mechanism the robot
uses to recognise the emotions. For example, the comparison
of the two heatmap images in Figure 6 shows that the
robot uses similar feature pixels in two different faces. This
means that when VGG16 classifies the face as a surprise’
emotion, the robot relies more on features near the eyes and
lips to make its prediction. This is in line with theories of
human emotion perception and cognition. Thus the explainable
method provides essential insights into the natural features of
the prediction model.
In this paper, we only completed the basic human facial
emotion recognition and related explanation based on the LRP
model and have not conducted so much work in human-
robot interaction scenes. So in the future work, more human-
joined tests will be taken for trust evaluation to explore the
effectiveness of our explainable model. And we will explore
how we can use the explainable visual feedback for human-
in-the-loop robot learning to improve the robot’s emotional
perception ability in dynamic human-robot interaction scenes.
Moreover, the multimodal explanation will be explored for
the trustworthy human-robot interaction, including verbal and
non-verbal behaviours.
ACKNOWLEDGMENT
This work was supported by the UKRI Trustworthy Au-
tonomous Systems Node in Trust(EP/V026682/1).
REFERENCES
[1] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial–temporal
recurrent neural network for emotion recognition,” IEEE transactions
on cybernetics, vol. 49, no. 3, pp. 839–847, 2018.
[2] C. Huang, “Combining convolutional neural networks for emotion
recognition,” in 2017 IEEE MIT Undergraduate Research Technology
Conference (URTC). IEEE, 2017, pp. 1–4.
[3] M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-
based facial emotion recognition for human–computer interaction appli-
cations,” Neural Computing and Applications, pp. 1–18, 2021.
[4] A. Sepas-Moghaddam, A. Etemad, F. Pereira, and P. L. Correia, “Facial
emotion recognition using light field images with deep attention-based
bidirectional lstm,” in ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2020, pp. 3367–3371.
[5] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M¨
uller,
“Layer-wise relevance propagation: an overview, Explainable AI: inter-
preting, explaining and visualizing deep learning, pp. 193–209, 2019.
[6] S. Mishra, B. L. Sturm, and S. Dixon, “Local interpretable model-
agnostic explanations for music content analysis.” in ISMIR, 2017, pp.
537–543.
[7] S. M. Lundberg and S.-I. Lee, A unified approach to interpreting model
predictions,” in Proceedings of the 31st international conference on
neural information processing systems, 2017, pp. 4768–4777.
[8] P. Ekman, “Emotions revealed,” Bmj, vol. 328, no. Suppl S5, 2004.
[9] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨
uller, and
W. Samek, “On pixel-wise explanations for non-linear classifier deci-
sions by layer-wise relevance propagation, PloS one, vol. 10, no. 7, p.
e0130140, 2015.
[10] E. Rezende, G. Ruppert, T. Carvalho, A. Theophilo, F. Ramos, and
P. de Geus, “Malicious software classification using vgg16 deep neural
network’s bottleneck features, in Information Technology-New Gener-
ations. Springer, 2018, pp. 51–59.
[11] Layer-wise relevance propagation. [Accessed: 25-Jan-2022]. [Online].
Available: https://www.hhi.fraunhofer.de/en/departments/ai/research-
groups/explainable-artificial-intelligence/research-topics/layer-wise-
relevance-propagation.html
[12] D. Lundqvist, A. Flykt, and A. ¨
Ohman, “Karolinska directed emotional
faces,” Cognition and Emotion, 1998.
[13] A. K. Pandey and R. Gelin, “A mass-produced sociable humanoid robot:
Pepper: The first machine of its kind,” IEEE Robotics & Automation
Magazine, vol. 25, no. 3, pp. 40–48, 2018.
... Real-time human expression detection using Cascade Classifier for emotion recognition and Grad-CAM for model detection visualization has shown promising results [1]. Layer-wise Relevance Propagation (LRP) is used to improve CNN model predictions at the pixel level, resulting in more reliable and natural Human-Robot Interaction (HRI) [16]. ...
... The importance of guiding the model through the training process for FME has been highlighted by integrating attention mechanisms in neural networks [17]. The paper explores how explainability can foster trustworthy interactions between humans and robots at the intersection of HRI and FME [16]. ...
... Trust plays a important role in human-robot interaction (HRI) [13] [14] [15], particularly when humans and robots need to team up or coordinate with each other [16]. Successful trust-aware human-robot interaction requires consideration of trust dynamics modeling [17] [18] and trust-based human behavioral policy [19] [20]. ...
Conference Paper
Full-text available
Theory of Mind (ToM) is a fundamental cognitive architecture that endows humans with the ability to attribute mental states to others. Humans infer the desires, beliefs, and intentions of others by observing their behavior and, in turn, adjust their actions to facilitate better interpersonal communication and team collaboration. In this paper, we investigated trust-aware robot policy with the theory of mind in a multiagent setting where a human collaborates with a robot against another human opponent. We show that by only focusing on team performance, the robot may resort to the reverse psychology trick, which poses a significant threat to trust maintenance. The human's trust in the robot will collapse when they discover deceptive behavior by the robot. To mitigate this problem, we adopt the robot theory of mind model to infer the human's trust beliefs, including true belief and false belief (an essential element of ToM). We designed a dynamic trust-aware reward function based on different trust beliefs to guide the robot policy learning, which aims to balance between avoiding human trust collapse due to robot reverse psychology and leveraging its potential to boost team performance. The experimental results demonstrate the importance of the ToM-based robot policy for human-robot trust and the effectiveness of our robot ToM-based robot policy in multiagent interaction settings.
... Hence, in affective computing, postmodel XAI techniques have been mainly applied on tasks where a single modality is considered and a discrete set of inputs, such as pixels or words, is used to learn data representations (see Table IV). For instance, in vision, postmodel XAI applications include pain estimation from facial expressions using DeepLift [172] and LRP [173], interpretable depression recognition from facial images using a CNN-based postmodel method [174], facial emotion recognition using an extension of Shapley values [175] and LRP [176], and prediction of driving behavior with LRP [177]. Using LIME on video frames for emotion recognition, Heimerl et al. [178] introduced NOVA, an annotation tool for emotional behavior analysis implementing a workflow that interactively incorporates the "human in the loop." ...
Article
Full-text available
Affective computing has an unprecedented potential to change the way humans interact with technology. While the last decades have witnessed vast progress in the field, multimodal affective computing systems are generally black box by design. As affective systems start to be deployed in real-world scenarios, such as education or healthcare, a shift of focus toward improved transparency and interpretability is needed. In this context, how do we explain the output of affective computing models? and how to do so without limiting predictive performance? In this article, we review affective computing work from an explainable AI (XAI) perspective, collecting and synthesizing relevant papers into three major XAI approaches: premodel (applied before training), in-model (applied during training), and postmodel (applied after training). We present and discuss the most fundamental challenges in the field, namely, how to relate explanations back to multimodal and time-dependent data, how to integrate context and inductive biases into explanations using mechanisms such as attention, generative modeling, or graph-based methods, and how to capture intramodal and cross-modal interactions in post hoc explanations. While explainable affective computing is still nascent, existing methods are promising, contributing not only toward improved transparency but, in many cases, surpassing state-of-the-art results. Based on these findings, we explore directions for future research and discuss the importance of data-driven XAI and explanation goals, and explainee needs definition, as well as causability or the extent to which a given method leads to human understanding.
Article
Full-text available
One of the most significant fields in the man–machine interface is emotion recognition using facial expressions. Some of the challenges in the emotion recognition area are facial accessories, non-uniform illuminations, pose variations, etc. Emotion detection using conventional approaches having the drawback of mutual optimization of feature extraction and classification. To overcome this problem, researchers are showing more attention toward deep learning techniques. Nowadays, deep-learning approaches are playing a major role in classification tasks. This paper deals with emotion recognition by using transfer learning approaches. In this work pre-trained networks of Resnet50, vgg19, Inception V3, and Mobile Net are used. The fully connected layers of the pre-trained ConvNets are eliminated, and we add our fully connected layers that are suitable for the number of instructions in our task. Finally, the newly added layers are only trainable to update the weights. The experiment was conducted by using the CK + database and achieved an average accuracy of 96% for emotion detection problems.
Article
Full-text available
As robotics technology evolves, we believe that personal social robots will be one of the next big expansions in the robotics sector. Based on the accelerated advances in this multidisciplinary domain and the growing number of use cases, we can posit that robots will play key roles in everyday life and will soon coexist with us, leading all people to a smarter, safer, healthier, and happier existence.
Conference Paper
Full-text available
Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Article
Full-text available
Understanding and interpreting classification decisions of automated image classification systems is of high value in many applications as it allows to verify the reasoning of the system and provides additional information to the human expert. Although machine learning methods are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision. This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of non- linear classifiers. We introduce a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks. These pixel contributions can be visualized as heatmaps and are provided to a human expert who can intuitively not only verify the validity of the classification decision, but also focus further analysis on regions of potential interest. We evaluate our method for classifiers trained on PASCAL VOC 2009 images, synthetic image data containing geometric shapes, the MNIST handwritten digits data set and for the pre-trained ImageNet model available as part of the Caffe open source package.
Article
The Karolinska Directed Emotional Faces (KDEF; Lundqvist, Flykt, & Öhman, 1998) is a database of pictorial emotional facial expressions for use in emotion research. The original KDEF database consists of a total of 490 JPEG pictures (72x72 dots per inch) showing 70 individuals (35 women and 35 men) displaying 7 different emotional expressions (Angry, Fearful, Disgusted, Sad, Happy, Surprised, and Neutral). Each expression is viewed from 5 different angles and was recorded twice (the A and B series). All the individuals were trained amateur actors between 20 and 30 years of age. For participation in the photo session, beards, moustaches, earrings, eyeglasses, and visible make-up were exclusion criteria. All the participants were instructed to try to evoke the emotion that was to be expressed and to make the expression strong and clear. In a validation study (Goeleven et al., 2008), a series of the KDEF images were used and participants rated emotion, intensity, and arousal on 9-point Likert scales. In that same study, a test-retest reliability analysis was performed by computing the percentage similarity of emotion type ratings and by calculating the correlations for the intensity and arousal measures over a one-week period. With regard to the intensity and arousal measures, a mean correlation across all pictures of .75 and .78 respectively was found. (APA PsycTests Database Record (c) 2019 APA, all rights reserved)
Article
Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the spatial region of each time slice from multiple angles. Then a bi-directional temporal RNN layer is further used to learn discriminative temporal dependencies from the sequences concatenating spatial features of each time slice produced from the spatial RNN layer. To further select those salient regions of emotion representation, we impose sparse projection onto those hidden states of spatial and temporal domains, which actually also increases the model discriminant ability because of this global consideration. Consequently, such a two-layer RNN model builds spatial dependencies as well as temporal dependencies of the input signals. Experimental results on the public emotion datasets of EEG and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.
Explainable AI: interpreting, explaining and visualizing deep learning
  • G Montavon
  • A Binder
  • S Lapuschkin
  • W Samek
  • K.-R Müller
G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, "Layer-wise relevance propagation: an overview," Explainable AI: interpreting, explaining and visualizing deep learning, pp. 193-209, 2019.