Conference PaperPDF Available

Contribution of temporal and multi-level body cues to emotion classification

2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)
Contribution of temporal and multi-level body cues
to emotion classification
Nesrine Fourati
Research & Innovation
Paris, France
Catherine Pelachaud
Sorbonne University
Paris, France
Patrice Darmon
Research & Innovation
Paris, France
Abstract—The representation of expressive body movement
is a critical step to automatically classify bodily expression of
emotions. Using motion capture data, we study the contribution
of different types of body features to the classification of emotions.
In particular, we focus on the role played by temporal profiles
of motion cues with regards to a set of multi-level body cues,
including postural and dynamic ones.
Index Terms—Emotion expression, Classification, Temporal
Profile features, Multi-Level body features
Along with the growing trend to develop intelligent, in-
teractive, and affective machines, the automatic analysis and
recognition of affects from body language, and from body
movement in particular, have lately received an increasing
interest (e.g. for video game [16] application). On one hand,
different studies from psychology [23, 6, 24] as well as
from affective computing [15, 13] demonstrate and highlight
the ability of human body to convey affects. On the other
hand, recent advances in computer science, including image
processing and Machine-Learning techniques, make it possible
to build robust automatic systems of affect recognition from
body movement [18] [5]. Such automatic systems aim to
capture the relationship that link body movement cues to
the expression of emotional states. Three basic components
have to be considered to build such a system [5]; 1) an
adequate representation of body movement, 2) an appropriate
Machine Learning model able to capture the complexity of the
relationship that links body movement cues to affective states,
and 3) a specific representation of affective states.
In this paper, we compare the contribution of two com-
plementary representations of body movement to the classi-
fication of emotions. We also compare the performance of
two Machine Learning models for this purpose. A categorical
approach is adopted to represent emotional states. Body ex-
pressions are modeled in terms of discrete emotions. In the
next section, we discuss related works on the representation
of expressive body movement used to explore the ability of
body posture and body movement to convey affective states.
We also go through the motivation of the present work.
Along with the need to build automatic systems for emotion
recognition from body movement, an increasing number of
studies are emerging to explore the ability of body posture,
movement dynamics, and their combination to automatically
classify emotions expressed in body movement. This requires
a body notation system able to encode bodily expression of
emotions through the representation of the form and/ or the
dynamics of the movement. Bodily expression of emotions
can be encoded in an explicit or implicit manner. Explicit
expression focuses on the use of specific body action/ posture
units to convey emotions [7]. Implicit expression concerns
the way the expression of emotions affects the form of our
body posture and the quality of our body movement [20].
Probably the most known body notation systems used so far to
characterize respectively explicit and implicit emotional body
expression are BAP (Body Actions and Posture) [7] and LMA
(Laban Movement Analysis) [17]. In this paper, we focus on
the way the expression of emotions implicitly affects the form
of body posture and the quality of body movement.
a) LMA: LMA approach [17] has been in particular
widely adopted to build affective recognition systems of
emotional body expressions [20, 3]. Based on LMA ap-
proach, Camurri et al [3] proposed an interesting collection
of computational models called EyesWeb expressive gesture
processing library for real-time expressive body movement
analysis. Additional modules have been recently proposed to
enrich this library with powerful tools enabling robust and
multi-level recognition of affects expressed in body movement
based on LMA approach [20]. In addition to related works that
relied on LMA approach, other studies have been conducted
relying on different body notation systems to build automatic
recognition systems of emotions expressed in body movement.
They focused on postural body cues [14, 16, 21], properties
of movement dynamics [4, 11, 2] or both [22, 20]. We discuss
in the following a few number of these studies.
b) Postural cues: The authors in [14] defined a set of
24 low-level features based on the distances between body
joints to describe the multi-directional form of the whole
body posture (lateral, frontal and vertical extension of body
parts). Their set of features was used to classify four basic
emotions using a non-linear Mixture Discriminant Analysis
model. Classification rates ranged between 78% and 90%
across different cultures. Another low-level description of
body posture is used in [16] based on normalized 3-D joint
978-1-7281-3888-6/19/$31.00 ©2019 IEEE
Euler rotations of head, neck, collar, shoulders, elbows, wrists,
torso, hips and knees. A multilayer perceptron model was
used to evaluate automatic recognition models of affective
states (Concentrating, defeated, frustrated and triumphant) and
affective dimensions (Valence and Arousal) from non-acted
affective postures. The model achieved a recognition rate of
59.22% across the four affective states. Riemer et al. [21]
focused on the upper body posture to explore how emotions
change during learning with a serious game.
c) Dynamics cues: Castellano et al. [4] extracted a large
set of submotion characteristics that capture the temporal pro-
file of upper body motion and the speed of head motion to dis-
criminate between five moods in one professional musician’s
performance: personal, sad, allegro, serene and overexpressive.
A wrapper approach was adopted to select a subset of features
and a decision tree model was used to assess the ability of
temporal features to classify different emotionally expressive
performances. Only three expressions were discriminated with
57.8% classification rate. The automatic classification of the
five expressions was not successful. Glowinski et al. [11]
focused on head and hands motion to recognize emotions
expressed through explicit gestures. Similarly to [4], a number
of features was then computed on the time series of head and
hands trajectory to describe their temporal profiles based on
the slopes and main peaks characteristics of the trajectories.
Both [11] and [4] rely on Eyes-Web modules to detect body
parts based on video analysis processing.
d) Postural and Dynamics cues: Assuming that “only the
right arm exhibits significant movement” in knocking motion,
a limited set of four arm motion features was used in [2] to
automatically classify four emotions (neutral, happy, angry and
sad) expressed in body movement during knocking action. The
features were composed of one postural feature (maximum
distance of hand from body) and three discrete representations
of movement dynamics (average of hand speed, acceleratin
and jerk). A support vector machines (SVMs) model with
a polynomial kernel was used for the classification task
achieving 81%. Similarly to [2], Saha et al. [22] focused on
the representation of upper body movement to discriminate
between 5 emotional states expressed through gestures. They
used time-series of both postural body cues (e.g. the distance
between the hands and the spine) and movement dynamics
cues (e.g. acceleration of hand and elbow motion with respect
to spine). Different Machine Learning methods were used in
their study and it turned out that the ensemble tree model
provided the best results (90.83%) [22]. A comparative study
presented in [9] showed that postural body cues used for the
classification of emotions expressed in daily actions provide
better results than a discrete representation of movement
dynamics. Considering both postural and a discrete representa-
tion of movement dynamics features always provided the best
results [9]. It was shown in another study conducted by the
same authors [10] that both postural and discrete movement
dynamics features figure among the most relevant features as
assessed by the Random Forest model.
A. Discussion and Motivation of the present work
Overall, most of the previous works have focused on a
limited representation of body movement to classify emotional
body expressions. Only a limited number of studies were based
on a multi-level description of body movement considering
all together the representation of the whole body movement
(i.e. upper and lower body parts), different motion directions
(e.g. vertical, lateral, frontal directions), and both posture and
movement dynamics cues [3, 20, 10]. Considering a multi-
level representation of expressive body movement has proven
to achieve better results of emotions classification across
different actions compared to the classification results based
on a limited set of body cues [9].
Movement dynamics cues used in previous studies refer
either to a discrete representation of movement dynamics
(e.g. the average of motion speed) [2] or to a set of sub-
motion characteristics that capture the temporal profile of
body segments motion (e.g. main peaks characteristics) [4].
Compared to the temporal features, the discrete representation
of movement dynamics cues can reduce a high amount of
information regarding the dynamics properties of expressive
body movement. Different studies from psychology [23] and
affective computing fields [15, 11] highlight the discriminative
power of movement dynamics properties. However, only a few
efforts were made to compare the contribution of postural and
movement dynamics features to the classification of emotions
expressed in body movement [9, 15]. In particular, the role
played by temporal profiles of motion cues in emotion classi-
fication with regards to a set of multi-level body cues remains
unclear. This constitutes the aim of this paper. As body end
effectors (i.e. the head, the feet and the hands) mostly retain
the largest amount of variability regarding body movement
dynamics, we focus on their temporal dynamics.
Our main research questions can be summarized as follows:
To what extent can dynamic features, describing the temporal
profile of body end effectors motion, discriminate emotional
expressions? And to what extent can they contribute to the im-
provement of emotional body expression classification based
on multi-level features defined according to different sublevels
of anatomical, directional and posture/ movement dimensions?
We use two different Machine Learning approaches to ad-
dress these research questions. Sections III and IV summarize
respectively the body notation system used to define multi-
level features (described in details in [10]) and the set of
temporal features (described in details in [4]). Our framework
and our results of automatic classification using these sets
of features are presented in section V. A study of the most
relevant features that contribute the best to the classification
task is discussed in section VI. Finally a general conclusion
is provided in section VII.
Our Multi-Level body notation system (MLBNS) is inspired
by body notation systems proposed in previous works, mainly
LMA (Laban Movement Analysis) and BAP coding system
[7] as well as postural body notation systems such as the
Fig. 1: Multi-Level body notation system: # stands for the
number of features related to each body cue.
one used in [14]. MLBNS provides a structured and multi-
level body notation system that can be used to describe
implicit expression of emotions in different movement tasks. It
encompasses three main description levels, which are in turn
decomposed of different sub-levels: 1) Anatomical (Global,
Semi-Global, Local), 2) Directional (Sagittal, Lateral, Verti-
cal:Length, Vertical:Rotation and Three dimensional) and 3)
Posture/ Movement (Posture, Postural changes and Movement
dynamics) description levels. We proposed a set of 114 motion
capture features based on our MLBNS. Figure 1 summarizes
the body cues that regroup these features. We rely on Anatom-
ical description level to categorize these body cues into three
groups: Body part (i.e. bounding boxes surrounding lower
body parts, trunk or arms), Semi-Global (i.e. the relationship
between body segments such as the distance between hands
or the symmetry of arms posture/ movement) and Local (i.e.
local description of specific body joints such as downward/
upward head rotation) body cues. For each body cue, we
give the corresponding number of features (See Figure 1).
The definition of each body cue is either based on the
3D rotation or the 3D position data. These features aim to
explain postural information, postural changes and the discrete
representation of movement dynamics. Postural information
(ML.Post) refers to the average of peaks value of rotation/
position data (e.g. maximal downward/ upward head flexion)
or to the occurrence of specific postural configuration such as
crossing limbs or symmetry of arms/ legs. Postural changes
(Post.PostChg) refers to the standard deviation of motion.
The discrete representation of movement dynamics (ML.Dyn)
refers to the average of peaks value of speed and acceleration
of motion cue, or to the correlation between arms motion.
Body part features combine these movement description levels
with different directional levels to describe the motion of
bounding boxes that surround the arms, the trunk and lower
body part. More details about this set of 114 multi-level
features (henceforth called ML features) can be found in [10].
In this section, we present the set of temporal profiles
features (henceforth called TP features) that we use in our
work in addition to our set of multi-level features described
in the previous section. Castellano et al. [4] proposed an
extensive description of submotion characteristics intended
to define the temporal profiles of motion cues. We rely on
their work as it covers a detailed description of temporal
dynamics properties. Sixteen features are used to capture the
temporal profile of motion cues [4]. Overall, this set of features
refers to a rich variety of temporal motion characteristics
such as the temporal regularity of a motion cue’s profile, the
overall impulsiveness of a motion cue and the impulsiveness
of the release and of the attack of a motion cue’s main peak
[4]. These features are categorized into four groups, each
composed of four features:
Slopes characteristics: initial and final slope of the first,
the last and the main peaks
Main peaks characteristics: main peak value and its rela-
tionship with its duration and with the following peak, the
relationship between main peak duration and the overall
motion duration (see Figure 2)
Overall characteristics of motion: number of peaks, mean
value and its relationship with the absolute maximum, and
the number of peaks preceding the main one
Temporal regularity of motion structure: centroid of en-
ergy, its relationship with the absolute maximum, sym-
metry of the motion cue temporal profile, and temporal
position of the main peak
In [4], the authors proposed to measure these features to
describe the temporal profile of two motion cues defined as
the quantity of upper body motion and the velocity of the
head movements. In our work, we compute the set of sixteen
features to characterize the temporal profile of two motion
cues: 1) 3D trajectory of hands, head and feet movement (i.e.
positions time-series) and 2) Kinetic energy of the 3D motion
of hands, head and feet. Different studies in psychology [23] as
well as affective and social computing field [20, 19] highlight
the relevance of energy cue to characterize expressive body
movement. Kinetic energy measure is computed separately for
Fig. 2: Main peak characteristics
each of the five body end effectors 3D motion. A Savitzky–
Golay filter is applied on energy measure to smooth the data.
We refer henceforth to these two sets of features respectively
as TP.Position and TP.Energy. Overall, each of these two sets
of features contains 80 features (16*5) = sixteen temporal
profile features that describe one motion cue (Positions or
Energy) of five body end effectors (head, hands and feet).
We present in this section our framework and our results to
study the contribution of temporal features to the classification
of emotions expressed in body movement with respect to our
initial set of multi-level features.
A. Database
Our analyses are based on the Emilya database, a large
3D motion capture database of emotional body expression
in daily actions [8]. Eleven actors as well as a professional
director were asked to express 8 discrete emotions based on a
scenario-approach in seven daily actions. A 3D motion capture
system was used to record body movement. The emotions are:
Anxiety, Pride, Joy, Sadness, Panic Fear, Shame, Anger and
Neutral. The actions are: Simple Walking (SW), Walking with
an object in hands (WH), Moving books on a table (MB),
Knocking (KD), Sit Down action which is split into Sitting
Down (SD) and Being Seated (BS), Lifting (Lf) and Throwing
(Th). Considering the number of trials per action and the
number of scenarios used for each emotion, the recording
of this database leaded to 8206 motion capture sequences,
including around 1000 motion sequences are available for
each action (8 totally). Overall, the distribution of emotions (8
totally) in the database is well balanced; around 120 Emotion-
related observations are available per action.
B. Classification models
We use different representations of body movements based
on multi-level and temporal profile features presented in sec-
tion III and IV. Totally, eleven sets of features are considered
as presented in table I: multi-level and temporal profile features
are firstly considered separately then combined. A single
decision tree model was used in [4] for the task of emotional
TABLE I: Quantification of the sets of features: ML, ML.Post,
ML.PostChg, ML.Dyn and TP stand respectively for multi-
level, postural, postural changes, dynamic features of ML and
temporal profile of positions or energy based motion cue.
# ML TP.Position TP.Energy Total
ML 114 - - 114
ML.Post 36 - - 36
ML.PostChg 26 - - 26
ML.Dyn 52 - - 52
TP.Position - 80 - 80
TP.Energy - - 80 80
TP - 80 80 160
ML.Post+ TP 36 80 80 196
ML + TP.Position 114 80 - 194
ML + TP.Energy 114 - 80 194
ML + TP 114 80 80 274
body expressions classification based on the set of temporal
features. In our work, we compare the performance of two
Machine Learning approaches that have been widely used
in previous works for automatic analysis of human motion
[12] [1] [2]: 1) Random Forest (RF) approach which is an
ensemble of decision trees and 2) and Support Vector Machine
(SVM). RF model is mainly adopted for its high performance
compared to a single decision tree model, its robustness to
handle a large set of features and its ability to evaluate the
relevance of features to the classification task. SVM model is
mainly selected for its powerful generalization performance.
Our analyses are based on the Scikit-learn python library.
We build one Random Forest model and one SVM model
per action (8 totally) and per set of features (11 totally).
Preliminary analyses showed that both RF and SVM mod-
els provide better results when standardization rather than
normalization is applied as a pre-processing step. As such,
standardization is applied on data before using SVM and RF
models, even though it is only required for SVM model as
Random Forest model is not sensitive to the scaling of the
data. The Hyperparameters of each SVM model and each
RF model are tuned based on two grids of parameters. The
grid of parameters used for SVM model refers to; 1) Kernel
(radial basis function -rbf, linear), 2) Penalty parameter C
of the error term (8 values ranging from 0.001 to 10000),
and 3) Gamma Kernel coefficient for rbf (7 values ranging
from 0.001 to 1000, considering also the options ’auto’ and
’scale’). The best set of parameters is set to the one leading
to the best performance of SVM model across a three-fold
cross validation model. The grid of parameter considered for
RF model refers to the number of trees (ranging from 50 to
1000 when increased by 50 trees). The number of trees is set
to the one that provides the best classification accuracy with
no significant improvement according to the statistical HSD
Tukey test (alpha= 0.0.5) across 20 trials.
C. Emotion classification results
Based on F1-score, we study the ability of the different sets
of features presented in Table I to classify emotional body
expressions. We study the classification of 8 emotions in each
of the actions presented in section V-A. Two classification
models (RF and SVM) are used for each action dataset
as explained in section V-B. A three-fold cross-validation
approach is applied for each model. The process is repeated 30
times. Figure 3 depicts the overall F1-score averaged across
the 30 trials (and related standard deviation in error bars) of
emotion classification for each action dataset. Figure 4 refers
to the average F1-score (and related standard deviation in
error bars) of each emotion (class) across the different action
datasets. As shown in Figure 3 and 4, the classification of
emotions obtained with TP as well as ML sets of features are
above chance level (12.5% for 8 classified emotions).
a) Separate ML and TP set of features: Firstly we note
that the TP.Position features mostly outperform TP.Energy
features for each action across all emotions (See Figure 3 a),
b)) and for each emotion across all actions ( See Figure 4
a), b)). The averaged F1-score across all the actions resulted
from RF model is 53% and 45% based respectively on
TP.position and Combining all temporal features
does not lead to a significant improvement. This result suggests
that temporal profile of the 3D trajectories of end-effectors
(head, hands and feet) is more likely to cover the expressive
content of the motion than the temporal profiles of the energy
cue that is based on the velocity of head, hands and feet
motion. Secondly, we observe that the discrete representation
of movement dynamics of ML features mostly outperforms
temporal features. This result may be explained as the former
retain additional properties related to different anatomical and
directional levels based on our multi-level coding scheme.
Thirdly, when comparing the outcome of the separate sets of
features, the postural features give better results than the other
sets. This is in line with previous findings in psychology [6]
and affective computing fields [16] [21] that report emotional
states can be recognized from body posture. Moreover, it has
been reported in [4] that temporal features alone do not allow
classifying five emotional expressions using a decision tree
model. Although our work is based on more robust Machine
Learning models, using their set of temporal features to
describe more complex bodily expressions allows classifying
emotions above chance level but with lower classification rates
than those obtained with postural features or with the total set
of multi-level features.
b) Combined ML and TP features: Firstly, we note that,
across the two classification models, the best classification
results are obtained using ML features. They outperform the
results obtained by postural features combined with temporal
features (ML.Post + TP) (see Figures 3 and 4 ). This finding
highlights the ability of our set of multi-level features to
capture the complexity of emotional body behavior modeling,
although only a discrete representation of movement dynamics
is considered in conjunction with an extensive description of
postural cues. Secondly, we find that combining TP features
with ML features does not enhance the classifier performance
based on RF model; its performance lowers from 74.47%
with ML features to 71.41% with ML+TP features in average
across all actions (see Figure 3 c)). The same finding applies
Fig. 3: F1-scores obtained per action using separate sets of
features based on: a) RF and b) SVM models, and combined
sets of features based on: c) RF and d) SVM models. Walking1
and Walking2 refer respectively to SW and WH.
to the SVM model, with lower results, as the performance
decreases from 73.93% to 64.87% when considering temporal
features in addition to the multi-level features (see Figure 3
d)). This can be due to the curse of dimensionality; as the
number of body features grows, the amount of emotional
body expression data required to generalise accurately grows
exponentially. Furthermore, this result can be explained as TP
features perform significantly lower than ML features, which
may result in the decrease of performance when combining
ML with TP features. Besides, the difference in the size of
features sets can be affecting the results as a larger set may
hinder the benefit that can be obtained by a smaller one. To
get better insights on the contribution of temporal features
to emotion classification, we discuss in the next section the
selection of the most relevant features based on RF model.
The difference in the decrease of performance found be-
tween RF and SVM models points out the sensitivity of
SVM model to high dimensionality compared to RF model.
More action-dependent variability in classification rates is
also observed using SVM model compared to RF model
(See Figure 3 b), d)). This variability may be explained by
the sensitiviy of SVM model to the complexity of bodily
expression modeling. Gunes et al. also showed that RF model
outperforms SVM model for the classification of emotional
gestures (76.87% vs 64.51%) [12].
One of the advantages of Random Forest model is its ability
to track the relevance of the features considered during the
classification task. We examine the most relevant features
as assessed by the RF model built using the whole set of
features for each of the 8 actions. The whole set of features
is composed of 274 features combining the 114 multi-level
features and 160 temporal features (see Table I). A wrapper
forward feature selection approach is proposed to get a reduced
Fig. 4: F1-scores obtained per emotion using separate sets of
features based on: a) RF and b) SVM models, and combined
sets of features based on: c) RF and d) SVM models.
TABLE II: Relative quantification of the subsets of selected
features (SSF) per action. Columns stands for actions.
#SSF/274 22% 58% 36% 83% 36% 39% 41% 39%
#ML/#SSF 75% 66% 74% 48% 69% 63% 71% 70%
#TP.position/ #SSF 3% 15% 9% 26% 15% 20% 13% 12% #SSF 21% 18% 17% 26% 16% 17% 15% 18%
set of features based on the relevance measure returned by
each RF model; The oob (out-of-bag) score of Random Forest
model is estimated iteratively based on the first k most relevant
features, while k ranges from 1 to 274. The results are
averaged across 50 runs. The set of features that leads to the
best averaged oob measure is selected. Table II summarizes
the relative quantification of the subset of selected features
(SSF) per action according to the total number of features
(i.e. 274 features); the results range from 22% of features in
simple walking action to 83% of features in knocking action.
Table II also depicts the relative quantification of the multi-
level features, position and energy based temporal features
according to the number of selected features. Except for
knocking action, for which a large subset of features is selected
with the best classification score, we can observe that the
content of selected features is mainly constituted of multi-level
features (e.g. 74% and 75% of selected features in walking
actions, see Table II). Along with the results based on the F1-
score discussed in the section V-C, this finding accentuates
the relevance of multi-level features compared to temporal
features. Except for walking actions, temporal characteristics
of the trajectory and the energy of end effectors seem to be
equally selected in each subset of selected features in each
action (see Table II).
Looking carefully at the content of each subset of selected
features, we find that the selected temporal features tend to be
related to the body end-effectors that are mainly involved in
the movement task. For instance, the temporal features selected
for arms based actions (i.e. Knocking, Moving Books, Lifting
and Throwing actions) are mainly composed of Hands features
while the temporal features selected for walking actions are
mainly composed of Feet features. Temporal features selected
for Sitting Down and Being Seated actions are mainly com-
posed of both Hands and Feet features. In fact, among selected
TP features, the TP of action-related body segments motion
seem to be the most relevant to capture the properties of
emotional expression. Regarding multi-level features, we find
that both upper body and lower body features are selected
across all the actions. Indeed, it seems that both upper and
lower body parts can capture the properties of emotional body
expression regardless the main body segments involved in the
movement task. The performance of Random Forest model is
evaluated with the new subset of selected features (described
in Table II) for each action following the same approach as
section V-B (same number of trees, 3-fold cross validation,
F1-score averaged across 30 runs). We find that RF model
performs slightly better using the subset of selected features
than using the whole set of features (multi-level and temporal
features). Except for Knocking action for which only a slight
improvement by 1% is observed, the improvement in F1-score
ranges from 4% in Walking with an object in hands, Sitting
Down and Throwing actions to 6% in Lifting action.
The present work is an attempt to give more insights into the
contribution of different representations of body movement to
the classification of emotions expressed in different movement
tasks. Two complementary set of features are considered: 1) a
set of multi-level features that includes multi-directional whole
body postural cues coupled with discrete representation of
movement dynamics and 2) a set of temporal features that
capture the temporal profiles of body end effectors motion.
Multi-level postural features, alone or coupled with a dis-
crete representation of the whole body movement dynamics,
seem to better capture the properties of emotional body
expression in different actions than temporal features of end
effectors motion. This result is in line with previous works
that highlight the role of postural cues to discriminate between
emotions. Based on our analysis of the most important features
considered during the classification task, we find that postural
cues of upper and lower body parts are both relevant to
classify emotions regardless the main body segments involved
in the action. Submotion characteristics of action-related joints
(e.g. temporal features of feet motion for Walking) seem
to retain additional properties of emotional body expression
in conjunction with multi-level features that include multi-
directional description of the whole body posture and a dis-
crete representation of movement dynamics. In future work,
it would be interesting to conduct a deep analysis of the
correlation that may occur within postural features and within
movement dynamics features. Besides, additional movement
dynamics features can be considered in future analyses such
as the temporal evolution of emotion expression during a
movement sequence and the temporal profile of other body
joints motion (e.g. shoulders, hips, elbows and knees).
The authors thank the anonymous reviewers for their in-
sightful comments and valuable suggestions which improved
the quality and the readability of the paper.
[1] Insaf Ajili, Malik Mallem, and Jean-Yves Didier. “Rel-
evant LMA Features for Human Motion Recognition”.
In: International Journal of Computer and Information
Engineering 12.9 (2018), pp. 792–796.
[2] Daniel Bernhardt and Peter Robinson. “Detecting affect
from non-stylised body motions”. In: Affective Comput-
ing and Intelligent Interaction 4738 (2007), pp. 59–70.
[3] Antonio Camurri, Barbara Mazzarino, and Gualtiero
Volpe. “Analysis of Expressive Gesture: The EyesWeb
Expressive Gesture Processing Library”. In: In Gesture-
based Communication in Human-Computer Interaction,
LNAI 2915 (Springer Verlag). Lecture Notes in Com-
puter Science (2004), pp. 460–467.
[4] G Castellano, M Mortillaro, and A Camurri. “Au-
tomated analysis of body movement in emotionally
expressive piano performances”. In: Music Perception
26.2 (2008), pp. 103–120.
[5] Ciprian Corneanu et al. “Survey on Emotional Body
Gesture Recognition”. In: IEEE Transactions on Affec-
tive Computing (2018), pp. 1–19.
[6] Nele Dael, Marcello Mortillaro, and Klaus R Scherer.
“Emotion expression in body action and posture.” In:
Emotion 12.5 (2011), pp. 1085–1101.
[7] Nele Dael, Marcello Mortillaro, and Klaus R. Scherer.
“The Body Action and Posture Coding System (BAP):
Development and Reliability”. In: Journal of Nonverbal
Behavior (Jan. 2012), pp. 97–121.
[8] Nesrine Fourati and Catherine Pelachaud. “Emilya:
Emotional body expression in daily actions database”.
In: 9th International Conference on Language Re-
sources and Evaluation (LREC 2014). Reykjavik, Ice-
land, 2014, pp. 3486–3493.
[9] Nesrine Fourati and Catherine Pelachaud. “Multi-level
classification of emotional body expression”. In: 11th
IEEE International Conference on Automatic Face and
Gesture Recognition (FG2015). Ljubljana, Slovenia,
[10] Nesrine Fourati and Catherine Pelachaud. “Relevant
body cues for the classification of emotional body
expression in daily actions”. In: 2015 International
Conference on Affective Computing and Intelligent In-
teraction (ACII). IEEE, Sept. 2015, pp. 267–273.
[11] Donald Glowinski et al. “Technique for automatic emo-
tion recognition by body gesture analysis”. In: 2008
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops. IEEE, June 2008,
pp. 1–6.
[12] Hatice Gunes and Massimo Piccardi. “Automatic tem-
poral segment detection and affect recognition from face
and body display”. In: IEEE Transactions on Systems,
Man, and Cybernetics, Part B: Cybernetics 39.1 (2009),
pp. 64–84.
[13] Michelle Karg and AA Samadani. “Body movements
for affective expression: a survey of automatic recogni-
tion and generation”. In: IEEE Transactions on Affective
Computing 4.4 (2013), pp. 341–359.
[14] A Kleinsmith, P De silva, and Nadia Bianchi-Berthouze.
“Cross-cultural differences in recognizing affect from
body posture”. In: Interacting with Computers 18.6
(2006), pp. 1371–1389.
[15] Andrea Kleinsmith and Nadia Bianchi-Berthouze. “Af-
fective Body Expression Perception and Recognition: A
Survey”. In: IEEE Transactions on Affective Computing
4.1 (Jan. 2013), pp. 15–33.
[16] Andrea Kleinsmith, Nadia Bianchi-Berthouze, and An-
thony Steed. “Automatic Recognition of Non-Acted
Affective Postures”. In: IEEE Transactions on Systems
Man and Cybernetics Part B Cybernetics 41.4 (2011),
pp. 1027–1038.
[17] R. Laban. The mastery of movement. Ed. by Northcote
House. Plymouth, UK, 1988, p. 196.
[18] Son Thai Ly et al. “Emotion Recognition via Body
Gesture: Deep Learning Model Coupled with Keyframe
Selection”. In: Proceedings of the 2018 International
Conference on Machine Learning and Machine Intelli-
gence - MLMI2018. New York, New York, USA: ACM
Press, Sept. 2018, pp. 27–31.
[19] Maurizio Mancini et al. “Computing and evaluating the
body laughter index”. In: Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 7559
LNCS (2012), pp. 90–98.
[20] Radoslaw Niewiadomski et al. “Analysis of Movement
Quality in Full-Body Physical Activities”. In: ACM
Transactions on Interactive Intelligent Systems 9.1 (Feb.
2019), pp. 1–20.
[21] Valentin Riemer et al. “Identifying features of bodily
expression as indicators of emotional experience during
multimedia learning”. In: Frontiers in Psychology 8.JUL
(2017), pp. 1–13.
[22] Sriparna Saha et al. “A study on emotion recognition
from body gestures using Kinect sensor”. In: 2014
International Conference on Communication and Signal
Processing. IEEE, Apr. 2014, pp. 056–060.
[23] Harald G Wallbott. “Bodily expression of emotion”. In:
European Journal of Social Psychology 28.6 (1998),
pp. 879–896.
[24] Zachary Witkower and Jessica L. Tracy. “Bodily Com-
munication of Emotion: Evidence for Extrafacial Be-
havioral Expressions and Available Coding Systems”.
In: Emotion Review 11.2 (May 2018), pp. 184–193.
... Indeed, the majority of the studies in this context still rely on hand-crafted features and apply learning methods such as SVMs and Random Forests [24], [25], [26], [27]. For example, Castellano et al. [24] classify acted emotional states using the movement features (motion quantity, velocity, movement fluidity and so forth) extracted from visual data. ...
... Piana et al. [27] use 3D-motion data of full-body movements and defines a number of low-level (e.g., kinematics of a single joint) and high-level (e.g., contraction index, impulsiveness) features, which are modelled by an SVM classifier. The contribution of temporal features (e.g., regularity of a motion profile, overall or single gesture phase impulsiveness) and multi-level body cues (e.g., based on Body Action and Posture Coding System [28]) to automatic emotion classification were investigated by Fourati et al. [25] on a dataset composed of 8 daily-life actions (e.g., walking with/without objects in hands, moving books on a table) performed with 8-states (anxiety, pride, joy, sadness, panic fear, shame, anger and neutral). ...
... We used radial basis function (RBF) kernel when the penalty parameter C of the error term is ranging from 0.001 to 10000, and γ kernel coefficient is ranging from 0.001 to 1000. Bi-LSTMs and SVM-RBF have been frequently applied to process MoCap data of nonverbal behaviors in various contexts including emotion classification [26], [14], [41], [25], therefore, we included them to the comparisons. ...
Full-text available
This work investigates classification of emotions from full-body movements by using a novel Convolutional Neural Network-based architecture. The model is composed of two shallow networks processing in parallel when the 8-bit RGB images obtained from time intervals of 3D-positional data are the inputs. One network performs a coarse-grained modelling in the time domain while the other one applies a fine-grained modelling. We show that combining different temporal scales into a single architecture improves the classification results of a dataset composed of short excerpts of the performances of professional dancers who interpreted four affective states: anger, happiness, sadness, and insecurity. Additionally, we investigate the effect of data chunk duration, overlapping, the size of the input images and the contribution of several data augmentation strategies for our proposed method. Better recognition results were obtained when the duration of a data chunk was longer, and this was further improved by applying balanced data augmentation. Moreover, we test our method on other existing motion capture datasets and compare the results with prior art. In all experiments, our results surpassed the state-of-the-art approaches, showing that this method generalizes across diverse settings and contexts.
... Furthermore, the same authors also studied the harder problem of emotion detection from connected action sequences (Bernhardt and Robinson, 2009). More recent work on emotion recognition from body movements of single individuals continued to investigate the recognition of emotions displayed in walking (Stephens-Fripp et al., 2017; and the contribution of different pose-based cues on emotion classification during motion captured daily activities (Fourati et al., 2019), as well as the development of real-time systems (Wang et al., 2015b). ...
... We are talking here about expressive gestures, which are gestures made in an emotional state. This idea has attracted the attention of many researchers in recent years [77,76]. Expressive gestures are applied in several fields, including dance, music, animation, etc. ...
This thesis can be divided into two parts: action recognition and emotion recognition. Each part is done in two method, classic method of Machine Learning and deep network. In the Action Recognition section, we first defined a local descriptor based on the LMA, to describe the movements. LMA is an algorithm to describe a motion by using its four components: Body, Space, Shape and Effort. Since the only goal in this part is gesture recognition, only the first three factors have been used. The DTW, algorithm is implemented to find the similarities of the curves obtained from the descriptor vectors obtained by the LMA method. Finally SVM, algorithm is used to train and classify the data. In the second part of this section, we constructed a new descriptor based on the geometric coordinates of different parts of the body to present a movement. To do this, in addition to the distances between hip centre and other joints of the body and the changes of the quaternion angles in time, we define the triangles formed by the different parts of the body and calculated their area. We also calculate the area of the single conforming 3-D boundary around all the joints of the body. At the end we add the velocity of different joint in the proposed descriptor. We used LSTM to evaluate this descriptor. In second section of this thesis, we first presented a higher-level module to identify the inner feelings of human beings by observing their body movements. In order to define a robust descriptor, two methods are carried out: The first method is the LMA, which by adding the "Effort" factor has become a robust descriptor, which describes a movement and the state in which it was performed. In addition, the second on is based on a set of spatio-temporal features. In the continuation of this section, a pipeline of recognition of expressive motions is proposed in order to recognize the emotions of people through their gestures by the use of machine learning methods. A comparative study is made between these 2 methods in order to choose the best one. The second part of this part consists of a statistical study based on human perception in order to evaluate the recognition system as well as the proposed motion descriptor.
... In general, emotion recognition in any modality is a challenging task due to the huge variability and subjectivity involved in the expression and perception of emotion. In recent years, signi cant progress has been made towards the recognition and analysis of emotion using dynamic facial expressions [4][5][6][7][8], speech [9][10][11][12] and body gestures [13][14][15]. Since human emotion is inherently multimodal, research e orts that combine information from multiple modalities are also on the rise [16,17]. ...
Analyzing emotion from verbal and non-verbal behavioral cues is critical for many intelligent human-centric systems. The emotional cues can be captured using audio, video, motion-capture (mocap) or other modalities. We propose a generalized graph approach to emotion recognition that can take any time-varying (dynamic) data modality as input. To alleviate the problem of optimal graph construction, we cast this as a joint graph learning and classification task. To this end, we present the \emph{Learnable Graph Inception Network} (L-GrIN) that jointly learns to recognize emotion and to identify the underlying graph structure in data. Our architecture comprises multiple novel components: a new graph convolution operation, a graph inception layer, learnable adjacency, and a learnable pooling function that yields a graph-level embedding. We evaluate the proposed architecture on four benchmark emotion recognition databases spanning three different modalities (video, audio, mocap), where each database captures one of the following emotional cues: facial expressions, speech and body gestures. We achieve state-of-the-art performance on all databases outperforming several competitive baselines and relevant existing methods.
Full-text available
Full-body human movement is characterized by fine-grain expressive qualities that humans are easily capable of exhibiting and recognizing in others’ movement. In sports (e.g., martial arts) and performing arts (e.g., dance), the same sequence of movements can be performed in a wide range of ways characterized by different qualities, often in terms of subtle (spatial and temporal) perturbations of the movement. Even a non-expert observer can distinguish between a top-level and average performance by a dancer or martial artist. The difference is not in the performed movements--the same in both cases--but in the “quality” of their performance. In this article, we present a computational framework aimed at an automated approximate measure of movement quality in full-body physical activities. Starting from motion capture data, the framework computes low-level (e.g., a limb velocity) and high-level (e.g., synchronization between different limbs) movement features. Then, this vector of features is integrated to compute a value aimed at providing a quantitative assessment of movement quality approximating the evaluation that an external expert observer would give of the same sequence of movements. Next, a system representing a concrete implementation of the framework is proposed. Karate is adopted as a testbed. We selected two different katas (i.e., detailed choreographies of movements in karate) characterized by different overall attitudes and expressions (aggressiveness, meditation), and we asked seven athletes, having various levels of experience and age, to perform them. Motion capture data were collected from the performances and were analyzed with the system. The results of the automated analysis were compared with the scores given by 14 karate experts who rated the same performances. Results show that the movement-quality scores computed by the system and the ratings given by the human observers are highly correlated (Pearson’s correlations r = 0.84, p = 0.001 and r = 0.75, p = 0.005).
Full-text available
Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g. human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce, there is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.
Full-text available
The importance of emotions experienced by learners during their interaction with multimedia learning systems, such as serious games, underscores the need to identify sources of information that allow the recognition of learners' emotional experience without interrupting the learning process. Bodily expression is gaining in attention as one of these sources of information. However, to date, the question of how bodily expression can convey different emotions has largely been addressed in research relying on acted emotion displays. Following a more contextualized approach, the present study aims to identify features of bodily expression (i.e., posture and activity of the upper body and the head) that relate to genuine emotional experience during interaction with a serious game. In a multimethod approach, 70 undergraduates played a serious game relating to financial education while their bodily expression was captured using an off-the-shelf depth-image sensor (Microsoft Kinect). In addition, self-reports of experienced enjoyment, boredom, and frustration were collected repeatedly during gameplay, to address the dynamic changes in emotions occurring in educational tasks. Results showed that, firstly, the intensities of all emotions indeed changed significantly over the course of the game. Secondly, by using generalized estimating equations, distinct features of bodily expression could be identified as significant indicators for each emotion under investigation. A participant keeping their head more turned to the right was positively related to frustration being experienced, whereas keeping their head more turned to the left was positively related to enjoyment. Furthermore, having their upper body positioned more closely to the gaming screen was also positively related to frustration. Finally, increased activity of a participant's head emerged as a significant indicator of boredom being experienced. These results confirm the value of bodily expression as an indicator of emotional experience in multimedia learning systems. Furthermore, the findings may guide developers of emotion recognition procedures by focusing on the identified features of bodily expression.
Conference Paper
Full-text available
Bodily expression of emotion is recently receiving a growing interest, in particular regarding the study of bodily expression of emotions in daily actions such as walking or knocking at the door. However, previous studies tend to focus on a limited range of actions or emotions. Based on a new motion capture database of emotional body expression in daily actions, we propose a deeper analysis of the expression of emotions in body movement based on different emotions, actions and a wide range of low-level body cues. Random Forest approach is applied to investigate the classification of emotions in different movement tasks and to study the contribution of different types of body cues to the classification of each expressed emotion.
Conference Paper
Motion recognition from videos is actually a very complex task due to the high variability of motions. This paper describes the challenges of human motion recognition, especially motion representation step with relevant features. Our descriptor vector is inspired from Laban Movement Analysis method. We propose discriminative features using the Random Forest algorithm in order to remove redundant features and make learning algorithms operate faster and more effectively. We validate our method on MSRC-12 and UTKinect datasets.
Conference Paper
Many psychological research revealed that bodily gestures convey crucial information to emotion recognition. This valuable aspect, however, has not gained much the attention from the science community, compared to other modalities, such as facial expression and speech in the emotion recognition task. There are limited works on this topic. Unlike recent works which exploited the hand-crafted features, this study proposes an end-to-end deep learning approach for gesture-based emotion recognition. Firstly, we adopt the hashing method to extract the keyframes from the video. Secondly, a convolutional LSTM network is used for exploiting the sequence information. Our results exceed most of the hand-crafted results, it achieves the state-of-the-art results for end-to-end deep learning based technique for gesture-based emotion recognition on FABO dataset. The research also indicates the promise for future improvement.
Although scientists dating back to Darwin have noted the importance of the body in communicating emotion, current research on emotion communication tends to emphasize the face. In this article we review the evidence for bodily expressions of emotions—that is, the handful of emotions that are displayed and recognized from certain bodily behaviors (i.e., pride, joy, sadness, shame, embarrassment, anger, fear, and disgust). We also review the previously developed coding systems available for identifying emotions from bodily behaviors. Although no extant coding system provides an exhaustive list of bodily behaviors known to communicate a panoply of emotions, our review provides the foundation for developing such a system.
The question whether body movements and body postures are indicative of specific emotions is a matter of debate. While some studies have found evidence for specific body movements accompanying specific emotions, others indicate that movement behavior (aside from facial expression) may be only indicative of the quantity (intensity) of emotion, but not of its quality. The study reported here is an attempt to demonstrate that body movements and postures to some degree are specific for certain emotions. A sample of 224 video takes, in which actors and actresses portrayed the emotions of elated joy, happiness, sadness, despair, fear, terror, cold anger, hot anger, disgust, contempt, shame, guilt, pride, and boredom via a scenario approach, was analyzed using coding schemata for the analysis of body movements and postures. Results indicate that some emotion-specific movement and posture characteristics seem to exist, but that for body movements differences between emotions can be partly explained by the dimension of activation. While encoder (actor) differences are rather pronounced with respect to specific movement and posture habits, these differences are largely independent from the emotion-specific differences found. The results are discussed with respect to emotion-specific discrete expression models in contrast to dimensional models of emotion encoding.