Content uploaded by Philipp Werner
Author content
All content in this area was uploaded by Philipp Werner on Nov 26, 2014
Content may be subject to copyright.
Automatic Pain Recognition
from Video and Biomedical Signals
Philipp Werner,
Ayoub Al-Hamadi,
Robert Niese
Institute for Information Technology and Communications,
University of Magdeburg, Germany
E-Mail: {Philipp.Werner, Ayoub.Al-Hamadi}@ovgu.de
Steffen Walter,
Sascha Gruss,
Harald C. Traue
Department for Psychosomatic Medicine and Psychotherapy,
University of Ulm, Germany
Abstract—How much does it hurt? Accurate assessment of
pain is very important for selecting the right treatment, however
current methods are not sufficiently valid and reliable in many
cases. Automatic pain monitoring may help by providing an
objective and continuous assessment. In this paper we propose an
automatic pain recognition system combining information from
video and biomedical signals, namely facial expression, head
movement, galvanic skin response, electromyography and elec-
trocardiogram. Using the BioVid Heat Pain Database, the system
is evaluated in the task of pain detection showing significant
improvement over the current state of the art. Further, we discuss
the relevance of the modalities and compare person-specific and
generic classification models.
I. INTRODUCTION
The key to successful pain management is accurate as-
sessment [1]. If valid assessment of the pain is not possible,
a treatment may lead to problems and risks for the patient.
E. g. over-usage of opioids can slowdown or even stop the
breathing of the patient, or it can lead to addiction [2].
Further, medication may have various adverse effects like
nausea, vomiting or constipation [2]. In contrast, the lack of
sufficient pain relief not only causes mental suffering, but it
is also associated with several pathophysiological effects, e.g.
increased blood pressure and heart rate [1]. Generally, valid
and reliable assessment is necessary to facilitate pain relief
without complications and support a faster recovery of the
patient [2], [3].
Currently, self-report is the gold-standard in pain measure-
ment, as pain is a very individual sensation. However, self-
report is not always reliable and valid, e. g. for demented
patients [4]. Furthermore, it cannot be applied at all for uncon-
scious or newborn patients. Observational and physiological
measures [5] can help in these cases. They may also facilitate
to overcome the weaknesses of simple rating scales [6] which
are common practice in clinics. For adequate pain manage-
ment, the assessment must be repeated regularly, especially if
the patient cannot call for help by himself. For providing a
continuous assessment, we work towards an automatic system
for measurement and monitoring of pain [7], [8], which can
alert hospital staff timely and provide additional information
for the patient’s medical record. In this paper, we present first
results in multi-modal pain recognition from video (i. e. facial
expression, head movement) and biomedical signals (i. e. ECG,
GSR, EMG).
A. Related Work
Previous work in pain recognition can be classified in
video-based and biosignal-based approaches. The combination
of both in a multi-modal recognition system is only known in
related application domains.
1) Video: Commonly, pain recognition from video is based
on facial expressions. Lucey et al. [9] employ Active Appear-
ance Models to track and align the face based on manually
labeled key-frames. They extract shape and appearance fea-
tures and apply a Support Vector Machine (SVM) to classify
at frame level whether there is a facial expression of pain, i.e.
whether any of the pain related action units previously found
by Prkachin [10] is present. At sequence level they classify
three intensities by fusing frame level results via majority vot-
ing. As they try to mimic an expert observer, the ground truth
pain intensity labels were assigned by considerably trained
observers. All experiments are conducted on the UNBC-
McMaster Shoulder Pain Expression Archive Database, which
is publicly available. Chen et al. [11] compare several learning
algorithms for training person-specific classifiers on the same
database. Based on the landmarks provided by Lucey et al.
and Local Binary Pattern features, their Inductive Transfer
Learning outperforms Lucey et al. [9] on the frame level. Also
on the UNBC database, Hammal and Cohn [12] applied Log-
Normal filter based features and SVMs to classify four pain
expression intensities (no pain and three pain levels). Niese
et al. [13] distinguish pain from five other expression classes
(neutral and four basic emotions) based on facial distances
and angles taken as input of an SVM. Werner et al. [14]
extend their work by measuring the pain expression intensities
in a continuous scale and integrating gradient based features
for measuring facial wrinkles. Hammal and Kunz [15] utilize
distances and measure nasal wrinkles, which are automatically
extracted from frontal face. Based on the Transferable Belief
Model which is designed for handling noisy and partial input,
they provide a prediction for each frame, but incorporate
dynamics using a temporal evolution model to refine the
prediction until the end of the sequence. They propose the
inclusion of context variables to bias the classifier towards
the most relevant expression. All above mentioned works
aim at predicting an observational measure, i. e. they employ
computer vision and pattern recognition methods for analyzing
facial expressions. Werner et al. [7] go a step further and try
to predict the pain stimulus that the person was subjected to.
This is the accepted manuscript. The final, published version is available on IEEE Xplore.
P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, und H. C. Traue, "Automatic Pain Recognition from Video
and Biomedical Signals", in 22nd International Conference on Pattern Recognition, Stockholm, Schweden, 2014,
pp. 4582–4587, DOI 10.1109/ICPR.2014.784.
They reveal challenges, that pain recognition research has to
face for gaining clinical relevance, most notably the diversity
of individuals, e. g. in expressiveness. Whereas some persons
show strong facial reactions even for low stimulation, others
show only little or no facial expression, also during high
stimulation. More details on [7] are given below, as we advance
this work in the present paper. The issue of pain posing is
addressed by Littlewort et al. [16]. They use Gabor features
and SVMs to classify real versus posed pain at sequence level.
Their system outperforms untrained observers in this task.
2) Biomedical Signals: Pain recognition from biomedical
signals is a quite new field of research with only few contri-
butions. In a study by Treister et al. [17] tonic heat stimuli
(1 minute) were individually adjusted to induce no pain, low,
medium, and high pain in 45 healthy volunteers. Electrocardio-
gram (ECG), photoplethysmogram (PPG), and galvanic skin
response (GSR) were recorded. The linear combination of
parameters significantly differentiated not only between pain
and no pain, but also between all levels of pain (P <.001 to
.02). Similarly Ben-Israel et al. [18] developed and validated
an index for nociception levels of patients under general anes-
thesia. The combination of the physiological parameters heart
rate (HR), heart rate variability (HRV), PPG wave amplitude,
GSR level, and GSR fluctuations outperformed any individual
parameter in the evaluation of the nociceptive response. Fur-
ther, fMRI was successfully applied to differentiate between
heat pain stimuli and no pain by Wager et al. [19]. Next to
this, there is a lot of basic research with fMRI focusing on
the understanding of pain-related processes in the brain. Aside
from pain, there are more works in automatic recognition of
mental states in the field of affective computing, which use
similar methods, e. g. [20].
3) Multi-modal: To the best knowledge of the authors,
there is currently no work combining video and biomedical
signals for recognizing pain. However, there are several ap-
proaches for classifying mental states in the field of affective
computing, showing that combining these modalities can sig-
nificantly improve the classification results, e. g. [21].
B. Contributions
Our main contributions are the following (also see Fig. 1).
•This is the first work combining video and physiolog-
ical data for automatic pain recognition.
•We improve the video-based system proposed by
Werner et al. [7] by reducing the level of noise in
the extracted features (Sect. II). Further, we propose a
set of features to extract from the biomedical signals
GSR, trapezius muscle EMG and ECG (Sect. III).
•We explore the effectiveness of ensemble-based clas-
sification, namely random forests (Sect. IV), for the
first time in video-based pain recognition.
•We conduct experiments in pain detection with the
BioVid Heat Pain Database, i.e. we try to predict
whether the person was subjected to a painful stim-
ulus. The results (Sect. V) show several significant
improvements compared to the state of the art. First,
the random forests outperforms the previously used
support vector machine in nearly all cases. Second, the
feature extraction from biomedical signals feature extraction from video
heat pain stimuli
in 4 intensity
levels
facial
expression
head
movement
galvanic
skin resp.
electro-
myography
electro-
cardiogram
early fusion and classification
video (color & depth) biomedical signals (GSR, EMG, ECG)
behavioral and
physiological
feedback
BioVid
Heat Pain
Database
experiments in pain detection
role of modalities, person-specific versus general model, comparison with [7]
Fig. 1. Overview the recognition concept, experiments and the used database.
noise reduction in video features leads to an additional
improvement. Third, the results most significantly
improve when data fusion with biosignals is applied.
We further discuss the relevance of the combined
modalities and the benefit of generic and person-
specific model. Finally, we give a conclusion and an
outlook to future work (Sect VI).
C. BioVid Heat Pain Database
Our experiments are conducted with the BioVid Heat Pain
Database [7], [8] which was collected in a study with 90
participants. Heat pain was induced experimentally at the
right arm (see Fig. 1) in about five second lasting stimuli of
four different intensities. The four temperatures for stimulation
were equally distributed between the person’s pain threshold
and pain tolerance. The sub-experiment considered in this
work consists of 80 stimuli per person, i. e. 20 per pain level.
Between the stimuli there was a randomized rest of 8 to 12
seconds. The rest periods following the lowest intensity stimuli
were selected as baseline (no pain). Among other data, the
database contains high resolution frontal color video of the
face, depth map video (from Kinect camera), GSR, trapezius
muscle EMG and ECG, which we all use in our recognition
approach.
II. FE ATUR E EXT RAC TI ON F ROM VIDEO
This section describes the extraction of pain-relevant fea-
tures from video. The first steps are facial feature point
detection in the color video and the estimation of the head pose
from the depth map video. Based on these information, facial
distances and gradient-based features are extracted at the frame
level. Next, the dynamics of the frame level features during a
time window is condensed into a descriptor. This is the feature
vector later used for classification (see Sect. IV). The described
method is an advancement of the work by Werner et al. [7].
x
y
z
(a) (b) (c)
Fig. 2. Video feature extraction at frame level. (a) Measured 3D point cloud,
model fitting residuals (cyan) and nose-tip coordinate system illustrating the
determined pose. (b) Calculation of facial distances. (c) Regions for mean
gradient magnitude features (green: nasal wrinkles, blue/yellow: nasolabial
folds) based on anchors (cyan) and facial axes (white).
A. Facial Feature Point Detection and Head Pose Estimation
The facial expression analysis is based on a set of land-
marks, which we extract automatically using IntraFace by
Xiong et al. [22]. This state-of-the-art facial feature point
detector is more robust and accurate than the previously used
approach which will show up in the recognition rates in
Section V.
To estimate the head pose, we utilize depth information.
For a volume of interest the depth map is converted into a
3D point cloud by a pinhole camera model. Afterwards, a
generic face model is registered with the measured point cloud
using a variant of the Iterative Closest Point (ICP) algorithm
as presented by Niese et al. [23]. It provides a 6D head
pose vector including the 3D position and 3D orientation (see
Fig. 2a).
B. Frame-level Facial Expression Features
For each image frame we extract a set of distance and
gradient features. They are selected to capture pain related
facial actions which have been identified by several previous
studies, e. g. by Prkachin [10]. These actions include lowering
of the brows, tightening of the lid, closing of the eyes, raising
of the cheeks and the upper lip, wrinkling of the nose, and
stretching and opening of the mouth. To uncouple facial
expression from head pose, distances are calculated in 3D,
as proposed by Niese et al. [24]. Using a pinhole camera
model, the previously detected landmarks are projected onto
the surface of the generic face model placed according to the
current head pose. From the obtained 3D points (depicted in
Fig. 2b), we calculate the distances between brows and eyes,
eyes and mouth, brows and mouth, as well as the width and
height of the mouth. In contrast to [7], we measure the closing
of the eye by the distances between an upper and a lower eye
lid landmark.
Next to these distances, some facial changes are measured
from the texture. Based on the landmarks we define rectangular
regions of interest (see Fig. 2c) and calculate the mean gradient
magnitude for each of these regions. This way, we measure
nasal wrinkles and the nasolabial furrows as done in [14] and
[7]. All regions are anchored by the eye center and mouth
corner points which are also utilized to define the eye axis
(upper white line in Fig. 2c) and the vertical face axis (line
between the center of the eyes and the mouth). Based on the
anchor points and the axes, the regions are placed according to
assumptions derived from empirical investigations of our data.
The width and horizontal position of the nasal wrinkles area
is determined from the innermost eyebrow landmarks.
C. Time-Window Descriptors
Whereas a single image may contain enough information
to estimate the intensity of a facial expression, it misses the
dynamics which we consider to contain valuable information
about the underlying feeling of pain. So, we classify time
windows rather than single images. The previously described
processing steps provide 1) a 6D vector of pose parameters per
depth frame and 2) a 13D vector of facial expression features
per color frame. For our experiments we use time windows
of 5.5 seconds length, clipping 6 discrete-time signals of pose
parameters and 13 discrete-time signals of facial expression
features. To reduce the number of dimensions, we calculate a
descriptor of each signal. We first apply a Butterworth low-
pass filter (cutoff 1 Hz) for noise reduction and estimate the
first and second temporal derivation of the signal. Then, we
calculate 7 statistic measures (mean, median, range, standard
and median absolute deviation, interquartile and interdecile
range) for each, the smoothed signal and its first and second
derivation, resulting in a 21D descriptor per signal. The 6
descriptors of the head pose signals are combined into the
head movement feature vector, the 13 descriptors of the facial
expression signals into the facial expression feature vector.
III. FEATU RE EX TR ACT IO N FROM BIOMEDICAL SIGNA LS
This section describes feature extraction from GSR, EMG
and ECG. All signals xiwith i∈[1, N ]were recorded and are
processed at a sampling rate of 512 Hz. As for the extraction
of video-based features, only a time-window of 5.5 seconds
during the stimulus is considered.
A. Galvanic Skin Response (GSR)
From the GSR signal we extract amplitude and variability
features, namely the peak (maximum), the range, the stan-
dard deviation, the inter-quartile range, the root mean square,
the mean value of local maxima and the mean value of
local minima. Further, we extract the mean absolute value
mav =1
NPN
i=1 |xi|, the mean of the absolute values of the
first differences mavf d(x) = 1
N−1PN−1
i=1 |xi+1 −xi|and
the mean of the absolute values of the second differences
mavsd(x) = 1
N−2PN−2
i=1 |xi+2 −xi|. The features mavf d(x)
and mavsd(x)are calculated for both, the raw signal and x
and the standardized signal x∗, i. e. the signal converted to a z-
score. Further, we extract stationarity features, i. e. the degree
of stationarity in the spectrum domain [25], and the variation
of the first and second moment of the signal over time [25].
B. Electromyography (EMG) at trapezius muscle
First, we filter the EMG signal with a Butterworth band-
pass filter (20-250 Hz). The resulting signal is further denoised
by the method of Andrade [26], which is based on Empir-
ical Mode Decomposition. Next, time windows of activity
are selected by thresholding the envelope of the denoised
signal [26]. Only these activity windows are considered for
feature extraction. The above listed amplitude, variability and
stationarity features are calculated here as well. Additionally,
we extract the following frequency features: mode and mean
frequency of the signal’s power spectrum, the width and central
frequency of the band, whose cut-off frequencies are given
by the upper and lower -3 dB points of the power spectrum.
Further, the median frequency of the band (which divides the
spectrum in two areas with equal power) and the count of
zero-crossings of the time-domain signal are extracted.
C. Electrocardiogram (ECG)
The ECG signal is first filtered with a Butterworth band-
pass filter (0.1-250 Hz). Afterwards we extract three heart rate
variability features from the signal: 1) the mean of the RR
intervals, i. e. the arithmetic mean of the time in between
consecutive heart beats, 2) the root mean square of the suc-
cessive differences (RMSSD), i.e. the quadratic mean of the
differences of consecutive RR intervals [9], and 3) the slope
of the linear regression of RR intervals in its time series, i. e.
a measure of acceleration of the heart rate.
IV. FUSION AND CLASSIFICATION
The next steps after feature extraction are data fusion
and classification of the stimulus. In this work, we apply an
early fusion model, i. e. the feature vectors of the different
modalities are concatenated. The resulting higher-dimensional
feature vectors are standardized per person, i. e. each variable
is converted to a z-score based on the corresponding person-
specific mean and standard deviation. The transformed feature
vectors are passed to the classifier, which is a random forest
in this work.
A. Random Forest Classifier Ensemble
The random forest classifier [27] is an ensemble of several
decision trees. Given a test pattern, each tree is evaluated
and the final prediction is made by majority voting. Each
of the trees is trained on a randomly selected subset of the
training data. For each node of the tree, only a randomly
selected subspace of the feature space is considered to find
the optimal split. Further, the trees are fully grown and not
pruned. Random forests have been successfully applied in
many application domains combine many advantages, e.g.
high predictive performance and fast training and prediction.
B. Training Strategy
To maintain comparability, we choose the same training
strategy as Werner et al. [7]. We first apply a grid search with
5-fold stratified cross validation on the training set to select
optimal values for the random forest training parameters. As
we observed that it is the only critical parameter, we only
searched for the optimal variable count to use for splitting at
the nodes. We fixed the number of trees to 100, the maximum
depth to 6 and the minimum sample count for node splitting
to 5. After parameter search, the random forest is trained with
the whole training set and the selected parameters.
TABLE I. GENERALIZATION CAPABILITY FOR VIDEO-BA SE D PAIN
DETECTION. ME AN ACC UR ACY AC ROS S SUB JE CTS F OR P ERS ON -SPE CI FIC
AND GENERIC MODELS AND THE FOUR PAIN INTENSITIES.
BLN vs. PA4 BLN vs. PA3 BLN vs. PA2 BLN vs. PA1
Approach p.-s. gen. p.-s. gen. p.-s. gen. p.-s. gen.
Werner et al. [7] 71.7 66.3 63.8 56.4 54.1 51.7 47.6 50.1
+ Rand. Forest 73.7+68.9+66.6*60.8*55.8-53.1 46.9 51.2
+ New Features 76.6*71.6*69.1*62.9*58.0*54.4+49.3 51.2
Significantly better than Werner et al. [7], with:
*p < 0.001 +p < 0.01 -p < 0.05
V. E XP ER IM EN TS A ND RE SU LTS
All experiments are conducted with the BioVid Heat Pain
Database. Due to technical problems during recording, data of
some modalities is missing for some subjects. We only use
the 87 subjects for which all data is available. The sample set
consists of 5 classes: stimulus at pain threshold (PA1), stimulus
at pain tolerance (PA4), at the two intermediate levels (PA2
and PA3), and no pain stimulus (baseline, BLN, extracted from
the pause after PA1). Depending on the classification task, the
samples of a subset are used. There are 20 samples per class
giving a total of 100 samples for each person. The placement
of the corresponding time windows for feature extraction is
determined from the stimulus temperature signal as done by
Werner et al. [7].
In our experiments with person-specific models, we test
how well the trained classifiers generalize to unseen samples
by applying 10-fold stratified cross validation for each subject.
With the generic models we test the generalization capabilities
to unseen persons, i. e. we apply leave-one-person-out cross
validation.
We use random (but stratified) partitioning of samples
for the 10-fold cross validation to reduce influence of the
sample order. The same is done for the embedded 5-fold cross
validation on the training aiming at parameter selection, for
both person-specific and generic models. Each outer cross-
validation procedure has been repeated 5 times to get a more
stable result for each subject by averaging over 5 results.
A. Improvements in Video-Based Pain Recognition
Table I compares the state of the art method in video-
based pain detection [7] with the improvements proposed
in this paper, namely employing random forest ensemble
learning instead of a support vector machine (SVM), using a
better facial feature point detection algorithm, extracting less
noisy frame level features, and applying low-pass Butterworth
filtering instead of median filtering on the temporal signals.
The tables shows the mean accuracy across all subjects (in %)
for several classification tasks (e. g. BLN, no pain versus PA4,
the highest pain level, or BLN versus PA1, the lowest pain
level), each for person-specific and generic classifier models.
To ensure full comparability with our new results, the
method of Werner et al. [7] has been applied on the same
data. Replacing the SVM of [7] with a random forest already
results in a significant performance gain for the high pain
intensities PA4 and PA3 for both, person-specific and generic
models. Using the new features improves the results further,
reaching an overall improvement of about 5-6% for PA4 and
PA3 and 3-4 % for PA2. For the lowest pain intensity PA1
TABLE II. GENERALIZATION CAPABILITY FOR PAIN DETECTION WITH
MULTI-MODAL D ATA FUSI ON. M EAN ACCURACIES ACROSS SUBJECTS. F OR
EACH COLUMN,THE B EST R ES ULT IS B OLD ,AN D THE B ES T OF TH E
RE SPE CT IVE C ATEG ORY (VIDEO RESP.BIOSIGNALS)IS UNDERLINED.
BLN vs. PA4 BLN vs. PA3 BLN vs. PA2 BLN vs. PA1
Used features p.-s. gen. p.-s. gen. p.-s. gen. p.-s. gen.
Facial expression 74.9 70.8 67.1 62.1 57.1 53.7 49.3 50.7
Head movement 69.9 64.6 64.1 57.8 56.6 54.5 46.4 51.0
All video 76.6b71.6a69.1b62.9b58.0a54.4 49.3 51.2
GSR 71.9 73.8 64.0 65.9b57.9 60.2a47.6 55.4
EMG 63.1 57.9 55.9 52.7 51.3 49.6 46.3 49.0
ECG 64.0 62.0 60.0 56.5 54.5 51.6 49.6 48.7
All biosignals 75.6b74.1 65.5 65.0 58.7 59.3 49.1 54.9
All video + bio 80.6c77.8c72.0c67.7 60.5 60.0 49.6 54.6
aSignificantly better (p < 0.05) than each other result in category (video or bio)
bHighly significantly better (p < 0.01) than each other result in the category
cHighly significantly better (p < 0.01) than each other result in column
the generalization capability is not significantly above chance,
i. e. the method is not able to distinguish whether unseen
samples are no pain or on the pain threshold. However, this
is not surprising as most of the subjects do not visibly react
at all to this low pain stimulus (also see Werner et al. [7]).
Usually, the person-specific models perform better than the
corresponding generic model, which is in line with facial
expression recognition research (e. g. [11]). Interestingly, it
is different for BLN vs. PA1. Probably, missing difference
between the classes leads to severe overfitting for the person-
specific models, as the sample sets for training are much
smaller there.
As already pointed out in [7] the predictive performance
varies strongly between subjects, which is mainly due to differ-
ences in expressiveness, age and other person-specific factors.
So the standard deviation between subjects is not meaningful
for assessing the uncertainty of the results. However, the statis-
tical significance of an improvement can be tested with a paired
samples t-test [28] or a permutation test [29]. We decided to
use the permutation test, as it makes fewer assumptions and
provides greater accuracy [29]. To check the hypothesis that
algorithm A is better than B, we permuted the assignments of
results to the algorithms 50,000 times. For each permutation,
we calculated the difference of the mean result of algorithm
A and the mean result of B, yielding the distribution of
the performance differences under the assumption that both
algorithms perform equally well. Then pis the probability of
differences being greater or equal to the observed difference,
determined from the aforementioned distribution. Low values
of pcontradict the hypotheses that both algorithms perform
equally. Thus, the lower p, the more significant is the tested
improvement.
B. Multi-Modal Data Fusion
Table II lists the results of our multi-modal data fusion
experiments. It compares mean accuracies across subjects for
several modality resp. feature (sub)sets used for classification.
These accuracies are given for the four pain intensity levels,
each with person-specific and generic models.
In terms of intensity and type of models, the general
observations are the same as in the previous section. The
performance is the better, the more intense the pain stimulus
is. More pain leads to more behavioral and physiological
0 50 100
(∗∗)
p.-s., PA4
(∗∗)
gen., PA4
(∗∗)
p.-s., PA3
(∗∗)
gen., PA3
(∗∗)
p.-s., PA2
(∗∗)
gen., PA2
(+)
p.-s., PA1
(∗)
gen., PA1
Method by Werner et al. [7]
Data fusion of video and biosignals
Fig. 3. Comparison of mean accuracies of the video-only approach by Werner
et al. [7] and the data fusion of video and biomedical signals. Fusion results
are significantly better with (+)p < 0.05, (∗)p < 0.001, (∗∗)p0.0001.
feedback, which is easier to detect. The lowest pain stimulus
(PA1) is almost undetectable with our system. The reactions
are either too subtle to be distinguished from noise, or not
present at all, as it can often be observed for facial expression
in PA1. Generally, person-specific models perform better than
generic ones, except for PA1 as already described above.
When considering the video-based recognition, the per-
formance increases in most cases when facial expression is
combined with head movement information. For high intensity
models the effect is statistically significant (person-specific:
PA4 and PA3 with p≈0.001, generic: PA4 with p= 0.02
and PA3 with p= 0.01).
In pain detection from biosignals, the role of GSR is
really interesting. It is the only modality, whose recognition
rate generally does not benefit from person-specific models,
suggesting that the GSR feedback is not as person-specific
as the other modalities, i. e. the ratio intra- and inter-person
variation is better. So, generalization performances of the
generic GSR-based model can benefit from the higher training
sample count available for generic model training. For generic
models and low pain intensities using only GSR features is
even better than all data fusion trials. However, for high pain
intensities the other modalities can contribute further infor-
mation leading to highly significant performance gains. E. g.
for PA4 person-specific models using all biosignal features
in the mean perform highly significantly better than these
only using GSR features (75.6% vs 71.9%, p= 0.0007).
Further, when using all video and biosignal features the results
are also highly significantly better than with each of the
considered feature subsets for high intensity tests (person-
specific: PA4 and PA3 with p < 0.001, generic: PA4 with
p < 0.003). These results show the capabilities multi-modal
data fusion in pain recognition. Nevertheless, the combination
of all modalities yields not the best results in all cases, calling
for more advanced data fusion methods that can better exploit
the strengths of the individual modalities.
Figure 3 visually compares the generalization capabilities
of the approach by Werner et al. [7] with these of the
data fusion of video and physiological signals. In the overall
comparison all fusion results are significantly better, most of
them with highest significance.
VI. CONCLUSION
The problem of automatically recognizing pain is challeng-
ing, because the feeling and expression of pain is influenced
by many factors, e. g. the personality, the social context, the
source of pain or previous experiences. This work took a
step towards increasing the validity and reliability of existing
pain recognition approaches by combining multiple sources of
information, namely video (facial expression, head pose) and
biomedical signals (GSR, trapezius muscle EMG and ECG).
We proposed advancements to the video-based method by
Werner et al. [7]. First, modifications in feature extraction
reduced the level of noise in the features. Second, we proposed
to use a random forest instead of a support vector machine.
Each of both changes resulted in a significant improvement in
the classification rates. Further, we suggested a set of features
to extract from the biomedical signals. The classification per-
formance was evaluated in detail, comparing the separate use
of each information source, and their combination. The results
improved for most combinations. For high pain intensity the
improvement was of high statistical significance. Although
the combination of video and biomedical signals performed
significantly better than the state-of-the-art approach [7], the
results showed that the data fusion method is not performing
well for the low pain intensities. This calls for the application
of more advanced data fusion methods. Further, we will work
towards taking the step from pain detection to measurement
of pain intensity.
ACK NOW LE DG ME NT
This work was funded by the German Research Foundation
(DFG), project AL 638/3-1 AOBJ 585843. We also thank
Prof. Adriano Andrade for his support with processing of the
biomedical signals.
REFERENCES
[1] M. Serpell, Ed., Handbook of Pain Management. Springer, 2008.
[2] H. McQuay, A. Moore, and D. Justins, “Treating acute pain in
hospital.” BMJ: British Medical Journal, vol. 314, no. 7093, p. 1531,
1997.
[3] H. Kehlet, “Acute pain control and accelerated postoperative surgical
recovery,” Surgical Clinics of North America, vol. 79, no. 2, pp.
431–443, Apr. 1999.
[4] S. M. G. Zwakhalen, J. P. H. Hamers, H. H. Abu-Saad, and M. P. F.
Berger, “Pain in elderly people with severe dementia: A systematic
review of behavioural pain assessment tools,” BMC Geriatrics, vol. 6,
no. 1, p. 3, Jan. 2006.
[5] J. Strong, A. M. Unruh, A. Wright, and G. D. Baxter, Pain: a textbook
for therapists. Churchill Livingstone Edinburgh, Scotland, 2002.
[6] A. Williams, H. T. O. Davies, and Y. Chadury, “Simple pain rating
scales hide complex idiosyncratic meanings,” Pain, vol. 85, no. 3, pp.
457–463, Apr. 2000.
[7] P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, and H. C.
Traue, “Towards pain monitoring: Facial expression, head pose, a
new database, an automatic system and remaining challenges,” in
Proceedings of the British Machine Vision Conference, 2013.
[8] S. Walter, P. Werner, S. Gruss, H. Ehleiter, J. Tan, H. C. Traue, A. Al-
Hamadi, A. O. Andrade, G. Moreira da Silva, and S. Crawcour, “The
BioVid heat pain database: Data for the advancement and systematic
validation of an automated pain recognition system,” in IEEE Interna-
tional Conference on Cybernetics (CYBCONF), 2013, pp. 128–131.
[9] P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, S. Chew, and
I. Matthews, “Painful monitoring: Automatic pain monitoring using the
UNBC-McMaster shoulder pain expression archive database,” Image
and Vision Computing, vol. 30, no. 3, pp. 197–205, Mar. 2012.
[10] K. M. Prkachin, “The consistency of facial expressions of pain: a
comparison across modalities,” Pain, vol. 51, no. 3, pp. 297–306, Dec.
1992.
[11] J. Chen, X. Liu, P. Tu, and A. Aragones, “Person-specific expression
recognition with transfer learning,” in Image Processing (ICIP),
2012 19th IEEE International Conference on, Orlando, 2012, pp.
2621–2624.
[12] Z. Hammal and J. F. Cohn, “Automatic detection of pain intensity,” in
Proceedings of the 14th ACM International Conference on Multimodal
Interaction. New York: ACM, 2012, pp. 47–52.
[13] R. Niese, A. Al-Hamadi, A. Panning, D. Brammen, U. Ebmeyer, and
B. Michaelis, “Towards pain recognition in post-operative phases using
3D-based features from video and support vector machines,” Inter-
national Journal of Digital Content Technology and its Applications,
vol. 3, no. 4, pp. 21–33, 2009.
[14] P. Werner, A. Al-Hamadi, and R. Niese, “Pain recognition and intensity
rating based on comparative learning,” in Image Processing (ICIP),
2012 19th IEEE International Conference on, Orlando, 2012, pp. 2313–
2316.
[15] Z. Hammal and M. Kunz, “Pain monitoring: A dynamic and context-
sensitive system,” Pattern Recognition, vol. 45, no. 4, pp. 1265–1280,
Apr. 2012.
[16] G. C. Littlewort, M. S. Bartlett, and K. Lee, “Automatic coding of
facial expressions displayed during posed and genuine pain,” Image
and Vision Computing, vol. 27, no. 12, pp. 1797–1803, Nov. 2009.
[17] R. Treister, M. Kliger, G. Zuckerman, I. G. Aryeh, and E. Eisenberg,
“Differentiating between heat pain intensities: The combined effect of
multiple autonomic parameters,” Pain, vol. 153, no. 9, p. 1807–1814,
2012.
[18] N. Ben-Israel, M. Kliger, G. Zuckerman, Y. Katz, and R. Edry,
“Monitoring the nociception level: a multi-parameter approach,”
Journal of Clinical Monitoring and Computing, vol. 27, no. 6, pp.
659–668, Dec. 2013.
[19] T. D. Wager, L. Y. Atlas, M. A. Lindquist, M. Roy, C.-W. Woo, and
E. Kross, “An fMRI-based neurologic signature of physical pain,” New
England Journal of Medicine, vol. 368, no. 15, p. 1388–1397, 2013.
[20] S. Walter, J. Kim, D. Hrabal, S. Crawcour, H. Kessler, and H. Traue,
“Transsituational individual-specific biopsychological classification of
emotions,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, vol. 43, no. 4, pp. 988–995, 2013.
[21] M. Schels, M. Glodek, S. Meudt, S. Scherer, M. Schmidt, G. Layher,
S. Tschechne, T. Brosch, D. Hrabal, S. Walter, H. C. Traue, G. Palm,
F. Schwenker, M. Rojc, and N. Campbell, “Multi-modal classifier-
fusion for the recognition of emotions,” in Converbal Synchrony in
Human-Machine Interaction. CRC Press, 2013.
[22] X. Xiong and F. De la Torre, “Supervised descent method and its
applications to face alignment,” IEEE Computer Vision and Pattern
Recognition, 2013.
[23] R. Niese, P. Werner, and A. Al-Hamadi, “Accurate, fast and robust real-
time face pose estimation using kinect camera,” in IEEE International
Conference on Systems, Man, and Cybernetics (SMC), 2013, pp. 487–
490.
[24] R. Niese, A. Al-Hamadi, A. Panning, and B. Michaelis, “Emotion
recognition based on 2D-3D facial feature extraction from color image
sequences,” Journal of Multimedia, vol. 5, Oct. 2010.
[25] C. Cao and S. Slobounov, “Application of a novel measure of EEG
non-stationarity as Shannon- entropy of the peak frequency shifting
for detecting residual abnormalities in concussed individuals,” Clinical
Neurophysiology, vol. 122, no. 7, pp. 1314–1321, Jul. 2011.
[26] A. de Oliveira Andrade, “Decomposition and analysis of
electromyographic signals,” Ph.D. dissertation, University of Reading,
2005.
[27] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, Oct. 2001.
[28] W. A. Rosenkrantz, Introduction to probability and statistics for science,
engineering, and finance. CRC Press, 2009.
[29] D. S. Moore, G. P. McCabe, W. M. Duckworth, and S. L. Sclove, The
practice of business statistics: using data for decisions. Wh Freeman,
2003.