Conference PaperPDF Available

Abstract and Figures

How much does it hurt? Accurate assessment of pain is very important for selecting the right treatment, however current methods are not sufficiently valid and reliable in many cases. Automatic pain monitoring may help by providing an objective and continuous assessment. In this paper we propose an automatic pain recognition system combining information from video and biomedical signals, namely facial expression, head movement, galvanic skin response, electromyography and electrocardiogram. Using the BioVid Heat Pain Database, the system is evaluated in the task of pain detection showing significant improvement over the current state of the art. Further, we discuss the relevance of the modalities and compare person-specific and generic classification models.
Content may be subject to copyright.
Automatic Pain Recognition
from Video and Biomedical Signals
Philipp Werner,
Ayoub Al-Hamadi,
Robert Niese
Institute for Information Technology and Communications,
University of Magdeburg, Germany
E-Mail: {Philipp.Werner, Ayoub.Al-Hamadi}@ovgu.de
Steffen Walter,
Sascha Gruss,
Harald C. Traue
Department for Psychosomatic Medicine and Psychotherapy,
University of Ulm, Germany
Abstract—How much does it hurt? Accurate assessment of
pain is very important for selecting the right treatment, however
current methods are not sufficiently valid and reliable in many
cases. Automatic pain monitoring may help by providing an
objective and continuous assessment. In this paper we propose an
automatic pain recognition system combining information from
video and biomedical signals, namely facial expression, head
movement, galvanic skin response, electromyography and elec-
trocardiogram. Using the BioVid Heat Pain Database, the system
is evaluated in the task of pain detection showing significant
improvement over the current state of the art. Further, we discuss
the relevance of the modalities and compare person-specific and
generic classification models.
I. INTRODUCTION
The key to successful pain management is accurate as-
sessment [1]. If valid assessment of the pain is not possible,
a treatment may lead to problems and risks for the patient.
E. g. over-usage of opioids can slowdown or even stop the
breathing of the patient, or it can lead to addiction [2].
Further, medication may have various adverse effects like
nausea, vomiting or constipation [2]. In contrast, the lack of
sufficient pain relief not only causes mental suffering, but it
is also associated with several pathophysiological effects, e.g.
increased blood pressure and heart rate [1]. Generally, valid
and reliable assessment is necessary to facilitate pain relief
without complications and support a faster recovery of the
patient [2], [3].
Currently, self-report is the gold-standard in pain measure-
ment, as pain is a very individual sensation. However, self-
report is not always reliable and valid, e. g. for demented
patients [4]. Furthermore, it cannot be applied at all for uncon-
scious or newborn patients. Observational and physiological
measures [5] can help in these cases. They may also facilitate
to overcome the weaknesses of simple rating scales [6] which
are common practice in clinics. For adequate pain manage-
ment, the assessment must be repeated regularly, especially if
the patient cannot call for help by himself. For providing a
continuous assessment, we work towards an automatic system
for measurement and monitoring of pain [7], [8], which can
alert hospital staff timely and provide additional information
for the patient’s medical record. In this paper, we present first
results in multi-modal pain recognition from video (i. e. facial
expression, head movement) and biomedical signals (i. e. ECG,
GSR, EMG).
A. Related Work
Previous work in pain recognition can be classified in
video-based and biosignal-based approaches. The combination
of both in a multi-modal recognition system is only known in
related application domains.
1) Video: Commonly, pain recognition from video is based
on facial expressions. Lucey et al. [9] employ Active Appear-
ance Models to track and align the face based on manually
labeled key-frames. They extract shape and appearance fea-
tures and apply a Support Vector Machine (SVM) to classify
at frame level whether there is a facial expression of pain, i.e.
whether any of the pain related action units previously found
by Prkachin [10] is present. At sequence level they classify
three intensities by fusing frame level results via majority vot-
ing. As they try to mimic an expert observer, the ground truth
pain intensity labels were assigned by considerably trained
observers. All experiments are conducted on the UNBC-
McMaster Shoulder Pain Expression Archive Database, which
is publicly available. Chen et al. [11] compare several learning
algorithms for training person-specific classifiers on the same
database. Based on the landmarks provided by Lucey et al.
and Local Binary Pattern features, their Inductive Transfer
Learning outperforms Lucey et al. [9] on the frame level. Also
on the UNBC database, Hammal and Cohn [12] applied Log-
Normal filter based features and SVMs to classify four pain
expression intensities (no pain and three pain levels). Niese
et al. [13] distinguish pain from five other expression classes
(neutral and four basic emotions) based on facial distances
and angles taken as input of an SVM. Werner et al. [14]
extend their work by measuring the pain expression intensities
in a continuous scale and integrating gradient based features
for measuring facial wrinkles. Hammal and Kunz [15] utilize
distances and measure nasal wrinkles, which are automatically
extracted from frontal face. Based on the Transferable Belief
Model which is designed for handling noisy and partial input,
they provide a prediction for each frame, but incorporate
dynamics using a temporal evolution model to refine the
prediction until the end of the sequence. They propose the
inclusion of context variables to bias the classifier towards
the most relevant expression. All above mentioned works
aim at predicting an observational measure, i. e. they employ
computer vision and pattern recognition methods for analyzing
facial expressions. Werner et al. [7] go a step further and try
to predict the pain stimulus that the person was subjected to.
This is the accepted manuscript. The final, published version is available on IEEE Xplore.
P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, und H. C. Traue, "Automatic Pain Recognition from Video
and Biomedical Signals", in 22nd International Conference on Pattern Recognition, Stockholm, Schweden, 2014,
pp. 4582–4587, DOI 10.1109/ICPR.2014.784.
They reveal challenges, that pain recognition research has to
face for gaining clinical relevance, most notably the diversity
of individuals, e. g. in expressiveness. Whereas some persons
show strong facial reactions even for low stimulation, others
show only little or no facial expression, also during high
stimulation. More details on [7] are given below, as we advance
this work in the present paper. The issue of pain posing is
addressed by Littlewort et al. [16]. They use Gabor features
and SVMs to classify real versus posed pain at sequence level.
Their system outperforms untrained observers in this task.
2) Biomedical Signals: Pain recognition from biomedical
signals is a quite new field of research with only few contri-
butions. In a study by Treister et al. [17] tonic heat stimuli
(1 minute) were individually adjusted to induce no pain, low,
medium, and high pain in 45 healthy volunteers. Electrocardio-
gram (ECG), photoplethysmogram (PPG), and galvanic skin
response (GSR) were recorded. The linear combination of
parameters significantly differentiated not only between pain
and no pain, but also between all levels of pain (P <.001 to
.02). Similarly Ben-Israel et al. [18] developed and validated
an index for nociception levels of patients under general anes-
thesia. The combination of the physiological parameters heart
rate (HR), heart rate variability (HRV), PPG wave amplitude,
GSR level, and GSR fluctuations outperformed any individual
parameter in the evaluation of the nociceptive response. Fur-
ther, fMRI was successfully applied to differentiate between
heat pain stimuli and no pain by Wager et al. [19]. Next to
this, there is a lot of basic research with fMRI focusing on
the understanding of pain-related processes in the brain. Aside
from pain, there are more works in automatic recognition of
mental states in the field of affective computing, which use
similar methods, e. g. [20].
3) Multi-modal: To the best knowledge of the authors,
there is currently no work combining video and biomedical
signals for recognizing pain. However, there are several ap-
proaches for classifying mental states in the field of affective
computing, showing that combining these modalities can sig-
nificantly improve the classification results, e. g. [21].
B. Contributions
Our main contributions are the following (also see Fig. 1).
This is the first work combining video and physiolog-
ical data for automatic pain recognition.
We improve the video-based system proposed by
Werner et al. [7] by reducing the level of noise in
the extracted features (Sect. II). Further, we propose a
set of features to extract from the biomedical signals
GSR, trapezius muscle EMG and ECG (Sect. III).
We explore the effectiveness of ensemble-based clas-
sification, namely random forests (Sect. IV), for the
first time in video-based pain recognition.
We conduct experiments in pain detection with the
BioVid Heat Pain Database, i.e. we try to predict
whether the person was subjected to a painful stim-
ulus. The results (Sect. V) show several significant
improvements compared to the state of the art. First,
the random forests outperforms the previously used
support vector machine in nearly all cases. Second, the
feature extraction from biomedical signals feature extraction from video
heat pain stimuli
in 4 intensity
levels
facial
expression
head
movement
galvanic
skin resp.
electro-
myography
electro-
cardiogram
early fusion and classification
video (color & depth) biomedical signals (GSR, EMG, ECG)
behavioral and
physiological
feedback
BioVid
Heat Pain
Database
experiments in pain detection
role of modalities, person-specific versus general model, comparison with [7]
Fig. 1. Overview the recognition concept, experiments and the used database.
noise reduction in video features leads to an additional
improvement. Third, the results most significantly
improve when data fusion with biosignals is applied.
We further discuss the relevance of the combined
modalities and the benefit of generic and person-
specific model. Finally, we give a conclusion and an
outlook to future work (Sect VI).
C. BioVid Heat Pain Database
Our experiments are conducted with the BioVid Heat Pain
Database [7], [8] which was collected in a study with 90
participants. Heat pain was induced experimentally at the
right arm (see Fig. 1) in about five second lasting stimuli of
four different intensities. The four temperatures for stimulation
were equally distributed between the person’s pain threshold
and pain tolerance. The sub-experiment considered in this
work consists of 80 stimuli per person, i. e. 20 per pain level.
Between the stimuli there was a randomized rest of 8 to 12
seconds. The rest periods following the lowest intensity stimuli
were selected as baseline (no pain). Among other data, the
database contains high resolution frontal color video of the
face, depth map video (from Kinect camera), GSR, trapezius
muscle EMG and ECG, which we all use in our recognition
approach.
II. FE ATUR E EXT RAC TI ON F ROM VIDEO
This section describes the extraction of pain-relevant fea-
tures from video. The first steps are facial feature point
detection in the color video and the estimation of the head pose
from the depth map video. Based on these information, facial
distances and gradient-based features are extracted at the frame
level. Next, the dynamics of the frame level features during a
time window is condensed into a descriptor. This is the feature
vector later used for classification (see Sect. IV). The described
method is an advancement of the work by Werner et al. [7].
x
y
z
(a) (b) (c)
Fig. 2. Video feature extraction at frame level. (a) Measured 3D point cloud,
model fitting residuals (cyan) and nose-tip coordinate system illustrating the
determined pose. (b) Calculation of facial distances. (c) Regions for mean
gradient magnitude features (green: nasal wrinkles, blue/yellow: nasolabial
folds) based on anchors (cyan) and facial axes (white).
A. Facial Feature Point Detection and Head Pose Estimation
The facial expression analysis is based on a set of land-
marks, which we extract automatically using IntraFace by
Xiong et al. [22]. This state-of-the-art facial feature point
detector is more robust and accurate than the previously used
approach which will show up in the recognition rates in
Section V.
To estimate the head pose, we utilize depth information.
For a volume of interest the depth map is converted into a
3D point cloud by a pinhole camera model. Afterwards, a
generic face model is registered with the measured point cloud
using a variant of the Iterative Closest Point (ICP) algorithm
as presented by Niese et al. [23]. It provides a 6D head
pose vector including the 3D position and 3D orientation (see
Fig. 2a).
B. Frame-level Facial Expression Features
For each image frame we extract a set of distance and
gradient features. They are selected to capture pain related
facial actions which have been identified by several previous
studies, e. g. by Prkachin [10]. These actions include lowering
of the brows, tightening of the lid, closing of the eyes, raising
of the cheeks and the upper lip, wrinkling of the nose, and
stretching and opening of the mouth. To uncouple facial
expression from head pose, distances are calculated in 3D,
as proposed by Niese et al. [24]. Using a pinhole camera
model, the previously detected landmarks are projected onto
the surface of the generic face model placed according to the
current head pose. From the obtained 3D points (depicted in
Fig. 2b), we calculate the distances between brows and eyes,
eyes and mouth, brows and mouth, as well as the width and
height of the mouth. In contrast to [7], we measure the closing
of the eye by the distances between an upper and a lower eye
lid landmark.
Next to these distances, some facial changes are measured
from the texture. Based on the landmarks we define rectangular
regions of interest (see Fig. 2c) and calculate the mean gradient
magnitude for each of these regions. This way, we measure
nasal wrinkles and the nasolabial furrows as done in [14] and
[7]. All regions are anchored by the eye center and mouth
corner points which are also utilized to define the eye axis
(upper white line in Fig. 2c) and the vertical face axis (line
between the center of the eyes and the mouth). Based on the
anchor points and the axes, the regions are placed according to
assumptions derived from empirical investigations of our data.
The width and horizontal position of the nasal wrinkles area
is determined from the innermost eyebrow landmarks.
C. Time-Window Descriptors
Whereas a single image may contain enough information
to estimate the intensity of a facial expression, it misses the
dynamics which we consider to contain valuable information
about the underlying feeling of pain. So, we classify time
windows rather than single images. The previously described
processing steps provide 1) a 6D vector of pose parameters per
depth frame and 2) a 13D vector of facial expression features
per color frame. For our experiments we use time windows
of 5.5 seconds length, clipping 6 discrete-time signals of pose
parameters and 13 discrete-time signals of facial expression
features. To reduce the number of dimensions, we calculate a
descriptor of each signal. We first apply a Butterworth low-
pass filter (cutoff 1 Hz) for noise reduction and estimate the
first and second temporal derivation of the signal. Then, we
calculate 7 statistic measures (mean, median, range, standard
and median absolute deviation, interquartile and interdecile
range) for each, the smoothed signal and its first and second
derivation, resulting in a 21D descriptor per signal. The 6
descriptors of the head pose signals are combined into the
head movement feature vector, the 13 descriptors of the facial
expression signals into the facial expression feature vector.
III. FEATU RE EX TR ACT IO N FROM BIOMEDICAL SIGNA LS
This section describes feature extraction from GSR, EMG
and ECG. All signals xiwith i[1, N ]were recorded and are
processed at a sampling rate of 512 Hz. As for the extraction
of video-based features, only a time-window of 5.5 seconds
during the stimulus is considered.
A. Galvanic Skin Response (GSR)
From the GSR signal we extract amplitude and variability
features, namely the peak (maximum), the range, the stan-
dard deviation, the inter-quartile range, the root mean square,
the mean value of local maxima and the mean value of
local minima. Further, we extract the mean absolute value
mav =1
NPN
i=1 |xi|, the mean of the absolute values of the
first differences mavf d(x) = 1
N1PN1
i=1 |xi+1 xi|and
the mean of the absolute values of the second differences
mavsd(x) = 1
N2PN2
i=1 |xi+2 xi|. The features mavf d(x)
and mavsd(x)are calculated for both, the raw signal and x
and the standardized signal x, i. e. the signal converted to a z-
score. Further, we extract stationarity features, i. e. the degree
of stationarity in the spectrum domain [25], and the variation
of the first and second moment of the signal over time [25].
B. Electromyography (EMG) at trapezius muscle
First, we filter the EMG signal with a Butterworth band-
pass filter (20-250 Hz). The resulting signal is further denoised
by the method of Andrade [26], which is based on Empir-
ical Mode Decomposition. Next, time windows of activity
are selected by thresholding the envelope of the denoised
signal [26]. Only these activity windows are considered for
feature extraction. The above listed amplitude, variability and
stationarity features are calculated here as well. Additionally,
we extract the following frequency features: mode and mean
frequency of the signal’s power spectrum, the width and central
frequency of the band, whose cut-off frequencies are given
by the upper and lower -3 dB points of the power spectrum.
Further, the median frequency of the band (which divides the
spectrum in two areas with equal power) and the count of
zero-crossings of the time-domain signal are extracted.
C. Electrocardiogram (ECG)
The ECG signal is first filtered with a Butterworth band-
pass filter (0.1-250 Hz). Afterwards we extract three heart rate
variability features from the signal: 1) the mean of the RR
intervals, i. e. the arithmetic mean of the time in between
consecutive heart beats, 2) the root mean square of the suc-
cessive differences (RMSSD), i.e. the quadratic mean of the
differences of consecutive RR intervals [9], and 3) the slope
of the linear regression of RR intervals in its time series, i. e.
a measure of acceleration of the heart rate.
IV. FUSION AND CLASSIFICATION
The next steps after feature extraction are data fusion
and classification of the stimulus. In this work, we apply an
early fusion model, i. e. the feature vectors of the different
modalities are concatenated. The resulting higher-dimensional
feature vectors are standardized per person, i. e. each variable
is converted to a z-score based on the corresponding person-
specific mean and standard deviation. The transformed feature
vectors are passed to the classifier, which is a random forest
in this work.
A. Random Forest Classifier Ensemble
The random forest classifier [27] is an ensemble of several
decision trees. Given a test pattern, each tree is evaluated
and the final prediction is made by majority voting. Each
of the trees is trained on a randomly selected subset of the
training data. For each node of the tree, only a randomly
selected subspace of the feature space is considered to find
the optimal split. Further, the trees are fully grown and not
pruned. Random forests have been successfully applied in
many application domains combine many advantages, e.g.
high predictive performance and fast training and prediction.
B. Training Strategy
To maintain comparability, we choose the same training
strategy as Werner et al. [7]. We first apply a grid search with
5-fold stratified cross validation on the training set to select
optimal values for the random forest training parameters. As
we observed that it is the only critical parameter, we only
searched for the optimal variable count to use for splitting at
the nodes. We fixed the number of trees to 100, the maximum
depth to 6 and the minimum sample count for node splitting
to 5. After parameter search, the random forest is trained with
the whole training set and the selected parameters.
TABLE I. GENERALIZATION CAPABILITY FOR VIDEO-BA SE D PAIN
DETECTION. ME AN ACC UR ACY AC ROS S SUB JE CTS F OR P ERS ON -SPE CI FIC
AND GENERIC MODELS AND THE FOUR PAIN INTENSITIES.
BLN vs. PA4 BLN vs. PA3 BLN vs. PA2 BLN vs. PA1
Approach p.-s. gen. p.-s. gen. p.-s. gen. p.-s. gen.
Werner et al. [7] 71.7 66.3 63.8 56.4 54.1 51.7 47.6 50.1
+ Rand. Forest 73.7+68.9+66.6*60.8*55.8-53.1 46.9 51.2
+ New Features 76.6*71.6*69.1*62.9*58.0*54.4+49.3 51.2
Significantly better than Werner et al. [7], with:
*p < 0.001 +p < 0.01 -p < 0.05
V. E XP ER IM EN TS A ND RE SU LTS
All experiments are conducted with the BioVid Heat Pain
Database. Due to technical problems during recording, data of
some modalities is missing for some subjects. We only use
the 87 subjects for which all data is available. The sample set
consists of 5 classes: stimulus at pain threshold (PA1), stimulus
at pain tolerance (PA4), at the two intermediate levels (PA2
and PA3), and no pain stimulus (baseline, BLN, extracted from
the pause after PA1). Depending on the classification task, the
samples of a subset are used. There are 20 samples per class
giving a total of 100 samples for each person. The placement
of the corresponding time windows for feature extraction is
determined from the stimulus temperature signal as done by
Werner et al. [7].
In our experiments with person-specific models, we test
how well the trained classifiers generalize to unseen samples
by applying 10-fold stratified cross validation for each subject.
With the generic models we test the generalization capabilities
to unseen persons, i. e. we apply leave-one-person-out cross
validation.
We use random (but stratified) partitioning of samples
for the 10-fold cross validation to reduce influence of the
sample order. The same is done for the embedded 5-fold cross
validation on the training aiming at parameter selection, for
both person-specific and generic models. Each outer cross-
validation procedure has been repeated 5 times to get a more
stable result for each subject by averaging over 5 results.
A. Improvements in Video-Based Pain Recognition
Table I compares the state of the art method in video-
based pain detection [7] with the improvements proposed
in this paper, namely employing random forest ensemble
learning instead of a support vector machine (SVM), using a
better facial feature point detection algorithm, extracting less
noisy frame level features, and applying low-pass Butterworth
filtering instead of median filtering on the temporal signals.
The tables shows the mean accuracy across all subjects (in %)
for several classification tasks (e. g. BLN, no pain versus PA4,
the highest pain level, or BLN versus PA1, the lowest pain
level), each for person-specific and generic classifier models.
To ensure full comparability with our new results, the
method of Werner et al. [7] has been applied on the same
data. Replacing the SVM of [7] with a random forest already
results in a significant performance gain for the high pain
intensities PA4 and PA3 for both, person-specific and generic
models. Using the new features improves the results further,
reaching an overall improvement of about 5-6% for PA4 and
PA3 and 3-4 % for PA2. For the lowest pain intensity PA1
TABLE II. GENERALIZATION CAPABILITY FOR PAIN DETECTION WITH
MULTI-MODAL D ATA FUSI ON. M EAN ACCURACIES ACROSS SUBJECTS. F OR
EACH COLUMN,THE B EST R ES ULT IS B OLD ,AN D THE B ES T OF TH E
RE SPE CT IVE C ATEG ORY (VIDEO RESP.BIOSIGNALS)IS UNDERLINED.
BLN vs. PA4 BLN vs. PA3 BLN vs. PA2 BLN vs. PA1
Used features p.-s. gen. p.-s. gen. p.-s. gen. p.-s. gen.
Facial expression 74.9 70.8 67.1 62.1 57.1 53.7 49.3 50.7
Head movement 69.9 64.6 64.1 57.8 56.6 54.5 46.4 51.0
All video 76.6b71.6a69.1b62.9b58.0a54.4 49.3 51.2
GSR 71.9 73.8 64.0 65.9b57.9 60.2a47.6 55.4
EMG 63.1 57.9 55.9 52.7 51.3 49.6 46.3 49.0
ECG 64.0 62.0 60.0 56.5 54.5 51.6 49.6 48.7
All biosignals 75.6b74.1 65.5 65.0 58.7 59.3 49.1 54.9
All video + bio 80.6c77.8c72.0c67.7 60.5 60.0 49.6 54.6
aSignificantly better (p < 0.05) than each other result in category (video or bio)
bHighly significantly better (p < 0.01) than each other result in the category
cHighly significantly better (p < 0.01) than each other result in column
the generalization capability is not significantly above chance,
i. e. the method is not able to distinguish whether unseen
samples are no pain or on the pain threshold. However, this
is not surprising as most of the subjects do not visibly react
at all to this low pain stimulus (also see Werner et al. [7]).
Usually, the person-specific models perform better than the
corresponding generic model, which is in line with facial
expression recognition research (e. g. [11]). Interestingly, it
is different for BLN vs. PA1. Probably, missing difference
between the classes leads to severe overfitting for the person-
specific models, as the sample sets for training are much
smaller there.
As already pointed out in [7] the predictive performance
varies strongly between subjects, which is mainly due to differ-
ences in expressiveness, age and other person-specific factors.
So the standard deviation between subjects is not meaningful
for assessing the uncertainty of the results. However, the statis-
tical significance of an improvement can be tested with a paired
samples t-test [28] or a permutation test [29]. We decided to
use the permutation test, as it makes fewer assumptions and
provides greater accuracy [29]. To check the hypothesis that
algorithm A is better than B, we permuted the assignments of
results to the algorithms 50,000 times. For each permutation,
we calculated the difference of the mean result of algorithm
A and the mean result of B, yielding the distribution of
the performance differences under the assumption that both
algorithms perform equally well. Then pis the probability of
differences being greater or equal to the observed difference,
determined from the aforementioned distribution. Low values
of pcontradict the hypotheses that both algorithms perform
equally. Thus, the lower p, the more significant is the tested
improvement.
B. Multi-Modal Data Fusion
Table II lists the results of our multi-modal data fusion
experiments. It compares mean accuracies across subjects for
several modality resp. feature (sub)sets used for classification.
These accuracies are given for the four pain intensity levels,
each with person-specific and generic models.
In terms of intensity and type of models, the general
observations are the same as in the previous section. The
performance is the better, the more intense the pain stimulus
is. More pain leads to more behavioral and physiological
0 50 100
(∗∗)
p.-s., PA4
(∗∗)
gen., PA4
(∗∗)
p.-s., PA3
(∗∗)
gen., PA3
(∗∗)
p.-s., PA2
(∗∗)
gen., PA2
(+)
p.-s., PA1
()
gen., PA1
Method by Werner et al. [7]
Data fusion of video and biosignals
Fig. 3. Comparison of mean accuracies of the video-only approach by Werner
et al. [7] and the data fusion of video and biomedical signals. Fusion results
are significantly better with (+)p < 0.05, ()p < 0.001, (∗∗)p0.0001.
feedback, which is easier to detect. The lowest pain stimulus
(PA1) is almost undetectable with our system. The reactions
are either too subtle to be distinguished from noise, or not
present at all, as it can often be observed for facial expression
in PA1. Generally, person-specific models perform better than
generic ones, except for PA1 as already described above.
When considering the video-based recognition, the per-
formance increases in most cases when facial expression is
combined with head movement information. For high intensity
models the effect is statistically significant (person-specific:
PA4 and PA3 with p0.001, generic: PA4 with p= 0.02
and PA3 with p= 0.01).
In pain detection from biosignals, the role of GSR is
really interesting. It is the only modality, whose recognition
rate generally does not benefit from person-specific models,
suggesting that the GSR feedback is not as person-specific
as the other modalities, i. e. the ratio intra- and inter-person
variation is better. So, generalization performances of the
generic GSR-based model can benefit from the higher training
sample count available for generic model training. For generic
models and low pain intensities using only GSR features is
even better than all data fusion trials. However, for high pain
intensities the other modalities can contribute further infor-
mation leading to highly significant performance gains. E. g.
for PA4 person-specific models using all biosignal features
in the mean perform highly significantly better than these
only using GSR features (75.6% vs 71.9%, p= 0.0007).
Further, when using all video and biosignal features the results
are also highly significantly better than with each of the
considered feature subsets for high intensity tests (person-
specific: PA4 and PA3 with p < 0.001, generic: PA4 with
p < 0.003). These results show the capabilities multi-modal
data fusion in pain recognition. Nevertheless, the combination
of all modalities yields not the best results in all cases, calling
for more advanced data fusion methods that can better exploit
the strengths of the individual modalities.
Figure 3 visually compares the generalization capabilities
of the approach by Werner et al. [7] with these of the
data fusion of video and physiological signals. In the overall
comparison all fusion results are significantly better, most of
them with highest significance.
VI. CONCLUSION
The problem of automatically recognizing pain is challeng-
ing, because the feeling and expression of pain is influenced
by many factors, e. g. the personality, the social context, the
source of pain or previous experiences. This work took a
step towards increasing the validity and reliability of existing
pain recognition approaches by combining multiple sources of
information, namely video (facial expression, head pose) and
biomedical signals (GSR, trapezius muscle EMG and ECG).
We proposed advancements to the video-based method by
Werner et al. [7]. First, modifications in feature extraction
reduced the level of noise in the features. Second, we proposed
to use a random forest instead of a support vector machine.
Each of both changes resulted in a significant improvement in
the classification rates. Further, we suggested a set of features
to extract from the biomedical signals. The classification per-
formance was evaluated in detail, comparing the separate use
of each information source, and their combination. The results
improved for most combinations. For high pain intensity the
improvement was of high statistical significance. Although
the combination of video and biomedical signals performed
significantly better than the state-of-the-art approach [7], the
results showed that the data fusion method is not performing
well for the low pain intensities. This calls for the application
of more advanced data fusion methods. Further, we will work
towards taking the step from pain detection to measurement
of pain intensity.
ACK NOW LE DG ME NT
This work was funded by the German Research Foundation
(DFG), project AL 638/3-1 AOBJ 585843. We also thank
Prof. Adriano Andrade for his support with processing of the
biomedical signals.
REFERENCES
[1] M. Serpell, Ed., Handbook of Pain Management. Springer, 2008.
[2] H. McQuay, A. Moore, and D. Justins, “Treating acute pain in
hospital.” BMJ: British Medical Journal, vol. 314, no. 7093, p. 1531,
1997.
[3] H. Kehlet, “Acute pain control and accelerated postoperative surgical
recovery,” Surgical Clinics of North America, vol. 79, no. 2, pp.
431–443, Apr. 1999.
[4] S. M. G. Zwakhalen, J. P. H. Hamers, H. H. Abu-Saad, and M. P. F.
Berger, “Pain in elderly people with severe dementia: A systematic
review of behavioural pain assessment tools,BMC Geriatrics, vol. 6,
no. 1, p. 3, Jan. 2006.
[5] J. Strong, A. M. Unruh, A. Wright, and G. D. Baxter, Pain: a textbook
for therapists. Churchill Livingstone Edinburgh, Scotland, 2002.
[6] A. Williams, H. T. O. Davies, and Y. Chadury, “Simple pain rating
scales hide complex idiosyncratic meanings,” Pain, vol. 85, no. 3, pp.
457–463, Apr. 2000.
[7] P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, and H. C.
Traue, “Towards pain monitoring: Facial expression, head pose, a
new database, an automatic system and remaining challenges,” in
Proceedings of the British Machine Vision Conference, 2013.
[8] S. Walter, P. Werner, S. Gruss, H. Ehleiter, J. Tan, H. C. Traue, A. Al-
Hamadi, A. O. Andrade, G. Moreira da Silva, and S. Crawcour, “The
BioVid heat pain database: Data for the advancement and systematic
validation of an automated pain recognition system,” in IEEE Interna-
tional Conference on Cybernetics (CYBCONF), 2013, pp. 128–131.
[9] P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, S. Chew, and
I. Matthews, “Painful monitoring: Automatic pain monitoring using the
UNBC-McMaster shoulder pain expression archive database,Image
and Vision Computing, vol. 30, no. 3, pp. 197–205, Mar. 2012.
[10] K. M. Prkachin, “The consistency of facial expressions of pain: a
comparison across modalities,” Pain, vol. 51, no. 3, pp. 297–306, Dec.
1992.
[11] J. Chen, X. Liu, P. Tu, and A. Aragones, “Person-specific expression
recognition with transfer learning,” in Image Processing (ICIP),
2012 19th IEEE International Conference on, Orlando, 2012, pp.
2621–2624.
[12] Z. Hammal and J. F. Cohn, “Automatic detection of pain intensity,” in
Proceedings of the 14th ACM International Conference on Multimodal
Interaction. New York: ACM, 2012, pp. 47–52.
[13] R. Niese, A. Al-Hamadi, A. Panning, D. Brammen, U. Ebmeyer, and
B. Michaelis, “Towards pain recognition in post-operative phases using
3D-based features from video and support vector machines,” Inter-
national Journal of Digital Content Technology and its Applications,
vol. 3, no. 4, pp. 21–33, 2009.
[14] P. Werner, A. Al-Hamadi, and R. Niese, “Pain recognition and intensity
rating based on comparative learning,” in Image Processing (ICIP),
2012 19th IEEE International Conference on, Orlando, 2012, pp. 2313–
2316.
[15] Z. Hammal and M. Kunz, “Pain monitoring: A dynamic and context-
sensitive system,Pattern Recognition, vol. 45, no. 4, pp. 1265–1280,
Apr. 2012.
[16] G. C. Littlewort, M. S. Bartlett, and K. Lee, “Automatic coding of
facial expressions displayed during posed and genuine pain,” Image
and Vision Computing, vol. 27, no. 12, pp. 1797–1803, Nov. 2009.
[17] R. Treister, M. Kliger, G. Zuckerman, I. G. Aryeh, and E. Eisenberg,
“Differentiating between heat pain intensities: The combined effect of
multiple autonomic parameters,” Pain, vol. 153, no. 9, p. 1807–1814,
2012.
[18] N. Ben-Israel, M. Kliger, G. Zuckerman, Y. Katz, and R. Edry,
“Monitoring the nociception level: a multi-parameter approach,
Journal of Clinical Monitoring and Computing, vol. 27, no. 6, pp.
659–668, Dec. 2013.
[19] T. D. Wager, L. Y. Atlas, M. A. Lindquist, M. Roy, C.-W. Woo, and
E. Kross, “An fMRI-based neurologic signature of physical pain,New
England Journal of Medicine, vol. 368, no. 15, p. 1388–1397, 2013.
[20] S. Walter, J. Kim, D. Hrabal, S. Crawcour, H. Kessler, and H. Traue,
“Transsituational individual-specific biopsychological classification of
emotions,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, vol. 43, no. 4, pp. 988–995, 2013.
[21] M. Schels, M. Glodek, S. Meudt, S. Scherer, M. Schmidt, G. Layher,
S. Tschechne, T. Brosch, D. Hrabal, S. Walter, H. C. Traue, G. Palm,
F. Schwenker, M. Rojc, and N. Campbell, “Multi-modal classifier-
fusion for the recognition of emotions,” in Converbal Synchrony in
Human-Machine Interaction. CRC Press, 2013.
[22] X. Xiong and F. De la Torre, “Supervised descent method and its
applications to face alignment,” IEEE Computer Vision and Pattern
Recognition, 2013.
[23] R. Niese, P. Werner, and A. Al-Hamadi, “Accurate, fast and robust real-
time face pose estimation using kinect camera,” in IEEE International
Conference on Systems, Man, and Cybernetics (SMC), 2013, pp. 487–
490.
[24] R. Niese, A. Al-Hamadi, A. Panning, and B. Michaelis, “Emotion
recognition based on 2D-3D facial feature extraction from color image
sequences,” Journal of Multimedia, vol. 5, Oct. 2010.
[25] C. Cao and S. Slobounov, “Application of a novel measure of EEG
non-stationarity as Shannon- entropy of the peak frequency shifting
for detecting residual abnormalities in concussed individuals,” Clinical
Neurophysiology, vol. 122, no. 7, pp. 1314–1321, Jul. 2011.
[26] A. de Oliveira Andrade, “Decomposition and analysis of
electromyographic signals,” Ph.D. dissertation, University of Reading,
2005.
[27] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, Oct. 2001.
[28] W. A. Rosenkrantz, Introduction to probability and statistics for science,
engineering, and finance. CRC Press, 2009.
[29] D. S. Moore, G. P. McCabe, W. M. Duckworth, and S. L. Sclove, The
practice of business statistics: using data for decisions. Wh Freeman,
2003.
... Hence, a combination of this new dataset, introduced in [6], and AI paved the way for more efficient automatic pain recognition models. Such automatic pain recognition could be carried out in several ways, for example, through facial expressions, speech, body movements, and/or by using biophysical signals such as Electromyography (EMG), Electrocardiogram (ECG), and electrodermal activity (EDA/GSR), as demonstrated in [9]. Conventional ML models rely on extracting relevant features manually with the help of a feature extractor, which is subsequently used to optimize an inference model. ...
... In [16], camera photoplethysmography (PPG) signals were combined with ECG and EMG signals, and the dataset that was then obtained was used for pain intensity classification based on a fusion architecture at the feature level with SVMs and random forests as base classifiers. The authors in [9,17] used the BVDB dataset to perform different pain intensity classification experiments based on a carefully selected set of features extracted from both physiological and video signals, and a classification task was then performed based on random forest and SVM classifiers. In [18], a user-independent pain intensity classification evaluation based on physiological signals using the same dataset was carried out. ...
... There is a potential to explore the suggested mixup approach further by implementing it on a neural network-based architecture. In [9], the authors used a random forest classifier instead of a support vector machine to achieve an accuracy of 73.80% by using only EDA signals and an average accuracy of 74.10% using an early fusion technique. In [18], the average accuracy of the machine learning model with more features extracted was reported to be 81.10% using only EDA signals and 82.73% using all three signals, ECG, EDA, and EMG. ...
Article
Full-text available
Machine learning (ML) has revolutionized healthcare by enhancing diagnostic capabilities because of its ability to analyze large datasets and detect minor patterns often overlooked by humans. This is beneficial, especially in pain recognition, where patient communication may be limited. However, ML models often face challenges such as memorization and sensitivity to adversarial examples. Regularization techniques like mixup, which trains models on convex combinations of data pairs, address these issues by enhancing model generalization. While mixup has proven effective in image, speech, and text datasets, its application to time-series signals like electrodermal activity (EDA) is less explored. This research uses ML for pain recognition with EDA signals from the BioVid Heat Pain Database to distinguish pain by applying mixup regularization to manually extracted EDA features and using a support vector machine (SVM) for classification. The results show that this approach achieves an average accuracy of 75.87% using leave-one-subject-out cross-validation (LOSOCV) compared to 74.61% without mixup. This demonstrates mixup’s efficacy in improving ML model accuracy for pain recognition from EDA signals. This study highlights the potential of mixup in ML as a promising approach to enhance pain assessment in healthcare.
... Nevertheless, this process is high labor-intensive, and the constant patient supervision is impracticable. In addition, the pain assessment is further intricate and demanding regarding to patients with communicational limitations, mental deficiency, severe illness, or infants [26]. Sufficient and impartial pain assessment is required, in order to provide the essential medical management to those in pain and prevent additional heath problems. ...
... Finally, in this section we compare our accomplished results utilizing the MT-NN employed the additional tasks of gender and age estimation, with corresponding studies which utilized the electrocardiography signals of the Part A of BioVid database with all 87 participants, and followed the same evaluation protocol i.e. leave-onesubject-out (LOSO) cross validation for the purpose of an equitable comparison. The results depicted in Table 8, including studies utilizing hand-crafted features and classic machine learning algorithms [8,26], end-to-end deep learning approaches [11,22], and methods combining hand-crafted features with deep learning classification algorithms [14]. Our method exploiting the carefully designed ECG features, followed by the highdimensional mapping from the encoder with the combination of the multi-task learning neural networks, was able to outperform other approaches in every pain estimation task, either in binary or multi-class classification setting. ...
Conference Paper
Full-text available
Pain is a complex phenomenon which is manifested and expressed by patients in various forms. The immediate and objective recognition of it is a great of importance in order to attain a reliable and unbiased healthcare system. In this work, we elaborate electrocardiography signals revealing the existence of variations in pain perception among different demographic groups. We exploit this insight by introducing a novel multi-task neural network for automatic pain estimation utilizing the age and the gender information of each individual, and show its advantages compared to other approaches.
... These approaches typically entail the identification of pain based on facial expressions captured by a frontal camera. Some works (Lopez-Martinez and Picard, 2018;Werner et al., 2014) have also delved into the utilization of physiological signals such as ECG, EDA, etc., for pain recognition. Despite recent strides in affective computing toward automatic pain detection, the available datasets remain limited in size, often necessitating techniques like transfer learning to address this constraint (Wang et al., 2018;Prajod et al., 2021). ...
Article
Full-text available
Introduction Individuals with diverse motor abilities often benefit from intensive and specialized rehabilitation therapies aimed at enhancing their functional recovery. Nevertheless, the challenge lies in the restricted availability of neurorehabilitation professionals, hindering the effective delivery of the necessary level of care. Robotic devices hold great potential in reducing the dependence on medical personnel during therapy but, at the same time, they generally lack the crucial human interaction and motivation that traditional in-person sessions provide. Methods To bridge this gap, we introduce an AI-based system aimed at delivering personalized, out-of-hospital assistance during neurorehabilitation training. This system includes a rehabilitation training device, affective signal classification models, training exercises, and a socially interactive agent as the user interface. With the assistance of a professional, the envisioned system is designed to be tailored to accommodate the unique rehabilitation requirements of an individual patient. Conceptually, after a preliminary setup and instruction phase, the patient is equipped to continue their rehabilitation regimen autonomously in the comfort of their home, facilitated by a socially interactive agent functioning as a virtual coaching assistant. Our approach involves the integration of an interactive socially-aware virtual agent into a neurorehabilitation robotic framework, with the primary objective of recreating the social aspects inherent to in-person rehabilitation sessions. We also conducted a feasibility study to test the framework with healthy patients. Results and discussion The results of our preliminary investigation indicate that participants demonstrated a propensity to adapt to the system. Notably, the presence of the interactive agent during the proposed exercises did not act as a source of distraction; instead, it positively impacted users' engagement.
Preprint
Full-text available
Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.
Article
Full-text available
Background Pain, a leading reason people seek medical care, has become a social issue. Automated pain assessment has seen notable advancements over recent decades, addressing a critical need in both clinical and everyday settings. Objective The objective of this survey was to provide a comprehensive overview of pain and its mechanisms, to explore existing research on automated pain recognition modalities, and to identify key challenges and future directions in this field. Methods A literature review was conducted, analyzing studies focused on various modalities for automated pain recognition. The modalities reviewed include facial expressions, physiological signals, audio cues, and pupil dilation, with a focus on their efficacy and application in pain assessment. Results The survey found that each modality offers unique contributions to automated pain recognition, with facial expressions and physiological signals showing particular promise. However, the reliability and accuracy of these modalities vary, often depending on factors such as individual variability and environmental conditions. Conclusions While automated pain recognition has progressed considerably, challenges remain in achieving consistent accuracy across diverse populations and contexts. Future research directions are suggested to address these challenges, enhancing the reliability and applicability of automated pain assessment in clinical practice.
Thesis
Full-text available
The extraction of motor unit action potentials (MUAPs) from electromyographic (EMG) signals, also known as EMG decomposition, is an important step for a better understanding of the behaviour of the neuromuscular system. For instance, the MUAP shape is very often used for diagnosis of neurological disorders. This research is aimed at applying novel and alternative digital signal processing techniques for addressing the problem of EMG decomposition, which may require several steps of signal processing. The stages of signal filtering, detection of MUAPs, and their clustering were investigated. The Empirical Mode Decomposition (EMD), which is a tool for decomposition of nonlinear time-series, was used as a filter capable of attenuating background activity often present in EMG signals. The performance of this flter was compared to that of a technique based on the Wavelet transform. The detection of MUAPs was studied via their energy description on time-frequency distributions. This study aimed to understand how MUAP activity is perceived on the Hilbert spectrum (HS), a novel method for energy representation of signals. The HS was compared to traditional techniques, i.e. the scalogram, spectrogram and Wigner-Ville distribution. Results of this comparison showed that the visualization of MUAP activity on the HS is not straightforward, however it could be adapted in order to highlight the detection of MUAPs. The use of generative models was introduced for analysis of MUAPs. It was shown that synthetic MUAPs may be generated via a Point Distribution Model. Furthermore, the Generative Topographic Mapping (GTM) was employed for clustering and visualization of MUAPs. Results obtained from GTM were compared to those obtained from Self-Organizing Map (SOM), Neural-Gas Network and a Gaussian Mixture Model. This comparison showed that GTM may be used as a principled alternative to SOM. Finally, the stages above were combined with additional units of signal processing in order to form a system for extraction of MUAPs. The analysis of synthetic and experimental (needle and surface) EMG signals showed that this system could automate the visualization of patterns in EMG signals.
Conference Paper
Full-text available
This paper presents a developed technique for electromyographic signals decomposition (EMG), which has as objective to extract motor unit action potentials of a signal. This type of analysis is applied in the study of the central nervous system and muscles behavior and diagnosis of neuromusculars disorders. For the implementation of this technique, notions of digital signal processing, signal filtering, potentials detection, and their clustering had been utilized.
Book
Successful pain management is key to patient quality of life and outcomes across many fields of medicine. The Handbook of Pain Management provides an insightful and comprehensive summary, authored by a noted expert. Concise and insightful review of an important and complicated area of medicine
Article
The clinical community has long shown interest in the concept of extracting as many motor unit action potentials (MUAPs) as possible from an intramuscular electromyographic (EMG) signal. Adrian and Bronk (1929) developed the first concentric needle electrode to identify both shape and firing rate of the MUAPs. Subsequent manual approaches of graphically measuring and quantifying the EMG signal evolved into computer-based techniques directed at identifying individual action potentials and discharge times by shape discrimination. The Precision Decomposition technique described in this chapter recovers all the usable information available in the EMG signal. The information can be conveniently grouped into two categories: morphology and control properties. Morphology describes the parameters of the MUAP shape such as the peak-to-peak amplitude, the time duration, the number of phases, and the area. These parameters are provided by the recovered Concentric and Macro MUAP. The morphology of the MUAP describes features that are related to the anatomical and physiological properties of the muscle fibers. These are the parameters which the clinician is accustomed to evaluating during a standard clinical EMG examination. The control properties of the motor units dictate the firing characteristics of the motor units. Therefore, the firing characteristics provide a description of how the motor units are controlled by the central nervous system and to some extent the peripheral nervous system. Clinically, they quantify upper motoneuron diseases.
Conference Paper
Previous efforts suggest that occurrence of pain can be detected from the face. Can intensity of pain be detected as well? The Prkachin and Solomon Pain Intensity (PSPI) metric was used to classify four levels of pain intensity (none, trace, weak, and strong) in 25 participants with previous shoulder injury (McMaster-UNBC Pain Archive). Participants were recorded while they completed a series of movements of their affected and unaffected shoulders. From the video recordings, canonical normalized appearance of the face (CAPP) was extracted using active appearance modeling. To control for variation in face size, all CAPP were rescaled to 96x96 pixels. CAPP then was passed through a set of Log-Normal filters consisting of 7 frequencies and 15 orientations to extract 9216 features. To detect pain level, 4 support vector machines (SVMs) were separately trained for the automatic measurement of pain intensity on a frame-by-frame level using both 5-folds cross-validation and leave-one-subject-out cross-validation. F1 for each level of pain intensity ranged from 91% to 96% and from 40% to 67% for 5-folds and leave-one-subject-out cross-validation, respectively. Intra-class correlation, which assesses the consistency of continuous pain intensity between manual and automatic PSPI was 0.85 and 0.55 for 5-folds and leave-one-subject-out cross-validation, respectively, which suggests moderate to high consistency. These findings show that pain intensity can be reliably measured from facial expression in participants with orthopedic injury.
Conference Paper
Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization of a general smooth function. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, this paper proposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-of-the-art performance in the problem of facial feature detection. The code is available at www.humansensing.cs. cmu.edu/intraface.
Conference Paper
A key assumption of traditional machine learning is that both the training and test data share the same distribution. However, this assumption does not hold in many real-world scenarios. For example, in facial expression recognition, the appearance of an expression may vary significantly for different people. Previous work has shown that learning from adequate person-specific data can improve facial expression recognition results. However, because of the difficulties of data collection and labeling, person-specific data is usually very sparse in real-world applications. Learning from the sparse data may suffer from serious over-fitting. In this paper, we propose to learn a person-specific facial expression model through transfer learning. By transferring the informative knowledge from other people, it allows us to learn an accurate person-specific model for a new subject with only a small amount of his/her specific data.