Conference PaperPDF Available

Toward Development and Evaluation of Pain Level-Rating Scale for Emergency Triage based on Vocal Characteristics and Facial Expressions

Authors:
Toward Development and Evaluation of Pain Level-Rating Scale for
Emergency Triage based on Vocal Characteristics and Facial Expressions
Fu-Sheng Tsai1, Ya-Ling Hsu1, Wei-Chen Chen1, Yi-Ming Weng2, Chip-Jin Ng2, Chi-Chun Lee1
1Department of Electrical Engineering, National Tsing Hua University, Taiwan
2Department of Emergency Medicine, Chang Gung Memorial Hospital, Taiwan
Abstract
In order to allocate the healthcare resource, triage classifica-
tion system plays an important role in assessing the severity of
illness of the boarding patient at emergency department. The
self-report pain intensity numerical-rating scale (NRS) is one
of the major modifiers of the current triage system based on
the Taiwan Triage and Acuity Scale (TTAS). The validity and
reliability of self-report scheme for pain level assessment is a
major concern. In this study, we model the observed expres-
sive behaviors, i.e., facial expressions and vocal characteristics,
directly from audio-video recordings in order to measure pain
level for patients during triage. This work demonstrates a feasi-
ble model, which achieves an accuracy of 72.3% and 51.6% in
a binary and ternary pain intensity classification. Moreover, the
study result reveals a significant association of current model
and analgesic prescription/patient disposition after adjusted for
patient-report NRS and triage vital signs.
Index Terms: behavioral signal processing (BSP), facial ex-
pressions, triage, pain scale, vocal characteristics
1. Introduction
Deriving behavioral informatics from signals, e.g., audio-video
and/or physiological data recordings, offers a new paradigm
for quantitative decision-making across behavior sciences [1].
Behavioral informatics, i.e., computational methods that mea-
sure human’s attributes-of-interest, are developed grounded in
their desired domain applications. For example, notable algo-
rithmic advances have been observed in the medical domains:
detection of depression [2, 3], assessment of Parkinson’s dis-
ease [4, 5], modeling of therapist’s empathy in motivational in-
terview [6, 7], analysis of disorder [8, 9], etc. In this work,
we carry out a research effort into objectifying pain level, i.e.,
one of the six major regulators in the Taiwan Triage and Acu-
ity Scale (TTAS) [10], of an on-boarding emergency patient by
modeling his/her facial expressions and vocal characteristics.
TTAS is jointly developed by the Taiwan Society of Emer-
gency Medicine and the Critical Care Society, which modifies
the Canadian Triage and Acuity Scale (CTAS) [11] by tailoring
toward Taiwan’s particular medical situations. It is officially
announced in 2010 by the Ministry of Health and Welfare to be
the triage system of Taiwan. TTAS includes six major factors
in assessing the severity and screening life-threatening patients:
respiratory distress, circulation, consciousness level, body tem-
perature, pain level, and injury mechanism. In specifics, the
intensity of pain is currently measured by the numerical-rating
scale (NRS)[12, 13], that is a 10-point self-report pain scale.
In clinical practice, physicians and nurses have noticed the dif-
ficulty in the systematic implementation of this instrument es-
pecially for elderly people, foreigners, or patients with a low
education level. This often leads to either a practice of using
FACES rating scale [14] that is designed for children, or the
triage nurses would select the level through his/her own obser-
vations instead of soliciting an answer from the patient. Further-
more, even when the nurses succeed in carrying out NRS, this
self-report rating still suffers from various unwanted idiosyn-
cratic factors, e.g., age and body part dependency and incon-
sistent comprehension of the pain scale. These issues centered
around subjectivity in measuring pain create a deviation on the
consistency and validity of the triage classification system.
Related previous works have concentrated mainly on rec-
ognizing the occurrences of pain by monitoring facial expres-
sions. For example Ashraf et al. [15] uses active appearance
model to recognize frame-level pain, Kaltwang et al. [16] uses
a relevance vector regression model to classify between real and
fake pain, and Wener et al. models head pose for pain detection
[17]. In this work, we propose to include voice characteristics
in addition to facial expression for measuring pain level. More-
over, we contribute not only in the multimodal aspect of pain
level measurement but also in the realism of contextualizing ap-
plications in real medical settings. In this work, we collect data
from a total of 182 real patients as they seek emergency medical
service at Chang Gung Memorial Hospital1. The data includes
audio-video samples during triage and follow-up sessions after
treatment, vital sign (physiological) data during triage, and fi-
nally a set of clinical outcomes. The data recordings are in real
medical settings (in-the-wild), and the interactions are sponta-
neous in nature - all posed as a challenging yet contextualized
situation for deriving appropriate informatics.
Our proposed multimodal framework achieves a 72.3% ac-
curacy in classifying between the extremes (severe versus mild)
pain level and 51.6% accuracy in performing a three-class (se-
vere,moderate, and mild) pain level recognition. The inclusion
of audio modality is essential in improving the overall recogni-
tion rate implicating that the intensity of pain is also reflected
in the patient’s vocal characteristics. Furthermore, while com-
paring to the so called ground-truth, i.e., NRS, is a straight-
forward mean of evaluating the framework, in this work, we
further utilize this audio-video based system in combination
with NRS and vital sign data in order to analyze clinically-
relevant outcome-related variables, in specifics analgesic pre-
scription and patient disposition, as another evaluation scheme.
We demonstrate that even after taking into the account for the
current best medical instruments availability (physiological data
and NRS), the usage of audio-video based pain level assess-
ment can improve the prediction about whether a doctor would
end up prescribing analgesic prescription or order patients to
be hospitalized. This initial result is quite promising as the re-
search effort will continue to derive novel pain-level rating and
validate its ability to improve the current triage classification
system clinically.
1IRB#:104-3625B
Copyright © 2016 ISCA
INTERSPEECH 2016
September 8–12, 2016, San Francisco, USA
http://dx.doi.org/10.21437/Interspeech.2016-40892
Figure 1: It shows a complete flow diagram of the proposed work. We segment the raw audio recordings manually and then extract
acoustic low-level descriptors; for the video data, we apply a pre-trained constraint local neural field (CLNF) to track the (x, y)
positions of the 68 landmark points on face and then extract descriptors based on to pain-related facial action units. Two encodings
methods, statistical functional descriptors and k-means bag-of-word model, are used to derive a session-level feature vector. Finally,
we conduct pain level recognition using fusion of audio-video features and further analyze it with respect to clinical outcomes.
The rest of the paper is organized as follows: section 2 de-
scribes about data collection and audio-video feature extraction,
section 3 includes experimental setups and results, and section
4 concludes with future work.
2. Research Methodology
2.1. Database Collection
The triage session included audio-video recordings, physiolog-
ical (heart rate, systolic and diastolic blood pressure) vital sign
data, and other clinically-related outcomes (analgesic prescrip-
tion and patient disposition) of on-boarding emergency patients
at Chang Gung Memorial Hospital. We excluded pediatric and
trauma patients, also excluded referral patients or patients with
prior treatment before arrival, and further only included patients
with symptoms of chest, abdominal, lower-back, limbic pain,
and headaches. There were two sessions recorded for each pa-
tient, i.e., at triage and follow-up, where the follow-up session
occurred approximately 1 hour after the treatment, if any, was
given to the patient. These sessions essentially involved nurses
asking the patient for the location of the body pain, the NRS
scale of pain intensity (010, where 10 means the worst pain
ever), and a brief description on the type of pain felt (for ex-
ample, cramps or aches); it usually lasted around 30 seconds
for each session. The audio-video data was recorded using a
Sony HDR handy cam on a tripod in a designated assessment
room, and the placement of the camera was set attempting to
consistently capture the patients’ facial expressions.
In our current database, we have collected a total of 182 pa-
tients, each recorded at the two designated points in time. After
excluding non-usable data (e.g., cases where the patient’s rela-
tive responds to the pain level assessment instead of the patient,
low audio-video quality due to various uncontrollable factors,
loss of either physiological data or clinical outcomes), we have
a total of 205 audio-video samples from 117 unique patients,
which constitutes the dataset of interest for this work. Lastly,
the pain level is often grouped into three levels based on the
number reported (mild: 03, moderate: 46, severe: 710);
we adopt the same convention in this work to serve as the learn-
ing target for our signal-based pain level assessment system.
2.2. Audio-Video Feature Extraction
Figure 1 depicts the overall framework including audio-
video data preprocessing, low-level descriptors extraction, and
session-level encoding. In the following sections, we will
briefly describe each component.
Figure 2: The red dots are the 68 facial landmarks tacked for
each image. The action units are the ones being indicative of
pain in the past literature. Lastly, it shows the various parame-
terization of the 68 facial landmarks that we compute as video
features to be used in this work. Facial Action Coding System
photos(http://www.cs.cmu.edu/ face/facs.htm)
2.2.1. Acoustic Characteristics
For each recorded session, we first perform manual segmenta-
tion on the audio file to obtain the speaking portions correspond-
ing to the patient, the patient’s relatives, and the interviewer. In
this work, we concentrate only on the patient’s voice character-
istics. We extract 45 low-level descriptors in total, including 13
MFCCs, 1 fundamental frequency, 1 intensity and their asso-
ciated delta and delta-delta every 10ms. This set of spectral-
prosodic features is extracted due to their common usage in
characterizing paralinguistic and emotion information [18]. The
audio features are further z-normalized per speaker.
2.2.2. Facial Expressions
On the video side, for each session, we first apply constrained
local neural fields (CLNF) [19] as a pre-processing step. CLNF
tracks a patient’s 68 facial landmark’s position based on the
Active Orientation Model (AOM) [20], which is an extension
to Active Appearance Model for describing the shape and ap-
pearance of a face. CLNF, i.e., an instance of constrained lo-
cal model, essentially involves three major technical compo-
nents: point distribution model (describing the position of fea-
ture points in an image), local neural field patch experts (layered
unidirectional graphical model), and optimization fitting ap-
proach (non-uniform regularized landmark mean shift fitting).
By applying CLNF, we then are able to track the 68 feature
points (Figure 2), e.g., around face, eyes, and nose contour, for
each patient in each image of the recorded video session.
Past works have identified several facial action units that are
related to the feeling of pain [21, 22], e.g., AU4, 6, 7, 9, 10, 12,
93
Table 1: It summarizes the Unweighted Average Recall (UAR) obtained in Exp I. 2-Class indicates the binary classification task between
the extreme pain levels (severe versus mild). 3-Class indicates the ternary classification between severe, moderate, and mild pain levels.
The numbers in bold indicate the best accuracy achieved within that specific task.
Chance Audio-Only Video-Only Multimodal Fusion (early-fusion / late-fusion)
Functional Bow Functional Bow FuncA, FuncV FuncA, BowV BowA,FuncV BowA, BowV
2-Class 50.0 67.9 61.3 55.9 61.9 66.8 / 68.7 72.3 / 68.1 56.6 / 61.5 61.1 / 64.8
3-Class 33.3 46.0 42.9 40.9 40.8 43.5 / 48.3 49.7 / 51.6 40.8 / 43.7 41.4 / 44.8
16, 25, 43 (Figure 2). In this work, instead of recognizing these
facial action units, we compute features characterizing these ex-
pressions directly from the tracked key points’ (x, y)position,
Eyebrows (7): the distance of inner eyebrows divided by
the distance of outer eyebrows (1), the quadratic polyno-
mial coefficients of the right and the left eyebrows (6)
Nose (2): the normalized distance between nose and
philtrum (1), and of nasolabial folds (1)
Eyes (5): the outer eye corner opening (2), the distance
between the inner eye corners divided by the distance of
outer eye corners (1), the distance of upper and lower
eyelids divided by the distance from the head to the cor-
ner of the eyes (2)
Mouth (14): the quadratic polynomial coefficients from
the shape of upper lip, including outer and inner part,
and lower lip, including outer and inner part (12), the
two-sided mouth corners opening angles (2)
There are a total of 28 features per frame extracted from the face
to represent the facial expression of the patient. Figure 2 also
shows a schematics of the features being extracted in this work.
2.3. Session-level Encodings
Since each session is approximately 30 seconds long, we ad-
ditionally utilize two different encoding approaches to form a
fixed-length feature vector at the session-level. The first one
is based on computing 15 different statistical functionals on
audio and video low-level descriptors (Functional). The list
of functionals includes maximum, minimum, mean, median,
standard deviation, 1st percentile, 99th percentile, 99th-1st per-
centile, skewness, kurtosis, minimum position, maximum po-
sition, lower quartile, upper quartile, and interquartile range.
The second approach is based on k-means bag-of-word (BoW)
encoding, which encoding varying length of sequences of low-
level descriptors with a histogram count of cluster occurrences.
In general, BoW characterizes the quantified behavior types
over a duration of time. The number of clusters is set to be
256 for both audio and video.
3. Experimental Setup and Results
In this work, we set up two different experiments:
Exp I : the NRS pain scale recognition task
Exp II: clinical outcomes analyses
Exp I is designed to validate that the pain-related facial and vo-
cal expressions can indeed be modeled and used in the develop-
ment toward a signal-based pain scale, and Exp II is designed
to analyze the predictive information that the signal-based pain
scale possess in addition to the NRS and patient’s physiology
to the clinical judgment of painkiller prescription and patient’s
disposition (hospitalization or discharge).
3.1. Exp I: NRS Pain Level Classification
In Exp I, we perform two different recognition tasks: 1) bi-
nary classification between the extreme pain levels, i.e., severe
vs. mild pain, on the subset of the dataset and 2) ternary clas-
sification of the three commonly-used pain levels, i.e., severe
vs. moderate vs. mild, on the entire dataset. Severe pain cor-
responds to NRS score ranging between 710, moderate is
46, and mild is 03. We design two different tasks due
to the fact that NRS rating itself only relies on patient’s self-
report, which can be subjective especially for the moderate por-
tion of the data. By running an additional binary classification
on the extreme set, where there is less concern on the reliabil-
ity of the label, we can better assess the technical feasibility
of our framework. The classifier of choice for this experiment
is the linear-kernel support vector machine. We employ two
different multimodal fusion techniques. One is based on early-
fusion technique, i.e., concatenating audio and video features
after performing univariate feature selection (i.e., ANOVA) on
each modality separately. Another one is based on late-fusion
technique, i.e., by fusing the decision scores from the audio and
video modality separately using logistics regression. All evalu-
ation is done via leave-one-patient-out cross-validation, and the
performance metric is unweighted average recall.
3.1.1. Results and Discussions
Table 1 summarizes the results of Exp I. 2-Class indicates the
binary classification task between the extremes. 3-Class indi-
cates the ternary classification between the three pain levels.
The numbers in bold indicate the best accuracy achieved. There
are a couple of points to note in these results. The best accura-
cies achieved are 72.3% and 51.6%, i.e., multimodal fusion of
audio and video modalities, for 2-Class and 3-Class classifica-
tion tasks respectively. Both of these results are significantly
better than the chance baseline indicating that there indeed
exists pain-related information that can be modeled through
audio-video signals. Another point to make is that while past
works concentrate mostly on the facial expressions, in our work,
we demonstrate that the vocal characteristics are also indicative
of the patient’s experience of pain. In fact, if we compare the
audio-only and video-only accuracies, the result obtained with
audio-only features are slightly higher than the video-only fea-
tures.
Secondly, the type of encoding methods affects the recogni-
tion accuracies. We show that functionals-based method works
better for audio features and bag-of-word approach works better
for video features. In fact, the best accuracy reported is by fus-
ing functional-based audio feature with bag-of-word encoding
of video feature. We hypothesize that this could be due to the
fact that pain-related audio characteristics are non-linearly dis-
tributed across the session (hence, the functional descriptor ap-
proach works better), and our video features are inherently try-
ing to capture a specific configuration of appearances (hence, a
counting-based method of encoding is superior). Another thing
to note that, in the three-class problem, the error rate for the
moderate class is considerably higher than in the mild and se-
vere. It could be due to the fact that this class is inherently
ambiguous; hence not only the data itself is ambiguous but the
ground truth itself can be unreliable. In summary, we demon-
strate that our proposed audio-video-based pain scale is capable
of reaching a substantial reliability compared to the established
NRS self-report-based instrument for assessing pain.
94
3.2. Exp II: Clinical Outcomes Analyses
The overarching goal of the research effort is not just to repli-
cate the NRS self-report pain scale, instead, the aim is to de-
rive a signal-based (i.e., from audio-video data) informatics that
can supplement the current decision-making protocol. A physi-
cian’s decision on the type of treatment, if any, to the patient
is often largely based on a holistic clinical assessment of a pa-
tient’s overall condition. Hence, in Exp II, our aim is to design a
simple quantitative score that combines the available measures
at triage with the audio-video based pain level (system output
in section 3.1). We will demonstrate that this score has added
information that is relevant to the patient’s clinical outcomes of
analgesic prescription and disposition. The exact analysis pro-
cedure goes as follows. For each triage, we have the following
measures for every patient:
PHY: age, systolic/diastolic blood pressure, heart rate
NRS-3C: the three pain levels, i.e., mild, moderate, and
severe, derived from the patient’s NRS scale
SYS-2C: one of the two predicted pain levels (mild / se-
vere) derived from the 2-Class SVM
SYS-2C(d): the decision score derived the 2-Class SVM
SYS-3C: one of the three predicted pain levels (mild /
moderate / severe) derived from the 3-Class SVM
SYS-3C(d): the decision score derived the 3-Class SVM
PHY measures are all normalized with respect to the age of each
patient. Further, we have two clinical dichotomous outcome
variables for each patient, i.e., painkiller prescription and dis-
position. We design a score, painK and dispT, of each outcome
by training a linear regression model each on the training set
using the measures mentioned above as the independent vari-
ables, and then we apply the learned regression model to assign
an outcome score for each patient i. Lastly, by utilizing the fol-
lowing simple rule, we can predict whether a patient iwill end
up being prescribed medication or being hospitalized:
prescription: painKi>AVG{painKj}jtrain-set
hospitalization: dispTi>AVG{dispTj}jtrain-set
where AVG means the average values of the score within the
training set. All of these procedures are done completely via
leave-one-patient-out cross validation. The main idea of the
analyses is to show that by having the audio-video based pain-
scale system, it enhances the quantitative (i.e., objective and
measurable) evidences to the doctor’s clinical judgment even
when accounted for the current clinical instruments.
3.2.1. Experimental Results and Discussions
Table 2 summarizes the results of Exp II as measured in UAR.
There are some interesting points to note in this analysis. For
the outcome of analgesic prescription, we see that NRS scale
by itself naturally is already capable of achieving an accuracy
of 66.3%, in accordance with known phenomenon in the past
[23], and PHY measures alone do not contribute at all. How-
ever, by combing NRS to the SYS-3C(d), the accuracy im-
proves to 71.0% (an 4.7% absolute improvement). This result
seems to indicate that the decision scores outputted from the 3-
Class SVM encodes additional information beyond NRS scale
that is relevant in understanding how physicians make a judg-
ment on analgesic prescription. Furthermore, for the outcome
of patient’s disposition (hospitalization or not), we see that PHY
(vital sign) by itself obtains 56.4% accuracy, where NRS scale
does not provide information here. However, by coming PHY
with SYS-2C, the accuracy improves to 65.7% (a 9.3% absolute
improvement) - signifying the added information that NRS is
Table 2: Summary of Exp II: the accuracy number is measured
in unweighted average recall
Analgesic Pres. Hospitalization
PHY 49.6 56.4
NRS-3C 66.3 42.7
PHY+NRS-3C 63.5 56.4
SYS-2C 51.5 58.7
SYS-3C 58.8 57.1
PHY+SYS-2C 47.8 65.7
PHY+SYS-3C 53.3 58.6
PHY+SYS-2C(d) 54.4 56.4
PHY+SYS-3C(d) 58.4 54.4
NRS-3C+SYS-2C 66.3 58.7
NRS-3C+SYS-3C 66.3 57.1
NRS-3C+SYS-2C(d) 66.3 43.3
NRS-3C+SYS-3C(d) 71.0 44.7
PHY+NRS-3C+SYS-2C 62.3 65.7
PHY+NRS-3C+SYS-3C 62.7 55.9
PHY+NRS-3C+SYS-2C(d) 66.0 55.1
PHY+NRS-3C+SYS-3C(d) 69.6 55.8
lacking originally yet the audio-video based pain-scale do pos-
sess in terms patient’s disposition outcome.
In summary, while the audio-video based system is trained
from the NRS, it seems to differ possibly due to the fact that it
models the facial expressions and vocal characteristics directly.
We demonstrate that these signal-based pain scales indeed pos-
sess additional clinically-relevant information to the outcome
variables of emergency triage beyond what is already captured
in the NRS scale and conventional vital sign measures.
4. Conclusions
In this work, we develop an initial predictive framework to as-
sess the pain-level for patients at emergency triage. The systems
show reliable estimates to the established NRS pain scale. Fur-
thermore, we evaluate the usefulness of such system by demon-
strating that it can capture important information about the out-
come of the patient beyond the current available instrumenta-
tions used at triage. This initial result is quite promising as the
goal of the research is to devise a novel objective and quantifi-
able informatics not to replicate the current instrumentation but
to provide supplemental clinically-relevant information that is
beyond the established protocols.
There are multiple future directions. Technically, employ-
ing state-of-the-art speech/video processing and machine learn-
ing algorithms will be an immediate future direction as we con-
tinue to collect more data samples (our aim is to collect at least
500 unique patients’ data). On the analysis part, we will put
effort into understanding exactly what additional information
that the system is able to capture from the facial and vocal ex-
pressions about the pain that is missing from the NRS scale,
and whether such information is related to the physiology of
the patient (e.g., muscle movement in response to pain felt that
may correlate with the measures of heart rate or blood pres-
sure). Having more insights discovered, we can hopefully help
advance and benefit the current medical practices at the emer-
gency triage with the introduction of such an informatics.
5. Acknowledgments
Thanks to MOST (103-2218-E-007-012-MY3) and ChangGung
Memorial Hospital (CMRPG3E1791) for funding.
95
6. References
[1] S. Narayanan and P. G. Georgiou, “Behavioral signal process-
ing: Deriving human behavioral informatics from speech and lan-
guage,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1203–1233,
2013.
[2] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen,
M. T. Padilla, F. Zhou, and F. D. La Torre, “Detecting depression
from facial actions and vocal prosody,” in Affective Computing
and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd
International Conference on. IEEE, 2009, pp. 1–7.
[3] Z. Liu, B. Hu, L. Yan, T. Wang, F. Liu, X. Li, and H. Kang,
“Detection of depression in speech,” in Affective Computing and
Intelligent Interaction (ACII), 2015 International Conference on.
IEEE, 2015, pp. 743–747.
[4] A. Tsanas, M. A. Little, C. Fox, and L. O. Ramig, “Objective
automatic assessment of rehabilitative speech treatment in parkin-
son’s disease,Neural Systems and Rehabilitation Engineering,
IEEE Transactions on, vol. 22, no. 1, pp. 181–190, 2014.
[5] A. Bayestehtashk, M. Asgari, I. Shafran, and J. McNames, “Fully
automated assessment of the severity of parkinson’s disease from
speech,” Computer speech & language, vol. 29, no. 1, pp. 172–
185, 2015.
[6] J. Gibson, N. Malandrakis, F. Romero, D. C. Atkins, and
S. Narayanan, “Predicting therapist empathy in motivational in-
terviews using language features inspired by psycholinguistic
norms,” in Sixteenth Annual Conference of the International
Speech Communication Association, 2015.
[7] B. Xiao, D. Can, P. G. Georgiou, D. Atkins, and S. S. Narayanan,
“Analyzing the language of therapist empathy in motivational in-
terview based psychotherapy,” in Signal & Information Process-
ing Association Annual Summit and Conference (APSIPA ASC),
2012 Asia-Pacific. IEEE, 2012, pp. 1–4.
[8] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, “Au-
tomatic intelligibility classification of sentence-level pathological
speech,” Computer speech & language, vol. 29, no. 1, pp. 132–
144, 2015.
[9] D. Bone, C.-C. Lee, M. P. Black, M. E. Williams, S. Lee, P. Levitt,
and S. Narayanan, “The psychologist as an interlocutor in autism
spectrum disorder assessment: Insights from a study of sponta-
neous prosody,Journal of Speech, Language, and Hearing Re-
search, vol. 57, no. 4, pp. 1162–1177, 2014.
[10] C.-J. Ng, Z.-S. Yen, J. C.-H. Tsai, L. C. Chen, S. J. Lin, Y. Y.
Sang, J.-C. Chen et al., “Validation of the taiwan triage and acuity
scale: a new computerised five-level triage system,Emergency
Medicine Journal, vol. 28, no. 12, pp. 1026–1031, 2011.
[11] M. J. Bullard, T. Chan, C. Brayman, D. Warren, E. Musgrave,
B. Unger et al., “Revisions to the canadian emergency department
triage and acuity scale (ctas) guidelines,” CJEM, vol. 16, no. 06,
pp. 485–489, 2014.
[12] K. Eriksson, L. Wikstr¨
om, K. ˚
Arestedt, B. Fridlund, and
A. Brostr¨
om, “Numeric rating scale: patients’ perceptions of its
use in postoperative pain assessments,Applied nursing research,
vol. 27, no. 1, pp. 41–46, 2014.
[13] E. Castarlenas, E. S´
anchez-Rodr´
ıguez, R. de la Vega, R. Roset,
and J. Mir´
o, “Agreement between verbal and electronic versions
of the numerical rating scale (nrs-11) when used to assess pain
intensity in adolescents,” The Clinical journal of pain, vol. 31,
no. 3, pp. 229–234, 2015.
[14] G. Garra, A. J. Singer, B. R. Taira, J. Chohan, H. Cardoz,
E. Chisena, and H. C. Thode, “Validation of the wong-baker faces
pain rating scale in pediatric emergency department patients,
Academic Emergency Medicine, vol. 17, no. 1, pp. 50–54, 2010.
[15] A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M.
Prkachin, and P. E. Solomon, “The painful face–pain expression
recognition using active appearance models,Image and vision
computing, vol. 27, no. 12, pp. 1788–1796, 2009.
[16] S. Kaltwang, O. Rudovic, and M. Pantic, “Continuous pain in-
tensity estimation from facial expressions,” in Advances in Visual
Computing. Springer, 2012, pp. 368–377.
[17] P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, and H. C.
Traue, “Towards pain monitoring: Facial expression, head pose, a
new database, an automatic system and remaining challenges,” in
Proceedings of the British Machine Vision Conference, 2013, pp.
119–1.
[18] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers,
C. M¨
uLler, and S. Narayanan, “Paralinguistics in speech and
languagestate-of-the-art and the challenge,” Computer Speech &
Language, vol. 27, no. 1, pp. 4–39, 2013.
[19] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained lo-
cal neural fields for robust facial landmark detection in the wild,
in Proceedings of the IEEE International Conference on Com-
puter Vision Workshops, 2013, pp. 354–361.
[20] G. Tzimiropoulos, J. Alabort-i Medina, S. Zafeiriou, and M. Pan-
tic, “Generic active appearance models revisited, in Computer
Vision–ACCV 2012. Springer, 2012, pp. 650–663.
[21] P. Lucey, J. F. Cohn, I. Matthews, S. Lucey, S. Sridharan,
J. Howlett, and K. M. Prkachin, “Automatically detecting pain in
video through facial action units,” Systems, Man, and Cybernet-
ics, Part B: Cybernetics, IEEE Transactions on, vol. 41, no. 3, pp.
664–674, 2011.
[22] N. Rathee and D. Ganotra, “A novel approach for pain intensity
detection based on facial feature deformations,” Journal of Visual
Communication and Image Representation, vol. 33, pp. 247–254,
2015.
[23] H. C. Bhakta and C. A. Marco, “Pain management: association
with patient satisfaction among emergency department patients,
The Journal of emergency medicine, vol. 46, no. 4, pp. 456–464,
2014.
96
... In contrast, the authors of [50] focused on acute MI pain and achieved the highest overall accuracy using their CNN model. Other studies [14,57] followed patients for one hour to assess their pain levels and conduct an additional audio session, suggesting that the nature of their pain was acute (Table 1). Accurate pain assessment is critical for effective pain management, and improvements in the precision of pain assessment can reduce the risk of incorrect management, especially in opioid prescription, which carries a risk of addiction and can pose a threat to patients' lives [8,58,59]. ...
... In 2016, Tsai et al. [57] conducted a study evaluating their framework against the NRS. They used the audio-video system in combination with the NRS and various measures (age, vital signs, pain levels, pain levels predicted using SVM models), as well as the two clinical outcome variables of painkiller prescription and disposition. ...
... The study indicated a correlation between the concurrent self-reported pain intensity and the quantifiable speech biosignal attributes, similar to the findings of the other two studies which demonstrated the effectiveness of prosodic low-level descriptors for spectral LLDs in identifying pain [14,57]. ...
Article
Full-text available
Pain is a complex and subjective experience, and traditional methods of pain assessment can be limited by factors such as self-report bias and observer variability. Voice is frequently used to evaluate pain, occasionally in conjunction with other behaviors such as facial gestures. Compared to facial emotions, there is less available evidence linking pain with voice. This literature review synthesizes the current state of research on the use of voice recognition and voice analysis for pain detection in adults, with a specific focus on the role of artificial intelligence (AI) and machine learning (ML) techniques. We describe the previous works on pain recognition using voice and highlight the different approaches to voice as a tool for pain detection, such as a human effect or biosignal. Overall, studies have shown that AI-based voice analysis can be an effective tool for pain detection in adult patients with various types of pain, including chronic and acute pain. We highlight the high accuracy of the ML-based approaches used in studies and their limitations in terms of generalizability due to factors such as the nature of the pain and patient population characteristics. However, there are still potential challenges, such as the need for large datasets and the risk of bias in training models, which warrant further research.
... Taking images or videos requires special equipment and a prepared environment to ensure the quality and clarity of the frames. Still, detection by facial expressions could be useful in emergency VOLUME 11, 2023 waiting rooms to replace the traditional triage classification system [39] 2) Linguistic analysis Natural language processing (NLP) is a type of machine learning that allows for the analysis and processing of freeform text. When applied to medical records, it can assist in predicting patient outcomes, improve emergency triage systems, and develop a conversational chatbot for patients to ask questions and obtain relevant information [40]. ...
Article
Full-text available
Pain assessment traditionally relies on self-report, but it is subjective and influenced by various factors. To address this, there's a need for an affordable and scalable objective pain identification method. Current research suggests that pain has physiological markers beyond the brain, such as changes in cardiovascular activity and electrodermal responses. Utilizing these markers, real-time pain detection algorithms were developed using the BioVid Heat Pain dataset, consisting of 86 healthy individuals experiencing acute pain. Three physiological signals were collected (ECG, GSR, EMG). Various machine learning models were employed to lay the foundation for future advancements in creating sophisticated pain categorization algorithms. The goal is to develop a machine learning model capable of accurately classifying levels of pain experienced based solely on physiological signals. The proposed method produced an accuracy score of 87% for binary classification and 52% accuracy for multi-class classification, with the highest-performing machine learning model being Random Forests. These results suggest that the PainMeter can be deployed in field settings using wearable sensors, offering real-time, unbiased pain sensing and management capabilities.
... In general, if the individual modalities demonstrate sufficiently strong predictive performance, their fusion tends to yield improved results. This has been demonstrated in various studies, including the combination of facial expression and head pose [19,20]; EDA, ECG, and sEMG [20,21]; video, EDA, ECG, and sEMG [20,22]; video, RSP, ECG, and remote PPG [23]; video and audio [24]; and MoCap and sEMG [25]. ...
Article
Full-text available
Pain assessment is a critical aspect of healthcare, influencing timely interventions and patient well-being. Traditional pain evaluation methods often rely on subjective patient reports, leading to inaccuracies and disparities in treatment, especially for patients who present difficulties to communicate due to cognitive impairments. Our contributions are three-fold. Firstly, we analyze the correlations of the data extracted from biomedical sensors. Then, we use state-of-the-art computer vision techniques to analyze videos focusing on the facial expressions of the patients, both per-frame and using the temporal context. We compare them and provide a baseline for pain assessment methods using two popular benchmarks: UNBC-McMaster Shoulder Pain Expression Archive Database and BioVid Heat Pain Database. We achieved an accuracy of over 96% and over 94% for the F1 Score, recall and precision metrics in pain estimation using single frames with the UNBC-McMaster dataset, employing state-of-the-art computer vision techniques such as Transformer-based architectures for vision tasks. In addition, from the conclusions drawn from the study, future lines of work in this area are discussed.
... Several databases have been designed and released for automatic pain recognition methods in computer vision and machine learning domains, which ranked from oldest to newest: COPE Database [14], UNBC-McMaster Shoulder Pain Database [15], BioVid Heat Pain Database [16], BP4D-Spontaneous Database [17], YouTube Database [18], BP4D+ Database [19], IIIT-S ICSD [20], SenseEmotion Database [21], Multimodal EmoPain Database [22], Mint PAIN Database [23], and X-ITE Pain Database [24]. Most methods of pain recognition used single modality: [7,25] used video, [26,27] used audio signal, and [28][29][30] used physiological signals. The recent methods used multiple modalities [31][32][33] that can improve the performance and flexibility of pain recognition. ...
... Most methods of pain recognition used a single modality [44,50] used video, [51,52] used audio signal, and [10,12,33] used physiological signals. The recent methods used multiple modalities [3,4,14,16,[53][54][55] that can improve the performance and flexibility of pain recognition. ...
Article
Full-text available
Pain is a reliable indicator of health issues; it affects patients’ quality of life when not well managed. The current methods in the clinical application undergo biases and errors; moreover, such methods do not facilitate continuous pain monitoring. For this purpose, the recent methodologies in automatic pain assessment were introduced, which demonstrated the possibility for objectively and robustly measuring and monitoring pain when using behavioral cues and physiological signals. This paper focuses on introducing a reliable automatic system for continuous monitoring of pain intensity by analyzing behavioral cues, such as facial expressions and audio, and physiological signals, such as electrocardiogram (ECG), electromyogram (EMG), and electrodermal activity (EDA) from the X-ITE Pain Dataset. Several experiments were conducted with 11 datasets regarding classification and regression; these datasets were obtained from the database to reduce the impact of the imbalanced database problem. With each single modality (Uni-modality) experiment, we used a Random Forest [RF] baseline method, a Long Short-Term Memory (LSTM) method, and a LSTM using a sample weighting method (called LSTM-SW). Further, LSTM and LSTM-SW were used with fused modalities (two modalities = Bi-modality and all modalities = Multi-modality) experiments. Sample weighting was used to downweight misclassified samples during training to improve the performance. The experiments’ results confirmed that regression is better than classification with imbalanced datasets, EDA is the best single modality, and fused modalities improved the performance significantly over the single modality in 10 out of 11 datasets.
... To mitigate the biases caused by the conventional evaluation methods, automatic evaluation of pain is potential to construct an objective and unified standard. In recent studies, models for automatic detection of pain have been investigated and proposed based on multiple modalities, including facial expression [184,185,186], body gestures, and motion descriptors [187,188]. As an important factor of evaluating the physiological health like the cardiovascular system [189] and the mental health such as depression [190], voice is potential to evaluate the pain level. ...
Thesis
Full-text available
Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges.
... Depending on the amount and diversity of sensors used during the data collection phase, several signals have been assessed and evaluated in various settings for the development of pain assessment systems. Some of the most prominently used signals constitute of the audio signal (e.g., paralinguistic vocalizations) (Tsai et al., 2016(Tsai et al., , 2017Thiam and Schwenker, 2019), the video signal (e.g., facial expressions) (Rodriguez et al., 2017;Werner et al., 2017;Tavakolian and Hadid, 2019;Thiam et al., 2020b), specific bio-physiological signals such as the Electrodermal Activity (EDA), the Electrocardiogram (ECG), the Electromyography (EMG), or the Respiration (RSP) signal Campbell et al., 2019;Thiam et al., 2019a), and also bodily expression signals (Dickey et al., 2002;Olugbade et al., 2019;Uddin and Canavan, 2020). ...
Article
Full-text available
Traditional pain assessment approaches ranging from self-reporting methods, to observational scales, rely on the ability of an individual to accurately assess and successfully report observed or experienced pain episodes. Automatic pain assessment tools are therefore more than desirable in cases where this specific ability is negatively affected by various psycho-physiological dispositions, as well as distinct physical traits such as in the case of professional athletes, who usually have a higher pain tolerance as regular individuals. Hence, several approaches have been proposed during the past decades for the implementation of an autonomous and effective pain assessment system. These approaches range from more conventional supervised and semi-supervised learning techniques applied on a set of carefully hand-designed feature representations, to deep neural networks applied on preprocessed signals. Some of the most prominent advantages of deep neural networks are the ability to automatically learn relevant features, as well as the inherent adaptability of trained deep neural networks to related inference tasks. Yet, some significant drawbacks such as requiring large amounts of data to train deep models and over-fitting remain. Both of these problems are especially relevant in pain intensity assessment, where labeled data is scarce and generalization is of utmost importance. In the following work we address these shortcomings by introducing several novel multi-modal deep learning approaches (characterized by specific supervised, as well as self-supervised learning techniques) for the assessment of pain intensity based on measurable bio-physiological data. While the proposed supervised deep learning approach is able to attain state-of-the-art inference performances, our self-supervised approach is able to significantly improve the data efficiency of the proposed architecture by automatically generating physiological data and simultaneously performing a fine-tuning of the architecture, which has been previously trained on a significantly smaller amount of data.
... With the release of publicly available pain databases (e.g., the UNBC-McMaster Pain Archive) and advancements in computer vision and machine learning, automatic assessment of pain from behavioral measures (e.g., facial expression) has emerged as a possible alternative to manual observations [10]. Using either spatial features or spatio-temporal features [10], researchers have automatically detected pain in the flow of behavior [1,16,19], differentiated feigned from genuine pain [2,16,17], detected ordinal pain intensity [11,12,15,[22][23][24][25][26][27]29] and distinguished pain from expressions of emotion [3,13,14] (see [10,28] for a detailed review). ...
Conference Paper
The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged as a potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale (VAS), the Sensory Scale (SEN), and the Affective Motivational Scale (AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of these scales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency between different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation. The obtained results suggest that automatic assessment of self-reported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.
Preprint
Full-text available
p>Chronic pain is a prevalent condition where fear of movement and pain interferes with everyday functioning. Yet, there is no open body movement dataset for people with chronic pain in everyday settings. Our EmoPain@Home dataset addresses this with capture from people with and without chronic pain in their homes, while they performed their routine activities. The data includes labels for pain, worry, and movement confidence continuously recorded for activity instances for the people with chronic pain. We explored two-level pain detection based on this dataset and obtained 0.62 mean F1 score. However, extension of the dataset led to deterioration in performance confirming high variability in pain expressions for real world settings. We investigated activity recognition for this setting as a first step in exploring the use of the activity label as contextual information for improving pain level classification performance. We obtained mean F1 score of 0.43 for 9 activity types, highlighting its feasibility. Further exploration, however, showed that data from healthy people cannot be easily leveraged for improving performance because worry and low confidence alter activity strategies for people with chronic pain. Our dataset and findings lay critical groundwork for automatic assessment of pain experience and behaviour in the wild. </p
Preprint
Full-text available
p>Chronic pain is a prevalent condition where fear of movement and pain interferes with everyday functioning. Yet, there is no open body movement dataset for people with chronic pain in everyday settings. Our EmoPain@Home dataset addresses this with capture from people with and without chronic pain in their homes, while they performed their routine activities. The data includes labels for pain, worry, and movement confidence continuously recorded for activity instances for the people with chronic pain. We explored two-level pain detection based on this dataset and obtained 0.62 mean F1 score. However, extension of the dataset led to deterioration in performance confirming high variability in pain expressions for real world settings. We investigated activity recognition for this setting as a first step in exploring the use of the activity label as contextual information for improving pain level classification performance. We obtained mean F1 score of 0.43 for 9 activity types, highlighting its feasibility. Further exploration, however, showed that data from healthy people cannot be easily leveraged for improving performance because worry and low confidence alter activity strategies for people with chronic pain. Our dataset and findings lay critical groundwork for automatic assessment of pain experience and behaviour in the wild. </p
Conference Paper
Full-text available
Automatic pain recognition is an evolving research area with promis-ing applications in health care. In this paper, we propose the first fully automatic approach to continuous pain intensity estimation from facial images. We first learn a set of independent regression functions for continuous pain intensity esti-mation using different shape (facial landmarks) and appearance (DCT and LBP) features, and then perform their late fusion. We show on the recently published UNBC-MacMaster Shoulder Pain Expression Archive Database that late fusion of the afore-mentioned features leads to better pain intensity estimation compared to feature-specific pain intensity estimation.
Conference Paper
Full-text available
Facial feature detection algorithms have seen great progress over the recent years. However, they still struggle in poor lighting conditions and in the presence of extreme pose or occlusions. We present the Constrained Local Neu-ral Field model for facial landmark detection. Our model includes two main novelties. First, we introduce a prob-abilistic patch expert (landmark detector) that can learn non-linear and spatial relationships between the input pix-els and the probability of a landmark being aligned. Sec-ondly, our model is optimised using a novel Non-uniform Regularised Landmark Mean-Shift optimisation technique, which takes into account the reliabilities of each patch ex-pert. We demonstrate the benefit of our approach on a num-ber of publicly available datasets over other state-of-the-art approaches when performing landmark detection in unseen lighting conditions and in the wild.
Article
Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.558 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.
Article
The pain intensity detection approach proposed in this paper is based on the fact that facial features get deformed during pain. To model facial feature deformations, Thin Plate Spline is adopted that separates rigid and non-rigid deformations very well. For efficient pain level detection, we have mapped the deformation parameters to higher discriminative space using Distance Metric Learning (DML) method. In DML, we seek a common distance metric such that the features belonging to the same pain intensity are pulled close to each other and the features belonging to the different pain intensity are pushed as far as possible. The assessment of the proposed approach is carried out on the popularly accepted UNBC-McMaster Shoulder Pain Expression Archive Database by using Support Vector Machine as a classifier. To prove the efficacy of the proposed approach, it is compared with state-of-the-art approaches mentioned in literature.
Conference Paper
The proposed Active Orientation Models (AOMs) are generative models of facial shape and appearance. Their main differences with the well-known paradigm of Active Appearance Models (AAMs) are (i) they use a different statistical model of appearance, (ii) they are accompanied by a robust algorithm for model fitting and parameter estimation and (iii) and, most importantly, they generalize well to unseen faces and variations. Their main similarity is computational complexity. The project-out version of AOMs is as computationally efficient as the standard project-out inverse compositional algorithm which is admittedly the fastest algorithm for fitting AAMs. We show that not only does the AOM generalize well to unseen identities, but also it outperforms state-of-the-art algorithms for the same task by a large margin. Finally, we prove our claims by providing Matlab code for reproducing our experiments ( http://ibug.doc.ic.ac.uk/resources ).
Conference Paper
Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.56 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.
Article
Electronic pain measures are becoming common tools in the assessment of pediatric pain intensity. The aims of this study were (1) to examine the agreement between the verbal and the electronic versions of the NRS-11 (vNRS-11 and eNRS-11, respectively) when used to assess pain intensity in adolescents and (2) to report participants' preferences for each of the two alternatives. 191 schoolchildren enrolled in grades 7 to 11(Supplemental Digital Content 1-5, http://links.lww.com/CJP/A96, http://links.lww.com/CJP/A97, http://links.lww.com/CJP/A98, http://links.lww.com/CJP/A99, http://links.lww.com/CJP/A100) (mean age=14.61; range=12-18) participated. They were asked to report the highest intensity of the most frequent pain that they had experienced during the last three months using both the vNRS-11 and the eNRS-11. Agreement analyses were done using: (1) the Bland-Altman method, with confidence intervals (CI) of both 95% and 80%, and a maximum limit of agreement of±1; and (2) weighted intra-rater Kappa coefficients between the ratings for each participant on the vNRS-11 and eNRS-11. The limits of agreement at 95% fell outside the limit established a priori (scores ranged from -1.42 to 1.69), except for participants in grade 11 (Supplemental Digital Content 5, http://links.lww.com/CJP/A100)(-0.80, 0.88). Meanwhile, the limits of agreement at 80% CI fell inside the maximum limit established a priori (scores ranged from -0.88 to 0.94), except for participants in grade 8 (Supplemental Digital Content 2, http://links.lww.com/CJP/A97)(-0.88, 1.16). The Kappa coefficients ranged from 0.786 to 0.912, indicating "almost perfect" agreement. A total of 83% of participants preferred the eNRS-11. Pain intensity ratings on the vNRS-11 and eNRS-11 seem to be comparable, at least for the 80% CI.