Conference PaperPDF Available

Toward Development and Evaluation of Pain Level-Rating Scale for Emergency Triage based on Vocal Characteristics and Facial Expressions

Toward Development and Evaluation of Pain Level-Rating Scale for
Emergency Triage based on Vocal Characteristics and Facial Expressions
Fu-Sheng Tsai1, Ya-Ling Hsu1, Wei-Chen Chen1, Yi-Ming Weng2, Chip-Jin Ng2, Chi-Chun Lee1
1Department of Electrical Engineering, National Tsing Hua University, Taiwan
2Department of Emergency Medicine, Chang Gung Memorial Hospital, Taiwan
In order to allocate the healthcare resource, triage classifica-
tion system plays an important role in assessing the severity of
illness of the boarding patient at emergency department. The
self-report pain intensity numerical-rating scale (NRS) is one
of the major modifiers of the current triage system based on
the Taiwan Triage and Acuity Scale (TTAS). The validity and
reliability of self-report scheme for pain level assessment is a
major concern. In this study, we model the observed expres-
sive behaviors, i.e., facial expressions and vocal characteristics,
directly from audio-video recordings in order to measure pain
level for patients during triage. This work demonstrates a feasi-
ble model, which achieves an accuracy of 72.3% and 51.6% in
a binary and ternary pain intensity classification. Moreover, the
study result reveals a significant association of current model
and analgesic prescription/patient disposition after adjusted for
patient-report NRS and triage vital signs.
Index Terms: behavioral signal processing (BSP), facial ex-
pressions, triage, pain scale, vocal characteristics
1. Introduction
Deriving behavioral informatics from signals, e.g., audio-video
and/or physiological data recordings, offers a new paradigm
for quantitative decision-making across behavior sciences [1].
Behavioral informatics, i.e., computational methods that mea-
sure human’s attributes-of-interest, are developed grounded in
their desired domain applications. For example, notable algo-
rithmic advances have been observed in the medical domains:
detection of depression [2, 3], assessment of Parkinson’s dis-
ease [4, 5], modeling of therapist’s empathy in motivational in-
terview [6, 7], analysis of disorder [8, 9], etc. In this work,
we carry out a research effort into objectifying pain level, i.e.,
one of the six major regulators in the Taiwan Triage and Acu-
ity Scale (TTAS) [10], of an on-boarding emergency patient by
modeling his/her facial expressions and vocal characteristics.
TTAS is jointly developed by the Taiwan Society of Emer-
gency Medicine and the Critical Care Society, which modifies
the Canadian Triage and Acuity Scale (CTAS) [11] by tailoring
toward Taiwan’s particular medical situations. It is officially
announced in 2010 by the Ministry of Health and Welfare to be
the triage system of Taiwan. TTAS includes six major factors
in assessing the severity and screening life-threatening patients:
respiratory distress, circulation, consciousness level, body tem-
perature, pain level, and injury mechanism. In specifics, the
intensity of pain is currently measured by the numerical-rating
scale (NRS)[12, 13], that is a 10-point self-report pain scale.
In clinical practice, physicians and nurses have noticed the dif-
ficulty in the systematic implementation of this instrument es-
pecially for elderly people, foreigners, or patients with a low
education level. This often leads to either a practice of using
FACES rating scale [14] that is designed for children, or the
triage nurses would select the level through his/her own obser-
vations instead of soliciting an answer from the patient. Further-
more, even when the nurses succeed in carrying out NRS, this
self-report rating still suffers from various unwanted idiosyn-
cratic factors, e.g., age and body part dependency and incon-
sistent comprehension of the pain scale. These issues centered
around subjectivity in measuring pain create a deviation on the
consistency and validity of the triage classification system.
Related previous works have concentrated mainly on rec-
ognizing the occurrences of pain by monitoring facial expres-
sions. For example Ashraf et al. [15] uses active appearance
model to recognize frame-level pain, Kaltwang et al. [16] uses
a relevance vector regression model to classify between real and
fake pain, and Wener et al. models head pose for pain detection
[17]. In this work, we propose to include voice characteristics
in addition to facial expression for measuring pain level. More-
over, we contribute not only in the multimodal aspect of pain
level measurement but also in the realism of contextualizing ap-
plications in real medical settings. In this work, we collect data
from a total of 182 real patients as they seek emergency medical
service at Chang Gung Memorial Hospital1. The data includes
audio-video samples during triage and follow-up sessions after
treatment, vital sign (physiological) data during triage, and fi-
nally a set of clinical outcomes. The data recordings are in real
medical settings (in-the-wild), and the interactions are sponta-
neous in nature - all posed as a challenging yet contextualized
situation for deriving appropriate informatics.
Our proposed multimodal framework achieves a 72.3% ac-
curacy in classifying between the extremes (severe versus mild)
pain level and 51.6% accuracy in performing a three-class (se-
vere,moderate, and mild) pain level recognition. The inclusion
of audio modality is essential in improving the overall recogni-
tion rate implicating that the intensity of pain is also reflected
in the patient’s vocal characteristics. Furthermore, while com-
paring to the so called ground-truth, i.e., NRS, is a straight-
forward mean of evaluating the framework, in this work, we
further utilize this audio-video based system in combination
with NRS and vital sign data in order to analyze clinically-
relevant outcome-related variables, in specifics analgesic pre-
scription and patient disposition, as another evaluation scheme.
We demonstrate that even after taking into the account for the
current best medical instruments availability (physiological data
and NRS), the usage of audio-video based pain level assess-
ment can improve the prediction about whether a doctor would
end up prescribing analgesic prescription or order patients to
be hospitalized. This initial result is quite promising as the re-
search effort will continue to derive novel pain-level rating and
validate its ability to improve the current triage classification
system clinically.
Copyright © 2016 ISCA
September 8–12, 2016, San Francisco, USA
Figure 1: It shows a complete flow diagram of the proposed work. We segment the raw audio recordings manually and then extract
acoustic low-level descriptors; for the video data, we apply a pre-trained constraint local neural field (CLNF) to track the (x, y)
positions of the 68 landmark points on face and then extract descriptors based on to pain-related facial action units. Two encodings
methods, statistical functional descriptors and k-means bag-of-word model, are used to derive a session-level feature vector. Finally,
we conduct pain level recognition using fusion of audio-video features and further analyze it with respect to clinical outcomes.
The rest of the paper is organized as follows: section 2 de-
scribes about data collection and audio-video feature extraction,
section 3 includes experimental setups and results, and section
4 concludes with future work.
2. Research Methodology
2.1. Database Collection
The triage session included audio-video recordings, physiolog-
ical (heart rate, systolic and diastolic blood pressure) vital sign
data, and other clinically-related outcomes (analgesic prescrip-
tion and patient disposition) of on-boarding emergency patients
at Chang Gung Memorial Hospital. We excluded pediatric and
trauma patients, also excluded referral patients or patients with
prior treatment before arrival, and further only included patients
with symptoms of chest, abdominal, lower-back, limbic pain,
and headaches. There were two sessions recorded for each pa-
tient, i.e., at triage and follow-up, where the follow-up session
occurred approximately 1 hour after the treatment, if any, was
given to the patient. These sessions essentially involved nurses
asking the patient for the location of the body pain, the NRS
scale of pain intensity (010, where 10 means the worst pain
ever), and a brief description on the type of pain felt (for ex-
ample, cramps or aches); it usually lasted around 30 seconds
for each session. The audio-video data was recorded using a
Sony HDR handy cam on a tripod in a designated assessment
room, and the placement of the camera was set attempting to
consistently capture the patients’ facial expressions.
In our current database, we have collected a total of 182 pa-
tients, each recorded at the two designated points in time. After
excluding non-usable data (e.g., cases where the patient’s rela-
tive responds to the pain level assessment instead of the patient,
low audio-video quality due to various uncontrollable factors,
loss of either physiological data or clinical outcomes), we have
a total of 205 audio-video samples from 117 unique patients,
which constitutes the dataset of interest for this work. Lastly,
the pain level is often grouped into three levels based on the
number reported (mild: 03, moderate: 46, severe: 710);
we adopt the same convention in this work to serve as the learn-
ing target for our signal-based pain level assessment system.
2.2. Audio-Video Feature Extraction
Figure 1 depicts the overall framework including audio-
video data preprocessing, low-level descriptors extraction, and
session-level encoding. In the following sections, we will
briefly describe each component.
Figure 2: The red dots are the 68 facial landmarks tacked for
each image. The action units are the ones being indicative of
pain in the past literature. Lastly, it shows the various parame-
terization of the 68 facial landmarks that we compute as video
features to be used in this work. Facial Action Coding System
photos( face/facs.htm)
2.2.1. Acoustic Characteristics
For each recorded session, we first perform manual segmenta-
tion on the audio file to obtain the speaking portions correspond-
ing to the patient, the patient’s relatives, and the interviewer. In
this work, we concentrate only on the patient’s voice character-
istics. We extract 45 low-level descriptors in total, including 13
MFCCs, 1 fundamental frequency, 1 intensity and their asso-
ciated delta and delta-delta every 10ms. This set of spectral-
prosodic features is extracted due to their common usage in
characterizing paralinguistic and emotion information [18]. The
audio features are further z-normalized per speaker.
2.2.2. Facial Expressions
On the video side, for each session, we first apply constrained
local neural fields (CLNF) [19] as a pre-processing step. CLNF
tracks a patient’s 68 facial landmark’s position based on the
Active Orientation Model (AOM) [20], which is an extension
to Active Appearance Model for describing the shape and ap-
pearance of a face. CLNF, i.e., an instance of constrained lo-
cal model, essentially involves three major technical compo-
nents: point distribution model (describing the position of fea-
ture points in an image), local neural field patch experts (layered
unidirectional graphical model), and optimization fitting ap-
proach (non-uniform regularized landmark mean shift fitting).
By applying CLNF, we then are able to track the 68 feature
points (Figure 2), e.g., around face, eyes, and nose contour, for
each patient in each image of the recorded video session.
Past works have identified several facial action units that are
related to the feeling of pain [21, 22], e.g., AU4, 6, 7, 9, 10, 12,
Table 1: It summarizes the Unweighted Average Recall (UAR) obtained in Exp I. 2-Class indicates the binary classification task between
the extreme pain levels (severe versus mild). 3-Class indicates the ternary classification between severe, moderate, and mild pain levels.
The numbers in bold indicate the best accuracy achieved within that specific task.
Chance Audio-Only Video-Only Multimodal Fusion (early-fusion / late-fusion)
Functional Bow Functional Bow FuncA, FuncV FuncA, BowV BowA,FuncV BowA, BowV
2-Class 50.0 67.9 61.3 55.9 61.9 66.8 / 68.7 72.3 / 68.1 56.6 / 61.5 61.1 / 64.8
3-Class 33.3 46.0 42.9 40.9 40.8 43.5 / 48.3 49.7 / 51.6 40.8 / 43.7 41.4 / 44.8
16, 25, 43 (Figure 2). In this work, instead of recognizing these
facial action units, we compute features characterizing these ex-
pressions directly from the tracked key points’ (x, y)position,
Eyebrows (7): the distance of inner eyebrows divided by
the distance of outer eyebrows (1), the quadratic polyno-
mial coefficients of the right and the left eyebrows (6)
Nose (2): the normalized distance between nose and
philtrum (1), and of nasolabial folds (1)
Eyes (5): the outer eye corner opening (2), the distance
between the inner eye corners divided by the distance of
outer eye corners (1), the distance of upper and lower
eyelids divided by the distance from the head to the cor-
ner of the eyes (2)
Mouth (14): the quadratic polynomial coefficients from
the shape of upper lip, including outer and inner part,
and lower lip, including outer and inner part (12), the
two-sided mouth corners opening angles (2)
There are a total of 28 features per frame extracted from the face
to represent the facial expression of the patient. Figure 2 also
shows a schematics of the features being extracted in this work.
2.3. Session-level Encodings
Since each session is approximately 30 seconds long, we ad-
ditionally utilize two different encoding approaches to form a
fixed-length feature vector at the session-level. The first one
is based on computing 15 different statistical functionals on
audio and video low-level descriptors (Functional). The list
of functionals includes maximum, minimum, mean, median,
standard deviation, 1st percentile, 99th percentile, 99th-1st per-
centile, skewness, kurtosis, minimum position, maximum po-
sition, lower quartile, upper quartile, and interquartile range.
The second approach is based on k-means bag-of-word (BoW)
encoding, which encoding varying length of sequences of low-
level descriptors with a histogram count of cluster occurrences.
In general, BoW characterizes the quantified behavior types
over a duration of time. The number of clusters is set to be
256 for both audio and video.
3. Experimental Setup and Results
In this work, we set up two different experiments:
Exp I : the NRS pain scale recognition task
Exp II: clinical outcomes analyses
Exp I is designed to validate that the pain-related facial and vo-
cal expressions can indeed be modeled and used in the develop-
ment toward a signal-based pain scale, and Exp II is designed
to analyze the predictive information that the signal-based pain
scale possess in addition to the NRS and patient’s physiology
to the clinical judgment of painkiller prescription and patient’s
disposition (hospitalization or discharge).
3.1. Exp I: NRS Pain Level Classification
In Exp I, we perform two different recognition tasks: 1) bi-
nary classification between the extreme pain levels, i.e., severe
vs. mild pain, on the subset of the dataset and 2) ternary clas-
sification of the three commonly-used pain levels, i.e., severe
vs. moderate vs. mild, on the entire dataset. Severe pain cor-
responds to NRS score ranging between 710, moderate is
46, and mild is 03. We design two different tasks due
to the fact that NRS rating itself only relies on patient’s self-
report, which can be subjective especially for the moderate por-
tion of the data. By running an additional binary classification
on the extreme set, where there is less concern on the reliabil-
ity of the label, we can better assess the technical feasibility
of our framework. The classifier of choice for this experiment
is the linear-kernel support vector machine. We employ two
different multimodal fusion techniques. One is based on early-
fusion technique, i.e., concatenating audio and video features
after performing univariate feature selection (i.e., ANOVA) on
each modality separately. Another one is based on late-fusion
technique, i.e., by fusing the decision scores from the audio and
video modality separately using logistics regression. All evalu-
ation is done via leave-one-patient-out cross-validation, and the
performance metric is unweighted average recall.
3.1.1. Results and Discussions
Table 1 summarizes the results of Exp I. 2-Class indicates the
binary classification task between the extremes. 3-Class indi-
cates the ternary classification between the three pain levels.
The numbers in bold indicate the best accuracy achieved. There
are a couple of points to note in these results. The best accura-
cies achieved are 72.3% and 51.6%, i.e., multimodal fusion of
audio and video modalities, for 2-Class and 3-Class classifica-
tion tasks respectively. Both of these results are significantly
better than the chance baseline indicating that there indeed
exists pain-related information that can be modeled through
audio-video signals. Another point to make is that while past
works concentrate mostly on the facial expressions, in our work,
we demonstrate that the vocal characteristics are also indicative
of the patient’s experience of pain. In fact, if we compare the
audio-only and video-only accuracies, the result obtained with
audio-only features are slightly higher than the video-only fea-
Secondly, the type of encoding methods affects the recogni-
tion accuracies. We show that functionals-based method works
better for audio features and bag-of-word approach works better
for video features. In fact, the best accuracy reported is by fus-
ing functional-based audio feature with bag-of-word encoding
of video feature. We hypothesize that this could be due to the
fact that pain-related audio characteristics are non-linearly dis-
tributed across the session (hence, the functional descriptor ap-
proach works better), and our video features are inherently try-
ing to capture a specific configuration of appearances (hence, a
counting-based method of encoding is superior). Another thing
to note that, in the three-class problem, the error rate for the
moderate class is considerably higher than in the mild and se-
vere. It could be due to the fact that this class is inherently
ambiguous; hence not only the data itself is ambiguous but the
ground truth itself can be unreliable. In summary, we demon-
strate that our proposed audio-video-based pain scale is capable
of reaching a substantial reliability compared to the established
NRS self-report-based instrument for assessing pain.
3.2. Exp II: Clinical Outcomes Analyses
The overarching goal of the research effort is not just to repli-
cate the NRS self-report pain scale, instead, the aim is to de-
rive a signal-based (i.e., from audio-video data) informatics that
can supplement the current decision-making protocol. A physi-
cian’s decision on the type of treatment, if any, to the patient
is often largely based on a holistic clinical assessment of a pa-
tient’s overall condition. Hence, in Exp II, our aim is to design a
simple quantitative score that combines the available measures
at triage with the audio-video based pain level (system output
in section 3.1). We will demonstrate that this score has added
information that is relevant to the patient’s clinical outcomes of
analgesic prescription and disposition. The exact analysis pro-
cedure goes as follows. For each triage, we have the following
measures for every patient:
PHY: age, systolic/diastolic blood pressure, heart rate
NRS-3C: the three pain levels, i.e., mild, moderate, and
severe, derived from the patient’s NRS scale
SYS-2C: one of the two predicted pain levels (mild / se-
vere) derived from the 2-Class SVM
SYS-2C(d): the decision score derived the 2-Class SVM
SYS-3C: one of the three predicted pain levels (mild /
moderate / severe) derived from the 3-Class SVM
SYS-3C(d): the decision score derived the 3-Class SVM
PHY measures are all normalized with respect to the age of each
patient. Further, we have two clinical dichotomous outcome
variables for each patient, i.e., painkiller prescription and dis-
position. We design a score, painK and dispT, of each outcome
by training a linear regression model each on the training set
using the measures mentioned above as the independent vari-
ables, and then we apply the learned regression model to assign
an outcome score for each patient i. Lastly, by utilizing the fol-
lowing simple rule, we can predict whether a patient iwill end
up being prescribed medication or being hospitalized:
prescription: painKi>AVG{painKj}jtrain-set
hospitalization: dispTi>AVG{dispTj}jtrain-set
where AVG means the average values of the score within the
training set. All of these procedures are done completely via
leave-one-patient-out cross validation. The main idea of the
analyses is to show that by having the audio-video based pain-
scale system, it enhances the quantitative (i.e., objective and
measurable) evidences to the doctor’s clinical judgment even
when accounted for the current clinical instruments.
3.2.1. Experimental Results and Discussions
Table 2 summarizes the results of Exp II as measured in UAR.
There are some interesting points to note in this analysis. For
the outcome of analgesic prescription, we see that NRS scale
by itself naturally is already capable of achieving an accuracy
of 66.3%, in accordance with known phenomenon in the past
[23], and PHY measures alone do not contribute at all. How-
ever, by combing NRS to the SYS-3C(d), the accuracy im-
proves to 71.0% (an 4.7% absolute improvement). This result
seems to indicate that the decision scores outputted from the 3-
Class SVM encodes additional information beyond NRS scale
that is relevant in understanding how physicians make a judg-
ment on analgesic prescription. Furthermore, for the outcome
of patient’s disposition (hospitalization or not), we see that PHY
(vital sign) by itself obtains 56.4% accuracy, where NRS scale
does not provide information here. However, by coming PHY
with SYS-2C, the accuracy improves to 65.7% (a 9.3% absolute
improvement) - signifying the added information that NRS is
Table 2: Summary of Exp II: the accuracy number is measured
in unweighted average recall
Analgesic Pres. Hospitalization
PHY 49.6 56.4
NRS-3C 66.3 42.7
PHY+NRS-3C 63.5 56.4
SYS-2C 51.5 58.7
SYS-3C 58.8 57.1
PHY+SYS-2C 47.8 65.7
PHY+SYS-3C 53.3 58.6
PHY+SYS-2C(d) 54.4 56.4
PHY+SYS-3C(d) 58.4 54.4
NRS-3C+SYS-2C 66.3 58.7
NRS-3C+SYS-3C 66.3 57.1
NRS-3C+SYS-2C(d) 66.3 43.3
NRS-3C+SYS-3C(d) 71.0 44.7
PHY+NRS-3C+SYS-2C 62.3 65.7
PHY+NRS-3C+SYS-3C 62.7 55.9
PHY+NRS-3C+SYS-2C(d) 66.0 55.1
PHY+NRS-3C+SYS-3C(d) 69.6 55.8
lacking originally yet the audio-video based pain-scale do pos-
sess in terms patient’s disposition outcome.
In summary, while the audio-video based system is trained
from the NRS, it seems to differ possibly due to the fact that it
models the facial expressions and vocal characteristics directly.
We demonstrate that these signal-based pain scales indeed pos-
sess additional clinically-relevant information to the outcome
variables of emergency triage beyond what is already captured
in the NRS scale and conventional vital sign measures.
4. Conclusions
In this work, we develop an initial predictive framework to as-
sess the pain-level for patients at emergency triage. The systems
show reliable estimates to the established NRS pain scale. Fur-
thermore, we evaluate the usefulness of such system by demon-
strating that it can capture important information about the out-
come of the patient beyond the current available instrumenta-
tions used at triage. This initial result is quite promising as the
goal of the research is to devise a novel objective and quantifi-
able informatics not to replicate the current instrumentation but
to provide supplemental clinically-relevant information that is
beyond the established protocols.
There are multiple future directions. Technically, employ-
ing state-of-the-art speech/video processing and machine learn-
ing algorithms will be an immediate future direction as we con-
tinue to collect more data samples (our aim is to collect at least
500 unique patients’ data). On the analysis part, we will put
effort into understanding exactly what additional information
that the system is able to capture from the facial and vocal ex-
pressions about the pain that is missing from the NRS scale,
and whether such information is related to the physiology of
the patient (e.g., muscle movement in response to pain felt that
may correlate with the measures of heart rate or blood pres-
sure). Having more insights discovered, we can hopefully help
advance and benefit the current medical practices at the emer-
gency triage with the introduction of such an informatics.
5. Acknowledgments
Thanks to MOST (103-2218-E-007-012-MY3) and ChangGung
Memorial Hospital (CMRPG3E1791) for funding.
6. References
[1] S. Narayanan and P. G. Georgiou, “Behavioral signal process-
ing: Deriving human behavioral informatics from speech and lan-
guage,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1203–1233,
[2] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen,
M. T. Padilla, F. Zhou, and F. D. La Torre, “Detecting depression
from facial actions and vocal prosody,” in Affective Computing
and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd
International Conference on. IEEE, 2009, pp. 1–7.
[3] Z. Liu, B. Hu, L. Yan, T. Wang, F. Liu, X. Li, and H. Kang,
“Detection of depression in speech,” in Affective Computing and
Intelligent Interaction (ACII), 2015 International Conference on.
IEEE, 2015, pp. 743–747.
[4] A. Tsanas, M. A. Little, C. Fox, and L. O. Ramig, “Objective
automatic assessment of rehabilitative speech treatment in parkin-
son’s disease,Neural Systems and Rehabilitation Engineering,
IEEE Transactions on, vol. 22, no. 1, pp. 181–190, 2014.
[5] A. Bayestehtashk, M. Asgari, I. Shafran, and J. McNames, “Fully
automated assessment of the severity of parkinson’s disease from
speech,” Computer speech & language, vol. 29, no. 1, pp. 172–
185, 2015.
[6] J. Gibson, N. Malandrakis, F. Romero, D. C. Atkins, and
S. Narayanan, “Predicting therapist empathy in motivational in-
terviews using language features inspired by psycholinguistic
norms,” in Sixteenth Annual Conference of the International
Speech Communication Association, 2015.
[7] B. Xiao, D. Can, P. G. Georgiou, D. Atkins, and S. S. Narayanan,
“Analyzing the language of therapist empathy in motivational in-
terview based psychotherapy,” in Signal & Information Process-
ing Association Annual Summit and Conference (APSIPA ASC),
2012 Asia-Pacific. IEEE, 2012, pp. 1–4.
[8] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, “Au-
tomatic intelligibility classification of sentence-level pathological
speech,” Computer speech & language, vol. 29, no. 1, pp. 132–
144, 2015.
[9] D. Bone, C.-C. Lee, M. P. Black, M. E. Williams, S. Lee, P. Levitt,
and S. Narayanan, “The psychologist as an interlocutor in autism
spectrum disorder assessment: Insights from a study of sponta-
neous prosody,Journal of Speech, Language, and Hearing Re-
search, vol. 57, no. 4, pp. 1162–1177, 2014.
[10] C.-J. Ng, Z.-S. Yen, J. C.-H. Tsai, L. C. Chen, S. J. Lin, Y. Y.
Sang, J.-C. Chen et al., “Validation of the taiwan triage and acuity
scale: a new computerised five-level triage system,Emergency
Medicine Journal, vol. 28, no. 12, pp. 1026–1031, 2011.
[11] M. J. Bullard, T. Chan, C. Brayman, D. Warren, E. Musgrave,
B. Unger et al., “Revisions to the canadian emergency department
triage and acuity scale (ctas) guidelines,” CJEM, vol. 16, no. 06,
pp. 485–489, 2014.
[12] K. Eriksson, L. Wikstr¨
om, K. ˚
Arestedt, B. Fridlund, and
A. Brostr¨
om, “Numeric rating scale: patients’ perceptions of its
use in postoperative pain assessments,Applied nursing research,
vol. 27, no. 1, pp. 41–46, 2014.
[13] E. Castarlenas, E. S´
ıguez, R. de la Vega, R. Roset,
and J. Mir´
o, “Agreement between verbal and electronic versions
of the numerical rating scale (nrs-11) when used to assess pain
intensity in adolescents,” The Clinical journal of pain, vol. 31,
no. 3, pp. 229–234, 2015.
[14] G. Garra, A. J. Singer, B. R. Taira, J. Chohan, H. Cardoz,
E. Chisena, and H. C. Thode, “Validation of the wong-baker faces
pain rating scale in pediatric emergency department patients,
Academic Emergency Medicine, vol. 17, no. 1, pp. 50–54, 2010.
[15] A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M.
Prkachin, and P. E. Solomon, “The painful face–pain expression
recognition using active appearance models,Image and vision
computing, vol. 27, no. 12, pp. 1788–1796, 2009.
[16] S. Kaltwang, O. Rudovic, and M. Pantic, “Continuous pain in-
tensity estimation from facial expressions,” in Advances in Visual
Computing. Springer, 2012, pp. 368–377.
[17] P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, and H. C.
Traue, “Towards pain monitoring: Facial expression, head pose, a
new database, an automatic system and remaining challenges,” in
Proceedings of the British Machine Vision Conference, 2013, pp.
[18] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers,
C. M¨
uLler, and S. Narayanan, “Paralinguistics in speech and
languagestate-of-the-art and the challenge,” Computer Speech &
Language, vol. 27, no. 1, pp. 4–39, 2013.
[19] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained lo-
cal neural fields for robust facial landmark detection in the wild,
in Proceedings of the IEEE International Conference on Com-
puter Vision Workshops, 2013, pp. 354–361.
[20] G. Tzimiropoulos, J. Alabort-i Medina, S. Zafeiriou, and M. Pan-
tic, “Generic active appearance models revisited, in Computer
Vision–ACCV 2012. Springer, 2012, pp. 650–663.
[21] P. Lucey, J. F. Cohn, I. Matthews, S. Lucey, S. Sridharan,
J. Howlett, and K. M. Prkachin, “Automatically detecting pain in
video through facial action units,” Systems, Man, and Cybernet-
ics, Part B: Cybernetics, IEEE Transactions on, vol. 41, no. 3, pp.
664–674, 2011.
[22] N. Rathee and D. Ganotra, “A novel approach for pain intensity
detection based on facial feature deformations,” Journal of Visual
Communication and Image Representation, vol. 33, pp. 247–254,
[23] H. C. Bhakta and C. A. Marco, “Pain management: association
with patient satisfaction among emergency department patients,
The Journal of emergency medicine, vol. 46, no. 4, pp. 456–464,
... Apart from facial expressions, attempts have been made to use other modalities either individually or in combination for automatic pain detection. For example, Aung et al. [27] examined the use of body posture, body motion, and muscle activity for detecting patterns in body movements that could be indicative of pain during physical exercise; Tsai et al. [28] combined facial expressions and acoustic features to detect pain intensities in emergency cases; Werner et al. [29] investigated the use of facial expressions, electromyogram (EMG) recorded from trapezius muscles, and autonomic signals such as skin conductance and electrocardiogram (ECG) to automatically detect the different heat pain stimulus levels. Attempts have also been made to investigate the use of brain activation-acquired via either electroencephalography (EEG) [30] [31] or functional imaging [32]-for automatic pain assessment. ...
... Zhang et al. [76] averaged the probabilities of selected pain-related AUs to calculate the pain intensity estimate. [126] statistical features from sequence of facial landmark distances and quadratic polynomial coefficients of mouth shape Tsai et al. [28] bag of words from k-means based clusters of sequence of geometric features (facial landmark distances and quadratic polynomial coefficients of mouth shape) ...
... Tsai et al. [28] textural HOG from Three Orthogonal Planes (HOG-TOP) Chen et al. [75] LBP from Three Orthogonal Planes (LBP-TOP) Kaltwang et al. [113] combinations of LBP-TOP, LPQ-TOP, BSIF-TOP Yang et al. [104] energy from optical flow Ghasemi et al. [74] time [130] Almost all classification and regression tasks were supervised. Ground truth in the form of pain or AU labels, and discrete or continuous-valued pain or AU intensities, were used to train the machine learning models. ...
Pain sensation is essential for survival, since it draws attention to physical threat to the body. Pain assessment is usually done through self-reports. However, self-assessment of pain is not available in the case of noncommunicative patients, and therefore, observer reports should be relied upon. Observer reports of pain could be prone to errors due to subjective biases of observers. Moreover, continuous monitoring by humans is impractical. Therefore, automatic pain detection technology could be deployed to assist human caregivers and complement their service, thereby improving the quality of pain management, especially for noncommunicative patients. Facial expressions are a reliable indicator of pain, and are used in all observer-based pain assessment tools. Following the advancements in automatic facial expression analysis, computer vision researchers have tried to use this technology for developing approaches for automatically detecting pain from facial expressions. This paper surveys the literature published in this field over the past decade, categorizes it, and identifies future research directions. The survey covers the pain datasets used in the reviewed literature, the learning tasks targeted by the approaches, the features extracted from images and image sequences to represent pain-related information, and finally, the machine learning methods used.
... Thiam et al. [60], [61] analyzed the audio signals of the SenseEmotion database, which do not contain verbal interaction, but mostly breathing noises and sporadic moaning sounds. In contrast, Tsai et al. [97], [98] and Li et al. [84] analyzed audio signals recorded during clinical interviews in an emergency triage situation. Whereas audio outperformed video-based facial expression recognition in Tsai et al. [97], the opposite results were found by Thiam et al. [60]. ...
... In contrast, Tsai et al. [97], [98] and Li et al. [84] analyzed audio signals recorded during clinical interviews in an emergency triage situation. Whereas audio outperformed video-based facial expression recognition in Tsai et al. [97], the opposite results were found by Thiam et al. [60]. The verbal communication during the interview in Tsai's work leads to (1) more audio material with potentially discriminative information and (2) facial movements due to speaking that may interfere with facial expression recognition, which together may explain the superiority they found with audio. ...
... Generally, if the predictive performances of the single modalities are sufficiently good, their fusion tends to improve the results. This has been shown for combining facial expression and head pose [38], [56], [57]; EDA, ECG, and sEMG [44], [57]; video, EDA, ECG, and sEMG [46], [47], [49], [57]; video, RSP, ECG, and remote PPG [59]; video and audio [60], [97]; video, audio, EDA, ECG, EMG, and RSP [61], and MoCap and sEMG [64]. Similarly, unimodal systems for infant pain recognition have been outperformed by integrating facial expression, body movement, vital signs, and crying sound modalities [72]. ...
Full-text available
Pain is a complex phenomenon, involving sensory and emotional experience, that is often poorly understood, especially in infants, anesthetized patients, and others who cannot speak. Technology supporting pain assessment has the potential to help reduce suffering; however, advances are needed before it can be adopted clinically. This survey paper assesses the state of the art and provides guidance for researchers to help make such advances. First, we overview pain’s biological mechanisms, physiological and behavioral responses, emotional components, as well as assessment methods commonly used in the clinic. Next, we discuss the challenges hampering the development and validation of pain recognition technology, and we survey existing datasets together with evaluation methods. We then present an overview of all automated pain recognition publications indexed in the Web of Science as well as from the proceedings of the major conferences on biomedical informatics and artificial intelligence, to provide understanding of the current advances that have been made. We highlight progress in both non-contact and contact-based approaches, tools using face, voice, physiology, and multi-modal information, the importance of context, and discuss challenges that exist, including identification of ground truth. Finally, we identify underexplored areas such as chronic pain and connections to treatments, and describe promising opportunities for continued advances.
... While many research has already indicated that the facial muscle movements, i.e., action units, provide an indication of different pain levels [8,9], several recent works have started to investigate the relationship between pain intensity and vocal cues. For example, Oshrat et al. analyzed the prosodic variation as a bio-signaling indicator of pain [10], Ren et al. recently proposed a database for evaluating pain from speech [11], Tsai et al. proposed several automated machine learning methods for recognizing self-reported pain-levels using speech and face multimodally in a real triage database [12,13]. These studies tend to focus more on the prosodic and spectral properties of speech. ...
... Specifically, we conduct a multivariate statistical analysis on a 181 unique patient cohort within the Triage Pain-Level Multimodal Database [13]. We analyze the variations of different voice quality measures with the self-reported pain-levels as a function of the following list of clinical parameters: ...
... • Age: Young Adult, Adult, Senior • Gender: Male and Female • Pain-Sites: Head, Chest, Abdomen, Limb, Back, Others Our analysis reveal that: 1) voice quality varies statistically with pain-levels only when considering other clinical contributing factors, 2) senior aged group displays a more intense (higher) value of voicing probability and shimmer when experiencing severe pain, 3) patients with abdomen pain shows a lower jitter and shimmer when experiencing more severe pain, which is counter-intuitive and different from patients experiencing pains from neck and shoulder, and 4) there seems to be a complex relationship between the expressed voice quality and the nociceptive pain resulting from our analysis on the interacting factor of pain-sites with pain-levels. In this study, we utilize the Triage Pain Level Database that was collected at the Department of Emergency of the ChangGung Memorial Hospital [13], which included audio-video recordings of on-boarding patients during triage sessions. A10-point self-reported (NRS) pain-levels [23] and a variety of clinical parameters including age, gender, vital sign, and pain-sites were also recorded as a part of the standard procedure during triage. ...
... Tsai et al. [8] demonstrated that pain intensity can be assessed from audio recorded during triage interviews in an emergency department. In another work, they fused audio and facial expression, outperforming each single modality [9]. Head pose is another behavioral signal that has been shown to be useful for pain recognition This work was funded by the German Research Foundation (DFG, no. ...
... Our results show that head pose is similarly useful in lying position. Audio is the worst performing modality in our X-ITE experiments, which is consistent with the quite poor performance of audio in the SenseEmotion database, but in contrast to the results achieved by Tsai et al. [9]. Tsai et al. used an interview scenario yielding more audio material with potentially discriminative information, whereas moaning and pain-related breathing patterns (which were observed on SenseEmotion and X-ITE) may occur less consistently than other pain responses. ...
Conference Paper
Full-text available
Automatic pain recognition has great potential to improve pain management. In this work, we investigate multi-modality in pain recognition in two regards. First, we compare and combine multiple sensor modalities, which capture both behavioral and physiological pain responses. Second, we compare and distinguish the heat and the electrical pain stimulus modalities in both phasic (short) and tonic (long) variants. Experiments show that (1) pain intensity can be recognized automatically in all stimulus variants, (2) that pain of different qualities (heat and electical stimuli) can be distinguished, (3) that electrodermal activity (EDA) is the best performing single modality, and (4) that fusion with modalities can improve results further.
... To mitigate the biases caused by the conventional evaluation methods, automatic evaluation of pain is potential to construct an objective and unified standard. In recent studies, models for automatic detection of pain have been investigated and proposed based on multiple modalities, including facial expression [184,185,186], body gestures, and motion descriptors [187,188]. As an important factor of evaluating the physiological health like the cardiovascular system [189] and the mental health such as depression [190], voice is potential to evaluate the pain level. ...
Full-text available
Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges.
... Depending on the amount and diversity of sensors used during the data collection phase, several signals have been assessed and evaluated in various settings for the development of pain assessment systems. Some of the most prominently used signals constitute of the audio signal (e.g., paralinguistic vocalizations) (Tsai et al., 2016(Tsai et al., , 2017Thiam and Schwenker, 2019), the video signal (e.g., facial expressions) (Rodriguez et al., 2017;Werner et al., 2017;Tavakolian and Hadid, 2019;Thiam et al., 2020b), specific bio-physiological signals such as the Electrodermal Activity (EDA), the Electrocardiogram (ECG), the Electromyography (EMG), or the Respiration (RSP) signal Campbell et al., 2019;Thiam et al., 2019a), and also bodily expression signals (Dickey et al., 2002;Olugbade et al., 2019;Uddin and Canavan, 2020). ...
Full-text available
Traditional pain assessment approaches ranging from self-reporting methods, to observational scales, rely on the ability of an individual to accurately assess and successfully report observed or experienced pain episodes. Automatic pain assessment tools are therefore more than desirable in cases where this specific ability is negatively affected by various psycho-physiological dispositions, as well as distinct physical traits such as in the case of professional athletes, who usually have a higher pain tolerance as regular individuals. Hence, several approaches have been proposed during the past decades for the implementation of an autonomous and effective pain assessment system. These approaches range from more conventional supervised and semi-supervised learning techniques applied on a set of carefully hand-designed feature representations, to deep neural networks applied on preprocessed signals. Some of the most prominent advantages of deep neural networks are the ability to automatically learn relevant features, as well as the inherent adaptability of trained deep neural networks to related inference tasks. Yet, some significant drawbacks such as requiring large amounts of data to train deep models and over-fitting remain. Both of these problems are especially relevant in pain intensity assessment, where labeled data is scarce and generalization is of utmost importance. In the following work we address these shortcomings by introducing several novel multi-modal deep learning approaches (characterized by specific supervised, as well as self-supervised learning techniques) for the assessment of pain intensity based on measurable bio-physiological data. While the proposed supervised deep learning approach is able to attain state-of-the-art inference performances, our self-supervised approach is able to significantly improve the data efficiency of the proposed architecture by automatically generating physiological data and simultaneously performing a fine-tuning of the architecture, which has been previously trained on a significantly smaller amount of data.
... With the release of publicly available pain databases (e.g., the UNBC-McMaster Pain Archive) and advancements in computer vision and machine learning, automatic assessment of pain from behavioral measures (e.g., facial expression) has emerged as a possible alternative to manual observations [10]. Using either spatial features or spatio-temporal features [10], researchers have automatically detected pain in the flow of behavior [1,16,19], differentiated feigned from genuine pain [2,16,17], detected ordinal pain intensity [11,12,15,[22][23][24][25][26][27]29] and distinguished pain from expressions of emotion [3,13,14] (see [10,28] for a detailed review). ...
Conference Paper
The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged as a potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale (VAS), the Sensory Scale (SEN), and the Affective Motivational Scale (AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of these scales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency between different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation. The obtained results suggest that automatic assessment of self-reported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.
... Pain evaluation [3], [4] is to collect all the information related to pain, such as Physical senses, severity, physical examination results Related special examination results Including the results of various treatments That has been before Interpret the results to diagnose the cause and mechanism of pain. Because each type of pain responds differently to painkillers. ...
Conference Paper
This research is to design and build a system for managing and monitoring drug use for pharmacies based on the principles of the database system and database connection with the program for drug administration and monitoring consists of a drug inventory management system storefront system for pharmacies and drug use tracking system using pain score. The results of the research can be designed to manage and monitor drug use systems for pharmacies. This system can function according to the design and construction objectives in all respects.
Conference Paper
Full-text available
Automatic pain recognition is an evolving research area with promis-ing applications in health care. In this paper, we propose the first fully automatic approach to continuous pain intensity estimation from facial images. We first learn a set of independent regression functions for continuous pain intensity esti-mation using different shape (facial landmarks) and appearance (DCT and LBP) features, and then perform their late fusion. We show on the recently published UNBC-MacMaster Shoulder Pain Expression Archive Database that late fusion of the afore-mentioned features leads to better pain intensity estimation compared to feature-specific pain intensity estimation.
Conference Paper
Full-text available
Facial feature detection algorithms have seen great progress over the recent years. However, they still struggle in poor lighting conditions and in the presence of extreme pose or occlusions. We present the Constrained Local Neu-ral Field model for facial landmark detection. Our model includes two main novelties. First, we introduce a prob-abilistic patch expert (landmark detector) that can learn non-linear and spatial relationships between the input pix-els and the probability of a landmark being aligned. Sec-ondly, our model is optimised using a novel Non-uniform Regularised Landmark Mean-Shift optimisation technique, which takes into account the reliabilities of each patch ex-pert. We demonstrate the benefit of our approach on a num-ber of publicly available datasets over other state-of-the-art approaches when performing landmark detection in unseen lighting conditions and in the wild.
Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.558 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.
The pain intensity detection approach proposed in this paper is based on the fact that facial features get deformed during pain. To model facial feature deformations, Thin Plate Spline is adopted that separates rigid and non-rigid deformations very well. For efficient pain level detection, we have mapped the deformation parameters to higher discriminative space using Distance Metric Learning (DML) method. In DML, we seek a common distance metric such that the features belonging to the same pain intensity are pulled close to each other and the features belonging to the different pain intensity are pushed as far as possible. The assessment of the proposed approach is carried out on the popularly accepted UNBC-McMaster Shoulder Pain Expression Archive Database by using Support Vector Machine as a classifier. To prove the efficacy of the proposed approach, it is compared with state-of-the-art approaches mentioned in literature.
Conference Paper
The proposed Active Orientation Models (AOMs) are generative models of facial shape and appearance. Their main differences with the well-known paradigm of Active Appearance Models (AAMs) are (i) they use a different statistical model of appearance, (ii) they are accompanied by a robust algorithm for model fitting and parameter estimation and (iii) and, most importantly, they generalize well to unseen faces and variations. Their main similarity is computational complexity. The project-out version of AOMs is as computationally efficient as the standard project-out inverse compositional algorithm which is admittedly the fastest algorithm for fitting AAMs. We show that not only does the AOM generalize well to unseen identities, but also it outperforms state-of-the-art algorithms for the same task by a large margin. Finally, we prove our claims by providing Matlab code for reproducing our experiments ( ).
Conference Paper
Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.56 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.
Electronic pain measures are becoming common tools in the assessment of pediatric pain intensity. The aims of this study were (1) to examine the agreement between the verbal and the electronic versions of the NRS-11 (vNRS-11 and eNRS-11, respectively) when used to assess pain intensity in adolescents and (2) to report participants' preferences for each of the two alternatives. 191 schoolchildren enrolled in grades 7 to 11(Supplemental Digital Content 1-5,,,,, (mean age=14.61; range=12-18) participated. They were asked to report the highest intensity of the most frequent pain that they had experienced during the last three months using both the vNRS-11 and the eNRS-11. Agreement analyses were done using: (1) the Bland-Altman method, with confidence intervals (CI) of both 95% and 80%, and a maximum limit of agreement of±1; and (2) weighted intra-rater Kappa coefficients between the ratings for each participant on the vNRS-11 and eNRS-11. The limits of agreement at 95% fell outside the limit established a priori (scores ranged from -1.42 to 1.69), except for participants in grade 11 (Supplemental Digital Content 5,, 0.88). Meanwhile, the limits of agreement at 80% CI fell inside the maximum limit established a priori (scores ranged from -0.88 to 0.94), except for participants in grade 8 (Supplemental Digital Content 2,, 1.16). The Kappa coefficients ranged from 0.786 to 0.912, indicating "almost perfect" agreement. A total of 83% of participants preferred the eNRS-11. Pain intensity ratings on the vNRS-11 and eNRS-11 seem to be comparable, at least for the 80% CI.