Conference PaperPDF Available

"How are you?" Estimation of anxiety, sleep quality, and mood using computational voice analysis

Authors:
  • Canary Speech

Abstract

We developed a method of estimating impactors of cognitive function (ICF) - such as anxiety, sleep quality, and mood - using computational voice analysis. Clinically validated questionnaires (VQs) were used to score anxiety, sleep and mood while salient voice features were extracted to train regression models with deep neural networks. Experiments with 203 subjects showed promising results with significant concordance correlation coefficients (CCC) between actual VQ scores and the predicted scores (0.46 = anxiety, 0.50 = sleep quality, 0.45 = mood).
“How are you?” Estimation of anxiety, sleep quality, and mood
using computational voice analysis
Samuel Kim1, Namhee Kwon1, Henry O’Connell1, Nathan Fisk2, Scott Ferguson2and Mark Bartlett2
Abstract We developed a method of estimating impactors
of cognitive function (ICF) - such as anxiety, sleep quality, and
mood - using computational voice analysis. Clinically validated
questionnaires (VQs) were used to score anxiety, sleep and mood
while salient voice features were extracted to train regression
models with deep neural networks. Experiments with 203
subjects showed promising results with significant concordance
correlation coefficients (CCC) between actual VQ scores and
the predicted scores (0.46 = anxiety, 0.50 = sleep quality, 0.45
= mood).
I. INTRODUCTION
There is a growing body of interest in using the voice
as a biomarker to estimate human health conditions. Some
conditions like Parkinson’s disease [1], [2] and amyotrophic
lateral sclerosis (ALS) [3] are directly related to speech
production mechanisms. Others like schizophrenia [4], [5],
[6], depression [7], [8], and bipolar disorder [9], [10] are
indirectly related through neurological processes that can
modulate voice.
This study focuses on known ICFs - specifically anxi-
ety [11], sleep quality [12], [13] and mood [14], [15] - in
a generally healthy population rather than a medically chal-
lenged one. Like other related applications mentioned above,
we also assume that a subject’s condition is embedded in the
voice by modulating articulatory organs either voluntarily or
involuntarily. Based on this assumption, we extract salient
features from voice with respect to the target measurements
and train them with machine learning algorithms to automat-
ically estimate the measurements of interest.
In this work, we use VQs as ground truth of anxiety, sleep
quality, and mood status. Dealing with subjective matters like
these, however, is difficult partially because it is related to
personal perspectives, feelings, and opinions which may vary
across the subjects [16]. Therefore, we use a questionnaire
based approach [16], [17] to consolidate responses from
multiple instruments to address the target measurements.
The VQs and voice collection methodology will be dis-
cussed in Section II followed by examining the proposed
voice analysis method in Section III and finally the experi-
mental results in Section IV.
II. DATA
A. Data collection procedures
219 subjects were recruited for data collection. The native
language of the subject population was English, including
1Canary Speech, LLC. at Provo, Utah, U.S.A. {sam, namhee,
henry}@canaryspeech.com
2Pharmanex Research, NSE Products Inc., Provo, UT, U.S.A.
{nafisk, scottf, mrbartle}@nuskin.com
12 French bilingual subjects. Along with other demographic
information, subjects were asked to report their medical
conditions. No previous medical conditions were reported for
155 subjects while 48 reported having at least one medical
condition; notably 22 with depression and 13 with migraine
or headache.
The data collection consisted of one training session (not
analyzed) followed by 4 evaluation sessions per subject. For
each session, the subjects were asked to answer a set of
questionnaires using mobile devices; four VQs and a se-
quence of recorded voice responses. 203 subjects completed
at least three of the four evaluation sessions. See Table I
for demographic distribution of the participating subjects.
We conducted Wilcoxon significance tests to see if there
was any potential impact of the demographic characteristics
and found none had a significant influence on the primary
research objective.
TABLE I: Distribution of demographic groups of the col-
lected database.
Demographics Group Num of %
subjects
Age group
25-34 138 67.9
35-44 35 17.2
45-54 30 14.7
Gender
Male 88 43.3
Female 112 55.2
Refused 3 1.4
Ethnicity group
Caucasian 120 59.1
Asian 46 22.7
Other 18 8.8
African American 10 4.9
Hispanic 5 2.5
Middle East 1 0.5
Refused 3 1.4
Education level
Bachelors 92 45.3
CEGEP/Professional 14 6.9
High School 21 10.3
Post Graduate 52 25.6
Vocational 24 11.8
Medical condition Stated Condition 48 23.6
No Condition 155 76.4
TABLE II: Statistics of VQ measurements.
Possible range Collected Data
Range Mean ±STD
STAI 20 - 80 20 - 80 38.4 ±13.2
GAD7 0 - 21 0 - 21 6.1 ±4.9
PSQI 0 - 21 0 - 15 5.3 ±2.8
PANAS 10 - 50 10 - 50 24.4 ±6.7
B. Questionnaires
1) Clinically Validated Questionnaires: As discussed ear-
lier, we used VQs to collect measurements of the subjects’
well-being. The VQs are designed to measure anxiety, sleep
quality and mood as follows...
Anxiety: State-trait anxiety inventory (STAI) [18] and
Generalized anxiety disorder (GAD7) [19] were used
to measure levels of anxiety. STAI-6 has six emotional
variables for users to score themselves based on how
they feel at the moment, while GAD7 has seven emo-
tional variables to score how they felt over the past
two weeks. Both instruments have consolidating rules
to generate a final level of anxiety based on the answers;
higher values mean higher anxiety.
Sleep quality: The Pittsburgh sleep quality index (PSQI)
[20] was used as a measure of subject sleep quality. The
questionnaire evaluates various aspects of sleep quality
such as length of sleep as well as disturbing factors
and their frequencies. The scoring guideline generates
a final level of sleep discomfort; higher values indicate
worse sleep quality.
Mood: The positive affect and negative affect schedule
(PANAS) [21] was used to evaluate the subject’s mood.
The applied PANAS contains ten different emotional
variables (five positive and five negative) to evaluate the
user’s mood over the past week. It gives an aggregated
score to represent the overall mood of the subject; a
higher value is indicative of a negative mood and a
lower value indicates a positive mood.
Table II summarizes the statistics of the collected scores
along with possible score ranges. Note that the collected data
covers the possible range of individual measurements except
PSQI, which covers only the lower range.
2) Voice responses: Subject speech responses were de-
signed to capture vocal behaviors and to use them in estimat-
ing the actual VQ scores. We asked seven different questions
to elicit three types of voice responses; one spontaneous
speech, four sentence readings and two paragraph readings.
For spontaneous speech, participants were given instruc-
tions to speak freely on whatever topic they wanted for about
a minute. For sentence or paragraph readings, test subjects
were prompted to read aloud phonetically balanced reading
materials.
Table III shows the statistics of the collected voice re-
sponses in terms of recorded length. Overall, approximately
54 hours of voice data were collected.
TABLE III: Length of voice responses.
Voice responses Length (in secs.)
Spontaneous Q1 64.0 ±10.3
Sentence reading
Q2 5.4 ±1.5
Q3 5.1 ±1.6
Q4 4.8 ±1.6
Q5 5.1 ±1.6
Paragraph reading Q6 48.6 ±8.6
Q7 48.6 ±12.4
III. METHODOLOGY
A. Feature extraction
Types of voice features came in two categories: acoustic
and linguistic features. Acoustic features capture signal-level
modulations due to the speaker’s status, while linguistic
features capture language-level patterns which may be in-
fluenced by the condition.
Acoustic features were calculated on a per-frame basis.
Frames were defined as 25 ms sliding windows that were
created every 10 ms. A 41-dimensional supervector of var-
ious features such as mel-frequency cepstral coefficients
(MFCC), perceptual linear prediction (PLP), prosody and
voice quality features were generated every frame [22].
MFCC and PLP provided spectral characteristics of speech
signals considering human auditory characteristics. Prosody
and voice quality features are related to voice tonality and
its characteristics, i.e. fundamental frequency (a.k.a. pitch),
shimmer/jitter and harmonic-to-noise ratio, etc. Each fea-
ture’s delta and delta-delta were concatenated to capture
frame-level context. To summarize the features in a response-
level, we used 19 statistical functions such as mean, median,
skewness, kurtosis, quartile, percentile, and slope.
Language features were based on the results from auto-
matic speech recognition (ASR). We used Canary’s general
English model which is trained on publicly available datasets
like Tedlium and Librispeech using the time delayed neural
network (TDNN) architecture in Kaldi [23]. On top of com-
mon features such as part-of-voice ratio, syllable duration,
filler (ah, hmm, eh, uh, etc.) ratio and word repetition ratio
over the total number of spoken words, we extracted a
different feature set whether the response was spontaneous
or read.
For the spontaneous voice responses where no prompted
text was given, the semantic features were extracted in-
cluding word popularity percentiles1, and word frequency
of the depression-related terms2, and positive and negative
sentiment likelihood3. For the read responses, on the other
hand, the ASR errors for insertion, deletion, and substitution
were computed based on the given text.
The feature dimension was 2,357 for a read response and
2,364 for a spontaneous response. For feature selection, we
computed the Pearson’s correlation coefficient between the
extracted features and the individual VSI scores and then
selected top ncorrelated features for the model. Note that
this was done for response and measurement levels so that
different features could be selected from different responses
depending on the target measurement.
As a baseline, we used the OpenSmile toolkit [26] with
eGeMAPS configuration [27] which is widely accepted and
1The word popularity was computed from the general English language
model with 130K words and the distribution of the values per response was
examined and its percentile values were used.
2We built a dictionary with depression-related words and negative expres-
sions, as well as common words observed from depression patients’ speech,
which resulted in 486 terms. [24]
3The sentiment likelihood score was generated by the binary classi-
fication model trained on Stanford Sentiment Treebank using Sentiment
Neuron [25].
TABLE IV: Comparison of the proposed and conventional methods in terms of concordance correlation coefficients (CCC)
between VQ scores and predicted scores using voice responses. Statistical significance is denoted with stars (for p < 102
and ∗∗ for p < 105).
Individual features Concatenated
features
Spontaneous Sentence reading Paragraph reading
Q1 Q2 Q3 Q4 Q5 Q6 Q7
STAI eGeMAPS 0.05 0.03 0.070.02 0.04 0.03 0.07 0.04∗∗
Proposed 0.09∗∗ 0.16∗∗ 0.19∗∗ 0.15∗∗ 0.14∗∗ 0.14∗∗ 0.14∗∗ 0.30∗∗
GAD7 eGeMAPS 0.02 0.050.04 0.03 0.01 0.050.040.03∗∗
Proposed 0.14∗∗ 0.19∗∗ 0.26∗∗ 0.22∗∗ 0.17∗∗ 0.08∗∗ 0.09∗∗ 0.41∗∗
PSQI eGeMAPS 0.03 0.080.04 0.01 0.02 0.10∗∗ 0.10∗∗ 0.06∗∗
Proposed 0.14∗∗ 0.17∗∗ 0.21∗∗ 0.20∗∗ 0.21∗∗ 0.12∗∗ 0.09∗∗ 0.44∗∗
PANAS eGeMAPS -0.01 0.00 0.01 0.01 0.00 0.03 0.050.02
Proposed 0.12∗∗ 0.18∗∗ 0.20∗∗ 0.16∗∗ 0.17∗∗ 0.12∗∗ 0.15∗∗ 0.38∗∗
(a) Bivariate Predicted STAI and Actual STAI (b) Bivariate Predicted GAD7 and Actual GAD7
(c) Bivariate Predicted PSQI and Actual PSQI (d) Bivariate Predicted PANAS and Actual PANAS
Fig. 1: Scatter plots of VQ scores and predicted scores using the proposed method. Colors represent the density of the
individual data points (red for high density and blue for low density).
shows promising results particularly in affective computing
related fields such as speech emotion recognition. As it
generated an 88 dimensional feature vector per one speech
signal, we set a threshold of the proposed feature selection
procedure to retain only the same number of features for
fair comparisons when we compared the proposed method
against the conventional feature extraction.
B. Modeling
We used a fully connected deep neural network (FC-DNN)
with 4 hidden layers of 256 neuron units. Each layer had
the rectified linear unit (ReLU) as an activation function,
and l2 regularizer and 50% of dropout to prevent overfitting.
The output layer had only one unit as we targeted to build
a regression model. We used an Adam optimizer with a
0.0001 learning rate using mean squared error (MSE) as a
loss function. The batch size was 32 and iteration stops after
100 epochs.
IV. EXP ERI MEN TAL RESULTS
A. Setup
We performed a 5-fold cross validation by randomly
splitting the whole dataset into 5 exclusive folds, iteratively
using one fold as test data and the remaining folds as training
data. Each fold was subject independent so that different
folds did not share data from the same subjects. Note that
we considered all the sessions independently. A longitudinal
study to investigate changes over time was left for future
work.
The performance was measured in terms of concordance
correlation coefficients (CCC) between VQ scores and pre-
dicted scores [28].
B. Results
Table IV shows CCC between VQ scores and predicted
scores using voice responses with respect to the target
measurement. The left part of the table shows the results
with features from individual voice responses, while the right
part of the table shows the results with concatenated features
from individual voice responses.
The results with individual voice responses show that
the proposed method outperforms the conventional general
purpose feature extraction method in estimating all measure-
ments with all voice responses. The same applies to the
results with concatenated features. This indicates that the
proposed feature selection strategy which considers differ-
ences in individual responses is beneficial. The proposed
linguistic features are also considered to contribute toward
the improvement.
It is notable that the correlation coefficients with features
from sentence reading are higher than the ones with features
from spontaneous or paragraph reading responses. The re-
sults also show that concatenating features from individual
voice responses boost the performance. These suggest that
concatenating features from multiple short voice responses
rather than one long voice response with a fixed set of
features is beneficial to estimating self-assessed measure-
ments. This may be partially because the salient features are
averaged out over a long sentence, but further research is
needed for clarification. Additional study will also examine
the impact of the ASR performance in extracting linguistic
features, and longitudinal research is needed to investigate
changes over time.
TABLE V: Comparison of the numbers of selected features
in terms of CCC between VQ scores and predicted scores
using voice responses.
STAI GAD7 PSQI PANAS
88 (Table IV) 0.30 0.41 0.44 0.38
100 0.46 0.46 0.50 0.45
Table V shows the performance with different sizes of
selected features. Although we empirically choose this hyper-
parameter and it needs to be fully investigated in the future,
it indicates that there is a room for improvement by how we
select features as an input to machine learning algorithms.
Fig. 1 depicts the scattered plots of VQ scores and estimated
scores using the feature set. The red oval and blue line
represent a bivariate normal ellipse that encircles 95% and
the linear fit of the data points respectively. The colored
zones are percentile density contours that move from red to
blue as the percentage of the data in the quantiles increases.
The figures illustrate, as shown in correlation coefficients, the
VQ scores and estimated scores are significantly correlated
across all instruments.
V. C ONCLUSION
We show promising results using computational voice
analysis in predicting three ICFs (anxiety, sleep quality, and
mood). The proposed method utilizes acoustic and linguistic
features along with response and measurement level feature
selection strategy followed by a deep neural network re-
gression model. Experimental results show that sleep quality
using PSQI has the strongest correlation while anxiety using
STAI has the weakest correlation. All assessments have
statistically significant correlations with p < 105. Voice as
a non-intrusive measurement of ICFs is a promising research
pathway that will benefit from the collection of more data to
detect hidden patterns and subtle differences among larger
populations.
REFERENCES
[1] Kara M. Smith, James R. Williamson, and Thomas F. Quatieri, “Vocal
markers of motor, cognitive, and depressive symptoms in Parkinson’s
disease,” 2017 7th International Conference on Affective Computing
and Intelligent Interaction, ACII 2017, vol. 2018-Janua, pp. 71–78,
2018.
[2] Yermiyahu Hauptman, Ruth Aloni-Lavi, Itshak Lapidot, Tanya Gure-
vich, Yael Manor, Stav Naor, Noa Diamant, and Irit Opher, “Identify-
ing Distinctive Acoustic and Spectral Features in Parkinson’s Disease,”
in Proc. Interspeech 2019, 2019, pp. 2498–2502.
[3] Hannah P. Rowe and Jordan R. Green, “Profiling Speech Motor
Impairments in Persons with Amyotrophic Lateral Sclerosis: An
Acoustic-Based Approach,” in Proc. Interspeech 2019, 2019, pp.
4509–4513.
[4] Francisco Mart´
ınez-S´
anchez, Jos´
e Antonio Muela-Mart´
ınez, Pedro
Cort´
es-Soto, Juan J.os´
e Garc´
ıa Meil´
an, Juan A.ntonio Vera Ferr ´
andiz,
Amaro Egea Caparr´
os, and Isabel M.ar´
ıa Pujante Valverde, “Can the
Acoustic Analysis of Expressive Prosody Discriminate Schizophre-
nia?,” The Spanish journal of psychology, vol. 18, no. March 2016,
pp. E86, 2015.
[5] Rohit Voleti, Stephanie Woolridge, Julie M. Liss, Melissa Milanovic,
Christopher R. Bowie, and Visar Berisha, “Objective Assessment of
Social Skills Using Automated Language Analysis for Identification
of Schizophrenia and Bipolar Disorder,” in Proc. Interspeech 2019,
2019, pp. 1433–1437.
[6] Armen C. Arevian, Daniel Bone, Nikolaos Malandrakis, Victor R.
Martinez, Kenneth B. Wells, David J. Miklowitz, and Shrikanth
Narayanan, “Clinical state tracking in serious mental illness through
computational analysis of speech,” PloS one, vol. 15, no. 1, pp.
e0225695, 2020.
[7] Denis Dresvyanskiy, Danila Mamontov, Maxim Markitantov, and
Albert Ali Salah, “Predicting depression and emotions in the cross-
roads of cultures, para-linguistics, and non-linguistics,” in AVEC 2019,
2019.
[8] Carol Espy-Wilson, Adam C. Lammert, Nadee Seneviratne, and
Thomas F. Quatieri, “Assessing Neuromotor Coordination in Depres-
sion Using Inverted Vocal Tract Variables,” in Proc. Interspeech 2019,
2019, pp. 1448–1452.
[9] Zakaria Aldeneh, Mimansa Jaiswal, Michael Picheny, Melvin G.
McInnis, and Emily Mower Provost, “Identifying Mood Episodes Us-
ing Dialogue Features from Clinical Interviews,” in Proc. Interspeech
2019, 2019, pp. 1926–1930.
[10] Katie Matton, Melvin G. McInnis, and Emily Mower Provost, “Into the
Wild: Transitioning from Recognizing Mood in Clinical Interactions
to Personal Conversations for Individuals with Bipolar Disorder, in
Proc. Interspeech 2019, 2019, pp. 1438–1442.
[11] Bruce S. McEwen, “The neurobiology of stress: from serendipity to
clinical relevance 1, Brain Research, vol. 886, no. 1-2, pp. 172–189,
2000.
[12] Andreana Benitez and John Gunstad, “Poor sleep quality diminishes
cognitive functioning independent of depression and anxiety in healthy
young adults,” The Clinical neuropsychologist, vol. 26, pp. 214–23,
02 2012.
[13] Steven Gilbert and Cameron Weaver, “Sleep quality and academic
performance in university students: A wake-up call for college psy-
chologists,” Journal of College Student Psychotherapy, vol. 24, pp.
295–306, 10 2010.
[14] Karuna Subramaniam, John Kounios, Todd Parrish, and Mark Beeman,
“A brain mechanism for facilitation of insight by positive affect,”
Journal of cognitive neuroscience, vol. 21, pp. 415–32, 07 2008.
[15] Tabitha Payne and Michael Schnapp, “The relationship between
negative affect and reported cognitive failures,” Depression research
and treatment, vol. 2014, pp. 396195, 02 2014.
[16] Francis Guillemin, Claire Bombardier, and Dorcas Beaton, “Cross-
cultural adaptation of health-related quality of life measures: Literature
review and proposed guidelines, Journal of Clinical Epidemiology,
vol. 46, no. 12, pp. 1417–1432, Dec. 1993.
[24] Mohammed Al-Mosaiwi and Tom Johnstone, “In an absolute state:
Elevated use of absolutist words is a marker specific to anxiety,
depression, and suicidal ideation,” Clinical Psychological Science,
vol. 6, no. 4, pp. 529–542, 2018, PMID: 30886766.
[17] Theodore Pincus, Christopher Swearingen, and Frederick Wolfe, “To-
ward a multidimensional health assessment questionnaire (MDHAQ):
Assessment of advanced activities of daily living and psychological
status in the patientfriendly health assessment questionnaire format,”
Arthritis & Rheumatism, vol. 42, pp. 2220–2230, 2001.
[18] Audrey Tluczek, Jeffrey B. Henriques, and Roger L. Brown, “Support
for the reliability and validity of a six-item state anxiety scale
derived from the state-trait anxiety inventory,” Journal of Nursing
Measurement, vol. 17, no. 1, pp. 19–28, 2009.
[19] Robert L. Spitzer, Kurt Kroenke, Janet B. W. Williams, and Bernd
L¨
owe, “A Brief Measure for Assessing Generalized Anxiety Disorder:
The GAD-7,” JAMA Internal Medicine, vol. 166, no. 10, pp. 1092–
1097, 05 2006.
[20] Daniel J. Buysse, Charles F. Reynolds, Timothy H. Monk, Susan R.
Berman, and David J. Kupfer, “The pittsburgh sleep quality index:
A new instrument for psychiatric practice and research,” Psychiatry
Research, vol. 28, no. 2, pp. 193 – 213, 1989.
[21] David Watson, Lee Anna Clark, and Auke Tellegen, “Development
and validation of brief measures of positive and negative affect: The
PANAS scales.,” 1988.
[22] Thomas Quatieri, Discrete-Time Speech Signal Processing: Principles
and Practice, Prentice Hall Press, USA, first edition, 2001.
[23] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej
Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin
Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely,
“The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding. Dec. 2011, IEEE
Signal Processing Society.
[25] Alec Radford, Rafal J´
ozefowicz, and Ilya Sutskever, “Learning to gen-
erate reviews and discovering sentiment, CoRR, vol. abs/1704.01444,
2017.
[26] Florian Eyben, Felix Weninger, Florian Gross, and Bj ¨
orn Schuller,
“Recent developments in opensmile, the munich open-source multime-
dia feature extractor, in Proceedings of the 21st ACM International
Conference on Multimedia, New York, NY, USA, 2013, MM ’13, pp.
835–838, ACM.
[27] Florian Eyben, Klaus Scherer, Bj¨
orn Schuller, Johan Sundberg, Elis-
abeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, Petri
Laukka, Shrikanth Narayanan, and Khiet Truong, “The geneva
minimalistic acoustic parameter set (gemaps) for voice research and
affective computing, IEEE Transactions on Affective Computing, vol.
7, pp. 1–1, 01 2015.
[28] Lawrence I-Kuei Lin, A concordance correlation coefficient to
evaluate reproducibility,” Biometrics, vol. 45, no. 1, pp. 255–268,
1989.
... These data can be implemented into Machine Learning (ML) algorithms, providing tools to support the clinical practice. Although vocal analysis can be applied to any pathology that directly or indirectly affects the vocal apparatus, recent evidence shows its high potential in identifying sleep disorders [1]- [3]. ...
... In a similar work [3], Kim et al. investigated the ability of voice analysis to predict SQ. 203 English healthy native speakers were recruited and asked to answer a set of questionnaires using mobile devices and collect a set of voice samples (free speech, sentence, and text). Regression performance was evaluated in a 5-fold CV in terms of Concordance Correlation Coefficient (CCC) between real and predicted scores. ...
... The latter include periodicity measures -i.e, F0 and the first three Formants; noise measures -i.e, HNR, Cepstral Peak Prominence (CPP), and Glottal to Noise Excitation ratio (GNE) -and amplitude related measures (i.e., Intensity). Moreover, we extracted spectral and cepstral features (12 Mel Frequency Cepstral Coefficients (MFCC)), 13 Perceptual Linear Prediction (PLP)) which proved to be effective for the application at hand in similar works [2], [3]. In more detail, after identifying and merging voiced regions, we framed each signal into 25 ms windows with 10 ms overlap [3] and extracted features from each segment. ...
Conference Paper
Voice is a reservoir of valuable health data. Recent studies highlighted its efficacy in predicting sleep quality, and its potential as biomarker of neurodegeneration. This study assesses the feasibility of a Telemedicine system for the evaluation of sleep quality through brief vocal recordings. Machine Learning models were employed in the binary classification between good and poor sleepers, with great performance in scoring poor sleep quality – 88% and 85% F-1 score on a 5-fold Cross Validation (CV) for females and males, respectively. Moreover, the correlation between perceived sleep quality and a validated global score was studied, as well as the influence of external factors and sleep-wake schedule.
... Acoustic-prosodic features are commonly adopted in audio classification tasks and have proven effective in detecting several types of diseases [13,14]. Acoustic features represent statistics of frame-based signal descriptors including Mel-Frequency Cepstral Coefficients (MFCC) [15], Perceptual Linear Prediction (PLP) coefficients [16], and pitch-related information [17]. ...
... Although there is evidence that multimodal-multisensor combinations perform better [24], [25], [26], the objective anxiety assessment field has often utilized physiologically or behaviorally dominant unimodal feature combinations [27], [28], [29], [30], [31], [32], [33], [34], [35]. The best-performing physiological (SCR rate) and behavioral (mean wrist rigidity and mean posture) feature combination identified in this research (see Section 7.2), which demonstrated a 30% performance gain compared to the best-performing unimodal feature, represented a multimodal-multisensor combination. ...
Preprint
Full-text available
p>In this paper, we use a combination of physiological and behavioral metrics of anxiety to detect changes in anxiety status in clinically anxious participants compared with healthy controls, which is important for intervening in a timely manner for the effective management of anxiety. Specifically, we first operationalize four phases of anxiety and select multimodal-multisensor feature candidates to assess those phases, considering preliminary results obtained in prior research employing a generalized mixed additive model-based analysis. Then, we evaluate the performance of selected features and their combinations in detecting the presence of anxiety phases, using the "Anxiety Phases" dataset. The results demonstrate that unimodal features of skin conductance response rate and mean rigidity of wrists detected all four temporal phases in ~50% of high-anxiety individuals. These two features detected at least three phases in ~90% of high-anxiety individuals. The fusion of these two features with an additional postural feature detected all four temporal phases in 65% of high-anxiety individuals. The multimodal-multisensor combination of these three features represented a 30% improvement compared with the best unimodal predictive feature. Implications of these findings are discussed for developing accurate real-time multimodal-multisensor anxiety management systems for clinical populations.</p
... Although there is evidence that multimodal-multisensor combinations perform better [24], [25], [26], the objective anxiety assessment field has often utilized physiologically or behaviorally dominant unimodal feature combinations [27], [28], [29], [30], [31], [32], [33], [34], [35]. The best-performing physiological (SCR rate) and behavioral (mean wrist rigidity and mean posture) feature combination identified in this research (see Section 7.2), which demonstrated a 30% performance gain compared to the best-performing unimodal feature, represented a multimodal-multisensor combination. ...
Preprint
Full-text available
p>In this paper, we use a combination of physiological and behavioral metrics of anxiety to detect changes in anxiety status in clinically anxious participants compared with healthy controls, which is important for intervening in a timely manner for the effective management of anxiety. Specifically, we first operationalize four phases of anxiety and select multimodal-multisensor feature candidates to assess those phases, considering preliminary results obtained in prior research employing a generalized mixed additive model-based analysis. Then, we evaluate the performance of selected features and their combinations in detecting the presence of anxiety phases, using the "Anxiety Phases" dataset. The results demonstrate that unimodal features of skin conductance response rate and mean rigidity of wrists detected all four temporal phases in ~50% of high-anxiety individuals. These two features detected at least three phases in ~90% of high-anxiety individuals. The fusion of these two features with an additional postural feature detected all four temporal phases in 65% of high-anxiety individuals. The multimodal-multisensor combination of these three features represented a 30% improvement compared with the best unimodal predictive feature. Implications of these findings are discussed for developing accurate real-time multimodal-multisensor anxiety management systems for clinical populations.</p
... This has motivated research into the automatic identification of mental health issues such as depression and anxiety, using different modalities, such as audio and video features [1,2]. Some of these studies used specially designed questions or prompts [3,4], while others relied on open conversations or spontaneous speech [5,2]. The audio used is often from a "mental health evaluation interview" setting and recordings are wide-band (16 kHz or more). ...
Article
Full-text available
Parkinson’s Disease (PD) is one of the most common non-curable neurodegenerative diseases. Diagnosis is achieved clinically on the basis of different symptoms with considerable delays from the onset of neurodegenerative processes in the central nervous system. In this study, we investigated early and full-blown PD patients based on the analysis of their voice characteristics with the aid of the most commonly employed machine learning (ML) techniques. A custom dataset was made with hi-fi quality recordings of vocal tasks gathered from Italian healthy control subjects and PD patients, divided into early diagnosed, off-medication patients on the one hand, and mid-advanced patients treated with L-Dopa on the other. Following the current state-of-the-art, several ML pipelines were compared usingdifferent feature selection and classification algorithms, and deep learning was also explored with a custom CNN architecture. Results show how feature-based ML and deep learning achieve comparable results in terms of classification, with KNN, SVM and naïve Bayes classifiers performing similarly, with a slight edge for KNN. Much more evident is the predominance of CFS as the best feature selector. The selected features act as relevant vocal biomarkers capable of differentiating healthy subjects, early untreated PD patients and mid-advanced L-Dopa treated patients.
Conference Paper
We propose a divide-and-conquer approach to detect depression severity using speech. We divide speech features based on their attributes, i.e., acoustic, prosodic, and language features, then fuse them in a modeling stage with fully connected deep neural networks. Experiments with 76 clinically depressed patients (38 severe and 38 moderate in terms of Montgomery-Asberg Depression Rating Scale (MADRS)), we obtain 78% accuracy while patients' self-reporting scores can classify their own status with 79% accuracy.
Article
Full-text available
Individuals with serious mental illness experience changes in their clinical states over time that are difficult to assess and that result in increased disease burden and care utilization. It is not known if features derived from speech can serve as a transdiagnostic marker of these clinical states. This study evaluates the feasibility of collecting speech samples from people with serious mental illness and explores the potential utility for tracking changes in clinical state over time. Patients (n = 47) were recruited from a community-based mental health clinic with diagnoses of bipolar disorder, major depressive disorder, schizophrenia or schizoaffective disorder. Patients used an interactive voice response system for at least 4 months to provide speech samples. Clinic providers (n = 13) reviewed responses and provided global assessment ratings. We computed features of speech and used machine learning to create models of outcome measures trained using either population data or an individual’s own data over time. The system was feasible to use, recording 1101 phone calls and 117 hours of speech. Most (92%) of the patients agreed that it was easy to use. The individually-trained models demonstrated the highest correlation with provider ratings (rho = 0.78, p<0.001). Population-level models demonstrated statistically significant correlations with provider global assessment ratings (rho = 0.44, p<0.001), future provider ratings (rho = 0.33, p<0.05), BASIS-24 summary score, depression sub score, and self-harm sub score (rho = 0.25,0.25, and 0.28 respectively; p<0.05), and the SF-12 mental health sub score (rho = 0.25, p<0.05), but not with other BASIS-24 or SF-12 sub scores. This study brings together longitudinal collection of objective behavioral markers along with a transdiagnostic, personalized approach for tracking of mental health clinical state in a community-based clinical setting.
Article
Full-text available
Absolutist thinking is considered a cognitive distortion by most cognitive therapies for anxiety and depression. Yet, there is little empirical evidence of its prevalence or specificity. Across three studies, we conducted a text analysis of 63 Internet forums (over 6,400 members) using the Linguistic Inquiry and Word Count software to examine absolutism at the linguistic level. We predicted and found that anxiety, depression, and suicidal ideation forums contained more absolutist words than control forums (ds > 3.14). Suicidal ideation forums also contained more absolutist words than anxiety and depression forums (ds > 1.71). We show that these differences are more reflective of absolutist thinking than psychological distress. It is interesting that absolutist words tracked the severity of affective disorder forums more faithfully than negative emotion words. Finally, we found elevated levels of absolutist words in depression recovery forums. This suggests that absolutist thinking may be a vulnerability factor.
Article
We explore the properties of byte-level recurrent language models. When given sufficient amounts of capacity, training data, and compute time, the representations learned by these models include disentangled features corresponding to high-level concepts. Specifically, we find a single unit which performs sentiment analysis. These representations, learned in an unsupervised manner, achieve state of the art on the binary subset of the Stanford Sentiment Treebank. They are also very data efficient. When using only a handful of labeled examples, our approach matches the performance of strong baselines trained on full datasets. We also demonstrate the sentiment unit has a direct influence on the generative process of the model. Simply fixing its value to be positive or negative generates samples with the corresponding positive or negative sentiment.