Conference PaperPDF Available


Early, accurate detection of Parkinson's disease may aid in possible intervention and rehabilitation. Thus, simple noninvasive biomarkers are desired for determining severity. In this study, a novel set of acoustic speech biomarkers are introduced and fused with conventional features for predicting clinical assessment of Parkinson's disease. We introduce acoustic biomarkers reflecting the segment dependence of changes in speech production components, motivated by disturbances in underlying neural motor, articulatory, and prosodic brain centers of speech. Such changes occur at phonetic and larger time scales, including multi-scale perturbations in formant frequency and pitch trajectories, in phoneme durations and their frequency of occurrence, and in temporal waveform structure. We also introduce articulatory features based on a neural computational model of speech production, the Directions into Velocities of Articulators (DIVA) model. The database used is from the Interspeech 2015 Computational Paralinguistic Challenge. By fusing conventional and novel speech features, we obtain Spearman correlations between predicted scores and clinical assessments of r = 0.63 on the training set (four-fold cross validation), r = 0.70 on a held-out development set, and r = 0.97 on a held-out test set. Index Terms: Parkinson's disease, speech biomarkers, phoneme and pause duration, articulatory coordination, neural computational models of motor control
Segment-dependent dynamics in predicting Parkinson’s disease
James R. Williamson1, Thomas F. Quatieri 1, Brian S. Helfer1, Joseph Perricone1,
Satrajit S. Ghosh2, Gregory Ciccarelli1, Daryush D. Mehta1
1 MIT Lincoln Laboratory, Lexington, Massachusetts, USA
2 Massachusetts Institute of Technology, Cambridge, Massachusetts, USA,,,,,,
Early, accurate detection of Parkinson’s disease may aid in
possible intervention and rehabilitation. Thus, simple
noninvasive biomarkers are desired for determining severity.
In this study, a novel set of acoustic speech biomarkers are
introduced and fused with conventional features for predicting
clinical assessment of Parkinson’s disease. We introduce
acoustic biomarkers reflecting the segment dependence of
changes in speech production components, motivated by
disturbances in underlying neural motor, articulatory, and
prosodic brain centers of speech. Such changes occur at
phonetic and larger time scales, including multi-scale
perturbations in formant frequency and pitch trajectories, in
phoneme durations and their frequency of occurrence, and in
temporal waveform structure. We also introduce articulatory
features based on a neural computational model of speech
production, the Directions into Velocities of Articulators
(DIVA) model. The database used is from the Interspeech
2015 Computational Paralinguistic Challenge. By fusing
conventional and novel speech features, we obtain Spearman
correlations between predicted scores and clinical assessments
of r = 0.63 on the training set (four-fold cross validation),
r = 0.70 on a held-out development set, and r = 0.97 on a held-
out test set.
Index Terms: Parkinson’s disease, speech biomarkers,
phoneme and pause duration, articulatory coordination, neural
computational models of motor control
1. Introduction
Parkinson’s disease is a neurological disorder with associated
progressive decline in motor precision and sensorimotor
integration stemming presumably from the basal ganglia. In
this disorder, there is a steady loss of cells in the midbrain,
leading to speech impairment in nearly 90% of subjects [1].
Speech and voice characteristics of Parkinson’s disease
include imprecise and incoordinated articulation, monotonous
and reduced pitch and loudness, variable speech rate and
rushes of breath and pause segments, breathy and harsh voice
quality, and changes in intonation and rhythm [2][3][4][5][6].
*This work is sponsored by the Assistant Secretary of Defense for
Research & Engineering under Air Force contract #FA8721-05-C-
0002. Opinions, interpretations, conclusions, and recommendations
are those of the authors and are not necessarily endorsed by the United
States Government. SG was partly supported by a joint collaboration
between the McGovern Institute for Brain Research and MIT LL.
Early and accurate detection of Parkinson’s disease can aid
in possible intervention and rehabilitation. Thus, simple
noninvasive biomarkers are desired for determining severity of
the condition. In this paper, a novel set of speech biomarkers is
introduced for predicting the clinical assessment of the
Parkinson’s disease severity. Specifically, we introduce speech
biomarkers representing essential speech production
mechanisms of phonation, articulation, and prosody, motivated
by changes that are known to occur at different time scales and
differentially across speech segments in the underlying neural
motor and prosodic brain centers of speech production.
With this motivation, our new features are designed based
on two basic tenets: (1) there is a phoneme dependence of the
dynamic speech characteristics that is altered in Parkinson’s
disease, and (2) vocal tract dynamics and stability are altered
in Parkinson’s disease. We exploit the phoneme dependence of
durations and pitch and formant slopes, consistent with
findings that certain speech segments are more prone to
variation than others in Parkinson’s disease [7][8][9]. We also
exploit the decline in the precision of articulatory trajectories
over different time scales and tasks [10][11].
Our paper is organized as follows. In Section 2, we
describe the data collection and preprocessing approaches. In
Section 3, we describe our signal models and corresponding
signal processing methods for speech feature extraction.
Section 4 reports predictor types and prediction results.
Section 5 gives conclusions and projections toward future
2. Parkinson’s database
2.1. Audio recordings
The Parkinson’s disease database used in the Interspeech 2015
Computational Paralinguistic Challenge is described in [1].
Assessments of Parkinson’s severity are based on the Unified
Parkinson’s Disease Rating Scale (UPDRS) [12]. The data set
is divided into 42 tasks per speaker, yielding 1470 recordings
in the training set (35 speakers) and 630 recordings in the
development set (15 speakers), both with UPDRS scores
provided. It also contains 462 recordings (11 speakers) in the
test set, without UPDRS scores provided. The duration of
recordings ranges from 0.24 seconds to 154 seconds.
2.2. Audio enhancement
To address noise in the test data, we use an adaptive
Wiener-filter approach that preserves the dynamic components
of a speech signal while reducing noise [13][14]. The
approach uses a measure of spectral change that allows robust
Copyright © 2015 ISCA September 6
10, 2015, Dresden, Germany
and rapid adaptation of the Wiener filter to speech and
background events. The approach reduces speech distortion by
using time-varying smoothing parameters, with constants
selected to produce less temporal smoothing in rapidly-
changing regions and greater smoothing in more stationary
2.3. Test set annotation
Identification (ID) labels for 11 test subjects were
manually assigned to the monologue, the read-text, and the ten
sentence recordings by listening. A 128-component Universal
Background Model/Gaussian Mixture Model (UBM/GMM)
classifier was used [15], with a feature vector consisting of 16
Mel-frequency cepstral coefficients (MFCCs) and 16 delta-
MFCCs, to classify the remaining 330 recordings on the test
set. The UBM was trained from all 2,562 recordings in the
data set, and subject ID assignments on the test recordings
were obtained using 11 subject GMMs, adapted using
manually labeled tasks on the test set. Finally, manual
correction of 28 of the test subject assignments was done. The
final subject ID assignments were not perfect, with number of
assigned recordings per subject ranging from 40 to 44.
3. Feature extraction
3.1. Overview of feature sets
Feature development was designed to reflect the three basic
aspects of speech production: Phonation (source), articulation
(vocal tract), and prosody (intonation and timing). The focus
of the features is to characterize variations in dynamics of
pitch (phonation), formants (articulation), and rate and
waveform (prosody) that reflect Parkinson’s severity.
Ten different feature sets are used, designed to
characterize changes in speech as a function of Parkinson’s
severity. Each feature set (FS) comprises a set of raw features,
described in Sections 3.23.4. Dimensionality reduction of all
FSs is done by z-scoring the raw features and then extracting
lower dimensional features using principal components
analysis (PCA). The PCA features provide input into
regression models that map each FS into a Parkinson’s
severity (UPDRS) prediction (see Section 4). The FSs,
summarized in Table 1, are categorized in terms of three
different classes: summary statistics, phoneme dependence,
and correlation structure.
Four of the FSs (1, 3, 5, and 7) are effective on short
duration tasks, and are applied to all the recordings in the
dataset. For these FSs, the data is divided into six different
time bins, based on durations of recordings (see Table 2 for
details). Each statistical model is trained only on tasks of
similar duration. The remaining six FSs are designed to
capture longer duration speech dynamics, and are used to
characterize changes across subjects when speaking the same
sentence. For these FSs, the same three declarative sentences
(sentences 2, 4, and 6) are analyzed.
3.2. Summary statistics
FS 1: Delta-MFCC means. Mel-frequency cepstral
coefficients (MFCCs) are obtained with openSMILE at a 100
Hz frame rate [16]. Delta-MFCC coefficients are computed
using regression over two frames before and after each frame.
Then, the mean values of the 16 delta-MFCCs are computed
across all frames in each recording.
Data Types
# Raw
Time bins
Loudness statistics
Phn duration
Time bins
Phn dur. & pitch
Phn dur. &
formant slopes
Time bins
Phn freq.
Waveform corr.
Time bins
Delta-MFCC corr.
Formant corr.
162 171
position corr.
FS 2: Loudness statistics. Loudness is computed from the
Perceived Evaluation of Audio Quality (PEAQ) algorithm
[17], a psychoacoustic measure of audio quality that has been
used for analysis of Lombard speech [18]. Loudness is average
energy across critical auditory bands per frame, FS 2
comprises the mean and standard deviation of this feature.
3.3. Phoneme-based features
Changes in phoneme durations and frequencies and in
phoneme-dependent pitch and formant slopes reflect the
phonemic segment dependence of alterations in phonation and
articulation with Parkinson’s severity. Phonemic segments are
used, along with estimated pitch and formant frequency
contours, to generate several phoneme-based feature sets.
Using an automatic phoneme recognition algorithm [19],
phonemic boundaries are detected, with each segment labeled
with one of 40 phoneme classes. The fundamental frequency
(pitch) contour is estimated using an autocorrelation method
over a 40-ms Hanning window every 1 ms [20]. Formant
frequency contours are estimated using a Kalman filter that
smoothly tracks the first three spectral modes while also
smoothly coasting through non-speech regions [21].
For FS 3, 4, and 5, an aggregation step is performed in
which a subset of the 40 phoneme-based measures that are the
most highly correlating with Parkinson’s severity (on the
training set) is linearly combined. In all cases, the top 10 most
highly correlating measures are combined using weights
w=sign(r)/(1-r2) [22]. For measures derived from time-bin
data, the aggregation is done independently in each time bin.
For measures derived from sentences, aggregation is done
independently in each sentence.
FS 3: Average phoneme duration. A linear fit is made of
the logarithm of pitch over time (within each phonemic
segment), yielding a pitch slope (Δlog(Hz)/s) for each
phonemic segment. Phoneme durations are then computed for
those segments where the pitch slope is marked as valid (i.e.,
where the absolute pitch slope is less than eight, indicating that
the slope is likely derived from a continuous pitch contour).
Table 1. Feature Set (FS) summary. Three feature
types (summary statistics, phoneme dependence, and
correlation structure) are applied to time binned and
sentence data.
Average phoneme durations were used originally in
classifying depression severity [23].
FS 4: Average phoneme duration and pitch slope. In
addition to the average phoneme duration feature, an average
log-pitch slope is also used where pitch slopes are valid [22].
FS 5: Average phoneme duration and formant slopes.
Linear fits to formant frequencies f1 and f2, along with average
phoneme durations, are computed from all phoneme segments
regardless of pitch slope validity.
FS 6: Phoneme frequencies. The number of occurrences of
each phoneme is computed. These counts are normalized to
sum to one and sorted from highest to lowest. The top seven of
these phoneme frequency measures are used as features.
3.4. Correlation structure features
Measures of the structure of correlations among low-level
speech features have previously been applied in the estimation
of depression [22][24], the estimation of cognitive
performance associated with dementia [25], the detection of
changes in cognitive performance associated with mild
traumatic brain injury [26], and were first introduced for
analysis of EEG signals for epileptic seizure prediction [27].
Channel-delay correlation and covariance matrices are
computed from multiple time series. Each matrix contains
correlation or covariance coefficients between the channels at
multiple time delays. Changes over time in the coupling
strengths among the channel signals cause changes in the
eigenvalue spectra of the channel-delay matrices. The matrices
are computed at four separate time scales, in which successive
time delays correspond to different size frame spacings.
Overall power (logarithm of the trace) and entropy (logarithm
of the determinant) are extracted from the channel-delay
covariance matrices at each scale for FSs 810. A detailed
description of the correlation structure approach can be found
in [27] and its application to speech signals in [22][24].
FS 7: Waveform correlation structure. Correlation
structure features are extracted directly from waveform
segments, thereby characterizing instability in temporal
envelope and phase structure (rhythm and regularity) on
different time scales. This technique was applied to each 0.5 s
frames with 50% overlap. Each frame is divided into five 0.1 s
segments that are treated as separate channels. Utterances ≥
0.75 s in duration resulted in multiple frames. In these cases
the average eigenvalue at each eigenvalue rank is computed
across frames. Features are computed at four scales with delay
spacings of 3, 7, 15, and 31, with 30 delays per scale. These
features reveal an association between Parkinson’s severity
and reduction in dynamical complexity, as illustrated in
Section 3.5.
FS 8: Delta-MFCC correlation structure. Correlation
structure features are derived from all 16 delta-MFCCs using
four delay scales with spacings 1, 3, 7, and 15, and using 15
delays per scale. Fewer delays are used for the largest scale on
sentences 2 and 4 due to their short duration.
FS 9: Formant correlation structure. Correlation structure
features are derived from three formant frequencies using the
same delay and scale parameters as FS 8.
FS 10: Positions of speech articulators correlation
structure. Here, we take advantage of a neurologically
plausible, fMRI-validated computational model of speech
production, the Directions into Velocities of Articulators
(DIVA) model [28]. The DIVA model takes as inputs the first
three formants and the fundamental frequency of a speech
utterance. Then, through an iterative learning process, the
model computes a set of synaptic weights that correspond to
different aspects of the speech production process including
articulatory commands and auditory and somatosensory
feedback errors. We hypothesize that Parkinsonian speech
results from impairments along the speech production
pathway, and therefore, when the model is trained on
Parkinsonian speech, the internal variables will reflect the
severity of the disorder. Correlation structure features are
derived from the DIVA model’s 13 time-varying articulatory
position states, are sampled at 200 Hz. The same delay and
scale parameters are applied as with FS 8 and FS 9.
3.5. Example of discriminative value
In this section we show the discriminative value of the
waveform correlation structure features (FS 7). Figure 1 shows
channel-delay correlation matrices obtained from two women
with low and high Parkinson’s severity (training set files 154
and 303), both speaking the word “crema”. The channels are
five 0.1 s segments of the audio waveform. The matrix
elements are correlation coefficients between channels at
different relative time delays. These matrices are obtained at
the 3rd time scale, and so successive matrix elements
correspond to delays of 15 sub-frames. The matrix from the
low severity speaker exhibits more heterogeneity in the
correlation patterns, indicating higher waveform complexity.
This difference is quantified using matrix eigenvalues
(Figure 2, left panel), ordered largest to smallest. Low UPDRS
speech contains greater power in the small eigenvalues. This
effect is summarized across multiple speakers by plotting the
average eigenvalue for different ranges of Parkinson’s severity
at each rank. Figure 2 (right panel) shows the eigenvalue
averages (in standard units) from all training/development
waveforms < 0.5 s in duration for three UPDRS ranges. These
averages reveal distinct differences related to Parkinson’s
severity, even across multiple different speech tasks.
Figure 1: Correlation matrices from waveform-
segment channel-delay matrices for a low (left) and
high (right) UPDRS from utterance of “crema”.
Figure 2: Left: Eigenspectra for the utterance
“crema” from two female speakers with different
Parkinson’s severity. Right: Average eigenspectra for
low (blue), medium (green), and high (red) UPDRS.
4. Parkinson’s score prediction
4.1. Gaussian staircase regression models
There is a separate predictor for each FS, and the predictor
outputs are linearly combined to produce the system’s net
UPDRS prediction. Each predictor uses a Gaussian staircase
(GS) regression model [22][24], which comprises an ensemble
of six Gaussian classifiers, each trained on data with different
UPDRS partitions between Class 1 (lower UPDRS) and Class
2 (higher UPDRS). The Class 1 partitions are 015, 024,
033, 042, 051, and 060, and the Class 2 partitions are
the complement of these partitions. Additional regularization
of the densities is obtained by adding the constant 4 to the
diagonal elements of the data-normalized covariance matrices.
The GS output score is the ratio of the log of the summed
likelihoods from each ensemble of Gaussians. This is followed
by applying a 2nd-order univariate regression model, created
from the GS training scores, to the GS test score to generate a
UPDRS prediction.
4.2. Within-subject prediction averaging
For each FS, subject ID labels are the basis for computing,
across all tasks for that subject, a weighted prediction average
based on the tasks processed by each FS. The provided subject
IDs are used on the training/development sets, and the subject
IDs from our procedure (Section 2.3) are used on the test set.
Time-bin predictions. For FS 1, 3, 5 and 7, there are six
different regression models, one for each time bin. The utility
of each FS in each time bin is assessed based on the Spearman
correlation of its predictions on the training set (using 4-fold
cross validation). Within each FS and subject ID, the predictor
outputs are combined across time bins ID using weights w =
r2/(1-r2) if r > 0; w = 0 otherwise. Table 2 lists the normalized
weights for each FS, with weights summing to one in each
column. Observe that the FSs accumulate most of their
evidence from short duration tasks.
Time bin weights
FS 1
FS 3
FS 5
FS 7
Sentence predictions. The remaining FSs use data from
matched sentences (sentences 2, 4 and 6), which range in
duration from 1.56 s to 9.63 s. The sentence tasks allow for
phonemic timing and correlation structure biomarkers that
have been used to predict various neurological states
[11][12][21][22][23]. Within each FS and test subject ID, the
GS training and test scores are averaged across the three
sentences. A 2nd-order univariate regression model obtains a
sentence-based prediction for each FS and test subject.
4.3. Fusing predictors
The ten FS predictors are applied individually to the
training set (four-fold cross-validation with held out speakers)
and to the development set (Table 3). We used linear
combinations of the predictors with three different weight
vectors w1, w2, and w3. With w1, each predictor is weighted
equally. To improve performance, we also applied differential
weighting of the ten predictors. To choose the weights, we
conducted a grid search over all 1,024 possible weight
combinations in which each weight can have a value of 1 or 2.
From this search, we obtained two different weight vectors
based on different constraints. The first weight vector is w2,
the weights that yield the highest average of 1) Spearman
correlation on the training set, 2) Spearman correlation on the
development set, and 3) a test metric score. The second weight
vector is w3, the weights that yield the highest test metric
score. The test metric is the Pearson correlation between
previous submission scores and the Spearman correlations
between the prediction vectors that had produced the previous
scores and a candidate prediction vector. The test metric
equals one if a candidate prediction vector has Spearman
correlation of one with the true UPDRS scores. To compute
this metric, we used ten previously obtained scores, which
range between 0.12 and 0.89. These scores are our previous
eight submissions and two Baseline results (Table 3, rows 1, 3
of [29]). After training on the combined train and development
sets, we submitted test results using w2 and w3, obtaining
Spearman correlations of r = 0.96 and r = 0.97, respectively.
Train r
Devel. r
5. Conclusions and discussion
We applied standard and novel speech features predicting
levels of Parkinson’s disease. Our features capture segment-
based dynamics across phonemes, formant frequencies and
articulatory positions, based on an understanding of the effect
of Parkinson’s disease on speech production components. Our
ongoing work involves the further enhancement of current
features and establishment of new features, with emphasis on
approaches that provide a neurological basis for understanding
the effect of Parkinson’s disease on speech.
6. Acknowledgements
The authors thank Dr. Elizabeth Godoy and Dr. Christopher
Smalt for providing code and guidance on the extraction of
loudness features, and Dr. Douglas Sturim for helpful
discussions on automatic speaker identification.
Table 2. Weights used for combining predictions
across recording duration bins for feature sets (FS).
Table 3. Performance for each feature set predictor
and for linear combinations of predictors.
7. References
[1] J. Orozco-Arroyave, J. Arias-Londono, J. Vargas-Bonilla, M.
González-Rátiva, and E. Nöth, “New Spanish speech corpus
database for the analysis of people suffering from Parkinson’s
disease,” in Proceedings of the 9th Language Resources and
Evaluation Conference (LREC), 2014, pp. 342–347.
[2] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: I. Intensity, pitch, and duration,” Journal of Speech and
Hearing Disorders, vol. 28, no. 3, pp. 221–229, 1963.
[3] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: II. Physiological support for speech.” Journal of Speech
and Hearing Disorders, vol. 30, no. 1, pp. 44–49, 1965.
[4] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: III. Articulation, diadochokinesis, and over-all speech
adequacy,” Journal of Speech and Hearing Disorders, vol. 30,
no. 3, pp. 217–224, 1965.
[5] J. A. Logemann, H. B. Fisher, B. Boshes, and E. R. Blonsky,
“Frequency and cooccurrence of vocal tract dysfunctions in the
speech of a large sample of Parkinson patients,” Journal of
Speech and Hearing Disorders, vol. 43, no. 1, pp. 47–57, 1978.
[6] J. E. Sussman and K. Tjaden, “Perceptual measures of speech
from individuals with Parkinson’s disease and multiple sclerosis:
Intelligibility and beyond,” Journal of Speech, Language, and
Hearing Research, vol. 55, no. 4, pp. 1208–1219, 2012.
[7] J. A. Logemann and H. B. Fisher, “Vocal tract control in
Parkinson’s disease,” Journal of Speech and Hearing Disorders,
vol. 46, no. 4, pp. 348–352, 1981.
[8] S. Skodda, and U. Schlegel, “Speech rate and rhythm in
Parkinson’s disease,” Movement Disorders, vol. 23, no. 7, pp.
985–992, 2008.
[9] S. Skodda, “Aspects of speech rate and regularity in Parkinson’s
disease,” Journal of the neurological sciences, vol. 310, no. 1–2,
pp. 231–236, 2011.
[10] S. G. Hoberman, “Speech techniques in aphasia and
Parkinsonism,” Journal - Michigan State Medical Society, vol.
57, no. 12, pp. 1720–1723, 1958.
[11] I. J. Rusz, R. Cmejla, T. Tykalova, H. Ruzickova, J. Klempir, V.
Majerova, J. Picmausova, J. Roth, and E. Ruzicka, “Imprecise
vowel articulation as a potential early marker of Parkinson’s
disease: Effect of speaking task,” The Journal of the Acoustical
Society of America, vol. 134, no. 3, pp. 2171–2181, 2013.
[12] C. G. Goetz, B. C. Tilley, S. R. Shaftman, G. T. Stebbins, S.
Fahn, P. MartinezMartin, ... and N. LaPelle, “Movement
Disorder Societysponsored revision of the Unified Parkinson’s
Disease Rating Scale (MDSUPDRS): Scale presentation and
clinimetric testing results,” Movement Disorders, vol. 23, no. 15,
pp. 2129–2170, 2008.
[13] T. F. Quatieri and R. B. Dunn, “Speech enhancement based on
auditory spectral change,” in Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal
Processing, 2002, pp. I-257.
[14] T. F. Quatieri and R. A. Baxter, “Noise reduction based on
spectral change,” in Proceedings of the IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics
(WASPAA), 1997.
[15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker
verification using adapted Gaussian mixture models,” Digital
signal processing, vol. 10, no. 1, pp. 19–41, 2000.
[16] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent
developments in openSMILE, the Munich Open-Source
Multimedia Feature Extractor,” in Proceedings of ACM
Multimedia (MM), Barcelona, Spain, 2013, pp. 835–838.
[17] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J.
G. Beerends, and C. Colomes, “PEAQ-The ITU standard for
objective measurement of perceived audio quality,” Journal of
the Audio Engineering Society, vol. 48, no. 1/2, pp. 3–29, 2000.
[18] E. Godoy and Y. Stylianou, “Unsupervised acoustic analyses of
normal and Lombard speech, with spectral envelope
transformation to improve intelligibility,” in 13th Annual
Conference of the International Speech Communication
Association, September 9–13, Portland, Oregon, Proceedings,
2012, pp. 1472–1475.
[19] W. Shen, C. White, T. J. Hazen, “A comparison of query-by-
example methods for spoken term detection,” in Proceedings of
the IEEE International Conference on Acoustics Speech and
Signal Processing, 2010.
[20] P. Boersma and D. Weenink, “Praat, a system for doing
phonetics by computer,” 2001.
[21] D. D. Mehta, D. Rudoy, and P. J. Wolfe, “Kalman-based
autoregressive moving average modeling and inference for
formant and antiformant tracking,” The Journal of the Acoustical
Society of America, vol. 132, no. 3, pp. 1732–1746, 2012.
[22] J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and
D. D. Mehta, “Vocal and facial biomarkers of depression based
on motor incoordination and timing,” in Proceedings of the 4th
ACM International Workshop on Audio/Visual Emotion
Challenge (AVEC), 2014, pp. 65–72.
[23] A. Trevino, T. F. Quatieri, and N. Malyska, “Phonologically-
based biomarkers for major depressive disorder,” EURASIP
Journal on Advances in Signal Processing, vol. 42, pp. 1–18,
[24] J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu,
and D. D. Mehta, “Vocal biomarkers of depression based on
motor incoordination,” in Proceedings of the 3rd ACM
International Workshop on Audio/Visual Emotion Challenge,
2013, pp. 41–48.
[25] B. Yu, T. F. Quatieri, J. W. Williamson, and J. Mundt,
“Prediction of cognitive performance in an animal fluency task
based on rate and articulatory markers,” in 15th Annual
Conference of the International Speech Communication
Association, September 9–13, Portland, Oregon, Proceedings,
[26] B. S. Helfer, T. F. Quatieri, J. R. Williamson, L. Keyes, B.
Evans, W. N. Greene, J. Palmer, and K. Heaton, “Articulatory
dynamics and coordination in classifying cognitive change with
preclinical mTBI,” in 15th Annual Conference of the
International Speech Communication Association, September 9–
13, Portland, Oregon, Proceedings, 2014.
[27] J. R. Williamson, D. Bliss, D. W. Browne, and J. T. Narayanan,
“Seizure prediction using EEG spatiotemporal correlation
structure,” Epilepsy and Behavior, vol. 25, no. 2, pp. 230–238,
[28] F. H. Guenther, S. S. Ghosh, and J. A. Tourville, “Neural
modeling and imaging of the cortical interactions underlying
syllable production,” Brain and Language, vol. 96, no. 3, pp.
280–301, 2006.
[29] B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. Hönig, J. R.
Orozco-Arroyave, E. Nöth, Y. Zhang, F. Weninger, “The
INTERSPEECH 2015 Computational Paralinguistics Challenge:
Nativeness, Parkinson’s & Eating Condition,” in
INTERSPEECH 2015 – 16th Annual Conference of the
International Speech Communication Association, September 6
10, Dresden, Germany, Proceedings, 2015.
... The efficacy of speech-based diagnostic models has been explored in a range of conditions, including Parkinson's disease (Benba et al., 2015;Orozco-Arroyave et al., 2016;Williamson et al., 2015), cognitive impairment (Garrard et al., 2014;Orimaye et al., 2017;Roark et al., 2011;Yu et al., 2014), and depression (Cummins et al., 2011;Sturim et al., 2011;Williamson et al., 2014). Generally speaking, these models collect some type of sensor data (acoustic, kinematic, etc.), apply signal processing to extract salient information from the raw signals, then use machine learning to map from the extracted features to clinically relevant information (e.g., whether or not a person has a given disease). ...
... One approach for extracting salient information from speech acoustic or kinematic data that has shown promise across a range of different pathologies is to analyze how different characteristic measures of speech (such as a localized acoustic feature or the position of a certain articulator) are correlated with one another across time. Prior studies have consistently observed reduced complexity in the structure of these correlations for individuals with communication disorders relative to healthy controls (Williamson et al., 2013(Williamson et al., , 2015Yu et al., 2014). This field of research has grown rapidly in recent years and yielded important insights into the feasibility of using speech signal data in clinical practice. ...
... The spatiotemporal correlation structure approach used in this article was first introduced for seizure detection based on electroencephalogram (EEG) signals (Williamson et al., 2012) and has since been applied to detect changes in articulatory coordination resulting from different neurological conditions, such as Parkinson's disease or cognitive impairment Williamson et al., 2013Williamson et al., , 2015Yu et al., 2014). The broad applicability of this approach stems from the fact that it can be applied to virtually any multichannel signal and has previously been applied to EEG signals, acoustic feature signals, and articulation signals. ...
Full-text available
Purpose Kinematic measurements of speech have demonstrated some success in automatic detection of early symptoms of amyotrophic lateral sclerosis (ALS). In this study, we examined how the region of symptom onset (bulbar vs. spinal) affects the ability of data-driven models to detect ALS. Method We used a correlation structure of articulatory movements combined with a machine learning model (i.e., artificial neural network) to detect differences between people with ALS and healthy controls. The performance of this system was evaluated separately for participants with bulbar onset and spinal onset to examine how region of onset affects classification performance. We then performed a regression analysis to examine how different severity measures and region of onset affects model performance. Results The proposed model was significantly more accurate in classifying the bulbar-onset participants, achieving an area under the curve of 0.809 relative to the 0.674 achieved for spinal-onset participants. The regression analysis, however, found that differences in classifier performance across participants were better explained by their speech performance (intelligible speaking rate), and no significant differences were observed based on region of onset when intelligible speaking rate was accounted for. Conclusions Although we found a significant difference in the model's ability to detect ALS depending on the region of onset, this disparity can be primarily explained by observable differences in speech motor symptoms. Thus, when the severity of speech symptoms (e.g., intelligible speaking rate) was accounted for, symptom onset location did not affect the proposed computational model's ability to detect ALS.
... This is a burden on clinical work, and also limits the scalability of automated patient screening and follow-up using clinically interpretable features. However, recent technical developments in automatic speech recognition (ASR) have been successfully involved in feature extraction in automatic assessment of various types of pathological speech [19][20][21]. This raises the question whether analysis of vowel articulation could also be fully automated in order to support clinical practice. ...
... Given this background, the main purpose of the present study is to develop an automatic and language-independent method for vowel articulation measurement in terms of acoustic features. Inspired by the recent successes of using ASR for feature extraction in automatic pathological speech assessment [19][20][21], a universal phone/phoneme recognizer is adopted to detect speech frames representative of corner vowel articulation, followed by statistical analysis of the formant frequencies across the detected frames. ...
... Earlier work using ASR for pathological speech assessment has already been carried out in the context of other acoustic features. In [19,20], phoneme statistics, duration and confidence measures derived from off-the-shelf Spanish ASR systems were applied to speech assessment of Spanish-speaking patients with PD. In [21], a Cantonese ASR system was used to generate utterance-level posterior related features for broad phoneme classes in voice disorders assessment. ...
Full-text available
Dysarthria is a common symptom for people with Parkinsons disease (PD), which affects respiration, phonation, articulation and prosody, and reduces the speech intelligibility as a result. Imprecise vowel articulation can be observed in people with PD. Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its detection and assessment. Standard clinical vowel articulation features include the first two formants of the three corner vowels /a/, /i/ and /u/, from which clinically relevant parameters such as vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR) are derived. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming and requires specific expertise in speech signal analysis. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/) scores in a fully automated manner. The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech. Results from these two corpora further demonstrate the efficacy of the automated features in separating PD/dysarthric speakers from controls, and that the features are correlated with Parkinsons disease severity ratings on PC-GITA and with level of dysarthria on TORGO.
... It has been suggested that changes in complexity in movement dimension-ality can reflect subtle physiological change as it manifests over time [21]. This has been used in several studies to relate coordination patterns with changes in neurological state and motor capability [21][22][23][24][25][26]. ...
... This points to the fact that more subtle changes in movement are being impacted by overpressure exposure, which makes the overt clinical detection challenging. Prior works by Williamson et al. [21][22][23] have shown that the eigenvalue features, as utilized here, provide insight into changes in motor coordination, that in turn can be used to assess neurophysiological status. ...
Conference Paper
Repetitive exposure to non-concussive blast expo-sure may result in sub-clinical neurological symptoms. These changes may be reflected in the neural control gait and balance. In this study, we collected body-worn accelerometry data on individuals who were exposed to repetitive blast overpressures as part of their occupation. Accelerometry features were gener-ated within periods of low-movement and gait. These features were the eigenvalues of high-dimensional correlation matrices, which were constructed with time-delay embedding at multiple delay scales. When focusing on the gait windows, there were significant correlations of the changes in features with the cumulative dose of blast exposure. When focusing on the low-movement frames, the correlation with exposure were lower than that of the gait frames and statistically insignificant. In a cross-validated model, the overpressure exposure was predicted from gait features alone. The model was statistically significant and yielded an RMSE of 1.27 dB. With continued development, the model may be used to assess the physiological effects of repetitive blast exposure and guide training procedures to minimize impact on the individual.
... A general approach was developed that quantifies movement dimensionality in this way, using correlation patterns among multichannel feature sets. This approach has been used in many studies to detect and estimate alterations in neuromotor coordination from speech [23][24][25][26], including motor and cognitive symptoms due to Parkinson's disease [21,22]. The approach has also been used for detecting alterations in torso accelerations during gait due to load carriage [20] and mild traumatic brain injury [25], as well as for detecting alterations in hand movements during drawing due to autism [26]. ...
... The accelerometry analysis approaches (movement dispersion and dimensionality) have been validated previously on torso-worn accelerometry data [20,25,43]. The movement dimensionality approach has also been applied to many alternative modalities, including the analysis of speech differences due to PD [21,22]. As the primary focus of this work was on activity segmentation and movement characterization, a single standard machine learning approach was adopted to quantify the effectiveness of the features in detecting PD. ...
Full-text available
Parkinson’s disease (PD) is a chronic movement disorder that produces a variety of characteristic movement abnormalities. The ubiquity of wrist-worn accelerometry suggests a possible sensor modality for early detection of PD symptoms and subsequent tracking of PD symptom severity. As an initial proof of concept for this technological approach, we analyzed the U.K. Biobank data set, consisting of one week of wrist-worn accelerometry from a population with a PD primary diagnosis and an age-matched healthy control population. Measures of movement dispersion were extracted from automatically segmented gait data, and measures of movement dimensionality were extracted from automatically segmented low-movement data. Using machine learning classifiers applied to one week of data, PD was detected with an area under the curve (AUC) of 0.69 on gait data, AUC = 0.84 on low-movement data, and AUC = 0.85 on a fusion of both activities. It was also found that classification accuracy steadily improved across the one-week data collection, suggesting that higher accuracy could be achievable from a longer data collection. These results suggest the viability of using a low-cost and easy-to-use activity sensor for detecting movement abnormalities due to PD and motivate further research on early PD detection and tracking of PD symptom severity.
... Fig. 2 This results in a final accuracy of 95%. It is crucial to emphasize that, differently from standard dysarthria detection approaches [18,19,20], any specific training has been performed: we directly infer the speaker health status from the similarity coefficients α α α that provide us an interpretable model. Note that none of existing SA methods are able to also perform dysarthria detection. ...
... Fig. 2 This results in a final accuracy of 95%. It is crucial to emphasize that, differently from standard dysarthria detection approaches [18,19,20], any specific training has been performed: we directly infer the speaker health status from the similarity coefficients α α α that provide us an interpretable model. Note that none of existing SA methods are able to also perform dysarthria detection. ...
Full-text available
This work addresses the mismatch problem between the distribution of training data (source) and testing data (target), in the challenging context of dysarthric speech recognition. We focus on Speaker Adaptation (SA) in command speech recognition, where data from multiple sources (i.e., multiple speakers) are available. Specifically, we propose an unsupervised Multi-Source Domain Adaptation (MSDA) algorithm based on optimal-transport, called MSDA via Weighted Joint Optimal Transport (MSDA-WJDOT). We achieve a Command Error Rate relative reduction of 16% and 7% over the speaker-independent model and the best competitor method, respectively. The strength of the proposed approach is that, differently from any other existing SA method, it offers an interpretable model that can also be exploited, in this context, to diagnose dysarthria without any specific training. Indeed, it provides a closeness measure between the target and the source speakers, reflecting their similarity in terms of speech characteristics. Based on the similarity between the target speaker and the healthy/dysarthric source speakers, we then define the healthy/dysarthric score of the target speaker that we leverage to perform dysarthria detection. This approach does not require any additional training and achieves a 95% accuracy in the dysarthria diagnosis.
... Therefore, a rapid and objective dysarthria detection procedure could help the therapist in the diagnosis. In the last years, the research community started to look at dysarthria detection by learning a mapping from the acoustic features to the text label [4,5]. In [6], the authors proposed an interpretable DNN model in which they added an intermediate layer that acts as a bottle-neck feature extractor providing nasality, vocal quality, articulatory precision and prosody features. ...
Full-text available
In many real-world applications, the mismatch between distributions of training data (source) and test data (target) significantly degrades the performance of machine learning algorithms. In speech data, causes of this mismatch include different acoustic environments or speaker characteristics. In this paper, we address this issue in the challenging context of dysarthric speech, by multi-source domain/speaker adaptation (MSDA/MSSA). Specifically, we propose the use of an optimal-transport based approach, called MSDA via Weighted Joint Optimal Transport (MSDA-WDJOT). We confront the mismatch problem in dysarthria detection for which the proposed approach outperforms both the Baseline and the state-of-the-art MSDA models, improving the detection accuracy of 0.9% over the best competitor method. We then employ MSDA-WJDOT for dysarthric speaker adaptation in command speech recognition. This provides a Command Error Rate relative reduction of 16% and 7% over the baseline and the best competitor model, respectively. Interestingly, MSDA-WJDOT provides a similarity score between the source and the target, i.e. between speakers in this case. We leverage this similarity measure to define a Dysarthric and Healthy score of the target speaker and diagnose the dysarthria with an accuracy of 95%.
... The INTERSPEECH, an international conference in the field of speech signal processing, has held two challenges on pathological speech research in 2012 [27] and 2015 [28], respectively. However, the features reported are mostly sourced from spectrum, prosodic features [29][30][31][32] and glottal features [33], lack of articulatory features. ...
Full-text available
Pathological articulation exploration, especially the study of the kinematic characteristics of motor organ, is helpful to further reveal the essence of motor dysarthria. Due to the scarcity of the available pathological pronunciation database, there has little research working on the statistic distribution analysis for patients and normal controlled people. This paper applied the distribution method on TORGO database to discover the cognitive and motor rules of dysarthria patients. Single phoneme analysis is effective for locating the specific tongue muscle but ignoring cognitive ability assessment, particularly for the content understanding and the fluency degree of expression by patients. The paper focused on the word/sentence level rather than single phoneme analysis. The reaction time was designed to reveal the relationship between the brain cognition and motor neuron activation. The statistic distribution tells that the cerebral palsy or amyotrophic lateral sclerosis does affect people’s reflection and make the patients hard to control the tongue muscles effectively, resulting in unstable reaction time. The articulation velocity of patients appears 5mm/s faster than normal people, at 85mm/s, perhaps due to the factors of the word/sentence data and the big proportion of extra large displacement. It illustrates that the tongue moves relatively coherently and fluently once patients active the muscles, but hard to slow down as the muscle control ability decreases. The spatial occupancy was represented by the maximum articulation movement range (MAMR). We adapted the logarithmic normal distribution to find out the significant threshold for the diagnosis of dysarthria with the MAMR exceeding 7mm along left and right direction and the number of abnormal ranges surpassing 10% of total number. Primary test of MAMR, as an articulatory feature, for speech classification was carried out and it achieves 81% accuracy. These explorations convinced us to apply them to the pathological speech recognition task for improvement in future.
... A similar reduction, but in the acoustic feature space, has also been found in depressed speech [29]. The aforementioned feature sets have also been trialled for other mental disorders such as Parkinson's disease [30], and PTSD (Post Traumatic Stress Disease) [28], Alzheimer's disease, schizophrenia, etc., which however have received less attention compared with depression in automatic systems. ...
Full-text available
The massive and growing burden imposed on modern society by depression has motivated investigations into early detection through automated, scalable and non-invasive methods, including those based on speech. However, speech-based methods that capture articulatory information effectively across different recording devices and in naturalistic environments are still needed. This article proposes two feature sets associated with speech articulation events based on counts and durations of sequential landmark groups or n-grams. Statistical analysis of the duration-based features reveals that durations from several consecutive landmark bigrams and onset-offset landmark pairs are significant in discriminating depressed from non-depressed speakers. In addition to investigating different normalization approaches and values of n for landmark n-gram features, experiments across different elicitation tasks suggest that the features can be tailored to capture different articulatory aspects of depressed voices. Evaluations of both landmark duration features and landmark n-gram features on the DAIC-WOZ and SH2 datasets show that they are highly effective, either alone or fused, relative to existing approaches.
To assist the clinical diagnosis and treatment of neurological diseases that cause speech dysarthria such as Parkinson's disease (PD), it is of paramount importance to craft robust features which can be used to automatically discriminate between healthy and dysarthric speech. Since dysarthric speech of patients suffering from PD is breathy, semi-whispery, and is characterized by abnormal pauses and imprecise articulation, it can be expected that its spectro-temporal sparsity differs from the spectro-temporal sparsity of healthy speech. While we have recently successfully used temporal sparsity characterization for dysarthric speech detection, characterizing spectral sparsity poses the challenge of constructing a valid feature vector from signals with a different number of unaligned time frames. Further, although several non-parametric and parametric measures of sparsity exist, it is unknown which sparsity measure yields the best performance in the context of dysarthric speech detection. The objective of this paper is to demonstrate the advantages of spectro-temporal sparsity characterization for automatic dysarthric speech detection. To this end, we first provide a numerical analysis of the suitability of different non-parametric and parametric measures (i.e., l $_{1}$ -norm, kurtosis, Shannon entropy, Gini index, shape parameter of a Chi distribution, and shape parameter of a Weibull distribution) for sparsity characterization. It is shown that kurtosis, the Gini index, and the parametric sparsity measures are advantageous sparsity measures, whereas the l $_{1}$ -norm and entropy measures fail to robustly characterize the temporal sparsity of signals with a different number of time frames. Second, we propose to characterize the spectral sparsity of an utterance by initially time-aligning it to the same utterance uttered by a (arbitrarily selected) reference speaker using dynamic time warping. Experimental results on a Spanish database of healthy and dysarthric speech show that estimating the spectro-temporal sparsity using the Gini index or the parametric sparsity measures and using it as a feature in a support vector machine results in a high classification accuracy of 83.3%.
Full-text available
In Major Depressive Disorder (MDD), neurophysiologic changes can alter motor control [1, 2] and therefore alter speech production by influencing the characteristics of the vocal source, tract, and prosodics. Clinically, many of these characteristics are associated with psychomotor retardation, where a patient shows sluggishness and motor disorder in vocal articulation, affecting coordination across multiple aspects of production [3, 4]. In this paper, we exploit such effects by selecting features that reflect changes in coordination of vocal tract motion associated with MDD. Specifically, we investigate changes in correlation that occur at different time scales across formant frequencies and also across channels of the delta-mel-cepstrum. Both feature domains provide measures of coordination in vocal tract articulation while reducing effects of a slowly-varying linear channel, which can be introduced by time-varying microphone placements. With these two complementary feature sets, using the AVEC 2013 depression dataset, we design a novel Gaussian mixture model (GMM)-based multivariate regression scheme, referred to as Gaussian Staircase Regression, that provides a root-mean-squared-error (RMSE) of 7.42 and a mean-absolute-error (MAE) of 5.75 on the standard Beck depression rating scale. We are currently exploring coordination measures of other aspects of speech production, derived from both audio and video signals.
Conference Paper
Full-text available
Parkinson's disease (PD) is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting about 1% of the people older than 65 and about 89% of the people with PD develop different speech disorders. Different researchers are currently working in the analysis of speech of people with PD, including the study of different dimensions in speech such as phonation, articulation, prosody, and intelligibility. The study of phonation and articulation has been addressed mainly considering sustained vowels; however, the analysis of prosody and intelligibility requires the inclusion of words, sentences and monologue. In this paper we present a new database with speech recordings of 50 patients with PD and their respective healthy controls, matched by age and gender. All of the participants are Spanish native speakers and the recordings were collected following a protocol that considers both technical requirements and several recommendations given by experts in linguistics, phoniatry and neurology. This corpus includes tasks such as sustained phonations of the vowels, diadochokinetic evaluation, 45 words, 10 sentences, a reading text and a monologue. The paper also includes results of the characterization of the Spanish vowels considering different measures used in other works to characterize different speech impairments.
Full-text available
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Full-text available
The purpose of this study was to analyze vowel articulation across various speaking tasks in a group of 20 early Parkinson's disease (PD) individuals prior to pharmacotherapy. Vowels were extracted from sustained phonation, sentence repetition, reading passage, and monologue. Acoustic analysis was based upon measures of the first (F1) and second (F2) formant of the vowels /a/, /i/, and /u/, vowel space area (VSA), F2i/F2u and vowel articulation index (VAI). Parkinsonian speakers manifested abnormalities in vowel articulation across F2u, VSA, F2i/F2u, and VAI in all speaking tasks except sustained phonation, compared to 15 age-matched healthy control participants. Findings suggest that sustained phonation is an inappropriate task to investigate vowel articulation in early PD. In contrast, monologue was the most sensitive in differentiating between controls and PD patients, with classification accuracy up to 80%. Measurements of vowel articulation were able to capture even minor abnormalities in speech of PD patients with no perceptible dysarthria. In conclusion, impaired vowel articulation may be considered as a possible early marker of PD. A certain type of speaking task can exert significant influence on vowel articulation. Specifically, complex tasks such as monologue are more likely to elicit articulatory deficits in parkinsonian speech, compared to other speaking tasks.
The "Lombard effect" describes how humans modify their speech in noisy environments to make it more intelligible. The present work analyzes Normal and Lombard speech from mul-tiple speakers in an unsupervised context, using meaningful acoustic criteria for speech classification (according to voicing and stationarity) and evaluation (using loudness and intelligi-bility). These acoustic analyses using generalized classes of-fer alternative and informative interpretations of the Lombard effect. For example, the Lombard increase in intelligibility is shown to be isolated primarily to voiced speech. Also, while transients are shown to be less intelligible overall, the Lom-bard effect does not appear to distinguish between stationary and transient speech. In addition to these analyses, follow-ing recently published results illustrating that Lombard spectral modifications account for the largest increases in intelligibil-ity, this work also examines spectral envelope transformation to improve speech intelligibility. In particular, speaker-dependent Normal-to-Lombard correction filters are estimated and, when applied in transformation, shown to yield higher overall objec-tive intelligibility than Normal, and even Lombard, speech.
Conference Paper
Speech analysis has shown potential for identifying neurological impairment. With brain trauma, changes in brain structure or connectivity may result in changes in source, prosodic, or articulatory aspects of voice. In this work, we examine the articulatory components of speech reflected in formant tracks, and how changes in track dynamics and coordination map to cognitive decline. We address a population of athletes regularly receiving impacts to the head and showing signs of preclinical mild traumatic brain injury (mTBI), a state indicated by impaired cognitive performance occurring prior to concussion. We hypothesize that this preclinical damage results in 1) changes in average vocal tract dynamics measured by formant frequencies, their velocities, and acceleration, and 2) changes in articulatory coordination measured by a novel formant-frequency cross-correlation characterization. These features allow machine learning algorithms to detect preclinical mTBI identified by a battery of cognitive tests. A comparison is performed of the effectiveness of vocal tract dynamics features versus articulatory coordination features. This evaluation is done using receiver operating characteristic (ROC) curves along with confidence bounds. The articulatory dynamics features achieve area under the ROC curve (AUC) values between 0.72 and 0.98, whereas the articulatory coordination features achieve AUC values between 0.94 and 0.97.
Conference Paper
In individuals with major depressive disorder, neurophysiological changes often alter motor control and thus affect the mechanisms controlling speech production and facial expression. These changes are typically associated with psychomotor retardation, a condition marked by slowed neuromotor output that is behaviorally manifested as altered coordination and timing across multiple motor-based properties. Changes in motor outputs can be inferred from vocal acoustics and facial movements as individuals speak. We derive novel multi-scale correlation structure and timing feature sets from audio-based vocal features and video-based facial action units from recordings provided by the 4th International Audio/Video Emotion Challenge (AVEC). The feature sets enable detection of changes in coordination, movement, and timing of vocal and facial gestures that are potentially symptomatic of depression. Combining complementary features in Gaussian mixture model and extreme learning machine classifiers, our multivariate regression scheme predicts Beck depression inventory ratings on the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31. Future work calls for continued study into detection of neurological disorders based on altered coordination and timing across audio and video modalities.
Consonant articulation patterns of 200 Parkinson patients were defined by two expert listeners from high fidelity tape recordings of the sentence version of the Fisher-Logemann Test of Articulation Competence (1971). Phonetic transcription and phonetic feature analysis were the methodologies used. Of the 200 patients, 90 (45%) exhibited some misarticulations. Phonetic data on these 90 dysarthric Parkinson patients revealed articulatory errors highly consistent in detailed production characteristics. Manner changes predominated. Phoneme classes that were most affected were the stop-plosives, affricates, and fricatives. In terms of perception features (Chomsky & Halle, 1968), the stop-plosives and affricates, which are normally [– continuant] were produced as [ + continuant] fricatives; fricatives that are [+ strident] were produced as [– strident]. There is no implication, however, that Parkinsonism involves a perception deficit. Analysis of the articulatory deficit reveals inadequate tongue elevation to achieve complete closure on stop-plosives and affricates, which can be expressed in production features as a change from [+ stop] to [+ fricative]. There was also inadequate close construction of the airway in lingual fricatives, which in articulatory features can be expressed as a change from [+ fricative] to [– fricative]. Both the incomplete contact for stops and the partial constriction for fricatives represent and inadequate narrowing of the vocal tract at the point of articulation. These results are discussed in relation to recent EMG studies and other physiologic examinations of Parkinsonian dysarthria.
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from