Conference PaperPDF Available

Abstract

Early, accurate detection of Parkinson's disease may aid in possible intervention and rehabilitation. Thus, simple noninvasive biomarkers are desired for determining severity. In this study, a novel set of acoustic speech biomarkers are introduced and fused with conventional features for predicting clinical assessment of Parkinson's disease. We introduce acoustic biomarkers reflecting the segment dependence of changes in speech production components, motivated by disturbances in underlying neural motor, articulatory, and prosodic brain centers of speech. Such changes occur at phonetic and larger time scales, including multi-scale perturbations in formant frequency and pitch trajectories, in phoneme durations and their frequency of occurrence, and in temporal waveform structure. We also introduce articulatory features based on a neural computational model of speech production, the Directions into Velocities of Articulators (DIVA) model. The database used is from the Interspeech 2015 Computational Paralinguistic Challenge. By fusing conventional and novel speech features, we obtain Spearman correlations between predicted scores and clinical assessments of r = 0.63 on the training set (four-fold cross validation), r = 0.70 on a held-out development set, and r = 0.97 on a held-out test set. Index Terms: Parkinson's disease, speech biomarkers, phoneme and pause duration, articulatory coordination, neural computational models of motor control
Segment-dependent dynamics in predicting Parkinson’s disease
James R. Williamson1, Thomas F. Quatieri 1, Brian S. Helfer1, Joseph Perricone1,
Satrajit S. Ghosh2, Gregory Ciccarelli1, Daryush D. Mehta1
1 MIT Lincoln Laboratory, Lexington, Massachusetts, USA
2 Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
jrw@ll.mit.edu, quatieri@ll.mit.edu, brian.helfer@ll.mit.edu, joey.perricone@ll.mit.edu,
satra@mit.edu, gregory.ciccarelli@ll.mit.edu, daryush.mehta@ll.mit.edu
Abstract
Early, accurate detection of Parkinson’s disease may aid in
possible intervention and rehabilitation. Thus, simple
noninvasive biomarkers are desired for determining severity.
In this study, a novel set of acoustic speech biomarkers are
introduced and fused with conventional features for predicting
clinical assessment of Parkinson’s disease. We introduce
acoustic biomarkers reflecting the segment dependence of
changes in speech production components, motivated by
disturbances in underlying neural motor, articulatory, and
prosodic brain centers of speech. Such changes occur at
phonetic and larger time scales, including multi-scale
perturbations in formant frequency and pitch trajectories, in
phoneme durations and their frequency of occurrence, and in
temporal waveform structure. We also introduce articulatory
features based on a neural computational model of speech
production, the Directions into Velocities of Articulators
(DIVA) model. The database used is from the Interspeech
2015 Computational Paralinguistic Challenge. By fusing
conventional and novel speech features, we obtain Spearman
correlations between predicted scores and clinical assessments
of r = 0.63 on the training set (four-fold cross validation),
r = 0.70 on a held-out development set, and r = 0.97 on a held-
out test set.
Index Terms: Parkinson’s disease, speech biomarkers,
phoneme and pause duration, articulatory coordination, neural
computational models of motor control
1. Introduction
Parkinson’s disease is a neurological disorder with associated
progressive decline in motor precision and sensorimotor
integration stemming presumably from the basal ganglia. In
this disorder, there is a steady loss of cells in the midbrain,
leading to speech impairment in nearly 90% of subjects [1].
Speech and voice characteristics of Parkinson’s disease
include imprecise and incoordinated articulation, monotonous
and reduced pitch and loudness, variable speech rate and
rushes of breath and pause segments, breathy and harsh voice
quality, and changes in intonation and rhythm [2][3][4][5][6].
*This work is sponsored by the Assistant Secretary of Defense for
Research & Engineering under Air Force contract #FA8721-05-C-
0002. Opinions, interpretations, conclusions, and recommendations
are those of the authors and are not necessarily endorsed by the United
States Government. SG was partly supported by a joint collaboration
between the McGovern Institute for Brain Research and MIT LL.
Early and accurate detection of Parkinson’s disease can aid
in possible intervention and rehabilitation. Thus, simple
noninvasive biomarkers are desired for determining severity of
the condition. In this paper, a novel set of speech biomarkers is
introduced for predicting the clinical assessment of the
Parkinson’s disease severity. Specifically, we introduce speech
biomarkers representing essential speech production
mechanisms of phonation, articulation, and prosody, motivated
by changes that are known to occur at different time scales and
differentially across speech segments in the underlying neural
motor and prosodic brain centers of speech production.
With this motivation, our new features are designed based
on two basic tenets: (1) there is a phoneme dependence of the
dynamic speech characteristics that is altered in Parkinson’s
disease, and (2) vocal tract dynamics and stability are altered
in Parkinson’s disease. We exploit the phoneme dependence of
durations and pitch and formant slopes, consistent with
findings that certain speech segments are more prone to
variation than others in Parkinson’s disease [7][8][9]. We also
exploit the decline in the precision of articulatory trajectories
over different time scales and tasks [10][11].
Our paper is organized as follows. In Section 2, we
describe the data collection and preprocessing approaches. In
Section 3, we describe our signal models and corresponding
signal processing methods for speech feature extraction.
Section 4 reports predictor types and prediction results.
Section 5 gives conclusions and projections toward future
work.
2. Parkinson’s database
2.1. Audio recordings
The Parkinson’s disease database used in the Interspeech 2015
Computational Paralinguistic Challenge is described in [1].
Assessments of Parkinson’s severity are based on the Unified
Parkinson’s Disease Rating Scale (UPDRS) [12]. The data set
is divided into 42 tasks per speaker, yielding 1470 recordings
in the training set (35 speakers) and 630 recordings in the
development set (15 speakers), both with UPDRS scores
provided. It also contains 462 recordings (11 speakers) in the
test set, without UPDRS scores provided. The duration of
recordings ranges from 0.24 seconds to 154 seconds.
2.2. Audio enhancement
To address noise in the test data, we use an adaptive
Wiener-filter approach that preserves the dynamic components
of a speech signal while reducing noise [13][14]. The
approach uses a measure of spectral change that allows robust
Copyright © 2015 ISCA September 6
-
10, 2015, Dresden, Germany
INTERSPEECH 2015
518
and rapid adaptation of the Wiener filter to speech and
background events. The approach reduces speech distortion by
using time-varying smoothing parameters, with constants
selected to produce less temporal smoothing in rapidly-
changing regions and greater smoothing in more stationary
regions.
2.3. Test set annotation
Identification (ID) labels for 11 test subjects were
manually assigned to the monologue, the read-text, and the ten
sentence recordings by listening. A 128-component Universal
Background Model/Gaussian Mixture Model (UBM/GMM)
classifier was used [15], with a feature vector consisting of 16
Mel-frequency cepstral coefficients (MFCCs) and 16 delta-
MFCCs, to classify the remaining 330 recordings on the test
set. The UBM was trained from all 2,562 recordings in the
data set, and subject ID assignments on the test recordings
were obtained using 11 subject GMMs, adapted using
manually labeled tasks on the test set. Finally, manual
correction of 28 of the test subject assignments was done. The
final subject ID assignments were not perfect, with number of
assigned recordings per subject ranging from 40 to 44.
3. Feature extraction
3.1. Overview of feature sets
Feature development was designed to reflect the three basic
aspects of speech production: Phonation (source), articulation
(vocal tract), and prosody (intonation and timing). The focus
of the features is to characterize variations in dynamics of
pitch (phonation), formants (articulation), and rate and
waveform (prosody) that reflect Parkinson’s severity.
Ten different feature sets are used, designed to
characterize changes in speech as a function of Parkinson’s
severity. Each feature set (FS) comprises a set of raw features,
described in Sections 3.23.4. Dimensionality reduction of all
FSs is done by z-scoring the raw features and then extracting
lower dimensional features using principal components
analysis (PCA). The PCA features provide input into
regression models that map each FS into a Parkinson’s
severity (UPDRS) prediction (see Section 4). The FSs,
summarized in Table 1, are categorized in terms of three
different classes: summary statistics, phoneme dependence,
and correlation structure.
Four of the FSs (1, 3, 5, and 7) are effective on short
duration tasks, and are applied to all the recordings in the
dataset. For these FSs, the data is divided into six different
time bins, based on durations of recordings (see Table 2 for
details). Each statistical model is trained only on tasks of
similar duration. The remaining six FSs are designed to
capture longer duration speech dynamics, and are used to
characterize changes across subjects when speaking the same
sentence. For these FSs, the same three declarative sentences
(sentences 2, 4, and 6) are analyzed.
3.2. Summary statistics
FS 1: Delta-MFCC means. Mel-frequency cepstral
coefficients (MFCCs) are obtained with openSMILE at a 100
Hz frame rate [16]. Delta-MFCC coefficients are computed
using regression over two frames before and after each frame.
Then, the mean values of the 16 delta-MFCCs are computed
across all frames in each recording.
FS
Index
Description
Data Types
# Raw
Features
# PCA
Features
1
Delta-MFCC
means
Time bins
16
15
2
Loudness statistics
Sentences
2
1
3
Phn duration
Time bins
1
1
4
Phn dur. & pitch
slope
Sentences
2
2
5
Phn dur. &
formant slopes
Time bins
3
2
6
Phn freq.
Sentences
7
1
7
Waveform corr.
structure
Time bins
600
6
8
Delta-MFCC corr.
structure
Sentences
864
912
3
9
Formant corr.
structure
Sentences
162 171
3
10
Articulatory
position corr.
structure
Sentences
324
342
3
FS 2: Loudness statistics. Loudness is computed from the
Perceived Evaluation of Audio Quality (PEAQ) algorithm
[17], a psychoacoustic measure of audio quality that has been
used for analysis of Lombard speech [18]. Loudness is average
energy across critical auditory bands per frame, FS 2
comprises the mean and standard deviation of this feature.
3.3. Phoneme-based features
Changes in phoneme durations and frequencies and in
phoneme-dependent pitch and formant slopes reflect the
phonemic segment dependence of alterations in phonation and
articulation with Parkinson’s severity. Phonemic segments are
used, along with estimated pitch and formant frequency
contours, to generate several phoneme-based feature sets.
Using an automatic phoneme recognition algorithm [19],
phonemic boundaries are detected, with each segment labeled
with one of 40 phoneme classes. The fundamental frequency
(pitch) contour is estimated using an autocorrelation method
over a 40-ms Hanning window every 1 ms [20]. Formant
frequency contours are estimated using a Kalman filter that
smoothly tracks the first three spectral modes while also
smoothly coasting through non-speech regions [21].
For FS 3, 4, and 5, an aggregation step is performed in
which a subset of the 40 phoneme-based measures that are the
most highly correlating with Parkinson’s severity (on the
training set) is linearly combined. In all cases, the top 10 most
highly correlating measures are combined using weights
w=sign(r)/(1-r2) [22]. For measures derived from time-bin
data, the aggregation is done independently in each time bin.
For measures derived from sentences, aggregation is done
independently in each sentence.
FS 3: Average phoneme duration. A linear fit is made of
the logarithm of pitch over time (within each phonemic
segment), yielding a pitch slope (Δlog(Hz)/s) for each
phonemic segment. Phoneme durations are then computed for
those segments where the pitch slope is marked as valid (i.e.,
where the absolute pitch slope is less than eight, indicating that
the slope is likely derived from a continuous pitch contour).
Table 1. Feature Set (FS) summary. Three feature
types (summary statistics, phoneme dependence, and
correlation structure) are applied to time binned and
sentence data.
519
Average phoneme durations were used originally in
classifying depression severity [23].
FS 4: Average phoneme duration and pitch slope. In
addition to the average phoneme duration feature, an average
log-pitch slope is also used where pitch slopes are valid [22].
FS 5: Average phoneme duration and formant slopes.
Linear fits to formant frequencies f1 and f2, along with average
phoneme durations, are computed from all phoneme segments
regardless of pitch slope validity.
FS 6: Phoneme frequencies. The number of occurrences of
each phoneme is computed. These counts are normalized to
sum to one and sorted from highest to lowest. The top seven of
these phoneme frequency measures are used as features.
3.4. Correlation structure features
Measures of the structure of correlations among low-level
speech features have previously been applied in the estimation
of depression [22][24], the estimation of cognitive
performance associated with dementia [25], the detection of
changes in cognitive performance associated with mild
traumatic brain injury [26], and were first introduced for
analysis of EEG signals for epileptic seizure prediction [27].
Channel-delay correlation and covariance matrices are
computed from multiple time series. Each matrix contains
correlation or covariance coefficients between the channels at
multiple time delays. Changes over time in the coupling
strengths among the channel signals cause changes in the
eigenvalue spectra of the channel-delay matrices. The matrices
are computed at four separate time scales, in which successive
time delays correspond to different size frame spacings.
Overall power (logarithm of the trace) and entropy (logarithm
of the determinant) are extracted from the channel-delay
covariance matrices at each scale for FSs 810. A detailed
description of the correlation structure approach can be found
in [27] and its application to speech signals in [22][24].
FS 7: Waveform correlation structure. Correlation
structure features are extracted directly from waveform
segments, thereby characterizing instability in temporal
envelope and phase structure (rhythm and regularity) on
different time scales. This technique was applied to each 0.5 s
frames with 50% overlap. Each frame is divided into five 0.1 s
segments that are treated as separate channels. Utterances ≥
0.75 s in duration resulted in multiple frames. In these cases
the average eigenvalue at each eigenvalue rank is computed
across frames. Features are computed at four scales with delay
spacings of 3, 7, 15, and 31, with 30 delays per scale. These
features reveal an association between Parkinson’s severity
and reduction in dynamical complexity, as illustrated in
Section 3.5.
FS 8: Delta-MFCC correlation structure. Correlation
structure features are derived from all 16 delta-MFCCs using
four delay scales with spacings 1, 3, 7, and 15, and using 15
delays per scale. Fewer delays are used for the largest scale on
sentences 2 and 4 due to their short duration.
FS 9: Formant correlation structure. Correlation structure
features are derived from three formant frequencies using the
same delay and scale parameters as FS 8.
FS 10: Positions of speech articulators correlation
structure. Here, we take advantage of a neurologically
plausible, fMRI-validated computational model of speech
production, the Directions into Velocities of Articulators
(DIVA) model [28]. The DIVA model takes as inputs the first
three formants and the fundamental frequency of a speech
utterance. Then, through an iterative learning process, the
model computes a set of synaptic weights that correspond to
different aspects of the speech production process including
articulatory commands and auditory and somatosensory
feedback errors. We hypothesize that Parkinsonian speech
results from impairments along the speech production
pathway, and therefore, when the model is trained on
Parkinsonian speech, the internal variables will reflect the
severity of the disorder. Correlation structure features are
derived from the DIVA model’s 13 time-varying articulatory
position states, are sampled at 200 Hz. The same delay and
scale parameters are applied as with FS 8 and FS 9.
3.5. Example of discriminative value
In this section we show the discriminative value of the
waveform correlation structure features (FS 7). Figure 1 shows
channel-delay correlation matrices obtained from two women
with low and high Parkinson’s severity (training set files 154
and 303), both speaking the word “crema”. The channels are
five 0.1 s segments of the audio waveform. The matrix
elements are correlation coefficients between channels at
different relative time delays. These matrices are obtained at
the 3rd time scale, and so successive matrix elements
correspond to delays of 15 sub-frames. The matrix from the
low severity speaker exhibits more heterogeneity in the
correlation patterns, indicating higher waveform complexity.
This difference is quantified using matrix eigenvalues
(Figure 2, left panel), ordered largest to smallest. Low UPDRS
speech contains greater power in the small eigenvalues. This
effect is summarized across multiple speakers by plotting the
average eigenvalue for different ranges of Parkinson’s severity
at each rank. Figure 2 (right panel) shows the eigenvalue
averages (in standard units) from all training/development
waveforms < 0.5 s in duration for three UPDRS ranges. These
averages reveal distinct differences related to Parkinson’s
severity, even across multiple different speech tasks.
Figure 1: Correlation matrices from waveform-
segment channel-delay matrices for a low (left) and
high (right) UPDRS from utterance of “crema”.
Figure 2: Left: Eigenspectra for the utterance
“crema” from two female speakers with different
Parkinson’s severity. Right: Average eigenspectra for
low (blue), medium (green), and high (red) UPDRS.
520
4. Parkinson’s score prediction
4.1. Gaussian staircase regression models
There is a separate predictor for each FS, and the predictor
outputs are linearly combined to produce the system’s net
UPDRS prediction. Each predictor uses a Gaussian staircase
(GS) regression model [22][24], which comprises an ensemble
of six Gaussian classifiers, each trained on data with different
UPDRS partitions between Class 1 (lower UPDRS) and Class
2 (higher UPDRS). The Class 1 partitions are 015, 024,
033, 042, 051, and 060, and the Class 2 partitions are
the complement of these partitions. Additional regularization
of the densities is obtained by adding the constant 4 to the
diagonal elements of the data-normalized covariance matrices.
The GS output score is the ratio of the log of the summed
likelihoods from each ensemble of Gaussians. This is followed
by applying a 2nd-order univariate regression model, created
from the GS training scores, to the GS test score to generate a
UPDRS prediction.
4.2. Within-subject prediction averaging
For each FS, subject ID labels are the basis for computing,
across all tasks for that subject, a weighted prediction average
based on the tasks processed by each FS. The provided subject
IDs are used on the training/development sets, and the subject
IDs from our procedure (Section 2.3) are used on the test set.
Time-bin predictions. For FS 1, 3, 5 and 7, there are six
different regression models, one for each time bin. The utility
of each FS in each time bin is assessed based on the Spearman
correlation of its predictions on the training set (using 4-fold
cross validation). Within each FS and subject ID, the predictor
outputs are combined across time bins ID using weights w =
r2/(1-r2) if r > 0; w = 0 otherwise. Table 2 lists the normalized
weights for each FS, with weights summing to one in each
column. Observe that the FSs accumulate most of their
evidence from short duration tasks.
Time bin weights
FS 1
FS 3
FS 5
FS 7
0.54
0.00
0.00
0.32
0.03
0.44
0.56
0.00
0.23
0.56
0.11
0.03
0.20
0.00
0.01
0.58
0.00
0.00
0.00
0.01
0.00
0.00
0.32
0.06
Sentence predictions. The remaining FSs use data from
matched sentences (sentences 2, 4 and 6), which range in
duration from 1.56 s to 9.63 s. The sentence tasks allow for
phonemic timing and correlation structure biomarkers that
have been used to predict various neurological states
[11][12][21][22][23]. Within each FS and test subject ID, the
GS training and test scores are averaged across the three
sentences. A 2nd-order univariate regression model obtains a
sentence-based prediction for each FS and test subject.
4.3. Fusing predictors
The ten FS predictors are applied individually to the
training set (four-fold cross-validation with held out speakers)
and to the development set (Table 3). We used linear
combinations of the predictors with three different weight
vectors w1, w2, and w3. With w1, each predictor is weighted
equally. To improve performance, we also applied differential
weighting of the ten predictors. To choose the weights, we
conducted a grid search over all 1,024 possible weight
combinations in which each weight can have a value of 1 or 2.
From this search, we obtained two different weight vectors
based on different constraints. The first weight vector is w2,
the weights that yield the highest average of 1) Spearman
correlation on the training set, 2) Spearman correlation on the
development set, and 3) a test metric score. The second weight
vector is w3, the weights that yield the highest test metric
score. The test metric is the Pearson correlation between
previous submission scores and the Spearman correlations
between the prediction vectors that had produced the previous
scores and a candidate prediction vector. The test metric
equals one if a candidate prediction vector has Spearman
correlation of one with the true UPDRS scores. To compute
this metric, we used ten previously obtained scores, which
range between 0.12 and 0.89. These scores are our previous
eight submissions and two Baseline results (Table 3, rows 1, 3
of [29]). After training on the combined train and development
sets, we submitted test results using w2 and w3, obtaining
Spearman correlations of r = 0.96 and r = 0.97, respectively.
Predictor
Train r
Devel. r
w1
w2
w3
1
0.59
0.39
1
2
1
2
0.29
0.35
1
2
2
3
0.56
0.67
1
1
1
4
0.24
0.37
1
1
1
5
0.46
0.10
1
2
1
6
0.10
0.09
1
1
1
7
0.56
0.53
1
1
2
8
0.43
0.49
1
2
1
9
0.46
0.21
1
1
2
10
0.39
0.25
1
1
1
w1
0.61
0.75
w2
0.62
0.78
w3
0.63
0.70
5. Conclusions and discussion
We applied standard and novel speech features predicting
levels of Parkinson’s disease. Our features capture segment-
based dynamics across phonemes, formant frequencies and
articulatory positions, based on an understanding of the effect
of Parkinson’s disease on speech production components. Our
ongoing work involves the further enhancement of current
features and establishment of new features, with emphasis on
approaches that provide a neurological basis for understanding
the effect of Parkinson’s disease on speech.
6. Acknowledgements
The authors thank Dr. Elizabeth Godoy and Dr. Christopher
Smalt for providing code and guidance on the extraction of
loudness features, and Dr. Douglas Sturim for helpful
discussions on automatic speaker identification.
Table 2. Weights used for combining predictions
across recording duration bins for feature sets (FS).
Table 3. Performance for each feature set predictor
and for linear combinations of predictors.
521
7. References
[1] J. Orozco-Arroyave, J. Arias-Londono, J. Vargas-Bonilla, M.
González-Rátiva, and E. Nöth, “New Spanish speech corpus
database for the analysis of people suffering from Parkinson’s
disease,” in Proceedings of the 9th Language Resources and
Evaluation Conference (LREC), 2014, pp. 342–347.
[2] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: I. Intensity, pitch, and duration,” Journal of Speech and
Hearing Disorders, vol. 28, no. 3, pp. 221–229, 1963.
[3] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: II. Physiological support for speech.” Journal of Speech
and Hearing Disorders, vol. 30, no. 1, pp. 44–49, 1965.
[4] G. J. Canter, “Speech characteristics of patients with Parkinson’s
disease: III. Articulation, diadochokinesis, and over-all speech
adequacy,” Journal of Speech and Hearing Disorders, vol. 30,
no. 3, pp. 217–224, 1965.
[5] J. A. Logemann, H. B. Fisher, B. Boshes, and E. R. Blonsky,
“Frequency and cooccurrence of vocal tract dysfunctions in the
speech of a large sample of Parkinson patients,” Journal of
Speech and Hearing Disorders, vol. 43, no. 1, pp. 47–57, 1978.
[6] J. E. Sussman and K. Tjaden, “Perceptual measures of speech
from individuals with Parkinson’s disease and multiple sclerosis:
Intelligibility and beyond,” Journal of Speech, Language, and
Hearing Research, vol. 55, no. 4, pp. 1208–1219, 2012.
[7] J. A. Logemann and H. B. Fisher, “Vocal tract control in
Parkinson’s disease,” Journal of Speech and Hearing Disorders,
vol. 46, no. 4, pp. 348–352, 1981.
[8] S. Skodda, and U. Schlegel, “Speech rate and rhythm in
Parkinson’s disease,” Movement Disorders, vol. 23, no. 7, pp.
985–992, 2008.
[9] S. Skodda, “Aspects of speech rate and regularity in Parkinson’s
disease,” Journal of the neurological sciences, vol. 310, no. 1–2,
pp. 231–236, 2011.
[10] S. G. Hoberman, “Speech techniques in aphasia and
Parkinsonism,” Journal - Michigan State Medical Society, vol.
57, no. 12, pp. 1720–1723, 1958.
[11] I. J. Rusz, R. Cmejla, T. Tykalova, H. Ruzickova, J. Klempir, V.
Majerova, J. Picmausova, J. Roth, and E. Ruzicka, “Imprecise
vowel articulation as a potential early marker of Parkinson’s
disease: Effect of speaking task,” The Journal of the Acoustical
Society of America, vol. 134, no. 3, pp. 2171–2181, 2013.
[12] C. G. Goetz, B. C. Tilley, S. R. Shaftman, G. T. Stebbins, S.
Fahn, P. MartinezMartin, ... and N. LaPelle, “Movement
Disorder Societysponsored revision of the Unified Parkinson’s
Disease Rating Scale (MDSUPDRS): Scale presentation and
clinimetric testing results,” Movement Disorders, vol. 23, no. 15,
pp. 2129–2170, 2008.
[13] T. F. Quatieri and R. B. Dunn, “Speech enhancement based on
auditory spectral change,” in Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal
Processing, 2002, pp. I-257.
[14] T. F. Quatieri and R. A. Baxter, “Noise reduction based on
spectral change,” in Proceedings of the IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics
(WASPAA), 1997.
[15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker
verification using adapted Gaussian mixture models,” Digital
signal processing, vol. 10, no. 1, pp. 19–41, 2000.
[16] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent
developments in openSMILE, the Munich Open-Source
Multimedia Feature Extractor,” in Proceedings of ACM
Multimedia (MM), Barcelona, Spain, 2013, pp. 835–838.
[17] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J.
G. Beerends, and C. Colomes, “PEAQ-The ITU standard for
objective measurement of perceived audio quality,” Journal of
the Audio Engineering Society, vol. 48, no. 1/2, pp. 3–29, 2000.
[18] E. Godoy and Y. Stylianou, “Unsupervised acoustic analyses of
normal and Lombard speech, with spectral envelope
transformation to improve intelligibility,” in 13th Annual
Conference of the International Speech Communication
Association, September 9–13, Portland, Oregon, Proceedings,
2012, pp. 1472–1475.
[19] W. Shen, C. White, T. J. Hazen, “A comparison of query-by-
example methods for spoken term detection,” in Proceedings of
the IEEE International Conference on Acoustics Speech and
Signal Processing, 2010.
[20] P. Boersma and D. Weenink, “Praat, a system for doing
phonetics by computer,” 2001.
[21] D. D. Mehta, D. Rudoy, and P. J. Wolfe, “Kalman-based
autoregressive moving average modeling and inference for
formant and antiformant tracking,” The Journal of the Acoustical
Society of America, vol. 132, no. 3, pp. 1732–1746, 2012.
[22] J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and
D. D. Mehta, “Vocal and facial biomarkers of depression based
on motor incoordination and timing,” in Proceedings of the 4th
ACM International Workshop on Audio/Visual Emotion
Challenge (AVEC), 2014, pp. 65–72.
[23] A. Trevino, T. F. Quatieri, and N. Malyska, “Phonologically-
based biomarkers for major depressive disorder,” EURASIP
Journal on Advances in Signal Processing, vol. 42, pp. 1–18,
2011.
[24] J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu,
and D. D. Mehta, “Vocal biomarkers of depression based on
motor incoordination,” in Proceedings of the 3rd ACM
International Workshop on Audio/Visual Emotion Challenge,
2013, pp. 41–48.
[25] B. Yu, T. F. Quatieri, J. W. Williamson, and J. Mundt,
“Prediction of cognitive performance in an animal fluency task
based on rate and articulatory markers,” in 15th Annual
Conference of the International Speech Communication
Association, September 9–13, Portland, Oregon, Proceedings,
2014.
[26] B. S. Helfer, T. F. Quatieri, J. R. Williamson, L. Keyes, B.
Evans, W. N. Greene, J. Palmer, and K. Heaton, “Articulatory
dynamics and coordination in classifying cognitive change with
preclinical mTBI,” in 15th Annual Conference of the
International Speech Communication Association, September 9–
13, Portland, Oregon, Proceedings, 2014.
[27] J. R. Williamson, D. Bliss, D. W. Browne, and J. T. Narayanan,
“Seizure prediction using EEG spatiotemporal correlation
structure,” Epilepsy and Behavior, vol. 25, no. 2, pp. 230–238,
2012.
[28] F. H. Guenther, S. S. Ghosh, and J. A. Tourville, “Neural
modeling and imaging of the cortical interactions underlying
syllable production,” Brain and Language, vol. 96, no. 3, pp.
280–301, 2006.
[29] B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. Hönig, J. R.
Orozco-Arroyave, E. Nöth, Y. Zhang, F. Weninger, “The
INTERSPEECH 2015 Computational Paralinguistics Challenge:
Nativeness, Parkinson’s & Eating Condition,” in
INTERSPEECH 2015 – 16th Annual Conference of the
International Speech Communication Association, September 6
10, Dresden, Germany, Proceedings, 2015.
522
... This is a burden on clinical work, and also limits the scalability of automated patient screening and follow-up using clinically interpretable features. However, recent technical developments in automatic speech recognition (ASR) have been successfully involved in feature extraction in automatic assessment of various types of pathological speech [19][20][21]. This raises the question whether analysis of vowel articulation could also be fully automated in order to support clinical practice. ...
... Given this background, the main purpose of the present study is to develop an automatic and language-independent method for vowel articulation measurement in terms of acoustic features. Inspired by the recent successes of using ASR for feature extraction in automatic pathological speech assessment [19][20][21], a universal phone/phoneme recognizer is adopted to detect speech frames representative of corner vowel articulation, followed by statistical analysis of the formant frequencies across the detected frames. ...
... Earlier work using ASR for pathological speech assessment has already been carried out in the context of other acoustic features. In [19,20], phoneme statistics, duration and confidence measures derived from off-the-shelf Spanish ASR systems were applied to speech assessment of Spanish-speaking patients with PD. In [21], a Cantonese ASR system was used to generate utterance-level posterior related features for broad phoneme classes in voice disorders assessment. ...
Article
Full-text available
Dysarthria is a common symptom for people with Parkinsons disease (PD), which affects respiration, phonation, articulation and prosody, and reduces the speech intelligibility as a result. Imprecise vowel articulation can be observed in people with PD. Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its detection and assessment. Standard clinical vowel articulation features include the first two formants of the three corner vowels /a/, /i/ and /u/, from which clinically relevant parameters such as vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR) are derived. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming and requires specific expertise in speech signal analysis. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/) scores in a fully automated manner. The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech. Results from these two corpora further demonstrate the efficacy of the automated features in separating PD/dysarthric speakers from controls, and that the features are correlated with Parkinsons disease severity ratings on PC-GITA and with level of dysarthria on TORGO.
... A growing area of research is focused on utilizing machine learning and speech processing in diagnostics and the development of assistive tools for individuals with neurological speech disorders (Berisha, Utianski, and Liss 2013;Benba, Jilbab, and Hammouch 2015;Williamson et al. 2015;Orozco-Arroyave et al. 2016;Hsu et al. 2017;Norel et al. 2018;An et al. 2018;Wang et al. 2018). A major challenge in this field is the development of tools that can quantify the degree to which aspects of speech, such as articulatory precision, are degraded as a result of a neurological disorder. ...
... The features used in this paper were based on the Mel-frequency cepstral coefficients (MFCCs). Although MFCCs do not provide comprehensive representation of the effects of ALS on the acoustic signal, as they ignore characteristics of pitch, they have been shown to be an effective tool for characterizing motor-speech disorders (Benba, Jilbab, and Hammouch 2015;Williamson et al. 2015;Tu, Berisha, and Liss 2017) and provide a suitable test case for this study. We extracted the first fourteen MFCCs, along with their first and second derivatives, thus totaling 42 low-level descriptors at the frame level. ...
Conference Paper
Full-text available
In this study we examined the efficacy of machine learning general regression algorithms for predicting ordinal variables based on the acoustic speech signal. We were specifically interested in whether predictions that fell between ordinal levels (e.g. a predicted score of 3.2 instead of a true score of 3) contained meaningful information about the outcome variable. As a test case, we explored speech-based estimation of the Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R), a clinical measure for speech severity. The ALSFRS-R is a diagnostic tool that measures individual components of motor function for patients with ALS along a 5-point ordinal scale. Using artificial neural networks (ANN) we can generate continuous estimates of the speech component in ALSFRS-R. However, the degree to which the improved resolution of these estimates contains useful information related to patients' motor control has not been thoroughly studied. In this paper, we sought to answer this question by comparing the residuals in machine learning estimates of the ALSFRS-R speech score with patients' intelligible speaking rate (ISR), which is a more granular measure of speech motor control. Experimental results using speech data from 45 patients with ALS confirmed that ANN regression is effective at learning granular information even when trained with coarse ordinal labels.
... It has been suggested that changes in complexity in movement dimension-ality can reflect subtle physiological change as it manifests over time [21]. This has been used in several studies to relate coordination patterns with changes in neurological state and motor capability [21][22][23][24][25][26]. ...
... This points to the fact that more subtle changes in movement are being impacted by overpressure exposure, which makes the overt clinical detection challenging. Prior works by Williamson et al. [21][22][23] have shown that the eigenvalue features, as utilized here, provide insight into changes in motor coordination, that in turn can be used to assess neurophysiological status. ...
Conference Paper
Repetitive exposure to non-concussive blast expo-sure may result in sub-clinical neurological symptoms. These changes may be reflected in the neural control gait and balance. In this study, we collected body-worn accelerometry data on individuals who were exposed to repetitive blast overpressures as part of their occupation. Accelerometry features were gener-ated within periods of low-movement and gait. These features were the eigenvalues of high-dimensional correlation matrices, which were constructed with time-delay embedding at multiple delay scales. When focusing on the gait windows, there were significant correlations of the changes in features with the cumulative dose of blast exposure. When focusing on the low-movement frames, the correlation with exposure were lower than that of the gait frames and statistically insignificant. In a cross-validated model, the overpressure exposure was predicted from gait features alone. The model was statistically significant and yielded an RMSE of 1.27 dB. With continued development, the model may be used to assess the physiological effects of repetitive blast exposure and guide training procedures to minimize impact on the individual.
... More sophisticated solutions also adopted dimensionality reduction techniques to feed more compact and informative feature vectors (encoding the input speech signal) to the classifiers. Different classification algorithms, like Artificial Neural Networks, Support Vector Machines, and k-Nearest Neighbors, have been adopted for this detection problem [16] [17] [18] [19] [20] [21] [22]. Usually, these solutions involved the extraction of prosodic and acoustic features (mainly MFCC, Jitter, Shimmer, and Pitch), which allowed to train discriminative models with impressive results. ...
... Speech signal has been demonstrated to be a valuable indicator of disease progression and treatment efficacy in PD [7]. There is a large body of research on automatic PD assessment using speech, which employs acoustic analysis and pattern recognition techniques and aims at objective, non-invasive, and cost-efficient health care technology for the benefit of clinical practice [8][9][10][11][12][13][14]. Most studies have investigated PD detection, which is typically formulated as a binary classification problem. ...
Article
Full-text available
Speech from people with Parkinson's disease (PD) are likely to be degraded on phonation, articulation, and prosody. Motivated to describe articulation deficits comprehensively, we investigated 1) the universal phonological features that model articulation manner and place, also known as speech attributes, and 2) glottal features capturing phonation characteristics. These were further supplemented by, and compared with, prosodic features using a popular compact feature set and standard MFCC. Temporal characteristics of these features were modeled by convolutional neural networks. Besides the features, we were also interested in the speech tasks for collecting data for automatic PD speech assessment, like sustained vowels, text reading, and spontaneous monologue. For this, we utilized a recently collected Finnish PD corpus (PDSTU) as well as a Spanish database (PC-GITA). The experiments were formulated as regression problems against expert ratings of PD-related symptoms, including ratings of speech intelligibility, voice impairment, overall severity of communication disorder on PDSTU, as well as on the Unified Parkinson's Disease Rating Scale (UPDRS) on PC-GITA. The experimental results show: 1) the speech attribute features can well indicate the severity of pathologies in parkinsonian speech; 2) combining phonation features with articulatory features improves the PD assessment performance, but requires high-quality recordings to be applicable; 3) read speech leads to more accurate automatic ratings than the use of sustained vowels, but not if the amount of speech is limited to correspond to the sustained vowels in duration; and 4) jointly using data from several speech tasks can further improve the automatic PD assessment performance.
... Fig. 2 This results in a final accuracy of 95%. It is crucial to emphasize that, differently from standard dysarthria detection approaches [18,19,20], any specific training has been performed: we directly infer the speaker health status from the similarity coefficients α α α that provide us an interpretable model. Note that none of existing SA methods are able to also perform dysarthria detection. ...
... Fig. 2 This results in a final accuracy of 95%. It is crucial to emphasize that, differently from standard dysarthria detection approaches [18,19,20], any specific training has been performed: we directly infer the speaker health status from the similarity coefficients α α α that provide us an interpretable model. Note that none of existing SA methods are able to also perform dysarthria detection. ...
Preprint
Full-text available
This work addresses the mismatch problem between the distribution of training data (source) and testing data (target), in the challenging context of dysarthric speech recognition. We focus on Speaker Adaptation (SA) in command speech recognition, where data from multiple sources (i.e., multiple speakers) are available. Specifically, we propose an unsupervised Multi-Source Domain Adaptation (MSDA) algorithm based on optimal-transport, called MSDA via Weighted Joint Optimal Transport (MSDA-WJDOT). We achieve a Command Error Rate relative reduction of 16% and 7% over the speaker-independent model and the best competitor method, respectively. The strength of the proposed approach is that, differently from any other existing SA method, it offers an interpretable model that can also be exploited, in this context, to diagnose dysarthria without any specific training. Indeed, it provides a closeness measure between the target and the source speakers, reflecting their similarity in terms of speech characteristics. Based on the similarity between the target speaker and the healthy/dysarthric source speakers, we then define the healthy/dysarthric score of the target speaker that we leverage to perform dysarthria detection. This approach does not require any additional training and achieves a 95% accuracy in the dysarthria diagnosis.
... Therefore, a rapid and objective dysarthria detection procedure could help the therapist in the diagnosis. In the last years, the research community started to look at dysarthria detection by learning a mapping from the acoustic features to the text label [4,5]. In [6], the authors proposed an interpretable DNN model in which they added an intermediate layer that acts as a bottle-neck feature extractor providing nasality, vocal quality, articulatory precision and prosody features. ...
Preprint
Full-text available
In many real-world applications, the mismatch between distributions of training data (source) and test data (target) significantly degrades the performance of machine learning algorithms. In speech data, causes of this mismatch include different acoustic environments or speaker characteristics. In this paper, we address this issue in the challenging context of dysarthric speech, by multi-source domain/speaker adaptation (MSDA/MSSA). Specifically, we propose the use of an optimal-transport based approach, called MSDA via Weighted Joint Optimal Transport (MSDA-WDJOT). We confront the mismatch problem in dysarthria detection for which the proposed approach outperforms both the Baseline and the state-of-the-art MSDA models, improving the detection accuracy of 0.9% over the best competitor method. We then employ MSDA-WJDOT for dysarthric speaker adaptation in command speech recognition. This provides a Command Error Rate relative reduction of 16% and 7% over the baseline and the best competitor model, respectively. Interestingly, MSDA-WJDOT provides a similarity score between the source and the target, i.e. between speakers in this case. We leverage this similarity measure to define a Dysarthric and Healthy score of the target speaker and diagnose the dysarthria with an accuracy of 95%.
Article
Background: In people with Parkinson's disease (PwPD), both motor and cognitive deficits influence voice and other aspects of communication. PwPD demonstrate vocal instability, but acoustic declines over the course of speaking are not well characterized and the role of cognition on these declines is unknown. We examined voice acoustics related to speech motor instability by comparing the first and the last utterances within a speech task. Our objective was to determine if mild cognitive impairment (MCI) status was associated with different patterns of acoustic change during these tasks. Methods: Participants with PD (n = 44) were enrolled at University of Massachusetts Chan Medical School and classified by gold-standard criteria as normal cognition (PD-NC) or mild cognitive impairment (PD-MCI). The speech was recorded during the Rainbow Passage and a picture description task (Cookie Theft). We calculated the difference between first and last utterances in fo mean and standardized semitones (STSD), cepstral peak prominence-smoothed (CPPS), and low to high ratio (LH). We used t-tests to compare the declines in acoustic parameters between the task types and between participants with PD-NC versus PD-MCI. Results: Mean fo, fo variability (STSD) and CPPS declined from the first to the last utterance in both tasks, but there was no significant difference in these declines between the PD-NC and PD-MCI groups. Those with PD-MCI demonstrated lower fo variability on the whole in both tasks and lower CPPS in the picture description task, compared to those with PD-NC. Conclusions: Mean and STSD fo as well as CPPS may be sensitive to PD-MCI status in reading and spontaneous speech tasks. Speech motor instability can be observed in these voice acoustic parameters over brief speech tasks, but the degree of decline does not depend on cognitive status. These findings will inform the ongoing development of algorithms to monitor speech and cognitive function in PD.
Chapter
Parkinson’s disease is a nervous system disease that progresses over time and causes the patient’s movement skills to deteriorate. The deficiency of dopamine hormone in the brain causes a sort of abnormal activity, which leads to problems in movement and other Parkinson’s disease symptoms like fuzzy thinking, difficulty in recalling things. No specific tests exist to diagnose the patients. In many cases, the clinical picture of Parkinson’s disease is typical; nonetheless, symptoms that separate it from other conditions should be meticulously investigated. No two people experience this disease in the same way. Even with the presence of various technological models that help predict this disease, none of them are personalized enough. The main focus is on personalization by building a hybrid model which will take into consideration various important factors of the disease. Hence, it has been decided to come up with a way to predict the onset or presence of this disease based on the few and not easily detectable changes in the person which might be the potential symptoms of this disease which affects more than 1 million people in India annually, will prove to be helpful. With the integration of technology, a cost-effective, user-friendly, and personalized system can be developed.KeywordsParkinson’s diseaseIngeniousMachine learningPrediction system
Article
Full-text available
In Major Depressive Disorder (MDD), neurophysiologic changes can alter motor control [1, 2] and therefore alter speech production by influencing the characteristics of the vocal source, tract, and prosodics. Clinically, many of these characteristics are associated with psychomotor retardation, where a patient shows sluggishness and motor disorder in vocal articulation, affecting coordination across multiple aspects of production [3, 4]. In this paper, we exploit such effects by selecting features that reflect changes in coordination of vocal tract motion associated with MDD. Specifically, we investigate changes in correlation that occur at different time scales across formant frequencies and also across channels of the delta-mel-cepstrum. Both feature domains provide measures of coordination in vocal tract articulation while reducing effects of a slowly-varying linear channel, which can be introduced by time-varying microphone placements. With these two complementary feature sets, using the AVEC 2013 depression dataset, we design a novel Gaussian mixture model (GMM)-based multivariate regression scheme, referred to as Gaussian Staircase Regression, that provides a root-mean-squared-error (RMSE) of 7.42 and a mean-absolute-error (MAE) of 5.75 on the standard Beck depression rating scale. We are currently exploring coordination measures of other aspects of speech production, derived from both audio and video signals.
Conference Paper
Full-text available
Parkinson's disease (PD) is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting about 1% of the people older than 65 and about 89% of the people with PD develop different speech disorders. Different researchers are currently working in the analysis of speech of people with PD, including the study of different dimensions in speech such as phonation, articulation, prosody, and intelligibility. The study of phonation and articulation has been addressed mainly considering sustained vowels; however, the analysis of prosody and intelligibility requires the inclusion of words, sentences and monologue. In this paper we present a new database with speech recordings of 50 patients with PD and their respective healthy controls, matched by age and gender. All of the participants are Spanish native speakers and the recordings were collected following a protocol that considers both technical requirements and several recommendations given by experts in linguistics, phoniatry and neurology. This corpus includes tasks such as sustained phonations of the vowels, diadochokinetic evaluation, 45 words, 10 sentences, a reading text and a monologue. The paper also includes results of the characterization of the Spanish vowels considering different measures used in other works to characterize different speech impairments.
Article
Full-text available
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Article
The "Lombard effect" describes how humans modify their speech in noisy environments to make it more intelligible. The present work analyzes Normal and Lombard speech from mul-tiple speakers in an unsupervised context, using meaningful acoustic criteria for speech classification (according to voicing and stationarity) and evaluation (using loudness and intelligi-bility). These acoustic analyses using generalized classes of-fer alternative and informative interpretations of the Lombard effect. For example, the Lombard increase in intelligibility is shown to be isolated primarily to voiced speech. Also, while transients are shown to be less intelligible overall, the Lom-bard effect does not appear to distinguish between stationary and transient speech. In addition to these analyses, follow-ing recently published results illustrating that Lombard spectral modifications account for the largest increases in intelligibil-ity, this work also examines spectral envelope transformation to improve speech intelligibility. In particular, speaker-dependent Normal-to-Lombard correction filters are estimated and, when applied in transformation, shown to yield higher overall objec-tive intelligibility than Normal, and even Lombard, speech.
Conference Paper
Speech analysis has shown potential for identifying neurological impairment. With brain trauma, changes in brain structure or connectivity may result in changes in source, prosodic, or articulatory aspects of voice. In this work, we examine the articulatory components of speech reflected in formant tracks, and how changes in track dynamics and coordination map to cognitive decline. We address a population of athletes regularly receiving impacts to the head and showing signs of preclinical mild traumatic brain injury (mTBI), a state indicated by impaired cognitive performance occurring prior to concussion. We hypothesize that this preclinical damage results in 1) changes in average vocal tract dynamics measured by formant frequencies, their velocities, and acceleration, and 2) changes in articulatory coordination measured by a novel formant-frequency cross-correlation characterization. These features allow machine learning algorithms to detect preclinical mTBI identified by a battery of cognitive tests. A comparison is performed of the effectiveness of vocal tract dynamics features versus articulatory coordination features. This evaluation is done using receiver operating characteristic (ROC) curves along with confidence bounds. The articulatory dynamics features achieve area under the ROC curve (AUC) values between 0.72 and 0.98, whereas the articulatory coordination features achieve AUC values between 0.94 and 0.97.
Conference Paper
In individuals with major depressive disorder, neurophysiological changes often alter motor control and thus affect the mechanisms controlling speech production and facial expression. These changes are typically associated with psychomotor retardation, a condition marked by slowed neuromotor output that is behaviorally manifested as altered coordination and timing across multiple motor-based properties. Changes in motor outputs can be inferred from vocal acoustics and facial movements as individuals speak. We derive novel multi-scale correlation structure and timing feature sets from audio-based vocal features and video-based facial action units from recordings provided by the 4th International Audio/Video Emotion Challenge (AVEC). The feature sets enable detection of changes in coordination, movement, and timing of vocal and facial gestures that are potentially symptomatic of depression. Combining complementary features in Gaussian mixture model and extreme learning machine classifiers, our multivariate regression scheme predicts Beck depression inventory ratings on the AVEC test set with a root-mean-square error of 8.12 and mean absolute error of 6.31. Future work calls for continued study into detection of neurological disorders based on altered coordination and timing across audio and video modalities.
Article
Consonant articulation patterns of 200 Parkinson patients were defined by two expert listeners from high fidelity tape recordings of the sentence version of the Fisher-Logemann Test of Articulation Competence (1971). Phonetic transcription and phonetic feature analysis were the methodologies used. Of the 200 patients, 90 (45%) exhibited some misarticulations. Phonetic data on these 90 dysarthric Parkinson patients revealed articulatory errors highly consistent in detailed production characteristics. Manner changes predominated. Phoneme classes that were most affected were the stop-plosives, affricates, and fricatives. In terms of perception features (Chomsky & Halle, 1968), the stop-plosives and affricates, which are normally [– continuant] were produced as [ + continuant] fricatives; fricatives that are [+ strident] were produced as [– strident]. There is no implication, however, that Parkinsonism involves a perception deficit. Analysis of the articulatory deficit reveals inadequate tongue elevation to achieve complete closure on stop-plosives and affricates, which can be expressed in production features as a change from [+ stop] to [+ fricative]. There was also inadequate close construction of the airway in lingual fricatives, which in articulatory features can be expressed as a change from [+ fricative] to [– fricative]. Both the incomplete contact for stops and the partial constriction for fricatives represent and inadequate narrowing of the vocal tract at the point of articulation. These results are discussed in relation to recent EMG studies and other physiologic examinations of Parkinsonian dysarthria.
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.