Conference PaperPDF Available

A Comparison of GMM-HMM and DNN-HMM Based Pronunciation Verification Techniques for Use in the Assessment of Childhood Apraxia of Speech

Authors:

Abstract

This paper introduces a pronunciation verification method to be used in an automatic assessment therapy tool of child disordered speech. The proposed method creates a phonebased search lattice that is flexible enough to cover all probable mispronunciations. This allows us to verify the correctness of the pronunciation and detect the incorrect phonemes produced by the child. We compare between two different acoustic models, the conventional GMM-HMM and the hybrid DNN-HMM. Results show that the hybrid DNNHMM outperforms the conventional GMM-HMM for all experiments on both normal and disordered speech. The total correctness accuracy of the system at the phoneme level is above 85% when used with disordered speech.
Mostafa Shahin1, Beena Ahmed1, Jacqueline McKechnie2, Kirrie Ballard2, Ricardo Gutierrez-Osuna3
1Dept.of Electrical and Computer Engineering, Texas A&M University, Doha, Qatar
2Faculty of Health Sciences, The University of Sydney, Sydney, Australia
3Dept. of Computer Science and Engineering, Texas A&M University, College Station, Texas
Abstract
This paper introduces a pronunciation verification method to
be used in an automatic assessment therapy tool of child
disordered speech. The proposed method creates a phone-
based search lattice that is flexible enough to cover all
probable mispronunciations. This allows us to verify the
correctness of the pronunciation and detect the incorrect
phonemes produced by the child. We compare between two
different acoustic models, the conventional GMM-HMM and
the hybrid DNN-HMM. Results show that the hybrid DNN-
HMM outperforms the conventional GMM-HMM for all
experiments on both normal and disordered speech. The total
correctness accuracy of the system at the phoneme level is
above 85% when used with disordered speech.
Index Terms Pronunciation verification, speech
therapy, automatic speech recognition, computer aided
pronunciation learning, deep learning
1. Introduction
Language production and speech articulation can be delayed in
children due to developmental disabilities and neuromotor
disorders such as childhood apraxia of speech (CAS) [1].
Traditional CAS therapy requires a child undergo extended
therapy sessions with a trained speech language pathologist
(SLP) in a clinic; this can be both logistically and financially
prohibitive. Interactive and automatic speech monitoring tools
that can be used remotely by children in their own homes,
offer a practical, adaptive and cost-effective complement to
face-to-face intervention sessions with a SLP.
A number of technology-based tools have been
developed to facilitate general speech therapy, but a very
limited number of them target the specific articulation
problems of children with CAS [2], [3], [4]. The intuitive and
engaging environment provided by tablets and smartphones
has led to the development of generic speech therapy
applications for mobile devices [5], [6]. The main drawback of
all these systems is the absence of automatic feedback, which
makes it hard to adapt the therapy regimen based on the
specific needs of each child.
There has been limited success in incorporating
automatic speech recognition (ASR) systems into speech
therapy tools. This is due to the higher error rates ASR
systems still exhibit for developing children due to variations
in vocal tract length, formant frequency, pronunciation and
grammar. Perceptual evaluations of apraxic speakers can be
inconsistent and prone to error [7]. The Speech Training,
Assessment, and Remediation system (STAR) [8] evaluates
phoneme production by calculating the likelihood ratio
produced by aligning the subject’s speech using the target
phoneme and alternative phonemes. In Vocaliza [9], a set of
confidence measures are used to score the phoneme
pronunciation level. Both systems decide whether the
phoneme was pronounced correctly or incorrectly without
actually detecting the errors made by the child. ASR has also
been used widely in the area of second language learning. As
an example, Kim et al [10], define a set of rules of the
expected mispronunciations of the native Korean speakers
when pronouncing an English word were defined to detect
pronunciation errors. In Hafss [11], a search lattice was
created from all probable pronunciation variants and fed to a
speech decoder to identify errors in Quranic Arabic.
In our previous work [12], we proposed an automated
therapy tool for child with CAS. The proposed system consists
of 1) a clinician interface where the SLP can create and assign
exercises to different children and monitor each child’s
progress, 2) a tablet-based mobile application which prompts
the child with the assigned exercises and records their speech,
and 3) a speech processing module installed on a server that
receives the recorded speech, analyzes it and provides
feedback to the SLP with the assessment results. The SLP can
then update the exercises assigned to each child as per the
feedback received. The speech processing module consists of
multiple components that specialize in identifying the types of
errors made by children with CAS. In [13] we presented a
lexical stress classifier to detect prosodic errors.
In this paper, we enhance our earlier pronunciation
verification method [12], by creating a search lattice that
contains all the expected mispronunciation phonemes which
includes a garbage model to collect any unexpected inserted
phonemes. We also use a penalty value in both the alternative
and garbage paths to control the strictness of the system. We
compare the performance of two different acoustic models, the
conventional GMM-HMM and the hybrid DNN-HMM [14],
which has been reported to outperform the conventional
GMM-HMM model in other applications [15], [16]
particularly with smaller training datasets [17]. The proposed
method allows us to verify the correctness of phoneme
pronunciation with higher accuracy than previous
pronunciation verification systems [8], [9] and provides a
mechanism to detect the error type (insertion, deletion or
substitution) made, if any.
The remainder of this paper is structured as follows.
Section 2 describes the method and the speech corpus used.
Section 3 presents the experiments performed and results.
Finally, the conclusions are summarized in section 4.
2. Methods
2.1. System description
A Comparison of GMM-HMM and DNN-HMM Based
Pronunciation Verification Techniques for Use in the
Assessment of Childhood Apraxia of Speech
Copyright © 2014 ISCA 14
-
18 September 2014, Singapore
INTERSPEECH 2014
1583
In speech therapy for CAS, the child is asked to produce a set
of phonemes (utterance) based upon their prescribed regime
using either a visual or oral prompt. In our automated speech
therapy tool [12], the goal of the pronunciation verification
module is to compare the pronunciation of each phoneme in
the child’s production to the prompt, and identify any
mispronounced phonemes. This is achieved by creating a
search lattice for each word prompt in the therapy protocol.
The paths in the search lattice are based on rules of expected
mispronunciations made in the children’s productions as
developed by a SLP after assessing 20 children with CAS.
Figure 1 shows the block diagram of the system. The
prompted word is first transcribed as per the corresponding
phoneme sequence (referred to as the correct phoneme
sequence) using the CMU pronunciation dictionary [18] and
then passed to the lattice generator along with the expected
mispronunciation rules. The speech signal is segmented into
frames with a length of 25 msec and 15 msec overlap and a set
of features extracted from each frame. The extracted features
are then fed to the speech recognizer along with the created
lattice and the pre-trained acoustic models to generate a
sequence of phonemes from the child’s utterance. An
evaluation report is then generated by matching the recognized
phoneme sequence with the correct phoneme sequence and
specifying the errors made by the child.
2.2. Lattice creation
We used a search lattice with a specific number of alternative
pronunciations for each phoneme; this limits the decoder
search, making it faster and more accurate in determining the
various errors made by the child. Each phoneme in the correct
phoneme sequence is compared with the expected
mispronunciation rules; if a rule is matched, the pronunciation
variants in this rule are added as alternative arcs to the current
phoneme sequence. The mispronunciation rules depend on the
type of the phoneme (consonant/vowel), the phoneme position
in the word (Initial/Medial/Final) and the context of the
phoneme (the next phoneme in the word). Table 1 shows
examples of the mispronunciation rules.
Table 1: Examples of mispronunciation rules used
Phoneme
Next
phoneme
Position
in word
Expected
mispronunciations
P
Any
Initial
D
P
Any
Medial
B
K
L
Any
D/T
AE
Any
Any
AA/EY/AH
After examining all the phonemes with the
mispronunciation rules, the search lattice is created using all
the rules that match the correct prompt sequence. An example
of a lattice created for the word “buyis shown in Figure 2. A
garbage model is added as an alternative to each phoneme and
between phonemes to absorb any insertion of out-of-
vocabulary phonemes. The garbage model consists of all the
phonemes in parallel in addition to null and loop arcs to allow
the decoder to either skip the garbage node (in case of no
insertion) or repeat (if multiple insertions occur).
The terms PA and PG represent insertion penalties added
to the alternative and the garbage arcs respectively. These
penalties are added so the decoder will not align the speech to
the alternative error phoneme or the garbage node unless it has
enough confidence. By increasing these values, the system
will tend to align the speech to the correct phonemes. These
penalties can thus be used to control how strict the system is.
2.3. Acoustic models
We tested two different acoustic models, a GMM-HMM and a
hybrid DNN-HMM for use in our system.
2.3.1. Conventional GMM-HMM
The system uses speaker-independent acoustic HMMs for the
garbage node, which are context-independent to simplify and
speed up the decoding process. However it uses tied-state
context-dependent and speaker-independent HMMs for both
the correct and alternative phonemes. The models are then
adapted using Maximum Likelihood Linear Regression
(MLLR) to produce a set of speaker-dependent models for
each speaker in the test data. The features used in building the
model are the Mel-frequency cepstral coefficients (MFCC). 13
coefficients are computed for each frame plus delta and
acceleration to obtain a 39 dimensions feature vector per
frame. The number of tied states and the number of mixtures
per state are tuned using a development data set.
2.3.2. Hybrid DNN-HMM
The setup of this model is similar to the one proposed in [14].
As the DNN can easily model correlated data, Mel-scale
filterbank features are used instead of the MFCCs [19]. For
each frame we compute a Mel-scale filterbank with 40
coefficients and the energy together with the delta and
acceleration combined into a 123 dimension feature vector.
Each n successive frames are grouped to produce one input
window to the DNN with the target label of the middle frame.
Figure 2: (a) Lattice example of word “buy” where Garb is the garbage node. The filled nodes represent the correct
phoneme sequence. (b) The construction of the garbage node.
Figure 1: Block diagram of the pronunciation verification
system which uses a lattice generator and speech recognition
module to compare the child’s production to the given
prompt.
(a)
(b)
PG
PG
PG
PA
PG
Garb
P
B
Start
Garb
AA
AY
G
Garb
Null
PA
AE
G
Garb
Null
End
PA
1584
Here too, the number of layers of the DNN, size of each layer
and number of frames in the input window are tuned with a
development data set.
2.4. Speech corpus
We used two speech corpora to test our pronunciation
verification system. The first speech corpus is the Oregon
Graduate Institute of Science & Technology (OGI) kids’
speech corpus [20]. This corpus consists of 1100 normal
children from kindergarten through grade 10 saying 205
isolated words, 100 sentences and 10 numeric strings. Each
utterance was verified by two individuals and classified as
“good” utterances (the word is clearly intelligible with no
significant background noise or extraneous speech),
“questionable” utterances (intelligible but accompanied by
other sounds) or “bad” utterances (unintelligible or wrong
word spoken). We worked with only good utterances of the
isolated words; 880 children were used to train the model, 110
children for testing and another 110 children for development.
The second speech corpus consists of disordered speech
from 41 children between the ages of 4-12 years diagnosed
with CAS. Inclusion criteria included no reported impairment
in cognition, language comprehension, hearing or vision,
orofacial structure or lower level movement
programming/execution (i.e., dysarthria). Each child
pronounced 90 isolated words in a speech therapy clinic in the
supervision of an SLP. Each word in the data was phonetically
annotated by a SLP and divided into 30, 6 and 5 speakers for
training, development and testing sets respectively.
3. Experimental results
3.1. Acoustic model parameter tuning
We performed a phone decoding process using a bi-gram
language model trained on the CMU pronunciation dictionary
to identify the best parameter values for both the GMM-
HMMs and DNN-HMMs. Decoding was applied to both
speech corpora and the parameters resulting in the least phone
error rate (PER) chosen for the GMM-HMMs and DNN-
HMMs using the same training and development sets.
3.1.1. GMM-HMM parameter tuning
For the GMM-HMMs, we tuned the number of tied states
and the number of mixtures per state. As the normal speech
corpus is much larger than the disordered speech corpus, the
number of tied states for the normal speech corpus was tuned
from 3000 states to 7000 states while the disordered speech
corpus was tuned from 200 to 1700 tied states. We evaluated
different mixtures per state (4, 8, 16, 32 and 64). The PER of
the development sets of both normal and disordered speech for
the different parameters values is shown in Figure 3. The
results show that the best PER for normal speech 37% is
obtained with 5128 states and 32 mixtures per state. For
disordered speech the best PER 43.4% is obtained with 846
states and 4 mixtures per state; PERs increased significantly
when the number of mixtures increased as there was not
enough data to train each mixture.
3.1.2. DNN-HMM parameter tuning
Next, we tested the effect of varying the number of
hidden layers from 1 to 6 on a DNN-HMM with 1024 units
per hidden layer and an input window of 27 frames (270 msec)
to. The models were trained using training sets from both the
normal and disordered corpora separately and the PER
measured on development sets from both corpora. Figure 4
shows the PER of each speech corpus for different hidden
layers. For normal speech, the best accuracy was obtained
with 4 hidden layers, while for disordered speech the lowest
PER was obtained using 2 layers; in both cases, PERs
increased with additional DNN layers.
Previous work using the DNN-HMM demonstrated that
increasing the input window length above 37 frames degraded
the performance significantly [14]. We thus tested the effect of
Figure 3: Phone error rate of the development sets of both
normal and disordered speech for different number of tied
states and mixtures per state
35
40
45
50
55
60
65
70
4 8 16 32 64
% Phone error rate (PER)
Number of mixtures per state
Normal-3048 states Disordered-192 states
Normal-5128 states Disordered-846 states
Normal-7212 states Disordered-1760 states
Figure 4:
Phone error rate (PER) for both normal and
disordered speech corpora as a function of the number of
hidden layers. The number of units in each layer is fixed to
1024 and the number of frames in the input window is fixed to
27 frames (270 msec).
20
25
30
35
40
45
50
55
123456
% Phone error rate (PER)
Number of hidden layers
Normal Disordered
Figu
re 5:
Phone error rate (PER) of the development sets for
both normal and disordered speech as a function of the length
of the input window. The number of hidden layers is 4 for the
normal speech and 2 for the disordered one and the number of
units per hidden layer is 1024 for both.
20
25
30
35
40
45
11 19 27 37 47 59 67
% Phone error rate (PER)
Input window size
Normal Disordered
1585
varying input window length on our system performance. The
PER was computed for input window sizes of 11, 19, 27, 37,
47, 59 and 67 frames with 1024 units per hidden layer and 4
hidden layers for normal and 2 layers for disordered speech.
Our results in Figure 5 show that the performance kept
increasing till a length of 59 frames (590 msec) and 47 frames
(470 msec) for normal and disordered speech corpus
respectively. Given that the words in both the training and
development data of our speech corpus sets are repeated by the
speakers, an increased window length is needed to develop
more accurate models for each phoneme in its specific context.
The resulting reduction in the generalization of the model
when applied to other words, does not affect the performance
of our application as the words used are limited to the words in
the training set. The best PER for normal speech was 21%
while for disordered speech, the best PER was 39.3%.
Given that recognition of disordered speech is a
particularly challenging problem, both models performed
better with normal speech than with disordered speech. The
disordered speech corpus also had a reduced amount of
available data for training compared to the normal speech
corpus. Our results also show that the DNN-HMM performs
much better than the GMM-HMM for both the normal and
disordered speech corpus. The best PER obtained using the
GMM-HMM was 37% and 43.4% for normal and disordered
speech respectively; the PER dropped to 21% and 36.2%
respectively with the DNN-HMM.
3.2. Multiple pronunciations lattice decoding
To further validate the performance of the acoustic models, we
created a search lattice for each target word in the test sets of
both corpora as described in section 2.4 which was then fed to
a Viterbi decoder along with the extracted feature vector and
pre-trained acoustic models. We used the GMM-HMMs and
DNN-HMMs with the parameters that gave the best accuracy
on the phone decoder as determined in Section 3.1. For normal
speech we simulated pronunciation errors in correctly
pronounced words by changing the labeled pronunciation
sequence based on the CAS mispronunciation rules. For
example, if the prompted word was “boy”, the correct
pronunciation sequence is “/B/ /OY/”. As “B” is an
alternative phoneme of “P”, an error was simulated by
replacing the “B” with “P” in the labeled pronunciation. The
system was also tested against naturally occurring errors using
the disordered speech corpus. Errors were simulated in 30% of
the phonemes in each word of the normal speech corpus; and
around 10% of the disordered words contained errors.
Tables 2 and 3 summarize the performance with both
normal and disordered speech respectively using both acoustic
models. The “correct/correct” cell represents the percentage of
true positives, while the “correct/wrong” cell represents the
percentage of false negatives. The percentage of false
positives are listed in the “wrong/correct” cell. The percentage
of the true negatives that were rejected with the same error
phonemes are listed in the “wrong/wrong with same error”
cell; the true negatives that were rejected with different errors
from the actual error phoneme are listed in the “wrong/wrong
with different error” cell. The results show that for normal
speech with simulated errors, the DNN-HMM detected the
mispronounced phoneme with an increase in accuracy of 4%
compared to the GMM-HMM but with a small decrease in the
correct acceptance. For disordered speech, the improvement
with the DNN-HMM was more significant. The detection of
mispronounced phonemes increased by 27.2% to 53.8% and
correct phonemes by 1.9%. The total phoneme matching
accuracy is around 93% for normal speech and around 89%
for disordered speech. Given the absence of work on
automated ASR in speech therapy for children with CAS, it is
not possible to compare our resultant accuracies to existing
systems. To give context to our results, in a study on the
phonological difficulties of children with CAS, inter-rater
reliability for the phonemic transcriptions of disordered speech
by SLPs was between 78.497.3% [21].
4. Conclusions
In this paper, we present a pronunciation verification method
for integration into an automated speech therapy tool for
children with CAS. Our approach consists of creating a search
lattice for each prompt which contains the correct phoneme
sequence path and a set of alternative paths that cover
expected mispronunciation errors. The resulting DNN-HMM
system had an overall phoneme level accuracy of 89% when
used with disordered speech, which is comparable to the SLP
phonetic transcription agreement of apraxic speech of around
80% [22]. Our system was able to detect the correctly
pronounced phoneme with an accuracy of 94% and identify
the correct errors made with an accuracy of 54%. Our
proposed approach can thus be used to accurately classify
mispronunciation errors in disordered children’s speech
collected in a noisy speech therapy environment.
We also compared the performance of conventional
GMM-HMMs and hybrid DNN-HMMs. For the GMM-HMM,
we achieved a minimum PER of 37% when tested with normal
speech and 43% with disordered speech. When using the
DNN-HMM, the PER decreased to 21% and 36% for normal
and disordered speech respectively. For the DNN-HMM
model, we demonstrated that increased window lengths are
required to develop accurate phoneme models to account for
the limited number of words used in speech therapy. The
DNN-HMM performed better than the GMM-HMM with
disordered speech, due to its ability to train better with the
limited size training set. We found the DNN-HMM produced
an improvement of 27% over the GMM-HMM in correctly
detecting mispronounced phonemes and an improvement of
about 2% in detecting correctly pronounced phonemes.
5. Acknowledgement
This work was made possible by NPRP grant # [4-638-2-236]
from the Qatar National Research Fund (a member of Qatar
Foundation). The statements made herein are solely the
responsibility of the authors.
Table 2: Phoneme-level confusion matrix for normal speech
System
evaluaiton
Reference evaluation
Correct
Wrong
GMM
DNN
GMM
DNN
Correct
96.5%
96%
21.2%
16.3%
Wrong same error
3.5%
4
%
70.1%
74.6%
Wrong different error
8.6%
9%
Table
3: Phoneme-level confusion matrix for disordered speech
System
evaluaiton
Reference evaluation
Correct
Wrong
GMM
DNN
GMM
DNN
Correct
91.8%
93.9%
45.3%
40.5%
Wrong same error
8.2%
6.1%
26.6%
53.8%
Wrong different error
28.8%
5.7%
1586
6. References
[1] Adhoc Committee on CAS, ASHA. Childhood Apraxia of
Speech”, American Speech-Language-Hearing Association,
2007.
[2] Rvachew, S., Brosseau-Lapre, F., Speech perception
intervention”, Interventions for Speech Sound Disorders in
Children, Brookes Pub, 2006.
[3] Williams, A. Multiple oppositions intervention. Interventions for
speech sound disorders in children, Brookes Pub, 2006.
[4] Wren, Y., S. Roulstone, and A.L. Williams, Computer-Based
Interventions”, Interventions for Speech Sound Disorders in
Children. Brookes Pub, 2006.
[5] Apraxiaville. Available: http://smartyearsapps.com/
[6] Pocket SLP. Available: http://pocketslp.com/
[7] Kirrie J. Ballard, Donald A. Robin, Patricia McCabe, Jeannie
McDonald, A Treatment for Dysprosody in Childhood Apraxia
of Speech”, J Speech Lang Hear Res 2010;53(5):1227-1245.
[8] Bunnell, H.T., D.M. Yarrington, and J.B. Polikoff. STAR:
articulation training for young children”, International
Conference on Spoken Language Processing. 2000. 85-88.
[9] Saz, O., S. Yin, E. LLeida, R. Rose, C. Vaquero, and W.R.
Rodriguez. Tools and Technologies for Computer-Aided Speech
and Language Therapy”, Speech Communication 51 (2009):
948-967.
[10] J.-M. Kim, C. Wang, M. Peabody, and S. Seneff, “An interactive
English pronunciation dictionary for Korean learners,” in
INTERSPEECH, 2004, pp. 11451148.
[11] Abdou, S.M., Hamid, S.E., Rashwan, M., Samir, A., Abd-
Elhamid, O., Shahin, M., & Nazih, W., Computer aided
pronunciation learning system using speech recognition
technology”, in Interspeech, 2006.
[12] Parnandi, A., Karappa, V., Son, Y., Shahin, M., Mckechnie, J.,
Ballard, K., Ahmed, B., and Gutierrez-Osuna, R., Architecture
of an Automated Therapy Tool for Childhood Apraxia of
Speech”, ACM ASSETS 2013.
[13] Shahin, M., Ahmed, B. and Ballard, K., Automatic
classification of unequal lexical stress patterns using machine
learning algorithms”, Spoken Language Technology Workshop
(SLT), IEEE (2012): 388-391.
[14] Mohamed, A., Dahl, G.E., Hinton, G., Acoustic Modeling using
Deep Belief Networks, IEEE Trans. on Audio, Speech, and
Language Processing, 2011.
[15] Dahl, G., Yu, D., Deng, L., and Acero, A., Context-Dependent
Pre-trained Deep Neural Networks for Large Vocabulary Speech
Recognition, In IEEE Trans. Audio, Speech, and Language
Processing, 2012.
[16] Li, L., Zhao, Y., Jiang, D., Zhang, Y., etc. “Hybrid Deep Neural
Network--Hidden Markov Model (DNN-HMM) Based Speech
Emotion Recognition”, Proc. Conf. Affective Computing and
Intelligent Interaction (ACII), pp.312-317, Sept. 2013.
[17] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly,
N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N.,
Kingsbury, B., "Deep Neural Networks for Acoustic Modeling in
Speech Recognition: The shared views of four research groups",
IEEE Signal Processing Magazine, 29, November 2012
[18] CMU pronunciation dictionary,
http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
[19] Mohamed, A., Hinton, G., Penn, G., Understanding how Deep
Belief Networks perform acoustic modelling, in ICASSP 2012
[20] Shobaki, K., Hosom, J. P., and Cole, R. A., The OGI kids’
speech corpus and recognizers, Presented at the International
Conference on Spoken Language Processing, Beijing, 2000.
[21] McNeill, B.C., Gillon, G.T., and Dodd, B., “Phonological
awareness and early reading development in childhood apraxia
of speech (CAS)”, International journal of language &
communication disorders / Royal College of Speech & Language
Therapists, 2009, 44, (2), pp. 175-19
[22] Shriberg, L., Austin, D., Lewis, B., McSweeny, J., & Wilson, D.
(1997). The percentage of consonants correct (PCC) metric:
extensions and reliability data, Journal of Speech, Language,
And Hearing Research: JSLHR, 40(4), 708-722.
1587
... As said in the previous section, we decided to use a SD approach because we want to find out what are the best parameters at the feature extraction level, that optimise the performance of an ASR system according to the speaker selected. The ASR architecture is based on a Gaussian Mixture Model (GMM) combined with Hidden Markov Model (HMM) [16]. For our purpose, we decided a tri-phoneme Acoustic Model trained with Speaker Adaptive Training (SAT) algorithm. ...
... We have to stress that this is the case when a large amount of data is available to train the DNN model. In the case of small training sample sizes, the GMM model was shown to have better performance [16]. Table 2 reports the results of the second experiment over IDEA database. ...
Article
Full-text available
Within the field of Automatic Speech Recognition (ASR) systems, facing impaired speech is a big challenge because standard approaches are ineffective in the presence of dysarthria. The first aim of our work is to confirm the effectiveness of a new speech analysis technique for speakers with dysarthria. This new approach exploits the fine-tuning of the size and shift parameters of the spectral analysis window used to compute the initial short-time Fourier transform, to improve the performance of a speaker-dependent ASR system. The second aim is to define if there exists a correlation among the speaker’s voice features and the optimal window and shift parameters that minimises the error of an ASR system, for that specific speaker. For our experiments, we used both impaired and unimpaired Italian speech. Specifically, we used 30 speakers with dysarthria from the IDEA database and 10 professional speakers from the CLIPS database. Both databases are freely available. The results confirm that, if a standard ASR system performs poorly with a speaker with dysarthria, it can be improved by using the new speech analysis. Otherwise, the new approach is ineffective in cases of unimpaired and low impaired speech. Furthermore, there exists a correlation between some speaker’s voice features and their optimal parameters.
... Rule-based methods take existing knowledge of mispronunciation patterns to identify errors, usually by including these errors in the ASR decoder lattice [34,35]. Obtaining the necessary error patterns requires expert manual curation [35,36] or access to large quantities of speech to identify the patterns in a data-driven fashion [37,38]. ...
... Obtaining the necessary error patterns requires expert manual curation [35,36] or access to large quantities of speech to identify the patterns in a data-driven fashion [37,38]. Shahin et al. [34] deployed rule-based MPD for child speech by including expected errors as provided by a speech-language pathologist to the decoding path. They later adopted a data-driven approach, where garbage nodes along the decoding lattice collected unexpected or missing phonemes and penalty values were tuned to control garbage node acceptance rates [39]. ...
... It has a simple and interpretable structure and is, therefore, computationally efficient, especially during decoding. However, the HMM has limited ability to model complex non-linear relationships between input features and output phonemes, making it difficult to capture acoustic details [28]. Because of this, the HMM has limitations in recognizing speech with complex structures (e.g., sentence units). ...
Article
Full-text available
Most conventional speech recognition systems have mainly concentrated on voice-driven control of personal user devices such as smartphones. Therefore, a speech recognition system used in a special environment needs to be developed in consideration of the environment. In this study, a speech recognition framework for voice-driven control of unmanned aerial vehicles (UAVs) is proposed in a collaborative environment between manned aerial vehicles (MAVs) and UAVs, where multiple MAVs and UAVs fly together, and pilots on board MAVs control multiple UAVs with their voices. Standard speech recognition systems consist of several modules, including front-end, recognition, and post-processing. Among them, this study focuses on recognition and post-processing modules in terms of in-vehicle speech recognition. In order to stably control UAVs via voice, it is necessary to handle the environmental conditions of the UAVs carefully. First, we define control commands that the MAV pilot delivers to UAVs and construct training data. Next, for the recognition module, we investigate an acoustic model suitable for the characteristics of the UAV control commands and the UAV system with hardware resource constraints. Finally, two approaches are proposed for post-processing: grammar network-based syntax analysis and transaction-based semantic analysis. For evaluation, we developed a speech recognition system in a collaborative simulation environment between a MAV and an UAV and successfully verified the validity of each module. As a result of recognition experiments of connected words consisting of two to five words, the recognition rates of hidden Markov model (HMM) and deep neural network (DNN)-based acoustic models were 98.2% and 98.4%, respectively. However, in terms of computational amount, the HMM model was about 100 times more efficient than DNN. In addition, the relative improvement in error rate with the proposed post-processing was about 65%.
... ASR automatically transcribes recorded spoken language into written text (speech-to-text). Human speech is modeled as a sequence of features which can be trained using hidden Markov models or, most recently, with deep neural networks [2], [3]. Current performance of ASR for children, however, is below that of the latest systems dedicated to adult speech [4]. ...
Conference Paper
Reading is a learned skill that children acquire through instruction and practice. A desirable feature of that practice is that children can read aloud under the guidance of a teacher. Unfortunately, this is not always possible to a sufficient extent because of general time-constraints in teacher-fronted education. For this reason, experts have been looking at Automatic Speech Recognition (ASR) technology as a possible alternative for the "listening ear" that is usually lent by teachers. Especially for reading in English, the contribution of ASR technology has been investigated from various perspectives and this research even led to relatively successful commercial products. In general, in these products the aim for ASR technology is to follow the children while reading aloud and provide some form of support when they hesitate. An important requirement for this kind of research and applications is that ASR of child speech is of sufficient quality. In turn this requires that the ASR algorithms are trained with large amounts of child speech recordings that, in general, are difficult to obtain. A compounding problem with child speech recognition is that children, as a group, display an enormous amount of variation in terms of speech characteristics, which is in part related to their variable physical characteristics. This means that large amounts of speech recordings are needed for each age cohort, which makes achieving high-quality levels of ASR performance even more difficult. As a matter of fact, the reason why so far research and applications have mostly addressed English is that for English larger amounts of child speech recordings could be obtained than for other languages. As a way of circumventing the data sparsity problem, speech technologists have investigated child speech recognition by applying several techniques that had been developed for low-resourced languages. In general, research has shown that the younger the children, the more difficult it is to achieve good ASR performance. On the other hand, one can imagine that ASR-based applications for learning to read are especially required for young children. In this paper we report on research aimed at developing and testing a Dutch ASR-based reading tutor. Innovative features of this system are that it is intended for young children in the earlier stages of learning to read, when they start developing decoding skills, and that it addresses reading in Dutch. A preliminary study conducted during the pandemic when children were primarily working from home showed encouraging results. In the present paper we report on a recent, larger study in which children used the system in the classroom. After collecting over 617 thousand words from 263 first graders from 44 different schools, we analyzed various attempts by the children at reading words and sentences before and after they received various types of ASR-based feedback. The results indicate that successive attempts were characterized by improved reading accuracy and fluency. Significant improvement in accuracy was observed in accuracy exercises and improvement in fluency was observed in fluency exercises after the pupils were provided with feedback. We also present words that are most commonly mispronounced by the pupils during practice. We discuss these results in relation to those of previous studies and consider possible avenues of future research.
... Many studies have been focused on consonant error detection, given that assessment of consonant pronunciation is a major task in clinical diagnosis of SSD. In [3,4], constrained lattice was incorporated from an automatic speech recognition (ASR) system, by which expected ASR outputs could be preset to facilitate the detection of target consonants. In [5], goodness of pronunciation (GOP) was used to detect the deviation of pronunciations in disordered speech. ...
Conference Paper
Full-text available
Speech sound disorder (SSD) refers to a type of developmental disorder in young children who encounter persistent difficulties in producing certain speech sounds at the expected age. Consonant errors are the major indicator of SSD in clinical assessment. Previous studies on automatic assessment of SSD revealed that detection of speech errors concerning short and transitory consonants is less satisfactory. This paper investigates a neural network based approach to detecting consonant errors in disordered speech using consonant-vowel (CV) diphone segment in comparison to using consonant monophone segment. The underlying assumption is that the vowel part of a CV segment carries important information of co-articulation from the consonant. Speech embeddings are extracted from CV segments by a recurrent neural network model. The similarity scores between the embeddings of the test segment and the reference segments are computed to determine if the test segment is the expected consonant or not. Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those “difficult” consonants reported in the previous studies.
... Many studies have been focused on consonant error detection, given that assessment of consonant pronunciation is a major task in clinical diagnosis of SSD. In [3,4], constrained lattice was incorporated from an automatic speech recognition (ASR) system, by which expected ASR outputs could be preset to facilitate the detection of target consonants. In [5], goodness of pronunciation (GOP) was used to detect the deviation of pronunciations in disordered speech. ...
Preprint
Full-text available
Speech sound disorder (SSD) refers to a type of developmental disorder in young children who encounter persistent difficulties in producing certain speech sounds at the expected age. Consonant errors are the major indicator of SSD in clinical assessment. Previous studies on automatic assessment of SSD revealed that detection of speech errors concerning short and transitory consonants is less satisfactory. This paper investigates a neural network based approach to detecting consonant errors in disordered speech using consonant-vowel (CV) di-phone segment in comparison to using consonant monophone segment. The underlying assumption is that the vowel part of a CV segment carries important information of co-articulation from the consonant. Speech embeddings are extracted from CV segments by a recurrent neural network model. The similarity scores between the embeddings of the test segment and the reference segments are computed to determine if the test segment is the expected consonant or not. Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those "difficult" consonants reported in the previous studies.
... This framework has two main advantages: training can be performed using the Viterbi algorithm, which finds the path that maximizes the possibility of observation between the values of HMM expressed in matrix form and decoding is generally too efficient. Shahin et al. (2014) performed a comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques, where it turns out that the hybrid DNN-HMM outperforms the conventional GMM-HMM for all experiments on both normal and disordered speech. The DNN-HMM represents accuracy above 85% when used with disordered speech. ...
Article
Full-text available
Speech recognition is an interdisciplinary subfield of natural language processing (NLP) that facilitates the recognition and translation of spoken language into text by machine. Speech recognition plays an important role in digital transformation. It is widely used in different areas such as education, industry, and healthcare and has recently been used in many Internet of Things and Machine Learning applications. The process of speech recognition is one of the most difficult processes in computer science. Despite numerous searches in this domain, an optimal method for speech recognition has not yet been found. This is due to the fact that there are many attributes that characterize natural languages and every language has its particular highlights. The aim of this research is to provide a comprehensive understanding of the various techniques within the domain of Speech Recognition through a systematic literature review of the existing work. We will introduce the most significant and relevant techniques that may provide some directions in the future research.
Article
Full-text available
The rehabilitation of aphasics is fundamentally based on the assessment of speech impairment. Developing methods for assessing speech impairment automatically is important due to the growing number of stroke cases each year. Traditionally, aphasia is assessed manually using one of the well-known assessment batteries, such as the Western Aphasia Battery (WAB), the Chinese Rehabilitation Research Center Aphasia Examination (CRRCAE), and the Boston Diagnostic Aphasia Examination (BDAE). In aphasia testing, a speech-language pathologist (SLP) administers multiple subtests to assess people with aphasia (PWA). The traditional assessment is a resource-intensive process that requires the presence of an SLP. Thus, automating the assessment of aphasia is essential. This paper evaluated and compared custom machine learning (ML) speech recognition algorithms against off-the-shelf platforms using healthy and aphasic speech datasets on the naming and repetition subtests of the aphasia battery. Convolutional neural networks (CNN) and linear discriminant analysis (LDA) are the customized ML algorithms, while Microsoft Azure and Google speech recognition are off-the-shelf platforms. The results of this study demonstrated that CNN-based speech recognition algorithms outperform LDA and off-the-shelf platforms. The ResNet-50 architecture of CNN yielded an accuracy of 99.64 ± 0.26% on the healthy dataset. Even though Microsoft Azure was not trained on the same healthy dataset, it still generated comparable results to the LDA and superior results to Google’s speech recognition platform.
Conference Paper
Full-text available
Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN.
Conference Paper
Full-text available
Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.
Article
Full-text available
Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.
Conference Paper
Deep Neural Network Hidden Markov Models, or DNN-HMMs, are recently very promising acoustic models achieving good speech recognition results over Gaussian mixture model based HMMs (GMM-HMMs). In this paper, for emotion recognition from speech, we investigate DNN-HMMs with restricted Boltzmann Machine (RBM) based unsupervised pre-training, and DNN-HMMs with discriminative pre-training. Emotion recognition experiments are carried out on these two models on the eNTERFACE'05 database and Berlin database, respectively, and results are compared with those from the GMM-HMMs, the shallow-NN-HMMs with two layers, as well as the Multi-layer Perceptrons HMMs (MLP-HMMs). Experimental results show that when the numbers of the hidden layers as well hidden units are properly set, the DNN could extend the labeling ability of GMM-HMM. Among all the models, the DNN-HMMs with discriminative pre-training obtain the best results. For example, for the eNTERFACE'05 database, the recognition accuracy improves 12.22% from the DNN-HMMs with unsupervised pre-training, 11.67% from the GMM-HMMs, 10.56% from the MLP-HMMs, and even 17.22% from the shallow-NN-HMMs, respectively.
Article
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Conference Paper
We present a multi-tier system for the remote administration of speech therapy to children with apraxia of speech. The system uses a client-server architecture model and facilitates task-oriented remote therapeutic training in both in-home and clinical settings. Namely, the system allows a speech therapist to remotely assign speech production exercises to each child through a web interface, and the child to practice these exercises on a mobile device. The mobile app records the child's utterances and streams them to a back-end server for automated scoring by a speech-analysis engine. The therapist can then review the individual recordings and the automated scores through a web interface, provide feedback to the child, and adapt the training program as needed. We validated the system through a pilot study with children diagnosed with apraxia of speech, and their parents and speech therapists. Here we describe the overall client-server architecture, middleware tools used to build the system, the speech-analysis tools for automatic scoring of recorded utterances, and results from the pilot study. Our results support the feasibility of the system as a complement to traditional face-to-face therapy through the use of mobile tools and automated speech analysis algorithms.
Article
With detailed discussion and invaluable video footage of 23 treatment interventions for speech sound disorders (SSDs) in children, this textbook and DVD set should be part of every speech-language pathologist's professional preparation. Focusing on children with functional or motor-based speech disorders from early childhood through the early elementary period, this textbook gives preservice SLPs critical analyses of a complete spectrum of evidence-based phonological and articulatory interventions. An essential core text for pre-service SLPs--and an important professional resource for practicing SLPs, early interventionists, and special educators--this book will help readers make the best intervention decisions for children with speech sound disorders. Chapters of this book include: (1) Introduction to Interventions for Speech Sound Disorders in Children (A. Lynn Williams, Sharynne McLeod, & Rebecca J. McCauley); (2) Minimal Pair Intervention (Elise Baker); (3) Multiple Oppositions Intervention (A. Lynn Williams); (4) Complexity Approaches to Intervention (Elise Baker & A. Lynn Williams); (5) Core Vocabulary Intervention (Barbara Dodd, Alison Holm, Sharon Crosbie, & Beth McIntosh); (6) The Cycles Phonological Remediation Approach (Raul F. Prezas & Barbara Williams Hodson); (7) Nuffield Centre Dyspraxia Programme (Pam Williams & Hilary Stephens); (8) Stimulability Treatment (Adele W. Miccio & Al Lynn Williams); (9) Psycholinguistic Intervention (Joy Stackhouse & Michelle Pascoe); (10) Metaphonological Intervention: Phonological Awareness Therapy (Anne Hesketh); (11) Computer-Based Interventions (Yvonne Wren, Sue Roulstone, & A. Lynn Williams); (12) Speech Perception Intervention (Susan Rvachew & Francoise Brosseau-Lapre); (13) Nonlinear Phonological Intervention (B. May Bernhardt, Karen D. Bopp, Bonnie Daudlin, Susan M. Edwards, & Susan E. Wastie); (14) Dynamic Systems and Whole Language Intervention (Paul R. Hoffman & Janet A. Norris); (15) Morphosyntax Intervention (Ann A. Tyler & Allison M. Haskill); (16) Naturalistic Intervention for Speech Intelligibility and Speech Accuracy (Stephen M. Camarata); (17) Parents and Children Together (PACT) Intervention (Caroline Bowen); (18) Enhanced Milieu Teaching with Phonological Emphasis for Children with Cleft Lip and Palate (Nancy J. Scherer & Ann P. Kaiser); (19) PROMPT: A Tactually Grounded Model (Deborah A. Hayden, Jennifer Eigen, Anne Walker, & Lisa Olsen); (20) Family-Friendly Intervention (Nicole Watts Pappas); (21) Visual Feedback Therapy with Electropalatography (Fiona E. Gibbon & Sara E. Wood); (22) Vowel Intervention (B. May Bernhardt, Joseph P. Stemberger, & Penelope Bacsfalvi); (23) Developmental Dysarthia Intervention (Megan M. Hodge); (24) Nonspeech Oral Motor Exercise (Heather M. Clark); and (25) Interventions for Children with Speech Sound Disorders: Future Directions (Rebecca J. McCauley, A. Lynn Williams, & Sharynne McLeod). A glossary and an index are also included.
Article
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.