ArticlePDF Available

Automatic Speech Recognition for ageing voices

Authors:

Abstract and Figures

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.
Content may be subject to copyright.
Automatic Speech Recognition
for ageing voices
Ravichander Vipperla
Doctor of Philosophy
Institute for Language, Cognition and Computation
School of Informatics
University of Edinburgh
2011
Abstract
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speak-
ing rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accura-
cies is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disflu-
encies were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven in-
correct. Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in funda-
mental frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly as-
sociated with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hier-
archical acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then ex-
tended to construct unsupervised hierarchical models which also outperform the base-
line models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults.
i
Acknowledgements
First and foremost, I am sincerely grateful to my thesis advisor Prof. Steve Re-
nals for his expert guidance, support and encouragement throughout the period of my
doctoral work. His deep understanding of the subject matter has been an invaluable
resource for this research work. He has been a wonderful mentor and has inspired me
in several ways to pursue scientific quest further in my life.
I am also deeply indebted to my supervisor Dr. Joe Frankel who despite moving
on to become an entrepreneur, found time from his busy schedule regularly to review
my work and provide helpful guidance.
I would like to express my sincere thanks to Dr. Maria Wolters for providing critical
feedback on my work and to Prof. Simon King who has given me valuable advice time
and again and for helping me get started with cluster computers.
I am extremely thankful to Prof. Phil Green and Prof. Hiroshi Shimodaira for
agreeing to be on my examination panel and for providing me with some constructive
feedback to improve this manuscript.
The financial support for this work from Scottish Funding Council and HCRC,
University of Edinburgh is gratefully acknowledged. I wish to thank all the members
of the MATCH project for providing a nice collaborative research environment and
helping me broaden my perspective on a wider range of technologies suited for home
care systems.
I am indebted to Dr. Junichi Yamagishi, Dr. Giulia Garau, Dr. Mike Lincoln,
and other members of CSTR for always extending a helping hand to resolve issues in
experimental design and setup. I would also like to thank Prof. Mark Liberman and
Prof. Jerry Goldman for their advice in setting up experiments using the SCOTUS
corpus. I have cherished the company of my colleagues at CSTR with whom I have
shared unforgettable hours of fun and intellect uplift. I am thankful to them for making
this whole experience so much more worthwhile.
I would like to acknowledge the timely support from our wonderful admin and IT
assist teams, and the infrastructure provided by the cluster compute team, Edinburgh
university library and the School of Informatics.
Several open source tools such as HTK, HTS, Praat, Cluto, LibSVM, and R have
been used in this work. I sincerely thank the developers of these tools for their effort.
Finally and most importantly, I owe my loving thanks to my wife Neelima, my par-
ents (Shri. V. Nagendra Rao and Smt. V. Asoka Rani), my sister and my close family,
ii
my extended family and my friends who have put up with me with patience through
this roller coaster ride and for being a constant source of support and encouragement.
iii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Ravichander Vipperla)
iv
To shri Ganapati deva.
v
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Ageing voices 5
2.1 Human speech production mechanism . . . . . . . . . . . . . . . . . 5
2.2 Changes in the speech production mechanism with ageing . . . . . . 6
2.2.1 Changes in the Respiratory system . . . . . . . . . . . . . . . 7
2.2.2 Changes in the Larynx . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Changes in the vocal tract . . . . . . . . . . . . . . . . . . . 10
2.2.4 Neuromuscular control . . . . . . . . . . . . . . . . . . . . . 11
2.3 Acoustic effects of ageing . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Average fundamental frequency . . . . . . . . . . . . . . . . 12
2.3.2 Fundamental frequency variation and Amplitude variation . . 13
2.3.3 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Shimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 Breathiness . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.6 Sound pressure level . . . . . . . . . . . . . . . . . . . . . . 17
2.3.7 Speech rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Automatic Speech Recognition 18
3.1 ASR architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Acoustic models . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Language models . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.4 Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vi
3.1.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.6 Performance measures . . . . . . . . . . . . . . . . . . . . . 45
3.2 Normalisation approaches in acoustic space . . . . . . . . . . . . . . 46
3.2.1 Cepstral mean and variance normalisation . . . . . . . . . . . 46
3.2.2 Vocal tract length normalisation . . . . . . . . . . . . . . . . 47
3.3 Adaptation approaches in acoustic space . . . . . . . . . . . . . . . . 49
3.3.1 Maximum likelihood adaptation . . . . . . . . . . . . . . . . 49
3.3.2 Maximum a posteriori adaptation . . . . . . . . . . . . . . . 53
3.3.3 Speaker space adaptation approaches . . . . . . . . . . . . . 57
3.4 Automatic age recognition . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Automatic speech recognition on older voices . . . . . . . . . . . . . 60
4 ASR accuracy on ageing voices: Baseline Experiments 62
4.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 SCOTUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 MATCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 JNAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 ASR WERs on older voices . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Experiments with SCOTUS corpus . . . . . . . . . . . . . . 67
4.2.2 Experiments with MATCH corpus . . . . . . . . . . . . . . . 73
4.2.3 Experiments with JNAS corpus . . . . . . . . . . . . . . . . 76
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Impact of changes in glottal source parameters with ageing on ASR 80
5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Fundamental frequency . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Shimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Harmonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 Articulatory changes in older voices 90
6.1 Phoneme recognition accuracies . . . . . . . . . . . . . . . . . . . . 90
6.1.1 Results on the SCOTUS corpus . . . . . . . . . . . . . . . . 92
6.1.2 Longitudinal results on the SCOTUS corpus . . . . . . . . . . 92
6.1.3 Results on the JNAS corpus . . . . . . . . . . . . . . . . . . 94
vii
6.2 Vowel centralisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Speaking rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Speaking rate comparison on SCOTUS corpus . . . . . . . . 99
6.3.2 Speaking rate comparison on JNAS corpus . . . . . . . . . . 99
6.3.3 Impact of speaking rate changes on ASR accuracies . . . . . . 101
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Acoustic models for older voices 105
7.1 Speaker classification and clustering . . . . . . . . . . . . . . . . . . 105
7.1.1 Age group classification using SVMs . . . . . . . . . . . . . 106
7.1.2 Speaker clustering based on MLLR transforms . . . . . . . . 107
7.2 Supervised hierarchical models . . . . . . . . . . . . . . . . . . . . . 109
7.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Unsupervised hierarchical models . . . . . . . . . . . . . . . . . . . 112
7.3.1 Acoustic models . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4 Modifying HMM transition parameters . . . . . . . . . . . . . . . . . 114
7.4.1 Experimental results on the JNAS corpus . . . . . . . . . . . 116
7.4.2 Experimental results on the SCOTUS corpus . . . . . . . . . 117
7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8 Speaker selection to augment adaptation data 119
8.1 Distance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.2 Speaker identification task . . . . . . . . . . . . . . . . . . . . . . . 121
8.3 Speaker selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4.1 Experiments with the AMI corpus . . . . . . . . . . . . . . . 124
8.4.2 Experiments on SCOTUS Corpus . . . . . . . . . . . . . . . 126
8.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
viii
9 Conclusions 134
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A Appendix: Experimental result tables 139
Bibliography 147
ix
List of Figures
2.1 Human speech production mechanism . . . . . . . . . . . . . . . . . 6
2.2 Cepstral peak prominence . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Parametric representation of speech . . . . . . . . . . . . . . . . . . 19
3.2 Automatic speech recognition system . . . . . . . . . . . . . . . . . 20
3.3 Speech signal and vocal tract response . . . . . . . . . . . . . . . . . 22
3.4 MFCC and PLP feature extraction . . . . . . . . . . . . . . . . . . . 23
3.5 Windowing or Short time analysis of speech signal . . . . . . . . . . 24
3.6 Filter bank illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 Three state left-right HMM . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 Example of a word network lattice . . . . . . . . . . . . . . . . . . . 38
3.9 Example of a small segment of a finite state network . . . . . . . . . 44
3.10 Piecewise linear warping in VTLN . . . . . . . . . . . . . . . . . . . 48
3.11 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Age distribution of speakers in the JNAS and the S-JNAS corpora . . 66
4.2 Age distribution of speakers in the training set of the SCOTUS corpus 68
4.3 Comparison of WERs on younger adult and older adult voices in the
SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 WERs (%) with increasing age on older adult voices in the SCOTUS
corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 WERs (%) with increasing age on older adult voices in the SCOTUS
corpus with speaker adaptation . . . . . . . . . . . . . . . . . . . . . 72
4.6 WERs (%) for young and older speakers of the MATCH corpus using
different language models . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 WERs (%) for young and older speakers of the MATCH corpus using
different acoustic models . . . . . . . . . . . . . . . . . . . . . . . . 76
x
5.1 Illustration of artificial modification of fundamental frequency . . . . 83
5.2 Illustration of waveforms with artificial increase in jitter . . . . . . . . 85
5.3 Illustration of waveform with artificial increase in shimmer . . . . . . 87
6.1 Phoneme loop decoder . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Phoneme correct recognition (%) on the SCOTUS corpus . . . . . . . 93
6.3 Correct recognition (%) of most used japanese phonemes for the younger
and older adults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Mean vowel space areas for younger and older male adults in SCOTUS
corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Centroid positions of common vowels in younger and older Adults . . 98
7.1 MATCH speakers in 3D space using multi dimensional scaling on the
distance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2 Clustering of speakers in the MATCH corpus . . . . . . . . . . . . . 109
7.3 Training gender and age dependent acoustic models . . . . . . . . . . 111
7.4 Unsupervised hierarchical models . . . . . . . . . . . . . . . . . . . 113
7.5 Statistics of models chosen by test speakers in the unsupervised hier-
archical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1 Speaker Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Augmentation of adaptation data. Results on the AMI corpus . . . . . 127
8.3 Augmentation of adaptation data. Results on the SCOTUS corpus for
younger adult male speakers . . . . . . . . . . . . . . . . . . . . . . 130
8.4 Augmentation of adaptation data. Results on the SCOTUS corpus for
older adult male speakers . . . . . . . . . . . . . . . . . . . . . . . . 131
xi
List of Tables
4.1 Perplexity and OOV rate for the younger adult and older adult test sets
in SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Comparison of the perplexities of the language model and OOV rates
on MATCH corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Comparison of WERs (%) of younger and older adults in the JNAS
corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 WERs (%) for older adults in different age groups in the JNAS corpus 79
5.1 Fundamental frequency analysis for the phonations of vowel ‘aa’ in
the SCOTUS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 WER (%) with artificial reduction in fundamental frequency of the
speech from younger adults in the SCOTUS corpus. . . . . . . . . . . 82
5.3 Jitter analysis for the phonations of vowel ‘aa’ in the SCOTUS corpus. 84
5.4 Jitter values computed on phonations of the vowel ‘aa’ in the original
and modified waveforms . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 WER (%) with artificial increase of jitter in the speech from younger
adults in the SCOTUS corpus. . . . . . . . . . . . . . . . . . . . . . 86
5.6 Shimmer analysis for the phonations of vowel ‘aa’ in the SCOTUS
corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.7 Shimmer values computed on phonations of the vowel ‘aa’ in the orig-
inal and modified waveforms . . . . . . . . . . . . . . . . . . . . . . 87
5.8 WER (%) with artificial increase of shimmer in the speech from younger
adults in the SCOTUS corpus. . . . . . . . . . . . . . . . . . . . . . 88
5.9 Harmonicity analysis for the phonations of vowel ‘aa’ in the SCOTUS
corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1 Phonemes with largest drop in recognition rates in longitudinal study
on the SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . 94
xii
6.2 Vowel Space Area comparison between younger adult and older adult
males in SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Speaking rate differences between younger and older adults on the
SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Speaking Rate differences between younger and older adults in the
JNAS corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Phoneme accuracies using phoneme loop decoder for older speakers . 102
6.6 Phoneme accuracies using phoneme loop decoder for younger speakers 102
6.7 Word correct recognition and accuracies for older speakers in the JNAS
corpus with original and transition parameter modified models . . . . 103
6.8 Substitution, deletion and insertion errors for older speakers in the
JNAS corpus with original and transition parameter modified models . 103
6.9 Word correct recognition and accuracies for younger speakers . . . . 103
6.10 Substitution, deletion and insertion errors for younger speakers in the
JNAS corpus with original and transition parameter modified models . 103
7.1 Precision And recall for each class in age group classification task on
MATCH corpus using support vector machines . . . . . . . . . . . . 107
7.2 Comparison of WERs (%) of younger and older adults in the JNAS
corpus using gender dependant models . . . . . . . . . . . . . . . . . 111
7.3 Comparison of WERs (%) of younger and older adults using ‘Gender
+ Age’ dependant Models . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 WERs (%) of Younger and Older odults using unsupervised hierarchi-
cal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 WERs (%) on older speakers in the JNAS corpus using acoustic mod-
els with modified transition parameters . . . . . . . . . . . . . . . . . 117
7.6 WERs (%) on older speakers in the SCOTUS Corpus with modified
transition parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.1 Speaker identification task . . . . . . . . . . . . . . . . . . . . . . . 122
8.2 AMI Corpus: Baseline results (WER %) . . . . . . . . . . . . . . . . 125
8.3 AMI Corpus: Results with augmented adaptation data (WER %) . . . 126
8.4 SCOTUS Corpus: Baseline results (WER %) for younger adult speakers128
8.5 SCOTUS Corpus: Baseline results (WER %) for older adult speakers 129
8.6 SCOTUS Corpus: Results with augmented adaptation data (WER %)
on younger adult speakers . . . . . . . . . . . . . . . . . . . . . . . 129
xiii
8.7 SCOTUS Corpus: Results with augmented adaptation data (WER %)
on older adult speakers . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.1 Comparison of WER (%) on younger adult and older adult voices in
the SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.2 Comparison of WER (%) using MLLR speaker adaptation on younger
adult and older adult voices in the SCOTUS corpus . . . . . . . . . . 139
A.3 Comparison of WER (%) using vocal tract length normalisation on
younger adult and older adult voices in the SCOTUS corpus . . . . . 139
A.4 Comparison of WER (%) using speaker adaptive training on younger
adult and older adult voices in the SCOTUS corpus . . . . . . . . . . 140
A.5 WER (%) with increasing age on older adult voices in the SCOTUS
corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.6 WER (%) with increasing age on older adult voices using MLLR speaker
adaptation in the SCOTUS corpus . . . . . . . . . . . . . . . . . . . 141
A.7 Comparison of WER (%) of young and older voices on MATCH cor-
pus using different language models . . . . . . . . . . . . . . . . . . 141
A.8 Comparison of WER (%) of younger and older voices on MATCH
corpus using different acoustic models. . . . . . . . . . . . . . . . . . 142
A.9 F1 and F2 for the monophthongs in the SCOTUS corpus . . . . . . . 142
A.10 Correct recognition (%) of phonemes on younger and older adult males
in the SCOTUS corpus . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.11 Correct recognition (%) of phonemes on younger and older adult males
in the JNAS corpus 101 . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.12 Speaking Rate (Frames/Phoneme) on the SCOTUS Corpus . . . . . . 145
A.13 Speaking rate (Frames/Phoneme) on the JNAS corpus . . . . . . . . . 146
xiv
Chapter 1
Introduction
1.1 Motivation
Speech is the most natural form of communication between humans. With advances in
Automatic Speech Recognition (ASR) systems, speech as a mode of communication
with computing devices is finding wider acceptance in the society. Today, use of ASR
can be seen in a large array of applications including interactive voice response systems
such as telephone banking and ticket booking, dictation systems on personal comput-
ers, command and control in automobiles, easy dialing on mobile phones, creation of
electronic medical records in health-care organisations etc.
While use of ASR systems is beneficial for everyone, it could be particularly useful
for older people and especially those with mobility and visual impairments. Easy to
use voice based interactive systems in health-care and home-care would make life a
lot easier for them [M¨
uller et al., 2003]. Several initiatives such as MATCH 1and
Gator Tech Smart houses 2are focused on research and development of home care
technologies and thereby to assist in independent living of the elderly people. These
systems see voice as one of the important modes of interaction.
During the last century the world’s ageing population has been growing at a stag-
gering rate. According to the United Nations, in 2006, close to 500 million people in
the world were aged 65 and older. Based on projections, the number will increase to
1 billion by 2030, which means one in every 8 of earth’s inhabitant’s will be aged 65
or above [Kevin and Philips, 2005]. This is a large segment of population and from an
1‘Mobilising Advanced Technologies For Care at Home’ - a research project focused on technologies
for home care.
www.match-project.org.uk
2
http://www.icta.ufl.edu/gt.htm
1
Chapter 1. Introduction 2
ASR research point of view it is of interest to be able to cater to their voices.
Over the years, there have been numerous studies to understand the structural
changes in speech production mechanism observed with ageing. These studies have
been mainly in speech pathology and speech therapy research and research associ-
ated with geratology mainly motivated by the need to understand the differences be-
tween natural changes in voice with ageing and vocal changes associated with patho-
logical conditions. Deterioration of voice quality with ageing has been widely re-
ported [Linville, 2001; Ramig and Ringel, 1983; Ramig et al., 2001]. Ageing also
effects fine motor control capabilities and thereby the tongue movement and speaking
rate. These changes impact the intelligibility of speech from older people. Cognitive
abilities such as fluid intelligence, working memory span and information processing
speed tend to decline as people grow old [B¨
ackman et al., 2001]. These cognitive
factors have a large impact on the way older people interact with spoken dialogue sys-
tems [Wolters et al., 2009]. The impact of ageing on voice is also dependent on several
factors specific to individuals such as their health and well being, smoking habits and
their profession. These factors increase the variability and make it difficult to find a
correspondence between chronological age and vocal age. All the above mentioned
changes throw interesting challenges to ASR systems that need to be addressed.
ASR systems have been evolving rapidly over the last couple of decades with ad-
vances in machine learning techniques[Renals and Hain, 2010]. The problem of acous-
tic modeling has been studied and researched from various perspectives such as making
them robust to variations in background noise, speaker characteristics, dialect and ac-
cent. From an age perspctive, there has been lot of work focused on acoustic modeling
for children voices [Gerosa et al., 2009], but there has been limited work on under-
standing the impact of changes in acoustic characteristics associated with older voices
on ASR systems. Relatively poor recognition accuracies for older voices have been
reported before [Baba et al., 2004; Anderson et al., 1999; Wilpon and Jacobsen, 1996]
but to the best of our knowledge there has not been an in-depth study addressing this
problem. In this thesis, we address this problem and present our research work and
experimental results focused on the domain of ASR for ageing voices.
1.2 Objectives
There are several components in an ASR system including the acoustic models, lan-
guage models, lexicon and decoder. There is scope to adapt each of these components
Chapter 1. Introduction 3
in order to make the ASR systems work better for older voices. In this thesis, we
address the problem from an acoustic modeling perspective.
We approach the problem from a two fold perspective. Firstly, it is of interest to
analyse the changes in glottal source and articulatory characteristics of older voices
and to analyse the impact of such changes on ASR recognition accuracies. Secondly, it
is of interest to understand the improvements in accuracies possible with the state-of-
the-art speaker adaptation techniques and to explore and propose other strategies for
acoustic modeling targeted towards older voices to enhance the accuracies.
The main objectives of the thesis are outlined as follows:
1. To perform a systematic comparative study of the glottal source parameters of
adult and older voices and to analyse the impact of changes in any those param-
eters on ASR accuracies.
2. To study articulatory changes with ageing.
3. To analyse the impact of slower speaking rate on ASR accuracies.
4. To report the baseline accuracies for older voices for a few chosen corpora.
5. To explore the possibility of speaker clustering based on gender and age group.
6. To explore the effectiveness of hierarchical models to improve the accuracies for
older voices.
7. To explore the idea of improving the accuracies for a target speaker by using
speech from other acoustically close speakers.
The approach to address these research objectives and the experimental results are
explained in detail in the following chapters.
A couple of important factors need to be mentioned beforehand. In general, there
are several disfluencies associated with very old speakers due to various pathological
conditions. For the purpose of this thesis, we are mainly interested in and only investi-
gate the speech of healthy older adults. It is also well known that chronological ageing
and vocal ageing are weakly correlated. However, in this thesis we categorize speakers
above 60 years of age as older adults.
Chapter 1. Introduction 4
1.3 Publications
Some of the ideas and results appearing in this thesis have been published in peer
reviewed conference proceedings and articles during the course of this research work.
Following is the list of publications and the thesis chapter in which the results of the
paper are discussed.
Ravichander Vipperla, Steve Renals, and Joe Frankel. Longitudinal study of
ASR performance on ageing voices. In Proceedings of Interspeech, Brisbane,
2008. (Chapter 4)
Ravichander Vipperla, Maria Wolters, Kallirroi Georgila, and Steve Renals. Speech
input from older users in smart environments: Challenges and perspectives. In
Proc. HCI International: Universal Access in Human-Computer Interaction.
Intelligent and Ubiquitous Interaction Environments, number 5615 in Lecture
Notes in Computer Science. Springer, 2009. (Chapter 4)
Maria Wolters, Ravichander Vipperla, and Steve Renals. Age Recognition for
Spoken Dialogue Systems: Do We Need It? In Proceedings of Interspeech,
Brighton, 2009. (Chapter 7)
Ravichander Vipperla, Steve Renals, and Joe Frankel. Ageing voices: The effect
of changes in voice parameters on ASR performance. EURASIP Journal on
Audio, Speech and Music Processing, 2010. (Chapter 5)
Ravichander Vipperla, Steve Renals, and Joe Frankel. Augmentation of adapta-
tion data. Proceedings of Interspeech, Makuhari, 2010. (Chapter 8)
Chapter 2
Ageing voices
In this chapter, we review the important structural and functional changes that occur
in speech production when people grow old. We then review previous studies on how
these changes impact voice quality and look at various measures used by researchers
to analyse the quality of voice.
2.1 Human speech production mechanism
The human vocal mechanism (Figure. 2.1) consists of the lungs, the larynx (which
houses the vocal cords), and the vocal tract comprised of the pharynx, the mouth and
the nose.
Depending on the sound that needs to be generated, articulatory motor control
mechanisms include positioning the jaw, shaping the tongue, shaping the lips, posi-
tioning the velum (to control the acoustic flow through the nasal cavity), control of the
vocal cord vibrations and flow of air in and out of the lungs. As air is expelled from
the lungs through the trachea, the vocal cords in the larynx are caused to vibrate by the
air flow. The air flow is thus chopped into quasi periodic pulses which are modulated
as they pass through the pharynx cavity, mouth cavity and nasal cavity. The combina-
tion of the shape of the vocal tract and the presence/absence of vocal cords vibrations,
result in the production of various sounds [Rabiner and Juang, 1993].
5
Chapter 2. Ageing voices 6
Lungs
Diaphragm
Trachea
Larynx
Pharynx
Tongue
Nasal Cavity
VelumPalate
Lips
Jaw
Epiglottis
Oesophagus
Figure 2.1: Human speech production mechanism
2.2 Changes in the speech production mechanism with
ageing
Several physical and physiological changes occur in a human bodies with ageing. Typ-
ical changes include decline in vision and hearing, weakening of muscles, mobility
restrictions and weakened immune system. Similar to other body parts, organs in the
human speech production mechanism also undergo age related changes such as reduc-
tion in the respiratory muscle strength, restricted vocal fold adjustments during phona-
tion and difficulty in adjustments of tongue and lip shapes [Linville, 2001]. The rate at
which voices age does not however depend only on the chronological age of a person,
but also on other factors such as lifestyle, physiological condition, smoking habits and
profession. Even with the above mentioned factors being identical between two indi-
viduals, the extent of vocal ageing could differ between them. Described below are
some of the changes seen in the voices showing signs of ageing.
Chapter 2. Ageing voices 7
2.2.1 Changes in the Respiratory system
Apart from breathing, the respiratory system plays a crucial role in producing speech.
It acts as the energy source for speech production by forcing air through the vocal cords
and the vocal tract resulting in various sounds.
The most significant changes seen in the respiratory system of aged people are the
loss of lung elasticity, increase in the stiffness of the chest wall, and decrease in the
respiratory muscle strength [Mahler et al., 1986; Rossi et al., 1996].
Lung recoil elasticity is the ease with which lungs rebound after having been
stretched during inhalation. A decline in lung elasticity has been reported by Mahler
et al. [1986] due to ageing. The loss of lung elastic recoil with age is found to be faster
in males as compared to females [Bode et al., 1976].
Due to the alterations in the muscles of the chest wall, the thorax becomes increas-
ingly rigid with ageing [Kahane, 1981]. This leads to a reduced movement in response
to the respiratory muscle forces. Due to the degeneration of the upper and middle re-
gions of the thoracic vertebral column, a pronounced curvature of the back is observed
in some older adults. This phenomenon called Kyphosis, alters the shape of the thorax
and may effect the amount of air that can be inhaled and exhaled.
Several research studies have reported weakening of respiratory muscles during
old age [Black and Hyatt, 1969; Kahane, 1981]. This leads to reduced respiratory
forces during inhalation and exhalation. A decline in maximal respiratory pressure
progressively beyond the age of 65 has been reported by Enright et al. [1994]. The
decline is more prominent in males compared to females. A loss in diaphragm strength
leading to an average reduction of 25% of maximum transdiaphragmatic pressure in
elderly group as compared to younger subjects has also been reported [Tolep et al.,
1995].
While the total lung volume remains unaltered in the older people, the forced ex-
piratory volume and the lung pressure are decreased. This leads to a decline in the
amount of air that can be moved in and out of the lungs and the efficiency with which
it can be moved [Linville, 2004; Ramig et al., 2001]. The rate of this decline accelerates
with advancing age [Mahler et al., 1986]. Also the amount of air left after exhalation
known as ‘Residual volume’ has been found to increase by about 40% from the age of
20 to the age of 70 [Lynne-Davies, 1977].
Chapter 2. Ageing voices 8
2.2.2 Changes in the Larynx
The parts of the larynx that form the vocal apparatus are the laryngeal cartilages (to
which the vocal folds are attached), the vocal folds that play a key role in phonation,
and the intrinsic muscles that regulate the vocal cord tension and the vocal fold open-
ing [Pretterklieber, 2003]. Several anatomical changes are seen in these organs with
ageing.
Among the several cartilages in the larynx, the thyroid, cricoid and arytenoid car-
tilages are the most significant from the speech production point of view. The thyroid
and cricoid cartilages form the skeleton of the larynx. A pair of arytenoid cartilages
are located on the upper edge of the cricoid cartilage. The vocal cords are attached
posteriorly to the arytenoid cartilages and anteriorly to the thyroid cartilages. The
cricoarytenoid joints allow the arytenoid and thus the vocal apparatus to move laterally
or medially. The arytenoids can also glide on the surface of the cricoid and move closer
or recede away from each other. The most significant change in the cartilages observed
as an individual moves from adulthood to old age is the toughening of the soft tissue
into bone like structure (ossification). This phenomenon is observed in both males and
females. It occurs at an earlier age and is more prominent in males as compared to
females. Each of the cartilages has its own pattern of ossification. Arytenoid cartilage
ossifies only partially sparing the vocal process. Significant age-related changes have
been reported in the cricoarytenoid joint [Paulsen and Tillmann, 1998; Dedivitis et al.,
2001]. Changes include thinning of the joint surface, reduced collagen fibers in the
cartilage matrix and surface irregularities. These changes are again more prominent in
males compared to females and hamper overall positional or postural movements of
the arytenoid cartilages. This leads to reduction in the degree and extent of vocal lig-
ament closure and makes it difficult for vocal fold adjustments during phonation. The
result of this is impaired vocal quality and reduced vocal intensity due to air leakage
through incomplete vocal fold closure.
The vocal folds have a complex layered structure. They are comprised of five
discrete histological layers: the Epithelium, three layers jointly called Lamina Propria
and the Thyroarytenoid muscle. The thin layer of Epithelium forms the protective
covering for the vocal folds. The epithelial cells are bound together firmly and form
a smooth lining reducing the friction to the air flow. The superficial layer of Lamina
Propria is a thin layer made of elastin fibres. This layer can be stretched in several
directions. The intermediate layer which is formed of elastin and collagen fibres is
Chapter 2. Ageing voices 9
more densely packed and can only be stretched in anterior-posterior direction. The
deep layer is formed on collagen fibres and is least stretchable. This layer protects the
vocal cords from over extension. The Thyroarytenoid muscle lies below the Lamina
Propria. They are mainly concerned with pulling together the thyroid and arytenoid
cartilage, thus relaxing the vocal folds.
Several changes in the structure with ageing alter the biomechanical properties of
the vocal folds [Linville, 2001]. Glandular changes in the laryngeal mucosa (the mu-
cous lining of larynx) [Linville, 2004] cause drying of the epithelial tissue, increasing
the stiffness of vocal cord cover. This increase in cover stiffness leads to instability of
vocal fold vibration. Some investigations [Hirano et al., 1989] have reported thicken-
ing of laryngeal epithelium progressively with age. Tissues age at varying rates and to
varying extents [Kahane and Hammons, 1987] and substantial structural changes need
to occur before observing noticeable changes in voice.
In the Lamina Propria, several age related changes have been documented in all
the three layers. The thickness of the superficial layers alters [Hirano et al., 1989] and
atrophy and degeneration of the elastic fibres in the layer has been observed [Sato and
Hirano, 1997]. Changes seen in the intermediate layer include thinning of the layer,
decrease in the density of the fibres, atrophy of the fibres and changes in the contour
of the layer [Linville, 2001]. The fibrous protein loses elasticity and the layer stiffens.
The deep layer thickens with an increase in the collagen fibres. Such morphological
changes in the fibres of the vocal folds contribute partially to the ageing of the voices.
The thyroarytenoid muscle also displays atrophy with ageing. Changes in mus-
cle fibres have been reported [Sato and Tauchi, 1982]. A decrease in thyroarytenoid
muscle activity has been reported [Baker et al., 1998] in older speakers than young
speakers. This affects the fine control of the position of the arytenoid joint and thereby
the fine control of pitch of the voice.
Intrinsic laryngeal muscles are responsible for control of the vocal cords. The
tension in the vocal cords is regulated by the cricothyroid muscle. The opening (ab-
duction) of the vocal fold opening (called Rima Glottidis) is controlled by the posterior
cricoarytenoid muscle and the closing (adduction) is controlled by the lateral cricoary-
tenoid and thyroarytenoid muscles. Regressive changes and atrophy have been re-
ported in all these muscles with ageing [Rode˜
no et al., 1993; Bach et al., 1941]. The
changes include accumulation of fats, degeneration of muscle fibers and unusual vari-
ations in the cross sectional areas [Linville, 2001]. As a result, precise control of the
vocal cord tension and complete abduction/adduction is affected.
Chapter 2. Ageing voices 10
2.2.3 Changes in the vocal tract
The human vocal tract consists of all the organs above the vocal folds that are involved
in speech production. It is comprised of the pharynx (throat), the oral cavity, the nasal
cavity, soft palate (velum) and the articulators viz., the tongue and the lips. The human
speech production mechanism can be viewed as a source-filter model. The lungs in
conjunction with vocal cords act as the source and expel air into the vocal tract. De-
pending on the presence or absence of the vocal cord vibrations, the source is either
voiced or unvoiced. This quasi periodic air then resonates in the pharynx, oral and
nasal cavities to generate a rich timbre. The vocal tract thus acts as the filter.
The vocal tract can be broadly thought to be comprised of three resonating cavities,
the pharynx, and the oral and nasal cavities. The pharynx is involved in the production
of all speech sounds. The pharynx can change shape to a limited extent and thus alter
the resonance patterns. The pharynx can be constricted, and it can be raised or lowered.
The position of the velum also alters the shape of the pharyngeal cavity. The velum
controls the flow of air into the nasal cavity. During the production of nasal sounds
such as /m/ and /n/, the velum is moved forward to open the air passage through the
nasal cavity. The oral cavity is the most flexible among the three cavities in varying
the shape. The resonating property of the oral cavity depends on the position of the
temporomandibular joint, the shape of the tongue and the lips and the position of the
velum.
Thinning of pharyngeal epithelium and degeneration of the pharyngeal muscles has
been reported with ageing [Linville, 2001]. However these changes in the pharynx are
not found to be extensive.
The temporomandibular joint (TMJ) is the joint at which the jaw is hinged to the
skull. It is used in controlling the position of the jaw and hence influences the oral
resonance during speech production. Jaw movement has a significant role to play in
articulation of certain phonemes as well as in the co-articulation of adjacent phonemes.
With ageing, degenerative changes are observed in the TMJ [Weinstein, 2000]. Dis-
placement of the TMJ disk is commonly observed leading to a lowering of the articu-
lating surface. Xue and Hao [2003] have reported increase in vocal tract dimensions in
older speakers. The vocal tract volume of older speakers in particular is significantly
higher compared to the younger speakers. This could lead to changes in the resonance
patterns in older voices.
The tongue plays a major role in speech production. It is very flexible and can be
Chapter 2. Ageing voices 11
moved up, down, forward and backward. By adjusting the shape of the tongue and the
position of the tongue tip, the oral cavity’s shape is modified affecting the resonance
patterns and hence the sound produced. Significant changes have been reported in the
tongue with ageing [Rother et al., 2002]. Decrease in the thickness of epithelium and
glandular atrophy have been reported in people over 50 years of age [Nakayama, 1991].
However the most significant change in the tongue that affects the speech production is
the atrophy of the tongue muscles. From ultrasound observations, decline in the tongue
motor skills in the elderly in comparison to young adults were reported by Koshino
et al. [1997]. A decline in tongue strength has also been reported in older individuals
[Crow and Ship, 1996]. These changes in the tongue could affect the articulatory
patterns.
Other changes observed in the mouth with ageing include loss of oral mucosa (the
mucous membrane that covers all the structures inside the oral cavity other than the
teeth), decline in the salivary function leading to oral dryness and degeneration and
loss of tooth. These changes could also have a small impact on speech production.
2.2.4 Neuromuscular control
Age related changes also take place in the peripheral and central nervous system that
have implications for speech production. One of the changes in the peripheral neural
system is the decline of motor neurons. This loss in the motor units has been impli-
cated as the primary mechanism for muscle atrophy and loss of contractile strength in
the muscles [Doherty et al., 1993]. An average loss of 25% neurons has been reported
from the second to the tenth decade of life. However this loss of motor units is par-
tially compensated by increase in the size of the motor units along with the slowing of
contractile speed. This affects various muscles involved in the speech production and
is a possible cause of the slower speaking rate observed in older speakers.
Age related memory impairment is commonly observed in elderly people [Hedden
and Gabrieli, 2004]. In particular reduction in working memory and the associated
difficulty in refreshing recently processed information have implications on speech
production behavior and interaction styles.
Chapter 2. Ageing voices 12
2.3 Acoustic effects of ageing
Several studies have been made to understand the effect of ageing on various acoustic
parameters of speech. These studies have been mainly in the field of speech pathology
to differentiate normal voice changes due to ageing from pathological vocal conditions
affecting elderly patients. Most of these studies [Ramig and Ringel, 1983; Ramig
et al., 2001; Linville, 2000; Edward, 1959] have indicated that speakers experience
certain changes, mainly deterioration, of vocal acoustic output as they age.
To analyse the voice quality, different parameters of speech signal have been pro-
posed and widely used. This section provides a brief description of the parameters
that have been used in this thesis. Some of these parameters such as the fundamen-
tal frequency, jitter and shimmer relate to the characteristics of the glottis and hence
can be treated as source related parameters. Other parameters, such as formant fre-
quencies and speaking rate relate to the shape and movement of the vocal tract and
are thereby treated as filter related parameters. Although these parameters have been
primarily used to differentiate between healthy voices and those suffering from patho-
logical conditions, they have also been used to study the change in voice quality with
ageing. These parameters are typically measured on sustained phonations of few sec-
onds in duration recorded in noise free sound booths.
2.3.1 Average fundamental frequency
Among the several parameters affected by ageing, the average fundamental frequency
(F0) has been one of the most extensively studied parameters. Although there is no
general agreement on the trend, it appears [Sch¨
otz and M¨
uller, 2007; Linville, 2000]
that in females, the fundamental frequency remains fairly constant until menopause,
and later decreases. A drop of approximately 10-15 Hz is observed. This is attributed
to the thickening of laryngeal mucosa. while in males F0decreases until a certain age
around 60 years and increases after that significantly. However the experiments in
[Xue and Deliyski, 2001; Endres et al., 1971] indicate that F0reduces significantly for
both the males and females. A decrease of 40-60 Hz in F0has been reported for both
males and females.
Chapter 2. Ageing voices 13
2.3.2 Fundamental frequency variation and Amplitude variation
Older voices are generally associated with tremor and increased hoarseness. These
characteristics are related to F0and amplitude instability. Measures of standard de-
viation of the fundamental frequency and its amplitude, indicate gross stability of
voice over time. These measures tend to increase with age for both males and fe-
males [Linville, 2000]. The F0standard deviation more than doubles between young
adulthood and old age for men while an increase of over 70% has been observed in
older women’s voices. These observations are also confirmed experimentally by Xue
and Deliyski [2001]; Bruckl and Sendlmeier [2003].
2.3.3 Jitter
Jitter is the cycle to cycle variation of the pitch period, i.e., the average of the absolute
distance between consecutive periods. It is measured in µsec.
Jitter(absolute) = 1
N1
N1
i=1|TiTi+1|(2.1)
where Tiis the extracted F0period length and N is the number of extracted F0pitch
periods [Boersma, 2001].
A relative measure for frequency perturbations known as ‘Jitter Local’ is often
used. It is the ratio of pitch period variation from cycle to cycle to the average pitch
period. It is expressed as a percentage.
Jitter(Local) =
1
N1N1
i=1|TiTi+1|
1
NN
i=1Ti(2.2)
The other measures of jitter that are averaged over larger number of pitch periods
are as follows:
Relative Average Perturbations (Jitter RAP): The average absolute difference
between a period and the average of it and its two neighbours, divided by the
average period.
Five point Period Perturbation Quotient (Jitter PPQ5): The average absolute
difference between a period and the average of it and its four closest neighbours,
divided by the average.
Difference of differences between periods (Jitter DDP): The average absolute
difference between consecutive differences between consecutive periods, divided
by the average period.