Conference PaperPDF Available

Impact of variabilities on speech recognition

Authors:
  • McGill University and University of Avignon
  • Acapela Group
Impact of variabilities on speech recognition
Mohamed Benzeghiba(1), Renato De Mori(2), Olivier Deroo(3), Stephane Dupont(4),
Denis Jouvet(5), Luciano Fissore(6), Pietro Laface(7), Alfred Mertins(8)
Christophe Ris(4), Richard Rose(9), Vivek Tyagi(1), Christian Wellekens(1)
1-Institut Eurecom, Sophia Antipolis, France
2-LIA, Avignon, France
3-A Capela, Mons, Belgium
4-Multitel, Mons, Belgium
5-France Telecom, Lannion, France
6-Loquendo, Torino, Italy
7-Politenico, Torino,Italy
8-University Carl von Ossietzky, Oldenburg, Germany
9-University Mc Gill, Montreal, Canada
christian.wellekens@eurecom.fr
Abstract
Major progress is being recorded regularly on both the technol-
ogy and exploitation of Automatic Speech Recognition (ASR)
and spoken language systems. However, there are still techno-
logical barriers to flexible solutions and user satisfaction under
some circumstances. This is related to several factors, such as
the sensitivity to the environment (background noise or chan-
nel variability), or the weak representation of grammatical and
semantic knowledge.
Current research is also emphasizing deficiencies in deal-
ing with variation naturally present in speech. For instance, the
lack of robustness to foreign accents precludes the use by spe-
cific populations. There are actually many factors affecting the
speech realization: regional, sociolinguistic, or related to the
environment or the speaker itself. These create a wide range
of variations that may not be modeled correctly (speaker, gen-
der, speech rate, vocal effort, regional accents, speaking style,
non stationarity...), especially when resources for system train-
ing are scarce.
This paper outlines some current advances related to vari-
abilities in ASR.
1. Introduction
The weaknesses of ASR systems are pointed out even by non-
experts: Is it possible to recognize speech in noisy environ-
ment? What happens if the speaker has a sore throat or is
too stressed? These two questions put into evidence the main
sources of variability in ASR. Extrinsic variabilities are due to
the environment: signal to noise ratio may be high but also
variable within short time, telecommunication channels (wired
or wireless) show variable properties and just changing micro-
phones may cause strong error rate. Speech signal not only
conveys semantic information (the message) but also a lot of
information about the speaker himself: gender, age, social and
regional origin, health and emotional state and, with a rather
strong reliability, its identity that are intrinsic variabilities.
Characterization of the effect of some of these specific vari-
ations, together with related techniques to improve ASR robust-
ness is a major research topic.
As a first obvious theme, the speech signal is non-
stationary. The power spectral density of speech varies over
time according to the glottal signal (which for instance affect the
pitch) and the configuration of the speech articulators (tongue,
jaws, lips...). This signal is modeled, through Hidden Markov
Models (HMMs), as a sequence of stationary random regimes.
At a first stage of processing, most ASR processes analyze short
signal frames (typically covering 30 ms of speech) on which sta-
tionarity is assumed. More subtle signal analysis techniques are
being studied in the framework of ASR.
Compensation for noise degradation (additive noise)can be
done at several levels: either by enhancing speech signal, or
by training models on noisy databases, or by designing specific
models for noise and speech, or by considering noise as missing
information that can be marginalized in a statistical training of
models by making hypotheses on the parametric distributions
of noise and speech.
Known as convolution noise, degradations due to the chan-
nel come from its slowly varying spectral properties (or im-
pulse response) that can be reduced by averaging speech fea-
tures (Cepstral Mean Subtraction) or by evaluating the impulse
response as missing data and combined with additive noise re-
duction.
Among intrinsic variabilities, modification of the speech
production at the level of articulatory mechanisms under spe-
cific conditions plays a crucial role. Studies on the impact of
coarticulation have yielded segment based, articulatory, as well
as widely used context dependent (CD) modeling techniques.
Even in carefully articulated speech, the production of a partic-
ular phoneme results from a continuous gesture of the articula-
tors, coming from the configuration of the previous phoneme,
and going to the configuration of the following phoneme. In
different and more relaxed speaking styles, stronger pronunci-
ation effects always appear. Some of these being particular to
a language (and mostly unconscious). Other are related to re-
gional origin, and are referred to as accents (or dialects for the
linguistic counterpart) or to social groups and are referred to as
sociolects. Although some of these phenomena may be mod-
SPECOM'2006, St. Petersburg, 25-29 June 2006
3
eled appropriately by CD modeling techniques, their impact is
rather characterized more simply at the pronunciation model
level. At this stage, phonological knowledge may be helpful,
especially in the case of strong effects like foreign accent. Fully
data-driven techniques have also been proposed.
Following coarticulation and pronunciation effects, speaker
related spectral characteristics (and gender) have been identi-
fied as another major dimension of speech variability. Spe-
cific models of frequency warping (based on vocal tract length
differences) have been proposed, as well as more general fea-
tures compensation and model adaptation techniques, relying
on Maximum Likelihood or Maximum a Posteriori criteria.
These model adaptation techniques provide a general formalism
for re-estimation based on moderate amounts of speech data.
Besides these speaker specific properties outlined above,
other extra-linguistic variabilities are admittedly affecting the
signal and ASR systems. A person can change his voice to be
louder, quieter, more tense or softer, or even a whisper; Also,
some reflex effects exist, such as speaking louder when the en-
vironment is noisy.
Speaking faster or slower, also has influence on the speech
signal. This impacts both temporal and spectral characteristics
of the signal, both affecting the acoustic models. Obviously,
faster speaking rates may also translate in more frequent and
stronger pronunciation changes.
Speech also varies with age, due to both generational and
physiological reasons. The two “extremes” of the range are gen-
erally put at a disadvantage due to the fact that research corpora,
as well as corpora used for model estimation, are typically not
designed to be representative of children and elderly speech.
Some general adaptation techniques can however be applied to
counteract this problem.
Emotions are also becoming a hot topic, as they can indeed
have a negative effect on ASR; and also because added-value
can emerge from applications that are able to identify the user
emotional state (frustration due to compromised usability for
instance).
Finally, research on recognition of spontaneous conversa-
tions has allowed to highlight the strong detrimental impact of
this elocution style; and current studies are trying to better char-
acterize pronunciation variation phenomena inherent in sponta-
neous speech.
This paper reviews some current advances related to these
topics. It focuses on variations within the speech signal that
make the ASR task difficult. Intrinsic variations to the speech
signal affect the different levels of the ASR processing chain.
Extrinsic variations have been more studied in the past but re-
cent new approaches deserve a special report in this paper that
summarizes the current literature without pretending to be ex-
haustive and highlights specific feature extraction or modeling
weaknesses.
The paper is organized as follows. In a first section, intrin-
sic variability factors are reviewed individually according to the
major trends identified in the literature. The section gathers in-
formation on the effect of variations on the structure of speech
as well as the ASR performance. Typical modeling or engineer-
ing solutions that have been adopted at the different stages of
the ASR chain are also introduced.
In general, this review further motivates research on the
acoustic, phonetic and pronunciation limitations of speech
recognition by machines. It is for instance acknowledged that
pronunciation discrepancies is a major factor of reduced per-
formance (in the case of accented and spontaneous speech).
Section 3 reviews ongoing trends and possible breakthroughs
in general feature extraction and modeling techniques that
provides more resistance to speech production variability. It
also includes recent techniques for noise/channel compensa-
tion. The issues that are being addressed include the fact that
temporal representations/models may not match the structure
of speech, as well as the fact that some analysis and modeling
assumptions can be detrimental. General techniques such as
compensation, adaptation, multiple models, additional acoustic
cues and more accurate models are briefly surveyed.
2. Variation in speech
2.1. Speaker characteristics
Obviously, the speech signal not only conveys the linguistic in-
formation (the message) but also a lot of information about the
speaker himself: gender, age, social and regional origin, health
and emotional state and, with a rather strong reliability, its iden-
tity. Apart from the intra- speaker variability (emotion, health,
age), it is commonly admitted that the speaker uniqueness re-
sults from a complex combination of physiological and cultural
aspects [1]. While investigating the variability between speak-
ers through statistical analysis methods, [2] found that the first
two principal components correspond to the gender and accent
respectively. Gender would then appear as the prime factor re-
lated to physiological differences, and accent would be one of
the most important from the cultural point of view. This section
deals mostly with physiological factors.
The complex shape of the vocal organs determines the
unique ”timbre” of every speaker. The larynx which is the lo-
cation of the source of the speech signal conveys the pitch and
important speaker information. The vocal tract, can be modeled
by a tube resonator [3]. The resonant frequencies (the formants)
are structuring the global shape of the instantaneous voice spec-
trum and are mostly defining the phonetic content and quality
of the vowels.
Standard feature extraction methods (PLP, MFCC) simply
ignore the pitch component. On the other hand, the effect of the
vocal tract shape on the intrinsic variability of the speech signal
between different speakers has been widely studied and many
solutions to compensate for its impact on ASR performance
have been proposed: ”speaker independent” feature extraction,
speaker normalization, speaker adaptation. The formant struc-
ture of vowel spectra has been the subject of early studies [4]
that amongst other have established the standard view that the
F1-F2 plane is the most descriptive, two-dimensional represen-
tation of the phonetic quality of spoken vowel sounds. On the
other hand, similar studies underlined the speaker specificity of
higher formants and spectral content above 2.5 kHz [5]. Other
important studies [6] suggested that relative positions of the for-
mant frequencies are rather constant for a given sound spoken
by different speakers and, as a corollary, that absolute formant
positions are speaker-specific. These observations are corrobo-
rated by the acoustic theory applied to the tube resonator model
of the vocal tract which states that positions of the resonant fre-
quencies are inversely proportional to the length of the vocal
tract [7]. This observation is at the root of different techniques
that increase the robustness of ASR systems to inter-speaker
variability.
The preponderance of lower frequencies for carrying the
linguistic information has been assessed by both perceptual and
acoustical analysis and justify the success of the non-linear fre-
quency scales such as Mel, Bark, Erb. Other approaches aim
at building acoustic features invariant to the frequency warp-
ing [8, 9]. A direct application of the tube resonator model of
SPECOM'2006, St. Petersburg, 25-29 June 2006
4
the vocal tract lead to the different vocal tract length normal-
ization (VTLN) techniques: speaker-dependent formant map-
ping [10], transformation of the LPC pole modeling [11], fre-
quency warping, either linear [12] or non-linear [13], all con-
sists in modifying the position of the formants in order to get
closer to an ”average” canonical speaker. Incidentally, chan-
nel compensation techniques such as the cepstral mean subtrac-
tion or the RASTA filtering of spectral trajectories, also com-
pensate for the speaker-dependent component of the long-term
spectrum [14, 15].
On the other side, general adaptation techniques reduce
speaker specificities and tends to further reduce the gap between
speaker-dependent and speaker-independent ASR by adapting
the acoustic models to a particular speaker [16, 17]
2.2. Foreign and regional accents
As introduced earlier, accent is one of the major components
of interspeaker variability, as demonstrated in [2]. And in-
deed, compared to native speech recognition, performances de-
grades when recognizing accented speech and even more for
non-native speech recognition [18]. In fact accented speech
is associated to a shift within the feature space [19]. For na-
tive accents the shift is applied by large groups of speakers, is
more or less important, more or less global, but overall acoustic
confusability is not changed significantly. On the opposite, for
foreign accents, the shift is very variable, is influenced by the
native language, and depends also on the level of proficiency of
the speaker.
Regional variants correspond to significantly different data,
and enriched modelling is generally used to handle such vari-
ants. This can be achieved through the use of multiple acous-
tic models associated to large groups of speakers as in [20]
or through the introduction of detailed pronunciation variants
at the lexical level [21]. However adding too many systematic
pronunciation variants may be harmful [22].
Non-native speech recognition is not properly handled by
native speech models, no matter how much dialect data is in-
cluded in the training [23]. This is due to the fact that non-
native speakers can replace an unfamiliar phoneme in the target
language, which is absent in their native language phoneme in-
ventory, with the one considered as the closest in their native
language phoneme inventory [24]. This behaviour makes the
non-native alterations dependent on both the native language
and the speaker. Some sounds may be replaced by other sounds,
or inserted or omitted, and such insertion/omission behaviour
cannot be handled by the usual triphone-based modelling [25].
In the specific context of speaker dependent recognition, adap-
tation techniques can be used [18]. For speaker independent
systems this is not feasible. Introducing multiple phonetic tran-
scriptions that handle alterations produced by non-native speak-
ers is a usual approach, and is generally associated to a combi-
nation of phone models of the native language with phone mod-
els of the target language [26]. When a single foreign accent is
handled, some accented data can be used for training or adapt-
ing the acoustic models [27]. Proper and foreign name process-
ing is another topic strongly related with foreign accent [28].
Multilingual phone models are investigated since many
years in the hope of achieving language independent units [29].
Language independent phone models are often useful when lit-
tle or no data exists in a particular language and their use re-
duces the size of the phoneme inventory of multilingual speech
recognition systems. The mapping between phoneme models of
different languages can be derived from data [30] or determined
from phonetic knowledge [31], but this is far from obvious as
each language has his own characteristic set of phonetic units
and associated distinctive features. Moreover, a phonemic dis-
tinguishing feature for a given language may hardly be audible
to a native of another language.
Altough accent robustness is a desirable property of spo-
ken language systems, accent classification is also studied since
many years [32]. As a contiguous topic, speech recognition
technology is also used in foreign language learning for rating
the quality of the pronunciation [33].
2.3. Speaking rate and style
Speaking rate, expressed for instance in phonemes or syllables
per second, is an important factor of intra-speaker variability.
When speaking a a fast rate, the timing and acoustic realization
of syllables are strongly affected due in part to the limitations
of the articulatory machinery.
In automatic speech recognition, the significant perfor-
mance degradations caused by speech rate variations stimulated
many studies for modeling the spectral effects of speech rate
variations. All the schemes presented in the literature make use
of a speech rate estimator, based on different methods, provid-
ing the number of phones or syllables per second. The most
common methods rely upon the evaluation of the frequency of
phonemes or syllables in a sentence [34], through a prelimi-
nary segmentation of the test utterance; other approaches per-
form a normalization by dividing the measured phone duration
by the average duration of the underlying phone [35]. Some
approaches address the pronunciation correlates of fast speech.
In [36], the authors rely upon an explicit modeling strategy, us-
ing different variants of pronunciation.
In casual situations or under time pressure, sluring pronun-
ciations of certain phonemes indeed happen. Besides physi-
ology, this builds on the speech redundancy and it has been
hypothesized that this slurring affects more strongly sections
that are more easily predicted. In contrast, speech portions
where confusability is higher tend to be articulated more care-
fully [37].
In circumstances where the transmission and intelligibility
of the message is at risk, a person can make use of an op-
posite articulatory behaviour, and for instance articulate more
distinctly. Another related phenomenon happens in noisy en-
vironments where the speaker adapts s(maybe unconsciously)
with the communicative purpose of increasing the intelligibil-
ity. This effect of augmented tension on the vocal folds as well
as augmented loudness is known as the Lombard reflex [38].
These are crucial issues and research on speaking style
specificities as well as spontaneous speech modeling is hence
very active. Techniques to increase accuracy towards sponta-
neous speech have mostly focused on pronunciation studies1.
Also, the strong dependy of pronunciation phenomena with re-
spect to the syllable structure has been highlighted [39, 40]. As
a consequence, extensions of acoustic modeling dependency to
the phoneme position in a syllable and to the syllable position
in word and sentences have been proposed [39].
Variations in spontaneous speech can also extend beyond
the typical phonological alterations outlined previously. Phe-
nomena called disfluencies can also be present, such as false
starts, repetitions, hesitations and filled pauses. The reader will
find useful information in [41, 42].
2.4. Age
Age is another major cause of variability and mismatch in
speech recognition systems. The first reason is of physiological
1besides language modeling which is out of the scope of this paper
SPECOM'2006, St. Petersburg, 25-29 June 2006
5
nature [43]. Children have shorter vocal tract and vocal folds
compared with adults. This results in higher position of for-
mants and fundamental frequency. The high fundamental fre-
quency is reflected as a large distance between the harmonics,
resulting in poor spectral resolution of voiced sounds. The dif-
ference in vocal tract size results in a non-linear increase of the
formant frequencies.
In order to reduce this effect, previous studies have focused
on the acoustic analysis of children speech [44, 45]. This work
has put in evidence the challenges faced by Speech Recogni-
tion systems that will be developed to automatically recognize
children speech. For example, it has been shown that children
below the age of 10 exhibit a wider range of vowel durations rel-
ative to older children and adults, larger spectral and supraseg-
mental variations, and wider variability in formant locations and
fundamental frequencies in the speech signal.
Obvisously, younger children may not have a correct pro-
nunciation. Sometimes they have not yet learnt how to articulate
specific phonemes [46]. Finally, children are using language in
a different way. The vocabulary is smaller but may also contain
words that don’t appear in grown-up speech. The correct inflec-
tional forms of certain words may not have been acquired fully,
especially for those words that are exceptions to common rules.
Spontaneous speech is also believed to be less grammatical than
for adults. A number of different solutions have been proposed,
modification of the pronunciation dictionary, and the use of lan-
guage models which are customized for children speech have
all been tried [47].
Several studies have attempted to address this problem by
adapting the acoustic features of children speech to match that
of acoustic models trained from adult speech [48]. Such ap-
proaches include vocal tract length normalization (VTLN) [49]
as well as spectral normalization [50]. However, most of these
studies point to lack of children acoustic data and resources
to estimate speech recognition parameters relative to the over
abundance of existing resources for adult speech recognition.
Simply training a conventional speech recognizer on children
speech is not sufficient to yield high accuracies, as demonstrated
by Wilpon and Jacobsen [51]. Recently, corpora for children
speech recognition have begun to emerge (for instance [52, 53]
and [54] of the PF-STAR project).
2.5. Emotions
Similarly to the previously discussed speech intrinsic variations,
emotional state is found to significantly influence the speech
spectrum. It is recognized that a speaker mood change has a
considerable impact on the features extracted from his speech,
hence directly affecting the basis of all speech recognition sys-
tems.
Studies on speaker emotions is a fairly recent, emerging
field and most of today literature that remotely deals with emo-
tions in speech recognition is concentrated on attempting to
classify a “stressed” speech signal into its correct emotion cat-
egory. The purpose of these efforts is to further improve man-
machine communication. The studies that interest us are differ-
ent. Being interested in speech intrinsic variabilities, we focus
our attention on the recognition of speech produced in differ-
ent emotional states. The stressed speech categories studied are
generally a collection of all the previously described intrinsic
variabilities: loud, soft, Lombard, fast, angry, scared; and noise.
As Hansen formulates it in [55], approaches for robust
recognition can be summarized under three areas: (i) better
training methods, (ii) improved front-end processing, and (iii)
improved back-end processing or robust recognition measures.
A majority of work undertaken up to now revolves around in-
specting the specific differences in the speech signal under the
different stress conditions. Concerning the research specifically
geared towards robust recognition, the first approach, based
on improved training methods, comprises the following works:
multi-style training [56], and simulated stress token genera-
tion [57]. As for all the improved training methods, recognition
performance is increased only around the training conditions
and degradation in results is observed as the test conditions drift
from the original training data.
The second category of research is front-end processing,
the goal being to devise feature extraction methods tailored for
the recognition of stressed and non-stressed speech simultane-
ously [55, 58].
Finally, some interest has been focused on improving back-
end processing as means of robust recognition. These tech-
niques rely on adapting the model structure within the recog-
nition system to account for the variability in the input signal.
Consequently to the drawback of the “improved modeling” ap-
proach, one practice has been to bring the training and test con-
ditions closer by space projection [59].
3. ASR techniques
In this section, we review methodologies towards improved
ASR analysis/modeling accuracy and resistance towards vari-
ability sources.
3.1. Front-end techniques
An update on feature extraction front-end is proposed, particu-
larly showing how to take advantage of techniques targeting the
non-stationarity assumption. Also, the feature extraction stage
can be the appropriate level to target some other variations, like
the speaker spectral characteristics (through feature compensa-
tion [60] or else improved invariance [9]) and other dimensions
of speech variability. Also, noise reduction can be achieved by
feature compensation. Finally, techniques for combining esti-
mation based on different features sets are reminded. This may
also involve dimensionality reduction approaches.
3.1.1. Overcoming assumptions
Most of the Automatic Speech Recognition (ASR) acoustic fea-
tures, such as Mel-Frequency Cepstral Coefficient (MFCC)[61]
or Perceptual Linear Prediction (PLP) coefficient[62], are based
on some sort of representation of the smoothed spectral enve-
lope, usually estimated over fixed analysis windows of typi-
cally 20 ms to 30 ms [61, 63]. Such analysis is based on the
assumption that the speech signal is quasi-stationary over these
segment durations. However, it is well known that the voiced
speech sounds such as vowels are quasi-stationary for 40ms-
80ms while, stops and plosive are time-limited by less than
20ms [63]. Therefore, it implies that the spectral analysis based
on a fixed size window of 20ms-30ms has some limitations, in-
cluding:
The frequency resolution obtained for quasi-stationary
segments (QSS) longer than 20ms is quite low compared
to what could be obtained using larger analysis windows.
In certain cases, the analysis window can span the transi-
tion between two QSSs, thus blurring the spectral prop-
erties of the QSSs, as well as of the transitions. Indeed,
in theory, Power Spectral Density (PSD) cannot even be
defined for such non stationary segments [64]. Further-
SPECOM'2006, St. Petersburg, 25-29 June 2006
6
more, on a more practical note, the feature vectors ex-
tracted from such transition segments do not belong to
a single unique (stationary) class and may lead to poor
discrimination in a pattern recognition problem .
In [65], the usual assumption is made that the piecewise
quasi-stationary segments (QSS) of the speech signal can be
modeled by a Gaussian AR process of a fixed order pas
in [66, 67, 68]. The problem of detecting QSSs is then formu-
lated using a Maximum Likelihood (ML) criterion, defining a
QSSs as the longest segment that has most probably been gener-
ated by the same AR process.2Given a pth order AR Gaussian
QSS, the Minimum Mean Square Error (MMSE) linear predic-
tion (LP) filter parameters [a(1), a(2), ... a(p)] are the most
“compact” representation of that QSS amongst all the pth or-
der all pole filters [64]. In other words, the normalized “cod-
ing error”3is minimum amongst all the pth order LP filters.
When erroneously analyzing two distinct pth order AR Gaus-
sian QSSs in the same non-stationary analysis window, it can be
shown that the “coding error” will then always be greater than
the ones resulting of QSSs analyzed individually in stationary
windows[64]. Therefore, higher coding error is expected in the
former case as compared to the optimal case when each QSS
is analyzed in a stationary window. Once the “start” and the
“end” points of a QSS are known, all the speech samples com-
ing from this QSS are analyzed within that window, resulting in
(variable-scale) acoustic vectors.
Another approach is proposed in [69], which described a
temporal decomposition technique to represent the continuous
variation of the LPC parameters as a linearly weighted sum of
a number of discrete elementary components. These elemen-
tary components are designed such that they have the minimum
temporal spread (highly localized in time) resulting in superior
coding efficiency. However, the relationship between the opti-
mization criterion of “the minimum temporal spread” and the
quasi-stationarity is not obvious. Therefore, the discrete ele-
mentary components are not necessarily quasi-stationary and
vice-versa.
Coifman et al [70] have described a minimum entropy ba-
sis selection algorithm to achieve the minimum information cost
of a signal relative to the designed orthonormal basis. In [66],
Svendsen et al have proposed a ML segmentation algorithm us-
ing a single fixed window size for speech analysis, followed
by a clustering of the frames which were spectrally similar for
sub-word unit design. More recently, Achan et al [71] have pro-
posed a segmental HMM for speech waveforms which identifies
waveform samples at the boundaries between glottal pulse pe-
riods with applications in pitch estimation and time-scale mod-
ifications.
As a complementary principle to developing features that
“work around” the non-stationarity of speech, significant efforts
have also been made to develop new speech signal representa-
tions which can better describe the non-stationarity inherent in
the speech signal. Some representative examples are tempo-
ral patterns features[72], MLP and the several modulation spec-
trum related techniques[73, 74, 75, 76]. In this approach tem-
poral trajectories of spectral energies in individual critical bands
over windows as long as one second are used as features for pat-
tern classification. Another methodology is to use the notion of
the amplitude modulation (AM) and the frequency modulation
2Equivalent to the detection of the transition point between the two
adjoining QSSs.
3The power of the residual signal normalized by the number of sam-
ples in the window
(FM) [77]. In theory, the AM signal modulates a narrow-band
carrier signal (specifically, a monochromatic sinusoidal signal).
Therefore to be able to extract the AM signals of a wide-band
signal such as speech (typically 4KHz), it is necessary to de-
compose the speech signal into narrow spectral bands. In [78],
this approach is opposed to the previous use of the speech mod-
ulation spectrum [73, 74, 75, 76] which was derived by decom-
posing the speech signal into increasingly wider spectral bands
(such as critical, Bark or Mel). Similar arguments from the
modulation filtering point of view, were presented by Schimmel
and Atlas[79]. In their experiment, they consider a wide-band
filtered speech signal x(t) = a(t)c(t), where a(t)is the AM
signal and c(t)is the broad-band carrier signal. Then, they per-
form a low-pass modulation filtering of the AM signal a(t)to
obtain aLP (t). The low-pass filtered AM signal aLP (t)is then
multiplied with the original carrier c(t)to obtain a new signal
˜x(t). They show t hat the acoustic bandwidth of ˜x(t)is not
necessarily less than that of the original signal x(t). This unex-
pected result is a consequence of the signal decomposition into
wide spectral bands that results in a broad-band carrier.
Finally, as extension to the “traditional” AR process (all-
pole model) speech modeling, pole-zero transfer functions that
are used for modeling the frequency response of a signal, have
been well studied and understood [80]. Lately, Kumaresan et
al.[81, 82] have proposed to model analytic signals using pole-
zero models in the temporal domain. Along similar lines, Athi-
neos et al.[83] have used the dual of the linear prediction in the
frequency domain to improve upon the TRAP features.
3.1.2. Compensation and invariance
Simple models may exist that appropriately reflects the ef-
fect of a variability on speech features. This is for instance the
case for long-term spectral characteristics, mostly referred to
the Vocal Tract Length (VTL) of the speaker. Simple yet pow-
erful techniques for normalizing (compensating) the features to
the VTL are widely used [60].
An alternative to normalization is the generation of invari-
ant features. For vocal tract length for instance, [8, 9] propose to
exploit the fact that vocal-tract length variations can be approx-
imated via linear frequency warping. In [8], the scale trans-
form and the scale cepstrum have been introduced. Both trans-
forms exhibit the interesting property that their magnitudes are
invariant to linear frequency warping. In [9], the continuous
wavelet transform has been used as a preprocessing step, in or-
der to obtain a speech representation in which linear frequency
scaling leads to a translation in the time-scale plane. In a sec-
ond step, frequency-warping invariant features were generated.
These include the auto- and cross-correlation of magnitudes of
local wavelet spectra as well as linear and nonlinear transforms
thereof. It could be shown that these features not only lead to
better recognition scores than standard MFCCs, but that they
are also more robust to mismatches between training and test
conditions, such as training on male and testing on female data.
The best results were obtained when MFCCs and the vocal tract
length invariant features were combined, showing that both sets
contain complementary information [9].
Normalization (compensation) or invariance with respect to
other dimensions may also be useful (f.i. with respect to speak-
ing rate).
When simple parametric models of the effect of the
variability are not appropriate, feature compensation can be
performed using more generic non-parametric transformation
SPECOM'2006, St. Petersburg, 25-29 June 2006
7
schemes, including linear and non-linear transformation. This
becomes a dual approach to model adaptation, which is the topic
of Section 3.2.2.
3.1.3. Noise compensation
One of the most popular technique for increasing recognition
accuracy in noise is the spectral subtraction [84, 85] where
noise sprectrum is estimated during short pauses and subtracted
from the spectrum of noisy speech. Although this method is
not appropriate for non-stationary noise, slowly varying noise
can be removed from the signal since noise spectrum is regu-
larly updated. Two major drawbacks are the difficulty to detect
pauses (non-speech) in low SNR and that the subtraction should
be carefully controlled to avoid negative values for the ”clean”
speech spectrum that leads to the so-called musical noise ef-
fect [86, 87]. Also , the assumption that noisy speech power
spectrum is the sum of noise power spectrum and the clean
speech power spectrum is not correct (see more recent tech-
niques where this hypothesis is relaxed).
Another noise reduction method consists in filtering speech
with a high order adaptive FIR filter [88]. When no reference to
an external noise source is available, A Wiener linear prediction
filter may suppress interfering noise under the hypotheses of
stationarity of input and noise and if noise spectrum is much
wider than spectrum of the input. The main advantage of this
method is that no noise reference source is required. In speech
case, most hypotheses are not valid but for voiced speech, the
signal can be seen as a sequence of sinusoids: interesting results
can be demonstrated.
Statistical approaches for noise reduction have been re-
ported in [89]. More recently, several new approaches like un-
certainty decoding [90] and the SPLICE algorithm described
by [91, 92] have raised a strong interest for these techniques
that estimate simultaneously noise and clean speech making a
priori hypotheses on their distributions.SPLICE works on spec-
tral representation of speech. Another similar algorithm ALGO-
NQUIN works on log-spectrra and has been described recently
in Kristjansson’s thesis [93]where the hypotheses of decorre-
lation between noise and clean speech are shown unnecessary.
Also these authors deal with the convolution noise. Up to now,
convolution noise that is usually varying slowly, can be low pass
filtered out: this is achieved by removing cepstral mean from all
feature vectors of the utterance [94, 95].
3.1.4. Additional cues and feature combinations
As a complementary perspective to improving or compensating
single feature sets, one can also make use of several “streams”
of features that rely on different underlying assumptions and
exhibit different properties.
Intrinsic feature variability depends on the set of classes
that feature have to discriminate. Given a set of acoustic mea-
surements, algorithms have been described to select subsets of
them that improve automatic classification of speech data into
phonemes or phonetic features. Unfortunately, pertinent algo-
rithms are computationally intractable with this types of classes
as stated in [96], [97], where a sub-optimal solution is proposed.
It consists in selecting a set of acoustic measurement that guar-
antees a high value of the mutual information between acoustic
measurements and phonetic distinctive features.
Without attempting to find an optimal set of acoustic mea-
surements, many recent automatic speech recognition systems
combine streams of different acoustic measurements on the as-
sumption that some characteristics that are de-emphasized by a
particular feature are emphasized by another feature, and there-
fore the combined feature streams capture complementary in-
formation present in individual features. In [98], it is shown
that log-linear combination provides good results when used for
integrating probabilities provided by acoustic models.
In order to take into account different temporal behaviors
in different bands, it has been proposed ([99, 100, 101]) to con-
sider separate streams of features extracted in separate chan-
nels with different frequency bands. Other approaches integrate
some specific parameters into a single stream of features. Ex-
amples of added parameters are:
periodicity and jitter ([102]),
voicing ([103], [104]),
rate of speech and pitch ([105]).
To benefit from the strengths of both MLP-HMM and Gaussian-
HMM techniques, the Tandem solution was proposed in [106],
using posterior probability estimation obtained at MLP outputs
as observations for a Gaussian-HMM. An error analysis of Tan-
dem MLP features showed that the errors using MLP features
are different from the errors using cepstral features. This moti-
vates the combination of both feature styles. In ([107]), combi-
nation techniques were applied to increasingly more advanced
systems showing the benefits of the MLP-based features. These
features have been combined with TRAP features ([98, 108]).
In ([109]), Gabor filters are proposed, in conjunction with MLP
features, to model the characteristics of neurons in the auditory
system as they do in the visual system. There is evidence that
in primary auditory cortex each individual neuron is tuned to a
specific combination of spectral and temporal modulation fre-
quencies.
In [110], it is proposed to use mixture gaussians to represent
presence and absence of features.
Additional features have also been considered as cues for
speech recognition failures [111].
3.1.5. Dimensionality reduction and feature selection
Using additional features/cues as reviewed in the previous sec-
tion, or simply extending the context by concatenating fea-
ture vectors from adjacent frames may yield very long fea-
ture vectors in which several features contain redundant infor-
mation, thus requiring an additional dimension-reduction stage
[112, 113] and/or improved training procedures [114].
The most common feature-reduction technique is the use
of a linear transform y=Axwhere xand yare the original
and the reduced feature vectors, respectively, and Ais a p×n
matrix with p<nwhere nand pare the original and the de-
sired number of features, respectively. The principal component
analysis (PCA) [115, 116] is the most simple way of finding A.
It allows for the best reconstruction of xfrom yin the sense
of a minimal average squared Euclidean distance. However, it
does not take the final classification task into account and is
therefore only suboptimal for finding reduced feature sets. A
more classification-related approach is the linear discriminant
analysis (LDA), which is based on Fisher’s ratio (F-ratio) of
between-class and within-class covariances [115, 116]. Here
the columns of matrix Aare the eigenvectors belonging to the p
largest eigenvalues of matrix [S1
wSb], where Swand Sbare the
within-class and between-class scatter matrices, respectively.
Good results with LDA have been reported for small vocabulary
speech recognition tasks, but for large-vocabulary speech recog-
nition, results were mixed [112]. In [112] it was found that the
LDA should best be trained on sub-phone units in order to serve
SPECOM'2006, St. Petersburg, 25-29 June 2006
8
as a preprocessor for a continuous mixture density based recog-
nizer. A limitation of LDA is that it cannot effectively take into
account the presence of different within-class covariance matri-
ces for different classes. Heteroscedastic discriminant analysis
(HDA) [113] overcomes this problem, but the method usually
requires the use of numerical optimization techniques to find
the matrix A. An exception is the method in [117], which uses
the Chernoff distance to measure between-class distances and
leads to a straight forward solution for A. Finally, LDA and
HDA can be combined with maximum likelihood linear trans-
form (MLLT) [118], which is a special case of semi-tied co-
variance matrices (STC) [119]. Both aim at transforming the
reduced features in such a way that they better fit with the di-
agonal covariance matrices that are applied in many HMM rec-
ognizers. It has been reported [120] that such a combination
performs better than LDA or HDA alone. Also, HDA has been
combined with minimum phoneme error (MPE) analysis [121].
Recently, the problem of finding optimal dimension-reducing
feature transformations has been studied from the viewpoint of
maximizing the mutual information between the obtained fea-
ture set and the corresponding phonetic class [96, 122].
A problem of the use of linear transforms for feature re-
duction is that the entire feature vector xneeds to be computed
before the reduced vector ycan be generated. This may lead
to a large computational cost for feature generation, although
the final number of features may be relatively low. An alterna-
tive is the direct selection of feature subsets, which, expressed
by matrix A, means that each row of Acontains a single one
while all other elements are zero. The question is then the one
of which features to include and which to exclude. Because the
elements of Ahave to be binary, simple algebraic solutions like
with PCA or LDA cannot be found, and iterative strategies have
been proposed. For example, in [123], the maximum entropy
principle was used to decide on the best feature space.
3.2. Acoustic modeling techniques
Concerning acoustic modeling, good performance is generally
achieved when the model is matched to the task, which can be
obtained through adequate training data. Systems with stronger
generalization capabilities can then be built through a so-called
multi-style training. Estimating the parameters of a traditional
modeling architecture in this way however has some limitation
due to the inhomogeneity of the data, which increases the spread
of the models, and hence negatively impacts accuracy compared
to task-specific models. This is partly to be related to the inabil-
ity of the framework to properly model long-term correlations
of the speech signals.
Also, within the acoustic modeling framework, adaptation
techniques provide a general formalism for reestimating opti-
mal model parameters for given circumstances based on mod-
erate amounts of speech data.
Then, the modeling framework can be extended to allow
multiple specific models to cover the space of variation. These
can be obtained through generalizations of the HMM modeling
framework, or through explicit construction of multiple models
build on knowledge-based or data-driven clusters of data.
In the following, extensions for modeling using additional
cues and features is also reviewed.
3.2.1. Model compensation
In section 3.1.3, feature compensation techniques were reported
for enhancing speech features. A dual approach is to apply
acoustic model compensation. Two main techniques were pro-
posed. In [124], Moore proposed MM decomposition where
dynamic time warping was extended to a 3D-array where the
additional dimension represents a noise reference and an opti-
mal path has to be found in this 3D-domain. The major prob-
lenm was the definition of a local Probality for each box [125].
Existence of a single noise model is also a severe limitation.
This last difficulty was circumvented by a parallel model
decomposition (PMC) [126, 127] where clean speech and noise
are both modeled by HMM and where the local probabilities are
combined at the level of linear spectrum: this implies that only
additive noise can be taken into account.
Feature compensation methods seem to be more successful
than model compensation. Howevere there is a strong relation
between the two techniques that is particularly well illustrated
by contrained MLLR (C-MLLR) [128]where the transforma-
tion matrix for the covariance matrices is the same as the matrix
for the mean vectors. In that case for Gaussian distriutions it
is trivial to observe that model compensation is strictly equal to
feature transformation ( this is no longer valid in case of com-
pensation of state classes).
3.2.2. Adaptation
In Section 3.1.2, we have been illustrating techniques that can
be used to compensated the features to speech variation. the
dual approach is to adapt the ASR acoustic models.
In some cases, some variations in the speech signal could
be considered as long term given the applicative scenario. For
instance, a system embedded in a personal device and hence
mainly designed to be used by a single person, or a system
designed to transcribe and index spontaneous speech, or char-
acterized by utilization in a particular environment. In these
cases, it is often possible to adapt the models to these particular
conditions, hence partially factoring out the detrimental effect
of these. A popular technique is to estimate a linear transfor-
mation of the model parameters using a Maximum Likelihood
(ML) criterion [17]. A Maximum a Posteriori (MAP) objective
function may also be used.
Being able to perform this adaptation using limited amounts
of condition-specific data would be a very desirable property for
such adaptation methodologies, as this would reduce the cost
and hassle of such adaptation phases. Such “rapid” (sometimes
on-line) adaptation schemes have been proposed a few year ago,
mostly based on speaker-space methods, such as eigenvoices
and cluster-based adaptation [129, 130].
Intuitively, these techniques rest on the principle of acquir-
ing knowledge from the training corpora that represent the prior
distribution (or clusters) of model parameters given a variability
factor under study. With these adaptation techniques, knowl-
edge about the effect of the inter-speaker variabilities are gath-
ered in the model. In the traditional approach, this knowledge
is simply discarded, and, although all the speakers are used to
build the model, and pdfs are modeled using mixtures of gaus-
sians, the ties between particular mixture components across the
several CD phonemes are not represented/used.
Recent publications have been extending and refining this
class of techniques. In [131], rapid adaptation is further ex-
tended through a more accurate speaker space model, and an
on-line algorithm is also proposed. In [132], the correlations
between the means of mixture components of the different fea-
tures are modeled using a Markov Random Field, which is then
used to constrain the transformation matrix used for adaptation.
Other publications include [132, 133, 134, 135, 136, 137]
Other forms of transformations for adaptation are also pro-
SPECOM'2006, St. Petersburg, 25-29 June 2006
9
posed in [138], where the Maximum Likelihood criterion is
used but the transformation are allowed to be nonlinear.
3.2.3. Multiple modeling
Instead of adapting the models to particular conditions, one may
also train ensemble of models specialized to specific conditions
or variations. These models may then be used within a selec-
tion, competition or else combination framework. Such tech-
niques are the object of this section.
Acoustic models are estimated from speech corpora, and
they provide their best recognition performances when the op-
erating (or testing) conditions are consistent with the training
conditions. Hence many adaptation procedures were studied to
adapt generic models to specific tasks and conditions. When
the speech recognition system have to handle various possible
conditions, several speech corpora can be used together for es-
timating the acoustic models, leading to mixed models or hy-
brid systems [139, 140], which provide good performances
in those various conditions (for example in both landline and
wireless networks). However, merging too many heterogeneous
data in the training corpus makes acoustic models less discrim-
inant. Hence the numerous investigations along multiple mod-
eling, that is the usage of several models for each unit, each
model being train from a subset of the training data, defined ac-
cording to a priori criteria such as gender, age, rate-of-speech
(ROS) or through automatic clustering procedures. Ideally sub-
sets should contain homogeneous data, and be large enough for
making possible a reliable training of the acoustic models.
Gender information is one of the most often used criteria. It
leads to gender-dependent models that are either directly used in
the recognition process itself [141, 142] or used as a better seed
for speaker adaptation [143]. Gender dependence is applied to
whole word units, for example digits [144], or to context de-
pendent phonetic units [142], as a result of an adequate splitting
of the training data.
Age dependent modeling has been less investigated, may
be due to the lack of large size children speech corpora. The
results presented in [145] fail to demonstrate a significant im-
provement when using age dependent acoustic models, possibly
due to the limited amount of training data for each class of age.
Speaking rate affects notably the recognition performances,
thus speaking rate dependent models were studied [34]. It
was also noticed that speaking rate dependent models are often
getting less speaker-independent because the range of speak-
ing rate shown by different speakers is not the same [146],
and that training procedures robust to sparse data need to be
used. Speaking rate can be estimated on line [146], or com-
puted from a decoding result using a generic set of acoustic
models, in which case a rescoring is applied for fast or slow
sentences [147]; or the various rate dependent models may be
used simultaneously during decoding [148, 149].
Signal-to-Noise Ratio (SNR) also impacts recognition per-
formances, hence, besides or in addition to noise reduction
techniques, SNR-dependent models have been investigated. In
[150] multiple sets of models are trained according to several
noise masking levels and the model set appropriate for the esti-
mated noise level is selected automatically in recognition phase.
On the opposite, in [151] acoustic models composed under var-
ious SNR conditions are run in parallel during decoding.
Automatic clustering techniques have also been used for
elaborating several models per word for connected-digit recog-
nition [152]. Clustering the trajectories deliver more accurate
modeling for the different groups of speech samples [153]; and
clustering training data at the utterance level provides the best
performances [154].
Multiple modeling of phonetic units may be handled also
through the usual triphone-based modeling approach by incor-
porating questions on some variability sources in the set of
questions used for building the decision trees: gender informa-
tion in [155]; syllable boundary and stress tags in [156]; and
voice characteristics in [157].
When multiple modeling is available, all the available mod-
els may be used simultaneously during decoding, as done in
many approaches, or the most adequate set of acoustic models
may be selected from a priori knowledge (for example network
or gender), or their combination may be handled dynamically
by the decoder. This is the case of parallel hidden Markov mod-
els [158] where the acoustic densities are modulated depending
on the probability of a master context HMM being in certain
states. More recently Dynamic Bayesian Networks have been
used to handle dependencies of the acoustic models with re-
spect to auxiliary variables, such as local speaking rate [159],
or hidden factors related to a clustering of the data [160, 161].
Multiple models can also be used in a parallel decoding
framework [162]; then the final answer results from a ”vot-
ing” process [163], or from the application of elaborated deci-
sion rules that take into account the recognized word hypotheses
[164]. Multiple decoding is also useful for estimating reliable
confidence measures [165].
At the pronunciation level, multiple pronunciations are
generally used for the vocabulary words. Hidden model se-
quences offer a possible way of handling multiple realizations
of phonemes [166] possibly depending on phone context. For
handling hyper articulated speech where pauses may be inserted
between syllables, ad hoc variants are necessary [161]. And, as
detailed in section 2.2, adding more variants is usually required
for handling foreign accents.
Also, if models of some of the factors affecting speech
variation are known, adaptive training schemes can be devel-
oped, avoiding training data sparsity issues that could result
from cluster-based techniques. This has been used for instance
in the case of VTLN normalization, where a specific estimation
of the vocal tract length (VTL) is associated to each speaker of
the training data [60]. This allows to build “canonical” mod-
els based on appropriately normalized data. During recogni-
tion, a VTL is estimated in order to be able to normalize the
feature stream before recognition. More general normalization
schemes have also been investigated [167], based on associat-
ing transforms (mostly linear transforms) to each speaker, or
more generally, to different cluster of the training data. These
transforms can also be constrained to reside in an reduce-
dimensionality eigenspace [130]. A technique for “factoring-
in” selected transformations back in the canonical model is also
proposed in [168], providing a flexible way of building factor-
specific models, for instance multi-speaker models within a par-
ticular noise environment, or multi-environment models for a
particular speaker.
3.2.4. Auxiliary parameters
Most of speech recognition systems rely on acoustic parame-
ters that represent the speech spectrum, for example cepstral
coefficients. However, these features are sensitive to auxiliary
information such as pitch, energy, rate-of-speech, etc. Hence
attempts have been made in taking into account this auxiliary
information in the modeling and in the decoding processes.
Pitch and voicing parameters have been used since a long
SPECOM'2006, St. Petersburg, 25-29 June 2006
10
time, but mainly for endpoint detection purposes [169] mak-
ing it much more robust in noisy environments [170]. Many
algorithms have been developed and tuned for computing these
parameters, but are out of the scope of this paper.
For what concerns speech recognition itself, the most sim-
ple way of using such parameters (pitch and/or voicing) is their
direct introduction in the feature vector, along with the cepstral
coefficients, for example periodicity and jitter are used in [171]
for connected digits and large vocabulary. Correlation between
pitch and acoustic features is taken into account in [172] and a
LDA is applied on the full set of features (i.e. energy, MFCC,
voicing and pitch) in [173].
Pitch has to be taken into account for the recognition of
tonal languages. Tone can be modeled separately through spe-
cific HMMs [174] or decision trees [175], or the pitch parame-
ter can be included in the feature vector [176], or both informa-
tion streams (acoustic features and tonal features) can be han-
dled directly by the decoder, possibly with different optimized
weights [177]. Various coding and normalization schemes of
the pitch parameter are generally applied to make it less speaker
dependent; the derivative of the pitch is the most useful feature
[178], and pitch tracking and voicing are investigated in [179].
A comparison of various modeling approaches is available in
[180]. For tonal languages, pitch modeling usually concerns
the whole syllable; however limiting the modeling to the vowel
seems sufficient [181].
Voicing has been used in the decoder to constraint the
Viterbi decoding (when phoneme node characteristics are not
consistent with the voiced/unvoiced nature of the segment, cor-
responding paths are not extended) making the system more ro-
bust to noise [182].
Pitch, energy and duration have also been used as prosodic
parameters in speech recognition systems, or for reducing am-
biguity in post-processing steps. These aspects are out of scope
of this paper.
Dynamic Bayesian Networks (DBN) offer an integrated
formalism for introducing dependence on auxiliary features.
This approach is used in [105] with pitch and energy as auxil-
iary features. Other information can also be taken into account
such as articulatory information in [183] where the DBN uti-
lizes an additional variable for representing the state of the artic-
ulators. As mentioned in previous section, speaking rate is an-
other factor that can be taken into account in such a framework.
Most experiments deal with limited vocabulary sizes; extension
to large vocabulary continuous speech recognition is proposed
through an hybrid HMM/BN acoustic modeling in [184].
Another approach for handling heterogeneous features is
the TANDEM approach used with pitch, energy or rate of
speech in [185]. The TANDEM approach transforms the in-
put features into posterior probabilities of sub-word units us-
ing artificial neural networks (ANNs), which are then processed
to form input features for conventional speech recognition sys-
tems.
Finally, auxiliary parameters may be used to normalize
spectral parameters, for example based on pitch value in [186],
or used to modify the parameters of the densities (during de-
coding) through multiple regressions as with pitch and speaking
rate in [187].
4. Conclusion
This paper gathers important literature references related to the
endogenous variation of the speech signal and their importance
in automatic speech recognition. Important references address-
ing specific individual speech variation sources are first been
surveyed. This covers accent, speaking style, speaker physiol-
ogy, age, emotions. Finally, the paper proposed an overview
of general techniques for better handling intrinsic and extrinsic
variation sources in ASR, mostly tackling the speech analysis
and acoustic modeling aspect.
5. Acknowledgments
This work has been partly supported by the EU 6th Frame-
work Programme, under contract number IST-2002-002034
(DIVINES project). The views expressed here are those of the
authors only. The Community is not liable for any use that may
be made of the information contained therein.
References
[1] F. Nolan, The phonetic bases of speaker recognition.
Cambridge: Cambridge University Press, 1983.
[2] C. Huang, T. Chen, S. Li, E. Chang, and J. Zhou, “Anal-
ysis of speaker variability,” in Proc. of Eurospeech, (Aal-
borg, Denmark), pp. 1377–1380, Sept. 2001.
[3] X. Huang, A. Acero, and H.-W. Hon, Spoken language
processing: a guide to theory, algorithm, and system de-
velopment. New Jersey: Prentice-Hall, 2001.
[4] R. K. Potter and J. C. Steinberg, “Toward the specifica-
tion of speech,” The Journal of the Acoustical Society of
America, vol. 22, pp. 807–820, 1950.
[5] L. C. W. Pols, L. J. T. V. der Kamp, and R. Plomp,
“Perceptual and physical space of vowel sounds, The
Journal of the Acoustical Society of America, vol. 46,
pp. 458–467, 1969.
[6] T. M. Nearey, Phonetic feature systems for vowels.
Bloomington, Indiana, USA: Indiana University Lin-
guistics Club, 1978.
[7] D. O’Saughnessy, Speech communication - human and
machine. Addison-Wesley, 1987.
[8] S. Umesh, L. Cohen, N. Marinovic, and D. Nel-
son, “Scale transform in speech analysis,” IEEE Trans.
Speech Audio Process., vol. 7, pp. 40–45, Jan. 1999.
[9] A. Mertins and J. Rademacher, “Vocal tract length invari-
ant features for automatic speech recognition,” in Proc.
of ASRU, (Cancun, Mexico), Dec. 2005.
[10] M.-G. D. Benedetto and J.-S. Li´
enard, “Extrinsic nor-
malization of vowel formant values based on cardinal
vowels mapping, in Proc. of ICSLP, pp. 579–582, 1992.
[11] J. Slifka and T. R. Anderson, “Speaker modification with
lpc pole analysis,” in Proc. of ICASSP, (Detroit, MI),
pp. 644–647, May 1995.
[12] P. Zhan and M. Westphal, “Speaker normalization based
on frequency warping, in Proc. of ICASSP, (Munich,
Germany), 1997.
[13] Y. Ono, H. Wakita, and Y. Zhao, “Speaker normaliza-
tion using constrained spectra shifts in auditory filter do-
main,” in Proc. of Eurospeech, pp. 355–358, 1993.
[14] S. Kajarekar, N. Malayath, and H. Hermansky, Anal-
ysis of source of variability in speech, in Proc. of Eu-
rospeech, (Budapest, Hungary), Sept. 1999.
[15] M. Westphal, “The use of cepstral means in conversa-
tional speech recognition,” in Proc. of Eurospeech, (Rho-
dos, Greece), 1997.
SPECOM'2006, St. Petersburg, 25-29 June 2006
11
[16] C. Lee, C. Lin, and B. Juang., “A study on speaker
adaptation of the parameters of continuous density Hid-
den Markov Models, IEEE Trans. Signal Processing.,
vol. 39, pp. 806–813, April 1991.
[17] C. Leggetter and P. Woodland., “Maximum likelihood
linear regression for speaker adaptation of continuous
density Hidden Markov Models, Computer, Speech and
Language, vol. 9, pp. 171–185, April 1995.
[18] F. Kubala, A. Anastasakos, J. Makhoul, L. Nguyen,
R. Schwartz, and E. Zavaliagkos, “Comparative exper-
iments on large vocabulary speech recognition, in Proc.
of ICASSP, Apr. 1994.
[19] D. V. Compernolle, “Recognizing speech of goats,
wolves, sheep and ... non-natives, in Speech Commu-
nication, pp. 71–79, Aug. 2001.
[20] D. V. Compernolle, J. Smolders, P. Jaspers, and T. Helle-
mans, “Speaker clustering for dialectic robustness in
speaker independent speech recognition,” in Proc. of Eu-
rospeech, 1991.
[21] J. J. Humphries, P. C. Woodland, and D. Pearce, “Us-
ing accent-specific pronunciation modelling for robust
speech recognition,” in Proc. of ICSLP, 1996.
[22] H. Strik and C. Cucchiarini, “Modeling pronunciation
variation for ASR: a survey of the literature, in Speech
Communication, pp. 225–246, Nov. 1999.
[23] V. Beattie, S. Edmondson, D. Miller, Y. Patel, and
G. Talvola, “An integrated multidialect speech recogni-
tion system with optional speaker adaptation,” in Proc.
of Eurospeech, 1995.
[24] J. E. Flege, C. Schirru, and I. R. A. MacKay, “Interaction
between the native and second language phonetic subsys-
tems,” in Speech Communication, pp. 467–491, 2003.
[25] D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xi-
uyang, and Z. Sen, “What kind of pronunciation varia-
tion is hard for triphones to model?,” in Proc. of ICASSP,
(Salt Lake City, Utah), May 2001.
[26] K. Bartkova and D. Jouvet, “Language based phone
model combination for asr adaptation to foreign accent,”
in Proc. of ICPhS, (San Francisco, USA), Aug. 1999.
[27] W. K. Liu and P. Fung, “MLLR-based accent model
adaptation without accented data,” in Proc. of ICSLP,
(Beijing, China), 2000.
[28] K. Bartkova, “Generating proper name pronunciation
variants for automatic speech recognition, in Proc. of
ICPhS, (Barcelona, Spain), 2003.
[29] T. Schultz and A. Waibel, “Language independent and
language adaptive large vocabulary speech recognition,”
in Proc. of ICSLP, 1998.
[30] F. Weng, H. Bratt, L. Neumeyer, and A. Stomcke, “A
study of multilingual speech recognition,” in Proc. of Eu-
rospeech, (Rhodes, Greece), 1997.
[31] U. Uebler, “Multilingual speech recognition in seven lan-
guages,” in Speech Communication, pp. 53–69, Aug.
2001.
[32] L. Arslan and J. Hansen, “Language accent classification
in american english,” Speech Communication, vol. 18,
no. 4, pp. 353–367, 1996.
[33] H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen,
“Combination of machine scores for automatic grad-
ing of pronunciation quality, in Speech Communication,
pp. 121–130, Feb. 2000.
[34] N. Mirghafori, E. Fosler, and N. Morgan, “Towards ro-
bustness to fast speech in ASR, in Proc. of ICASSP, (At-
lanta, Georgia), pp. 335–338, May 1996.
[35] M. Richardson, M. Hwang, A. Acero, and X. D. Huang,
“Improvements on speech recognition for fast talkers, in
Proc. of Eurospeech, (Budapest, Hungary), Sept. 1999.
[36] E. Fosler-Lussier and N. Morgan, “Effects of speak-
ing rate and word predictability on conversational pro-
nunciations,” Speech Communication, vol. 29, no. 2-4,
pp. 137–158, 1999.
[37] A. Bell, D. Jurafsky, E. Fosler-Lussier, C. Girand,
M. Gregory, and D. Gildea, “Effects of disfluencies, pre-
dictability, and utterance position on word form variation
in english conversation, The Journal of the Acoustical
Society of America, vol. 113, pp. 1001–1024, Feb. 2003.
[38] J.-C. Junqua, “The Lombard reflex and its role on hu-
man listeners and automatic speech recognisers,” JASA,
vol. 93, pp. 510–524, Jan. 1993.
[39] S. Greenberg and S. Chang, “Linguistic dissection of
switchboard-corpus automatic speech recognition sys-
tems,” in Proc. of ISCA Workshop on Automatic Speech
Recognition: Challenges for the New Millenium, (Paris,
France), Sept. 2000.
[40] M. Adda-Decker, P. B. de Mareuil, G. Adda, and
L. Lamel, “Investigating syllabic structures and their
variation in spontaneous french, Speech Communica-
tion, vol. 46, pp. 119–139, June 2005.
[41] S. Furui, M. B. J. Hirschberg, S. Itahashi, T. Kawahara,
S. Nakamura, and S. Narayanan, “Introduction to the
special issue on spontaneous speech processing,” IEEE
Trans. Speech Audio Process., vol. 12, pp. 349–350, July
2004.
[42] W. Byrne, D. Doermann, M. Franz, S. Gustman, J. Ha-
jic, D. Oard, M. Picheny, J. Psutka, B. Ramabhadran,
D. Soergel, T. Ward, and Z. Wei-Jin, Automatic recog-
nition of spontaneous speech for access to multilingual
oral history archives, IEEE Trans. Speech Audio Pro-
cess., vol. 12, pp. 420–435, July 2004.
[43] D. Elenius and M. Blomberg, “Comparing speech
recognition for adults and children,” in Proceedings of
FONETIK 2004, (Stockholm, Sweden), 2004.
[44] G. Potamianos, S. Narayanan, and S. Lee, “Analysis of
children speech: duration, pitch and formants, in Proc.
of Eurospeech, (Rhodes, Greece), Sept. 1997.
[45] S. Lee, A. Potamianos, and S. Narayanan, Acoustics
of children speech: developmental changes of temporal
and spectral parameters,” in The Journal of the Acousti-
cal Society of America, (Vol.105), pp. 1455–1468, Mar.
1999.
[46] S. Sch¨
otz, “A perceptual study of speaker age,” in Work-
ing paper 49 (2001), 136-139, (Lund University, Dept Of
Linguistic), Nov. 2001.
[47] G. P. M. Eskenazi, “Pinpointing pronunciation errors in
children speech: examining the role of the speech rec-
ognizer, in Proceedings of the PMLA Workshop, (Col-
orado, USA), Sept. 2002.
SPECOM'2006, St. Petersburg, 25-29 June 2006
12
[48] D. Giuliani and M. Gerosa, “Investigating recognition of
children speech,” in Proc. of ICASSP, (Hong Kong), Apr.
2003.
[49] S. Das, D. Nix, and M. Picheny, “Improvements in
children speech recognition performance,” in Proc. of
ICASSP, vol. 1, (Seattle, USA), pp. 433–436, May 1998.
[50] L. Lee and R. C. Rose, “Speaker normalization us-
ing efficient frequency warping procedures, in Proc. of
ICASSP, vol. 1, (Atlanta, Georgia), pp. 353–356, May
1996.
[51] J. G. Wilpon and C. N. Jacobsen, A study of speech
recognition for children and the elderly, in Proc. of
ICASSP, vol. 1, (Atlanta, Georgia), pp. 349–352, May
1996.
[52] M. Eskenazi, “Kids: a database of children’s speech,
in The Journal of the Acoustical Society of America,
(Vol.100, No. 4, Part 2), Dec. 1996.
[53] K. Shobaki, J.-P. Hosom, and R. Cole, “The ogi kids
speech corpus and recognizers,” in Proc. of ICSLP, (Bei-
jing, China), Oct. 2000.
[54] M. Blomberg and D. Elenius, “Collection and recogni-
tion of children speech in the pf-star project,” in In Proc
of Fonetik 2003 Umea University, (Department of Phi-
losophy and Linguistics PHONUM), 2003.
[55] J. H. L. Hansen, “Analysis and compensation of speech
under stress and noise for environmental robustness in
speech recognition,” Speech Communication, vol. 20,
1996.
[56] R. P. Lippmann, E. Martin, and D. Paul, “Multi-style
training for robust isolated-word speech recognition, in
Proc. of ICASSP, 1987.
[57] S. E. Bou-Ghazale and J. L. H. Hansen, “Improving
recognition and synthesis of stressed speech via fea-
ture perturbation in a source generator framework, in
ECSA-NATO Proc. Speech Under Stress Workshop, Lis-
bon, Portugal, 1995.
[58] B. A. Hanson and T. Applebaum, “Robust speaker-
independent word recognition using instantaneous, dy-
namic and acceleration features: experiments with Lom-
bard and noisy speech,” in Proc. of ICASSP, 1990.
[59] B. Carlson and M. Clements, “Speech recognition in
noise using a projection-based likelihood measure for
mixture density hmms,” in Proc. of ICASSP, 1992.
[60] L. Welling, H. Ney, and S. Kanthak, “Speaker adap-
tive training by vocal tract normalization, IEEE Trans.
Speech Audio Process., vol. 10, pp. 415–426, Sept. 2002.
[61] S. B. Davis and P. Mermelstein, “Comparison of para-
metric representations for monosyllabic word recogni-
tion in continuously spoken sentences,” IEEE Trans.
Acoust. Speech Signal Process., vol. 28, pp. 357–366,
August 1980.
[62] H. Hermansky, “Perceptual linear predictive (PLP) anal-
ysis of speech,” The Journal of the Acoustical Society of
America, vol. 87, pp. 1738–1752, Apr. 1990.
[63] L. Rabiner and B. H. Juang, Fundamentals of speech
recognition, ch. 2, pp. 20–37. Englewoood Cliffs, NJ,
USA: Prentice Hall PTR, 1993.
[64] S. Haykin, Adaptive filter theory. Prentice-Hall Publish-
ers, N.J., USA., 1993.
[65] V. Tyagi, C. Wellekens, and H. Bourlard, “On variable-
scale piecewise stationary spectral analysis of speech
signals for ASR,” in Proc. of Eurospeech, (Lisbon, Por-
tugal), September 2005.
[66] T. Svendsen, “On the automatic segmentation of speech
signals,” in Proc. of ICASSP, 1987.
[67] T. Svendsen, K. K. Paliwal, E. Harborg, and P. O. Hu-
soy, An improved sub-word based speech recognizer,”
in Proc. of ICASSP, 1989.
[68] R. A. Obrect, “A new statistical approach for the auto-
matic segmentation of continuous speech signals, IEEE
Trans. Acoust. Speech Signal Process., vol. 36, January
1988.
[69] B. Atal, “Efficient coding of LPC parameters by tempo-
ral decomposition,” in Proc. of ICASSP, (Boston, USA),
1983.
[70] R. R. Coifman and M. V. Wickerhauser, “Entropy based
algorithms for best basis selection,” IEEE Trans. on In-
formation Theory, vol. 38, March 1992.
[71] K. Achan, S. Roweis, A. Hertzmann, and B. Frey,
“A segmental HMM for speech waveforms, Tech.
Rep. UTML Techical Report 2004-001, University of
Toronto, Toronto, Canada, 2004.
[72] H. Hermansky and S. Sharma, “TRAPS: classifiers of
temporal patterns,” in Proc. of ICSLP, (Sydney, Aus-
tralia), pp. 1003–1006, 1998.
[73] V. Tyagi, I. McCowan, H. Bourlard, and H. Misra, “Mel-
cepstrum modulation spectrum (MCMS) features for ro-
bust ASR, in Proc. of ASRU, (St. Thomas, US Virgin
Islands), 2003.
[74] B. Kingsbury, N. Morgan, and S. Greenberg, “Robust
speech recognition using the modulation spectrogram,”
Speech Communication, vol. 25, August 1998.
[75] Q. Zhu and A. Alwan, “AM-demodualtion of speech
spectra and its application to noise robust speech recog-
nition,” in Proc. of ICSLP, vol. 1, pp. 341–344, 2000.
[76] B. P. Milner, “Inclusion of temporal information into fea-
tures for speech recognition,” in Proc. of ICSLP, 1996.
[77] S. Haykin, Communication systems. New York, USA:
John Wiley and Sons, 3 ed., 1994.
[78] V. Tyagi and C. Wellekens, “Fepstrum representation of
speech,” in Proc. of ASRU, (Cancun, Mexico), December
2005.
[79] S. Schimmel and L. Atlas, “Coherent envelope detection
for modulation filtering of speech,” in Proc. of ICASSP,
(Philadephia, USA), 2005.
[80] J. Makhoul, “Linear prediction: a tutorial review,” Pro-
ceedings of IEEE, vol. 63, April 1975.
[81] R. Kumaresan and A. Rao, “Model-based approach to
envelope and positive instantaneous frequency estima-
tion of signals with speech appications,” J. Acoust. Soc.
Am, vol. 105, March 1999.
[82] R. Kumaresan, An inverse signal approach to comput-
ing the envelope of a real valued signal, IEEE Signal
Processing Letters, vol. 5, October 1998.
[83] M. Athineos and D. Ellis, “Frequency domain linear pre-
diction for temporal features,” in Proc. of ASRU, (St.
Thomas,US Virgin Islands, USA), December 2003.
SPECOM'2006, St. Petersburg, 25-29 June 2006
13
[84] J.-C. Junqua and J.-P. Haton, Robustness in automatic
speech recognition. Kluwer, 1996.
[85] S. Boll, “Suppression of acoustic noise in speech us-
ing spectral subtraction,” IEEE Trans. ASSP, vol. 27(2),
1979.
[86] M. Berouti, R.Schwartz, and J.Makhoul, “Enhancement
of speech corrupted by acoustic noise,” in Proc. of
ICASSP, 1979.
[87] P. Lockwood and J. Boudy, “Experiments with a non-
linear spectral subtractor (NSS), hidden markov models
and the projection, for robust speech recognition in cars,”
Speech Communication, vol. 11, 1992.
[88] V.Tyagi and C.J.Wellekens, “Least squares filtering of
speech signals for robust asr, in Proc. of MLMI, 2005.
[89] Y. Ephraim, “Statistical-model-based speech enhance-
ment systems,” Proc. IEEE, vol. 80(10), 1992.
[90] H. Liao and M. Gales, “Joint uncertainty decoding for
noise robust speech recognition, in Proc. of Interspeech,
2005.
[91] L. Deng, J.Droppo, and A. Acero, “Recursive estima-
tion of non-stationary noise using iterative stochastic ap-
proximation for robust speech recognition, IEEE Trans.
SAP, vol. 11, 2003.
[92] L. Deng, J.Droppo, and A. Acero, “Dynamic compen-
sation of hmm variances using the feature enhancement
uncertainty computed from a parametric model of speech
distortion,” IEEE Trans. SAP, vol. 13, 2005.
[93] T. Kristjansson, Speech Recognition in adverse environ-
ments: a probabilistic approach. PhD thesis, University
of Waterloo, Canada, 2002.
[94] L. Fissore, P. Laface, G. Micca, and G. Sperto, “Channel
adaptation for a continuous speech recognizer, in Proc.
of ICSLP, 1992.
[95] A.Anastasakos, F. Kubala, J.Makhoul, and R. Schwartz,
“Adaptation to new microphones using tied-mixture nor-
malization,” in Proc. of ICASSP, 1994.
[96] M. K. Omar and M. Hasegawa-Johnson, “Maximum mu-
tual information based acoustic features representation
of phonological features for speech recognition,” in Proc.
of ICASSP, (Montreal, Canada), 2002.
[97] M. K. Omar, K. Chena, M. Hasegawa-Johnson, and
Y. Bradman, “An evaluation on using mutual informa-
tion for selection of acoustic features representation of
phonemes for speech recognition,” in Proc. of ICSLP,
(Denver, CO), pp. 2129–2132, 2002.
[98] A. Zolnay, R. Schl¨
uter, and H. Ney, “Acoustic feature
combination for robust speech recognition, in Proc. of
ICASSP, vol. I, (Philadelphia, PA), pp. 457–460, 2005.
[99] H. Bourlard and D. Dupont, “Sub-band based speech
recognition,” in Proc. of ICASSP, (Munich Germany),
pp. 1251–1254, April, 1997.
[100] S. Tibrewala and H. Hermansky, “Sub-band based recog-
nition of noisy speech,” in Proc. of ICASSP, (Munich
Germany), pp. 1255–1258, 1997.
[101] M. J. Tomlinson, M. J. Russell, R. K. Moore, A. P.
Buckland, and M. A. Fawley, “Modelling asynchrony in
speech using elementary single-signal decomposition,”
in Proc. of ICASSP, (Munich Germany), pp. 1247–1250,
1997.
[102] D. L. Thomson and R. Chengalvarayan, “Use of period-
icity and jitter as speech recognition feature,” in Proc. of
ICASSP, vol. 1, (Seattle, WA), pp. 21 24, 1998.
[103] A. Zolnay, R. Schl¨
uter, and H. Ney, “Robust speech
recognition using a voiced-unvoiced feature, in Proc. of
ICSLP, vol. 2, (Denver, CO), pp. 1065 1068, 2002.
[104] M. Graciarena, H. France, J. Zheng, D. Vergyri, and
A. Stolcke, “Voicing feature integration in SRI’s DECI-
PHE LVCSR system,” in Proc. of ICASSP, (Montreal,
Canada), 2004.
[105] T. A. Stephenson, M. M. Doss, and H. Bourlard, “Speech
recognition with auxiliary information,” IEEE Trans.
Speech Audio Process., vol. SAP-12, no. 3, pp. 189–203,
2004.
[106] D. Ellis, R. Singh, and S. Sivadas, “Tandem acoustic
modeling in large-vocabulary recognition, in Proc. of
ICASSP, (Salt Lake City, USA), May 2001.
[107] Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, “On us-
ing MLP features in LVCSR,” in Proc. of ICSLP, (Jeju
Island, Korea), 2004.
[108] N. Morgan, B. Chen, Q. Zhu, and A. Stolcke, “TRAP-
ping conversational speech: extending trap/tandem ap-
proaches to conversational telephone speech recogni-
tion,” in Proc. of ICASSP, (Montreal, Canada), 2004.
[109] M. Kleinschmidt and D. Gelbart, “Improving word ac-
curacy with gabor feature extraction, in Proc. of ICSLP,
(Denver, Colorado), pp. 25–28, 2002.
[110] E. Eide, “Distinctive features for use in automatic speech
recognition,” in Proc. of Eurospeech, (Aalborg, Den-
mark), pp. 1613–1616, september 2001.
[111] D. Litman, J. Hirschberg, and M. Swerts, “Prosodic and
other cues to speech recognition failures,” Speech Com-
munication, vol. 43, no. 1-2, pp. 155–175, 2004.
[112] R. Haeb-Umbach and H. Ney, “Linear discriminant anal-
ysis for improved large vocabulary continuous speech
recognition,” in Proc. of ICASSP, (San Francisco),
pp. 13–16, 1992.
[113] N. Kumar and A. G. Andreou, “Heteroscedastic dis-
criminant analysis and reduced rank hmms for improved
speech recognition,” Speech Communication, vol. 26,
no. 4, pp. 283–297, 1998.
[114] R. Schl¨
uter, W. Macherey, B. M¨
uller, and H. Ney, “Com-
parison of discriminative training criteria and optimiza-
tion methods for speech recognition,” Speech Communi-
cation, vol. 34, no. 3, pp. 287–310, 2001.
[115] K. Fukunaga, Introduction to statistical pattern recogni-
tion. New York: Academic Press, 1972.
[116] R. O. Duda and P. E. Hart, Pattern classification and
scene analysis. New York: Wiley, 1973.
[117] M. Loog and R. P. W. Duin, “Linear dimensionality re-
duction via a heteroscedastic extension of LDA: The
Chernoff criterion, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 26, pp. 732–739, June 2004.
[118] R. A. Gopinath, “Maximum likelihood modeling with
gaussian distributions for classification, in Proc. of
ICASSP, 1998.
[119] M. J. F. Gales, “Semi-tied covariance matrices, in Proc.
of ICASSP, 1998.
SPECOM'2006, St. Petersburg, 25-29 June 2006
14
[120] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen,
“Maximum likelihood discriminant feature spaces,” in
icassp, pp. 1129–1132, jun 2000.
[121] B. Zhang and S. Matsoukas, “Minimum phoneme er-
ror based heteroscedastic linear discriminant analysis for
speech recognition,” in Proc. of ICASSP, vol. 1, pp. 925–
928, Mar. 2005.
[122] M. Padmanabhan and S. Dharanipragada, “Maximizing
information content in feature extraction, IEEE Trans.
Speech Audio Process., vol. 13, pp. 512–519, July 2005.
[123] Y. H. Abdel-Haleem, S. Renals, and N. D. Lawrence,
“Acoustic space dimensionality selection and combina-
tion using the maximum entropy principle,” in Proc. of
ICASSP, (Montreal, Canada), May 2004.
[124] R. Moore, “Signal decomposition using markov model-
ing techniques,” Tech. Rep. Memo no 3931, Royal Signal
and Radar Establishment, Malvern, Worcs, UK, 1986.
[125] A.Varga and R.K.Moore, “Hidden markov decomposi-
tion of speech and noise,” in Proc. of ICASSP, 1990.
[126] M. Gales and S.Young, “An improved approach to the
hidden markov model decomposition of speech and
noise,” in Proc. of ICASSP, 1992.
[127] M.J.F.Gales and S.Young, “Cepstral parameter compen-
sation for hmm recognition,” Speech Communication,
vol. 12(3), 1993.
[128] M. Gales, “Maximum likelihood linear regression for
speaker adaptation of continuous density hidden markov
models,” Computer, Speech and Language, vol. 9, 1998.
[129] P. Nguyen, R. Kuhn, J.-C. Junqua, N. Niedzielski,
and C. Wellekens, “Eigenvoices : a compact repre-
sentation of speakers in a model space,” Annales des
T´
el´
ecommunications, vol. 55, March-April 2000.
[130] M. J. F. Gales, “Cluster adaptive training for speech
recognition,” in Proc. of ICSLP, pp. 1783–1786, 1998.
[131] D. K. Kim and N. S. Kim, “Rapid online adaptation using
speaker space model evolution, Speech Communication,
vol. 42, pp. 467–478, Apr. 2004.
[132] X. Wu and Y. Yan, “Speaker adaptation using con-
strained transformation,” IEEE Trans. Speech Audio Pro-
cess., vol. 12, pp. 168–174, Mar. 2004.
[133] B. Mak and R. Hsiao, “Improving eigenspace-based mllr
adaptation by kernel PCA,” in Proc. of ICSLP, (Jeju Is-
land, Korea), Sept. 2004.
[134] B. Zhou and J. Hansen, “Rapid discriminative acous-
tic model based on eigenspace mapping for fast speaker
adaptation,” IEEE Trans. Speech Audio Process., vol. 13,
pp. 554–564, July 2005.
[135] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice
modeling with sparse training data,” IEEE Trans. Speech
Audio Process., vol. 13, pp. 345–354, May 2005.
[136] S. Tsakalidis, V. Doumpiotis, and W. Byrne, “Discrim-
inative linear transforms for feature normalization and
speaker adaptation in HMM estimation,” IEEE Trans.
Speech Audio Process., vol. 13, pp. 367–376, May 2005.
[137] Y. Tsao, S.-M. Lee, and L.-S. Lee, “Segmental eigen-
voice with delicate eigenspace for improved speaker
adaptation,” IEEE Trans. Speech Audio Process., vol. 13,
pp. 399–411, May 2005.
[138] M. Padmanabhan and S. Dharanipragada, “Maximum-
likelihood nonlinear transformation for acoustic adap-
tation,” IEEE Trans. Speech Audio Process., vol. 12,
pp. 572–578, Nov. 2004.
[139] C. Mokbel, L. Mauuary, L. Karray, D. Jouvet, J. Monn´
e,
J. Simonin, and K. Bartkova, “Towards improving ASR
robustness for PSN and GSM telephone applications, in
Speech Communication, pp. 141–159, Oct. 1997.
[140] S. Das, D.Lubensky, and C. Wu, “Towards robust speech
recognition in the telephony network environment - cel-
lular and landline conditions,” in Proc. of Eurospeech,
pp. 1959–1962, 1999.
[141] Y.Konig and N.Morgan, “GDNN: a gender-dependent
neural network for continuous speech recognition,” in
Proc. of Int. Joint Conf. on Neural Networks, pp. 332
337, June 1992.
[142] J. O. P.C. Woodland and, V. Valtchev, and S. Young,
“Large vocabulary continuous speech recognition using
HTK,” in Proc. of ICASSP, pp. 125–128, Apr. 1994.
[143] C.-H. Lee and J.-L. Gauvain, “Speaker adaptation based
on MAP estimation of HMM parameters,” in Proc. of
ICASSP, pp. 558–561, Apr. 1993.
[144] S. Gupta, F. Soong, and R. Haimi-Cohen, “High-
accuracy connected digit recognition for mobile appli-
cations,” in Proc. of ICASSP, pp. 57–60, May 1996.
[145] S. M. D. Arcy, L. P. Wong, and M. J. Russell, “Recog-
nition of read and spontaneous children’s speech using
two new corpora, in Proc. of ICSLP, (Jeju Island, Ko-
rea), Sept. 2004.
[146] T. Pfau and G. Ruske, “Creating Hidden Markov Models
for fast speech,” in Proc. of ICSLP, p. 0255, 1998.
[147] H. Nanjo and T. Kawahara, “Speaking-rate dependent
decoding and adaptation for spontaneous lecture speech
recognition,” in Proc. of ICASSP, pp. 725–728, 2002.
[148] C. Chesta, P. Laface, and F. Ravera, “Connected digit
recognition using short and long duration models,” in
Proc. of ICASSP, pp. 557–560, Mar. 1999.
[149] J. Zheng, H. Franco, and A. Stolcke, “Effective acoustic
modeling for rate-of-speech variation in large vocabulary
conversational speech recognition, in Proc. of ICSLP,
(Jeju Island, Korea), pp. 401–404, Sept. 2004.
[150] M. G. Song, H. Jung, K.-J. Shim, and H. S. Kim, “Speech
recognition in car noise environments using multiple
models according to noise masking levels, in Proc. of
ICSLP, p. 1065, 1998.
[151] S. Sakauchi, Y. Yamaguchi, S. Takahashi, and
S. Kobashikawa, “Robust speech recognition based on
hmm composition and modified wiener filter, in Proc. of
Interspeech, (Jeju Island, Korea), pp. 2053–2056, 2004.
[152] L. Rabiner, C. Lee, B. Juang, and J. Wilpon, “HMM
clustering for connected word recognition,” in Proc. of
ICASSP, pp. 405–408, May 1989.
[153] F. Korkmazskiy, B.-H. Juang, and F. Soong, “General-
ized mixture of HMMs for continuous speech recogni-
tion,” in Proc. of ICASSP, pp. 144 –1446, Apr. 1997.
[154] T. Shinozaki and S. Furui, “Spontaneous speech recog-
nition using a massively parallel decoder, in Proc. of
ICSLP, (Jeju Island, Korea), pp. 1705–1708, Sept. 2004.
SPECOM'2006, St. Petersburg, 25-29 June 2006
15
[155] C. Neti and S. Roukos, “Phone-context specific gender-
dependent acoustic-models for continuous speech recog-
nition,” in Proc. of ASRU, pp. 192–198, Dec. 1997.
[156] D. Paul, “Extensions to phone-state decision-tree clus-
tering: single tree and tagged clustering,” in Proc. of
ICASSP, pp. 1487–1490, Apr. 1997.
[157] H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda,
and T. Kitamura, “Speech recognition using voice-
characteristic-dependent acoustic models,” in Proc. of
ICASSP, pp. 740–743, Apr. 2003.
[158] F. Brugnara, R. D. Mori, D. Giuliani, and M. Omologo,
“A family of parallel Hidden Markov Models,” in Proc.
of ICASSP, pp. 377–380, Mar. 1992.
[159] T. Shinozaki and S. Furui, “Hidden mode HMM using
bayesian network for modeling speaking rate fluctua-
tion,” in Proc. of ASRU, (US Virgin Islands), pp. 417–
422, Dec. 2003.
[160] F. Korkmazsky, M. Deviren, D. Fohr, and I. Illina, “Hid-
den factor dynamic bayesian networks for speech recog-
nition,” in Proc. of ICSLP, (Jeju Island, Korea), Sept.
2004.
[161] S. Matsuda, T. Jitsuhiro, K. Markov, and S. Nakamura,
“Speeh recognition system robust to noise and speaking
styles,” in Proc. of ICSLP, (Jeju Island, Korea), Sept.
2004.
[162] Y. Zhang, C. Desilva, A. Togneri, M. Alder, and Y. At-
tikiouzel, “Speaker-independent isolated word recogni-
tion using multiple Hidden Markov Models, in Proc.
IEE Vision, Image and Signal Processing, pp. 197–202,
June 1994.
[163] J. Fiscus, “A post-processing system to yield reduced
word error rates: Recognizer Output Voting Error Re-
duction (ROVER),” in Proc. of ASRU, pp. 347–354, Dec.
1997.
[164] L. Barrault, R. de Mori, R. Gemello, F. Mana, and
D. Matrouf, “Variability of automatic speech recognition
systems using different features, in Proc. of Interspeech,
(Lisboa, Portugal), pp. 221–224, 2005.
[165] T. Utsuro, T. Harada, H. Nishizaki, and S. Nakagawa, “A
confidence measure based on agreement among multi-
ple LVCSR models - correlation between pair of acoustic
models and confidence,” in Proc. of ICSLP, pp. 701–704,
2002.
[166] T. Hain and P. C. Woodland, “Dynamic HMM selec-
tion for continuous speech recognition,” in Proc. of Eu-
rospeech, pp. 1327–1330, 1999.
[167] M. J. F. Gales, “Multiple-cluster adaptive training
schemes,” in Proc. of ICASSP, 2001.
[168] M. J. F. Gales, “Acoustic factorization, in Proc. of
ASRU, 2001.
[169] B. Atal and L. Rabiner, A pattern recognition approach
to voiced-unvoiced-silence classification with applica-
tions to speech recognition,” in IEEE Trans. on Acous-
tics, Speech, and Signal Processing, pp. 201–212, June
1976.
[170] A. M. . L. Mauuary, “Voicing parameter and energy-
based speech/non-speech detection for speech recog-
nition in adverse conditions, in Proc. of Eurospeech,
(Geneva, Switzerland), pp. 3069–3072, Sept. 2003.
[171] D. L. Thomson and R. Chengalvarayan, “Use of voicing
features in HMM-based speech recognition,” in Speech
Communication, pp. 197–211, July 2002.
[172] N. Kitaoka, D. Yamada, and S. Nakagawa, “Speaker in-
dependent speech recognition using features based on
glottal sound source,” in Proc. of ICSLP, (Denver, USA),
pp. 2125–2128, Sept. 2002.
[173] A. Ljolje, “Speech recognition using fundamental fre-
quency and voicing in acoustic modeling, in Proc. of
ICSLP, (Denver, USA), pp. 2137–2140, Sept. 2002.
[174] W.-J. Yang, J.-C. Lee, Y.-C. Chang, and H.-C. Wang,
“Hidden Markov Model for Mandarin lexical tone recog-
nition,” in IEEE Trans. on Acoustics, Speech, and Signal
Processing, pp. 988–992, July 1988.
[175] P.-F. Wong and M.-H. Siu, “Decision tree based tone
modeling for chinese speech recognition,” in Proc. of
ICASSP, pp. 905–908, May 2004.
[176] C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A.
Picheny, and K. Shen, “New methods in continuous
mandarin speech recognition,” in Proc. of Eurospeech,
pp. 1543–1546, 1997.
[177] Y. Y. Shi, J. Liu, and R. Liu, “Discriminative HMM
stream model for Mandarin digit string speech recog-
nition,” in Proc. of Int. Conf. on Signal Processing,
pp. 528–531, Aug. 2002.
[178] S. Liu, S. Doyle, A. Morris, and F. Ehsam, “The effect
of fundamental frequency on mandarin speech recogni-
tion,” in Proc. of ICSLP, (Sydney, Australia), pp. 2647–
2650, Nov/Dec 1998.
[179] C.-H. H. Hank and F. Seide, “Pitch tracking and tone
features for Mandarin speech recognition,” in Proc. of
ICASSP, pp. 1523–1526, June 2000.
[180] T. Demeechai and K. M¨
akel¨
ainen, “Recognition of syl-
lables in a tone language,” in Speech Communication,
pp. 241–254, Feb. 2001.
[181] C. Chen, H. Li, L. Shen, and G. Fu, “Recognize tone
languages using pitch information on the main vowel of
each syllable,” in Proc. of ICASSP, pp. 61–64, May 2001.
[182] D. O’Shaughnessy and H. Tolba, “Towards a robust/fast
continuous speech recognition system using a voiced-
unvoiced decision, in Proc. of ICASSP, pp. 413–416,
Mar. 1999.
[183] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C.
Morris, “Automatic speech recognition using dynamic
Bayesian networks with both acoustic and articulatory
variables, in Proc. of ICSLP, (Beijing, China), pp. 951–
954, Oct. 2000.
[184] K. Markov and S. Nakamura, “Hybrid HMM/BN
LVCSR system integrating multiple acoustic features,”
in Proc. of ICASSP, pp. 840–843, Apr. 2003.
[185] M. Magimai-Doss, T. A. Stephenson, S. Ikbal, and
H. Bourlard, “Modelling auxiliary features in tandem
systems,” in Proc. of ICSLP, (Jeju Island, Korea), Sept.
2004.
[186] H. Singer and S. Sagayama, “Pitch dependent phone
modelling for HMM based speech recognition,” in Proc.
of ICASSP, pp. 273–276, Mar. 1992.
[187] K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama,
“Multiple-regression Hidden Markov Model, in Proc. of
ICASSP, (Salt Lake City, USA), pp. 513–516, May 2001.
SPECOM'2006, St. Petersburg, 25-29 June 2006
16
... Rate of speech (i.e., speed at which a user speaks) is a wellknown factor that may cause both ASR and human misinterpretations [11,12,39,49], but little effort has been made to exploit it for constructing adversary audio attacks. In particular, due in part to the limitations of the articulatory machinery, timing and acoustic realization of syllables are affected. ...
... In particular, due in part to the limitations of the articulatory machinery, timing and acoustic realization of syllables are affected. Pronunciation may be thus altered in several ways, such as phoneme reduction, time compression or expansion, and changes in the temporal patterns [11,12]. In this context, a syllable is regarded as a pronunciation unit, i.e., a single, unbroken sound of a spoken or written word; and a phoneme, as the core spoken language component, is a unit of sound. ...
... Recognition may be studied in detail considering different linguistic or phonetic properties [1]. The recognition results are usually identified using the acousticphonetic classes [2,3]. ...
... In speech recognition, speech variability is one of the major error sources. Speech variabilities may be classified to the two main categories: extrinsic variabilities are due to the environment (noise, telecommunication channels), and intrinsic variabilities that convey information about the speaker himself (gender, age, social and regional origin, health and emotional state) [1]. There is also a well studied impact of stressed speech on speech and speaker recognition [18]. ...
... Acoustic variability is one of the main contributors to degradation of performances of automatic speech recognition (ASR) systems. A systematic review of what has been done in this area can be found in [1]. One of the standard features used in ASR systems is the energy of the speech signal. ...
... where the index i can be either f or s. The meaning of the other parameters is the same as in Equation 1. As well as in the case of peak energy tracking, the value of the memory coefficient depends on whether the energy is rising or falling. ...
Conference Paper
Full-text available
In this paper a novel method for energy normalization is presented. The objective of this method is to remove unwanted energy variations caused by different microphone gains, various loudness levels across speakers, as well as changes of single speaker loudness level over time. The solution presented here is based on principles used in automatic gain control. The use of this method results in relative improvement of the performances of an automatic speech recognition system by 26 %.
... It is a known fact that ASR systems are known to be faulty when tested with non-native speakers. (Benzeghiba et al., 2006) The authors of this work all speak English as it is the high-resource language of their countries. As one of the co-authors speaks Hausa and has first-hand experience, we consider our approach somewhat more inclusive. ...
Preprint
Full-text available
Nollywood, based on the idea of Bollywood from India, is a series of outstanding movies that originate from Nigeria. Unfortunately, while the movies are in English, they are hard to understand for many native speakers due to the dialect of English that is spoken. In this article, we accomplish two goals: (1) create a phonetic sub-title model that is able to translate Nigerian English speech to American English and (2) use the most advanced toxicity detectors to discover how toxic the speech is. Our aim is to highlight the text in these videos which is often times ignored for lack of dialectal understanding due the fact that many people in Nigeria speak a native language like Hausa at home.
... The state-of-the-art ASR systems based on HMM and GMM are sensitive to differences in training and test conditions, which result in serious degradations of performance (Molau, 2003;Benzeghiba et al., 2006). One of the common methods to reduce spectral variations caused by different vocal tract length and shape is vocal tract length normalisation (VTN). ...
... In speech recognition, speech variability is one of the major error sources. Speech variabilities may be classified to the two main categories: extrinsic variabilities are due to the environment (noise, telecommunication channels), and intrinsic variabilities that convey information about the speaker himself (gender, age, social and regional origin, health and emotional state) [2]. There is also a well studied impact of stressed speech on speech and speaker recognition. ...
Article
Full-text available
In this paper we present an explorative study of diagnostics of speech recognition for finding subsets of features that are most informative in terms of incorrect speech recognition, if variable speech is recognized. The impact on both MFCC and PLP features is investigated. Standard HMM- GMM phoneme-based ASR system with no grammar is used for collection of the all the correct and wrong decodings, and decision tree analysis is used with questions about variance of feature coefficients in the tree nodes. The paper presents various results on importance of quefrency regions in terms of intrinsic speech variabilities, and contributes to better understanding of efficiency of used front-end.
... It is well-known that speech signal not only conveys the message but also a lot of information about the speaker himself such as a gender. The gender appears to be the important factor related to physiological differences that create speech variability [1] [2]. A speaker's gender can be one of the variabilities adversely affecting the speech recognizer's accuracy and apparently, separating speakers can be considered as an important way of improving a speech recognizer's performance [3]. ...
Conference Paper
This paper proposes a method for identifying a gender by using a Thai spoken syllable with the Average Magnitude Difference Function (AMDF) and a neural network (NN). The AMDF is applied to extracting pitch contour from a syllable. Then the NN uses the pitch contour to identify a gender. Experiments are carried out to evaluate the effects of Thai tones and syllable parts on the gender classification performance. By using a whole syllable, the average correct classification rate of 98.5% is achieved. While using parts of a syllable, the first half part gives the highest accuracy of 99.5%, followed by the middle and the last parts with the accuracies of 96.5% and 95.5%, respectively. The results indicate that the proposed method using pitch contour from any tones of the first half of a Thai spoken syllable or a whole Thai spoken syllable with the NN is efficient for identifying a gender.
Article
Full-text available
The present study investigates the acoustic properties of English long vowels produced by Pashto speakers by highlighting the problematic areas for Pashto speakers learning English. The data was collected from ten Pashtun learners of English through tape recorder. The data was analyzed using PRAAT software. The spectrograms produced by the software helped us in specifying the acoustic values (formant frequencies of F1 and F2) of target vowels compared with the production of the same vowels by native speakers of English. The analysis of the collected data revealed the problematic areas of English long vowels for Pashtun learners of English. The study proved that English long vowels like /I:/ and /a:/ were more problematic for Pashtun learners. A significant difference was seen in both the height and backness of the target English vowels produced by the subjects. The English vowels like /e:/ and /o:/ produced by subjects were like those produced by the native speakers. Similarly, target vowel like /u:/ was pronounced by subjects like native speakers of English. The study recommends that Pashto speakers should be given proper training for acquiring correct pronunciation of English.
Conference Paper
Full-text available
Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques that have been proposed to solve the generalization problem in acoustic model training and adaptation, that is, how to achieve high recognition accuracy for new utterances. One of the common approaches is controlling the degree of freedom in model training and adaptation. The techniques can be classified by whether a priori knowledge of speech obtained by a speech database such as those spoken by many speakers is used or not. Another approach is maximizing ¿margins¿ between training samples and the decision boundaries. Many of these techniques have also been combined and extended to further improve performance. Although many useful techniques have been developed, we still do not have a golden standard that can be applied to any kind of speech variation and any condition of the speech data available for training and adaptation.
Conference Paper
Full-text available
ABSTRACT More than three decades of speech recognition research re- sulted in a very sophisticated statistical framework. How- ever, less attention was still devoted to diagnostics of speech recognition; most previous research report on re- sults in terms of ever-lower WER in various intrinsic or environmental conditions. This paper presents a diagnostics of the decoding pro- cess of ASR systems. The purpose of our diagnostics is to go beyond,standard evaluation in terms of WERs and confusion matrices, and to look at the recognized output in more details. During the decoding phase, some specific data are collected at the decoder as possible causes of er- rors, and later are statistically analyzed using classification and regression trees. Focusing on pure acoustic phone de- coding without language modeling, we present and discuss the results of the diagnostics that is used for an analysis of impact of intrinsic speech variabilities on speech recogni- tion. KEY WORDS Fault diagnosis, speech recognition, intrinsic speech vari- abilities.
Article
Full-text available
Experiments were carried out to investigate the correlation between the perceptual and physical space of 11 vowel sounds. The signals were single periods out of the constant vowel part of normally spoken words of the type h (vowel) t, generated continuously by computer. Pitch,loudness, onset, and duration were equalized. These signals were presented to 15 subjects in a triadic‐comparison procedure, resulting in a cumulative similarity matrix. Multidimensional scaling (Kruskal) of this matrix resulted in a three‐dimensional perceptual space with 1.6% stress. The signals were also analyzed physically with 1 3 ‐ oct band filters. Principal‐components analysis of the decibel values per frequency band indicated that three dimensions accounted for 81.7% of the total variance. Matching the perceptual and the physical configurations to maximal congruence yielded an excellent result with correlation coefficients of 0.992, 0.971, and 0.742 along the corresponding dimensions. The formant frequencies and levels were correlated also with both configurations.
Article
We propose an eigenvector-based heteroscedastic linear dimension reduction (LDR) technique for multiclass data. The technique is based on a heteroscedastic two-class technique which utilizes the so-called Chernoff criterion, and successfully extends the well-known linear discriminant analysis (LDA). The latter, which is based on the Fisher criterion, is incapable of dealing with heteroscedastic data in a proper way. For the two-class case, the between-class scatter is generalized so to capture differences in (co)variances. It is shown that the classical notion of between-class scatter can be associated with Euclidean distances between class means. From this viewpoint, the between-class scatter is generalized by employing the Chernoff distance measure, leading to our proposed heteroscedastic measure. Finally, using the results from the two-class case, a multiclass extension of the Chernoff criterion is proposed. This criterion combines separation information present in the class mean as well as the class covariance matrices. Extensive experiments and a comparison with similar dimension reduction techniques are presented.
Conference Paper
An innovative method for speech recognition of tone languages is reported. By definition, the tone of a syllable is determined by the pitch contour of the entire syllable. We propose that the pitch information on the main vowel of a syllable is sufficient to determine the tone of that syllable. Therefore, to recognize tone languages, only main vowels are needed to associate with tones. The number of basic phonetic units required to recognize tone languages is greatly reduced. We then report experimental results on Cantonese and Mandarin. In both cases, using the main vowel method, while the number of phonemes and the quantity of training data are substantially reduced, the decoding accuracy is improved over other methods. Possible applications of the new method to other tone languages, including Thai, Vietnamese, Japanese, Swedish, and Norwegian are discussed.
Conference Paper
This paper addresses the problem of automatic speech recognition in the presence of interfering noise. The new approach described decomposes the contaminated speech signal using a generalisation of standard Hidden Markov Modelling, whilst utilising a compact and effective parametrisation of the speech signal. The technique is compared to some existing noise compensation techniques, using data recorded in noise, and is found to have improved performance compared to existing model decomposition techniques. Performance is comparable to existing noise subtraction techniques, but the technique is applicable to a wider range of noise environments and is not dependent on an accurate end pointing of the speech.