Role of nucleus based context in word-independent syllable stress classification
ABSTRACT An acoustic-phonetics based word-independent technique which uses syllable context for classifying the lexical syllable stress of spoken English words is presented. Nucleus based clustering is re markably successful in moving from word-dependent syllable stress classification which is intrinsically not scalable to word-independent classification. This however is not possible without an inherent drop in accuracy due to the loss of important contextual information of the syllables. An approach based on incorporating the left and the right context-ID of the syllable nucleus is proposed which results in a 10% improvement in word-level accuracy for word-independent syllable stress classification. The proposed approach exhibits performances comparable to that of the best performing word-dependent classifiers without suffering from the latter's scalability issues. A 7% improvement in the syllable level accuracy is also reported.
-
Citations (0)
-
Cited In (0)
Page 1
ROLE OF NUCLEUS BASED CONTEXT IN WORD-INDEPENDENT SYLLABLE STRESS
CLASSIFICATION
Harish Doddala, Om D Deshmukh, Ashish Verma
IBM Research - India
ABSTRACT
An acoustic-phonetics based word-independent technique which
uses syllable context for classifying the lexical syllable stress of
spoken English words is presented. Nucleus based clustering is re-
markably successful in moving from word-dependent syllable stress
classification which is intrinsically not scalable to word-independent
classification. This however is not possible without an inherent drop
in accuracy due to the loss of important contextual information of
the syllables. An approach based on incorporating the left and the
right context-ID of the syllable nucleus is proposed which results in
a 10% improvement in word-level accuracy for word-independent
syllable stress classification. The proposed approach exhibits perfor-
mances comparable to that of the best performing word-dependent
classifiers without suffering from the latter’s scalability issues. A
7% improvement in the syllable level accuracy is also reported.
Index Terms— Lexical Syllable Stress, Speech Analysis, Lan-
guage Learning, Context
1. INTRODUCTION
Syllable stress plays an important role in efficient spoken communi-
cation inEnglish producing intelligibleand natural sounding speech.
Learning to pronounce words with the correct stresspattern ispartic-
ularly important as the meaning of the word can change depending
onwhich of itsconstituent syllablesisstressed(e.g., address, content
and so on).
Several different word-independent syllable stress classification
techniques have been proposed in [1, 2, 3, 4]. The system pro-
posed in [2] poses the problem as a two-class classification prob-
lem. Authorsin[3]proposed aHiddenMarkov Model (HMM) based
technique to detect the degree of stress of syllables in phrases, con-
cluding that a two-stage classification (stressed/unstressed followed
by primary/secondary stress classification) is more accurate than a
single-stage three-way classification (primary vs. secondary vs. no
stress). Word-independent methods, with the exception of [3], em-
ploy a single classifier to determine if a syllable is stressed or not.
To train the classifier, all the stressed syllables in the training data
are grouped in one class and all the unstressed syllables are grouped
in the other. The suboptimalities of such a generic grouping is dis-
cussed in [1]. It was shown in [5] that word-dependent techniques,
although not scalable, lead to a better performance as compared to
the word-independent techniques.
It can be concluded from [2, 6, 3, 4] that duration, energy and
fundamental frequency of a syllable are the three basic features in
distinguishing stressed syllables from unstressed syllables. Nucleus
level clustering proposed in [1] used acoustic-phonetics motivated
grouping of syllables and trains separate classifiers for each syllable
nucleus to improve the syllable stress classification accuracy. Differ-
ences in acoustic manifestations of various syllable nuclei based on
the average energy and duration of the nucleus phone, and the acous-
tic differences between the stressed and the unstressed instances of
the corresponding syllable nuclei were used to group the syllables.
This approach is hereby referred to in the paper as the acoustic-
phonetics based approach or AP based approach.
Section 2 describes the AP based clustering of [1] and its short-
comings. The proposed technique extends the AP based approach
by incorporating the nucleus-ID of the contexts, in addition to the
acoustic-phonetic features of the syllable nucleus, for robust sylla-
ble stress classification. This acoustic-phonetics context based mod-
elling based on the context-ID of the nucleus is hereby referred to as
the CAPbased approach. TheCAP modelling isproposed in Section
3. The syllable-level and word-level performances are evaluated for
the CAP techniques for adult and child data in Section 4.
2. NUCLEUS LEVEL CLUSTERING BASED ON
ACOUSTIC-PHONETICS
2.1. Shortcomings of generic grouping of syllables
In the training phase of the syllable stress classifiers in [6, 2] etc,
acoustic features from all the stressed syllables across all the words
are grouped in one class and features from all the unstressed
syllables are grouped in the other class to train a syllable-level
stressed/unstressed classifier.
Stressed low-vowels tend to have the same energy as the un-
stressed high-vowels. Also, stressed lax-vowels are said to have the
same duration as unstressed tense vowels. Such a generic grouping,
referred to as the 1-C technique in [1], leads to suboptimal perfor-
mance. It wasfurther observed in [1] that (a) combining the syllable-
level features across syllables with different types of nuclei reduces
the stress discriminability of these features and (b) the relative sig-
nificance of the features depends on the type of the syllable nucleus.
2.2. Acoustic-phonetics based grouping of syllables
In [1] the syllables are grouped based on their acoustic similarities
into separate groups with a classifier trained for each such group.
Grouping was based on the average energy and duration of the vari-
ous syllable nuclei in their correctly stressed pronunciations. Sylla-
bles were grouped into 3 and 5 clusters referred to as 3-C and 5-C
respectively as shown in Table 1. A separate group was also formed
for each syllable nucleus leading to 12 groups. This case was re-
ferred to as ind-C. (AP based ind-C)
A separate classifier was trained for each group using a Deci-
sion Tree based classifier to distinguish stressed syllables from their
unstressed counterparts in the same cluster. In the testing phase, the
group for each syllable of the word is identified based on the sylla-
bles nucleus. The classifier trained for that particular group is then
usedtodecidewhether thesyllableisstressedor unstressed. Once all
5712978-1-4577-0539-7/11/$26.00 ©2011 IEEEICASSP 2011
Page 2
3-Cluster(1) High-vowel nuclei
(2) Non-high-vowel nuclei
(3) Diphthong nuclei
5-Cluster(1) High, lax-vowel nuclei
(2) High, tense-vowel nuclei
(3) Non-high, lax-vowel nuclei
(4) Non-high, tense-vowel nuclei
(5) Diphthong nuclei
Table 1. This table shows the grouping criterion to form 3 and 5
clusters of syllables with acoustically similar nuclei.
the individual syllables of the spoken word are classified, the word
is classified as correctly stressed if only one syllable is detected as
stressed and that syllable corresponds to the primary stressed sylla-
ble in the correct pronunciation. Syllable-level accuracy of 90.9%
was reported on adult data for the ind-C case. Word-level accuracy
of 76.9% was reported for the ind-C case as opposed to a word-
dependent accuracy of 85%.
3. PROPOSED CAP BASED MODELLING OF SYLLABLES
Word-dependent syllable stress evaluation systems were reported to
have the best performance as the classifiers are able to learn the sub-
tle differences in stressed and unstressed syllables and the across-
syllable interactions for a given word. These subtleties are averaged
out when word-independent techniques are built by grouping sylla-
bles from different words. The word-dependent techniques however,
are less appealing in real-life scenarios as they are not scalable.
Even though the approach proposed in [1] gains significant ad-
vantage by finer modelling of the syllable-nuclei, it doesn’t apply
the AP based grouping to model the context of the nuclei. The drop
in accuracies as a result of moving from the word-dependent sys-
tems to word-independent systems may be attributed to the loss of
context information across syllables. It is therefore important to fur-
ther group the syllables in subclusters based on the structure and the
identity of the contexts in which they occur.
3.1. Schema
Fig. 1. Schematic of the training and testing phases of CAP based
approach
Fig.1shows the schematic of the proposed CAP based approach.
The user is asked to record a word (randomly chosen from a pre-
defined set) in his/her natural voice. The input utterance (i.e., the
word) is time aligned with its corresponding phonetic transcrip-
tion using an Automatic Speech Recognition (ASR) system [7]. A
phone-to-syllable mapping of the corresponding word is used to
obtain the syllable-level time alignment of the input utterance and
thus get the syllable boundaries. The eight syllable level features,
1.
2.
3.
4.
5.
6.
7.
8.
Average fundamental frequency (F0)
Average normalized energy
Normalized duration
Average high frequency energy
Average energy X duration
Average F0 X duration
Ratio of average F0 of adjacent syllables
Ratio of average energies of adjacent syllables
Table 2. Syllable-level acoustic features.
described in Section 3.1 in [1], are listed in Table 2. In addition
to these acoustic features, the left and the right context-ID of the
syllable-nuclei are also incorporated for classifier training. Each
classifier is modelled based on these 8 acoustic features of the syl-
lable nucleus and the context-ID of the nucleus described in Section
3.2 and Section 3.3. A Decision Tree based classifier is used for
training this CAP based classifier to distinguish between stressed
syllables and their unstressed counterparts. During testing, the clas-
sifier trained for the group of each syllable of the word along with its
context-ID is identified and then a decision is made if the syllable is
stressed or unstressed. After having classified each of the individual
syllables of the spoken word, the word is classified as correctly
stressed or not.
In addition to the acoustic discriminability criteria such as pitch,
duration and energy, the context-ID of the nucleus also becomes an
important discriminative feature.
3.2. CAP based Grouping
For training, a seperate group based on the syllable-nucleus is
formed. Each such nucleus in a polysyllabic word has a left and a
right context-ID. The context-syllable information in addition to the
acoustic-phonetic features of the syllables form the basis for training
the classifiers.
The CAP based models are obtained by extracting the features nec-
essary for context modelling as follows.
The context for a given syllable-nucleus, NUC, is determined from
all the instances of that syllable-nucleus of the form NUC3/5
in the train-data. NUC3/5
right context nucleus-ID(R) or left context nucleus-ID(L) belongs to
one among the several nuclei from the ind-3C or the ind-5C group
(Section 2.2) represented by *. Therefore, * ranges from {1-3} for
the ind-3C case or from {1-5} for the ind-5C case. The case itself
is identified by the superscript3/5. These acoustic-phonetics based
context-ID features are then used to model the syllable-nucleus
contexts for each nucleus.
For example, theword ASSEThas2 syllablenuclei, AEand EH.
The right context of first syllable nucleus is represented as AE3
which means that the syllable nucleus AE has a right context-ID EH
which belongs to the 2nd group (non-high vowel) for the 3-C group-
ing of contexts. Similarly, the left context of EH may be represented
as EH3
belongs tothe 3rd group (dipthong) for the 3-C grouping of contexts.
The left context of AE and the right context of EH do not exist in this
case. Chosing the context models AE3
to just the nuclei models of AE and EH result in a substantial im-
provement in the classification accuracies. For non-boundary cases
∗−[R/L]
∗−[R/L]is the syllable-nucleus NUC whose
2−R,
3−Lwhich means that the left context-ID of EH which is AE
2−Rand EH3
3−Las opposed
5713
Page 3
where the syllable nucleus lies in between 2 nuclei, both the left and
the right contexts are considered by looking up the groups the left
and right nuclei-ID belong to. A model choice is then made based on
the classification confidence measure each case generates. The con-
fidence measure is essentially the prediction confidence value gener-
ated by the classifier which is typically the impurity measure of the
corresponding leaf node.
3.3. Dynamic Model Choice
This section describes a dynamic model choice for CAP based clas-
sification. The case where a seperate group is trained for each syl-
lable nucleus based on the acoustic-phonetic features of the nucleus
is referred to as the AP based ind-C case of grouping. The context-
dependent models are also trained based on the left and the right
syllable nuclei-IDs of the given syllable nucleus as described in Sec-
tion 3.2. The case where the contexts are classifed into 3 groups
based on the energy of the context-syllables is referred to as context-
ind-3C and the case where the contexts are classified into 5 groups
basedon energy andduration of thecontext-syllables isreferredtoas
context-ind-5C. It may be noted here that the context isnot modelled
if there is insufficient stressed-unstressed instances for that context
case. Thisisbecause the classifierhas atendency tobias thedecision
if only stressed or only unstressed instances are present.
In order to classify the given syllable, we first seek the context
model for that syllable based on its right/left nucleus-ID. If the con-
text model does not exist, we step up on the granularity to seek the
nucleus model based on the nucleus-ID alone. For example, for the
word ASSET to be modelled using the context-ind-3C models, if
the CAP based model AE3
model of AE is selected. Furthermore, if the AP based ind-C models
themselves are biased, they are grouped withother nucleus/nuclei by
moving further up the granularity scale to ensure stressed-unstressed
balance of data. This dynamic switching ensures an optimal model
choice between the CAP based models and the AP based ind-C mod-
els.
2−Rdoes not exist, the AP based ind-C
4. EXPERIMENTS AND RESULTS
The syllable stress experiments presented here are conducted on 92
unique words where each word has syllables varying from 2 to 5.
The training data consists of 13,594 correctly stressed words and
9879 incorrectly stressed words as assessed by one expert human as-
sessor considered as the master assessor by the call center. The test
data consists of recordings from real-life assessments of 350 candi-
dates with a total of about 6870 word instances. The rating assigned
by the assessor is 1 if the stress pattern of the word is correct and
is 0 otherwise. The syllable models can be trained using only those
instances of words which are labeled as correctly stressed. This is
because an incorrectly stressed word can be pronounced in multiple
ways (e.g., none of the syllables is stressed, more than one syllables
is stressed, wrong syllable is stressed and so on) and the syllable-
level stress information for incorrectly stressed words is not avail-
able. The test data was assessed by three expert human assessors.
The performance of all the stress classification techniques is evalu-
atedon the subset of thetest data where all thethree human assessors
assign the same score. The data thus collected is hereby referred to
as the adult data.
In addition to the adult data described above, the analysis was
also extended to children speech database recorded at schools. 798
children who were third-graders or younger and who came from
bilingual English/Spanish schools participated in the effort. About
half of recordings was done in natural classroom setting with some
background noise. Data collection was designed to build a corpus
of only correctly stressed phrases. The recording material in total
consisted of about 51092 utterances with about 702 unique words.
The data thus collected is hereby referred to as the child data.
4.1. Results on Adult Data
The word-level performance is evaluated on a subset of the test data
(2801 word instances) where the 3 human assessors assign the same
rating. The syllable-level performance is evaluated on the same sub-
set of test-data where all 3 human assessors assigned the same rating
and the rating was 1 (i.e., correctly-stressed). This is because an in-
correctly stressed word can be pronounced in multiple ways (e.g.,
none of the syllables is stressed, more than one syllables is stressed,
wrong syllable is stressed and so on) and the syllable-level stress in-
formation for incorrectly stressed words is not available. The num-
bers in Table 3 indicate the percentage of syllables which are cor-
rectly classified. Note that the performance improves as we move
from the AP based ind-C models to the CAP based context-ind-3C
and context-ind-5C models. The context-features in addition to the
acoustic-features are more robust in capturing the nuances of co-
articulation of the syllables.
For example, in the context-ind-3C case there are a total of 14
syllable nuclei, where the right or the left context nuclei-ID may
belong to one of the 3 groups (high-vowel, non-high vowel and
dipthong). This leads to a total of 14x2x3 = 84 CAP models where
2 refers to the context (left or the right), and 3 refers to the number
of groups the context-ID could be a part of in the 3-C grouping of
nuclei. In addition to the CAP based models described above, the
individual AP based ind-C models are also considered for classifica-
tion which result in a sum-total of 84+14 = 98 models. However, it
may be noted that all the context-models may not exist due to lack
of training data or phonotactic contraints.
The net syllable-level performance improvement achieved by the
proposed grouping is approx. 7% both for the context-ind-3C and
the context-ind-5C case. The post-processing step [1] rectifies the
unstressed decisions to stressed decisions further.
Syllable
Nucleus
AA
AE
AO
AW
AX
AY
EH
ER
EY
IH
IY
OW
UH
UW
Average
AP based
indC(%)
86.07
94.58
90.72
96.23
93.21
98.75
86.70
88.64
78.00
83.97
93.45
96.57
90.32
87.38
90.33
CAP based CAP based
context-ind-3C (%)context-ind-5C (%)
98.89
98.16
99.11
96.23
98.59
98.75
97.97
99.95
97.49
95.47
99.06
98.08
99.35
87.38
97.47
98.95
98.41
99.00
96.23
98.91
98.75
98.68
99.95
97.49
96.36
99.06
98.36
99.67
87.38
97.66
Table 3. This table compares the syllable-level performance of the
various syllable stress evaluation techniques presented in Section 3.2
There is significant improvement in the accuracies of the syl-
lable nuclei AA, AO, EH, EY, UH and IH in moving from the AP
based models to the CAP based models. For example, there is a
5714
Page 4
12% improvement in the syllable-level accuracy in moving from the
AP based ind-C models to CAP based context-ind-3C models for
the EY nucleus. This improvement can be attributed to the presence
of a wider choice of context-dependent models such as EH3
EH3
1−R, EH3
the nucleus model of EH for classification. Certain nuclei such as
AW, AY and UW do not show any improvements because the train-
ing data of the CAP based models is skewed towards only stressed or
only unstressed instances. In such cases, the AP based ind-C models
are chosen so as to not bias the decision. Also, there is no signifi-
cant improvement in moving from the context-ind-3C to the context-
ind-5C models perhaps due to the lack of sufficient data to train the
context-ind-5C models and thereby exploit its advantages.
This improvement at the syllable-nucleus level also translates to
improvements at the word level as shown in Table 4. For example,
a word such as ASSOCIATION has 5 syllable nuclei, AE, OW, IH,
EY and AX. The accuracy of the word jumps from 59% for the AP
based ind-C case to 96% for the CAP based context-ind-3C(context-
dependent) case for a total of 49 instances of the word in the dataset.
The CAP models, AE3
EY3
1−L, EY3
models of AE, OW, IH, EY and AX are used to classify the instances
of the word (Section 3.2). The non-boundary nuclei such as OW, IH
and EY have both the right and the left contexts-models, based on
the nucleus-ID of the context, to choose from. In such a case, the
decision as a result of chosing the 2 context models which yields
the highest confidence score is considered. This improvement can
be attributed to the inherent advantage of modelling contexts as op-
posed to chosing the AP based ind-C models of AE, OW, IH, EY and
AX. It is also noted that the context based accuracy of 86.95% for
the context-ind-3C case is comparable to that of the best performing
word-dependent systems.
1−L,
2−L, EH3
2−R, EH3
3−Land EH3
3−Rasopposed tojust
2−R, OW3
2−Lin addition to the context-independent
3−L, OW3
1−R, IH3
3−L, IH3
2−R,
1−Rand AX3
ind-C
74.87%
context-ind-3C
86.95%
context-ind-5C
86.98
Table 4. Word-level performance for the ind-C ,context-ind-5C and
the context-ind-3C case. Refer.3.2
4.2. Results on child data
The performance of the CAP based models was also evaluated on
child data. This robust data-set incorporates important structure and
context information of syllables in addition to the eight syllable level
features. Table 5 shows the syllable-level average 10-fold cross vali-
ind-C
93.75%
context-ind-3C
95.54%
context-ind-5C
96.34%
Table5. Syllable-level average10-fold crossvalidationperformance
for child data
dation performance for the AP based ind-C, CAP based context-ind-
3C and context-ind-5C cases. Furthermore, 10% of the child data is
held out as test data, training the classifiers on the remaining 90% of
the data. The utterances used for test do not overlap with those used
for training. The syllable level context accuracy evaluated on this
held-out dataisabout 96.94% or 2.5%lower compared tothe10-fold
accuracy reported in Table 5 for the context-ind-3C case. The CAP
based techniques also results in a word level accuracy of 96.67% for
the context-ind-3C case. The relatively higher performance on the
child data compared to the adult data may be due to the fact that the
test and train data used in the child data are all correctly stressed.
4.3. Adult data tested on models trained using child data
ind-C
67.01%
context-ind-3C
69.62%
context-ind-5C
72.37%
Table 6. Syllable-level performance of Adult data tested on models
trained using child data
To evaluate the cross data performance of the CAP based mod-
els, the adult data was tested on models trained using child data.
Syllable accuracy of 72.37% and a word level accuracy of 69.83%
was reported for the context-ind-5C case. This performance is only
about 5% lower than what was reported for AP based modelling in
[1] evaluated on the same adult test data.
5. CONCLUSIONS
Itisshown that therelativeknowledge of the syllablenuclei whenin-
corporated into the models trained by the classifiers significantly im-
prove the classification accuracies of the syllables. The CAP based
system outperforms the AP based techniques and exhibits perfor-
mances comparable to the word-dependent systems. A dynamic
context model selection and optimal grouping further improves the
classifiers making them more robust. The relative advantages of the
proposed system, however, may be exploited only when there is suf-
ficient data to capture the nuclei-contexts.
6. REFERENCES
[1] Om D. Deshmukh and Ashish Verma, “Nucleus-level cluster-
ing for word-independent syllable stress classification,” Speech
Commun., December 2009.
[2] J Tepperman and S Narayanan, “Automatic syllable stress de-
tection using prosodic features for pronunciation evaluation of
language learners,” in Internat. Conf. on Acoust. Speech and
Signal Process, 2005.
[3] K Imoto, Y Tsubota, A Raux, T Kawahara, and M Dantsuji,
“Modeling and automaticdetectionofenglishsentence stressfor
computer-assisted english prosody learning system.,” in ICSLP,
2002.
[4] K Jenkin and M Scordilis, “Development and comparison of
three syllable stress classifiers.,” in Internat. Conf. on Speech
and Language Process., 1996.
[5] A Verma, K Lal, Y.Y Lo, and J Basak, “Word independent
model for syllable stress evaluation.,”
Speech and Language Process., 2006.
[6] G Ying, L Jamieson, R Chen, and C Michell, “Lexical stress
detection on stress-minimal word pairs.,” in Internat. Conf. on
Speech and Language Process., 1996.
[7] B Ramabhadran, O Siohan, and A Sethy, “The ibm 2007 speech
transcription system for european parliamentary speeches.,” in
IEEE Automatic Speech Recognition and Understanding Work-
shop, 2007.
in Internat. Conf. on
5715