Conference PaperPDF Available

Automatic evaluation and training in English pronunciation



Content may be subject to copyright.
Kobe, Japan
Automatic Evaluation and Training in English Pronunciation
Jared Bernstein, Michael Cohen, Hy Murveit, Dimitry Rtischev and Mitchel Weintraub
Speech Research Program, SRI International, Menlo Park, California 94025 USA
SRI is developing a system that uses real time speech recognition to diag
nose, evaluate and provide training in spoken English. The paper first
describes the methods and results of a study of the feasibility of automati
cally grading the performance of Japanese students when reading English
aloud. Utterances recorded from Japanese speakers were independently rated
by expert listeners. Speech grading software was developed from a speaker
independent hidden-Markov-model speech recognition system. The auto
matic grading procedure first aligned the speech with a model and then com
pared the segments of the speech signal with models of those segments that
have been developed from a database of speech from native speakers of
English. The evaluation study showed that ratings of speech quality by
experts are very reliable and automatic grades correlate well (r > 0.8) with
those expert ratings.
SRI is now extending this technology and integrating it in a spoken-language
training system. This effort involves (1) porting SRI's DECIPHER speech
recognition system to a microcomputer platform, and (2) extending the
speech-evaluation software to more exactly diagnose a learner's pronuncia
tion deficits and lead the learner through an appropriate regimen of exer
1.0 Introduction
Computer-assisted foreign language instruction is a natural extension of lan
guage laboratory technology that is based on audio tape. Computer-assisted
language instruction has been the focus of many research projects and com
mercial products over the last two decades (Ahmad et al. 1985). Most lan
guage instruction systems implemented to date can be grouped into several
broad classes. The simplest systems offer highly structured lessons and drills
relying solely on text and static pictures. Such simple systems offer only a
minor improvement over the traditional printed textbooks. More advanced
designs attempt natural language processing, permitting greater flexibility in
user's input text and moving away from the highly constrained traditional
drill paradigms toward more life-like linguistic interactions. Certain more
attractive systems use moving video portraying language use in appropriate
cultural contexts.
Still, most computer-based language instruction systems have been designed
with a focus on reading and listening comprehension, since it is the practice
of these receptive skills that is most easily accomplished using a computer
capable of controlling simple text, video, and audio output Some systems
also permit writing practice by accepting constrained textual input from the
user. Speaking, however, remains the most difficult aspect of language learn
ing to incorporate into a computer-based instruction system. Thus, the piv
otal practice of active conversation skills is still restricted to live classroom
instruction and real-life "sink-or-swim" situations.
This paper is concerned with spoken language instruction using a computer
capable of both audio input and output. We first describe the methods and
results of a study of the feasibility of automatically evaluating the spoken
reading performance of Japanese students of English. We then outline our
latest efforts for developing a voice-interactive computer-assisted language
instruction system capable of speech diagnosis, instruction, and evaluation.
2.0 Automatic evaluation of spoken sentences
The initial objective was to determine the feasibility of automatically evalu
ating the intelligibility of English sentences read aloud by native Japanese
speakers. This technology might be useful, for example, as pan of an admis
sion procedure for entering a university. We sought a method for automati
cally deriving intelligibility scores for spoken sentences that correlated well
with those assigned by human expert listeners. In principle, such a method
would be useful not only for testing student spoken language skills, but also
as a basis for a more comprehensive system providing diagnosis and instruc
tion in speaking the foreign language.
2.1 Method
2.1.1 Materials
Each Japanese speaker read six sentences aloud. Of the six sentences chosen
for reading, two were specially designed diagnostic sentences that include
one or more examples of extreme vowel and consonant sounds, as well as
several sounds and sequences that vary among dialects of English. The other
four sentences are sentences for which SRI already had recorded a balanced
sample of 40 American English speakers. These sentences, which were
designed to provide breadth and depth of coverage of phones and phones-in-
context, are:
(1) She had your dark suit in greasy wash water all year, (diagnostic)
(2) Don't ask me to carry an oily rag like that, (diagnostic)
(3) What is England's estimated time of arrival in Townsville?
(4) Show the same display increasing letter size to the maximum value.
(5) How many ships were in Galveston May third?
(6) Is Puffer's remaining fuel sufficient to arrive in port at the present
2.12 Speakers
SRI recorded 37 speakers: 6 at SRI, 25 at Stanford University housing, and
then 6 more at Stanford. All of the analysis reported below was performed
on the 31 speakers comprising the first two groups. None of the 31 speakers
had lived in the United States for more than three years at the time of their
recording, and only two had been in the United States longer than two years.
Most speakers had been in the United States more than one month and less
than 15 months. All speakers are adults - 25 men and 12 women.
2.1.3 Equipment
Recordings were made wherever the speaker was found: e.g., in an office or
a living room. The recordings were made using a head-mounted microphone
and a high-quality, portable, analog tape recorder. The automatic gain con
trol circuit of the tape recorder was employed. The recorded utterances were
digitized at 16,000 16-bit samples per second and were stored on disk.
2.1.4 Ratings
The plan for obtaining human ratings of the quality and intelligibility of the
spoken material was to have a small number of experts rate the pronuncia
tion quality of each recorded utterance and then measure the intelligibility of
a stratified sample of the speakers.
Two listening tests were administered. The first test presented expert listen-
en with the same sentence as spoken by each of the 31 readers, so that the
listeners could gauge the range of English skill in the sample of speakers.
Subsequently, all 186 recorded utterances (6 sentences by 31 speakers) were
presented in a random order for rating by the listener. Three expert listeners
were instructed to estimate the pronunciation skill (segmental and prosodic)
of the speaker. Each expert listener rated the utterances on two occasions,
separated by several days.
The second test measured the intelligibility of six speakers selected from the
31 Japanese speakers. These six speakers were selected to cover the range of
quality ratings among the 31 speakers. Six balanced forms of the test were
administered to three sets of six naive listeners. Each form presented six
utterances: one from each speaker and one of each sentence type.
2.1 J Speech Processing
In the development of automatic grading, the speech signals were processed
in a manner similar to that used in discrete density hidden-Markov model
speech recognition (Cohen et al. 1990). The sampled speech signal was
spectrally transformed on a frame-by-frame basis via a Fast Fourier Trans
form algorithm. Acoustic features were calculated and quantized for each
frame from the discrete Fourier coefficients.
Hidden-Markov models trained on a large number of examples of sentences
and words spoken by a diverse sample of American English speakers were
used as a stochastic model of the pronunciation of English. Hidden-Markov
models that represent phonemes, words, or whole utterances could be
formed. Given adequate training, the larger the speech unit, the tighter the
model will be (tighter models discriminate more reliably). Two extremes of
model size were tested: context-free phoneme models and whole sentence
models. Regardless of size, each model consisted of a number of underlying
states with each state characterized by (1) discrete probability densities (one
for each of the acoustic features) and (2) a set of likelihoods of the transi
tions into allowed subsequent states. Figure 1 is a diagram of states for the
word [she].
2.1.6 Preliminary Separation Studies
Before the human ratings were available for these sentences, a series of stud
ies were done to identify which kinds of states and which features of states
were most useful in separating the populations of Japanese and American
speakers. These state-feature separators (SFS) were identified for the two
diagnostic sentences and were used in the studies reported below.
22 Results
2.2.1 Ratings
Rating of the pronunciation quality of the sentences by expert listeners
yielded the following results:
Raters used a seven-point scale adequately to distinguish among the
speakers. The average ratings assigned to speakers ranged from 1.8
to 6.5.
The standard deviation of ratings among all three expert listeners was
033 averaged over sentences.
"sh e" a /ah V
Each state (shl, sh2. sh3, ii. 12, i3} has
1. a probability transition to next state
2. a probability transition to same state
3. probability density functions for four
acoustic features.
Expert 1: 0.98
Expert 2: 0.96
Expert 3: 0.94
Expert 1 vs 2 & 3
Expert 2 vs 1 & 3
Expert 3 vs 2 & 3
As shown in Tables 1 and 2, the reliability of the ratings was excellent
That is, each listener's ratings of the speech samples was consistent
for two sets of ratings that were separated by several days. There
was also agreement among the listeners in the ratings given to each
speech sample.
The judgments of pronunciation quality were robust over sentence con
tent The average quality rating of the six sentences over all speak
ers ranged from 328 to 3.61, which suggests that the judgments
were largely independent of the sentence material.
222 Intelligibility
A stratified sample of six speakers was selected for intelligibility testing.
Table 3 shows the percentage of words correctly spoken (intelligibility
score) and the average human-assigned quality rating for these six speakers.
Sentence-intelligibility scores for the speakers were more variable than the
quality ratings. Unlike tests of word-intelligibility (which can be designed to
achieve considerable precision), developing precise tests of sentence intelli
gibility is impeded by the listeners' ability to predict words from context
INAK 5.33 95%
JDOT 4.28 84%
TRIM 3.72 82%
HOHT 3.27 88%
KHIR 2.50 67%
CDOT 1.67 49%
even when individual words are unintelligible. Listeners differ considerably
in their ability to predict unintelligible words, and this contributes to the
imprecision of the results. Thus, it seems that the quality rating is a more
desirable number to correlate with the automatic-grading score for sen
tences. Furthermore, it accords with the traditional method of judging pro
2.2.3 Automatic Grading
Three grading and aligning procedures were evaluated for both wideband
and telephone-band speech:
(1) Alignment and grading using sentence models.
(2) Alignment and grading using phoneme models.
(3) Alignment using sentence models and grading using phoneme mod
The most effective automatic grading performance was observed when each
spoken sentence was aligned and graded with a model (or a set of models)
for that sentence. Sentence model alignments yielded a correlation with the
average quality ratings by human experts of 0.81 for wideband speech and
0.76 for telephone-band speech. Table 4 summarizes the results. In the table,
"alignment" and "grading" specify the kind of models used in aligning and
grading the input speech (either sentence models or phoneme models).
Alignment: Sentence Phoneme Sentence
Grading: Sentence Phoneme Phoneme
320-5600 Hz (wideband) 0.81 0.71
200-3600 Hz (telephone) 0.76 0.11 0.73
At telephone bandwidth, phonetic alignment using context-free phoneme
models was degraded as evidenced by the low correlation of 0.11. However,
the phoneme models are nevertheless quite robust when properly aligned
When used in conjunction with sentence model alignment, for example, the
phoneme models produced an automatic grade that correlates 0.73 with the
human quality ratings. Figure 2 shows a scatter plot of automatic grader vs
human judgment scores. These data are for signals band-passed at 200-3600
Hz. The correlanon of the data between the two dimensions is 0.76.
A correlation of 0.81 means that the knowledge of sentence-model scores
enables us to predict about two-thirds (0.81 * 0.81 - 0.66) of the variation
expected in listeners' ratings of quality.
2.2.4 Reliability
In test theory, reliability is an index based on bow well two measures of the
same variable correlate. Any score is composed of a component of true vari
ance plus a component of error variance. In this case, the true variance is the
true difference among the speakers' English pronunciation in terms of the
rating criteria used by the listeners. The reliability of a score is an estimate of
the true variance component
Various strategies are commonly used to estimate reliability. Our strategy
was to use repeated ratings by each listener to estimate the reliability of the
speech quality ratings. As shown in Table 1, the reliabilities of ratings of
speech quality by expert listeners were excellent ranging from 0.94 to 0.98.
Experience with the time alignment of SRI's speech processing routine leads
us to conclude that the grading scores would have a retest correlation of near
1.0. The high reliability of the ratings is important for the reason that it pro
vides ample true variance to correlate with the scores from automatic grad
2.3 Conclusions from the Evaluation Study
The two principal conclusions that can be drawn from the evaluation study
are that (1) ratings of speech quality by experts are very reliable and (2)
automatic grading using sentence models correlate highly with those expert
ratings. Results when using phoneme models with 200-3600 Hz band-lim
ited speech (simulated telephone quality) suggest that high correlations with
quality ratings can be achieved, but mat further development of the time
alignment will be required We presume that intelligibility would also corre
late well with the models if enough listeners and sentences were used to sta
bilize the intelligibility scores.
The evaluation study has pointed to several important issues, among which
Development of automatic grading has relied predominantly on spectral
(segmental) features of the recorded speech sample. A more com
plete treatment of prosodic features may be needed to improve
automatic grading significantly.
Results with the SFS distances suggest that improved dfcerimination can
be realized with carefully selected states and features. Searching
for particularly effective features should be part of further devel
Cross-gender models of American adults were used for comparison with
Japanese adults. Performance could probably be improved by
using specific models for male and female speakers. Further
more, the training and matching may need to be based on a more
elaborate sample of speakers (e.g. samples that are stratified by
age and size of the speaker).
There are important trade-offs involved in deciding between phoneme
models and sentences models. Sentence-level models are much
more expensive to train, but they align the speech signal much
better and consistently correlate better with human judgments.
On the other hand, phoneme-level models permit an arbitrarily
large set of phrases and/or sentences to be constructed with no
extra development cost Achieving adequate performance with
phoneme models will require some additional research.
3.0 Instructional Development
3.1 Introduction
Although there are many directions in which SRI's research on pronuncia
tion evaluation might be extended, we are now taking the first steps in the
development of instructional systems that can take a more active part in the
shaping of spoken language production. In particular, students need to speak
the language in a situation in which good models are available and critical
feedback is given within some task that has reasonable intrinsic interest.
32 Target System
SRI is aiming development toward a language teaching system in which
both graphic and spoken forms of language can be taught and tested, and in
which both the receptive (listening and reading) and productive (speaking
and writing) skills are trained. The target system would ideally have excel
lent interactive graphics and tools that allow fast development of lessons and
the automatic tracking of student progress. However, as a first step, SRI is
focusing on a demonstration system that highlights the possibilities of spo
ken input in language instruction.
3.3 Initial System
The initial demonstration system supports the development of language les
sons in which the student has access to some graphical information like a
map or a table of data forming the basis for a limited-context "conversation"
between the student and the computer. The student is presented with ques
tions and several full-sentence answers in a multiple-choice format A native
pronunciation of the question and all the answers is available for playback at
student's request The student reads one of the answers into the microphone.
The system immediately indicates which answer was picked and how well it
was pronounced.
The development of the initial system involves building a small-vocabulary
English and Japanese word recognizer in a portable microcomputer environ
ment improvement of sentence grading and user feedback mechanisms, and
collection of speech data for recognizer training. We are also addressing the
design of a thematically integrated lesson battery along with a simple and
flexible user interface.
4.0 Conclusion
We have demonstrated that automatic evaluation of English pronunciation is
possible. Further, it seems that similar technology should be adequate for use
in the training of foreign language pronunciation and in the diagnosis of spo
ken errors. If teaching materials are carefully adapted to use the speech rec
ognition tools that already exist and if speech recognition system
components are properly trained and adapted to the requirements of lan
guage teaching, many new ways of teaching, analyzing and evaluating pro
nunciation will soon emerge.
K. Ahmad, G. Corbett, M. Rogers, R. Sussex, "Computers,
Language Learning and Language Teaching," Cambridge
Univ. Press, Cambridge, 1985.
M. Cohen, H. Murveit J. Bernstein, P. Price, M. Weintraub,
"The Decipher Speech Recognition System," Proc. IEEE
ICASSP-90, Albuquerque, 1990.
60 r
q a
a a b Q
a B q B
n a
q aa
Human Judgment of Pronunciation Quality
... An early approach developed at SRI [21,86,201,200] given the trained native speaker modelM of a non-native candidate utterance o o o 1:T with known word sequence w 1:T is then used to indicate the degree of nativeness and thus proficiency of the candidate's speech. The idea is that the more similar the candidate's pronunciation is to the native pronunciation, the easier it will be for the native-trained models to recognise the candidate's speech. ...
... Phone-level errors e(φ m ) can similarly be detected by thresholding the log likelihood of each phone in its aligned position [200,21,64,46]: ...
Growing global demand for learning a second language (L2), particularly English, has led to considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications. This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One of the challenges in automatic spoken language assessment is giving candidates feedback on particular aspects, or views, of their spoken language proficiency, in addition to the overall holistic score normally provided. Another is detecting pronunciation and other types of errors at the word or utterance level and feeding them back to the learner in a useful way. It is usually difficult to obtain accurate training data with separate scores for different views and, as examiners are often trained to give holistic grades, single-view scores can suffer issues of consistency. Conversely, holistic scores are available for various standard assessment tasks such as Linguaskill. An investigation is thus conducted into whether assessment scores linked to particular views of the speaker’s ability can be obtained from systems trained using only holistic scores. End-to-end neural systems are designed with structures and forms of input tuned to single views, specifically each of pronunciation, rhythm, intonation and text. By training each system on large quantities of candidate data, individual-view information should be possible to extract. The relationships between the predictions of each system are evaluated to examine whether they are, in fact, extracting different information about the speaker. Three methods of combining the systems to predict holistic score are investigated, namely averaging their predictions and concatenating and attending over their intermediate representations. The combined graders are compared to each other and to baseline approaches. The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. An approach to these tasks is presented by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora x of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level.
... In the 1980s, automatic speech recognition (ASR) technology showed substantial progress, and this led researchers to consider whether and to what extent this new technology could be used to evaluate the English proficiency of non-native speakers . Computer-assisted pronunciation training (CAPT) was newly introduced by integration ASR technology which focused on aspects of pronunciation and fluency in the 1990s (Bernstein et al., 1990;Cucchiarini et al., 1997aCucchiarini et al., , 1997bCucchiarini et al., , 2000aCucchiarini et al., , 2000bFranco et al., 2000b). ...
... In its earliest days of CAPT, learners were given a read-aloud task with just an overall accuracy of their reading due to insufficient performance of ASR technology (Bernstein et al., 1990). With further advances in ASR technology, CAPT system started to automatically evaluate a learner's utterance on the segmental units, phonemes. ...
... A few also teach oral skills, giving the student feedback about the quality of their pronunciation at different levels. Many systems have been proposed in the last decades that produce pronunciation scores for each paragraph, phrase, word or phone pronounced by the student [1,2,3,4]. Some of them reach performance levels that are comparable to the agreement across humans when scores are computed over long chunks of speech [1,5]. ...
... Many systems have been proposed in the last decades that produce pronunciation scores for each paragraph, phrase, word or phone pronounced by the student [1,2,3,4]. Some of them reach performance levels that are comparable to the agreement across humans when scores are computed over long chunks of speech [1,5]. Yet, word-and phone-level scoring are still challenging tasks with much lower level of performance; and experts still claim for the need of more accurate solutions [6]. ...
... In addition, Deng and Training (2015) reviewed apps focused on both intentional and accidental vocabulary learning strategies. Poignant to the present study, real-time feedback of student speaking assessment performance has long been a research and practical concern (Bernstein, Cohen, Murveit, Rtischev, & Weintraub, 1990), despite reports that such feedback may prove inhibiting to Chinese learners of English (Jia, 2009). Existing applications currently used by students are reportedly offering limited performance feedback and evaluation due to limited prosodic system features that result in generic guidelines on student pronunciation that would not fit the purposes of EAP speaking skills in higher education. ...
... Automatic speech recognition (ASR) research and the application of its technology for second language (L2) instruction and assessment have grown considerably in the last two decades. One of the first automated pronunciation evaluation systems was developed to assess the pronunciation quality of Japanese learners of English reading aloud (Bernstein, Cohen, Murveit, Rtischev, & Weintraub, 1990). Since then, numerous applications that incorporate pronunciation tutoring, or computer-assisted pronunciation training (CAPT), have become available commercially; examples include Tell Me More® from Rosetta Stone and NativeAccent® from Carnegie Speech. ...
Full-text available
This chapter discusses the operationalization and scoring of pronunciation constructs using automatic speech recognition (ASR) systems using constrained tasks. It begins by distinguishing between computer-assisted pronunciation training (CAPT) pronunciation remediation systems and ASR pronunciation assessment systems. The chapter describes how the systems are developed and how proficient or native reference speakers can be used as a model against which to compare learner pronunciations. It illustrates how features of speech are extracted and weighted to score sub-constructs of pronunciation such as word sounds, stress, and intonation. The chapter looks ahead to future possible uses of this assessment technology, through the lens of English as an International Language (EIL). Areas where more improvements are needed include the ability to score pronunciation ability on unconstrained, spontaneous speech, versus the read aloud or constrained speech that has been much of the focus of the chapter.
This chapter examines the use of Automatic Speech Recognition (ASR) technology in the context of Computer Assisted Language Learning (CALL) and language learning and teaching research. A brief introduction to ASR is first provided, to make it clear why and how this technology can be used to the benefit of learning and development in second language (L2) spoken discourse. This is followed by an overview of the state of the art in research on ASR-based CALL. Subsequently, a number of relevant projects on ASR-based CALL conducted at the Centre for Language and Speech Technology of the Radboud University in Nijmegen (the Netherlands) are presented. Possible solutions and recommendations are discussed given the current state of the technology with an explanation of how such systems can be used to the benefit of Discourse Analysis research. The chapter concludes with a discussion of possible perspectives for future research and development.
Assessment in the context of foreign language learning can be difficult and time-consuming for instructors. Distinctive from other domains, language learning often requires teachers to assess each student’s ability to speak the language, making this process even more time-consuming in large classrooms which are particularly common in post-secondary settings; considering that language instructors often assess students through assignments requiring recorded audio, a lack of tools to support such teachers makes providing individual feedback even more challenging. In this work, we seek to explore the development of tools to automatically assess audio responses within a college-level Chinese language-learning course. We build a model designed to grade student audio assignments with the purpose of incorporating such a model into tools focused on helping both teachers and students in real classrooms. Building upon our prior work which explored features extracted from audio, the goal of this work is to explore additional features derived from tone and speech recognition models to help assess students on two outcomes commonly observed in language learning classes: fluency and accuracy of speech. In addition to the exploration of features, this work explores the application of Siamese deep learning models for this assessment task. We find that models utilizing tonal features exhibit higher predictive performance of student fluency while text-based features derived from speech recognition models exhibit higher predictive performance of student accuracy of speech.
Full-text available
This study explores university students’ attitudes regarding the potential of artificial intelligence (AI)-assisted mobile applications (apps) to support the development of speaking skills in English for academic purposes (EAP) courses in higher education. Analysis of the data shows students expressing a preference to use AI tools for speaking development due to limited teacher feedback, and although they were generally satisfied practising their English using the AI technologies, the findings also point to certain limitations of the current AI apps, such as lack of applicable feedback and few model examples. In addition, students held strong views discouraging any notion that AI could replace actual language teachers. In conclusion, students suggest the need for more AI resources, especially apps that accommodate a variety of English accents.
Full-text available
Pronunciation teaching is an important stage in language learning activities. This article tackles the pronunciation scoring problem where research has demonstrated relatively low human-human and low human-machine agreement rates, which makes teachers skeptical about their relevance. To overcome these limitations, a fuzzy combination of two machines scores is suggested. The experiments were carried in the context of Algerian pupils learning to read Arabic. Although the native language of Algerian pupils is a dialect of Arabic, Modern Standard Arabic remains difficult for them with difficult sounds to master and letters close in their pronunciation. The article presents a fuzzy evaluation system including both oral reading fluency, and intelligibility. The fuzzy system has shown that despite the disparities between human ratings, its scores correspond at least to one of their ratings and most of the time its ratings are in favor of learners. Therefore, fuzzy logic, more favorable than thresholding systems, encourages learners to pursue their training.
ResearchGate has not been able to resolve any references for this publication.