ArticlePDF Available

L2-ARCTIC: A Non-Native English Speech Corpus


Abstract and Figures

In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from the Carnegie Mellon University ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training. The corpus is publicly accessible at
Content may be subject to copyright.
L2-ARCTIC: A Non-Native English Speech Corpus
Guanlong Zhao1, Sinem Sonsaat2, Alif Silpachai2, Ivana Lucic2
Evgeny Chukharev-Hudilainen2, John Levis2 and Ricardo Gutierrez-Osuna1
1Department of Computer Science and Engineering, Texas A&M University, United States
2Department of English, Iowa State University, United States
{gzhao, rgutier}, {sonsaat, alif, ilucic, evgeny, jlevis}
In this paper, we introduce L2-ARCTIC, a speech corpus of
non-native English that is intended for research in voice
conversion, accent conversion, and mispronunciation detection.
This initial release includes recordings from ten non-native
speakers of English whose first languages (L1s) are Hindi,
Korean, Mandarin, Spanish, and Arabic, each L1 containing
recordings from one male and one female speaker. Each
speaker recorded approximately one hour of read speech from
the Carnegie Mellon University ARCTIC prompts, from which
we generated orthographic and forced-aligned phonetic
transcriptions. In addition, we manually annotated 150
utterances per speaker to identify three types of
mispronunciation errors: substitutions, deletions, and additions,
making it a valuable resource not only for research in voice
conversion and accent conversion but also in computer-assisted
pronunciation training. The corpus is publicly accessible at
Index Terms: speech corpus, voice conversion, accent
conversion, mispronunciation detection
1. Introduction
Voice conversion (VC) [1] aims to transform utterances from a
source speaker to make them sound as if a target speaker had
uttered them. The closely related problem of accent conversion
(AC) [2] goes a step further, mixing the source speech’s
linguistic content and accent with the target speaker’s voice
quality to create utterances with the target’s voice but the
content and pronunciation of the source speaker. When teaching
a second language (L2), accent conversion can be used to create
a golden speaker,” a synthesized voice that has the learner’s
voice quality but with a native speaker’s accent (e.g., prosody,
intonation, pronunciation) [3]. Several studies [4, 5] have
suggested that having such a “golden speaker” to imitate can be
beneficial in pronunciation training. Furthermore, in addition to
providing language learners with a suitable voice to mimic,
detecting mispronunciations is also a critical component for
providing useful feedback to the learners in computer-assisted
pronunciation training [6].
To train and evaluate voice and accent conversion systems
designed for non-native speakers, one needs high-quality
parallel recordings from the source and target speakers.
Likewise, to develop and benchmark mispronunciation
detection algorithms, detailed phoneme level annotations on
pronunciation errors (e.g., phone substitution, additions, and
deletions) are required. However, existing non-native English
JMK: Canadian accent; AWB: Scottish accent; KSP: Indian accent
corpora (e.g., Speech Accent Archive [7] and IDEA [8]) do not
fulfill these requirements (refer to section 2 for a detailed
To fill this gap, we have built a non-native English speech
corpus that contains ten non-native speakers of English in the
initial release. The end goal for this corpus is to include 20
speakers from five different native languages: Hindi, Korean,
Mandarin, Spanish, and Arabic. For each speaker, the corpus
contains the following data:
Speech recordings: over one hour of prompted recordings
of phonetically-balanced short sentences
Word level transcriptions: orthographic transcription and
forced-aligned word boundaries for each sentence
Phoneme level transcriptions: forced-aligned phoneme
transcription for each sentence
Manual annotations: a selected subset of utterances (~150),
including 100 sentences produced by all speakers and 50
sentences that include phonemes likely to be difficult
according to each speaker’s L1, all annotated with cor recte d
word and phone boundaries; phone substitution, deletion,
and addition errors are also tagged
The dataset is hosted on an online archive and is freely
available to the research community for non-commercial use.
To the best of our knowledge, L2-ARCTIC is the first openly
available corpus of its kind.
2. The need for a new L2 English corpus
A number of voice conversion studies [9-12] have relied on the
Carnegie Mellon University (CM U) ARCTIC speech corpus
[13] and, more recently, the Voice Conversion Challenge
(VCC) dataset [14]. However, little attention has been paid to
voice conversion between non-native speakers of English, in
part due to the lack of high-quality speech recordings from
those speakers, despite 80% of the English speakers in the
world being non-native [15]. For example, CMU ARCTIC only
has a few accented English speakers
, either native speakers of
different English dialects or highly proficient non-native speak-
ers, whereas the VCC dataset was recorded solely by profes-
sional voice talents who are native English speakers. Therefore,
these standard corpora are not suitable for either voice conver-
sion between non-native speakers nor accent conversion tasks.
Among the non-native English corpora, the Speech Accent
Archive [7] and IDEA [8] cover a wide range of native
languages and speakers. However, each speaker only recorded
a short paragraph (Speech Accent Archive) or a short free
speech task (IDEA), and most of the recordings have strong
background noise, making them ill-suited for voice/accent
conversion. The Wildcat [16], LDC2007S08 [17], and
NUFAESD [18] datasets have a limited number of recordings
for each non-native speaker, and have restricted access
LDC2007S08 requires a fee, while Wildcat and NUFAESD are
limited to designated research groups.
As for corpora for mispronunciation detection, the CU-
CHLOE [19] and College Learners’ Spoken English Corpus
(COLSEC) [20] only contain speech and error tags from
Chinese learners of English, and CU-CHLOE is (to our
knowledge) not publicly available. The ISLE Speech Corpus
[21] contains mispronunciation tags and is open for academic
access, but it only focuses on a limited group of English learners
(German and Italian). SingaKids-Mandarin [22] has a rich set
of speech data, but it only focuses on mispronunciation patterns
in Singapore children’s Mandarin speech. In fact, most existing
mispronunciation detection systems use their private datasets,
which makes it difficult to compare experimental results across
different publications [19, 23-25].
To overcome the insufficiencies outlined above, we
constructed (and are now releasing) L2-ARCTIC to provide an
open corpus for voice conversion between accented speakers,
accent conversion, and mispronunciation detection. Zhao et al.
[26] have performed a preliminary evaluation on voice/accent
conversion tasks using a subset of the speakers in L2-ARCTIC.
Using a joint-density GMM with MLPG and global variance
compensation [9] (128 mixtures, ~5 min of parallel training
data) as the voice conversion system, they obtained 2.5 Mean
Opinion Score (MOS) on the converted speech, which was also
rated as similar to the target voice. Furthermore, an accent-
conversion algorithm based on frame-alignment using
posteriorgrams was able to generate speech that was perceived
as similar to a non-native target voice but markedly less
accented (98% preference compared to non-native speech).
This manuscript presents preliminary results on a new task:
mispronunciation detection.
3. Corpus curation procedure
This initial release of L2-ARCTIC contains English speech of
speakers from five different first languages: Hindi
Korean, Mandarin, Spanish, and Arabic. We chose these L1s
because each one has a distinct foreign/non-native accent in
English and provides unique challenges. Indian speakers of
English typically have native-like English fluency but use
segmental and suprasegmental features in ways that are distinct
from American English. Thus, Indian speakers have both
advantages in approaching pronunciation changes (e.g.,
familiarity and comfort with English) and disadvantages
(comfort with their English variety makes it particularly
difficult to adjust their speech to salient differences with
American English.) Korean learners of English have a large
number of high functional load consonant and vowel
difficulties (errors with many minimal pairs). Prosodically,
Korean and English employ suprasegmental systems that have
little overlap [28, 29]. Mandarin (Chinese/Putonghua) learners
of English have difficulty with a range of consonant and vowel
sounds and in producing correct English stress, intonation, and
juncture [30-32]. Spanish learners of English may have
difficulties distinguishing a number of high functional load
contrasts in English [33, 34]. Spanish is also a five-vowel
Hindi is an Indo-Aryan language that is both an L1 and a language of
wider communication. Thus Hindi speakers in the corpus may use Hindi
language, and Spanish learners find the more complex English
vowel system especially challenging. Like English, Spanish
uses both word stress and nuclear stress for emphasis but,
because it does not use the unstressed vowel schwa, realizes
stress differently. Finally, Arabic also has significantly fewer
vowels than English, and while Arabic has word stress, it does
not use stress in the same way that English does [35, 36]. In the
future, we may also include speakers from other L1s if we find
them to be useful to the research community.
3.1. Participants
For this initial release, we recruited two speakers (one male and
one female) for each of the L1s, for a total of ten speakers.
Speakers were recr uited from Iowa State University’s student
body; their age range was from 22 to 43 years, with an average
of 29 years (std: 6.9.) Demographic information of the speakers
is summarized in Table 1. The proficiency level of English was
measured using TOEFL iBT scores [37].
Table 1: Demographic information of the speakers
3.2. Recording the corpus
To create the corpus, we used the 1,132 sentences in the CMU
ARCTIC prompts. There were multiple reasons to choose these
sentences. First, the ARCTIC prompts are phonetically
balanced (100%, 79.6%, and 13.7% coverage for phonemes,
diphones, and triphones, respectively), are open source, and can
produce around one hour of edited speech. Second, the
ARCTIC corpus itself has proven to work well with speech
synthesis [38] and voice conversion tasks [9-11, 39]. Finally,
the ARCTIC prompts are challenging for non-native English
speakers so they can elicit potential pronunciation problems.
The speech was recorded in a quiet room at Iowa State
University (ISU). We used a Samson C03U microphone and
Earamble studio microphone pop filter for recordings; the
microphone was placed 20 cm from the speaker to avoid air
puffing. During each recording session, a linguist guided the L2
speaker through the process, asking the speaker to re-record a
sentence if the production contained significant disfluency or
deviated from the prompt. All speakers were instructed to speak
in a natural manner. The speech was sampled at 44.1 kHz and
saved as a WAV file.
Once the recording was finished, we removed repetitions
and false starts, performed amplitude normalization, and
segmented the utterances into individual WAV files. All of the
above were done in Audacity [40]. The utterances were
carefully trimmed to remove the leading and trailing silence and
non-speech sounds such as lip smacks.
as an L2, speaking another Indian language as an L1. Educated Indian
English is a stable contact variety of English.
3.3. Corpus annotations
Our corpus provides orthographic transcriptions at the word
level. We used the Montreal forced-aligner [41] to produce pho-
netic transcriptions in PRAAT’s TextGrid format [42], which
contains word and phone boundaries (Figure 1). Further, we
performed manual annotations on a selected subset of sentences
for each speaker. For all the speakers, we annotated a common
set of 100 sentences. In addition, we annotated 50 sentences that
included phoneme difficulties that were L1-dependent. In the
end, the corpus contains up to 150 curated phonetic transcrip-
tions per speaker1. Those transcriptions contain manually-
adjusted word and phone boundaries, correct phoneme labels,
mispronunciation error tags (phone additions, deletions, and
substitutions), and comments from the annotators. To facilitate
computer processing, we used the ARPAbet phoneme set for
the phonetic transcriptions as well as the error tags. In the
comment part of the transcriptions, however, annotators were
allowed to use IPA symbols. To ensure high-quality
annotations, we developed automated scripts to check the
Some speakers did not read all sentences, and a few sentences were
removed for some speakers since those recordings did not have the
required quality.
annotation consistency and then asked human annotators to fix
problems. The annotators (N=3) were PhD students in the
Applied Linguistics and Technology program at ISU. They
were experienced in transcribing speech samples of native or
non-native English speakers.
4. Corpus statistics
In total, the dataset contains 11,026 utterances, with most
speakers recording the full ARCTIC set (1,132 utterances.)
The total duration of the corpus is 11.2 hours, with an average
of 67 minutes (std: 9 minutes) of speech per L2 speaker. On
average, each utterance is 3.7 seconds in duration. The pause
before and after each utterance is generally no longer than 100
ms. Using the forced alignment results, we estimate a speech to
silence ratio of 7:1 across the whole dataset. The dataset
contains over 97,000 word segments, giving an average of
around nine words per utterance, and over 349,000 phone
segments (excluding silence). The phoneme distribution is
shown in Figure 2.
Human annotators manually examined 1,499 utterances,
annotating 5,199 phone substitutions, 1,048 phone deletions,
and 497 phone additions. Figure 3 (a) shows the top-20 most
frequent phoneme substitution tags in the corpus. The most
dominant substitution errors were “ZS,” (voicing) “DHD,”
(fricativestop) “IHIY,” and “OWAO (use of a tense
vowel for lax, and vice versa.) Each contains English phoneme
distinctions that lead to common substitution errors for varied
English learners. Figure 3 (b) shows the phone deletion errors
in the annotations. In our sample group, the most frequent
phoneme deletions were “D,” “T,” and “R,” almost always in
non-initial position. Many non-native speakers of English do
not pronounce the American English phoneme “R” in
postvocalic position (e.g., in car and farm.) “T” and “D” often
occur as word endings and in consonant clusters both within
and across words, where they were often omitted. Figure 3 (c)
shows the phone addition errors in the annotations. The ones
that stood out were “AH,”EH,” “R,”AX (schwa),” “G”, and
IH.” The vowel additions simplify complex syllable structures
with consonant clusters and so may serve to make the word
more pronounceable. Table 2 provides a breakdown of
pronunciation errors by L1s. Although others have used L1 to
predict L2 pronunciation errors [33, 34, 43], such predictions
are often inaccurate when applied to individual learners. Thus,
this list is meant to start a discussion of the types of errors that
actually occur in L2-ARCTIC.
Table 2: Most frequent errors by native language; the top-5
error occurrences are listed in descending order
R, D, T
R, AH, S, Y
D, T, R
D, T, R
L, N
N, R
D, T, AH
T, R, D
Figure 1: A TextGrid with manual annotations. Top to
bottom: speech waveform, spectrogram, words, phonemes
and error tags, comments from the annotator
Figure 2: Phoneme distribution of the corpus
0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
5. Mispronunciation detection evaluation
This section provides initial results on mispronunciation
detection using the 10 speakers that we have currently released.
Our implementation is based on the conventional Goodness of
Pronunciation (GOP) method as defined in [44]. The acoustic
model we used was a triphone model (tri6b) as defined by
Kaldi’s Librispeech training script [45]. It is a GMM trained
with 960 hours of native English speech [46], and contains
150,000 Gaussian mixtures. Three-state left-to-right HMMs
were used for non-silent sounds. The Kaldi implementation
does not have a fixed number of Gaussians for each HMM state.
The Word Error Rate (WER) of this acoustic model was around
8% on clean speech when combined with a 3-gram language
We used the phone-independent thresholding variation of
the GOP method to make the classification decisions, i.e., if the
GOP score of a phone segment was higher than a threshold
then it was accepted as a correct pronunciation, otherwise it was
rejected as an error. As a preliminary result, we only focused on
substitution errors since the GOP is not suited for detecting
additions and deletions.
Two hundred and six (206) utterances were withheld to
determine the search range of the phoneme-independent
detection threshold. The remaining 1,293 utterances were used
as the testing set. In the testing data, excluding the additions and
deletion tags, there are 41,353 phone samples in total, where
4,415 (10.7%) were tagged as substitution errors. We set the log
GOP threshold between -16 and 0 and made the step size 0.1.
For each experiment condition, we computed the detection
precision rate as
and the recall rate as
is the number of correctly predicted substitution
is the total number of segments predicted as
substitution errors, and
is the total number of
substitution errors in the testing set. The Precision-Recall curve
is shown in Figure 4. When we set the threshold to -4.2 (in log
scale), the precision equals recall (0.29). From this result, we
can see that the dataset is quite challenging, because it contains
speech data from different L1 backgrounds and recorded by
speakers with a wide range of pronunciation challenges. This
GOP implementation is open source and is available online
6. Conclusion
This paper has presented L2-ARCTIC, a new non-native
English speech corpus designed for voice conversion, accent
conversion, and mispronunciation detection tasks. Each speaker
in L2-ARCTIC produced sufficient speech data to capture their
voice identity and accent characteristics. Detailed annotations
on mispronunciation errors are also included. Thus, it is possi-
ble to use this corpus to develop and evaluate mispronunciation
detection algorithms. To the best of our knowledge, L2-
ARCTIC is the first of its own kind, and we believe it fills gaps
in both voice/accent conversion and pronunciation training.
The corpus is released under the CC BY-NC 4.0 license
[47] and is available at
arctic-corpus/. Future work will be focusing on adding ten
more speakers to the corpus.
7. Acknowledgments
This work was supported by NSF awards 1619212 and
1623750. We would like to thank the anonymous participants
for recording the corpus. We also would like to thank Ziwei
Zhou for his assistance with the annotations. We appreciate
suggestions from Christopher Liberatore and Shaojin Ding on
early versions of this manuscript.
Figure 4: Precision-Recall curve of a phone-independent GOP
system to demo mispronunciation detection on L2-ARCTIC
00.1 0.2 0.3 0.4 0.5 0.6
Precision: 0.29
Recall: 0.29
Figure 3:
pronunciation error distributions and the aggregated results.
(a) Substitutions
Errors with low frequencies were omitted; all the values are the percentages with respect to the total number of each error type
(i.e., normalized universally); “ERR” means an erroneous pronunciation that is not in the ARPAbet phoneme set.
0% 7% 14%
0% 10% 20%
0% 2% 4%
0% 2% 4%
0% 2% 4%
0% 2% 4%
0% 2% 4%
Hindi MandarinKorean Spanish Arabic
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 12% 24%
Hindi MandarinKorean Spanish Arabic
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 5% 10%
0% 8% 16%
Hindi MandarinKorean Spanish Arabic
8. References
[1] S.
H. Mohammadi and A. Kain, "An overview of voice conversion
systems," Speech Communication, vol. 88, pp. 65-82, 2017.
[2] S. Aryal and R. Gutierrez-Osuna, "Can Voice Conversion Be
Used to Reduce Non-Native Accents?," in ICASSP, 2014, pp.
[3] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, "Foreign accent
conversion in computer assisted pronunciation training," Speech
Communication, vol. 51, no. 10, pp. 920-932, 2009.
[4] K. Probst, Y. Ke, and M. Eskenazi, "Enhancing foreign language
tutorsin search of the golden speaker," Speech Communication,
vol. 37, no. 3, pp. 161-173, 2002.
[5] M. P. Bissiri, H. R. Pfitzinger, and H. G. Tillmann, "Lexical stress
training of German compounds for Italian speakers by means of
resynthesis and emphasis," in Australian International
Conference on Speech Science & Technology, 2006, pp. 24-29.
[6] J. Levis, "Computer technology in teaching and researching
pronunciation," Annual Review of Applied Linguistics, vol. 27, pp.
184-202, 2007.
[7] S. Weinberger. Speech accent archive [Online]. Av ailable:
[8] P. Meier. IDEA: International Dialects of English Archive
[Online]. Available:
[9] T. Toda, A. W. Black, and K. Tokuda, "Voice conversion based
on maximu m-likelihood estimation of spectral parameter
trajectory," IEEE Transactions on Audio, Speech, and Language
Processing, vol. 15, no. 8, pp. 2222-2235, 2007.
[10] L. Sun, S. Kang, K. Li, and H. Meng, "Voice conversion using
deep bidirectional long short-term memory based recurrent neural
networks," in ICASSP, 2015, pp. 4869-4873.
[11] G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in
voice conversion," in ICASSP, 2017, pp. 5525-5529.
[12] Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang,
"Locally Linear Embedding for Exe mplar-Based Spectral
Conversion," in Interspeech, 2016 , pp. 1652-1656.
[13] J. Kominek and A. W. Black, "The CMU Arctic speech
databases," in Fifth ISCA Workshop on Speech Synthesis, 2004,
pp. 223-224.
[14] T. Toda et al., "The Voice Conversion Challenge 2016," in
Interspeech, 2016, pp. 1632-1636.
[15] J. Jenkins. (2008). English as a lingua franca. Av ailable:
[16] K. J. Van Engen, M. Baese-Berk, R. E. Baker, A. Choi, M. Kim,
and A. R. Bradlow, "The Wildcat Corpus of native-and foreign-
accented English: Communicative efficiency across
conversational dyads with varying language alignment profiles,"
Language and speech, vol. 53, no. 4, pp. 510-540, 2010.
[17] T. Lander. CSLU : Foreign Accented English Release 1.2
LDC2007S08 [Online]. Available:
[18] T. Bent and A. R. Bradlow, "The interlanguage speech
intelligibility benefit," The J ournal of the Acoustical Society of
America, vol. 114, no. 3, pp. 1600-1610, 2003.
[19] K. Li, X. Qian, and H. Meng, "Mispronunciation detection and
diagnosis in l2 english speech using multidistribution deep neural
networks," IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 25, no. 1, pp. 193-207, 2017.
[20] H. Yang and N. Wei, "Construction and data analysis of a Chinese
learner spoken English corpus," ed: Shan hai Foreign Languse
Eduacation Press, 2005.
[21] W. Menzel et al., "The ISLE corpus of non-native spoken
English," in Proceedings of LREC 2000: Language Resources
and Evaluation Conference, vol. 2, 2000, pp. 957-964.
[22] N. F. Chen, R. Tong, D. Wee, P. X. Lee, B. Ma, and H. Li,
"SingaKids-Man darin: Speech Corpus of Singap orean Children
Chinese,"in Interspeech, 2016,
[23] W. Hu, Y. Qian, F. K. Soong, and Y. Wang, "Impr oved
mispr onuncia tion detection with deep neural network trained
acoustic models and transfer learning based logistic regression
classifiers," Speech Communication, vol. 67, pp. 154-166, 2015.
[24] Y.-B. Wang and L.-s. Lee, "Supervised detection and
unsupervised discovery of pronunciation error patterns for
computer-assisted language learning," IEEE/ACM Transactions
on Audio, Speech and Language Processing, vol. 23, no. 3, pp.
564-579, 2015.
[25] H. Huang, H. Xu, X. Wang, and W. Silamu, "Maximum F1-score
discriminative training criterion for automatic mispronunc iation
detection," IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 23, no. 4, pp. 787-797, 2015.
[26] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R.
Gutierrez-Osuna, "Accent Conversion Using Phonetic
Posteriorg rams," in ICASSP, 2018.
[27] P. Pramod, "Indian English Pronunciation," in The Ha ndbook of
English Pronunciation, M. Reed and J. Levis, Eds.: Wiley
Blackwell, 2015, pp. 301-319.
[28] S.-A. Jun, "Prosody in sentence processing: Korean vs. English,"
UCLA Working Papers in Phonetics, vol. 104, pp. 26-45, 2005.
[29] M. Ueya ma and S.-A. Jun, "Focus realization of Japanese English
and Korean English intonation," UCLA Working Papers in
Phonetics, pp. 110-125, 1996 .
[30] J. Anderson-Hsieh, R. Johnson, and K. Koehler, "The relationship
between native speaker judgments of nonnative pronunciation and
deviance in segmentais, prosody, and syllable structure,"
Language learning, vol. 42, no. 4, pp. 529-555, 1992.
[31] M. C. Pen nington and N. C. Ellis, "Cantonese speakers' memory
for English sentences with prosodic cues," The Modern Language
Journal, vol. 84, no. 3, pp. 372-389, 2000.
[32] J. Chang, "Chinese speakers," Learner English, vol. 2, pp. 310-
324, 1987.
[33] J. Morley, "Teaching American English Pronunciation," ed:
JSTOR, 1993.
[34] B. Smith, Learner Eng lish: A teacher's guide to in terference and
other problems. Cambridge University Press, 20 01.
[35] M. Benrabah, "Word-stressa source of unintelligibility in
English," IRAL-International Review of Applied Linguistics in
Language Teaching, vol. 35, no. 3, pp. 157-166, 1997.
[36] K. De Jong and B. A. Zawaydeh, "Stress, duration, and intonation
in Arabic word-level pr osody," Journal of Phonetics, vol. 27, no.
1, pp. 3-22, 1999.
[37] Y. Cho and B. Bridgeman, "Relationship of TOEFL iBT® scores
to academic performance: Some evidence from American
universities," Language Testing, vol. 29, no. 3, pp. 421-442, 2012.
[38] H. Zen et al., "The HMM-based speech synthesis system (HTS)
version 2.0," in SSW, 2007, pp. 294-299.
[39] D. Erro, E. Navas, and I. Hernaez, "Parametric voice conversion
based on bilinear frequency warping plus amplitude scaling,"
IEEE Transactions on Audio, Speech, and Language Processing,
vol. 21, no. 3, pp. 556-566, 2013.
[40] Audacity®. Available:
[41] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M.
Sonderegger, "Montreal Forced Aligner: trainable text-speech
alignment using Kaldi," in Interspeech, 2017, pp. 498-502.
[42] P. P. G. Boersma, "Praat, a s ystem for d oing phonetics by
computer," Glot international, vol. 5, 2002.
[43] M. Munro, "How well can we predict L2 learners' pronunciation
difficulties?," CATESOL Journal, vol. 30, no. 1, pp. 267-282,
[44] S. M. Witt and S. J. Young , "Phone-lev el pronunciation scoring
and assessment for interactive language learning," Speech
communication, vol. 30, no. 2, pp. 95-108, 2000.
[45] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE
2011 Workshop on Automatic Speech Recognition &
Understanding, 2011.
[46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,
"Librisp eech : an ASR corpus based on public domain audio
books," in ICASSP, 2015, pp. 5206-5210.
[47] Creative Commons Attribution-NonCommercial 4.0 International
Public Licen se. Available:
... In doing such kind of input augmentation, we can not only distill knowledge about linguistic and phonetic discrimination from the frame-level output of a hybrid DNN-HMM acoustic model, but to some extent make the input to the E2E MD model less affected by a wide variety of subtle factors such as speaker, gender, age, accent, channel, and among others, which a CAPT system is often confronted with. In this work, the hybrid DNN-HMM acoustic model was trained on both the L1 (viz. the TIMIT dataset [26]) and the L2 (viz. the L2-ARCTIC dataset [27]) speech corpora and in turn used to extract the phonetic PPG vector that corresponds to each speech fame of an L2 learning's utterance. The notion of leveraging PPG-related information to replace or augment spectral features has been recently investigated voice conversion [24] and cross-accent voice conversion [25]. ...
... A series of mispronunciation detection experiments were conducted the L2-ARCTIC benchmark corpus [27]. L2-ARCTIC is an open-access L2-English speech corpus compiled for research on CAPT, accent conversion, and others. ...
... As it is evident from Figure 2, the MD performance improvements for the Hindi and Arabic speakers are more obvious than the speakers of other mother-tongue languages. In particular, as pointed out in [27], Hindi speakers often use segmental and suprasegmental features in ways that are distinct from American speakers or non-native speakers of other mothertongue languages. ...
Full-text available
Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.
... The proposal utilizes enormous L1 ASR datasets to relieve the data scarcity of MD&D. Meanwhile, experiments conducted on the latest version of L2-ARCTIC corpus [15] verify the proposal's efficiency. ...
... Our experiments are conducted on TIMIT [17] and L2-ARCTIC (V5.0) [15] corpus. TIMIT contains recordings of 630 US native speakers and L2-ARCTIC includes recordings of 24 non-native speakers whose mother tongues are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese. ...
... And those cases in TR are further split into correct diagnosis (CD) and diagnosis error (DE). Then the metrics for mispronunciation detection are calculated follow (10) - (15). ...
Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech, in a noise-robust and speaker-independent manner. These embeddings, when used as implicit phonetic supplementary information, can alleviate the data shortage of explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD\&D system. Experimental results obtained on the L2-ARCTIC database show the proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the detection accuracy, diagnosis error rate and the F-measure, respectively.
... Datasets. We use the publicly available datasets TIMIT [19] and L2-arctic [20] to conduct our experiments. TIMIT is a native (L1) English corpus containing 6,300 utterances from 630 speakers. ...
Full-text available
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.
... The overall length of corpus is 27.1 hours and the average duration of speech per L2 user is 67.7 minutes. Over 238,702word segments are included in the dataset, providing an average of about 9 words per utterance, and over 851,830 phone segments (Zhao et al., 2018). Human annotators evaluated 3,599 utterances manually, annotating 1,092 phoneme addition errors, 14,098 phoneme substitution errors and 3,420 phone me deletion errors. ...
Full-text available
This report proposes state-of-the-art research in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL. Studies on automated pronunciation error detection began in the 1990s, but the development of fullfledged CAPTs has only accelerated in the last decade due to an increase in computing power and availability of mobile devices for recording speech required for pronunciation analysis. Detecting Pronunciation errors is a hard problem to solve as there is no formal definition of correct and incorrect pronunciation. As a result, typically prosodic and phoneme errors such as phoneme substitution, insertion, and deletion are detected. Also, it has been agreed upon that learning pronunciation should focus on speaker intelligibility rather than sounding like an L1 English speaker. Initially, methods were developed on posterior likelihood called Good of Pronunciation using Gaussian Mixture Model-Hidden Markov Model and Deep Neural Network-Hidden Markov Model approaches. These are complex systems to implement when compared with the recently proposed ASR based End-to-End mispronunciations detection systems. The purpose of this research is to create End-to-End (E2E) models using Connectionist Temporal Classification (CTC) and Attention-based sequence decoder. Recently, E2E models have shown considerable improvement in mispronunciation detection accuracy. This research will draw comparison amongst baseline models CNN-RNN-CTC, CNN-RNN-CTC with character sequence-based attention decoder, and CNN-RNN-CTC with phoneme-based decoder systems. This study will help us in deciding a better approach towards developing an efficient mispronunciation detection system.
... In Ref. [6], the L2-Arctic speech corpus has been presented. This mentioned corpus includes recordings of speakers whose first languages are Hindi, Korean, Mandarin, Spanish, and Arabic. ...
Full-text available
This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.
... Datasets We conduct our experiments on two publicly available English corpora, TIMIT [20] and L2-ARCTIC corpus [21]. TIMIT The corpus consists of recordings from 24 speakers (12 males and 12 females) of six different first languages (Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese). ...
End-to-end models are becoming popular approaches for mispronunciation detection and diagnosis (MDD). A streaming MDD framework which is demanded by many practical applications still remains a challenge. This paper proposes a streaming end-to-end MDD framework called CCA-MDD. CCA-MDD supports online processing and is able to run strictly in real-time. The encoder of CCA-MDD consists of a conv-Transformer network based streaming acoustic encoder and an improved cross-attention named coupled cross-attention (CCA). The coupled cross-attention integrates encoded acoustic features with pre-encoded linguistic features. An ensemble of decoders trained from multi-task learning is applied for final MDD decision. Experiments on publicly available corpora demonstrate that CCA-MDD achieves comparable performance to published offline end-to-end MDD models.
... A series of mispronunciation detection experiments were conducted the L2-ARCTIC benchmark corpus [26]. L2-ARCTIC is an openaccess L2-English speech corpus compiled for research on CAPT, accent conversion, and others. ...
End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems, showing competitive performance to conventional pronunciation-scoring based methods. However, current E2E neural methods for CAPT are faced with at least two pivotal challenges. On one hand, most of the E2E methods operate in an autoregressive manner with left-to-right beam search to dictate the pronunciations of an L2 learners. This however leads to very slow inference speed, which inevitably hinders their practical use. On the other hand, E2E neural methods are normally data greedy and meanwhile an insufficient amount of nonnative training data would often reduce their efficacy on mispronunciation detection and diagnosis (MD&D). In response, we put forward a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to dramatically speed up the inference time while maintaining performance in line with the conventional E2E neural methods. In addition, we design and develop a pronunciation modeling network stacked on top of the NAR E2E models of our method to further boost the effectiveness of MD&D. Empirical experiments conducted on the L2-ARCTIC English dataset seems to validate the feasibility of our method, in comparison to some top-of-the-line E2E models and an iconic pronunciation-scoring based method built on a DNN-HMM acoustic model.
... We conducted our MDD experiments on the L2-ARCTIC dataset [22]. The L2-ARCTIC dataset is an open-access L2-English speech corpus compiled for research on CAPT, accent conversion, and others. ...
End-to-end (E2E) neural models are increasingly attracting attention as a promising modeling approach for mispronunciation detection and diagnosis (MDD). Typically, these models are trained by optimizing a cross-entropy criterion, which corresponds to improving the log-likelihood of the training data. However, there is a discrepancy between the objectives of model training and the MDD evaluation, since the performance of an MDD model is commonly evaluated in terms of F1-score instead of word error rate (WER). In view of this, we in this paper explore the use of a discriminative objective function for training E2E MDD models, which aims to maximize the expected F1-score directly. To further facilitate maximum F1-score training, we randomly perturb fractions of the labels of phonetic confusing pairs in the training utterances of L2 (second language) learners to generate artificial pronunciation error patterns for data augmentation. A series of experiments conducted on the L2-ARCTIC dataset show that our proposed method can yield considerable performance improvements in relation to some state-of-the-art E2E MDD approaches and the conventional GOP method.
Foreign accent conversion (FAC) aims to create a new voice that has the voice identity of a given second-language (L2) speaker but with a native (L1) accent. Previous FAC approaches usually require training a separate model for each L2 speaker and, more importantly, generally require considerable speech data from each L2 speaker for training. To address these limitations, we propose Accentron, an approach that can generate accent-converted speech for arbitrary L2 speakers unseen during training. In the proposed approach, we first train a speaker-independent acoustic model on L1 corpora to extract bottleneck features that represent the linguistic content of utterances. Then, we develop a speaker encoder and an accent encoder to generate embedding vectors for the desired voice identity (L2 speaker’s) and accent (L1 accent), respectively. Lastly, we use a sequence-to-sequence model to transform bottleneck-features to Mel-spectrograms, conditioned on the L2 speaker embedding and the L1 accent embedding. We conducted experiments on the L2-ARCTIC corpus under two testing conditions: the standard FAC setting where test L2 speakers were seen during training, and a zero-shot FAC setting where test L2 speakers were unseen during training. Accentron achieves over 27% relative improvement in accentedness ratings compared to two state-of-the-art FAC systems in the standard FAC setting. More importantly, our results show that Accentron generalizes to the zero-shot FAC setting with no performance loss. Therefore, in practical use scenarios (e.g., computer-assisted pronunciation training software), Accentron can effectively avoid the need to adapt or retrain the model, which significantly reduces computations and the users’ waiting time.
Speakers of English as a lingua franca (ELF) represent the largest contemporary group of English users around the world. Much empirical research into the phenomenon of ELF over the past 10 years or so has focused on identifying the linguistic regularities of their English in diverse contexts of use. More recently, research has demonstrated the extent to which ELF is characterized by extensive online variability, with speakers accommodating their English to an extent not found in other language use, so as to make it appropriate to the interaction in hand and the diverse interlocutors involved. The acknowledged variability of ELF has resulted in a dilemma for English language teaching and assessment. That is, grammatical and pragmatic norms conventionally associated with an idealized “standard” version of the native language no longer seem appropriate as benchmarks, because the majority of ELF use takes place among non-native English speakers. English language assessment thus needs to develop new benchmarks that are able to evaluate the degree to which ELF users’ English is fit for purpose. This chapter begins by discussing ELF from the perspective of teaching and learning, and, in doing so, dispels the myth that ELF eschews any kind of standards. It goes on to explore the notion of “standard” English and the extent to which it is grounded in language ideology rather than appropriate use of language in context, as well as the ways in which this ideology is being reinforced by the powerful transnational English language testing enterprises and, in turn, diverting attention from the changes needed in English language teaching. Finally, taking note of work in the field of critical language assessment and paying attention to empirical findings regarding ELF-mediated communication, the factors that need to be taken into account in order to enrich the agenda for future developments are considered.
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker’s speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
This paper investigates the use of multi-distribution deep neural networks (DNNs) for mispronunciation detection and diagnosis (MDD), to circumvent the difficulties encountered in an existing approach based on Extended Recognition Networks (ERNs). The ERNs leverages existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDD are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: (1) Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; (2) Acoustic models and phonological rules are trained independently and hence contextual information is lost. To address these issues, we propose an Acoustic-Graphemic-Phonemic Model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors). The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likelypronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified mispronunciation detection and diagnosis framework which works much like free-phone recognition. Experiments show that our method achieves a phone error rate (PER) of 11.1%. The false rejection rate (FRR), false acceptance rate (FAR) and diagnostic error rate (DER) for MDD are 4.6%, 30.5% and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR and DER are 16.8%, 11.0%, 43.6% and 32.3%, respectively.
This chapter illustrates the significant features of the General Indian English (GIE) and some of its variants. It first addresses the circumstances in which GIE has emerged as the representative variety of Indian English (IndE). Then the chapter discusses the main features of the segmental and prosodic phonology of IndE. It ends with a brief discussion of an overview of issues relating to the stability of IndE. The inventory of monophthong vowel phonemes is presented in tabular form. The vocalic allophones of GIE differ to a much greater extent than the consonant allophones from other varieties of English in terms of their phonetic realization. Studies of IndE phonology generally acknowledge the significance of the prosodic phenomena in lending it its character. The word-stress system in native English is significant on many counts. The speech rhythm of IndE is generally labeled as “syllable-timed” compared to native English, which is “stress-timed”.
This study examined the relationship between scores on the TOEFL Internet-Based Test (TOEFL iBT®) and academic performance in higher education, defined here in terms of grade point average (GPA). The academic records for 2594 undergraduate and graduate students were collected from 10 universities in the United States. The data consisted of students' GPA, detailed course information, and admissions-related test scores including TOEFL iBT, GRE, GMAT, and SAT scores. Correlation-based analyses were conducted for subgroups by academic status and disciplines. Expectancy graphs were also used to complement the correlation-based analyses by presenting the predictive validity in terms of individuals in one of the TOEFL iBT score subgroups belonging to one of the GPA subgroups. The predictive validity expressed in terms of correlation did not appear to be strong. Nevertheless, the general pattern shown in the expectancy graphs indicated that students with higher TOEFL iBT scores tended to earn higher GPAs and that the TOEFL iBT provided information about the future academic performance of non-native English speaking students beyond that provided by other admissions tests. These observations led us to conclude that even a small correlation might indicate a meaningful relationship between TOEFL iBT scores and GPA. Limitations and implications are discussed.