Available via license: CC BY 4.0
Content may be subject to copyright.
Improving Automatic Speech Recognition for
Non-Native English with Transfer Learning and
Language Model Decoding
Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Abstract ASR systems designed for native English (L1) usually underperform on
non-native English (L2). To address this performance gap, (i) we extend our previous
work to investigate fine-tuning of a pre-trained wav2vec 2.0 model [2, 56] under a
rich set of L1 and L2 training conditions. We further (ii) incorporate language model
decoding in the ASR system, along with the fine-tuning method. Quantifying gains
acquired from each of these two approaches separately and an error analysis allows
us to identify different sources of improvement within our models. We find that
while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding
knowledge for clean L1 speech [56], this does not hold for L2 speech and accounts
for the utility of employing language model decoding on L2 data.
1 Introduction
Although non-native English speakers (L2) outnumber native English speakers (L1)
[7], major challenges contribute to a gap between performance of automatic speech
recognition (ASR) systems on L2 speech. This is mainly due to influence of L1 pro-
nunciation on the learned language, and lack of annotated L2 speech data on which
ASR systems can be trained [42, 50]. To meet these challenges, previous work has
generally followed two distinct approaches. The first is to make L2 speech represen-
tations more closely match those of L1 speech [42]. The second approach leverages
L2 speech data to improve model robustness. Due to L2 data scarcity, this second
approach necessitates employment of transfer learning or domain adaptation [45, 47].
Peter Sullivan
The University of British Columbia, BC, Canada e-mail: prsull@student.ubc.ca
Toshiko Shibano
The University of British Columbia, BC, Canada e-mail: tshibano@student.ubc.ca
Muhammad Abdul-Mageed
The University of British Columbia, BC, Canada e-mail: muhammad.mageed@ubc.ca
1
arXiv:2202.05209v1 [cs.CL] 10 Feb 2022
2 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
State-of-the-art ASR models based on self-supervised pre-training such as
wav2vec [44] and wav2vec 2.0 [2]1offer a tantalizing starting point for applying the
transfer learning approach we list above, especially due to their strong performance
of self-trained wav2vec 2.0 models on ASR in low-resource settings even without a
language model [56]. However, challenges remain in identifying how best to apply
models such as wav2vec 2.0 in L2 fine-tuning scenarios. In spite of this advantage of
a fine-tuned model, it is not clear whether the knowledge it acquires is orthogonal to
that of a language model especially on L2 speech. Hence, we are interested in inves-
tigating the practical sufficiency of fine-tuned models on their own, and the extent to
which they may benefit from external language model decoding on both L1 and L2
speech. As such, our main objective in the current work is to investigate a rich set of
conditions under which we can fine-tune ASR models for optimal L2 performance
and the utility of integrating language model decoding along with fine-tuning in
an overall ASR model. Concretely, we pursue this primary objective through the
following sub-objectives:
1. Evaluate fine-tuning and language model decoding strategies for adapting pre-
trained L1 English ASR models to L2 English;
Fig. 1 The overall ASR pipeline. We (a) evaluate performance of wav2vec 2.0 without LM using
best path decoding. We also (b) incorporate language model decoding with beam search along with
the fine-tuned model.
1Although sometimes referred to as ‘unsupervised’, these models employ a self-supervised objec-
tive.
Automatic Speech Recognition for Non-Native English 3
2. Explore the impact of non-native (L2) accents on performance of these ASR
models fine-tuned under various conditions, comparing multi-accent training to
single-accent training;
3. Quantify the impact of L2 fine-tuning on model performance for L1 English
speech recognition; and
4. Analyze error categories associated with fine-tuning, as well as language model-
decoding.
Our investigation of the role of language-model decoding in L2 ASR performance
extends our previous work [46]. We also better contextualize the magnitude of impact
of fine-tuned only vs. fine-tuning+LM decoding on the downstream tasks for both
L1 and L2 speech. The rest of the paper is organized as follows: Section 2 is an
overview of related work. We introduce our methods in Section 3. We describe our
data in Section 4, and Section 5 is about our experiments and results. We conclude
in Section 7.
2 Related Work
Because of the difficulty in linguistically annotating corpora for Hidden Markov
Model (HMM)-based ASR [12], researchers have broadly embraced End-to-End
(E2E) deep learning architectures either based on Connectionist Temporal Classifi-
cation (CTC) [13, 12], Attention [5, 4, 14], or hybrids of the two [55, 53]. Recent
efforts inspired by work such as BERT [9] have improved on these purely supervised
learning baselines through self-supervised pre-training [44, 1, 2] and self-training
[56]. These self-supervised wav2vec models represent one line of research in speech
representation. Other works include models similar to wav2vec that also use a con-
trastive loss [37], models using an autoregressive loss function [28, 6], as well as
models using a masked language model closer to the original BERT [29].
With these efforts, ASR technologies for native languages have evolved signif-
icantly. However, we still observe problems in many applications. In particular,
several researchers have emphasized how performance of ASR models drops when
the input speech is from non-native speakers whose native languages are different
from the models’ target languages [42, 31, 54, 41, 52]. For systems developed for
English ASR, this can be a real issue due to the large populations of English lan-
guage speakers who are non-native [7]. In line with this argument, Ping [41] points
out the necessity to improve speech recognition technology for L2 speakers given
that many people speak more than one language for economic and social reasons.
It is hoped that continued efforts aiming at improving ASR for non-native speakers
will eventually lead to improved results for many as voice recognition technology
becomes increasingly pervasive in our daily lives [41].
There are two distinct approaches to improve current ASR performance on L2
speech: (i) accent conversion as an extension to the active area of research of voice
conversion; and (ii) incorporation of L2 speech data, which is often limited in quantity
and quality, during the model training process. The first approach takes inspiration
4 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
from voice conversion, but instead of focusing on modifying the pitch, it modifies the
pronunciation to reduce accents. Additionally, voice conversion models aim to gen-
erate results that are speaker-dependent, while accent conversion models deal with
generalizing accents from a group of speakers, hence being speaker-independent.
With this approach, the resulting model can be used as a pre-processing step to
remove accents in the data prior to feeding these data into an ASR model. Bearman
et al. [3] adopt this approach but focus on L1 English accents, while Radzikowski
et al. [42] work on L2 English accents with speakers’ L1 being Japanese. Liu
et al. [30] took a step further and turned Hindi-accented English to native American
English without utilizing native utterances.
The second approach often employs methods such as domain adversarial training
and transfer learning in order to utilize as much available accented speech data as
possible. Domain adversarial training (DAT) is a popular approach as it encourages
models to learn accent-invariant features [47, 19, 21]. Transfer learning is another
popular approach in L2 speech recognition, as it possibly allows a model to gain
knowledge from both the base task and the new task, even when the new task has
limited data [34, 8, 45]. In the Accented English Speech Recognition Challenge
2020 (AESRC2020), many teams utilize transfer learning to tackle the L2 accent
recognition task [45]. In a recent work, Das et al. [8] combine both DAT and transfer
learning to achieve robust accented speech recognition performance.
One method that is common in ASR systems is language model decoding, which
re-weights output probabilities to account for greater likelihoods of words occurring
in the language. Language models such as KenLM [17], give probabilities of to-
kens occurring in a sequence, and thus represent corpus-level statistic of language.
Language model decoding can help prevent unlikely sequences of words from being
selected ("the mat chased the rat") in favor of more likely predictions ("the cat
chased the rat").
While integration of language models has been a standard part of ASR systems,
only recent works have been able to reach parity without using an explicit language
model, either through knowledge distillation techniques [10], data augmentation
[40], or self-training [56, 48]. Language model-free ASR systems are appealing due
to the simplicity, but most still struggle with difficult ASR tasks, such as the noisy
recordings of LibriSpeech dev/test-other. To our knowledge there has been no work
to date examining whether the properties of these systems transfers to L2 ASR.
3 Methods
We provide a background about our main methods in this section. We first introduce
transfer learning for ASR, then follow with CTC and language model decoding.
Automatic Speech Recognition for Non-Native English 5
3.1 Transfer Learning
For tasks with limited labeled data, training models from scratch becomes imprac-
tical. One approach that has great success is transfer learning. Transfer learning
involves taking an existing model trained on one or more tasks from a given domain
and transferring its knowledge to a target downstream task or domain [38]. Tasks
which share the same label and feature space, but perhaps differ in feature distribu-
tion, can allow for a simple transfer learning method called model adaptation [51].
This allows for simply taking an existing model and re-training (i.e., ’fine-tuning’)
it using a smaller domain-specific dataset. Model adaptation for ASR can be per-
formed easily by freezing part of an existing model and re-training the rest on the
new domain [26].
One particularly promising base model for transfer learning is wav2vec 2.0 [2],
which is composed of a multi-layer convolutional neural network feature extractor
and a Transformer context network. The network uses a contrastive task for self-
supervised pre-training to learn good general representations of audio. Following
pre-training, the CNN feature extractor layers of the model are frozen, and the
model is fine-tuned on domain specific tasks by adding a linear layer on top of the
Transformer context network followed by training with CTC loss [2].
While the original models are strong baselines, the self-trained wav2vec 2.0 Large
(LV-60) version of the model [56], which we will refer to as Wav2Vec 2.0-ST [56]2,
extends the original work with wav2vec 2.0 by applying a self-training approach. The
model is pre-trained on 960 hours of speech data from LibriSpeech [39], followed
with self-training on 53.2k hours of Libri-Light [24]. During the self-training pro-
cess pseudo-labels are generated using language models trained on the LibriSpeech
corpus, allowing for transfer of knowledge from the language model into the ASR
model proper ultimately resulting in a model with little need for an external model
during inference time [56].
Fine-tuning of pre-trained wav2vec 2.0 is performed with CTC and the tran-
scriptions of the audio segments. For each model, we identify the optimal hyper-
parameters on the respective Dev set. We choose hyperparameters as follows: For
mask_feature_prob, we pick from {0.25, 0.5}, for mask_feature_length, we
choose from {15, 30}, for mask_time_prob we use {0.5, 0.75}, and a batch size of
16. To mimic the tri-state learning rate schedule [2], we set different learning rates
for different stages: warm-up (1e-5, 3e-5), constant stage (1e-5, 3e-5), and decay
(1e-5, 3e-5, 5e-6). The decay stage is followed by another constant stage (1e-5, 2e-6,
5e-6) to simulate the Fairseq’s fine-tuning configuration.
2https://github.com/pytorch/fairseq/tree/master/examples/wav2vec
6 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
3.2 CTC Decoding
Because the output of CTC-trained models is a table of character probabilities for
each timestep, this output must be decoded to find the most probable sequence of
characters. One simple approach is to use a best path decoding strategy (see top
left of Fig. 1), which simply involves outputting the highest probability token for
each timestep condensing duplicate tokens and removing CTC blank symbols [11].
Following Graves [11], we can write the decoding as:
𝑊∗=arg max
𝑊𝑝(𝑊|𝑋)(1)
Where 𝑊∗is our most likely sequence of characters (the labeling) and 𝑝(𝑊|𝑋)is
our probability of a labeling given a signal 𝑋. Then we can write best path decoding
as:
𝑊∗≈ F (𝜋∗)(2)
Where Fis the CTC collapsing function, which removes duplicate letters and blanks
tokens, and 𝜋∗is the highest activation in the CTC output at a given timestep. The
simplicity of this method allows for fast predictions, but at the cost of potential errors
added through not considering combined probability states. As Graves [11] notes,
this matters when the "label is weakly predicted for several consecutive timesteps"(p.
71). Several algorithms have been introduced to fix this shortcoming: prefix search,
which allows for accounting for the probability of children nodes in the search graph
[11]; token passing, which allows integration of a dictionary [11], and decoding
with attention, which uses a secondary RNN model to correct errors [58].
Many decoding strategies aim to also integrate a language model in the process,
which allows for incorporating lexical information into the decoding process. N-gram
language models can be formalized as:
𝑝(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2..., 𝑤𝑖−𝑛−1)(3)
Where 𝑤𝑖is the 𝑖th word in the sequence, which we would like to estimate the
probability of, and 𝑛is our n-gram size. Probabilities are generally calculated from
a text corpus either through efficient stastical packages such as KenLM [17] or
through training neural networks to generate probability distributions of the tokens.
Additional decoding strategies that use language model probability re-weighting
include modified beam search strategy [12, 16], weighted finite state transducers
[36, 43], character-level recurrent neural network (RNN) decoding [33, 22], or
word-level RNN decoding [18].
In our experiments, we choose to apply the prefix beam search strategy for both
decoding and including an external language model (see top right of Fig. 1). Instead
of rehashing the full prefix beam search algorithm (see [16]), we focus on the
main components needed to understand the hyperparameter optimization process
of this decoding strategy. Prefix beam search attempts to find transcriptions which
maximize the following equation (see [16]):
Automatic Speech Recognition for Non-Native English 7
𝑝𝐶𝑇 𝐶 (𝑊;𝑋)𝑝𝐿 𝑀 (𝑊)𝛼|𝑊|𝛽(4)
Here 𝑝𝐶𝑇 𝐶 (𝑊;𝑋)is our CTC-trained neural network probability of a character
sequence 𝑊given an input audio 𝑋,𝑝𝐿𝑀 (𝑊)is the language model probability of
sequence 𝑊,𝛼is the language model weight term, and 𝛽is a word insertion penalty
modifier. The algorithm to maximize the value in 4, is similar to normal beam search
in the sense that it keeps track of a set of possible contenders ≤𝑘, where we call 𝑘
the beam width [32, 35]. For CTC, the complexities of duplicates and blank tokens
mean that the actual probability of a given proposed sequence needs to be calculated
as follows:
𝑝(𝑙;𝑥1:𝑡)=(𝑝𝑏(𝑙;𝑥1:𝑡) + 𝑝𝑛𝑏 (𝑙;𝑥1:𝑡))|𝑊(𝑙) | 𝛽(5)
Where 𝑝(𝑙;𝑥1:𝑡)is the probability of a given prefix, 𝑝𝑏(𝑙;𝑥1:𝑡)is the probability
of a 𝑏𝑙 𝑎𝑛𝑘 token being appended onto the current sequence, 𝑝𝑛𝑏 (𝑙;𝑥1:𝑡)is the
probability of the next token being a character or punctuation (i.e. 𝑛𝑜𝑛 −𝑏𝑙𝑎𝑛 𝑘 ),
and |𝑊(𝑙)|𝛽is our word insertion term based on the words 𝑊(𝑙)in our proposed
sequence 𝑙. A list of these probabilities is kept and updated based on the probabilities
of each of the characters in the next segment of the CTC output table. When space
characters are added to an existing sequence, the language model probability weight
𝑝(𝑊(𝑙+)|𝑊(𝑙)) 𝛼is multiplied to the probability of the sequence, where 𝑊(𝑙+)is
the new set of words in the sequence. Values of 𝛼, which indicates how much to
emphasize the language model, and 𝛽, which must be set via a hyperparameter tuning
process.
In our experiments with adding language model decoding, we use the pyctcde-
code3implementation of prefix beam search. It functions much the same way as
normal prefix beam search, differing only in several minor ways: first by using
caching to speed up the decoding process and second by adding a partial word score
which penalizes out of vocabulary word matches (based on checking whether the
prefix is in a trie of the unigram vocabulary). For hyperparameter tuning, we per-
form a small grid search using the development set of L2-ARCTIC, with the ranges
𝛼∈ {0.5,1,1.5}(considering both downweighting and over-emphasizing the LM),
𝛽∈ {0.5,1,1.5}and 𝑏𝑒𝑎𝑚𝑤𝑖 𝑑𝑡ℎ ∈ {50,100,150,200}, with final hyperparameters
as 𝛼=1, 𝛽 =1.5, 𝑏 𝑒𝑎𝑚𝑤𝑖𝑑𝑡 ℎ =200. For experiments on LibriSpeech, we similarly
set hyperparameters on the development set (dev-other) and find the combination
𝛼=0.5, 𝛽 =0.5, 𝑏𝑒𝑎𝑚𝑤𝑖 𝑑𝑡ℎ =100 works best. To ablate the contribution of the
language model, we also conduct an experiment on the full splits of L2-ARCTIC
with 𝛼=0, effectively neutralizing the impact of the language model, keeping the
rest of the hyperparameters the same.
3https://github.com/kensho-technologies/pyctcdecode
8 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
4 Data
4.1 Corpus Information
We choose L2-ARCTIC, a non-native English speech corpus [59], for L2 fine-
tuning. The recordings are from 24 non-native speakers of English with a total of
six different L1s, and each of the L1s consists of two female speakers and two male
speakers. The L1s we use for our experiments are Arabic (AR), Hindi (HI), Korean
(KO), Mandarin (ZH), Spanish (ES), and Vietnamese (VI). Because L2-ARCTIC
is based on the original L1 English corpus, CMU ARCTIC [25] (henceforth L1-
ARCTIC, for simplicity), we can easily evaluate performance from fine-tuning on
same-domain L1 data.
Each speaker in L2-ARCTIC contributed approximately one hour of phonetically-
balanced read speech based on the L1-ARCTIC prompts, which consist of carefully
selected sentences (1,132 sentence prompts) from Project Gutenberg [25]. We note
this, as the pretrained wav2vec 2.0 model we use was first pre-trained on Lib-
riSpeech 4[39] and then self-trained on Libri-Light 5[24]. Both corpora rely on
audiobooks from the LibriVox project 6, much of which comes from Project Guten-
berg 7. However, because the ARCTIC corpus was selected to create a good phono-
logical balance of sentences and weighted towards fiction [25], there may be domain
mismatch between the sets of texts selected between these different corpora, and we
aim to measure this with experiments using L1 fine-tuned models. Finally, we ensure
there is no overlap in sentences between our L2-ARCTIC dev and test sets and the
LibriSpeech training sets.
We also evaluate our fine-tuned models on 1) LibriSpeech to compare the fine-
tuning with the original performance of Wav2Vec 2.0-ST. In addition, we evaluate on
2) L1-ARCTIC, identical to our L2-ARCTIC corpus but spoken by four native US
English speakers, allowing us to identify any degradation on same-domain L1 speech
performance, as well as estimate potential domain mismatch between the LibriSpeech
corpus used to train Wav2Vec 2.0-ST and ARCTIC. Each of L1-ARCTIC speakers’
datasets contain approximately the same number of utterances (𝑛=∼1,132 ∗4) as
each of L2-ARCTIC speakers’ datasets.
For the purpose of our experiments, we define native (L1) accents as those
represented in the LibriSpeech and L1-ARCTIC, and non-native (L2) accents as
those represented in L2-ARCTIC.
4http://www.openslr.org/12/
5https://github.com/facebookresearch/libri-light
6https://librivox.org
7http://www.gutenberg.org
Automatic Speech Recognition for Non-Native English 9
Table 1 Summary of data splits, fine-tuning, and evaluation setups.
Accent dependency Speaker dependency
Dependent Independent Dependent Independent
Multi-accent Model-1 (Split 1) x x
Model-2 (Split 2) x x
Model-3 (Split 3) x x
Single-accent Model-4 (Split 4) x x x x
Model-5 (Split 5) x x
4.2 Data Splits
For both L2-ARCTIC and L1-ARCTIC, we split the data into three distinct Train,
Dev, and Test sets with an 80:10:10 ratio. Importantly, we ensure there is no overlap
Fig. 2 The various data splits we use in our experiments. Color represents a different run of our
training, with the rainbow blocks in Split 4 being present in all runs. For cross validation splits,
we show a single fold as an example. Speakers are indicated by pattern with ’held-out’ speakers
blacked out in the training set.
10 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
between utterances. For L2-ARCTIC, we split the data across the following settings
(see Fig. 2).
•Split-1 (speaker-dependent, multi-accent split): All speakers from all accents in
the Train set are also included in the Dev and Test sets; however, no utterances
are shared between Train, Dev, and Test.
•Split-2 (speaker-independent cross-validation splits with multiple accents): A
speaker from each accent8is removed from the Train and Dev sets, but other
speakers with the same accent remain in the Train and Dev sets.
•Split-3 (speaker-independent zero-shot splits with multiple accents): All speakers
from one of the accents are entirely removed from the Train and Dev sets. The
removed speakers are included in Test.
•Split-4 (all-speaker, single-accent split): Speakers are broken down by accents
(six accents in total) and all speakers in a given accent are split into the Train,
Dev, and Test sets (3 data splits x 6 accents).
•Split-5 (speaker-independent cross-validation splits with single accent): One
speaker in each accent is removed from the Train and Dev sets, but the other
speakers with the same accent remain in the Train and Dev sets. As there are four
speakers per accent, four splits are created for each accent, which are further split
into the Train, Dev, and Test sets (3 data splits x 6 accents x 4 speakers).
5 Experiments
For all our wav2vec 2.0 models, we use Fairseq 9fine-tuning default settings as
a reference and convert the hyperparameters to align with Huggingface’s imple-
mentation. We evaluate all our models in terms of word error rate (WER). For L2
fine-tuning we train each model with three random seeds and report the average
WER. Our experiment code is available online 10.
5.1 Baselines
We use the following baselines:
•Baseline-I: We use Wav2Vec 2.0-ST as a baseline, due to its strong performance
on L1 English speech. We use the model released via HuggingFace 11.
•Baseline-II: This is Wav2Vec 2.0-ST, the same as Baseline-I, fine-tuned on L1-
ARCTIC described earlier. The purpose of Baseline-II is to measure potential
8We use the term ‘accent’ here to loosely refer to variation in speakers with L1 other than English.
9https://github.com/pytorch/fairseq
10 https://github.com/UBC-NLP/L2ASR
11 https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self
Automatic Speech Recognition for Non-Native English 11
Table 2 Model-1 performance in word error rate (WER) (lower is better) on non-native accents
(L2-ARCTIC) and native accents (L1-ARCTIC, LSdev and LStest). Baseline-I and Baseline-II are
reported on the same Dev and Test sets of each corpus for comparison.
L2-ARCTIC L1-ARCTIC LSdev LStest
Model Dev Test Dev Test Clean Other Clean Other
Baseline-I 13.47 12.47 2.30 2.23 1.69 3.55 1.86 3.89
Baseline-II 17.29 15.95 1.26 1.30 2.19 5.13 2.32 5.00
Model-1 9.78 9.27 1.94 1.86 2.75 5.55 2.82 6.36
domain shift between LibriSpeech/Libri-Light and ARCTIC, as well as to mea-
suring potential trade-offs from the fine-tuning process itself.
5.2 Multi-Accent Models
With our multi-accent models, we examine performance using multiple accents
during training. We introduce each of our models here, and present the results
acquired with each. We provide a summary of our different data splits and models
across accent and speaker dependency categories in Table 1.
Model-1 (speaker- and accent-dependent): The model is fine-tuned with Split-
1 data to identify any speaker-dependent training impact, as well as an upper limit
on performance. In addition to evaluating on L2-ARCTIC Test, we evaluate on L1-
ARCTIC Test and LibriSpeech in order to observe any changes in model performance
on L1 English.
As Table 2 shows, our Model-1 achieves best performance on both Dev and Test
of L2-ARCTIC as compared to our two baselines. On Test, our Model-1 acquires
25.66% improvement over our Baseline-I wav2vec 2.0 system on L2-ARCTIC (9.27
WER for our model vs. 12.47 WER for Baseline-I). This gain is not surprising and
simply means that a model with access to L2 data for fine-tuning will improve over
models fine-tuned with L1 data (Baseline-II, which is fine-tuned on L1-ARCTIC) or
not-fine-tuned at all (Baseline-I). Nor is performance on L1-ARCTIC surprising:
a model fine-tuned with native data (Baseline-II) outperforms one fine-tuned with
accented data (our Model-1), both of which outperform a model without fine-tuning
(Baseline-I). These results, however, show that in absence of L1 data, L2 data can
be valuable for improving ASR model performance even on L1. For LibriSpeech,
Baseline-I, which is trained on LibriSpeech data, outperforms the two fine-tuned
models (our Model-1 and Baseline-II). The reason is that these two latter models
are fine-tuned on a domain that is different from LibriSpeech. That is, fine-tuning
models on out-of-domain data will, and as we see here does, result in deterioration
of performance on in-domain data. We also note that our Model-1’s performance
on LibriSpeech is worse than that of Baseline-II on both the ‘Clean’ (LSClean, na-
tive speech under quite recording environments), and ‘Other’ (LSOther, both noisy
environment and accented recordings), Dev and Test splits. This may be because
12 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Table 3 Model-2 cross validated performance on L2-ARCTIC Dev and Test sets, alongside
Baseline-I and Baseline-II performance on the same cross validation splits. Mean refers to the
average WER over the four runs and SD refers to the standard deviation.
DevL2 TestL2
Model Mean SD Mean SD
Baseline-I 13.47 0.23 12.47 0.84
Baseline-II 17.29 0.41 15.96 1.58
Model-2 9.57 0.19 9.96 0.64
LibriSpeech is mostly comprised of L1 data and the greater variability on our L2-
ARCTIC Train set (24 non-native speakers in our Model-1 vs. 4 native speakers in
Baseline-II).
Model-2 (speaker-independent, accent-dependent): While Model-1 mimics a
situation where we have some training data from speakers that we serve (i.e., test
on), this is rarely a realistic scenario. We instead switch to a speaker-independent
(but still accent-dependent) setting, Split-2. We carry out four-fold cross-validation
with the 24 speakers in the data, every time using 18 speakers (three speakers per
accent) in Train12 and six speakers in Test (one per accent). We report the average
of the four folds/runs, along with standard deviation.
As Table 3shows, Model-2 performance is consistent with Model-1. Our Model-2
outperforms the two baselines on both Dev and Test, reaching 9.96 WER on Test
compared to 12.47 for Baseline-I and 15.96 for Baseline-II. These results demon-
strate that fine-tuning with multiple accents improves the accented ASR system
without access to test speaker data.
Model-3 (speaker- and accent-independent): To evaluate performance on un-
seen accents, we adopt a zero-shot strategy by removing one accent at a time from
both Train and Dev sets and evaluating on the Test set of the removed accent, Split-3.
To evaluate model performance on each accent, we conduct six runs in total with
one accent removed at a time.
As Table 4 shows, fine-tuning on accented speech benefits unseen accents and
speakers (Model-3 setting). All the multi-accent, zero-shot models outperform
Baseline-I and Baseline-II, which means each of the six accents benefit from other
accents through this process of transfer learning. Our results also show that, in ab-
sence of in-accent data, some unseen accents are easier for the model than others.
For example, on Testzeroshot, Vietnamese (VI) is the most challenging (with 18.81
WER) and Hindi (HI) is the least challenging (with only 6.67 WER).
12 We use 10% of the utterances from these 18 speakers for development (Dev).
Automatic Speech Recognition for Non-Native English 13
Table 4 Model-3 setting, where a different accent is removed each run. Testall refers to Test of
all 24 speakers, and Testzeroshot refers to Test of those four speakers who have L1removed accent.
Baseline-I acquires 12.47 on Testall , while Baseline-II acquires 15.95 on the same test set (i.e.,
Testall ).
Baseline-I Baseline-II Model-3
L1removed Testzeroshot Testzeroshot DevL2 Testzeroshot Testall
VI 23.30 28.81 7.96 18.81 9.43
ZH 14.85 19.32 9.02 12.13 9.08
AR 10.95 14.82 9.40 10.10 9.13
ES 10.48 13.48 9.38 8.89 8.98
KO 8.18 10.22 10.10 6.95 9.01
HI 6.93 8.93 10.29 6.67 9.11
5.3 Accent-Specific Models
We evaluate the accent-dependent performance by fine-tuning our models on a single
type of L1-specific accent at a time.
Model-4 (speaker-dependent, accent-dependent): The model is fine-tuned with
Split-4 data to identify any accent-dependent training impact on downstream per-
formance, as well as an upper bound on performance when the model is optimized
for a single accent. In addition to evaluating on L2-ARCTIC Test, we test the model
on L1-ARCTIC Test and LibriSpeech as a means to identify any degradation on L1
English data.
As Table 5 shows, while the multi-accent model (Model-1) outperforms Baseline-
I for all six accents, all of the accent-specific models (Model-4 setting) outperform
Model-1 on the TestL2 setting despite the small amount of data (roughly five hours)
used for fine-tuning each of the versions of Model-4. On average, Model-4 setting
Fig. 3 Model-4 Correlations
of fine-tuning accent vs. test-
accent percent performance
change. Here we present the
correlations based on Table 6,
to show the accents that most
benefit each other.
14 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Table 5 Model-4 performance on L2 accent (TestL2 ) and native accent (TestL1, LSClean , LSOther),
compared with Baseline-I, Baseline-II, and Model-1. SD refers to the standard deviation.
Baseline-I Baseline-II Model-1 Model-4
L1 TestL2 TestL2 TestL2 TestL2 TestL1 LSClean LSOther
VI 23.30 28.81 15.14 12.12 2.02 3.08 6.96
ZH 14.85 19.32 11.49 8.95 1.82 2.84 6.22
AR 10.95 14.82 8.90 6.92 1.55 2.66 6.24
ES 10.48 13.48 8.92 6.68 1.56 2.53 6.11
KO 8.18 10.22 6.60 4.99 1.71 2.51 5.63
HI 6.93 8.93 5.51 4.99 1.52 2.36 6.05
Mean 12.45 15.93 9.43 7.44 1.70 2.66 6.20
SD 5.97 7.30 3.49 2.72 0.20 0.26 0.43
Table 6 Model-4 performance in the zero-shot setting. Bold fonts represent the accent whose WER
drops the most in the zero-shot setting. For example, compared with Baseline-I, the VI-specific
fine-tuning not only improves performance on VI (i.e., a drop in WER), but also improves on ZH
despite ZH being the unseen accent. One notable pattern is that HI-specific fine-tuning only benefits
HI-accented speech recognition while all the other fine-tuning hinder performance on the HI accent.
VI ZH AR ES KO HI
Baseline-I 23.30 14.85 10.95 10.48 8.18 6.93
VI-specific 12.12 13.62 13.01 9.95 8.55 9.62
ΔWER -11.18 -1.23 2.06 -0.53 0.37 2.69
Δ%-48.00 -8.31 18.84 -5.03 4.52 38.77
ZH-specific 20.37 8.95 11.42 9.79 6.82 10.91
ΔWER -2.93 -5.90 0.47 -0.69 -1.36 3.98
Δ%-12.58 -39.75 4.26 -6.62 -16.67 57.43
AR-specific 23.88 14.86 6.92 9.86 9.16 7.74
ΔWER 0.58 0.01 -4.03 -0.62 0.98 0.81
Δ%2.47 0.07 -36.83 -5.92 11.94 11.69
ES-specific 20.71 13.99 11.00 6.68 7.92 8.66
ΔWER -2.59 -0.86 0.05 -3.80 -0.26 1.73
Δ%-11.13 -5.81 0.43 -36.23 -3.22 25.01
KO-specific 20.07 12.12 11.66 10.04 4.99 9.09
ΔWER -3.23 -2.73 0.71 -0.44 -3.19 2.16
Δ%-13.88 -18.38 6.45 -4.23 -39.04 31.17
HI-specific 26.18 18.39 13.51 11.90 10.72 4.99
ΔWER 2.88 3.54 2.56 1.42 2.54 -1.94
Δ%12.37 23.82 23.35 13.55 31.01 -27.99
is two points WER better than Model-1. In addition, Model-4 type models (each of
which is fine-tuned on one non-native accent) perform reasonably well on L1 data
(TestL1, LSClean , and LSOther). Further, large accent-specific variability is observed
across different model types on TestL2 (𝑆𝐷 = [2.72 −7.30]), compared with native
counterparts such as TestL1 (𝑆𝐷 = [0.20−0.43]). An interesting result is the apparent
difficulty difference between different accents (𝐻𝐼 and 𝐾𝑂 easiest, 𝑉 𝐼 hardest),
regardless of model types. We provide sample outputs from Model-4 in Table 11.
Automatic Speech Recognition for Non-Native English 15
Table 7 Model-5 performance on L2 accent. Testall contains utterances by all speakers within each
L1 whereas Testzeroshot-speaker contains utterances by a single speaker that is absent in the training
phase. Mean refers to the average WER over four folds for each L1, and SD refers to the standard
deviation.
Testall Testzeroshot-speaker
L1 Mean SD Mean SD
VI 12.67 0.38 14.28 4.87
ZH 9.65 0.31 11.26 3.03
AR 7.28 0.29 8.56 2.28
ES 6.95 0.26 7.76 3.99
KO 5.22 0.18 5.69 2.20
HI 5.27 0.11 5.79 1.12
As shown in Table 6, we also perform accent-wise zero-shot evaluation. Results
of this set of experiments reveal an interesting pattern: while fine-tuning on a single
accent generally benefits at least one other accent, fine-tuning on the Hindi accent
only benefits Hindi (the same accent) and hinders performance on all the other
accents.
Strong to moderate positive correlations (see Fig. 3) are observed among ZH, KO
and VI (0.79 between ZH and KO; 0.44 between VI and ZH; 0.34 between VI and
KO). On contrary, HI accents have negative correlations with all the other L1s. Strong
negative correlations with ZH, KO and VI (-0.95, -0.73, 0.67, respectively) suggest
that the more we fine-tune on HI accents, the more detrimental to the performance on
those three accents (and vice versa; those three accents would have negative impacts
on HI performance).
Model-5 (speaker-independent and accent-dependent): This setup simulates
a more realistic scenario where we target a single accent, without access to all
speakers during development time. Thus, we use Split-5 data which mimics a speaker-
independent setting. We cross-validate each L1 subset with one of the four speakers
per fold. The hyperparameters we use are those identified for Model-4. To evaluate
the performance on each speaker, we conduct 24 folds in total with one speaker
removed at a time, and report the average and standard deviation of the four folds
per each accent. As Table 7 shows, speaker-dependent variability is small for Testall
(𝑆𝐷 = [0.11 −0.38]) but large for Testzeroshot-speaker (𝑆𝐷 = [1.12 −4.87]). These
results suggest that individual speaker’s differences may play an important role in
how much performance gain can be obtained by fine-tuning.13
16 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Table 8 Comparison of models with and without language model decoding on the full L2-ARCTIC
Test set. We further ablate this by setting 𝛼to 0to demonstrate performance with beam search, but
without language model re-weighting. Δ𝑊 𝐸 𝑅 indicates increase (+) or decrease (-) in WER given
as a percent relative to the No LM results
Best Path ↓Beam Search ↓Δ𝑊 𝐸 𝑅 ↓
B1 12.47 8.43 -32.40 %
B1𝛼=012.47 12.74 +2.17%
B2 15.95 11.96 -25.02 %
B2𝛼=015.95 16.38 +2.70%
M1 9.27 5.53 -40.32%
M1𝛼=09.27 9.42 +1.62%
Table 9 Comparison of models with and without language model decoding on the language
background splits (subscript). Results in WER, and relative decrease (-) in WER (Δ𝑊 𝐸 𝑅%)
Best Path ↓Beam Search ↓Δ𝑊 𝐸 𝑅 %↓
𝐵1𝑉 𝐼 23.30 17.01 -27.00%
𝐵2𝑉 𝐼 28.81 22.60 -21.56%
𝑀4𝑉 𝐼 12.12 7.36 -39.23%
𝐵1𝑍 𝐻 14.85 9.76 -34.28%
𝐵2𝑍 𝐻 19.32 14.28 -26.09%
𝑀4𝑍 𝐻 8.95 5.98 -33.20%
𝐵1𝐴𝑅 10.95 7.53 -31.23%
𝐵2𝐴𝑅 14.82 10.90 -26.45%
𝑀4𝐴𝑅 6.92 4.72 -31.81%
𝐵1𝐸𝑆 10.48 7.04 -32.82%
𝐵2𝐸𝑆 13.48 9.96 -26.11%
𝑀4𝐸𝑆 6.68 3.96 -40.80%
𝐵1𝐾𝑂 8.18 4.75 -41.93%
𝐵2𝐾𝑂 10.22 7.25 -29.06%
𝑀4𝐾𝑂 4.99 2.80 -43.85%
𝐵1𝐻 𝐼 6.93 4.43 -36.08%
𝐵2𝐻 𝐼 8.93 6.67 -25.31%
𝑀4𝐻 𝐼 4.99 2.78 -44.22%
5.4 Language Model Decoding
We evaluate the impact of language model decoding in comparison to the fine-tuning
techniques already identified. We use a 4-gram KenLM model [17] trained on the
concatenated LibriSpeech and ARCTIC training corpora. We find performance gain
from language model decoding to be relatively similar to fine-tuning for most splits,
with the combination of the two methods even more beneficial (see Tables 8 & 9). To
further quantify the results, for each target accent, we calculate the average reduction
13 For those speakers whose TOEFL scores are known [59], a strong negative correlation was
observed between speaker-specific WERs of Baseline-I and speaker’s TOEFL scores, 𝑟(8)≈ −.77,
𝑝<.01.
Automatic Speech Recognition for Non-Native English 17
Fig. 4 Here we show the av-
erage absolute WER reduction
from adding a language model
compared with fine-tuning on
the different test splits.
from adding language model decoding and compare with the average reduction from
fine-tuning, calculated as Δ𝑊 𝐸 𝑅𝐿 𝑀 =𝐴𝑉 𝐺 (𝐵1𝐿𝑀 −𝐵1, 𝑀𝐿𝑀 −𝑀)while fine-
tuning reduction is Δ𝑊 𝐸 𝑅𝐹𝑇 =𝐴𝑉 𝐺 (𝑀−𝐵1, 𝑀𝐿 𝑀 −𝐵1𝐿 𝑀 )(see Fig. 4). For
more difficult accents (VI, ZH), fine-tuning appears to play a much bigger role in
performance improvements, with easier accents benefiting more from the language
model decoding (HI, KO).
In evaluating beam search on its own (𝛼=0), performance degrades slightly
compared to the baseline. This indicates that most of the performance gain in decod-
ing is coming from the inclusion of the language model. We note the B2 baseline
not only performs worse than B1 baseline, but additionally benefits the least from
language model decoding (perhaps indicating that it has overfit to the L1-ARCTIC
corpus). For L2 ASR, this suggests that simply fine-tuning on domain-specific but
L1 accent corpora is counterproductive.
For performance on the LibriSpeech corpus (see Table 10), results are more
mixed. As already shown by earlier work [56], Wav2Vec 2.0-ST benefits only slightly
from language model decoding on the clean split of LibriSpeech (Δ𝑊𝐸 𝑅 −0.05).
The fine-tuned models similarly only gain mild benefit, with some (Δ𝑊 𝐸 𝑅 M4ZH:
+0.12, M4ES: +0.02) models actually performing worse. For test-other, we observe
mild improvements in performance for most models using language model decoding,
although some improvements are negligible (Δ𝑊 𝐸 𝑅 M4ZH: −0.07, M4ES: −0.01).
The overall performance gains from adding a language model and beam search to
decoding L1 speech are minor in comparison to the benefits in L2 speech decoding,
and fine-tuning on L2 speech decreases the performance on L1 speech substantially
even when compared with better decoding (M1 results test-clean 40.33% and test-
other 47.68% increase from B1). This suggests an L1 vs. L2 tradeoff that cannot
entirely be overcome by the combination of fine-tuning and decoding strategies we
have tried.
18 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Table 10 Comparison of models with and without language model decoding on the test-clean and
test-other splits of LibriSpeech. Results in WER, and relative decrease (-) or increase (+) in WER
(Δ𝑊 𝐸 𝑅%).
LS Test-Clean LS Test-Other
Best Path ↓Beam Search ↓Δ𝑊 𝐸 𝑅 %↓Best Path ↓Beam Search ↓Δ𝑊 𝐸 𝑅 %↓
B1 1.86 1.81 -2.69% 3.89 3.67 -5.66%
B2 2.32 1.96 -15.52% 5.00 4.34 -13.20%
M1 2.91 2.54 -12.71% 6.44 5.42 -15.84%
M4VI 2.95 2.78 -5.76% 6.56 5.97 -8.99%
M4ZH 2.44 2.56 +4.92% 5.45 5.38 -1.28%
M4AR 2.62 2.42 -7.63% 6.13 5.38 -12.23%
M4ES 2.29 2.31 +0.87% 5.31 5.30 -0.19%
M4KO 2.48 2.24 -9.68% 5.53 4.82 -12.84%
M4HI 2.29 2.12 -7.42% 5.78 5.19 -10.21%
Table 11 Examples of transcription output of selected utterances from the Test set of Model-4
among all six L1s without a language model. Capitalized words indicate errors. We show samples
from two speakers per accent.
Model Model output
Ref at lake linderman i had one canoe very good peterborough canoe
VI at LAY LINDEMAN i had one canoe very good PETERBORROUG CANOES
A lake LNDER MAN i had one canoe very good BIET OF ROCK canoe
ZH at lake LINGERMAN i had ONCE canoe very good PETERBROUGH canoe
at lake LINERMAN i had one canoe very good PETERE BROUGHTA canoe
AR at lake LUNDERBOGH i had one canoe very good BITTERBOROUGH canoe
at lake LUNDERMAN i had one canoe very good BETTER BORT canoe
ES at lake linderman i had one canoe a very good PETERBOURN canoe
at lake linderman i had ONCE canoe very good PIERREBOROUGH canoe
KO at lake linderman i had one canoe very good peterborough canoe
at lake LINDEMAN i had ONCE canoe very good PITTEBRAUG canoe
HI at lake LINDEMAN i had one canoe very good PETERBURGH canoe
at lake linderman i had one canoe A very good PEACHERBROROU canoe
6 Error Analysis
To understand the benefit of fine-tuning and language model-decoding, we further
analyze the types of error corrected by the respective approaches, using Levenshtein
single character edit operations (as measured from gold standard to predicted ut-
terance) as our proxy for types of errors. For this analysis, we use the L2-ARCTIC
development set. An interesting finding of our analysis (see Fig. 5) is that while
adding language model decoding to the B1 baseline improves WER on L2-ARCTIC,
it increases the number of deletion operations, indicative of over-generation. For fine-
tuned models (M1 and M4), there is reduction in error types across the board, with
particular benefit to substitution (𝑀4 : −43%), and deletion operations (𝑀4 : −55%)
and mild benefit to insertion operations (𝑀4 : −30%). For adding a language model
Automatic Speech Recognition for Non-Native English 19
Fig. 5 The change in number
of Levenshtein edit operations
compared to our baseline B1
with best path decoding and
no language model.
on top of the fine-tuning, we see further reduction in the substitution operations
(𝑀4𝐿𝑀 :−64%) and insertion operations (𝑀4𝐿 𝑀 :−52%), with mild benefit to
deletion operations (𝑀4𝐿 𝑀 :−65%).
We give examples of some of these errors in Table 12, and use the B1 predictions
on the dev set of L2-ARCTIC to explore them in more detail. When looking at cases
of single Levenshtein edit operations we notice the following patterns: Out of the 18
deletion operations, 15 are spelling errors, one is a pluralization error, one is a tense
error, and one is a spacing error. For the 18 substitution operations all are indicative of
spelling errors. Of the 11 insertion operations 7are spacing errors and 3are spelling
errors, and one is a pluralization error. Using this rough analysis, it appears that
while both fine-tuning and language model decoding substantially improve spelling
of the model, language model decoding is more effective at ensuring proper spacing
of words.
7 Conclusion
We demonstrated the potential of developing accent-independent and accent-
dependent models that improve non-native speech recognition simply by fine-tuning
the pre-trained wav2vec 2.0 model on a small amount of labeled data. Both our multi-
and single-accent models improve performance on L2 English speakers. However,
each accent benefits differently: results of the multi-accent, zero-shot experiments
suggest that transfer learning on accent is possible and single-accent models improve
the most for the target L2 accents. Comparing the benefit from using a language model
in decoding the ASR outputs with simply fine-tuning the models, we find that both
these methods yield comparable improvements. We also find that the combination
of the two methods greatly closes the gap between L1 and L2 ASR. We summarize
our findings as follows:
20 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
Table 12 Examples of transcription output of different categories of edit operation. We use a
shorthand to indicate the applied method, fine-tuning: +𝐹𝑇 , language model decoding: +𝐿𝑀 , or
lack thereof (−). Errors capitalized.
Edit Type Model Model output
Substitution
Ref the portuguese boy crawled nearer and nearer
−𝐹𝑇 −𝐿 𝑀 the portuguese boy CROWLED nearer and nearer
−𝐹𝑇 +𝐿 𝑀 the portuguese boy crawled nearer and nearer
+𝐹𝑇 −𝐿 𝑀 the portuguese boy CROWLED nearer and nearer
+𝐹𝑇 +𝐿 𝑀 the portuguese boy crawled nearer and nearer
Insertion
Ref tomorrow it will be strong enough for you to stand upon
−𝐹𝑇 −𝐿 𝑀 TO MORROW it will be strong enough for you to stand upon
−𝐹𝑇 +𝐿 𝑀 TO MORROW it will be strong enough for you to stand upon
+𝐹𝑇 −𝐿 𝑀 tomorrow it will be strong enough for you to stand upon
+𝐹𝑇 +𝐿 𝑀 tomorrow it will be strong enough for you to stand upon
Deletion
Ref there are no kiddies and half grown youths among them
−𝐹𝑇 −𝐿 𝑀 there are no kiddies and half grown YOUTH among them
−𝐹𝑇 +𝐿 𝑀 there are no kiddies and half grown YOUTH among them
+𝐹𝑇 −𝐿 𝑀 there are no kiddies and half grown YOUTH among them
+𝐹𝑇 +𝐿 𝑀 there are no kiddies and half grown youths among them
•Fine-tuning either on single accents (most effective) or groups of accents (more
generalizeable) significantly improves L2 ASR performance. This is possible
through large reductions in calculated substitution and deletion operations.
•Fine-tuning on domain-specific L1 accents is counterproductive to L2 ASR.
•Language model decoding is useful for L2 ASR, even for models with strong lan-
guage model-free performance on L1 speech, and is particularly good at reducing
calculated substitution and insertion operations.
When looking at future research directions, it is important to stress the need the
field has for benchmark L2 ASR datasets. Datasets proposed by Wang et al. [52] is
one of the few L2 ASR datasets collected with this intention in mind (although these
only cover L1 Chinese speakers). Without datasets to cover a wide variety of L2
English accents, relying on accent embeddings [50, 49, 23] and multi-task learning
might be a vital addition to L2 ASR work if the goal is wide-accent coverage. While
our fine-tuning of Wav2vec 2.0-ST did not seem to keep the the model’s language
model-free performance, there may be other directions to take to try to maintain
it. These can include looking more closely at fine-tuning techniques, such as Layer
Norm and Attention fine-tuning [27] or Adapter fine-tuning [20]. These might be
better at preserving this internal language model, as they freeze more of the original
model weights. Finally, smaller ASR model size and federated learning–although
early results have noted difficulties in applying it to ASR [57], effort is underway to
lower the training cost and improve accuracy [15]–might bring about the potential
for ASR individualization to be targeted at the level of ideolects, with people able to
use ASR model’s tailored to their personal accent profile.
Automatic Speech Recognition for Non-Native English 21
Acknowledgements We would like to thank Mia Li, Jeremy Zhang, and Haejin Cho for having
contributed to an initial phase of this work.
References
[1] Baevski A, Schneider S, Auli M (2019) vq-wav2vec: Self-supervised learning
of discrete speech representations. arXiv preprint arXiv:191005453
[2] Baevski A, Zhou H, Mohamed A, Auli M (2020) wav2vec 2.0: A frame-
work for self-supervised learning of speech representations. arXiv preprint
arXiv:200611477
[3] Bearman A, Josund K, Fiore G (2017) Accent conversion using artificial neural
networks. Tech. rep., Stanford University, Tech. Rep, Tech. Rep
[4] Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: A neu-
ral network for large vocabulary conversational speech recognition. In: 2016
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 4960–4964
[5] Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-
based models for speech recognition. In: Advances in neural information pro-
cessing systems, pp 577–585
[6] Chung YA, Hsu WN, Tang H, Glass J (2019) An unsupervised autoregressive
model for speech representation learning. arXiv preprint arXiv:190403240
[7] Crystal D (2003) English as a global language. Ernst Klett Sprachen
[8] Das N, Bodapati S, Sunkara M, Srinivasan S, Chau DH (2021) Best of both
worlds: Robust accented speech recognition with adversarial transfer learning.
arXiv preprint arXiv:210305834
[9] Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:181004805
[10] Futami H, Inaguma H, Ueno S, Mimura M, Sakai S, Kawahara T (2020)
Distilling the knowledge of bert for sequence-to-sequence asr. arXiv preprint
arXiv:200803822
[11] Graves A (2012) Connectionist temporal classification. In: Supervised Se-
quence Labelling with Recurrent Neural Networks, Springer, pp 61–93
[12] Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent
neural networks. In: International conference on machine learning, pp 1764–
1772
[13] Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist tempo-
ral classification: labelling unsegmented sequence data with recurrent neural
networks. In: Proceedings of the 23rd international conference on Machine
learning, pp 369–376
[14] Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z,
Wu Y, et al. (2020) Conformer: Convolution-augmented transformer for speech
recognition. arXiv preprint arXiv:200508100
22 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
[15] Guliani D, Beaufays F, Motta G (2021) Training speech recognition models
with federated learning: A quality/cost framework. In: ICASSP 2021-2021
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 3080–3084
[16] Hannun AY, Maas AL, Jurafsky D, Ng AY (2014) First-pass large vocabu-
lary continuous speech recognition using bi-directional recurrent dnns. arXiv
preprint arXiv:14082873
[17] Heafield K (2011) Kenlm: Faster and smaller language model queries. In:
Proceedings of the sixth workshop on statistical machine translation, pp 187–
197
[18] Hori T, Cho J, Watanabe S (2018) End-to-end speech recognition with word-
based rnn language models. In: 2018 IEEE Spoken Language Technology
Workshop (SLT), IEEE, pp 389–396
[19] Hou J, Guo P, Sun S, Soong FK, Hu W, Xie L (2019) Domain adversarial
training for improving keyword spotting performance of esl speech. In: ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, pp 8122–8126
[20] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo
A, Attariyan M, Gelly S (2019) Parameter-efficient transfer learning for nlp.
In: International Conference on Machine Learning, PMLR, pp 2790–2799
[21] Hu H, Yang X, Raeesy Z, Guo J, Keskin G, Arsikere H, Rastrow A, Stolcke
A, Maas R (2021) Redat: Accent-invariant representation for end-to-end asr by
domain adversarial training with relabeling. In: ICASSP 2021-2021 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP),
IEEE, pp 6408–6412
[22] Hwang K, Sung W (2016) Character-level incremental speech recognition with
recurrent neural networks. In: 2016 IEEE international conference on acoustics,
speech and signal processing (ICASSP), IEEE, pp 5335–5339
[23] Jain A, Upreti M, Jyothi P (2018) Improved accented speech recognition using
accent embeddings and multi-task learning. In: Interspeech, pp 2454–2458
[24] Kahn J, Rivière M, Zheng W, Kharitonov E, Xu Q, Mazaré PE, Karadayi J,
Liptchinsky V, Collobert R, Fuegen C, et al. (2020) Libri-light: A benchmark for
asr with limited or no supervision. In: ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp
7669–7673
[25] Kominek J, Black AW, Ver V (2003) Cmu arctic databases for speech synthesis
[26] Kunze J, Kirsch L, Kurenkov I, Krug A, Johannsmeier J, Stober S
(2017) Transfer learning for speech recognition on a budget. arXiv preprint
arXiv:170600290
[27] Li X, Wang C, Tang Y, Tran C, Tang Y, Pino J, Baevski A, Conneau A, Auli M
(2020) Multilingual speech translation with efficient finetuning of pretrained
models. arXiv preprint arXiv:201012829
[28] Ling S, Liu Y, Salazar J, Kirchhoff K (2020) Deep contextualized acoustic
representations for semi-supervised speech recognition. In: ICASSP 2020-2020
Automatic Speech Recognition for Non-Native English 23
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 6429–6433
[29] Liu AT, Yang Sw, Chi PH, Hsu Pc, Lee Hy (2020) Mockingjay: Unsupervised
speech representation learning with deep bidirectional transformer encoders.
In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, pp 6419–6423
[30] Liu S, Wang D, Cao Y, Sun L, Wu X, Kang S, Wu Z, Liu X, Su D, Yu D,
et al. (2020) End-to-end accent conversion without using native utterances. In:
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), IEEE, pp 6289–6293
[31] Livescu K, Glass J (2000) Lexical modeling of non-native speech for automatic
speech recognition. In: 2000 IEEE International Conference on Acoustics,
Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), IEEE,
vol 3, pp 1683–1686
[32] Lowerre BT (1976) The harpy speech recognition system. Carnegie Mellon
University
[33] Maas A, Xie Z, Jurafsky D, Ng AY (2015) Lexicon-free conversational speech
recognition with neural networks. In: Proceedings of the 2015 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pp 345–354
[34] Matassoni M, Gretter R, Falavigna D, Giuliani D (2018) Non-native chil-
dren speech recognition through transfer learning. In: 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp
6229–6233
[35] Meister C, Vieira T, Cotterell R (2020) If beam search is the answer, what was
the question? arXiv preprint arXiv:201002650
[36] Miao Y, Gowayyed M, Metze F (2015) Eesen: End-to-end speech recognition
using deep rnn models and wfst-based decoding. In: 2015 IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU), IEEE, pp 167–174
[37] Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive
predictive coding. arXiv preprint arXiv:180703748
[38] Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Transactions on
knowledge and data engineering 22(10):1345–1359
[39] Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus
based on public domain audio books. In: 2015 IEEE international conference
on acoustics, speech and signal processing (ICASSP), IEEE, pp 5206–5210
[40] Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019)
Specaugment: A simple data augmentation method for automatic speech recog-
nition. arXiv preprint arXiv:190408779
[41] Ping TT (2008) Automatic speech recognition for non-native speakers. PhD
thesis, Université Joseph-Fourier-Grenoble I
[42] Radzikowski K, Wang L, Yoshie O, Nowak R (2021) Accent modification for
speech recognition of non-native speakers using neural style transfer. EURASIP
Journal on Audio, Speech, and Music Processing 2021(1):1–10
24 Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
[43] Sak H, Senior A, Rao K, Irsoy O, Graves A, Beaufays F, Schalkwyk J (2015)
Learning acoustic frame labeling for speech recognition with recurrent neural
networks. In: 2015 IEEE international conference on acoustics, speech and
signal processing (ICASSP), IEEE, pp 4280–4284
[44] Schneider S, Baevski A, Collobert R, Auli M (2019) wav2vec: Unsupervised
pre-training for speech recognition. arXiv preprint arXiv:190405862
[45] Shi X, Yu F, Lu Y, Liang Y, Feng Q, Wang D, Qian Y, Xie L (2021) The
accented english speech recognition challenge 2020: open datasets, tracks,
baselines, results and methods. In: ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp
6918–6922
[46] Shibano T, Zhang X, Li MT, Cho H, Sullivan P, Abdul-Mageed M (2021)
Speech technology for everyone: Automatic speech recognition for non-native
english with transfer learning. arXiv preprint arXiv:211000678
[47] Sun S, Yeh CF, Hwang MY, Ostendorf M, Xie L (2018) Domain adversarial
training for accented speech recognition. In: 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 4854–
4858
[48] Synnaeve G, Xu Q, Kahn J, Likhomanenko T, Grave E, Pratap V, Sri-
ram A, Liptchinsky V, Collobert R (2019) End-to-end asr: from super-
vised to semi-supervised learning with modern architectures. arXiv preprint
arXiv:191108460
[49] Turan MAT, Vincent E, Jouvet D (2020) Achieving multi-accent asr via unsu-
pervised acoustic model adaptation. In: INTERSPEECH 2020
[50] Viglino T, Motlicek P, Cernak M (2019) End-to-end accented speech recogni-
tion. In: Interspeech, pp 2140–2144
[51] Wang D, Zheng TF (2015) Transfer learning for speech and language pro-
cessing. In: 2015 Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA), IEEE, pp 1225–1237
[52] Wang Y, Luan H, Yuan J, Wang B, Lin H (2020) Laix corpus of chinese
learner english: Towards a benchmark for l2 english asr. In: INTERSPEECH,
pp 414–418
[53] Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, Huang H, Tjandra A,
Zhang X, Zhang F, et al. (2020) Transformer-based acoustic modeling for hybrid
speech recognition. In: ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 6874–6878
[54] Wang Z, Schultz T, Waibel A (2003) Comparison of acoustic model adaptation
techniques on non-native speech. In: 2003 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03).,
IEEE, vol 1, pp I–I
[55] Watanabe S, Hori T, Kim S, Hershey JR, Hayashi T (2017) Hybrid ctc/attention
architecture for end-to-end speech recognition. IEEE Journal of Selected Topics
in Signal Processing 11(8):1240–1253
[56] Xu Q, Baevski A, Likhomanenko T, Tomasello P, Conneau A, Collobert R,
Synnaeve G, Auli M (2021) Self-training and pre-training are complementary
Automatic Speech Recognition for Non-Native English 25
for speech recognition. In: ICASSP 2021-2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 3030–3034
[57] Yu W, Freiwald J, Tewes S, Huennemeyer F, Kolossa D (2021) Federated
learning in asr: Not as easy as you think. In: Speech Communication; 14th ITG
Conference, VDE, pp 1–5
[58] Zenkel T, Sanabria R, Metze F, Niehues J, Sperber M, Stüker S, Waibel A (2017)
Comparison of decoding strategies for ctc acoustic models. arXiv preprint
arXiv:170804469
[59] Zhao G, Sonsaat S, Silpachai AO, Lucic I, Chukharev-Hudilainen E, Levis
J, Gutierrez-Osuna R (2018) L2-arctic: A non-native english speech corpus.
Perception Sensing Instrumentation Lab