Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
345
Adaptation and Optimization of Automatic Speech Recognition (ASR) for
the Maritime Domain in the Field of VHF Communication
Emin Çağatay Nakilcioğlu, Fraunhofer CML, Hamburg/GERMANY emin.nakilcioglu@cml.fraun-
hofer.de
Maximilian Reimann, Fraunhofer CML, Hamburg/GERMANY, maximilian.reimann@cml.fraunho-
fer.de
Ole John, Fraunhofer CML, Hamburg/GERMANY ole.john@cml.fraunhofer.de
Abstract
This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-
cation that automatically converts received VHF radio signals into text. The challenges of maritime
radio communication are described at first, and the deep learning architecture of marFM® consisting
of audio processing techniques and machine learning algorithms is presented. Subsequently, maritime
radio data of interest is analyzed and then used to evaluate the transcription performance of our ASR
model for various maritime radio data.
1. Introduction
Maritime communication is a critical aspect of global trade and transportation, enabling ships to com-
municate with one another and with shore-based facilities. VHF stands for Very High Frequency and
is a commonly used radio communication technology in the maritime domain, providing reliable voice
communication. VHF is a type of electromagnetic radiation used in maritime radio communications
mainly in the frequency range from 156.025 MHz to 162.025 MHz, ITU (2022). VHF radio is used for
ship-to-ship, ship-to-shore (vice versa) and onboard communications, Alagha and Løge (2023). Due to
its vast range, simplicity of use, and technological robustness in application, VHF communication has
established itself as a useful instrument throughout decades and became a part of the mandatory equip-
ment on board according to SOLAS, IMO (1974). However, using VHF-Radio technology for commu-
nication poses several obstacles. In addition to the background noise that is a characteristic feature of
VHF radio, the noise on board adds to the difficulty, which can significantly impair speech intelligibil-
ity. In the maritime context, there are many sources of background noise, such as the noise of machinery
or the sound of the sea. On the other hand, the signal quality depends strongly on the weather. Weather
influences on the signal quality include fog, rain, or snow, whereby rain in particular reduces the power
of the radio wave and can thus lead to signal interruptions, Meng et al. (2009). A poor connection can
also affect intelligibility. If the signal between the vessel and the coast station is weak or affected by
sources of interference in the vicinity, distortion and interruptions in the radio call may occur. In addi-
tion to the factors mentioned so far, which are primarily technological based and have a negative impact
on the reception quality of the radio signal, there are other factors that affect the linguistic characteris-
tics of maritime radio communication.
Another challenge is the internationality of the ship's crews with their different language levels. Lan-
guage richness poses a challenge for speech recognition because it is difficult to develop models that
cover all possible language variations. As people from different countries and language regions work
together at sea, language barriers can arise, making communication more difficult. Since English is the
main language of communication, a variety of different dialects and accents occur in radio communi-
cation, John et al. (2017). Depending on the level of English proficiency and the strength of the respec-
tive accents, this circumstance has a great impact on the intelligibility of maritime radio communica-
tions, especially if speakers are not used to talking to people from other regions of the world. To reduce
the problem of language barriers and to mitigate the occurrence of misunderstandings, the IMO intro-
duced the “Standard Marine Communication Phrases” (SMCP) in 1977, which replaced the Standard
Marine Navigational Vocabulary (SMNV). The SMCP is a framework containing phrases for routine
situations and standard responses for emergency situations, IMO (2001). In practice, these phrases,
which aim to reduce language barriers and avoid misunderstandings in maritime communication, have
little application, Mockel et al. (2014).
346
Despite its shortcomings, VHF radio is an indispensable tool for communication between ships and
shore stations in the maritime domain and provides a fast, easy and reliable way to transmit information
between different stakeholders. Automatic speech recognition technologies serve a great potential to
address the challenges in the field of VF radio communication.
Automatic speech recognition (ASR) is an increasingly important component of human-computer in-
teraction since it is a technology that enables machines to transcribe human speech into text. This tech-
nology has made considerable progress in recent years and has also found application in the end-con-
sumer sector. Examples of areas where ASR is used today include voice control, voice input, automated
customer support systems and many more. ASR is based on the ability of machines to recognize and
interpret spoken language, calculating the most likely spoken sentence based on audio signals. The
technology uses complex algorithms and mathematical models to capture the acoustic signals of speech
and convert them into digital formats that can be processed.
In the field of maritime communication, deep learning-based ASR has been used to improve the accu-
racy of speech recognition. Various models have been proposed for different applications, including
the use of RNNs for decoding of Maritime VHF communications, Goudsmit et al. (2016), and the use
of CNNs and RNNs for detecting keywords in distress calls, Yu et al. (2019). However, there is still a
need for more accurate and efficient ASR models for the maritime domain.
The introduction of transfer learning has also shown promising results in ASR. Transfer learning in-
volves using a pre-trained model and fine-tuning it on a specific task. This approach has been used in
various fields, including computer vision, He et al. (2015), and natural language processing (Peters et
al., 2018). In the field of ASR, transfer learning has been used to improve the performance of models
on low-resource languages, Zhao and Zhang (2018), and tasks, Chorowski et al. (2019).
The Wav2Vec2 model, introduced by Baevski et al. (2020), is a state-of-the-art ASR model that utilizes
self-supervised learning to learn speech representations from raw audio. The model uses a contrastive
predictive coding (CPC) objective to learn contextualized representations of audio, which can then be
used for downstream tasks such as ASR. Fine-tuning the Wav2Vec2 model on specific speech recog-
nition tasks has shown significant improvements in performance, Baevski et al. (2020), Schneider et al.
(2021).
Considering the international nature of maritime communication, ships and ports from different coun-
tries often communicate with each other, and there are a variety of languages used in maritime commu-
nication. A multilingual ASR system would enable efficient and accurate communication between dif-
ferent language speakers, improving safety and efficiency in the maritime domain.
In recent years, the development of multilingual ASR models has gained significant attention. With the
addition of the recent advancements in deep learning, large-scale pre-trained models, such as the Cross-
lingual Language Model Pre-training (XLM) and XLSR models, which have shown great potential for
improving the performance of ASR systems, were introduced. The XLSR model, in particular, is a pre-
trained model that has been trained on a large corpus of multilingual data and has shown to be effective
in low-resource languages and domains. For example, in a recent study by Conneau et al. (2020), the
XLSR models were pre-trained on a large corpus of speech data in hundreds of languages and achieved
state-of-the-art performance on several benchmarks for cross-lingual speech recognition.
In the context of maritime communication, where there is a need for accurate and efficient ASR models,
the use of XLSR models can be beneficial due to their ability to handle multilingual and low-resource
data. The XLSR model has been shown to be effective in various multilingual ASR tasks, such as
speech-to-text translation and code-switching speech recognition. In a recent study by Li et al. (2021),
the authors developed a multilingual ASR system for the maritime domain by fine-tuning the XLSR
model on a custom maritime audio dataset. The study demonstrated the effectiveness of the XLSR
model in improving the performance of ASR in the maritime domain.
347
In this paper, we introduce marFM®, a multilingual ASR system for the maritime domain. Our approach
is similar to transfer learning, as we use a pretrained XLSR model to initialize the weights of our model
and then fine-tune it on our custom maritime audio database which consists of maritime audio record-
ings in English and German. By doing so, we aim to improve the accuracy and efficiency of ASR for
maritime communication in multiple languages simultaneously and thus to facilitate safer and more
efficient communication between vessels of different nationalities.
2. Methodology
This chapter lays out the underlying methodology applied in developing/training a multilingual ASR
model for multilingual maritime communication. As shown in the Fig.1, our methodology consists of
two main steps: data pre-processing and fine-tuning of a pretrained ASR-Model. In this section, the
details of the fine-tuning process are presented, and our custom maritime database and the data prepro-
cessing steps are discussed in detail.
Fig.1: Preprocessing and fine-tuning of ASR-model
2.1 Supervised Fine-Tuning
Our approach in this work is to use Wav2Vec2-XLSR-53 model pretrained on large-scale German data
as our base model and to fine tune it on our custom maritime database. The pretrained model was
acquired through Hugging Face Hub. Hugging Face is an open-source software library that provides
easy-to-use interfaces for popular NLP models, including Wav2Vec2 and XLSR, HuggingFace (2023).
It allows users to easily access and fine-tune these models on their own datasets. Hugging Face also
provides pre-trained models that can be used as a starting point for fine-tuning. The pre-trained models
are trained on large-scale datasets, which makes them suitable for transfer learning on smaller datasets.
In our work, we utilized the Wav2Vec2-XLSR-53 German model from Hugging Face as our base model
for fine-tuning on our maritime audio dataset.
The base architecture of the Wav2Vec2-XLSR-53 German model is based on the XLSR approach
which was built on wav2vec 2.90 architecture and designed to learns cross-lingual speech representa-
tions by pretraining a single model from the raw waveform of speech in multiple languages, Conneau
et al. (2020). As shown in Fig.2, the XLSR approach consists of a layer convolutional neural network
(CNN) encoder followed by a transformer decoder. The CNN encoder extracts contextualized speech
features from the input waveform whose embeddings are then act as targets for the training of the
transformer decoder using contrastive learning. Similar to self-supervising training of wav2vec 2.0,
XLSR only requires raw unlabelled speech audio in multiple languages. The Wav2Vec2-XLSR-53
German is pretrained on a large-scale German speech corpus and fine-tuned on the Common Voice
German dataset, Conneau et al. (2020).
348
Fig.2: The XLSR approach on multilingual speech recognition, Conneau et al. (2020)
To fine-tune the Wav2Vec2-XLSR-53 model for the maritime domain, we used a custom maritime
audio database that was collected by Fraunhofer CML. The database contains various types of maritime
radio communication, such as distress calls, navigational messages, and weather reports as well as
situational reports. The more detail regarding the database will be provided the following sub-section.
For fine-tuning, we followed a transfer learning approach. We initialized the model with the pre-trained
weights and fine-tuned it on our dataset using the Connectionist Temporal Classification (CTC) loss
function. CTC loss function is a popular method for training ASR models, where the model learns to
align the predicted output sequence with the ground truth by maximizing the log-likelihood of the
correct transcription, Graves et al. (2006). This approach has been shown to be effective in various
ASR tasks, including phone recognition, Graves et al. (2006), keyword spotting, Park et al. (2020),
and speaker diarization, Wang et al. (2021). In our work, we use CTC loss to train our multilingual
ASR model to predict character sequences from audio waveforms. We use the CTC loss implementation
provided by the PyTorch framework, Paszke et al. (2019). During the fine-tuning process, we used a
learning rate of 3e-5, a batch size of 8. We also used early stopping to prevent overfitting and reduce
training time. The training for fine-tuning on a single GPU (Nvidia RTX A6000) took approximately
34 hours in total.
2.2 Database and Preprocessing
For this project, Fraunhofer CML have created our own custom maritime database by gathering mari-
time audio recordings. This database consists of about 62-hour long audio recordings of real VHF radio
conversations in English and German together with their respective transcriptions.
Since the pretrained XLRS model was trained on data with a sampling rate of 16 kHz, the fine-tuning
data needed to be converted into the same format so that the convolutional filters could operate on the
same timescale. To this end, every audio recording in the database was resampled to 16 kHz. Prepro-
cessing was done using librosa library in Python, McFee et al. (2015).
As mentioned in Section 1, another the main challenge regarding VHF-Radio recordings is the heavy
background noise that is ever present regardless of the type of VHF-Radio hardware. Depending on the
range of the VHF-Radio receivers, the hardware quality and the distance between the communicators,
prepotency of the background differs in the recordings. To address the background noise with different
characteristics, we applied non-stationary noise gating to the raw audio data. The “noisereduce” library
in Python was used for noise gating/reduction, Sainburg et al. (2020). The library offers two different
noise reduction algorithms: stationary and non-stationary. Main difference between the algorithms is
that non-stationary noise reduction allows the noise gate to change over time by using a sliding window
where noise statistics are recomputed every time the window is moved to adjacent part of the audio
clip. The working schematics of the stationary and non-stationary noise reduction is shown in Fig.3.
Following steps are applied during the noise reduction process.
349
• First, a spectrogram is calculated over the signal
• A time-smoothed version of the spectrogram is computed using an IIR filter applied forward
and backward on each frequency channel, Grout (2008).
• A mask is computed based on that time-smoothed spectrogram
• The mask is smoothed with a filter over frequency and time
• The mask is applied to the spectrogram of the signal, and is inverted
The other preprocessing steps, text correction and clearing, were also carried out simultaneously. The
objective behind text correction and clearing was removing all characters that don’t contribute to the
meaning of a word and cannot be properly represented by an acoustic sound and thus, normalizing the
text. A quick investigation revealed that our transcriptions contained some punctuation marks which
do not correspond to any characteristic sound unit, i.e., phoneme. For example, the letter “d” has its
own phonetic sounds whereas the punctuation mark “-“ cannot be represented as a phoneme. To finalize
the text clearing stage, we have consulted with a native German speaker to find out whether any further
simplification and clearing of the transcriptions was possible. We were informed that the certain vowels
that were to be written with an umlaut such as ä, ö and ü were written according to the English alphabets.
Thus, as the final step of the preprocessing, we rewrote these vowels according to the German alphabet.
3. Experiments
For the experimental setup, we have split the dataset into, training and test datasets with the ratio of
90:10, which correspondents to ca. 56 h of audio for training and 6 h of audio for testing. Training
dataset was further divided into two splits: train and validation sets. Ratio of 80:20 was applied to these
splits which adds up to ca. 45 h of audio for training and 11 h of audio for validation. In summary, 6
hours of maritime recordings consisting of real VHF radio calls both in English and German were used
to evaluate the model performance.
Three different models were considered for the comparative performance analysis. Our base model,
Wav2Vec2-XLSR-53 German, was involved in the experiments in order to see the full effect of the
fine tuning on the model’s performance. We have also evaluated the performance of the Wav2Vec2-
XLSR-53 German with a language model. As shown in Udagawa et al. (2022), Vaibhav and Padma-
priya (2022) and on model evaluation pages of Hugging Face Hub, attaching a language model to a
Wav2Vec2 model has a positive effect on the decoding performance of the ASR models. Thus, we have
also included Wav2Vec2-XLSR-53 German with a language model trained on English and German
phrases.
In late 2022, a general-purpose speech recognition model called Whisper were introduced by OpenAI
Research Lab, Radford et al. (2022). It is trained on a large dataset of diverse audio, 680,000 hours of
multilingual and multitask supervised data collected from the web, and is also a multi-task model that
can perform multilingual speech recognition as well as speech translation and language identification.
Fig.3: Stationary vs. non-stationary noise reduction, Sainburg and Gentner (2021)
350
The Whisper’s zero-shot performance were measured across many diverse datasets, and it showed is
much more robustness and made fewer errors than the other state-of-the-art models. We have concluded
that the involvement of a Whisper-based model to our experimental setup would be worthwhile addition
and thus, in the evaluation phase, we have also considered the transcription performance of the largest
available Whisper model, Whisper large-v2 model, which is biggest model in size and trained for 2.5x
more epochs with added regularization for improved performance compared Whisper large model, Cho
et al. (2021).
Speech recognition research typically evaluates and compares systems based on the word error rate
(WER) metric. To calculate WER, the reference and recognized transcript are aligned. It is
accomplished by minimizing the Levenshstein distance (or edit distance) between the two texts.
Following this, you can count the number of substitutions (S), insertions (I), and deletion (D) errors. In
its simply form, the WER is the proportion of the total number of errors over the number of words in
the reference. In a transcription output where all the words were correct, the WER would be zero which
we are aiming to drive towards in terms of the marFM®’s performance. The WER calculated using the
following equation
𝑊𝐸𝑅 = 𝑆 + 𝐷 + 𝐼
𝑁
(1)
where N is the total number of symbols in the reference word, D the number of deleted symbols in the
hypothesis with respect to the reference word, S the number of changed symbols, and I the number of
additional symbols.
4. Results
In this study, we compared the multilingual speech recognition performance between four different
models including marFM® on the test split of our custom maritime database which contains 6 hours of
real VHF radio calls. Wav2Vec2-XLSR-53 German model performance was evaluated both with and
without an additional language model (LM). Other models performed the multilingual transcription
task without the attachment of any external language model.
The results of the multilingual audio transcription task are presented in Table I. Each model used a
shared capacity of a single GPU (Nvidia RTX A6000) while generating their predictions, i.e.,
transcriptions. The total inference time of the models added up to approximately 19 minutes.
As expected, the positive influence of attaching a language model to the base model, Wav2Vec2-
XLSR-53 German, was apparent on the model’s transcription performance. With the help of the
language model, its transcription accuracy has increased by approximately 4%. The base model without
any language model attached ended up performing the worst among the models. Considering that the
dataset the model was trained on consisted of mostly daily speech recordings, the relatively poor
performance by the base model was expected. However, though it was trained on a data with similar
characteristics, Whisper large-v2 model has outperformed both XLSR models. We observed close to
7% improvement in accuracy when compared to the base model and about 3% compared to the XLSR
model with the language model extension. Although both architectures, Wav2Vec2 and Whisper, were
based on transformers and encoder-decoder style approach, the Whisper implementation has demon-
strated a certain level of superiority regarding the transcription accuracy on the multilingual VHF calls.
Table I: Results of multilingual speech recognition performance among different ASR models
Model
Word Error Rate (WER%)
Wav2Vec2-XLSR-53 DE
44.36
Wav2Vec2-XLSR-53 DE w/ ML
40.58
Whisper Large V2
37.62
marFM®
31.59
351
Despite the higher accuracy output of the Whisper large-v2 model, marFM® has produce the most
accurate transcriptions among the models. With 31.5 % of WER, marFM® has shown a significant
improvement in comparison to its closest contender which is the Whisper large-v2 model with 37.62
% of WER. As hypothesized, fine-tuning process has elevated the model’s transcription performance
with an approximately 13% boost in WER in comparison to the transcription accuracy of the base
model, Wav2Vec2-XLSR-53 German.
5. Conclusion and Future Work
In this study, we introduced marFM® a multilingual automatic speech recognizer for maritime
communications. We also showcased its superior performance on the multilingual transcrip-
tion task for VHF radio calls in comparison with the other open-source and readily available
ASR models.
One of the main challenges for developing an ASR system for maritime domain was collecting
the real maritime data with its corresponding transcriptions for our fine-tuning process.
Throughout the span of multiple projects, we were able to gather a total of 62 hours of maritime
recordings. Thanks to the state-of-the-art ASR architectures such as Wav2Vec2 and XLSR, we
were able to develop marFM® and to provide an ASR system transcribing the maritime calls
with higher accuracy compared to the other open-source, state-of-the-art ASR systems availa-
ble.
As the experimental results have demonstrated, the addition of a language model can help
Wav2Vec2-based ASR models transcribe with more accuracy. Thus, the performance of
marFM® is expected to increase as we attach a language model trained on a language dataset
expanded with maritime-related corpus. For this end, we have initialized a development pro-
cess where we started training our own language model with the maritime expansion.
Among the non-finetuned models, Whisper large-v2 model has shown the best transcription
performance. Thus, we are of the opinion of that it would be worthwhile research effort to
develop Whisper-based maritime ASR in order to evaluate whether we can achieve higher
accuracy levels with such implementation for the maritime communication. For this end, we
have also initialized a side research project where we explore the capabilities of Whisper in the
maritime domain further.
References
ALAGHA, N.; LØGE (2023), Opportunities and challenges of maritime VHF data exchange systems,
IJSC&N Special Issue - Satell Network 41/2, pp.99–101
CHO, H.; LEE, J., KIM, J., LEE, H.; LEE, K. (2021), Whisper Large v2: An effective end-to-end ASR
for whispered speech
CHOROWSKI, J.; WEISS, R. J.; BENGIO, S.; VAN DEN OORD, A. (2019), Unsupervised speech
representation learning using wavenet autoencoders, IEEE/ACM Trans. Audio, Speech, And
Language Processing 27/12, pp.2041-2053
CONNEAU, A.; KHANDELWAL, K.; GOYAL, N.; CHAUDHARY, V.; WENZEK, G.; GUZMÁN,
F.; GRAVE, E.; OTT, M.; ZETTLEMOYER, L. (2020), Unsupervised Cross-lingual Representation
Learning at Scale, 58th Annual Meeting of the Association for Computational Linguistics, pp.8440-
8451
352
DOMINGUEZ-PÉRY, C.; VUDDARAJU, L.N.R.; CORBETT-ETCHEVERS, I.; TASSABEHJI, R.
(2021), Reducing maritime accidents in ships by tackling human error: a bibliometric review and re-
search agenda, J. Shipp. Trd. 6/1
GALES, M.; YOUNG, S. (2008), Application of Hidden Markov Models in Speech Recognition, Now
Foundations and Trends
GEORGESCU, A.L.; PAPPALARDO, A.; CUCU, H.; BLOTT, M. (2021), Performance vs. hardware
requirements in state-of-the-art automatic speech recognition, J. Audio Speech Music Proc. 2021/1
GOUDSMIT, J.; SCHAVEMAKER, J.; JANSEN, S. (2016), Robust maritime speech recognition with
recurrent neural networks, IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP),
pp.5570-5574
GRAVES, A.; FERNÁNDEZ, S.; GOMEZ, F.; SCHMIDHUBER, J. (2006), Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks, 23rd Int. Conf.
Machine Learning, pp.369-376
GRAVES, A.; WAYNE, G.; DANIHELKA, I. (2014), Neural turing machines, arXiv preprint
GROUT, I. (2008), Digital Systems Design with FPGAs and CPLDs, Newnes, pp.475-536
HE, K.; ZHANG, X.; REN, S.; SUN, J. (2016), Deep residual learning for image recognition, IEEE
Conf. Computer Vision and Pattern Recognition, pp.770-778
HANNUN, A.; CASE, C.; CASPER, J.; CATANZARO, B.; DIAMOS, G.; ELSEN, E.; PRENGER,
R.; SATHEESH, S.; SENGUPTA, S.; COATES, A.; NG, A. (2014), Deep speech: Scaling up end-to-
end speech recognition, arXiv preprint
HUGGINGFACE (2023), Transformers: State-of-the-art Natural Language Processing,
https://huggingface.co/
IMO (1974), International Convention for the Safety of Life at Sea (SOLAS), Int. Mar. Org., London
IMO (2001), IMO Standard Marine Communication Phrases, Int. Mar. Org., London
ITU (2022), Technical characteristics for a VHF data exchange system in the VHF maritime mobile
band: M Series Mobile, radiodetermination, amateur and related satellite services
JOHN, P.; BROOKS, B.; SCHRIEVER, U. (2017), Profiling maritime communication by non-native
speakers: A quantitative comparison between the baseline and standard marine communication
phraseology, English for Specific Purposes, vol. 47, pp.1-14
KLAKOW, D.; PETERS, J. (2002), Testing the correlation of word error rate and perplexity, Speech
Communication 38/1-2, pp.19-28
LI, J.; CAO, J.; ZHANG, Y.; LIU, X.; ZHAO, J.; LIU, X. (2021), An Investigation of Pretrained Cross-
Lingual Models for Maritime Domain Automatic Speech Recognition, Sensors 21/7, pp.2561
LI, J. (2022), Recent advances in end-to-end automatic speech recognition, APSIPA Trans. Signal and
Information Processing 11/1
McFEE, B.; RAFFEL, C.; LIANG, D.; ELLIS, D.P.; MCVICAR, M.; BATTENBERG, E.;
PARASCANDOLO, G. (2020), Librosa: Audio and music signal analysis in python, 14th Python in
Science Conf., pp.18–25
353
MENG, Y. S.; LEE, Y. H.; NG, B. C. (2009), Further Study of Rainfall Effect on VHF Forested Radio-
Wave Propagation With Four-Layered Model, PIER, vol. 99, pp.149-161
MOCKEL, S.; BRENKER, M.; STROHSCHNEIDER, S. (2014), Enhancing Safety through Generic
Competencies, TransNav 8/1, pp.97-102
PARASKEVOPOULOS, G.; PARTHASARATHY, S.; KHARE, A.; SUNDARAM, S. (2020),
Multimodal and Multiresolution Speech Recognition with Transformers, 58th Annual Meeting of the
Association for Computational Linguistics, pp.2381-2387
PARK, D. H.; Han, K.; LEE, J. H. (2020), Listen, attend, and spot: A framework for end-to-end keyword
spotting with recurrent neural networks, Neural Networks 128, pp.29-40
PASZKE, A.; GROSS, S.; MASSA, F.; LERER, A.; BRADBURY, J.; CHANAN, G.; DESMAISON,
A. (2019), PyTorch: An imperative style, high-performance deep learning library, Advances in Neural
Information Processing Systems, pp.8024-8035
RADFORD, A.; KIM, J.; XU, T. & BROCKMAN, G.; MCLEAVEY, C.; SUTSKEVER, I. (2022),
Robust Speech Recognition via Large-Scale Weak Supervision
SAINBURG, T.; THIELK M.; GENTNER; T.Q. (2020), Finding, visualizing, and quantifying latent
structure across diverse animal vocal repertoires, PLoS Computational Biology, Public Library of
Science, pp.1-48
SCHNEIDER, S.; BAEVSKI, A.; COLLOBERT, R.; AULI, M. (2019), wav2vec: Unsupervised pre-
training for speech recognition, arXiv preprint
UDAGAWA, T.; SUZUKI, M.; KURATA, G.; ITOH, N.; SAON, G. (2022), Effect and analysis of
large-scale language model rescoring on competitive ASR system, arXiv preprint
VAIBHAV, H.; PADMAPRIYA, M. (2022), Methods to Optimize Wav2Vec with Language Model for
Automatic Speech Recognition in Resource Constrained Environment, 19th Int. Conf. Natural
Language Processing (ICON), pp.149-153
WANG, X.; ZHU, L.; WAN, Y. (2021), Speaker diarization with connectionist temporal classification,
IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp.4956-4960
YADAV, H.; SITARAM S. (2022), A Survey of Multilingual Models for Automatic Speech Recognition
YU, K.; CHEN, K.; WANG, W. (2019), Keyword detection from maritime distress calls with
convolutional and recurrent neural networks, IEEE Access, 7, pp.87333-87341
ZHAO, J.; ZHANG, W. Q. (2022), Improving Automatic Speech Recognition Performance for Low-
Resource Languages with Self-Supervised Models, IEEE J. Selected Topics in Signal Processing 16/6,
pp.1227-1241