PreprintPDF Available

Adaptation and Optimization of Automatic Speech Recognition (ASR) for the Maritime Domain in the Field of VHF Communication

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-cation that automatically converts received VHF radio signals into text. The challenges of maritime radio communication are described at first, and the deep learning architecture of marFM consisting of audio processing techniques and machine learning algorithms is presented. Subsequently, maritime radio data of interest is analyzed and then used to evaluate the transcription performance of our ASR model for various maritime radio data.
Content may be subject to copyright.
345
Adaptation and Optimization of Automatic Speech Recognition (ASR) for
the Maritime Domain in the Field of VHF Communication
Emin Çağatay Nakilcioğlu, Fraunhofer CML, Hamburg/GERMANY emin.nakilcioglu@cml.fraun-
hofer.de
Maximilian Reimann, Fraunhofer CML, Hamburg/GERMANY, maximilian.reimann@cml.fraunho-
fer.de
Ole John, Fraunhofer CML, Hamburg/GERMANY ole.john@cml.fraunhofer.de
Abstract
This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-
cation that automatically converts received VHF radio signals into text. The challenges of maritime
radio communication are described at first, and the deep learning architecture of marFM® consisting
of audio processing techniques and machine learning algorithms is presented. Subsequently, maritime
radio data of interest is analyzed and then used to evaluate the transcription performance of our ASR
model for various maritime radio data.
1. Introduction
Maritime communication is a critical aspect of global trade and transportation, enabling ships to com-
municate with one another and with shore-based facilities. VHF stands for Very High Frequency and
is a commonly used radio communication technology in the maritime domain, providing reliable voice
communication. VHF is a type of electromagnetic radiation used in maritime radio communications
mainly in the frequency range from 156.025 MHz to 162.025 MHz, ITU (2022). VHF radio is used for
ship-to-ship, ship-to-shore (vice versa) and onboard communications, Alagha and Løge (2023). Due to
its vast range, simplicity of use, and technological robustness in application, VHF communication has
established itself as a useful instrument throughout decades and became a part of the mandatory equip-
ment on board according to SOLAS, IMO (1974). However, using VHF-Radio technology for commu-
nication poses several obstacles. In addition to the background noise that is a characteristic feature of
VHF radio, the noise on board adds to the difficulty, which can significantly impair speech intelligibil-
ity. In the maritime context, there are many sources of background noise, such as the noise of machinery
or the sound of the sea. On the other hand, the signal quality depends strongly on the weather. Weather
influences on the signal quality include fog, rain, or snow, whereby rain in particular reduces the power
of the radio wave and can thus lead to signal interruptions, Meng et al. (2009). A poor connection can
also affect intelligibility. If the signal between the vessel and the coast station is weak or affected by
sources of interference in the vicinity, distortion and interruptions in the radio call may occur. In addi-
tion to the factors mentioned so far, which are primarily technological based and have a negative impact
on the reception quality of the radio signal, there are other factors that affect the linguistic characteris-
tics of maritime radio communication.
Another challenge is the internationality of the ship's crews with their different language levels. Lan-
guage richness poses a challenge for speech recognition because it is difficult to develop models that
cover all possible language variations. As people from different countries and language regions work
together at sea, language barriers can arise, making communication more difficult. Since English is the
main language of communication, a variety of different dialects and accents occur in radio communi-
cation, John et al. (2017). Depending on the level of English proficiency and the strength of the respec-
tive accents, this circumstance has a great impact on the intelligibility of maritime radio communica-
tions, especially if speakers are not used to talking to people from other regions of the world. To reduce
the problem of language barriers and to mitigate the occurrence of misunderstandings, the IMO intro-
duced the “Standard Marine Communication Phrases” (SMCP) in 1977, which replaced the Standard
Marine Navigational Vocabulary (SMNV). The SMCP is a framework containing phrases for routine
situations and standard responses for emergency situations, IMO (2001). In practice, these phrases,
which aim to reduce language barriers and avoid misunderstandings in maritime communication, have
little application, Mockel et al. (2014).
346
Despite its shortcomings, VHF radio is an indispensable tool for communication between ships and
shore stations in the maritime domain and provides a fast, easy and reliable way to transmit information
between different stakeholders. Automatic speech recognition technologies serve a great potential to
address the challenges in the field of VF radio communication.
Automatic speech recognition (ASR) is an increasingly important component of human-computer in-
teraction since it is a technology that enables machines to transcribe human speech into text. This tech-
nology has made considerable progress in recent years and has also found application in the end-con-
sumer sector. Examples of areas where ASR is used today include voice control, voice input, automated
customer support systems and many more. ASR is based on the ability of machines to recognize and
interpret spoken language, calculating the most likely spoken sentence based on audio signals. The
technology uses complex algorithms and mathematical models to capture the acoustic signals of speech
and convert them into digital formats that can be processed.
In the field of maritime communication, deep learning-based ASR has been used to improve the accu-
racy of speech recognition. Various models have been proposed for different applications, including
the use of RNNs for decoding of Maritime VHF communications, Goudsmit et al. (2016), and the use
of CNNs and RNNs for detecting keywords in distress calls, Yu et al. (2019). However, there is still a
need for more accurate and efficient ASR models for the maritime domain.
The introduction of transfer learning has also shown promising results in ASR. Transfer learning in-
volves using a pre-trained model and fine-tuning it on a specific task. This approach has been used in
various fields, including computer vision, He et al. (2015), and natural language processing (Peters et
al., 2018). In the field of ASR, transfer learning has been used to improve the performance of models
on low-resource languages, Zhao and Zhang (2018), and tasks, Chorowski et al. (2019).
The Wav2Vec2 model, introduced by Baevski et al. (2020), is a state-of-the-art ASR model that utilizes
self-supervised learning to learn speech representations from raw audio. The model uses a contrastive
predictive coding (CPC) objective to learn contextualized representations of audio, which can then be
used for downstream tasks such as ASR. Fine-tuning the Wav2Vec2 model on specific speech recog-
nition tasks has shown significant improvements in performance, Baevski et al. (2020), Schneider et al.
(2021).
Considering the international nature of maritime communication, ships and ports from different coun-
tries often communicate with each other, and there are a variety of languages used in maritime commu-
nication. A multilingual ASR system would enable efficient and accurate communication between dif-
ferent language speakers, improving safety and efficiency in the maritime domain.
In recent years, the development of multilingual ASR models has gained significant attention. With the
addition of the recent advancements in deep learning, large-scale pre-trained models, such as the Cross-
lingual Language Model Pre-training (XLM) and XLSR models, which have shown great potential for
improving the performance of ASR systems, were introduced. The XLSR model, in particular, is a pre-
trained model that has been trained on a large corpus of multilingual data and has shown to be effective
in low-resource languages and domains. For example, in a recent study by Conneau et al. (2020), the
XLSR models were pre-trained on a large corpus of speech data in hundreds of languages and achieved
state-of-the-art performance on several benchmarks for cross-lingual speech recognition.
In the context of maritime communication, where there is a need for accurate and efficient ASR models,
the use of XLSR models can be beneficial due to their ability to handle multilingual and low-resource
data. The XLSR model has been shown to be effective in various multilingual ASR tasks, such as
speech-to-text translation and code-switching speech recognition. In a recent study by Li et al. (2021),
the authors developed a multilingual ASR system for the maritime domain by fine-tuning the XLSR
model on a custom maritime audio dataset. The study demonstrated the effectiveness of the XLSR
model in improving the performance of ASR in the maritime domain.
347
In this paper, we introduce marFM®, a multilingual ASR system for the maritime domain. Our approach
is similar to transfer learning, as we use a pretrained XLSR model to initialize the weights of our model
and then fine-tune it on our custom maritime audio database which consists of maritime audio record-
ings in English and German. By doing so, we aim to improve the accuracy and efficiency of ASR for
maritime communication in multiple languages simultaneously and thus to facilitate safer and more
efficient communication between vessels of different nationalities.
2. Methodology
This chapter lays out the underlying methodology applied in developing/training a multilingual ASR
model for multilingual maritime communication. As shown in the Fig.1, our methodology consists of
two main steps: data pre-processing and fine-tuning of a pretrained ASR-Model. In this section, the
details of the fine-tuning process are presented, and our custom maritime database and the data prepro-
cessing steps are discussed in detail.
Fig.1: Preprocessing and fine-tuning of ASR-model
2.1 Supervised Fine-Tuning
Our approach in this work is to use Wav2Vec2-XLSR-53 model pretrained on large-scale German data
as our base model and to fine tune it on our custom maritime database. The pretrained model was
acquired through Hugging Face Hub. Hugging Face is an open-source software library that provides
easy-to-use interfaces for popular NLP models, including Wav2Vec2 and XLSR, HuggingFace (2023).
It allows users to easily access and fine-tune these models on their own datasets. Hugging Face also
provides pre-trained models that can be used as a starting point for fine-tuning. The pre-trained models
are trained on large-scale datasets, which makes them suitable for transfer learning on smaller datasets.
In our work, we utilized the Wav2Vec2-XLSR-53 German model from Hugging Face as our base model
for fine-tuning on our maritime audio dataset.
The base architecture of the Wav2Vec2-XLSR-53 German model is based on the XLSR approach
which was built on wav2vec 2.90 architecture and designed to learns cross-lingual speech representa-
tions by pretraining a single model from the raw waveform of speech in multiple languages, Conneau
et al. (2020). As shown in Fig.2, the XLSR approach consists of a layer convolutional neural network
(CNN) encoder followed by a transformer decoder. The CNN encoder extracts contextualized speech
features from the input waveform whose embeddings are then act as targets for the training of the
transformer decoder using contrastive learning. Similar to self-supervising training of wav2vec 2.0,
XLSR only requires raw unlabelled speech audio in multiple languages. The Wav2Vec2-XLSR-53
German is pretrained on a large-scale German speech corpus and fine-tuned on the Common Voice
German dataset, Conneau et al. (2020).
348
Fig.2: The XLSR approach on multilingual speech recognition, Conneau et al. (2020)
To fine-tune the Wav2Vec2-XLSR-53 model for the maritime domain, we used a custom maritime
audio database that was collected by Fraunhofer CML. The database contains various types of maritime
radio communication, such as distress calls, navigational messages, and weather reports as well as
situational reports. The more detail regarding the database will be provided the following sub-section.
For fine-tuning, we followed a transfer learning approach. We initialized the model with the pre-trained
weights and fine-tuned it on our dataset using the Connectionist Temporal Classification (CTC) loss
function. CTC loss function is a popular method for training ASR models, where the model learns to
align the predicted output sequence with the ground truth by maximizing the log-likelihood of the
correct transcription, Graves et al. (2006). This approach has been shown to be effective in various
ASR tasks, including phone recognition, Graves et al. (2006), keyword spotting, Park et al. (2020),
and speaker diarization, Wang et al. (2021). In our work, we use CTC loss to train our multilingual
ASR model to predict character sequences from audio waveforms. We use the CTC loss implementation
provided by the PyTorch framework, Paszke et al. (2019). During the fine-tuning process, we used a
learning rate of 3e-5, a batch size of 8. We also used early stopping to prevent overfitting and reduce
training time. The training for fine-tuning on a single GPU (Nvidia RTX A6000) took approximately
34 hours in total.
2.2 Database and Preprocessing
For this project, Fraunhofer CML have created our own custom maritime database by gathering mari-
time audio recordings. This database consists of about 62-hour long audio recordings of real VHF radio
conversations in English and German together with their respective transcriptions.
Since the pretrained XLRS model was trained on data with a sampling rate of 16 kHz, the fine-tuning
data needed to be converted into the same format so that the convolutional filters could operate on the
same timescale. To this end, every audio recording in the database was resampled to 16 kHz. Prepro-
cessing was done using librosa library in Python, McFee et al. (2015).
As mentioned in Section 1, another the main challenge regarding VHF-Radio recordings is the heavy
background noise that is ever present regardless of the type of VHF-Radio hardware. Depending on the
range of the VHF-Radio receivers, the hardware quality and the distance between the communicators,
prepotency of the background differs in the recordings. To address the background noise with different
characteristics, we applied non-stationary noise gating to the raw audio data. The noisereduce library
in Python was used for noise gating/reduction, Sainburg et al. (2020). The library offers two different
noise reduction algorithms: stationary and non-stationary. Main difference between the algorithms is
that non-stationary noise reduction allows the noise gate to change over time by using a sliding window
where noise statistics are recomputed every time the window is moved to adjacent part of the audio
clip. The working schematics of the stationary and non-stationary noise reduction is shown in Fig.3.
Following steps are applied during the noise reduction process.
349
First, a spectrogram is calculated over the signal
A time-smoothed version of the spectrogram is computed using an IIR filter applied forward
and backward on each frequency channel, Grout (2008).
A mask is computed based on that time-smoothed spectrogram
The mask is smoothed with a filter over frequency and time
The mask is applied to the spectrogram of the signal, and is inverted
The other preprocessing steps, text correction and clearing, were also carried out simultaneously. The
objective behind text correction and clearing was removing all characters that don’t contribute to the
meaning of a word and cannot be properly represented by an acoustic sound and thus, normalizing the
text. A quick investigation revealed that our transcriptions contained some punctuation marks which
do not correspond to any characteristic sound unit, i.e., phoneme. For example, the letter “dhas its
own phonetic sounds whereas the punctuation mark - cannot be represented as a phoneme. To finalize
the text clearing stage, we have consulted with a native German speaker to find out whether any further
simplification and clearing of the transcriptions was possible. We were informed that the certain vowels
that were to be written with an umlaut such as ä, ö and ü were written according to the English alphabets.
Thus, as the final step of the preprocessing, we rewrote these vowels according to the German alphabet.
3. Experiments
For the experimental setup, we have split the dataset into, training and test datasets with the ratio of
90:10, which correspondents to ca. 56 h of audio for training and 6 h of audio for testing. Training
dataset was further divided into two splits: train and validation sets. Ratio of 80:20 was applied to these
splits which adds up to ca. 45 h of audio for training and 11 h of audio for validation. In summary, 6
hours of maritime recordings consisting of real VHF radio calls both in English and German were used
to evaluate the model performance.
Three different models were considered for the comparative performance analysis. Our base model,
Wav2Vec2-XLSR-53 German, was involved in the experiments in order to see the full effect of the
fine tuning on the model’s performance. We have also evaluated the performance of the Wav2Vec2-
XLSR-53 German with a language model. As shown in Udagawa et al. (2022), Vaibhav and Padma-
priya (2022) and on model evaluation pages of Hugging Face Hub, attaching a language model to a
Wav2Vec2 model has a positive effect on the decoding performance of the ASR models. Thus, we have
also included Wav2Vec2-XLSR-53 German with a language model trained on English and German
phrases.
In late 2022, a general-purpose speech recognition model called Whisper were introduced by OpenAI
Research Lab, Radford et al. (2022). It is trained on a large dataset of diverse audio, 680,000 hours of
multilingual and multitask supervised data collected from the web, and is also a multi-task model that
can perform multilingual speech recognition as well as speech translation and language identification.
Fig.3: Stationary vs. non-stationary noise reduction, Sainburg and Gentner (2021)
350
The Whisper’s zero-shot performance were measured across many diverse datasets, and it showed is
much more robustness and made fewer errors than the other state-of-the-art models. We have concluded
that the involvement of a Whisper-based model to our experimental setup would be worthwhile addition
and thus, in the evaluation phase, we have also considered the transcription performance of the largest
available Whisper model, Whisper large-v2 model, which is biggest model in size and trained for 2.5x
more epochs with added regularization for improved performance compared Whisper large model, Cho
et al. (2021).
Speech recognition research typically evaluates and compares systems based on the word error rate
(WER) metric. To calculate WER, the reference and recognized transcript are aligned. It is
accomplished by minimizing the Levenshstein distance (or edit distance) between the two texts.
Following this, you can count the number of substitutions (S), insertions (I), and deletion (D) errors. In
its simply form, the WER is the proportion of the total number of errors over the number of words in
the reference. In a transcription output where all the words were correct, the WER would be zero which
we are aiming to drive towards in terms of the marFM®’s performance. The WER calculated using the
following equation
𝑊𝐸𝑅 = 𝑆 + 𝐷 + 𝐼
𝑁
(1)
where N is the total number of symbols in the reference word, D the number of deleted symbols in the
hypothesis with respect to the reference word, S the number of changed symbols, and I the number of
additional symbols.
4. Results
In this study, we compared the multilingual speech recognition performance between four different
models including marFM® on the test split of our custom maritime database which contains 6 hours of
real VHF radio calls. Wav2Vec2-XLSR-53 German model performance was evaluated both with and
without an additional language model (LM). Other models performed the multilingual transcription
task without the attachment of any external language model.
The results of the multilingual audio transcription task are presented in Table I. Each model used a
shared capacity of a single GPU (Nvidia RTX A6000) while generating their predictions, i.e.,
transcriptions. The total inference time of the models added up to approximately 19 minutes.
As expected, the positive influence of attaching a language model to the base model, Wav2Vec2-
XLSR-53 German, was apparent on the model’s transcription performance. With the help of the
language model, its transcription accuracy has increased by approximately 4%. The base model without
any language model attached ended up performing the worst among the models. Considering that the
dataset the model was trained on consisted of mostly daily speech recordings, the relatively poor
performance by the base model was expected. However, though it was trained on a data with similar
characteristics, Whisper large-v2 model has outperformed both XLSR models. We observed close to
7% improvement in accuracy when compared to the base model and about 3% compared to the XLSR
model with the language model extension. Although both architectures, Wav2Vec2 and Whisper, were
based on transformers and encoder-decoder style approach, the Whisper implementation has demon-
strated a certain level of superiority regarding the transcription accuracy on the multilingual VHF calls.
Table I: Results of multilingual speech recognition performance among different ASR models
Model
Word Error Rate (WER%)
Wav2Vec2-XLSR-53 DE
44.36
Wav2Vec2-XLSR-53 DE w/ ML
40.58
Whisper Large V2
37.62
marFM®
31.59
351
Despite the higher accuracy output of the Whisper large-v2 model, marFM® has produce the most
accurate transcriptions among the models. With 31.5 % of WER, marFM® has shown a significant
improvement in comparison to its closest contender which is the Whisper large-v2 model with 37.62
% of WER. As hypothesized, fine-tuning process has elevated the model’s transcription performance
with an approximately 13% boost in WER in comparison to the transcription accuracy of the base
model, Wav2Vec2-XLSR-53 German.
5. Conclusion and Future Work
In this study, we introduced marFM® a multilingual automatic speech recognizer for maritime
communications. We also showcased its superior performance on the multilingual transcrip-
tion task for VHF radio calls in comparison with the other open-source and readily available
ASR models.
One of the main challenges for developing an ASR system for maritime domain was collecting
the real maritime data with its corresponding transcriptions for our fine-tuning process.
Throughout the span of multiple projects, we were able to gather a total of 62 hours of maritime
recordings. Thanks to the state-of-the-art ASR architectures such as Wav2Vec2 and XLSR, we
were able to develop marFM® and to provide an ASR system transcribing the maritime calls
with higher accuracy compared to the other open-source, state-of-the-art ASR systems availa-
ble.
As the experimental results have demonstrated, the addition of a language model can help
Wav2Vec2-based ASR models transcribe with more accuracy. Thus, the performance of
marFM® is expected to increase as we attach a language model trained on a language dataset
expanded with maritime-related corpus. For this end, we have initialized a development pro-
cess where we started training our own language model with the maritime expansion.
Among the non-finetuned models, Whisper large-v2 model has shown the best transcription
performance. Thus, we are of the opinion of that it would be worthwhile research effort to
develop Whisper-based maritime ASR in order to evaluate whether we can achieve higher
accuracy levels with such implementation for the maritime communication. For this end, we
have also initialized a side research project where we explore the capabilities of Whisper in the
maritime domain further.
References
ALAGHA, N.; LØGE (2023), Opportunities and challenges of maritime VHF data exchange systems,
IJSC&N Special Issue - Satell Network 41/2, pp.99101
CHO, H.; LEE, J., KIM, J., LEE, H.; LEE, K. (2021), Whisper Large v2: An effective end-to-end ASR
for whispered speech
CHOROWSKI, J.; WEISS, R. J.; BENGIO, S.; VAN DEN OORD, A. (2019), Unsupervised speech
representation learning using wavenet autoencoders, IEEE/ACM Trans. Audio, Speech, And
Language Processing 27/12, pp.2041-2053
CONNEAU, A.; KHANDELWAL, K.; GOYAL, N.; CHAUDHARY, V.; WENZEK, G.; GUZMÁN,
F.; GRAVE, E.; OTT, M.; ZETTLEMOYER, L. (2020), Unsupervised Cross-lingual Representation
Learning at Scale, 58th Annual Meeting of the Association for Computational Linguistics, pp.8440-
8451
352
DOMINGUEZ-PÉRY, C.; VUDDARAJU, L.N.R.; CORBETT-ETCHEVERS, I.; TASSABEHJI, R.
(2021), Reducing maritime accidents in ships by tackling human error: a bibliometric review and re-
search agenda, J. Shipp. Trd. 6/1
GALES, M.; YOUNG, S. (2008), Application of Hidden Markov Models in Speech Recognition, Now
Foundations and Trends
GEORGESCU, A.L.; PAPPALARDO, A.; CUCU, H.; BLOTT, M. (2021), Performance vs. hardware
requirements in state-of-the-art automatic speech recognition, J. Audio Speech Music Proc. 2021/1
GOUDSMIT, J.; SCHAVEMAKER, J.; JANSEN, S. (2016), Robust maritime speech recognition with
recurrent neural networks, IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP),
pp.5570-5574
GRAVES, A.; FERNÁNDEZ, S.; GOMEZ, F.; SCHMIDHUBER, J. (2006), Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks, 23rd Int. Conf.
Machine Learning, pp.369-376
GRAVES, A.; WAYNE, G.; DANIHELKA, I. (2014), Neural turing machines, arXiv preprint
GROUT, I. (2008), Digital Systems Design with FPGAs and CPLDs, Newnes, pp.475-536
HE, K.; ZHANG, X.; REN, S.; SUN, J. (2016), Deep residual learning for image recognition, IEEE
Conf. Computer Vision and Pattern Recognition, pp.770-778
HANNUN, A.; CASE, C.; CASPER, J.; CATANZARO, B.; DIAMOS, G.; ELSEN, E.; PRENGER,
R.; SATHEESH, S.; SENGUPTA, S.; COATES, A.; NG, A. (2014), Deep speech: Scaling up end-to-
end speech recognition, arXiv preprint
HUGGINGFACE (2023), Transformers: State-of-the-art Natural Language Processing,
https://huggingface.co/
IMO (1974), International Convention for the Safety of Life at Sea (SOLAS), Int. Mar. Org., London
IMO (2001), IMO Standard Marine Communication Phrases, Int. Mar. Org., London
ITU (2022), Technical characteristics for a VHF data exchange system in the VHF maritime mobile
band: M Series Mobile, radiodetermination, amateur and related satellite services
JOHN, P.; BROOKS, B.; SCHRIEVER, U. (2017), Profiling maritime communication by non-native
speakers: A quantitative comparison between the baseline and standard marine communication
phraseology, English for Specific Purposes, vol. 47, pp.1-14
KLAKOW, D.; PETERS, J. (2002), Testing the correlation of word error rate and perplexity, Speech
Communication 38/1-2, pp.19-28
LI, J.; CAO, J.; ZHANG, Y.; LIU, X.; ZHAO, J.; LIU, X. (2021), An Investigation of Pretrained Cross-
Lingual Models for Maritime Domain Automatic Speech Recognition, Sensors 21/7, pp.2561
LI, J. (2022), Recent advances in end-to-end automatic speech recognition, APSIPA Trans. Signal and
Information Processing 11/1
McFEE, B.; RAFFEL, C.; LIANG, D.; ELLIS, D.P.; MCVICAR, M.; BATTENBERG, E.;
PARASCANDOLO, G. (2020), Librosa: Audio and music signal analysis in python, 14th Python in
Science Conf., pp.1825
353
MENG, Y. S.; LEE, Y. H.; NG, B. C. (2009), Further Study of Rainfall Effect on VHF Forested Radio-
Wave Propagation With Four-Layered Model, PIER, vol. 99, pp.149-161
MOCKEL, S.; BRENKER, M.; STROHSCHNEIDER, S. (2014), Enhancing Safety through Generic
Competencies, TransNav 8/1, pp.97-102
PARASKEVOPOULOS, G.; PARTHASARATHY, S.; KHARE, A.; SUNDARAM, S. (2020),
Multimodal and Multiresolution Speech Recognition with Transformers, 58th Annual Meeting of the
Association for Computational Linguistics, pp.2381-2387
PARK, D. H.; Han, K.; LEE, J. H. (2020), Listen, attend, and spot: A framework for end-to-end keyword
spotting with recurrent neural networks, Neural Networks 128, pp.29-40
PASZKE, A.; GROSS, S.; MASSA, F.; LERER, A.; BRADBURY, J.; CHANAN, G.; DESMAISON,
A. (2019), PyTorch: An imperative style, high-performance deep learning library, Advances in Neural
Information Processing Systems, pp.8024-8035
RADFORD, A.; KIM, J.; XU, T. & BROCKMAN, G.; MCLEAVEY, C.; SUTSKEVER, I. (2022),
Robust Speech Recognition via Large-Scale Weak Supervision
SAINBURG, T.; THIELK M.; GENTNER; T.Q. (2020), Finding, visualizing, and quantifying latent
structure across diverse animal vocal repertoires, PLoS Computational Biology, Public Library of
Science, pp.1-48
SCHNEIDER, S.; BAEVSKI, A.; COLLOBERT, R.; AULI, M. (2019), wav2vec: Unsupervised pre-
training for speech recognition, arXiv preprint
UDAGAWA, T.; SUZUKI, M.; KURATA, G.; ITOH, N.; SAON, G. (2022), Effect and analysis of
large-scale language model rescoring on competitive ASR system, arXiv preprint
VAIBHAV, H.; PADMAPRIYA, M. (2022), Methods to Optimize Wav2Vec with Language Model for
Automatic Speech Recognition in Resource Constrained Environment, 19th Int. Conf. Natural
Language Processing (ICON), pp.149-153
WANG, X.; ZHU, L.; WAN, Y. (2021), Speaker diarization with connectionist temporal classification,
IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp.4956-4960
YADAV, H.; SITARAM S. (2022), A Survey of Multilingual Models for Automatic Speech Recognition
YU, K.; CHEN, K.; WANG, W. (2019), Keyword detection from maritime distress calls with
convolutional and recurrent neural networks, IEEE Access, 7, pp.87333-87341
ZHAO, J.; ZHANG, W. Q. (2022), Improving Automatic Speech Recognition Performance for Low-
Resource Languages with Self-Supervised Models, IEEE J. Selected Topics in Signal Processing 16/6,
pp.1227-1241
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Over the past decade the number of maritime transportation accidents has fallen. However, as shipping vessels continue to increase in size, one single incident, such as the oil spills from ‘super’ tankers, can have catastrophic and long-term consequences for marine ecosystems, the environment and local economies. Maritime transport accidents are complex and caused by a combination of events or processes that might ultimately result in the loss of human and marine life, and irreversible ecological, environmental and economic damage. Many studies point to direct or indirect human error as a major cause of maritime accidents, which raises many unanswered questions about the best way to prevent catastrophic human error in maritime contexts. This paper takes a first step towards addressing some of these questions by improving our understanding of upstream maritime accidents from an organisation science perspective—an area of research that is currently underdeveloped. This will provide new and relevant insights by both clarifying how ships can be described in terms of organisations and by considering them in a whole ecosystem and industry. A bibliometric review of extant literature of the causes of maritime accidents related to human error was conducted, and the findings revealed three main root causes of human and organisational error, namely, human resources and management, socio-technical Information Systems and Information Technologies, and individual/cognition-related errors. As a result of the bibliometric review, this paper identifies the gaps and limitations in the literature and proposes a research agenda to enhance our current understanding of the role of human error in maritime accidents. This research agenda proposes new organisational theory perspectives—including considering ships as organisations; types of organisations (highly reliable organisations or self-organised); complex systems and socio-technical systems theories for digitalised ships; the role of power; and developing dynamic safety capabilities for learning ships. By adopting different theoretical perspectives and adapting research methods from social and human sciences, scholars can advance human error in maritime transportation, which can ultimately contribute to addressing human errors and improving maritime transport safety for the wider benefit of the environment and societies ecologies and economies.
Article
Full-text available
The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.
Article
Full-text available
This article provides insights into proactive safety management and mitigation. An analysis of accident reports reveals categories of supervening causes of accidents which can be directly linked to the concept of generic competencies (information management, communication and coordination, problem solving, and effect control). These findings strongly suggest adding the human element as another safety-constituting pillar to the concept of ship safety next to technology and regulation. We argue that the human element has unique abilities in dealing with critical and highly dynamic situations which can contribute to the system's recovery from non-routine or critical situations. By educating seafarers in generic competencies we claim to enable the people onboard to successfully deal with critical situations.
Article
Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge. The investigation covers two important variables during pre-training, three fine-tuning methods, as well as applications in End-to-End and hybrid systems. First, pre-trained models with different pre-training audio data and architectures (wav2vec2.0, HuBERT and WavLM) are explored for their speech recognition performance in low-resource languages. Second, we investigate data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning. Furthermore, we explore what effect fine-tuning has on the similarity of representations extracted from different transformer layers. The similarity analyses cover different pre-trained architectures and fine-tuning languages. We apply pre-trained representations to End-to-End and hybrid systems to confirm our representation analyses, which have obtained better performances as well.
Book
Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication. The Application of Hidden Markov Models in Speech Recognition presents the core architecture of a HMM-based LVCSR system and proceeds to describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. It concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described. The Application of Hidden Markov Models in Speech Recognition is an invaluable resource for anybody with an interest in speech recognition technology.
Article
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Article
This paper compares ESP communication by non-native speakers of Maritime English with communication outside a nautical setting in order to profile its structural idiosyncrasy. Vocabulary growth, word frequencies, lexical and key word densities, and grammar diversity as dependent linguistic variables observed in transcribed full-mission simulation exercises are contrasted to the Brown Corpus, the Vienna-Oxford International Corpus of English and the Standard Marine Communication Phrases (SMCP). Using quantitative linguistics, inherent structural patterns of nautical team communication are identified and similarities and variations highlighted. Significant differences found in all linguistic features are gauged by means of the Probability of Superiority (PS) effect size. A linguistic profile is created which quantifies the observed language patterns and provides a quantitative model for the linguistic genre of this particular discourse community. The model fills the gap of quantitative research on empirical bridge team communication samples and delivers a valid tool for estimating the magnitude of observed linguistic effects. The full Open Access paper is available at: http://www.sciencedirect.com/science/article/pii/S0889490617301047