ArticlePDF Available

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Authors:

Abstract and Figures

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.
Content may be subject to copyright.
A preview of the PDF is not available
... When subword tokens are used for language modeling, a dummy symbol is added to identify the positions where the tokens can be glued together to form words [10]. When subwords replace words, the ASR vocabulary contains morphemes, syllables, or other character sequences that together can be used to create an unlimited number of words [11,12]. ...
... However, to segment text to subword units, there are data-driven as well as linguistically informed algorithms [1,7]. Morfessor [11,17,18], byte pair encoding (BPE) [19,20] and Unigram [21] are a few data driven algorithms in popular use. These algorithms do not ensure that subword tokenization happens at valid pronunciation boundaries. ...
... Subword token-based language modeling has been proposed for applications in speech recognition [7,11,12,22,23], statistical machine translation [24], neural machine translations [20,21] and handwriting recognition [25]. The choice of subword tokens used in language modeling impacts the performance of the model on many downstream tasks [26] including speech recognition [7]. ...
Article
Full-text available
This article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–hidden Markov model (DNN-HMM)-based automatic speech recognition (ASR). We propose a novel method, syllable-byte pair encoding (S-BPE), that combines linguistically informed syllable tokenization with the data-driven tokenization method of byte pair encoding (BPE). The proposed method ensures words are always segmented at valid pronunciation boundaries. On a text corpus that has been divided into tokens using the proposed method, we construct statistical n-gram language models and assess the modeling effectiveness in terms of both information-theoretic and corpus linguistic metrics. A comparative study of the proposed method with other data-driven (BPE, Morfessor, and Unigram), linguistic (Syllable), and baseline (Word) tokenization algorithms is also presented. Pronunciation lexicons of subword tokenized units are built with pronunciation described as graphemes. We develop ASR systems employing the subword tokenized language models and pronunciation lexicons. The resulting ASR models are comprehensively evaluated to answer the research questions regarding the impact of subword tokenization algorithms on language modeling complexity and on ASR performance. Our study highlights the strong performance of the hybrid S-BPE tokens, achieving a notable 10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond n=3n=3. Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.
... Subword modeling is another effective technique to reduce vocabulary size. We use the Morfessor method [23], [24], which has been successfully applied in speech recognition of many agglutinative languages [4], [25]. Morfessor is an unsupervised method that uses a statistical model to split words into smaller fragments. ...
... One reason why the Estonian vocabulary is smaller than the Finnish vocabulary, even though the Estonian data set is larger, is that colloquial Estonian is written in a more systematic way. Also standard Estonian vocabulary is smaller than standard Finnish vocabulary [25], probably because standard Finnish uses more inflected word forms. We use only spontaneous conversations as development and evaluation data. ...
Preprint
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important.
... Morphological segmentation is an important sub-task in many natural language processing (NLP) applications, aiming to break words into meaning-bearing sub-word units called morphemes (Creutz et al. 2007). Numerous methods in NLP, information retrieval, and text mining make use of word-level information. ...
... Creutz and Lagus (2002) proposed an unsupervised word segmentation approach that relies on the MDL principle and maximum likelihood optimization. One of the most wellknown systems is Morfessor (Creutz et al. 2007), a generative probabilistic model. Poon, Cherry, and Toutanova (2009) presented an unsupervised method that makes segmentation decisions based on the classic log-linear model framework, into which contextual and global features like lexicon priors are incorporated. ...
Article
Morphological segmentation, which aims to break words into meaning-bearing morphemes, is an important task in natural language processing. Most previous work relies heavily on linguistic preprocessing. In this paper, we instead propose novel neural network architectures that learn the structure of input sequences directly from raw input words and are subsequently able to predict morphological boundaries. Our architectures rely on Long Short Term Memory (LSTM) units to accomplish this, but exploit windows of characters to capture more contextual information. Experiments on multiple languages confirm the effectiveness of our models on this task.
... Due to its importance for down-stream tasks (Creutz et al., 2007;Dyer et al., 2008), segmentation has been tackled in many different ways, considering unsupervised (Creutz and Lagus, 2002), supervised (Ruokolainen et al., 2013) and semisupervised settings (Ruokolainen et al., 2014). Here, we add three new questions to this line of research: (i) Are data-hungry neural network models applicable to segmentation of polysynthetic languages in minimal-resource settings? ...
Preprint
Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-task training approaches -one with, one without need for external unlabeled resources-, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75%. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research.
... Finally, it is useful to quantify the number of total and unique words being uttered. Each language has a different size of unique words, which results in different ratios of Out-Of-Vocabulary (OOV) words for the same dataset size when performing an unstratified train/test set split [56]. Lexicon vastness and its corresponding phonetic inventory plays a pivotal role in the performance of the ASR model, as OOV words often cause misrecognition of neighboring words [3,271]. ...
Preprint
Full-text available
Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
Article
Automatic speech recognition systems for low-resource languages typically have smaller corpora on which the language model is trained. Decoding with such a language model leads to a high word error rate due to the large number of out-of-vocabulary words in the test data. Larger language models can be used to rescore the lattices generated from initial decoding. This approach, however, gives only a marginal improvement. Decoding with a larger augmented language model, though helpful, is memory intensive and not feasible for low resource system setup. The objective of our research is to perform initial decoding with a minimally augmented language model. The lattices thus generated are then rescored with a larger language model. We thus obtain a significant reduction in error for low-resource Indic languages, namely, Kannada and Telugu. This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is not sufficient for generating inclusive lattices. We minimally augment the baseline language model with unigram counts of words that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with a minimally augmented baseline language model are more comprehensive for rescoring. We obtain 21.8% (for Telugu) and 41.8% (for Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (for Telugu) and 45.9% (for Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.
Article
Full-text available
Recent work shows that ambient exposure in everyday situations can yield implicit knowledge of a language that an observer does not speak. We replicate and extend this work in the context of Spanish in California and Texas. In Word Identification and Wellformedness Rating experiments, non-Spanish-speaking Californians and Texans show implicit lexical and phonotactic knowledge of Spanish, which may be affected by both language structure and attitudes. Their knowledge of Spanish appears to be weaker than New Zealanders’ knowledge of Māori established in recent work, consistent with structural differences between Spanish and Māori. Additionally, the strength of a participant’s knowledge increases with the value they place on Spanish and its speakers in their state. These results showcase the power and generality of statistical learning of language in adults, while also highlighting how it cannot be divorced from the structural and attitudinal factors that shape the context in which it occurs.
Article
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
Article
Full-text available
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs , is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.
Article
Full-text available
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.
Article
Full-text available
This paper presents an algorithm for the unsuper-vised learning of a simple morphology of a nat-ural language from raw text. A generative prob-abilistic model is applied to segment word forms into morphs. The morphs are assumed to be gener-ated by one of three categories, namely prefix, suf-fix, or stem, and we make use of some observed asymmetries between these categories. The model learns a word structure, where words are allowed to consist of lengthy sequences of alternating stems and affixes, which makes the model suitable for highly-inflecting languages. The ability of the al-gorithm to find real morpheme boundaries is eval-uated against a gold standard for both Finnish and English. In comparison with a state-of-the-art al-gorithm the new algorithm performs best on the Finnish data, and on roughly equal level on the En-glish data.
Chapter
For the science of linguistics we seek objective and formally describable operations with which to analyze language. The phonemes of a language can be determined by means of an explicit behavioral test (the pair test, involving two speakers of the language) and distributional simplifications, i. e. the defining of symbols which express the way in which the outcomes of that test occur in respect to each other in sentences of the language. The syntax, and most of the morphology, of a language is discovered by seeing how the morphemes occur in respect to each other in sentences. As a bridge between these two sets of methods we need a test for determining what are the morphemes of a language, or at least a test that would tentatively segment a phonemic sequence (as a sentence) into morphemes, leaving it for a distributional criterion to decide which of these tentative segments are to be accepted as morphemes.