ArticlePDF Available

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Authors:

Abstract and Figures

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.
Content may be subject to copyright.
A preview of the PDF is not available
... Morfessor is a model family for unsupervised morphological segmentation of natural languages, originally developed by Creutz and Lagus [86,20,88]. Morfessor Baseline [20,88] is one of the most popular tools in the family, and it is capable to provide an unsupervised subword segmentation of a lexicon that can be utilized in the language model of an ASR ...
... Morfessor, that was originally developed by Creutz and Lagus [86,20,88]. Morfessor is based on the MDL (Minimum Description Length) principle and use a complex loss function and training algorithm to optimize the subword lexicon of a task (see Section 4.1). ...
... However, not all experiments with Morfessor were successful. For Egyptian Colloquial Arabic, morph-based language modeling had worse results than the word baseline, which the authors explained by severe language model training data insufficiency, and flatter vocabulary growth curves of spontaneous speech [86]. ...
Thesis
Full-text available
Automatic speech recognition (ASR) systems enable the machine transcription of human speech. Language model plays an important role in ASR systems, as the final automatic transcript is selected from the acoustically fitting hypotheses based on the probability estimation of this model. Language modeling for Hungarian speech recognition is a field that has not been thoroughly explored yet. Therefore, the general goal of my research was to develop, adapt and evaluate new language modeling approaches for Hungarian ASR. A great challenge in training language models is coping with data sparsity that is especially pronounced in morphologically rich languages like Hungarian due to the large number of word forms. In my thesis, therefore, I perform extensive research on the application and optimization of language models estimated on morpheme-like lexical units called as morphs. I show that morph-based language modeling can reduce the word error rate of 40 Hungarian ASR tasks. The amount of error rate reduction, however, varies from task to task, therefore I explore the root of this variance and show that morph-based language models can be especially efficient in less resourced tasks and in good acoustic conditions. In order to facilitate the practical application of morph-based ASR systems, I present an interpolation method, and a named entity modeling technique for morph-based language models. I also investigate the performance and applicability of neural language models in Hungarian ASR. Although, many tasks require the use of low-latency ASR, neural language models are computationally very heavy and cannot be effectively utilized in one-pass, real-time ASR systems. This is why my research focuses primarily on transferring knowledge from neural language models to the n-gram model of online ASR systems. I show that by using a text-based data augmentation technique up to 30% of the error reduction available with a word-based recurrent neural language model (RNNLM) can be transferred to the online system. I also demonstrate that by performing the data augmentation with a GPT-2 Transformer language model a further improvement can be achieved on our speech transcription task. In the last part of my thesis, I combine morph-based lexical modeling methods with neural language models. I show that by augmenting the training corpus with a morph-based RNNLM, more than 50% of the offline error reduction can be transferred to the one-pass ASR. Finally, I propose a new, Transformer language model based data augmentation technique that outperforms all other approaches, while also significantly reduces the vocabulary size and memory footprint of the ASR system. This technique improves the WER of the baseline, word-based n-gram model with almost relative 13% and is capable of detecting every fourth out-of-vocabulary word on the investigated conversational speech recognition task.
... Morphological segmentation is an important sub-task in many natural language processing (NLP) applications, aiming to break words into meaning-bearing sub-word units called morphemes (Creutz et al. 2007). Numerous methods in NLP, information retrieval, and text mining make use of word-level information. ...
... Creutz and Lagus (2002) proposed an unsupervised word segmentation approach that relies on the MDL principle and maximum likelihood optimization. One of the most wellknown systems is Morfessor (Creutz et al. 2007), a generative probabilistic model. Poon, Cherry, and Toutanova (2009) presented an unsupervised method that makes segmentation decisions based on the classic log-linear model framework, into which contextual and global features like lexicon priors are incorporated. ...
Article
Morphological segmentation, which aims to break words into meaning-bearing morphemes, is an important task in natural language processing. Most previous work relies heavily on linguistic preprocessing. In this paper, we instead propose novel neural network architectures that learn the structure of input sequences directly from raw input words and are subsequently able to predict morphological boundaries. Our architectures rely on Long Short Term Memory (LSTM) units to accomplish this, but exploit windows of characters to capture more contextual information. Experiments on multiple languages confirm the effectiveness of our models on this task.
... In order to solve these issues, many NLP models have used linguistically-motivated subwords (Bazzi, 2002;Creutz et al., 2007;Luong et al., 2013;Mikolov et al., 2012). Sennrich et al. (2015) first adapted the algorithm for word segmentation, so that instead of merging pairs of bytes, it merges pairs of characters or character sequences. ...
Article
Full-text available
Software developers frequently reuse source code from repositories as it saves development time and effort. Code clones (similar code fragments) accumulated in these repositories represent often repeated functionalities and are candidates for reuse in an exploratory or rapid development. To facilitate code clone reuse, we previously presented DeepClone, a novel deep learning approach for modeling code clones along with non-cloned code to predict the next set of tokens (possibly a complete clone method body) based on the code written so far. The probabilistic nature of language modeling, however, can lead to code output with minor syntax or logic errors. To resolve this, we propose a novel approach called Clone-Advisor. We apply an information retrieval technique on top of DeepClone output to recommend real clone methods closely matching the predicted clone method, thus improving the original output by DeepClone. In this paper we have discussed and refined our previous work on DeepClone in much more detail. Moreover, we have quantitatively evaluated the performance and effectiveness of Clone-Advisor in clone method recommendation.
... Different types of sub-word units (Morfessor-based, syllable, phoneme) are investigated with the aim of detecting OOV keywords. The first set of units are derived using Morfessor, a tool for unsupervised morphological decomposition [20,21]. Given the list of words in the training texts with their frequencies, Morfessor learns a set of morphological units that are then used to segment the training texts and the keyword list. ...
... The explicit detection of out-of-vocabulary (OOV) words in largescale continuous speech recognition is known to improve accuracy [7,8]. Methods used for this task include the introduction of garbage models [9,8,10] or the use of word confidence models [11,9,12,13]. Keyword spotting often uses related techniques, at times incorporating the entire recognition lattice to effectively ignore large swaths of non-keyword speech [14,15,16]. In dialogue systems, on the other hand, the utterance verification problem is occasionally attacked with multiple domain-specific recognizers evaluated concurrently to classify utterances based on a comparison and scoring of hypotheses from different systems [17,18,19]. ...
... In statistical machine translation, Dyer, Muresan, and Resnik (2008) integrate segmentation of Arabic and Chinese into decoding process to improve translation into English. In speech recognition, Creutz et al. (2007) use morpheme-based language model to improve performance in four morphologically rich languages: Finnish, Estonian, Turkish and Arabic. More recently, using word segments, instead of words, as processing units has been a major breakthrough (Sennrich, Haddow, and Birch, 2016) at addressing the rare words problem in the end-to-end neural encoder-decoder systems. ...
Article
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
Chapter
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperform a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high accuracy of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Preprint
Full-text available
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Article
Full-text available
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs , is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.
Article
Full-text available
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.
Article
Full-text available
This paper presents an algorithm for the unsuper-vised learning of a simple morphology of a nat-ural language from raw text. A generative prob-abilistic model is applied to segment word forms into morphs. The morphs are assumed to be gener-ated by one of three categories, namely prefix, suf-fix, or stem, and we make use of some observed asymmetries between these categories. The model learns a word structure, where words are allowed to consist of lengthy sequences of alternating stems and affixes, which makes the model suitable for highly-inflecting languages. The ability of the al-gorithm to find real morpheme boundaries is eval-uated against a gold standard for both Finnish and English. In comparison with a state-of-the-art al-gorithm the new algorithm performs best on the Finnish data, and on roughly equal level on the En-glish data.
Chapter
For the science of linguistics we seek objective and formally describable operations with which to analyze language. The phonemes of a language can be determined by means of an explicit behavioral test (the pair test, involving two speakers of the language) and distributional simplifications, i. e. the defining of symbols which express the way in which the outcomes of that test occur in respect to each other in sentences of the language. The syntax, and most of the morphology, of a language is discovered by seeing how the morphemes occur in respect to each other in sentences. As a bridge between these two sets of methods we need a test for determining what are the morphemes of a language, or at least a test that would tentatively segment a phonemic sequence (as a sentence) into morphemes, leaving it for a distributional criterion to decide which of these tentative segments are to be accepted as morphemes.