Conference Paper

Scalable Modified Kneser-Ney Language Model Estimation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Language Model The n-gram language model (LM) has been widely used in lots of applications of natural language processing (NLP) since a long time ago (Jurafsky, 2000). The emergence of advanced smoothing technologies makes the n-gram model able to provide a better estimation of human languages (Kneser and Ney, 1995;Chen and Goodman, 1996;Heafield et al., 2013). In statistical machine translation (Brown et al., 1990) and automatic speech recognition (Bahl et al., 1983), the decoder-side n-gram model is critical to estimate the quality of generated candidates. ...
... Among lots of variants of n-gram LMs, the n-gram LM with modified Kneser-Ney smoothing is widely adopted in lots of related tasks, because of its low perplexity and efficiency (Kneser and Ney, 1995;Chen and Goodman, 1996;Heafield et al., 2013). Like most n-gram LMs, the Kneser-Ney approximates the entire context x k−1 1 in Eq. (1) by the last n − 1 words in the context: ...
... where w indicates a word appears after x k−1 k−n+1 , b(·) is the backoff value for lower-order estimation, c(·) is the adjusted counts, d is the discounts for smoothing (Jurafsky, 2000;Heafield et al., 2013) 1 . According to Eq. (3), Kneser-Ney allows us to assign probabilities for unseen n-grams (e.g., 5grams), using the lower-order information (e.g., 4-, 3-, or even uni-grams). ...
... However, extending these representations to the large vocabularies needed for open-domain MT is an open area of research (Jean et al., 2015a;Luong et al., 2015;Sennrich et al., 2015;Chitnis and DeNero, 2015). By contrast, Hiero (and other symbolic systems) can easily use translation grammars and language models with very large vocabularies (Heafield et al., 2013;Lin and Dyer, 2010). Moreover, words and phrases can be easily added to a fully-trained symbolic MT system. ...
... The rules for En-Fr were extracted from the full data set available at the WMT'15 website using a shallow-1 grammar (de Gispert et al., 2010). 5gram Kneser-Ney language models (KN-LM) for the Hiero systems were trained on WMT'15 parallel and monolingual data (Heafield et al., 2013). (Jean et al., 2015a, Tab. ...
Preprint
We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over both Hiero and NMT decoding alone, with practical advantages in extending NMT to very large input and output vocabularies.
... We implemented the language model that provides all the necessary information for the search algorithm as a recurrent neural network (Mikolov & Zweig, 2012) and an n-gram model (Heafield, Pouzyrevsky, Clark, & Koehn, 2013). ...
... An n-gram model is a structure that stores historic data about the n-grams, sequences of n tokens, seen in a training corpus. For constructing these models, we first have to set the order of the model, n, and usually the smoothing function, although in this case the Kneser-Ney smoothing (Heafield et al., 2013;Kneser & Ney, 1995) seems to be the best option overall (Chen & Goodman, 1996). ...
Preprint
Full-text available
Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.
... Expectation-based theories claim that humans predict upcoming words during incremental sentence processing (Clark, 2013). ...
... For simplicity, some functional words (e.g., "the," "を") are merged into a single node.3 For example, if we use count-based LMs(Heafield et al., 2013), even a single 5-gram Japanese LM took 27GB in model size.... <b> _was _also _the _first _hotel _in _Westchester _County . <b> _on _4 _March _1990 _with _a _concert _performed _by _Ell a _Fitzgerald _at _the _Royal _Albert _Hal <b> _the _Har row <b> _On _the _night _of _the _31 _May _/ _1 _June _1941 _he <b> ... ...
... The baseline regression models contain as predictors word length in characters, index of word position within the sentence, unigram surprisal (all datasets), and whether the previous word was fixated (ET datasets only). Unigram surprisal was calculated using the KenLM toolkit (Heafield et al., 2013) with parameters estimated on the OpenWeb-Text Corpus (Gokaslan and Cohen, 2019). On top of these baseline regression models, surprisal of the current word and the preceding word was included to capture spillover effects (Rayner et al., 1983). ...
Preprint
Full-text available
Word-by-word language model surprisal is often used to model the incremental processing of human readers, which raises questions about how various choices in language modeling influence its predictive power. One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal, with tokens defined by a vocabulary size of 8,000 resulting in surprisal that is most predictive. In contrast, on garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions, suggesting their increased sensitivity to syntax. Taken together, these results suggest a large role of token granularity on the quality of language model surprisal for cognitive modeling.
... • KN: The Kneser-Ney (KN) model is a widely used language model proposed by Heafield et al. (2013). We use the KenLM toolkit to train 5-gram models without pruning. ...
Preprint
Table-to-text generation aims to generate a description for a factual table which can be viewed as a set of field-value records. To encode both the content and the structure of a table, we propose a novel structure-aware seq2seq architecture which consists of field-gating encoder and description generator with dual attention. In the encoding phase, we update the cell memory of the LSTM unit by a field gate and its corresponding field value in order to incorporate field information into table representation. In the decoding phase, dual attention mechanism which contains word level attention and field level attention is proposed to model the semantic relevance between the generated description and the table. We conduct experiments on the \texttt{WIKIBIO} dataset which contains over 700k biographies and corresponding infoboxes from Wikipedia. The attention visualizations and case studies show that our model is capable of generating coherent and informative descriptions based on the comprehensive understanding of both the content and the structure of a table. Automatic evaluations also show our model outperforms the baselines by a great margin. Code for this work is available on https://github.com/tyliupku/wiki2bio.
... Rules for our En-De Hiero system were extracted as described in (de Gispert et al., 2010). A 5-gram language model for the Hiero system was trained on WMT16 parallel and monolingual data (Heafield et al., 2013). ...
Preprint
This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore, instead of a hard restriction of the NMT search space to the lattice, we propose to loosely couple NMT and Hiero by composition with a modified version of the edit distance transducer. The loose combination outperforms lattice rescoring, especially when using multiple NMT systems in an ensemble.
... 2. KN5: a standard 5-gram Kneser-Ney language model in KenLM (Heafield et al., 2013). ...
Preprint
We propose a new generative model of sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.
... We used the growdiag-final-and heuristic for extracting phrases, lexicalised reordering and Batch MIRA (Cherry and Foster, 2012) for tuning (the default parameters on Moses). We trained 5-gram language models with Kneser-Ney smoothing using KenLM (Heafield et al., 2013). With these parameters, we trained SMT systems for en-ta and en-hi language pairs, with and without the use of extracted parallel sentence pairs. ...
Preprint
Resources for the non-English languages are scarce and this paper addresses this problem in the context of machine translation, by automatically extracting parallel sentence pairs from the multilingual articles available on the Internet. In this paper, we have used an end-to-end Siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual articles in Wikipedia. Subsequently, we have showed that using the harvested dataset improved BLEU scores on both NMT and phrase-based SMT systems for the low-resource language pairs: English--Hindi and English--Tamil, when compared to training exclusively on the limited bilingual corpora collected for these language pairs.
... For each triples set on the validation and test set, the random system generates a response by randomly selecting a Wikipedia summary from our training set. Secondly, we use the KenLM toolkit [33] in order to build a 5-gram Kneser-Ney (KN) language model. During testing, similarly to the case of our neural network approach, for each triple set in the validation and test set, we use beamsearch with a beam of size 10, in order to generate the 10 most probable summaries. ...
Preprint
Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand the underlying technology, they need textual or visual interfaces to help them make sense of it. We explore the problem of generating natural language summaries for Semantic Web data. This is non-trivial, especially in an open-domain context. To address this problem, we explore the use of neural networks. Our system encodes the information from a set of triples into a vector of fixed dimensionality and generates a textual summary by conditioning the output on the encoded vector. We train and evaluate our models on two corpora of loosely aligned Wikipedia snippets and DBpedia and Wikidata triples with promising results.
... We used a 5 gram language model trained with modified Kneser Ney smoothing (Kneser and Ney, 1995) using the KenLM toolkit (Heafield et al., 2013). Our PB SMT loglinear features include: (a) 4 translational features (forward and backward phrase and lexi cal probabilities), (b) 8 lexicalised reordering proba bilities (wbemslrbidirectionalfeallff ), (c) 5gram LM probabilities, (d) 5 OSM features (Durrani et al., 2011), and (e) wordcount and distortion penalties. ...
... Language Model and Decoder At test time, we apply a simple, lexicon-free, character-level n-gram language model (LM) to the model predictions. In our experiments, we use a 6-gram modified Kneser-Ney LM (Heafield et al., 2013) generated from WikiText-103 raw dataset (Merity et al., 2016), and built using the KenLM package (Heafield, 2011). The LM is integrated with the CTC logits using an efficient first-pass beam-search decoder similar to . ...
Preprint
Full-text available
Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty.
... The baseline predictors are word length in characters, index of word position within each sentence, unigram surprisal (both SPR and ET corpora), as well as a whether the previous word was fixated (ET corpora only). Unigram surprisal was estimated using the KenLM toolkit (Heafield et al., 2013) with default smoothing hyperparameters on the OpenWebText Corpus (Gokaslan and Cohen, 2019), which contains about 6.5 billion whitespacedelimited words. On top of these baseline regression models, surprisal at the current word and the previous word was included to capture lingering effects of the previous word (i.e. ...
Preprint
Full-text available
Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.
... To further improve multilingual ASR performances, one direct method is enhancing language models (LM) to correct errors, which encompasses enhancing grammar, fixing spelling mistakes, clarifying homophones, adding punctuation marks, recovering missed words, and normalizing text to ensure readability. Incorporating language models, such as N-gram [16] and RNN-based LMs [17] into the ASR systems can effectively improve recognition performance, but it is an old practice. More advanced pretrained language models (PLMs) [18,19] and large language models (LLMs) [20,21,22,23] independently trained on large-scale corpora have been developed in recent years. ...
... We rely on the small-scale WikiText-2 (Merity et al., 2017) and IWSLT-14 (Cettolo et al., 2014) data sets, respectively, and compare the performance of standard MLE and label smoothing to the performance obtained by using regularizers based on the smoothing methods illustrated in §3. available through the KenLM (Heafield, 2011;Heafield et al., 2013) library. 13 The remaining smoothing methods were implemented natively in fairseq. ...
... Early efforts to build large-scale language models utilized n-grams and simple smoothing techniques [9] [10]. Various neural network architectures were later applied to the language modeling task, including feedforward networks [11] and recurrent networks [12]. ...
Preprint
Full-text available
Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function ϕ\phi with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream task data. We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the ordinary method of just collecting data and fine-tuning. Based on our experiment, the continued pretraining process of IGOT with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and 5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even reach a 31.5\% of training time saving, making porting general generative AI to specific domains more effective than before. In domain-specific tasks, supervised IGOTτIGOT_\tau shows great performance on reducing both the convergence radius and convergence point during keep pretraining.
... Five-gram surprisal. The negative log probability of a word in context as computed by KenLM 5-gram language models (Heafield, Pouzyrevsky, Clark, & Koehn, 2013) from frequency counts in the Gigaword 3 corpus (Graff, Kong, Chen, & Maeda, 2007; Figure A2, 5-gram surprisal). Five-gram models condition the probability distribution over the upcoming word on the sequence of four words that precede it, using default interpolation and backoff settings as described in Heafield and colleagues (2013). ...
Article
Full-text available
Human language is expressive because it is compositional: The meaning of a sentence (semantics) can be inferred from its structure (syntax). It is commonly believed that language syntax and semantics are processed by distinct brain regions. Here, we revisit this claim using precision fMRI methods to capture separation or overlap of function in the brains of individual participants. Contrary to prior claims, we find distributed sensitivity to both syntax and semantics throughout a broad frontotemporal brain network. Our results join a growing body of evidence for an integrated network for language in the human brain within which internal specialization is primarily a matter of degree rather than kind, in contrast with influential proposals that advocate distinct specialization of different brain areas for different types of linguistic functions.
... • Rate (ET, SPR, fMRI): a "deconvolutional intercept"; that is, a timestamped vector of 1's that is convolved by the model to yield an IRF representing the baseline response to an event, so named because variability in the response is driven by the rate of stimulus events in time. • Unigram surprisal (ET, SPR, fMRI): the negative log probability of a word derived from a KenLM unigram model (Heafield et al., 2013) trained on the Gigaword 3 corpus (Graff et al., 2007). To account for the possibility of qualitatively different scan path responses to linguistic variables in regressive vs. non-regressive eye movements, in the Dundee scan path analyses we follow Shain and Schuler (2021) in partitioning all variables in the scan path analyses into +reg and -reg variants as a function of whether the fixation occurred within a regression (+reg) or not (-reg). ...
Article
Full-text available
The dynamics of the mind are complex. Mental processes unfold continuously in time and may be sensitive to a myriad of interacting variables, especially in naturalistic settings. But statistical models used to analyze data from cognitive experiments often assume simplistic dynamics. Recent advances in deep learning have yielded startling improvements to simulations of dynamical cognitive processes, including speech comprehension, visual perception, and goal-directed behavior. But due to poor interpretability, deep learning is generally not used for scientific analysis. Here, we bridge this gap by showing that deep learning can be used, not just to imitate, but to analyze complex processes, providing flexible function approximation while preserving interpretability. To do so, we define and implement a nonlinear regression model in which the probability distribution over the response variable is parameterized by convolving the history of predictors over time using an artificial neural network, thereby allowing the shape and continuous temporal extent of effects to be inferred directly from time series data. Our approach relaxes standard simplifying assumptions (e.g., linearity, stationarity, and homoscedasticity) that are implausible for many cognitive processes and may critically affect the interpretation of data. We demonstrate substantial improvements on behavioral and neuroimaging data from the language processing domain, and we show that our model enables discovery of novel patterns in exploratory analyses, controls for diverse confounds in confirmatory analyses, and opens up research questions in cognitive (neuro)science that are otherwise hard to study.
... It utilizes phrase-based language models to interpret source language text and generate questions in the target language. To bolster system performance, we trained a tri-gram language model on target side texts with the help of KenLM [104] and tuned it using minimum error rate training (MERT) on the development set. Performance results evaluated on the test set. ...
Article
Full-text available
The goal of this article is to develop a multiple-choice questions generation system that has a number of advantages, including quick scoring, consistent grading, and a short exam period. To overcome this difficulty, we suggest treating the problem of question creation as a sequence-to-sequence learning problem, where a sentence from a text passage can directly mapped to a question. Our approach is data-driven, which eliminates the need for manual rule implementation. This strategy is more effective and gets rid of potential errors that could result from incorrect human input. Our work on question generation, particularly the usage of the transformer model, has been impacted by recent developments in a number of domains, including neural machine translation, generalization, and picture captioning.
... In our research, the same language-dependent statistical LM was used for the modular and on end-to-end approach, for both Spanish and Irish. These LMs were initially created in ARPA format but were transformed into binary using KENLM (Heafield et al., 2013) to decrease the time required to load the models. The integration of the LM with the AM was performed using shallow fusion through the CTC decoder library pyctcdecode 5 . ...
Conference Paper
Full-text available
We present a comparative study of a state-of-the-art traditional modular Automatic Speech Recognition (Kaldi ASR) and an end-to-end ASR (wav2vec 2.0) for a well-resourced language (Spanish) and a low-resourced language (Irish). We created ASRs for both languages and evaluated their performance under different update regimes. Our results show that the end-to-end wav2vec 2.0 outperforms the modular ASR for both languages in terms of Word Error Rate (WER) but performs worst in terms of real-time decoding. We also addressed the issue of non-lexical words in wav2vec 2.0's output. We found that in wav2vec 2.0 by LM integration with shallow fusion and increasing LM weight to 0.7 and 0.8 respectively for the Spanish and Irish provided the optimum ASR performance by reducing non-lexical words. However, this does not eliminate all non-lexical words. Finally, our study found that Kaldi ASR would perform best for real-time decoding for longer audio inputs compared to wav2vec 2.0 model trained on the same dataset on the minimal infrastructure, although wav2vec 2.0's performance can be improved with a GPU acceleration in backend. These results may have significant implications for creating real-time ASR services, especially for low-resourced languages.
... For the statistical models, we trained (2 to 6)gram models using the KenLM toolkit (Heafield, 2011). KenLM implements a modified Kneser-Ney smoothing technique (Heafield et al., 2013), which has been demonstrated to produce sequences with low perplexity. In all of our experiments, we maintained the default hyperparameters. ...
... It offers tools for both model training and query generation and provides a comprehensive solution for harnessing the power of n-gram language models. The toolkit's approach includes techniques such as trie-based data structures and bit-level encoding, enabling faster and more memory-efficient queries [22]. We have used this toolkit to create n-grams for the experiments and to use its query tools to get the scoring of the sequences for decoding in the integration. ...
Preprint
Full-text available
End-to-end automatic speech recognition (ASR) models have achieved state-of-the-art performance by leveraging extensive training data. However, the absence of an explicit language model in recent end-to-end architectures poses challenges in feeding text-only data to the model and may result in sub-optimal recognition, especially for rare and context-specific words. This study investigates the efficacy of integrating an external language model into the end-to-end ASR system. The central hypothesis states that leveraging an external language model at inference time will enhance system performance by effectively utilizing in-domain text data available to the system by the language model. A comprehensive set of experiments across various tasks demonstrates a substantial reduction in word error rate, thus underscoring the positive impact of external language model integration on both field and general data. It also shows that this method simplifies integrating contextual information into the system by utilizing text-only data which mitigates the cost of fine-tuning and compensates for the scarcity of audio-text data.
... The overall training time is approximately 15.3 days on the whole dataset on a single P100 GPU. While evaluating, we consider a single type of language model for acoustic model decoding: a 4-gram model using KenLM toolkit [21] which is trained on the corpus based on the train-set data. We use a small budget of 10 trials, each with a beam-width of 200, to estimate the best hyperparameters (α = 0.69, β = 1.68) while decoding using CMA-ES sampler [22,23]. ...
... This substantial leap was especially pertinent to enhancing machine translation quality. Even though the early techniques, such as the "Stupid Backoff" for smoothing, were rudimentary, advancements were made by Heafield et al. (2013). The transformative potential of scaling was further emphasized with the evolution of transformer architectures, which carved out novel benchmarks in numerous NLP challenges. ...
Preprint
In natural language processing, transformer-based large language models (LLMs) like GPT-x models developed by OpenAI have revolutionized the landscape. Despite their impressive capabilities, these models often encounter challenges when handling tasks that differ from their training data, resulting in compromised performance. To address this, few-shot learning has emerged as a valuable technique, allowing LLMs to adapt with minimal task-specific data. One innovative strategy, known as Chain-of-Thought Prompting (CoT), has been introduced to guide LLMs in revealing cognitive processes during multi-step reasoning. In this paper, we propose Code Chain-of-Thought~(CodeCoT), which consists of two components: the Vanilla CodeCoT and the Self-exam CodeCoT. The latter incorporates self-examination, empowering the model to iteratively generate code, formulate test cases, and refine its outputs. Specifically, the process entails the generation of test examples by the model corresponding to the code it is tasked to implement. If it fails on the test examples, then it regenerates the code based on the erroneous code and associated error types. Through comprehensive experiments, we observed that both techniques significantly enhance code generation accuracy across various LLM variants. Our evaluation results reveal that CodeCoT improves the code generation effectiveness, including an unprecedented pass@1 accuracy of 79.27\% using the Self-exam CodeCoT approach on the gpt-3.5-turbo-0613 model in the HumanEval dataset.
... In addition to the baseline predictors, two surprisal predictors were also included in all regression models evaluated in this experiment. The first is unigram surprisal as a measure of word frequency, which was calculated using the KenLM toolkit (Heafield et al., 2013) with parameters estimated on the English Gigaword Corpus (Parker et al., 2009). The second is surprisal from GPT-2 Small (Radford et al., 2019), which is trained on ∼8B tokens of the WebText dataset. ...
... These are sentences that are most likely from the point of view of the language model. We use a three-gram language model with Kneser-Key smoothing trained on the Newspaper Corpus with KenLM [25] to estimate the sentence probability. ...
Preprint
Full-text available
Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a language model intended for correcting errors in L2 Russian writing. The language model proposed is trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus, and the quality of the model is validated against the RULEC-GEC corpus.
... Models. We implement Master-ASR and baseline methods on a pretrained XLSR-53 (Conneau et al., 2020) model without using a language model, such as a 4-gram language model (Heafield et al., 2013), to ensure a fair comparison. XLSR-53 is pretrained on 53 languages in an SSL manner. ...
Preprint
Full-text available
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues. Inspired by recent findings, we hypothesize that we can address the above challenges with modules widely shared across languages. To this end, we propose an ASR framework, dubbed \METHODNS, that, \textit{for the first time}, simultaneously achieves strong multilingual scalability and low-resource adaptation ability thanks to its modularize-then-assemble strategy. Specifically, \METHOD learns a small set of generalizable sub-modules and adaptively assembles them for different languages to reduce the multilingual overhead and enable effective knowledge transfer for low-resource adaptation. Extensive experiments and visualizations demonstrate that \METHOD can effectively discover language similarity and improve multilingual and low-resource ASR performance over state-of-the-art (SOTA) methods, e.g., under multilingual-ASR, our framework achieves a 0.13\sim2.41 lower character error rate (CER) with 30\% smaller inference overhead over SOTA solutions on multilingual ASR and a comparable CER, with nearly 50 times fewer trainable parameters over SOTA solutions on low-resource tuning, respectively.
... Thus, we report the 3-reference BLEU score. We use fast-align (Dyer, Chahuneau, and Smith 2013) to extract the alignment information for sentences in Table 4, and strategies more suitable for SiMT, and use KenLM (Heafield et al. 2013) to calculate source language model score in chunk length-based strategy. ...
Article
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.
... Given an incomplete sentence, e.g., "The book is on the", such models use the training data to generate a probability distribution to determine the most probable next words, e.g., "table" or "shelf". Early efforts to build large-scale language models used Ngrams and simple smoothing techniques (Brants et al., 2007;Heafield et al., 2013;Buck et al., 2014). Other approaches applied various types of neural networks architectures, such as feedforward networks (Bengio et al., 2000) and recurrent networks (Mikolov et al., 2010;Jozefowicz et al., 2016), to the language modeling task. ...
Preprint
Full-text available
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
... We also estimate the probability and perplexity of each sentence by training two 4-gram models on ukWaC (uncased tokens and lemmas) in the third model. This was created using KenLM (Heafield et al., 2013), a language modeling toolkit based on modified Kneser-Ney smoothing (Kneser and Ney, 1995). The n-gram model added 4 features. ...
Book
The ``Second Workshop on Tools and Resources for REAding DIfficulties'' (READI), collocated with the ``International Conference on Language Resources and Evaluation'' (LREC 2020), aims at presenting current state-of-the-art techniques and achievements for text adaptations together with existing reading aids and resources for lifelong learning. The materials can be addressed to children struggling with difficulties in learning to read, to the community of teachers, speech-language pathologists and parents seeking solutions, but also to adults and professionals involved with adults struggling with reading (illiterates, aphasic readers, low vision readers, etc.).
Preprint
Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).
Article
Full-text available
This article introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. Performance using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.
Article
Full-text available
Many studies of human language processing have shown that readers slow down at less frequent or less predictable words, but there is debate about whether frequency and predictability effects reflect separable cognitive phenomena: are cognitive operations that retrieve words from the mental lexicon based on sensory cues distinct from those that predict upcoming words based on context? Previous evidence for a frequency-predictability dissociation is mostly based on small samples (both for estimating predictability and frequency and for testing their effects on human behavior), artificial materials (e.g., isolated constructed sentences), and implausible modeling assumptions (discrete-time dynamics, linearity, additivity, constant variance, and invariance over time), which raises the question: do frequency and predictability dissociate in ordinary language comprehension, such as story reading? This study leverages recent progress in open data and computational modeling to address this question at scale. A large collection of naturalistic reading data (six datasets, >2.2 M datapoints) is analyzed using nonlinear continuous-time regression, and frequency and predictability are estimated using statistical language models trained on more data than is currently typical in psycholinguistics. Despite the use of naturalistic data, strong predictability estimates, and flexible regression models, results converge with earlier experimental studies in supporting dissociable and additive frequency and predictability effects.
Article
Full-text available
During real-time language comprehension, our minds rapidly decode complex meanings from sequences of words. The difficulty of doing so is known to be related to words’ contextual predictability, but what cognitive processes do these predictability effects reflect? In one view, predictability effects reflect facilitation due to anticipatory processing of words that are predictable from context. This view predicts a linear effect of predictability on processing demand. In another view, predictability effects reflect the costs of probabilistic inference over sentence interpretations. This view predicts either a logarithmic or a superlogarithmic effect of predictability on processing demand, depending on whether it assumes pressures toward a uniform distribution of information over time. The empirical record is currently mixed. Here, we revisit this question at scale: We analyze six reading datasets, estimate next-word probabilities with diverse statistical language models, and model reading times using recent advances in nonlinear regression. Results support a logarithmic effect of word predictability on processing difficulty, which favors probabilistic inference as a key component of human language processing.
Chapter
Taiwan-accented speech bears similarities to the Mandarin Min dialect, but with substantial differences in vocabulary, which significantly impacts spoken language recognition outcomes. This paper concentrates on integrating pre-trained language models (PLMs) with state-of-the-art self-supervised learning (SSL)-based speech recognition systems for Taiwan-accented speech recognition tasks. We propose a progressive error correction process in tandem with recognition to fully exploit the autoregressive nature of PLM models. Experimental results demonstrate that our method effectively addresses recognition errors stemming from misspelled vocabulary in accented speech. Our proposed progressive approach achieves roughly a 0.5% improvement compared to the conventional method. Furthermore, we demonstrate that fine-tuning PLMs solely with the text from the accented dataset can enhance recognition performance, despite the limitations of accented speech resources.
Chapter
Information retrieval (IR) systems evaluation aims at comparing IR systems either (1) one to another with respect to a single test collection, and (2) across multiple collections. In the first case, the evaluation environment (test collection and evaluation metrics) stays the same, while the environment changes, in the second case. Different evaluation environments may be seen, in fact, as evolutionary versions of some given evaluation environment. In this work, we propose a methodology to predict the statistically significant change in the performance of an IR system (i.e. result delta RΔ\mathcal {R}\varDelta ) by quantifying the differences between test collections (i.e. knowledge delta KΔ\mathcal {K}\varDelta ). In a first phase, we quantify differences between document collections (i.e. KdΔ\mathcal {K}_{d}\varDelta ) in the test collections by means of TF-IDF and Language Models (LM) representations. We use the KdΔ\mathcal {K}_{d}\varDelta to train SVM classification models to predict the significantly performance changes of various IR systems using evolving test collections derived from the Robust and TREC-COVID collections. We evaluate our approach against our previous KdΔ\mathcal {K}_{d}\varDelta experiments.KeywordsEvolving Test CollectionsPerformance PredictionKnowledge DeltaResult Delta
Article
This paper shows deeply the algorithms used in Natural Language (NLU) using Machine Learning (ML) in order to develop Natural Language applications like sentimental analysis, text classification and question answering. The paper thoroughly investigates the diverse applications, inherent challenges, and promising future prospects of machine learning in NLU, providing valuable insights into its revolutionary influence on language processing and comprehension.
Thesis
Full-text available
This thesis explores flexible techniques for automatically recognising text in historical documents. It delves into various approaches to optical character recognition for printed documents and handwrittentext recognition for manuscripts. The effectiveness of these approaches is tested using both the traditional convolutional neural network and recurrent neural network architecture, as well as the newer Transformer-based architecture. Additionally, new datasets are introduced. Results show that the Transformer-based models perform equally well or better than the traditional approaches. The thesis also introduces and compares methods for evaluating text generated by automatic text recognition methods without ground truth material.
Conference Paper
Full-text available
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech dataset with recordings of meetings from Stortinget, the Norwegian parliament. It is the first, publicly available dataset containing unscripted, Norwegian speech designed for training of automatic speech recognition (ASR) systems. The recordings are manually transcribed and annotated with language codes and speakers, and there are detailed metadata about the speakers. The transcriptions exist in both normalized and non-normalized form, and non-standardized words are explicitly marked and annotated with standardized equivalents. To test the usefulness of this dataset, we have compared an ASR system trained on the NPSC with a baseline system trained on only manuscript-read speech. These systems were tested on an independent dataset containing spontaneous, dialectal speech. The NPSC-trained system performed significantly better, with a 22.9% relative improvement in word error rate (WER). Moreover, training on the NPSC is shown to have a "democratizing" effect in terms of dialects, as improvements are generally larger for dialects with higher WER from the baseline system.
Article
Full-text available
Table-to-text generation aims to generate descriptions for structured data (i.e., tables) and has been applied in many fields like question-answering systems and search engines. Current approaches mostly use neural language models to learn alignment between output and input based on the attention mechanisms, which are still flawed by the gradual weakening of attention when processing long texts and the inability to utilize the records’ structural information. To solve these problems, we propose a novel generative model SAN-T2T, which consists of a field-content selective encoder and a descriptive decoder, connected with a selective attention network. In the encoding phase, the table’s structure is integrated into its field representation, and a content selector with self-aligned gates is applied to take advantage of the fact that different records can determine each other’s importance. In the decoding phase, the content selector’s semantic information enhances the alignment between description and records, and a featured copy mechanism is applied to solve the rare word problem. Experiments on WikiBio and WeatherGov datasets show that SAN-T2T outperforms the baselines by a large margin, and the content selector indeed improves the model’s performance.
Chapter
The Transformer, a model relying entirely on the attention mechanism, brought significant improvements in performance on several natural language processing tasks. This chapter presents its impact on the speech processing domain and, more specifically, on the automatic speech recognition task. A short history of the evolution of automatic speech recognition systems is also given. A selection of important works making use of transformers are presented, as well as pretraining self-supervised architectures.KeywordsTransformersAutomatic Speech Recognition (ASR)
Chapter
Social media users differ in how they write such as writing style and topics. This suggests that personalized language models—language models tailored to a specific person—could outperform a single generic language model. One challenge, however, is that language models typically require a large volume of text to train on, but for many people such a volume of text is not available. In this paper, we train n-gram and neural language models on relatively large in-domain background corpora, and on relatively small amounts of text from individual social media users, specifically authors of blogs. In experiments with interpolated language models, we find that, although user-specific language models trained on a small amount of text from a user perform relatively poorly, they can be interpolated with language models trained on a large background corpus to give improvements over either approach on its own. We further find that n-gram and neural language models are complementary, and can be interpolated to give improvements over either approach used individually. Our evaluation considers perplexity, and two evaluation measures motivated by next word suggestion on smart-phones. We find that although perplexity is widely used for intrinsic evaluation of language models, it is a poor indicator of performance in terms of these other measures.
Article
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.
Article
We present the software library STXXL that is an implementation of the C++ standard template library (STL) for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/O-efficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, Gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications are evaluated on synthetic and real-world inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. Copyright © 2007 John Wiley & Sons, Ltd.