Conference Paper

Corpora Generation for Grammatical Error Correction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In recent years, it has been the subject of many modeling efforts due to its ability to improve the grammaticality and readability of user-generated texts [2]. Most of the previously proposed approaches for GEC view the task as monolingual text-to-text rewriting or machine translation [3], [4], [5], [2], [6] and [7]. While these models can handle the full dependency between outputs [8], some serious issues still need to be addressed before we get a GEC model that matches the human performance. ...
... In this work, we propose novel multi-head model to improve the efficiency and the performance of the GEC model by dividing the task into seven subtasks: Insertion, Deletion, Merge, Substitution, Transformation, Detection and Correction, shown in Figure 1. Since previous works like [9], [7], [22] [10], and [2] have proven the efficiency of multistage training, our GEC sequence tagging pipeline consists of three training stages: pretraining on synthetic data, fine-tuning on an errorful parallel corpus, and finally, fine-tuning on a combination of errorful and error-free parallel corpora. ...
... GEC problem suffers from a lack of clean parallel datasets [7]. The publicly available annotated datasets are small or noisy [22]. ...
Preprint
Full-text available
To solve the Grammatical Error Correction (GEC) problem , a mapping between a source sequence and a target one is needed, where the two differ only on few spans. For this reason, the attention has been shifted to the non-autoregressive or sequence tagging models. In which, the GEC has been simplified from Seq2Seq to labeling the input tokens with edit commands chosen from a large edit space. Due to this large number of classes and the limitation of the available datasets, the current sequence tagging approaches still have some issues handling a broad range of grammatical errors just by being laser-focused on one single task. To this end, we simplified the GEC further by dividing it into seven related subtasks: Insertion, Deletion, Merge, Substitution, Transformation, Detection, and Correction, with Correction being our primary focus. A distinct classification head is dedicated to each of these subtasks. the novel multi-head and multi-task learning model is proposed to effectively utilize training data and harness the information from related task training signals. To mitigate the limited number of available training samples, a new denoising autoencoder is used to generate a new synthetic dataset to be used for pretraining. Additionally, a new character-level transformation is proposed to enhance the sequence-to-edit function and improve the model's vocabulary coverage. Our single/ensemble model achieves an F0.5 of 74.4/77.0, and 68.6/69.1 on BEA-19 (test) and CoNLL-14 (test) respectively. Moreover, evaluated on JFLEG test set, the GLEU scores are 61.6 and 61.7 for the single and ensemble models, respectively. It mostly outperforms recently published state-of-the-art results by a considerable margin.
... Many studies have been conducted on generating corpora using natural texts for various natural language processing tasks, such as [2][3][4][5][6][7]. ...
... Such a strong dependence of the speed of work on not only the amount but also other parameters of the input data can be explained by the lack of attempts by the authors to evaluate or measure it. However, it can be argued that the methods that work with textual data (reported in [2][3][4][5][6]) are more effective in terms of speed than the one proposed by the authors, as they do not require work with images. ...
... In [5], the authors describe two approaches to the generation of large parallel corpora for their use in solving the task of correcting grammatical errors. Both approaches use Wikipedia as a source of natural texts (not necessarily in English): the first approach uses page editing history, and the second approach uses two-way machine translation. ...
Article
Full-text available
The object of research is the process of generating text data corpora using the CorDeGen method. The problem solved in this study is the insufficient efficiency of generating corpora of text data by the CorDeGen method according to the speed criterion. Based on the analysis of the abstract CorDeGen method – the steps it consists of, the algorithm that implements it – the possibilities of its parallelization have been determined. As a result, two new modified methods of the base CorDeGen method were developed: “naive” parallel and parallel. These methods differ from each other in whether they preserve the order of terms in the generated texts compared to the texts generated by the base method (“naive” parallel does not preserve, parallel does). Using the .NET platform and the C# programming language, the software implementation of both proposed methods was performed in this work; a property-based testing methodology was used to validate both implementations. The results of efficiency testing showed that for corpora of sufficiently large sizes, the use of parallel CorDeGen methods speeds up the generation time by 2 times, compared to the base method. The acceleration effect is explained precisely by the parallelization of the process of generating the next term – its creation, calculation of the number of occurrences of texts, and recording – which takes most of the time in the base method. This means that if it is necessary to generate sufficiently large corpora in a limited time, in practice it is reasonable to use the developed parallel methods of CorDeGen instead of the base one. The choice of a particular parallel method (naive or conventional) for a practical application depends on whether or not the ability to predict the order of terms in the generated texts is important
... We achieve our best results using an unsupervised synthetic data generation method based on round-trip translations, i.e. sentence pairs that were generated by translating an English sentence into another language (e.g. German) and back, a technique that was previously proposed for GEC pre-training (Lichtarge et al., 2019). We construct additional data sets by creating mappings from the longest to the shortest reference in multi-reference machine translation (MT) test sets. ...
... into English. This idea of generating sentence pairs via round-trip translation was initially proposed by Lichtarge et al. (2019) to pre-train GEC systems. ...
... Transformer system is pre-trained on round-trip translations of sentences crawled from news websites following the recipe of Lichtarge et al. (2019) that were prepared as described in Sec. 3.2. ...
... The advantage of this is that the model gets a second chance to correct errors it might have missed during the first iteration. Lichtarge et al. (2019) thus proposed an iterative decoding algorithm that allows a model to make multiple incremental corrections. In each iteration, the model is allowed to generate 23 Computational Linguistics Just Accepted MS. ...
... Ehsan and Faili (2013) apply one error to each sentence from pre-defined error templates that include omitting prepositions, repeating words, and so on. Lichtarge et al. (2019) introduce spelling errors to Wikipedia edit history by performing deletion, insertion, replacement, and transposition of characters. Zhao et al. (2019) also apply a similar noising strategy but at the word level, that is deleting, adding, shuffling, and replacing words in a sentence. ...
... The assumption is that the MT system will make translation errors and so the output via the bridge language will be noisy in relation to the input. This strategy was employed by Madnani, Tetreault, and Chodorow (2012) and Lichtarge et al. (2019), who furthermore both explored the effect of using different bridge languages. Zhou et al. (2020) explore a similar technique, except use a bridge language as the input to both a low-quality and high-quality translation system (namely SMT vs. NMT), and treat the output from the former as an ungrammatical noisy sentence and the output from the latter as the reference. ...
Article
Full-text available
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject–verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors, respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems, which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarize the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgments, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as a comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
... • Controllable Rule-based Corpora Corruption (Edunov et al., 2018) • Data Generation from Round-trip Translations (Lichtarge et al., 2019) • Non-Autoregressive Translation (NAT) -based Data Construction (Sun et al., 2022) To explore the preceding synthetic data construction approaches we utilize the Bangla monolingual data from Bangla-English parallel machine translation corpus (Hasan et al., 2020) comprising of 2.75 million sentence pairs. ...
... This is not only time-intensive, but also hard to cover all the error phenomena wrote by learners. Hence, we utilize the round-trip translation mechanism for synthesizing erroneous text set (Lichtarge et al., 2019). The round-trip translations initiate noise according to both the weakness of the translation models and the several inherent ambiguities of translation. ...
... However, prior works have primarily focused on showing the effectiveness of their proposed augmentation methods, without considering sample efficiency. Training GEC models with excessive samples (e.g. over 100M (Lichtarge et al., 2019;Stahlberg and Kumar, 2021)) for poorscalable improvement is expensive and often unfeasible for most researchers. Additionally, existing studies suffer from a lack of consistent experimental settings, making it intractable to systematically and fairly compare various data augmentation methods. ...
... Round-translation (RT). RT is an alternative method to generate pseudo data, which is based on the assumption that NMT systems may produce translation errors, resulting in noisy outputs via the bridge languages (Lichtarge et al., 2019;Zhou et al., 2020). The diverse outputs, however, may change the structure of the sentence due to the heterogeneity of different languages. ...
... Generating synthetic data Standard data corruption methods typically use a variety of heuristics: random character and token transformations (Schmaltz et al., 2016;Lichtarge et al., 2019a), confusion sets generated from a spellchecker (Grundkiewicz and Junczys-Dowmunt, 2019; Naplava and Straka, 2019), or a morphological analyzer (Choe et al., 2019), or round-trip translation (Lichtarge et al., 2019a). ...
... Generating synthetic data Standard data corruption methods typically use a variety of heuristics: random character and token transformations (Schmaltz et al., 2016;Lichtarge et al., 2019a), confusion sets generated from a spellchecker (Grundkiewicz and Junczys-Dowmunt, 2019; Naplava and Straka, 2019), or a morphological analyzer (Choe et al., 2019), or round-trip translation (Lichtarge et al., 2019a). ...
... In recent years, English GEC task has attracted wide attention from researchers. By employing pre-trained models (Kaneko et al., 2020;Katsumata and Komachi, 2020) or incorporating synthetic data (Grundkiewicz et al., 2019;Lichtarge et al., 2019), the sequence-to-sequence models achieve remarkable performance on English GEC task. Besides, several sequence labeling approaches are proposed to cast text generation as token-level edit prediction (Malmi et al., 2019;Awasthi et al., 2019;Omelianchuk et al., 2020). ...
... PIE (Awasthi et al., 2019) and GECToR (Omelianchuk et al., 2020) manually design detailed English-specific labels, regarding case and tense. Synthetic data is generated to enhance model performance (Ge et al., 2018;Grundkiewicz et al., 2019;Lichtarge et al., 2019). Besides two mainstream model structure, ESD-ESC firstly detects erroneous spans and generates correct contents only for annotated spans. ...
... In general, characters with higher than 80% confidence scores have an overall accuracy close to 100% based on human validation, and characters with low confidence scores, such as 30%, require special attention. Recent NLP advances in Transformer architecture (Vaswani et al., 2017), which primarily consists of a multi-head self-attention mechanism stacked in combination with an encoder/decoder structure, has proven to be adept at correcting erroneous English text and performing context-based grammatical error correction on incorrect characters/words (Junczys-Dowmunt et al., 2018;Lichtarge et al., 2019;Zhao et al., 2019). Inspired by this idea, we built on these advancements by implementing an additional Confidence Score mechanism into the standard Transformer architecture to further improve Transformer's performance. ...
... Many recent neural Grammatical Error Correction (GEC) models are trained on this type of Transformer architecture (Junczys-Dowmunt et al., 2018;Lichtarge et al., 2019;Zhao et al., 2019). As a whole, our task of correcting the spelling errors in the OCR-ed Tibetan corpus is very similar to the GEC task. ...
Preprint
Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
... Zhao et al. [10] used a large-scale unlabeled corpus to pre-train a denoising self-encoder [11,12] with a copy mechanism for the Transformer model [13] and achieved results close to those of Ge et al. [14] based on word-level and sentence-level multi-task learning methods using only publicly available "error-corrected" parallel corpora. Based on the idea of round-trip translation, Lichtarge et al. [15] used Wikipedia data to generate a large number of pseudoparallel sentence pairs to pre-train a forward grammatical error-correction model using a medial language as a jumping-off point for fold-back translation. ...
Preprint
Full-text available
Asymmetric codecs have been successful in many deep learning tasks recently. It is a forward-looking research to apply this deep learning neural network to English grammar error detection tasks. The traditional machine translation model has poor semantic analysis ability, which leads to the low accuracy of the results when using the model to detect English grammatical errors. To solve this problem, this paper designs a deep learning neural network model based on asymmetric Encoder-Decoder. Firstly, English sentences need to be pre-processed and converted into a tuple probability table recognizable by the model, which is subsequently fed into the model of the encoder decoder. In the encoder: Pre-embedding is first implemented with the help of temporal convolutional network infrastructure to initially obtain the word mapping relationships in the sentences and then connected to BiLSTM for accurate word embedding to capture the word-to-word relationships in the sentences. In addition, for the context intermediate vector output by the encoder, in order to more accurately and adaptively extract relevant features with global information, the attention mechanism is used to process the features output from bilstm according to their corresponding weights. In the decoder, bigru is used to decode the context intermediate vector output from the encoder. And outputs the translation result after the decoding is completed. After training the machine translation model, word analysis is performed through the process of decoding the model. Finally, the experiment detected English grammatical errors, such as articles, prepositions, nouns, verbs and subject predicate agreement. The experimental results show that the designed model outperforms the comparison method in terms of evaluation metrics such as precision, recall and F1 values for English grammar detection. The necessity of each component of the model is verified, and it is shown that the m odel can effectively improve the accuracy of English grammar error detection. The research in this paper provides an important theoretical guidance for applying deep learning neural networks with asymmetric codecs to English grammar error detection.
... A special form of translation is round trip translation, which focuses on translating a given text from one language to the second and back to the first. Round trip translation has been increasingly used in several research areas, including correcting grammatical errors (Lichtarge et al., 2019;Madnani et al., 2012), evaluating machine translation models (Crone et al., 2021;Cao et al., 2020;Moon et al., 2020), paraphrasing (Guo et al., 2021) and rewriting questions (Chu et al., 2020). It is also used extensively as part of the quality assurance process in critical domains such as medical, legal and market search domains. ...
Preprint
Full-text available
Language Models today provide a high accuracy across a large number of downstream tasks. However, they remain susceptible to adversarial attacks, particularly against those where the adversarial examples maintain considerable similarity to the original text. Given the multilingual nature of text, the effectiveness of adversarial examples across translations and how machine translations can improve the robustness of adversarial examples remain largely unexplored. In this paper, we present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. Furthermore, we introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation and demonstrating increased robustness to round-trip translation. Our results indicate that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.
... Ge et al. (2018) and;Fu et al. (2018b) proposed to use recurrent neural networks, while recent workGrundkiewicz et al., 2019;Lichtarge et al., 2019;Fu et al., 2018a) made use of the Transformer(Vaswani et al., 2017). ...
... Nowadays, using synthetic data or data augmentation to improve the performance of GEC models has become a mainstream approach (Madnani et al., 2012;Grundkiewicz and Junczys-Dowmunt, 2014;Grundkiewicz et al., 2019). Common construction methods can be categorized into rule-based substitution (Awasthi et al., 2019;Choe et al., 2019) and model-based generation methods (Xie et al., 2018;Lichtarge et al., 2019;Fang et al., 2023a). However, the synthetic data constructed by the above methods are mainly used in the pre-training phase to initialize a better GEC model. ...
... Analysis of recent research Currently, there are many studies devoted to the generation of text data corpora for use in various natural language processing tasks (e.g., [1][2][3][4]), but most of them have little applicability to solving software engineering tasks. This is due to various factors, the main ones of which are the following: ...
Article
This paper is devoted to the issue of generating corpora of text data for their use in solving software engineering problems in the context of developing information systems for natural language processing (NLP). One of the methods intended for this is the basic CorDeGen method, however, the analysis revealed certain disadvantages. Such a disadvantage is that NLP methods at the pre-processing stage can remove part of the terms generated by this method from the texts, treating them as stop words of a certain language. Removing part of the terms leads to the fact that the distribution of terms between the texts predicted by the CorDeGen method is distorted and the obtained result of processing the corpus with a certain NLP method is significantly different from the expected one. To solve this disadvantage, a new modified CorDeGen+ method is proposed in the paper, which introduces an additional, language-dependent stage of checking each generated term for admissibility, and if necessary, replacing it with another one. At the same time, all the advantages of the basic CorDeGen method are preserved by the proposed method, as well as other possible disadvantages, except for the corrected one. The paper examines language variations of the proposed method for the four most common European languages and a variation for languages that use non-Latin letters. The conducted experimental test showed the effectiveness of the CorDeGen+ method in terms of correcting the described disadvantage of the basic CorDeGen method. Also, this test showed that the degree of slowdown of the corpora generation process due to the introduction of an additional stage depends on the corpus size. In the case of micro-corpora (100 unique terms), the degree of slowdown reaches 39%, but as the size of the corpus increases, the degree drops sharply, and for super-large corpora (312500 unique terms) it reaches a maximum of 6.8%.
... Nowadays, using synthetic data or data augmentation to improve the performance of GEC models has become a mainstream approach (Madnani et al., 2012;Grundkiewicz and Junczys-Dowmunt, 2014;Grundkiewicz et al., 2019). Common construction methods can be categorized into rule-based substitution (Awasthi et al., 2019;Choe et al., 2019) and model-based generation methods (Xie et al., 2018;Lichtarge et al., 2019;Fang et al., 2023a). However, the synthetic data constructed by the above methods are mainly used in the pre-training phase to initialize a better GEC model. ...
Preprint
Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
... To investigate whether ChatLang-8 has value as a dataset, we train a vanilla Transformer (Vaswani et al., 2017) and BART (Lewis et al., 2020), much larger model, on both ChatLang-8 and Lang-8 (Mizumoto et al., 2012;Tajiri et al., 2012). Note that for the fairness of the experiment, we choose Lang-8, which has a similar dataset structure and corpus size, and which effectively improve performance (Lichtarge et al., 2019(Lichtarge et al., , 2020Flachs et al., 2021). In addition, though there is previous attempt to improve the Lang-8 (Flachs et al., 2021), it reports only five error types on BEA test while we considers 25 error types and three benchmarks. ...
Preprint
We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named \textbf{ChatLang-8}, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.
... Mainly, there are two techniques used in building such datasets: noisy injections and back-translation (Kiyono et al., 2019). The Noisy injections technique involves corrupting an already clean text by inserting some pre-defined errors in a rule-based way (Ehsan and Faili, 2013;Lichtarge et al., 2019;Zhao et al., 2019), or by injecting probabilistic error patterns (Rozovskaya and Roth, 2010;Felice and Yuan, 2014;Rei et al., 2017). The back-translation technique, on the other hand, involves training a noisy channel model to predict a probable source text given a correct text (Xie et al., 2018). ...
Preprint
Grammatical Error Correction has seen significant progress with the recent advancements in deep learning. As those methods require huge amounts of data, synthetic datasets are being built to fill this gap. Unfortunately, synthetic datasets are not organic enough in some cases and even require clean data to start with. Furthermore, most of the work that has been done is focused mostly on English. In this work, we introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets from any organic data, and to clean the data used for training Large Language Models. We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones. We also show the effectiveness of our method on the training losses of training language models.
... We build on the CoNLL-2014 validation dataset, rather than create a new one with only synthetically induced errors, to capture the interaction of character-and word-level errors, described in Section 2. We preserve all non-character-level errors, in order to capture their interaction with the characterlevel ones. We rely on the algorithm for generating datasets suggested in [35], that is, probabilistically introduce spelling errors in the source sentences at a rate of 1-3 per sentence, randomly selecting deletion, insertion, replacement, or transposition of adjacent characters for each introduced error. The new density of character-level errors is much higher than the one in Original dataset. ...
Article
The study focuses on how modern GEC systems handle character-level errors. We discuss the ways these errors effect the performance of models and test how models of different architectures handle them. We conclude that specialized GEC systems do struggle against correcting non-existent words, and that a simple spellchecker considerably improve overall performance of a model. To evaluate it, we assess the models over several datasets. In addition to CoNLL-2014 validation dataset, we contribute a synthetic dataset with higher density of character-level errors and conclude that, provided that models generally show very high scores, validation datasets with higher density of tricky errors are a useful tool to compare models. Lastly, we notice cases of incorrect treatment of non-existent words on experts' annotation and contribute a cleared version of this dataset. In contrast to specialized GEC systems, LLaMA model used for GEC task handles character-level errors well. We suggest that this better performance is explained by the fact that Alpaca is not extensively trained on annotated texts with errors, but gets as input grammatically and orthographically correct texts
... Synthetic data generation. Synthetic data generation for GEC commonly adopts two strategies: backtranslation-based corruption methods using labeled data (Kiyono et al., 2019;Stahlberg and Kumar, 2021;Xie et al., 2018), and error injection corruption methods via edit pairs or confusion sets extracted from labeled data (Awasthi et al., 2019;Lichtarge et al., 2019;Yuan and Felice, 2013). Methods that do not require labeled GEC data have been explored by Grundkiewicz et al. (2019) and Sun et al. (2022). ...
... According to the authors [5], majority of researchers approaching the problem of interpreting noisy ASR output focus on correcting the errors first with the use of text correction tools, such as in [6]. The corrected ASR output is then interpreted using a NLU module. ...
... Learner corpora have been the main focus when it comes to creating a GEC learning dataset recently. Synthetically generated data has been used prior to learner corpora and are still being used as additional data during the development of GEC systems (Kaneko et al., 2020;Kiyono et al., 2019;Zhao et al., 2019;Rothe et al., 2021;Omelianchuk et al., 2020;Lichtarge et al., 2019;Grundkiewicz et al., 2019). There has been other semi-automatic alternatives for creating learning datasets such as extracting Wikipedia edits (Grundkiewicz and Junczys-Dowmunt, 2014) as correct-erroneous sentence pairs. ...
... Back translation is a reverse task that uses deep learning (DL) models to recreate error patterns in existing human-annotated datasets (Sennrich et al., 2016). Round-trip-translation is based on the assumption that many translation models are still imperfect and that flow and style mistakes will be produced through the chain of translation (Lichtarge et al., 2019). Finally, the Wiki Edits and Lang8 datasets are available for any language (Faruqui et al., 2018). ...
... A transformer-based sequence to sequence model is adopted for GEC. It is initialised from Gramformer 4 , which is a T5 model [19] trained on WikiEdits processed with synthetic error generation techniques [25]. The Gramformer is further tuned on L2 English learner data from the CLC [9] and the BEA 2019 shared task [8]. ...
... Recent work mainly formulates GEC as a monolingual translation task and handle it with burgeoning encoderdecoder-based MT models(Yuan and Briscoe, 2016;Junczys-Dowmunt et al., 2018), among which Transformer(Vaswani et al., 2017) has become a dominant paradigm. With the help of synthetic training data(Lichtarge et al., 2019;Yasunaga et al., 2021) and large PLMsKatsumata and Komachi, 2020), Transformer-based GEC models have achieved SOTA performance on various benchmark datasets ...
... In real-world applications, controlling specific GEC settings, such as minimal and fluency edits and learner level-based corrections, is crucial to address diverse learning needs and scenarios (Napoles et al., 2017;Bryant et al., 2019;Flachs et al., 2020). Although recent GEC approaches based on supervised learning have achieved remarkable progress, they heavily rely on large training datasets comprising both genuine and pseudo data (Xie et al., 2018;Ge et al., 2018;Zhao et al., 2019;Lichtarge et al., 2019;Xu et al., 2019;Choe et al., 2019;Qiu et al., 2019;Grundkiewicz et al., 2019;Kiyono et al., 2019;Grundkiewicz and Junczys-Dowmunt, 2019;Wang and Zheng, 2020;Zhou et al., 2020;Wan et al., 2020;Koyama et al., 2021a). Collecting such data for each specific setting is challenging and time-consuming, which limits the scalability of these methods in various learning situations. ...
... Such platforms encourage users to revise and improve existing content, such as encyclopedias (Faruqui et al., 2018), how-to instructions (Anthonio et al., 2020), Q&A sites (Li et al., 2015), and debate portals (Skitalinskaya et al., 2021). Studies have explored ways to automate content regulation, namely text simplification (Botha et al., 2018), detection of grammar errors (Lichtarge et al., 2019), lack of citations (Redi et al., 2019), biased language (De Kock and Vlachos, 2022), and vagueness (Debnath and Roth, 2021). While Bhat et al. (2020) consider a task similar to ours -detecting sentences in need of revision in the domain of instructional texts -their findings do not fully transfer to argumentative texts, as different domains have different goals, different notions of quality, and, subsequently, different revision types performed. ...
... They propose a copy-augmented architecture for GEC task which is pre-trained with unlabeled data. A series of work focus on data augmentation (Grundkiewicz et al., 2019;Ge et al., 2018;Lichtarge et al., 2019), Xie et al. (2018) propose to synthesize "realistic" parallel corpus with grammatical errors by back-translation. Zhao and Wang (2020) add a dynamic masking method to the original source sentence during training, which enhances the model performance without requiring additional data. ...
... This paper studies a multiagent collaborative framework where one language model can generate critiques to improve its peer's performance. 2020; Malmi et al., 2022), grammatical (Lichtarge et al., 2019) or factual error correction (Mitchell et al., 2022b), debiasing and detoxification (Schick et al., 2021). Unlike humans who can understand natural language feedback and improve using the information, most of the previous work relied on sequence tagging (Reid and Neubig, 2022), retraining from scratch (Sun et al., 2019) or parameter editing (Mitchell et al., 2022a) to repair model predictions. ...
... Qorib et al. (2022) propose a simple logistic regression method to combine GEC models much more effectively. It is noted that constructing pseudo datasets is most useful on GEC task, as noise can be easily injected into error-free sentences automatically, and receive large pseudo sentence pairs which can be used to pre-train GEC models (Zhao et al., 2019;Zhou et al., 2020;Lichtarge et al., 2019;Kiyono et al., 2020;Yasunaga et al., 2021;Sun et al., 2022;Fang et al., 2023b). Previous works have preliminary attempted to incorporate detection label knowledge into GEC models in order to improve correction results. ...
... There are many studies on pseudo data generation. Lichtarge et al. [15] accumulated source-target pairs from the Wikipedia revision histories. Wan et al. [23] generated synthetic pairs by editing latent representations of grammatical sentences. ...
Preprint
Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information.
... 35 https://huggingface.co/rinna/japanese-gpt-1b 36 https://huggingface.co/sonoisa/sentence-bert-base-ja-mean-tokens-v2(Xie et al. 2018;Ge et al. 2018a;Zhao et al. 2019;Lichtarge et al. 2019Lichtarge et al. , 2020Kiyono et al. 2020;Wang and Zheng 2020;Zhou et al. 2020;Wan et al. 2020;Stahlberg and Kumar 2021;Yasunaga et al. 2021;Round-trip translation (Lichtarge et al. 2019)Types for Grammatical Error Correction." InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. ...
Article
Full-text available
This study constructed an error-tagged evaluation corpus for Japanese grammatical error correction (GEC). Evaluation corpora are essential for assessing the performance of models. The availability of various evaluation corpora for English GEC has facilitated a comprehensive comparison between models and the development of the English GEC community. However, the development of the Japanese GEC community has been hindered due to the lack of available evaluation corpora in the Japanese GEC. As a result, we constructed a new evaluation corpus for the Japanese GEC and made it available to the public. We used texts written by the Japanese language learners in the Lang-8 corpus, a representative learner corpus in GEC, to create the evaluation corpus. The specification of the evaluation corpus was modified to align with the representative corpora and tools in the English GEC, making it easy for GEC researchers and developers to use the evaluation corpus. Finally, we evaluated representative GEC models on the created evaluation corpus and reported baseline scores for future Japanese GEC.
... A special form of translation is round trip translation, which focuses on translating a given text from one language to the second and back to the first. Round trip translation has been increasingly used in several research areas, including correcting grammatical errors (Lichtarge et al., 2019;Madnani et al., 2012), evaluating machine translation models (Crone et al., 2021;Cao et al., 2020;Moon et al., 2020), paraphrasing (Guo et al., 2021) and rewriting questions (Chu et al., 2020). It is also used extensively as part of the quality assurance process in critical domains such as medical, legal and market search domains. ...
... In real-world applications, controlling specific GEC settings, such as minimal and fluency edits and learner level-based corrections, is crucial to address diverse learning needs and scenarios (Napoles et al., 2017;Bryant et al., 2019;Flachs et al., 2020). Although recent GEC approaches based on supervised learning have achieved remarkable progress, they heavily rely on large training datasets comprising both genuine and pseudo data (Xie et al., 2018;Ge et al., 2018;Zhao et al., 2019;Lichtarge et al., 2019;Xu et al., 2019;Choe et al., 2019;Qiu et al., 2019;Kiyono et al., 2019;Wang and Zheng, 2020;Zhou et al., 2020;Wan et al., 2020;Koyama et al., 2021a). Collecting such data for each specific setting is challenging and time-consuming, which limits the scalability of these methods in various learning situations. ...
Preprint
Large-scale pre-trained language models such as GPT-3 have shown remarkable performance across various natural language processing tasks. However, applying prompt-based methods with GPT-3 for Grammatical Error Correction (GEC) tasks and their controllability remains underexplored. Controllability in GEC is crucial for real-world applications, particularly in educational settings, where the ability to tailor feedback according to learner levels and specific error types can significantly enhance the learning process. This paper investigates the performance and controllability of prompt-based methods with GPT-3 for GEC tasks using zero-shot and few-shot setting. We explore the impact of task instructions and examples on GPT-3's output, focusing on controlling aspects such as minimal edits, fluency edits, and learner levels. Our findings demonstrate that GPT-3 could effectively perform GEC tasks, outperforming existing supervised and unsupervised approaches. We also showed that GPT-3 could achieve controllability when appropriate task instructions and examples are given.
... Such platforms encourage users to revise and improve existing content, such as encyclopedias (Faruqui et al., 2018), how-to instructions (Anthonio et al., 2020), Q&A sites (Li et al., 2015), and debate portals (Skitalinskaya et al., 2021). Studies have explored ways to automate content regulation, namely text simplification (Botha et al., 2018), detection of grammar errors (Lichtarge et al., 2019), lack of citations (Redi et al., 2019), biased language (De Kock and Vlachos, 2022), and vagueness (Debnath and Roth, 2021). While Bhat et al. (2020) consider a task similar to ours -detecting sentences in need of revision in the domain of instructional texts -their findings do not fully transfer to argumentative texts, as different domains have different goals, different notions of quality, and, subsequently, different revision types performed. ...
Preprint
Full-text available
Optimizing the phrasing of argumentative text is crucial in higher education and professional development. However, assessing whether and how the different claims in a text should be revised is a hard task, especially for novice writers. In this work, we explore the main challenges to identifying argumentative claims in need of specific revisions. By learning from collaborative editing behaviors in online debates, we seek to capture implicit revision patterns in order to develop approaches aimed at guiding writers in how to further improve their arguments. We systematically compare the ability of common word embedding models to capture the differences between different versions of the same text, and we analyze their impact on various types of writing issues. To deal with the noisy nature of revision-based corpora, we propose a new sampling strategy based on revision distance. Opposed to approaches from prior work, such sampling can be done without employing additional annotations and judgments. Moreover, we provide evidence that using contextual information and domain knowledge can further improve prediction results. How useful a certain type of context is, depends on the issue the claim is suffering from, though.
... Natural languages are rich, and their grammars contain many rules and exceptions; therefore, professional linguists are often utilized to annotate highquality corpora for further training ML-based systems mostly in a supervised manner (Dahlmeier et al., 2013), . However, human annotation is expensive, so researchers are working on methods for augmentation of training data, synthetic data generation, and strategies for its efficient usage (Lichtarge et al., 2019), , (Stahlberg and Kumar, 2021). The majority of GEC systems today use synthetic data to pre-train Transformer-based components of their models. ...
Article
Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent studies also report negative results. In this work, we critically survey broad papers and discuss the conditions required for successful self-correction. We first find that prior studies often do not define their research questions in detail and involve impractical frameworks or unfair evaluations that over-evaluate self-correction. To tackle these issues, we categorize research questions in self-correction research and provide a checklist for designing appropriate experiments. Our critical survey based on the newly categorized research questions shows that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.
Chapter
This chapter presents a novel grammatical error correction (GEC) system leveraging the powerful T5 transformer model. Unlike traditional GEC tools, this system goes beyond simply correcting errors. It integrates ERRANT, a specialized tool that classifies the specific type of error present (e.g., subject-verb agreement, verb tense inconsistencies). This combined approach gives users a deeper understanding of their grammatical mistakes, fostering improved language proficiency, where evaluation metrics confirm significant advancements in GEC performance. The research delves further by exploring the potential of quantum natural language processing (QNLP). Quantum networking supports distributed, secure, real-time QNLP applications, leveraging quantum networks for complex language processing tasks. Together, they create context-aware, secure language processing tools, addressing human language complexities beyond classical computing's reach. This research enhances language proficiency and communication clarity by integrating T5 for GEC with insights on QNLP and quantum networking.
Article
Lyric rewriting involves taking the original lyrics of a song and creatively rephrasing them while preserving their core meaning and emotional essence. Sequence-to-sequence methods often face the problem of lack of annotated corpus and difficulty in understanding lyrics when dealing with the lyrics rewriting task. Inspired by the language rewriting technique - grammatical error correction (GEC) and sequence-to-sequence generation technique - neural machine translation (NMT) methods, we propose novel self-supervised learning methods that can effectively solve the problem of the lack of lyrics rewriting corpus. In addition, we also propose a new pre-trained DAE Transformer model with data prior knowledge fusion to enhance the lyrics rewriting ability. The reference-as-context model (RaC-Large) constructed by us based on these two methods achieves the best results in comparison with the baseline including large language models, fully verifying the effectiveness of the new method. We also validate the effectiveness of our approach on GEC and NMT tasks, further demonstrating the potential of our approach on a broad range of sequence-to-sequence tasks.
Article
Grammar error correction systems are pivotal in the field of natural language processing (NLP), with a primary focus on identifying and correcting the grammatical integrity of written text. This is crucial for both language learning and formal communication. Recently, neural machine translation (NMT) has emerged as a promising approach in high demand. However, this approach faces significant challenges, particularly the scarcity of training data and the complexity of grammar error correction (GEC), especially for low-resource languages such as Indonesian. To address these challenges, we propose InSpelPoS, a confusion method that combines two synthetic data generation methods: the Inverted Spellchecker and Patterns+POS. Furthermore, we introduce an adapted seq2seq framework equipped with a dynamic decoding method and state-of-the-art Transformer-based neural language models to enhance the accuracy and efficiency of GEC. The dynamic decoding method is capable of navigating the complexities of GEC and correcting a wide range of errors, including contextual and grammatical errors. The proposed model leverages the contextual information of words and sentences to generate a corrected output. To assess the effectiveness of our proposed framework, we conducted experiments using synthetic data and compared its performance with existing GEC systems. The results demonstrate a significant improvement in the accuracy of Indonesian GEC compared to existing methods.
Article
Scholars in the humanities heavily rely on ancient manuscripts to study history, religion, and socio-political structures of the past. Significant efforts have been devoted to digitizing these precious manuscripts using OCR technology. However, most manuscripts have been blemished over the centuries, making it unrealistic for OCR programs to accurately capture faded characters. This work presents the Transformer + Confidence Score mechanism architecture for post-processing Google’s Tibetan OCR-ed outputs. According to the Loss and Character Error Rate metrics, our Transformer + Confidence Score mechanism architecture proves superior to the Transformer, LSTM-to-LSTM, and GRU-to-GRU architectures. Our method can be adapted to any language dealing with post-processing OCR outputs.
Article
Full-text available
This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.
Chapter
The task of Chinese Grammatical Error Diagnosis (CGED) is considered challenging due to the diversity of error types and subtypes, as well as the imbalanced distribution of subtype occurrences and the emergence of new subtypes, which pose a threat to the generalization ability of CGED models. In this paper, we propose a sentence editing and character filling-based CGED strategy that conducts task decomposition and transformation based on different types of grammatical errors, and provides corresponding solutions. To improve error detection accuracy, a refined set of error types is designed to better utilize training data. The correction task is transformed into a character slot filling task, the performance of which, as well as its generalization for long-tail scenarios and the open domain, can be improved by large-scale pre-trained models. Experiments conducted on CGED evaluation datasets show that our approach outperforms comparison models in all evaluation metrics and has good generalization.
Article
Full-text available
AI has introduced a new reform direction for traditional education, such as automating Grammatical Error Correction (GEC) to reduce teachers’ workload and improve efficiency. However, current GEC models still have flaws because human language is very variable, and the available labeled datasets are often too small to learn everything automatically. One of the key principles of GEC is to preserve correct parts of the input text while correcting grammatical errors. However, previous sequence-to-sequence (Seq2Seq) models may be prone to over-correction as they generate corrections from scratch. Over-correction is a phenomenon where a grammatically correct sentence is incorrectly flagged as containing errors that require correction, leading to incorrect corrections that can change the meaning or structure of the original sentence. This can significantly reduce the accuracy and usefulness of GEC systems, highlighting the need for improved approaches that can reduce over-correction and ensure more accurate and natural corrections. Recently, sequence tagging-based models have been used to mitigate this issue by only predicting edit operations that convert the source sentence to a corrected one. Despite their good performance on datasets with minimal edits, they struggle to restore texts with drastic changes. This issue artificially restricts the type of changes that can be made to a sentence and does not reflect those required for native speakers to find sentences fluent or natural sounding. Moreover, sequence tagging-based models are usually conditioned on human-designed language-specific tagging labels, hindering generalization and the real error distribution generated by diverse learners from different nationalities. In this work, we introduce a novel Seq2Seq-based approach that can handle a wide variety of grammatical errors on a low-fluency dataset. Our approach enhances the Seq2Seq architecture with a novel copy mechanism based on supervised attention. Instead of merely predicting the next token in context, the model predicts additional correctness-related information for each token. This auxiliary objective propagates into the weights of the model during training without requiring extra labels at testing time. Experimental results on benchmark datasets show that our model achieves competitive performance compared to state-of-the-art(SOTA) models.
Preprint
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show improvements (~5% on average) in multiple text similarity metrics over strong baselines across all three tasks.
Article
Full-text available
Chinese grammatical error correction (GEC) is under continuous development and improvement, and this is a challenging task in the field of natural language processing due to the high complexity and flexibility of Chinese grammar. Nowadays, the iterative sequence tagging approach is widely applied to Chinese GEC tasks because it has a faster inference speed than sequence generation approaches. However, the training phase of the iterative sequence tagging approach uses sentences for only one round, while the inference phase is an iterative process. This makes the model focus only on the current sentence’s current error correction results rather than considering the results after multiple rounds of correction. In order to address this problem of mismatch between the training and inference processes, we propose a Chinese GEC method based on iterative training and sequence tagging (CGEC-IT). First, in the iterative training phase, we dynamically generate the target tags for each round by using the final target sentences and the input sentences of the current round. The final loss is the average of each round’s loss. Next, by adding conditional random fields for sequence labeling, we ensure that the model pays more attention to the overall labeling results. In addition, we use the focal loss to solve the problem of category imbalance caused by the fact that most words in text error correction do not need error correction. Furthermore, the experiments on NLPCC 2018 Task 2 show that our method outperforms prior work by up to 2% on the F0.5 score, which verifies the efficiency of iterative training on the Chinese GEC model.
Article
Full-text available
Grammatical error correction aims to detect and correct grammatical errors with all types of mistaken, disordered, missing, and redundant characters. However, most existing methods focus more on detecting errors than correcting them. In this paper, we propose a domain-adaptive model with Interoperable Layer Normalization (ILN) and dynamic word embedding enhancement to optimize the error correction capability. To further improve the chinese correction capability, we introduce multiple rounds of error correction to refine the sequence tagging model’s ability to fix mistakes. In addition, we propose a data augmentation method based on the complex tag to represent textual error correction traces more completely. We also explore a migration training method based on multiple training datasets. Further, we offer a unique unsupervised domain adaptation technique based on ILN, a innovative channel fusion approach which can significantly improve models’ domain adaptability. Finally, experimental results show that our proposed method substantially outperforms all robust baseline methods and achieves the best results in position-level and correction-level errors on the CGED-2020 dataset.
Article
Full-text available
Previously, neural methods in grammatical error correction (GEC) did not reach state-of-the-art results compared to phrase-based statistical machine translation (SMT) baselines. We demonstrate parallels between neural GEC and low-resource neural MT and successfully adapt several methods from low-resource MT to neural GEC. We further establish guidelines for trustable results in neural GEC and propose a set of model-independent methods for neural GEC that can be easily applied in most GEC settings. Proposed methods include adding source-side noise, domain-adaptation techniques, a GEC-specific training-objective, transfer learning with monolingual data, and ensembling of independently trained GEC models and language models. The combined effects of these methods result in better than state-of-the-art neural GEC models that outperform previously best neural GEC systems by more than 10% M2^2 on the CoNLL-2014 benchmark and 5.9% on the JFLEG test set. Non-neural state-of-the-art systems are outperformed by more than 2% on the CoNLL-2014 benchmark and by 4% on JFLEG.
Article
Full-text available
We improve automatic correction of grammatical, orthographic, and collocation errors in text using a multilayer convolutional encoder-decoder neural network. The network is initialized with embeddings that make use of character N-gram information to better suit this task. When evaluated on common benchmark test data sets (CoNLL-2014 and JFLEG), our model substantially outperforms all prior neural approaches on this task as well as strong statistical machine translation-based systems with neural and task-specific features trained on the same data. Our analysis shows the superiority of convolutional neural networks over recurrent neural networks such as long short-term memory (LSTM) networks in capturing the local context via attention, and thereby improving the coverage in correcting grammatical errors. By ensembling multiple models, and incorporating an N-gram language model and edit features via rescoring, our novel method becomes the first neural approach to outperform the current state-of-the-art statistical machine translation-based approach, both in terms of grammaticality and fluency.
Conference Paper
Full-text available
Automated methods for identifying whether sentences are grammatical have various potential applications (e.g., machine translation, automated essay scoring, computer-assisted language learning). In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores). We also present a new publicly available dataset of learner sentences judged for grammaticality on an ordinal scale. In evaluations, we compare our system to the one from Post (2011) and find that our approach yields state-of-the-art performance.
Conference Paper
Full-text available
This paper introduces the freely available WikEd Error Cor-pus. We describe the data mining process from Wikipedia revision his-tories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in a task of grammatical error correction for English-as-a-Second-Language (ESL) learners' writings by 2.63%. Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.
Conference Paper
Full-text available
In this paper, we consider the problem of generating candidate corrections for the task of correcting errors in text. We focus on the task of correcting errors in preposition usage made by non-native English speakers, using discriminative classifiers. The standard approach to the problem assumes that the set of candidate corrections for a preposition consists of all preposition choices participating in the task. We determine likely preposition confusions using an annotated corpus of non-native text and use this knowledge to produce smaller sets of candidates. We propose several methods of restricting candidate sets. These methods exclude candidate prepositions that are not observed as valid corrections in the annotated corpus and take into account the likelihood of each preposition confusion in the non-native text. We find that restricting candidates to those that are observed in the non-native data improves both the precision and the recall compared to the approach that views all prepositions as possible candidates. Furthermore, the approach that takes into account the likelihood of each preposition confusion is shown to be the most effective.
Conference Paper
Full-text available
This paper presents a pilot study of the use of phrasal Statistical Machine Trans- lation (SMT) techniques to identify and correct writing errors made by learners of English as a Second Language (ESL). Using examples of mass noun errors found in the Chinese Learner Error Cor- pus (CLEC) to guide creation of an engi- neered training set, we show that applica- tion of the SMT paradigm can capture er- rors not well addressed by widely-used proofing tools designed for native speak- ers. Our system was able to correct 61.81% of mistakes in a set of naturally- occurring examples of mass noun errors found on the World Wide Web, suggest- ing that efforts to collect alignable cor- pora of pre- and post-editing ESL writing samples offer can enable the develop- ment of SMT-based writing assistance tools capable of repairing many of the complex syntactic and lexical problems found in the writing of ESL learners.
Article
We explore six challenges for neural machine translation: domain mismatch, amount of training data, rare words, long sentences, word alignment, and beam search. We show both deficiencies and improvements over the quality of phrase-based statistical machine translation.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on a few-shot image classification benchmark, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
Article
We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English\rightarrowFrench and surpasses state-of-the-art results for English\rightarrowGerman. Similarly, a single multilingual model surpasses state-of-the-art results for French\rightarrowEnglish and German\rightarrowEnglish on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
Conference Paper
The CoNLL-2013 shared task was devoted to grammatical error correction. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results.
Article
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
We present a novel beam-search decoder for grammatical error correction. The decoder iteratively generates new hypothesis corrections from current hypotheses and scores them based on features of grammatical correctness and fluency. These features include scores from discriminative classifiers for specific error categories, such as articles and prepositions. Unlike all previous approaches, our method is able to perform correction of whole sentences with multiple and interacting errors while still taking advantage of powerful existing classifier approaches. Our decoder achieves an F1 correction score significantly higher than all previous published scores on the Helping Our Own (HOO) shared task data set.
Conference Paper
We present a novel method for evaluating grammatical error correction. The core of our method, which we call MaxMatch (M2), is an algorithm for efficiently computing the sequence of phrase-level edits between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation. This optimal edit sequence is subsequently scored using F1 measure. We test our M2 scorer on the Helping Our Own (HOO) shared task data and show that our method results in more accurate evaluation for grammatical error correction.
Conference Paper
This paper describes challenges and solutions for building a successful voice search system as applied to Japanese and Korean at Google. We describe the techniques used to deal with an infinite vocabulary, how modeling completely in the written domain for language model and dictionary can avoid some system complexity, and how we built dictionaries, language and acoustic models in this framework. We show how to deal with the difficulty of scoring results for multiple script languages because of ambiguities. The development of voice search for these languages led to a significant simplification of the original process to build a system for any new language which in in parts became our default process for internationalization of voice search.
Robust systems for preposition error correction using wikipedia revisions
  • Aoife Cahill
  • Nitin Madnani
  • Joel Tetreault
  • Diane Napolitano
Aoife Cahill, Nitin Madnani, Joel Tetreault, and Diane Napolitano. 2013. Robust systems for preposition error correction using wikipedia revisions. In Proceedings of NAACL, pages 507-517. Association for Computational Linguistics.
Kyoto university participation to wat 2016
  • Fabian Cromieres
  • Chenhui Chu
  • Toshiaki Nakazawa
  • Sadao Kurohashi
Fabian Cromieres, Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2016. Kyoto university participation to wat 2016. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016).
Using automatic roundtrip translation to repair general errors in second language writing
  • A Désilets
  • M Hermet
A. Désilets and Hermet M. 2009. Using automatic roundtrip translation to repair general errors in second language writing. In Proceedings of MT Summit XII.
Generrate: Generating errors for use in grammatical error detection
  • Jennifer Foster
  • Øistein E Andersen
Jennifer Foster and Øistein E. Andersen. 2009. Generrate: Generating errors for use in grammatical error detection. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, EdAppsNLP '09, pages 82-90, Stroudsburg, PA, USA. Association for Computational Linguistics.
Stanford neural machine translation systems for spoken language domain
  • Minh-Thang Luong
  • Christopher D Manning
Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domain. In International Workshop on Spoken Language Translation.
Exploring grammatical error correction with not-so-crummy machine translation
  • Nitin Madnani
  • Joel Tetreault
  • Martin Chodorow
Nitin Madnani, Joel Tetreault, and Martin Chodorow. 2012. Exploring grammatical error correction with not-so-crummy machine translation. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, NAACL HLT '12, pages 44-53, Stroudsburg, PA, USA. Association for Computational Linguistics.
Mining revision log of language learning SNS for automated japanese error correction of second language learners
  • Tomoya Mizumoto
  • Mamoru Komachi
  • Masaaki Nagata
  • Yuji Matsumoto
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. Mining revision log of language learning SNS for automated japanese error correction of second language learners. In Proceedings of IJCNLP, pages 147-155.
Reaching human-level performance in automatic grammar error correction: An empirical study
  • Ge Tao
  • Furu Wei
  • Ming Zhou
Ge Tao, Furu Wei, and Ming Zhou. 2018b. Reaching human-level performance in automatic grammar error correction: An empirical study. arXiv:1807.01270.
  • Courtney Napoles
  • Keisuke Sakaguchi
  • Matt Post
  • Joel Tetreault
Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2016. GLEU without tuning. arXiv:1605.02592.
  • Noam Shazeer
  • Mitchell Stern
Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv:1804.04235.
Simpitiki: a simplification corpus for italian
  • Sara Tonelli
  • Alessio Palmero Aprosio
  • Francesca Saltori
Sara Tonelli, Alessio Palmero Aprosio, and Francesca Saltori. 2016. Simpitiki: a simplification corpus for italian. In Proceedings of CLiC-it.
  • Ziang Xie
  • Anand Avati
  • Naveen Arivazhagan
  • Dan Jurafsky
  • Andrew Y Ng
Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y Ng. 2016. Neural language correction with character-based attention. arXiv:1603.09727.
Noising and denoising natural language: Diverse backtranslation for grammar correction
  • Guillaume Ziang Xie Xie
  • Stanley Genthial
  • Andrew Xie
  • Dan Ng
  • Jurafsky
Ziang Xie Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Jurafsky. 2018. Noising and denoising natural language: Diverse backtranslation for grammar correction. In Proceedings of NAACL.