Simon Tong’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Corpora Generation for Grammatical Error Correction
  • Preprint

April 2019

·

72 Reads

Jared Lichtarge

·

Chris Alberti

·

·

[...]

·

Simon Tong

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.


Citations (1)


... Zhao et al. [10] used a large-scale unlabeled corpus to pre-train a denoising self-encoder [11,12] with a copy mechanism for the Transformer model [13] and achieved results close to those of Ge et al. [14] based on word-level and sentence-level multi-task learning methods using only publicly available "error-corrected" parallel corpora. Based on the idea of round-trip translation, Lichtarge et al. [15] used Wikipedia data to generate a large number of pseudoparallel sentence pairs to pre-train a forward grammatical error-correction model using a medial language as a jumping-off point for fold-back translation. ...

Reference:

Establishment of machine translation model based on asymmetric Encoder-Decoder structure and its application in English grammar error detection
Corpora Generation for Grammatical Error Correction
  • Citing Conference Paper
  • January 2019