Anastasia Shimorina's research while affiliated with French National Centre for Scientific Research and other places

Publications (18)

Article
Full-text available
The metrics standardly used to evaluate Natural Language Generation (NLG) models, such as BLEU or METEOR, fail to provide information on which linguistic factors impact performance. Focusing on Surface Realization (SR), the task of converting an unordered dependency tree into a well-formed sentence, we propose a framework for error analysis which p...
Preprint
Full-text available
This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to fa...
Preprint
Full-text available
Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a...
Thesis
Natural language generation is a process of generating a natural language text from some input. This input can be texts, documents, images, tables, knowledge graphs, databases, dialogue acts, meaning representations, etc. Recent methods in natural language generation, mostly based on neural modelling, have yielded significant improvements in the fi...
Preprint
Full-text available
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-...
Preprint
This paper discusses two existing approaches to the correlation analysis between automatic evaluation metrics and human scores in the area of natural language generation. Our experiments show that depending on the usage of a system- or sentence-level correlation analysis, correlation results between automatic scores and human judgments are inconsis...
Conference Paper
Full-text available
The ModelWriter platform provides a generic framework for automated traceability analysis. In this paper, we demonstrate how this framework can be used to trace the consistency and completeness of technical documents that consist of a set of System Installation Design Principles used by Airbus to ensure the correctness of aircraft system installati...
Article
Full-text available
We propose a new sentence simplification task (Split-and-Rephrase) where the aim is to split a complex sentence into a meaning preserving sequence of shorter sentences. Like sentence simplification, splitting-and-rephrasing has the potential of benefiting both natural language processing and societal applications. Because shorter sentences are gene...

Citations

... Automatic metrics do not capture all the nuances of text simplification, and human evaluation is costly and difficult to reproduce. Still, the research community is putting a lot of effort on systematising ATS evaluation and making manual evaluation as reliable as possible [8,9]. Moreover, it is worth to notice that most of the works on ATS simplification focus exclusively on English. ...
... The situation memorably caricatured by Pedersen (2008) still happens all the time: you download some code you read about in a paper and liked the sound of, you run it on the data provided, only to find that the results are not the same as reported in the paper, in fact they are likely to be worse (Belz et al., 2021a). When both data and code are provided, the number of potential causes of such differences is limited, and the NLP field has shared increasingly detailed information about system, dependencies and evaluation to chase down sources of differences. ...
... For SQuAD QG and MSMARCO NLG, we use the original evaluation scripts provided by Du et al. (2017) and Bajaj et al. (2016), respectively. For WebNLG-en and CommonGen, we use the versions from the GEM benchmark (Gehrmann et al., 2021) and report using the GEM evaluation framework. Those scripts mainly differ in text tokenization methods. ...
... 3 There has been some effort to automate this process. For example, Shimorina et al. (2021) describe an automatic error analysis procedure for shallow surface realisation, and Stevens-Guille et al. (2020) automate the detection of repetitions, omissions, and hallucinations. However, for many NLG tasks, this kind of automation is still out of reach, given the wide range of possible correct outputs that are available in language generation tasks. ...
... Yu et al. (2019) showed that, for their system, error rates correlate with word order freedom, and reported linearization error rates for some frequent dependency types. In a similar vein, Shimorina and Gardent (2019) looked at their system performance in terms of dependency relations, which shed light on the differences between their non-delexicalized and delexicalized models. ...
... (Supervised) neural data-to-text NLG involves the collection of parallel data-text datasets, aligning data and linguistic realisations of these data. However, collecting these datasets is difficult because sets of texts and corresponding data are not a common natural occurrence (Shimorina, Khasanova, & Gardent, 2019). On the other hand, unpaired texts and data are significantly more common and easily collected (Qader, Portet, & Labbé, 2019). ...
... These placeholder tokens are later replaced with tokens copied from the input data instance [71]. In comparison to copy-based methods for handling rare entities, delexicalization has shown to yield better results in constrained datasets [108]. From the notion that delexicalization of the data instance may cause the loss of vital information that can aid seq2seq models in sentence planning, where some data instance slots may even be deemed nondelexicalizable [38], Nayak et. ...
... Conditional language generation aims to generate natural language text based on input context, and includes many useful and hard tasks such as abstractive summarization (Mani, 2001;Nenkova and McKeown, 2011), generative question answering (Bajaj et al., 2016), question generation (Zhou et al., 2017) and data-to-text (Wiseman et al., 2017;Gardent et al., 2017) tasks. Pretraining large Transformer encoder-decoder models and fine-tuning them on downstream tasks is the common paradigm to address these tasks (Raffel et al., 2020;Lewis et al., 2019;Tay et al., 2022;Zhang et al., 2019a). ...
... The CRF is trained on a set of manual alignments of complex-simple articles. [Narayan et al., 2017], that we detail in Chapter 3. A sentence split sample is composed of a complex sentence C aligned with two consecutive simple sentences deemed S1 and S2. They extract these sample from different temporal snapshots of EW by matching sentences where C and S1 start with the same trigram and C and S2 end with the same trigram. ...
... The paradigm shift introduced by model-driven development (MDD) [5,6] in which the focus changes from code to models, leverages the abstraction level and promotes the software development for various application domains (e.g. [7][8][9][10][11][12][13]). Moreover, domain-specific languages (DSLs) / domain-specific modeling languages (DSMLs) [14][15][16][17][18] which have notations and constructs tailored toward a particular application domain, assist to the developers during execution of MDD processes by providing first a user-friendly syntax for modeling systems (mostly in a visual manner) and then a translational semantics for generating application software and any other artifacts automatically [19]. ...