Marie-Catherine de Marneffe’s research while affiliated with The Ohio State University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (68)


Explanation sensitivity to the randomness of large language models: the case of journalistic text classification
  • Preprint

October 2024

·

10 Reads

Jeremie Bogaert

·

Marie-Catherine de Marneffe

·

·

[...]

·

Large language models (LLMs) perform very well in several natural language processing tasks but raise explainability challenges. In this paper, we examine the effect of random elements in the training of LLMs on the explainability of their predictions. We do so on a task of opinionated journalistic text classification in French. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations. We therefore claim that characterizing the explanations' statistical distribution is needed for the explainability of LLMs. We then explore a simpler model based on textual features which offers stable explanations but is less accurate. Hence, this simpler model corresponds to a different tradeoff between accuracy and explainability. We show that it can be improved by inserting features derived from CamemBERT's explanations. We finally discuss new research directions suggested by our results, in particular regarding the origin of the sensitivity observed in the training randomness.



Figure 1: Data collection interface.
Figure 2: Average distributions from the normalized multilabel responses for each item. For each item and each label, we take the multilabel responses that include at least that label and compute the average probabilities for that label (bars' transparency). Label variation is widespread, even for the items with unanimous agreement in MNLI.
Figure 3: Entropy of distributions of unigrams in the explanations of each item with each label (error bars indicating the variability across items.)
Figure 4: Proportion of highlighted words that are also mentioned in the explanation.
Understanding and Predicting Human Label Variation in Natural Language Inference through Explanation
  • Preprint
  • File available

April 2023

·

48 Reads

Human label variation (Plank 2022), or annotation disagreement, exists in many natural language processing (NLP) tasks. To be robust and trusted, NLP models need to identify such variation and be able to explain it. To this end, we created the first ecologically valid explanation dataset with diverse reasoning, LiveNLI. LiveNLI contains annotators' highlights and free-text explanations for the label(s) of their choice for 122 English Natural Language Inference items, each with at least 10 annotations. We used its explanations for chain-of-thought prompting, and found there is still room for improvement in GPT-3's ability to predict label distribution with in-context learning.

Download


Figure 1: ChaosNLI annotations of the 450 items we sampled. Each column of stacked bars represents an item's annotations-the number of votes for each label with top-down ordering of the labels. The horizontal lines indicate 80 votes.
Figure 2: Boxplots of annotation entropy (Left: from original MNLI 5 annotations, Right: from ChaosNLI 100 annotations) by predicted label. Number of items shown in parentheses. Triangles indicate the means. P-values from Mann-Whitney two-sided test.
Figure 3: For each disagreement category, percentage of ChaosNLI items annotated with that category (number in parentheses) and having converging NLI annotations (>80 majority vote) vs. percentage predicted as Complicated in the 4-way setup or as having more than one label in the multilabel setup. Legend also gives mean majority vote in each category, with standard deviation in parentheses.
Investigating Reasons for Disagreement in Natural Language Inference

December 2022

·

58 Reads

·

28 Citations

Transactions of the Association for Computational Linguistics

We investigate how disagreement in natural language inference (NLI) annotation arises. We developed a taxonomy of disagreement sources with 10 categories spanning 3 high- level classes. We found that some disagreements are due to uncertainty in the sentence meaning, others to annotator biases and task artifacts, leading to different interpretations of the label distribution. We explore two modeling approaches for detecting items with potential disagreement: a 4-way classification with a “Complicated” label in addition to the three standard NLI labels, and a multilabel classification approach. We found that the multilabel classification is more expressive and gives better recall of the possible interpretations in the data.


Figure 2: Boxplots of annotation entropy (Left: from original MNLI 5 annotations, Right: from ChaosNLI 100 annotations) by predicted label. Number of items shown in parentheses. Triangles indicate the means. P-values from Mann-Whitney two-sided test.
Left: Multilabel classification performance on the test set. Right: Performance on the two test set subsets. The Orig subset does not have any items with all three labels present.
Investigating Reasons for Disagreement in Natural Language Inference

September 2022

·

25 Reads

We investigate how disagreement in natural language inference (NLI) annotation arises. We developed a taxonomy of disagreement sources with 10 categories spanning 3 high-level classes. We found that some disagreements are due to uncertainty in the sentence meaning, others to annotator biases and task artifacts, leading to different interpretations of the label distribution. We explore two modeling approaches for detecting items with potential disagreement: a 4-way classification with a "Complicated" label in addition to the three standard NLI labels, and a multilabel classification approach. We found that the multilabel classification is more expressive and gives better recall of the possible interpretations in the data.




Figure 1: Fitted probabilities of true expected inference category predicted by the label of each item given by the ordered logistic regression model, organized by the signature and polarity. Some examples of verb-frame with mean probability less than 0.5 are labeled.
He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

October 2021

·

52 Reads

·

14 Citations

Transactions of the Association for Computational Linguistics

We investigate how well BERT performs on predicting factuality in several existing English datasets, encompassing various linguistic constructions. Although BERT obtains a strong performance on most datasets, it does so by exploiting common surface patterns that correlate with certain factuality labels, and it fails on instances where pragmatic reasoning is necessary. Contrary to what the high performance suggests, we are still far from having a robust system for factuality prediction.


Performance on the test sets under different BERT training setups. The best score obtained by our models for each dataset under each metric is marked by †. The overall best scores are highlighted.
He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

July 2021

·

19 Reads

We investigate how well BERT performs on predicting factuality in several existing English datasets, encompassing various linguistic constructions. Although BERT obtains a strong performance on most datasets, it does so by exploiting common surface patterns that correlate with certain factuality labels, and it fails on instances where pragmatic reasoning is necessary. Contrary to what the high performance suggests, we are still far from having a robust system for factuality prediction.


Citations (56)


... With good word embeddings, larger nouns are expected to have higher projection scores. The validity of the adjective scales from word representations is shown by Kim and de Marneffe (2013). ...

Reference:

Visual Commonsense in Pretrained Unimodal and Multimodal Models
Deriving Adjectival Scales from Continuous Space Word Representations
  • Citing Conference Paper
  • January 2013

... A study with a similar aim is presented by Chen et al. (2024). They explore whether the softmax probability distributions solicited by a few explanations from two LLMs (Mixtral and Llama 3) can approximate human judgement distributions (HJD) on the ChaosNLI and VariErr NLI (Weber-Genzel et al., 2024) benchmarks. Similarly, Baan et al. (2024) and compare human and model judgement distributions on ChaosNLI, finding that models fail to capture human distributions in LLMs and model confidences. ...

VariErr NLI: Separating Annotation Error from Human Label Variation
  • Citing Conference Paper
  • January 2024

... To the best of our knowledge, no such dataset exists yet. (ii) A new methodology to detect errors: we collect multiple annotations, where each label comes with an ecologically valid explanation inspired by Jiang et al. (2023), and propose to pair them with validity judgments to identify errors. (iii) Finally, we benchmark existing AED methods and GPTs in a challenging setup, where the task is to tease apart error from plausible human label variation. ...

Ecologically Valid Explanations for Label Variation in NLI
  • Citing Conference Paper
  • January 2023

... This is likely because the additional selection and ranking steps implemented by Whistely et al. (2022) and the lack thereof shown within the LS system provided by North et al. (2022a) (Section 3.2). Wilkens et al. (2022) likewise experimented with a range of monolingual transformers for SG. They employed an ensemble of BERT-like models with three distinct masking strategies: 1) copy, 2) query expansion, and 3) paraphrase. ...

CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification?
  • Citing Conference Paper
  • January 2022

... Santy et al. (2023) and Forbes et al. (2020), explore annotator disagreement in safety, looking specifically at how morality and toxicity judgments vary across users of different backgrounds. Prior works have analyzed disagreements in NLI (Pavlick & Kwiatkowski, 2019;Liu et al., 2023), and Jiang & Marneffe (2022) develop an NLI-specific taxonomy of disagreement causes. Sandri et al. (2023) similarly explores annotator disagreements in toxicity detection, and develop a taxonomy of disagreement causes for their task. ...

Investigating Reasons for Disagreement in Natural Language Inference

Transactions of the Association for Computational Linguistics

... Generated Text Detection The burgeoning progress in the generation capabilities of large language models has led to a corresponding increase in research and development efforts in the field of detection. Several recent efforts look at methods, varying from simple feature-based classifiers to fine-tuned language model-based detectors, in order to classify whether a piece of input text is human-written or AI-generated (Ippolito et al., 2019;Gehrmann et al., 2019;Mitchell et al., 2023), along with methods that specifically focus on AI-generated news (Zellers et al., 2019;Bogaert et al., 2022;Kumarage et al., 2023). A related direction of work is that of authorship attribution (AA). ...

Automatic and Manual Detection of Generated News: Case Study, Limitations and Challenges
  • Citing Conference Paper
  • June 2022

... We frame belief prediction as a regression task, following previous literature [9,17,18,19]. Given an utterance and a sequence of corresponding features S = [f1, f2, ..., fn], we wish to produce a valueŷ ∈ [−3, 3] that is as close to the annotation value y as possible. ...

He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Transactions of the Association for Computational Linguistics

... In other words, the world is not just black and white. Human label variation (HLV, as termed by Plank 2022) has been shown on a wide range of NLP tasks (de Marneffe et al., 2012;Plank et al., 2014;Aroyo and Welty, 2015), including in natural language inference (NLI; Pavlick and Kwiatkowski 2019; Zhang and de Marneffe 2021). NLI involves determining whether a hypothesis is true (Entailment), false (Contradiction), or neither (Neutral), assuming the truth of a given premise; see Figure 1 for an example with plausible labels. ...

Identifying inherent disagreement in natural language inference

... Essa tarefa está compreendida dentro do escopo conhecido como enhanced dependencies (ED). As ED foram desenvolvidas primeiro no inglês (Schuster & Manning, 2016), depois foram generalizadas para outras línguas pela abordagem Universal Dependencies (UD) 1 (Nivre et al., 2018, de Marneffe et al. 2021) e instanciadas para o português (Pagano et al., 2023). ...

Universal Dependencies

Computational Linguistics

... Yeomans et al. [174] develop a package in R that provides functions to extract politeness markers in the English language, graphically compare these markers to covariates of interest, develop a supervised model to identify politeness in new documents and inspect high-and low-politeness documents. Aljanaideh et al. [3] model the problem of politeness detection in natural language requests as a clustering task. The authors have created a set of clusters for every word in the request. ...

Contextualized Embeddings for Enriching Linguistic Analyses on Politeness