Adam Poliak’s research while affiliated with Johns Hopkins University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (37)


Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference
  • Preprint

March 2025

·

1 Read

Grace Proebsting

·

Adam Poliak

We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use pointwise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.


Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference

October 2024

·

1 Read

We test whether replacing crowdsource workers with LLMs to write Natural Language Inference (NLI) hypotheses similarly results in annotation artifacts. We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b, and train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts. On our LLM-elicited NLI datasets, BERT-based hypothesis-only classifiers achieve between 86-96% accuracy, indicating these datasets contain hypothesis-only artifacts. We also find frequent "give-aways" in LLM-generated hypotheses, e.g. the phrase "swimming in a pool" appears in more than 10,000 contradictions generated by GPT-4. Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.


Figure 1: Number of times a model changes its predictions from correct to incorrect (left blue bar) or incorrect to correct (right red bar).
Figure 2: Percentage of examples with one paraphrased sentence (left blue bar) or two paraphrased sentences (right red bar) where models' predictions change.
Figure 3: Percentage of examples where the models' predictions changed when the gold label is entailed (blue left bar) or not-entailed (right red bar).
Figure 5: Percentage of examples where models predictions change their predictions depending on the examples' sources. We omit BoW and BiLSTM for space.
Figure 6: Types of errors made by the paraphraser model
Evaluating Paraphrastic Robustness in Textual Entailment Models
  • Preprint
  • File available

June 2023

·

46 Reads

·

1 Citation

Dhruv Verma

·

·

Shreyashee Sinha

·

[...]

·

Adam Poliak

We present PaRTE, a collection of 1,126 pairs of Recognizing Textual Entailment (RTE) examples to evaluate whether models are robust to paraphrasing. We posit that if RTE models understand language, their predictions should be consistent across inputs that share the same meaning. We use the evaluation set to determine if RTE models' predictions change when examples are paraphrased. In our experiments, contemporary models change their predictions on 8-16\% of paraphrased examples, indicating that there is still room for improvement.

Download


Discovering changes in birthing narratives during COVID-19

April 2022

·

24 Reads

We investigate whether, and if so how, birthing narratives written by new parents on Reddit changed during COVID-19. Our results indicate that the presence of family members significantly decreased and themes related to induced labor significantly increased in the narratives during COVID-19. Our work builds upon recent research that analyze how new parents use Reddit to describe their birthing experiences.



Accuracy of different models on our datasets focusing on similes, metaphors, and irony.
Figurative Language in Recognizing Textual Entailment

June 2021

·

107 Reads

We introduce a collection of recognizing textual entailment (RTE) datasets focused on figurative language. We leverage five existing datasets annotated for a variety of figurative language -- simile, metaphor, and irony -- and frame them into over 12,500 RTE examples.We evaluate how well state-of-the-art models trained on popular RTE datasets capture different aspects of figurative language. Our results and analyses indicate that these models might not sufficiently capture figurative language, struggling to perform pragmatic inference and reasoning about world knowledge. Ultimately, our datasets provide a challenging testbed for evaluating RTE models.


Fine-Tuning Transformers for Identifying Self-Reporting Potential Cases and Symptoms of COVID-19 in Tweets

April 2021

·

16 Reads

We describe our straight-forward approach for Tasks 5 and 6 of 2021 Social Media Mining for Health Applications (SMM4H) shared tasks. Our system is based on fine-tuning Distill- BERT on each task, as well as first fine-tuning the model on the other task. We explore how much fine-tuning is necessary for accurately classifying tweets as containing self-reported COVID-19 symptoms (Task 5) or whether a tweet related to COVID-19 is self-reporting, non-personal reporting, or a literature/news mention of the virus (Task 6).




Citations (24)


... While performing strongly on various benchmarks, LLMs still struggle with many types of consistency including paraphrastic consistency (Srikanth et al., 2024a;Verma et al., 2023), hypothetical consistency (Chen et al., 2023), or even preferential consistency (Zhao et al., 2024). When LLMs make entailment judgments, another desirable property is logical consistency. ...

Reference:

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals
Evaluating Paraphrastic Robustness in Textual Entailment Models
  • Citing Conference Paper
  • January 2023

... В данном исследовании проводится сравнительный анализ популярных моделей искусственного интеллекта для перефразирования текстов на русском языке. Особое внимание уделяется тому, насколько эффективно эти модели генерируют перефразирования, которые не только точны и последовательны, но, что важно, соответствуют контексту и сохраняют оригинальную семантику текста [3,4]. ...

Evaluating Paraphrastic Robustness in Textual Entailment Models

... shows an example UDS1.0 graph augmented with (i) a subset of the properties we add in bold (see §3); and (ii) document-level edges in purple (see §6).2 The UDS annotations and associated toolkit have supported research in a variety of areas, including syntactic and semantic parsing, 2021, semantic role labeling(Teichert et al., 2017) and induction(White et al., 2017), event factuality prediction ...

Semantic Proto-Role Labeling
  • Citing Article
  • February 2017

Proceedings of the AAAI Conference on Artificial Intelligence

... Datasets and Models. We measure the perplexity for language generation tasks with Wikitext2 [60], C4 [61] and PTB [62], and accuracy for the zero-shot tasks including Winogrande [35], OBQA [36], Hellaswag [37], BoolQ [63], ARC [64] and RTE [65]. We conduct extensive experiments on LLaMA-1/2/3 [2,32], OPT [33], and Mistral [34]. ...

Figurative Language in Recognizing Textual Entailment

... NLP and computational linguistics pedagogy has a rich history of research and applications, with much work in developing online learning resources (Artemova et al., 2021;Baglini and Hjorth, 2021) and live instruction techniques (Bender et al., 2008;Agarwal, 2013;Gaddy et al., 2021;Durrett et al., 2021;Kennington, 2021). Research has explored how best to teach language technology concepts to learners without a computer science background (Fosler-Lussier, 2008;Poliak and Jenifer, 2021;Vajjala, 2021) and in non-English instruction settings . Camacho and Zevallos (2020) recommends integration of (computational) linguistics into high school curricula as a method to fight language decline. ...

An Immersive Computational Text Analysis Course for Non-Computer Science Students at Barnard College
  • Citing Conference Paper
  • January 2021

... EPM predicts EMSs from a post by leveraging textual entailment, a sentence-level inference technique from natural language processing [16]. In textual entailment, a statement entails a hypothesis h if a human reader would infer that h is likely true based on t [44]. EPM computes the entailment between sentences in a post and statements from the YSQ. ...

A survey on Recognizing Textual Entailment as an NLP Evaluation
  • Citing Conference Paper
  • January 2020

... The dataset covers five different event relations that overlap with Allen's relations. (Vashishtha et al., 2020) provides five datasets with a focus on ordering events and duration computation. TORQUE presents a reading comprehension dataset to investigate the temporal ordering of events (Ning et al., 2020). ...

Temporal Reasoning in Natural Language Inference
  • Citing Conference Paper
  • January 2020

... Wadden et al. 90 created the SCI-FACT dataset with 1.4K clinical medicine-related scientific claims paired with abstracts including their corresponding evidence, and the annotated labels (including supports, refutes, and neutral) as well as rationales. Poliak et al 91 There were also efforts on developing automatic medical fact-checking methods with these data resources. Specifically, Kotonya et al 89 proposed an explainable automatic fact-checking method with a classifier for predicting label of the given 10/18 ...

Collecting Verified COVID-19 Question Answer Pairs

... This advancement has facilitated various applications, including opinion mining via sentiment analysis (SA) [2], topic modeling [3], predictive analytics of social texts, automated text document summarization [4,5], semantic relation extraction (SRE) [4], and textual entailment (TE) studies [6], among others. Notably, SRE has gained substantial attention as a research technique for evaluating NLP tasks [7], particularly as it extends beyond basic text summarization by extracting mutual semantic entities across documents. By identifying similarities or differences within a network of documents addressing similar subjects, SRE can function as an automated systematic literature review (SLR) tool [8], helping establish ground truths within datasets to address research questions -this forms the core focus of this study. ...

A Survey on Recognizing Textual Entailment as an NLP Evaluation
  • Citing Preprint
  • October 2020

... where subclaims are concatenated using the [SEP] token. 8 We employ an UNLI model (Chen et al., 2020) 7 We use the same protocol as in Eq.1 to label new subclaims: a claim is true only if all its subclaims are true. 8 https://huggingface.co/google-bert/bert-base-uncased to estimate the conditional probability P (· | H) in Eq.5. ...

Uncertain Natural Language Inference
  • Citing Conference Paper
  • January 2020