Matthew Lease’s research while affiliated with University of Texas at Austin and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (165)


Figure 1. Trade-offs between accuracy difference (AD) and overall accuracy (OA), on the BERT based model with SUHNPF acting as hypernetwork for three methods -GAP (ours), CLA, and ADV -across the two datasets for α ∈ [0,1], with α = 0 optimizing AD only and α = 1 optimizing OA only. GAP achieves lower AD consistently across α settings and datasets, while a more modest drop in OA is observed across methods as AD is reduced.
Finding Pareto trade-offs in fair and accurate detection of toxic speech
  • Article
  • Full-text available

March 2025

·

12 Reads

Information Research an international electronic journal

·

·

·

[...]

·

Matthew Lease

Introduction. Optimizing NLP models for fairness poses many challenges. Lack of differentiable fairness measures prevents gradient-based loss training or requires surrogate losses that diverge from the true metric of interest. In addition, competing objectives (e.g., accuracy vs. fairness) often require making trade-offs based on stakeholder preferences, but stakeholders may not know their preferences before seeing system performance under different trade-off settings. Method. We formulate the GAP loss, a differentiable version of a fairness measure, Accuracy Parity, to provide balanced accuracy across binary demographic groups. Analysis. We show how model-agnostic, HyperNetwork optimization can efficiently train arbitrary NLP model architectures to learn Pareto-optimal trade-offs between competing metrics like predictive performance vs. group fairness. Results. Focusing on the task of toxic language detection, we show the generality and efficacy of our proposed GAP loss function across two datasets, three neural architectures, and three fairness loss functions. Conclusions. Our GAP loss for the task of TL detection demonstrates promising results - improved fairness and computational efficiency. Our work can be extended to other tasks, datasets, and neural models in any practical situation where ensuring equal accuracy across different demographic groups is a desired objective.

Download

Exploring Multidimensional Checkworthiness: Designing AI-assisted Claim Prioritization for Human Fact-checkers

December 2024

·

16 Reads

·

1 Citation

Given the massive volume of potentially false claims circulating online, claim prioritization is essential in allocating limited human resources available for fact-checking. In this study, we perceive claim prioritization as an information retrieval (IR) task: just as multidimensional IR relevance, with many factors influencing which search results a user deems relevant, checkworthiness is also multi-faceted, subjective, and even personal, with many factors influencing how fact-checkers triage and select which claims to check. Our study investigates both the multidimensional nature of checkworthiness and effective tool support to assist fact-checkers in claim prioritization. Methodologically, we pursue Research through Design combined with mixed-method evaluation. We develop an AI-assisted claim prioritization prototype as a probe to explore how fact-checkers use multidimensional checkworthiness factors in claim prioritization, simultaneously probing fact-checker needs while also exploring the design space to meet those needs. Our study with 16 professional fact-checkers investigates: 1) how participants assessed the relative importance of different checkworthy dimensions and apply different priorities in claim selection; 2) how they created customized GPT-based search filters and the corresponding benefits and limitations; and 3) their overall user experiences with our prototype. Our work makes a conceptual contribution between multidimensional IR relevance and fact-checking checkworthiness, with findings demonstrating the value of corresponding tooling support. Specifically, we uncovered a hierarchical prioritization strategy fact-checkers implicitly use, revealing an underexplored aspect of their workflow, with actionable design recommendations for improving claim triage across multi-dimensional checkworthiness and tailoring this process with LLM integration.


Fig. 3. Flowchart of the experimental procedure.
Fig. 5. Comparison of normalized fixation duration and fixation counts on attitude-consistent and inconsistent information across user behavior groups.
Fig. 6. Percentage stacked belief changes bar charts between baseline vs. experiment across topic familiarity and confirmation biases tendency: there is a clear tendency showing that participants with higher topic familiarity were less likely to change their beliefs, as presented in both systems. However, no such trend was evident across different levels of confirmation bias.
Fig. 8. Screenshot of Daxenberger et al. [17]'s "ArgumentSearch " system. When a user enters a query, such as "Do violent video games contribute to youth violence?" the system identifies relevant articles and presents content snippets representing both sides, including those in favor (PRO) and those opposed (CON) to the topic.
Fig. 9. Distribution of user behavior patterns across different topics.
Argumentative Experience: Reducing Confirmation Bias on Controversial Issues through LLM-Generated Multi-Persona Debates

December 2024

·

74 Reads

Large language models (LLMs) are enabling designers to give life to exciting new user experiences for information access. In this work, we present a system that generates LLM personas to debate a topic of interest from different perspectives. How might information seekers use and benefit from such a system? Can centering information access around diverse viewpoints help to mitigate thorny challenges like confirmation bias in which information seekers over-trust search results matching existing beliefs? How do potential biases and hallucinations in LLMs play out alongside human users who are also fallible and possibly biased? Our study exposes participants to multiple viewpoints on controversial issues via a mixed-methods, within-subjects study. We use eye-tracking metrics to quantitatively assess cognitive engagement alongside qualitative feedback. Compared to a baseline search system, we see more creative interactions and diverse information-seeking with our multi-persona debate system, which more effectively reduces user confirmation bias and conviction toward their initial beliefs. Overall, our study contributes to the emerging design space of LLM-based information access systems, specifically investigating the potential of simulated personas to promote greater exposure to information diversity, emulate collective intelligence, and mitigate bias in information seeking.


Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI

November 2024

·

12 Reads

·

15 Citations

Proceedings of the ACM on Human-Computer Interaction

While many Natural Language Processing (NLP) techniques have been proposed for fact-checking, both academic research and fact-checking organizations report limited adoption of such NLP work due to poor alignment with fact-checker practices, values, and needs. To address this, we investigate a co-design method, Matchmaking for AI, to enable fact-checkers, designers, and NLP researchers to collaboratively identify what fact-checker needs should be addressed by technology, and to brainstorm ideas for potential solutions. Co-design sessions we conducted with 22 professional fact-checkers yielded a set of 11 design ideas that offer a "north star'', integrating fact-checker criteria into novel NLP design concepts. These concepts range from pre-bunking misinformation, efficient and personalized monitoring misinformation, proactively reducing fact-checker potential biases, and collaborative writing fact-check reports. Our work provides new insights into both human-centered fact-checking research and practice and AI co-design research.



Figure 3: Visualization of the BA values achieved by each loss function over the 7 demographic groups. The maximum difference (Max. Diff.) between the maximum and minimum BA achieved for each loss across groups is also shown. See Table 1 for additional detail and discussion. GAP performs best with lowest Max Diff. of 5.5.
Figure 4: Heatmap of pairwise absolute difference of BA across groups in test set as an indicator for bias and disparate impact. OE has the highest performance gap (Max Diff = 21.9) across groups as indicated by the extremes of color, not only across one group-pair but consistently across multiple group pairs. GAP has the least spread in pairwise error values (Max Diff = 5.5), evident from the flatness of color, indicating least disparate impact across groups.
Figure 5: Our Multi-Label Architecture
Figure 6: Training Loss trajectory for our GAP loss via AdaMax gradient descent optimizer.
Fairly Accurate: Optimizing Accuracy Parity in Fair Target-Group Detection

July 2024

·

22 Reads

In algorithmic toxicity detection pipelines, it is important to identify which demographic group(s) are the subject of a post, a task commonly known as \textit{target (group) detection}. While accurate detection is clearly important, we further advocate a fairness objective: to provide equal protection to all groups who may be targeted. To this end, we adopt \textit{Accuracy Parity} (AP) -- balanced detection accuracy across groups -- as our fairness objective. However, in order to align model training with our AP fairness objective, we require an equivalent loss function. Moreover, for gradient-based models such as neural networks, this loss function needs to be differentiable. Because no such loss function exists today for AP, we propose \emph{Group Accuracy Parity} (GAP): the first differentiable loss function having a one-on-one mapping to AP. We empirically show that GAP addresses disparate impact on groups for target detection. Furthermore, because a single post often targets multiple groups in practice, we also provide a mathematical extension of GAP to larger multi-group settings, something typically requiring heuristics in prior work. Our findings show that by optimizing AP, GAP better mitigates bias in comparison with other commonly employed loss functions.



A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint)

March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves. This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.




Citations (55)


... La Barbera et al. [25] highlighted the influence of worker bias in assessing statements, an issue further explored by Roitero et al. [42], who observed similar agreement levels across various truthfulness scales. Expanding on this, Soprano et al. [47] and Liu et al. [29] introduced multidimensional truthfulness scales, showing that crowd judgments are reliable across multiple truthfulness dimensions, each capturing different aspects of truthfulness. The comparative performance of crowd workers and experts has also been examined. ...

Reference:

Efficiency and Effectiveness of LLM-Based Summarization of Evidence in Crowdsourced Fact-Checking
Exploring Multidimensional Checkworthiness: Designing AI-assisted Claim Prioritization for Human Fact-checkers

... Comparative analyses also have been developed to study variability in model performance across datasets [21,40,45,48]. Recent research has also focused on generating human-readable explanations [3,6,49,52] and incorporating complex interactions within machine learning pipelines [28]. Hybrid approaches that integrate human input with machine learning have gained attention as well [9,39,51]. ...

Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI
  • Citing Article
  • November 2024

Proceedings of the ACM on Human-Computer Interaction

... However, our work can be extended to other tasks, datasets, and neural models in any practical situation where ensuring equal accuracy across different demographic groups is a desired objective. Recently, Kovatchev and Lease (Kovatchev & Lease, 2024) demonstrated the significant impact of imbalanced data in popular NLP benchmarks. Our work can help address that challenge. ...

Benchmark Transparency: Measuring the Impact of Data on Evaluation
  • Citing Conference Paper
  • January 2024

... While these specialized models were shown to be effective for their respective tasks, one key limitation for such approach is that a custom algorithm is needed for each type of complex annotation. Recent work has explored pairwise similarities between annotations as a general approach for identifying good labelers across different types of complex annotations (Braylan et al. 2023;Meir et al. 2023). The assumption that good workers are similar to one another in terms of their reported annotations (whereas poor workers do not) is referred to as the Anna Karenina principle in Meir et al. (2023). ...

A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks
  • Citing Article
  • December 2023

Journal of Artificial Intelligence Research

... Others found that socio-demographic factors such as age, gender, race and ethnicity, education, and English proficiency, influence annotation outcomes, with younger and minority annotators more likely to label content as hate speech (Al Davani et al. 2023). Additionally, annotator beliefs and demographics can introduce further inconsistencies in crowdsourced annotations (Hettiachchi et al. 2023). Although much research has explored socio-demographic biases in hate speech annotations (Garg et al. 2023), most studies focused on a limited set of attributes, primarily race and ethnicity, gender, and age. ...

How Crowd Worker Factors Influence Subjective Annotations: A Study of Tagging Misogynistic Hate Speech in Tweets
  • Citing Article
  • November 2023

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

... While preferable to use of singular gold labels, this approach still limits the amount of possible perspectives identifiable to two, there is no room for more nuanced minority perspectives to emerge. Metadata based methodologies encode annotator metadata with the aim to capture clear signal from groups which share metadata information labels (Rottger et al. 2022;Davani et al. 2023;Fleisig, Abebe, and Klein 2023;Gupta et al. 2023). These approaches assume that individuals who share similar metadata (e.g., same gender), will also annotate similarly; which might be why findings supportive of this methodology seem to be both data set (Lee, An, and Thorne 2023) and task (Welch et al. 2020) specific. ...

Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection
  • Citing Conference Paper
  • April 2023

... Human-based inconsistencies can only be reduced, but not eliminated by resourceintensive manual reviews. One emerging method to account for the intrinsic challenges with human labeling has been the use of consensus crowdsourcing [6], [7], [8] to build databases. While these procedures excel at using probabilistic models to acquire agreement on features with relatively small levels of ambiguity, challenges remain when analyzing rare or highly ambiguous features. ...

Editorial: Human-centered AI: Crowd computing

Frontiers in Artificial Intelligence

... Furthermore, the output of fact-checking requires not only the stance but also the explanation for the stance of a claim because the judgment of a claim verification step is an essential process in conducting fact-checking in journalistic [4]. In addition, the explanation of the claim verdict helps readers understand the factchecking process [5]. On the other hand, according to [6], the claim does not always have available evidence due to the lack of evidence in the knowledge source and the up-to-date evidence database. ...

The state of human-centered NLP technology for fact-checking
  • Citing Article
  • March 2023

Information Processing & Management

... Furthermore, this allows users to grasp why an image is classified in a certain way, such as understanding why a bird is identified as a 'red-bellied woodpecker' due to its distinct red belly and head, along with black and white wing stripes. Following this work, researchers explored incorporating prototype layer with transformer-based encoders, such as Universal Sentence Encoder, BERT, BART (Bidirectional and Auto-Regressive Transformers) in fake news detection and hotel review classification (Das et al., 2022;Hong et al., 2024;Wen, 2024). Sarcasm, due to its nature, can benefit from such reasoning provided by prototype-based models. ...

ProtoTEx: Explaining Model Decisions with Prototype Tensors

... Explanations also benefit data quality: the need to explain a decision tends to improve its accuracy [33]. Although the time required for annotation with both label and explanation is greater at first, it largely decreases over time to nearly equal the time required to annotate labels only [34]. ...

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments
  • Citing Article
  • September 2016

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing