Carolyn Penstein Rosé’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (402)


Figure 2: The rankings of the top 10 models on the DocVQA leaderboard, before and after applying our composite score with α = 0.25. Left segment: Rankings based on ANLS versus our score. Middle segment: Our rankings broken down by question type. Right segment: Our rankings broken down by answer type.
Figure 3: The correlation between the rankings produced by our method (with α = 0.25) and the original ANLS-based ranking, broken down by the type of answer. All τ values are significant at p ≪ 0.05.
Figure 4: Kendall's τ rank correlation with the original DocVQA leaderboard, broken down by question types. All τ values are significant at p ≪ 0.05.
Figure 5: Pearson R correlation with the calibration error of models based on the DUDE leaderboard, broken down by answer type.
Figure 6: The mean volatility of each model's score versus its ranking. Red dots represent ANLS scores and blue dots represent SMuDGE with α = 0.25.

+4

Where is this coming from? Making groundedness count in the evaluation of Document VQA models
  • Preprint
  • File available

March 2025

·

23 Reads

·

Siddharth Parekh

·

Pranav Shetty

·

[...]

·

Carolyn Rose

Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model's outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model's robustness and tends to give higher rewards to better-calibrated answers.

Download

Figure 5: Case study 1. The original score_explicit_question function and its context extracted from the original GitHub repository. The function calls the text completion function from the OpenAI API.
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

March 2025

·

3 Reads

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.


Figure 3: Prompt used for synthesizing single sound laws from the protoforms (inputs) and attested forms (outputs).
Figure 4: Instantiation of the BasicAction class to represent a sound law. This example shows a rule where "a" goes to "e" when it occurs before a "j". The predicates match to an environment of "a@j" where '@' is the separator, and then the first character of the environment or "a" goes to "e" as described by the change_pos and mapping_fn. In other words it represents the rule a > e\ _ j
Does fine-tuning produce a significant improvement? (PySLICoder vs base Magicoder comparison). Wilcoxon Signed Rank test results show that each fine-tuning setting achieves a statistically significant improve- ment over the base Magicoder runs for reward per program, number of passing programs and pass rate. Note we use significance level α = 0.05 m = 0.05 7 = 0.007 in accordance with the Bonferroni correction (Weisstein, 2004) to account for the 7 comparisons.
Statistical significance Wilcoxon Signed Rank tests performed with the ptk variants of IDP-PI and RP- PI (significance level α = 0.05 5 = 0.01 after Bonferroni correction Weisstein (2004)). We highlight statistically significant observations.
Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

January 2025

·

8 Reads

Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.


Format of the reflections for each of the conditions. [NSI] is constituted by T1 and T2, while [GEN] and [CONT] are made up of T1, T2, T3 and T4 with [GEN] having the generic alternative scenario turn on the right and [CONT] having the GPT‐4 generated tailored alternatives shown on the left in T3.
This figure displays the interaction between condition and split on the least squares mean of combined posttest scores. A simple t‐test shows that the difference between GEN and CONT is significant both for low pretest students and marginal for high pretest students, though the ranking is opposite.
Providing tailored reflection instructions in collaborative learning using large language models

December 2024

·

89 Reads

·

1 Citation

The relative effectiveness of reflection either through student generation of contrasting cases or through provided contrasting cases is not well‐established for adult learners. This paper presents a classroom study to investigate this comparison in a college level Computer Science (CS) course where groups of students worked collaboratively to design database access strategies. Forty‐four teams were randomly assigned to three reflection conditions ([GEN] directive to generate a contrasting case to the student solution and evaluate their trade‐offs in light of the principle, [CONT] directive to compare the student solution with a provided contrasting case and evaluate their trade‐offs in light of a principle, and [NSI] a control condition with a non‐specific directive for reflection evaluating the student solution in light of a principle). In the CONT condition, as an illustration of the use of LLMs to exemplify knowledge transformation beyond knowledge construction in the generation of an automated contribution to a collaborative learning discussion, an LLM generated a contrasting case to a group's solution to exemplify application of an alternative problem solving strategy in a way that highlighted the contrast by keeping many concrete details the same as those the group had most recently collaboratively constructed. While there was no main effect of condition on learning based on a content test, low‐pretest student learned more from CONT than GEN, with NSI not distinguishable from the other two, while high‐pretest students learned marginally more from the GEN condition than the CONT condition, with NSI not distinguishable from the other two. Practitioner notes What is already known about this topic Reflection during or even in place of computer programming is beneficial for learning of principles for advanced computer science when the principles are new to students. Generation of contrasting cases and comparing contrasting cases have both been demonstrated to be effective as opportunities to learn from reflection in some contexts, though questions remain about ideal applicability conditions for adult learners. Intelligent conversational agents can be used effectively to deliver stimuli for reflection during collaborative learning, though room for improvement remains, which provides an opportunity to demonstrate the potential positive contribution of large language models (LLMs). What this paper adds The study contributes new knowledge related to the differences in applicability conditions between generation of contrasting cases and comparison across provided contrasting cases for adult learning. The paper presents an application of LLMs as a tool to provide contrasting cases tailored to the details of actual student solutions. The study provides evidence from a classroom intervention study for positive impact on student learning of an LLM‐enabled intervention. Implications for practice and/or policy Advanced computer science curricula should make substantial room for reflection alongside problem solving. Instructors should provide reflection opportunities for students tailored to their level of prior knowledge. Instructors would benefit from training to use LLMs as tools for providing effective contrasting cases, especially for low‐prior‐knowledge students.


Improving Model Factuality with Fine-grained Critique-based Evaluator

October 2024

·

2 Reads

Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.


Distilling Multi-Scale Knowledge for Event Temporal Relation Extraction

Event Temporal Relation Extraction (ETRE) is paramount but challenging. Within a discourse, event pairs are situated at different distances or the so-called proximity bands. The temporal ordering communicated about event pairs where at more remote (i.e., "long") or less remote (i.e., "short") proximity bands are encoded differently. SOTA models have tended to perform well on events situated at either short or long proximity bands, but not both. Nonetheless, real-world, natural texts contain all types of temporal event-pairs. In this paper, we present MulCo: Distilling Multi-Scale Knowledge via Contrastive Learning, a knowledge co-distillation approach that shares knowledge across multiple event pair proximity bands to improve performance on all types of temporal datasets. Our experimental results show that MulCo successfully integrates linguistic cues pertaining to temporal reasoning across both short and long proximity bands and achieves new state-of-the-art results on several ETRE benchmark datasets.


Figure 1: Example diff with multiple valid reviews. The ground truth and model-generated reviews focus on different topics like the performance of the added check and how likely it is to be triggered. However, a reference-based metric like the BLEU score assigns this review a low score of 0.0458ß.
Figure 3: Supervised fine-tuning pipeline for training Magicoder-6.7B for claim generation. We generate synthetic data by using GPT-4 to generate claims for the code changes in CodeReviewer validation set.
Figure 5: Histogram of sentence similarity of randomly sampled 100K sentence pairs from the CodeReviewer test set showing the scores are roughly normally distributed, justifying the usage of the 5-sigma rule for coming up with the threshold of 0.85 for high similarity used in metric computation.
Figure 6: Q-Q plot comparing quantiles of empirically observed sentence similarity scores computed over 100K sentence pairs from the CodeReviewer test set showing the theoretical quantiles match a normal distribution except for really high values. The discrepancy seen here is likely due to the random sample being a smaller subset of the whole 100M+ sentence pairs for which we compute similarities.
The various types of errors identified, their descriptions and examples (pseudo-references before and after correction of the error are shown) as well as relative frequencies as percentages are shown here. For this analysis, we annotated 46 erroneous pseudo-references
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

September 2024

·

35 Reads

·

1 Citation

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff), even though code review is a one-to-many problem like generation and summarization with many "valid reviews" for a diff. To tackle these issues we develop a CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.





Citations (57)


... Strategies for IE from LRDs. Graph-based models like GCNs (Liu et al., 2019a) and Ali-GATr (Nourbakhsh et al., 2024) enhance relation extraction by capturing textual-visual relationships. Reading order is critical; Token Path Prediction (TPP) resolves OCR layout ambiguities, while global tagging (Shaojie et al., 2023) mitigates text ordering issues for better extraction. ...

Reference:

Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs
AliGATr: Graph-based layout generation for form understanding
  • Citing Conference Paper
  • January 2024

... address these limitations, however, Google's approach [40] only focuses on issue classification without generating specific review comments, while Tencent [50] primarily addresses code maintainability concerns. Our investigation reveals three fundamental challenges in current LLM-based solutions: i) insufficient precision in generating technically accurate comments, ii) low practicality of comments that are technically correct but fail to provide substantial value [26,29], and iii) lack of systematic mechanisms for targeted improvement, preventing data-driven evolution in both model precision and suggestion practicality. ...

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

... As Nourbakhsh et al. (2024) argued, grounding is an important requirement (and challenge) for the operationalization of Document VQA models especially in enterprise domains. Nevertheless it is difficult to determine how much grounding might matter to one downstream application ver-sus another. ...

Towards a new research agenda for multimodal enterprise document understanding: What are we missing?
  • Citing Conference Paper
  • January 2024

... For the LLM-as-a-judge evaluation, we utilized the DocLens metric, which employs GPT-4o mini to extract claims from generated reports, verify their presence in ground-truth reports, and compute a precision score for the generated content. (for the prompt used, please refer to Supplementary Table 8; for the claim and precision examples, see Supplementary Table 9) 29 Five models were evaluated, including BrainGPT-plain, BrainGPT-example, BrainGPT-template, BrainGPT-keyword, and the baseline Otter model. Additionally, we compute the Pearson's correlation coefficient (Pearson's r) under two-sided paired t-test conditions for DocLens score with the FORTE categories and traditional metrics, offering a deeper understanding of their applicability and limitations in radiological report evaluation. ...

DocLens: Multi-aspect Fine-grained Medical Text Evaluation
  • Citing Conference Paper
  • January 2024

... For example, GPT-based systems analyze student inputs to provide context-aware responses, making them particularly effective in knowledge tracing and personalized tutoring. Research further suggests that LLMs are integral to improving student satisfaction by creating dynamic, on-demand learning experiences [9]. ...

Generating Situated Reflection Triggers About Alternative Solution Paths: A Case Study of Generative AI for Computer-Supported Collaborative Learning
  • Citing Chapter
  • July 2024

... Emerging technologies and the need to tackle complex, unprecedented problems will shape future labor markets, creating new sectors and roles that do not yet exist. In this context, it is essential to design educational approaches that foster the development of key competencies, such as critical thinking, collaboration, effective communication, and creative thinking (Araya, 2023;Kafai et al., 2024). ...

What Does it Mean to be Literate in the Time of AI? Different Perspectives on Learning and Teaching AI Literacies in K-12 Education
  • Citing Conference Paper
  • June 2024

... For those who have embraced the advancement of AI, staff members indicate that digital literacy skills are essential in the education philosophy as the same skills impact how educators and administrators work, student engagement and the productivity of research (Prior et al., 2016). However, many faculty members express their challenges with integrating AI into education, such as a lack of AI teaching tools, confidence in delivering and using AI in day-to-day operations and a lack of resources (Tatar et al., 2024). Despite research showing the benefit of using AI in increasing productivity in HEIs for both students and teachers, there is a gap between educators' awareness of the use of AI technologies and the actual implication of the application in higher education settings (Gaber et al., 2023;Xiao et al., 2024). ...

Exploring Teachers’ Views and Confidence in the Integration of an Artificial Intelligence Curriculum into Their Classrooms: a Case Study of Curricular Co-Design Program
  • Citing Article
  • May 2024

International Journal of Artificial Intelligence in Education

... A study at the University for Development Studies revealed that 76.9% of students reported using AI tools, with 31.6% using them on a daily basis (Iddrisu et al., 2025). This aligns with the findings of Jiang et al. (2024), which indicate that marginalized female students perform better in machine learning and artificial intelligence tasks when incorporating diverse cultural perspectives and holistic language analysis. Table 3 presents the level of efficiency among teachers when using AI-powered teaching tools when grouped to sex. ...

Towards inclusivity in AI: A comparative study of cognitive engagement between marginalized female students and peers

... Such a restricted understanding of young people's stances toward and practices with AI fails to capture the complexities of how they are making sense of and composing with generative AI in their everyday lives, an oversight with increasingly urgent implications as the most-used commercial composing platforms (e.g., Microsoft Word, Google Docs) race to embed AI tools in their interfaces. Emerging scholarship offers an initial glimpse into how young people understand the tensions and dilemmas they face as generative AI complicates notions of "reading" and "writing" (e.g., Malmström et al., 2023;Morales-Navarro et al., 2023;Singer, 2023), but young people's voices are still absent from many conversations about AI and writing (e.g., Tate et al., 2023), even as some scholars begin to think with youth about AI and media (e.g., Lee et al., 2022). Young people, particularly those from nondominant communities, are often left out of conversations that directly impact their futures. ...

Making Sense of Machine Learning: Integrating Youth’s Conceptual, Creative, and Critical Understandings of AI

... Code translation refers to translating code from one programming language to another while preserving the functionality of the source code [1][2][3][4]. It has broad applications, including refactoring code written in outdated languages [5], transitioning from simple but slow languages to more complex and faster ones [6], enabling programming language migration in software development [7][8][9][10][11][12][13][14], and addressing data scarcity issues through synthetic data generation [15,16]. Automatic code translation can significantly reduce manual effort and has thus garnered widespread attention in recent years [17][18][19][20][21][22][23][24][25]. ...

Data Augmentation for Code Translation with Comparable Corpora and Multiple References
  • Citing Conference Paper
  • January 2023