Stephen H. Bach’s research while affiliated with Brown University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (55)


Table 1 .
Figure 2: Measure of how often a language is studied ("Frequency") and the average number of languages covered by all papers in which the language appear in ("Paper's Average Multilinguality").
The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It
  • Preprint
  • File available

May 2025

·

4 Reads

Zheng-Xin Yong

·

Beyza Ermis

·

·

[...]

·

This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020--2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.

Download

Crosslingual Reasoning through Test-Time Scaling

May 2025

·

2 Reads

Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.


Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

March 2025

·

4 Reads

Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.


K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction

February 2025

·

5 Reads

Drug discovery is a complex and time-intensive process that requires identifying and validating new therapeutic candidates. Computational approaches using large-scale biomedical knowledge graphs (KGs) offer a promising solution to accelerate this process. However, extracting meaningful insights from large-scale KGs remains challenging due to the complexity of graph traversal. Existing subgraph-based methods are tailored to graph neural networks (GNNs), making them incompatible with other models, such as large language models (LLMs). We introduce K-Paths, a retrieval framework that extracts structured, diverse, and biologically meaningful paths from KGs. Integrating these paths enables LLMs and GNNs to effectively predict unobserved drug-drug and drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths retrieves and transforms paths into a structured format that LLMs can directly process, facilitating explainable reasoning. K-Paths employs a diversity-aware adaptation of Yen's algorithm to retrieve the K shortest loopless paths between entities in an interaction query, prioritizing biologically relevant and diverse relationships. Our experiments on benchmark datasets show that K-Paths improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on drug repurposing and 13.42 points on interaction severity prediction. We also show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, respectively. K-Paths also improves the supervised training efficiency of EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining strong predictive performance. Beyond its scalability and efficiency, K-Paths uniquely bridges the gap between KGs and LLMs, providing explainable rationales for predicted interactions. These capabilities show that K-Paths is a valuable tool for efficient data-driven drug discovery.



100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

October 2024

·

5 Reads

·

5 Citations

Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.


Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

July 2024

·

2 Reads

·

1 Citation

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6%87.6\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2%82.2\% are valid, solve-able problems, but only 35.1%35.1\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.


Preference Tuning For Toxicity Mitigation Generalizes Across Languages

June 2024

·

7 Reads

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.


Language Models in the Loop: Incorporating Prompting into Weak Supervision

April 2024

·

11 Reads

·

51 Citations

ACM / IMS Journal of Data Science

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data. Our experimental evaluation shows that prompting large language models within a weak supervision framework can provide significant gains in accuracy. On the WRENCH weak supervision benchmark, this approach can significantly improve over zero-shot performance, an average 19.5% reduction in errors. We also find that this approach produces classifiers with comparable or superior accuracy to those trained from hand-engineered rules.



Citations (32)


... Extracting text from school textbooks has been demonstrated as an effective approach for curating high-quality datasets in lowresource languages (Anand et al., 2024d,b,a,c). Furthermore, the APS model can serve as a reward model to align LLMs for generating more informative answers, facilitated by differential performance preference tuning algorithms (Li et al., 2024b;Naseem et al., 2024). chasm: statistical approaches to answer-finding. ...

Reference:

Long-Context Non-Factoid Question Answering in Indic Languages
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
  • Citing Conference Paper
  • January 2024

... However, the current practice of generating multilingual reasoning training data simply through translation technology [39,17] is insufficient, as it is well-established that translation models including LLMs still suffer from poor cultural alignment and Western-centric bias [65,66] as well as poor translation performance with low-resource languages [67,68,69]. Future work should systematically explore the effectiveness of multilingual augmentation techniques such as back-translation or synthetic data generation [70,71]. ...

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons
  • Citing Conference Paper
  • January 2024

... Prompt embeddings can be learned for fitting a specific task [44] or a test sample [53], integrating connections between training and test distribution. They can also be mined with the help of an external LLM [15]. Our approach is orthogonal to most of these strategies. ...

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
  • Citing Conference Paper
  • January 2024

... However, they commonly relied on proprietary LLMs accessed through APIs. As revealed by recent evaluations [19][20][21] and Section 4.6, these proprietary models often underperform specialized finetuned models [20,22,23]. Additionally, using proprietary language models through external services may encounter latency issues or reduced availability, especially during periods of high traffic or service outages. ...

Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation
  • Citing Conference Paper
  • January 2024

... Unlike traditional supervised approaches, we did not use the training set for model learning. Instead, we employed promptbased weak labeling and an ensemble verification mechanism (Smith et al., 2024). The test set contained unlabeled examples, and final system evaluation was conducted by the task organizers. ...

Language Models in the Loop: Incorporating Prompting into Weak Supervision
  • Citing Article
  • April 2024

ACM / IMS Journal of Data Science

... The OpenGPT-X project initially adopted the Megatron-DeepSpeed codebase 6 , developed by NVIDIA, extended by Microsoft researchers and further adapted during BigScience research workshop [47]. Other codebases, such as Meta's Open Pretrained Transformer (OPT) [31], also emerged, promising potential advantages in abstraction and usability. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Explanations for weakly supervised systems. There is a large body of work on explaining the results of weak-supervised systems that target improving the final model or better involving human annotators [5,7,14,34,[37][38][39]. For instance, [37] uses influence function to identify LFs responsible for erroneous labels; WeShap [14] measures the shapley value of LFs to rank and prune LFs. ...

Alfred: A System for Prompted Weak Supervision
  • Citing Conference Paper
  • January 2023

... Qin et al. 2023;Gilardi, Alizadeh, and Kubli 2023). While numerous guides for LLM prompting exist (DAIR 2023;Giray 2023;Akin 2023;Bach et al. 2022), they do not all offer the same guidance, and leave many empirical questions unanswered. For example, most guides suggest making the prompts as "descriptive and detailed" as possible (DAIR 2023). ...

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

... It is a popular choice for research and development projects, supported by a large developer community. The flexibility of customization makes it applicable to a wide variety of projects [24]. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... In white-box models, the internal logic is fully accessible and interpretable, providing a groundtruth rationale against which generated explanantia can be directly evaluated. Typical whitebox models used for comparison include linear regressors (Crabbe et al., 2020;Dai et al., 2022), feature-additive models (Carmichael & Scheirer, 2023), small neural networks with manually set parameters (Antwarg et al., 2019;Brandt et al., 2023), or symbolic models such as decision trees (Ribeiro et al., 2016;. Given that the model's reasoning is known, explanation methods can be assessed by comparing their output against this ground-truth explanans. ...

Fairness via Explanation Quality: Evaluating Disparities in the Quality of Post hoc Explanations
  • Citing Conference Paper
  • July 2022