Alexander M. Rush’s research while affiliated with Cornell University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (182)


Compute-Constrained Data Selection
  • Preprint
  • File available

October 2024

·

2 Reads

Junjie Oscar Yin

·

Alexander M. Rush

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. These experiments show the validity of this model in real-world experiments. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective.

Download

Contextual Document Embeddings

October 2024

·

9 Reads

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.


Overview of results across different extension types.
Hyperparameters for Different Long Sequence Methods in Training.
The scale factor and its relationship with perplexity on PG19. We only use the first 2 documents of PG19 to calculate the perplexity.
Perplexity results of LongLora reported and our reproduction on PG 19 and Proof-file.
LongBench results. N-32 and N-64 refer to NTK finetuned on 32K and 64K context lengths respectively. Inf refers to LM-Infinite. SE refers to Self-Extend. LLR refers to LongLora. AvgLen refers to average length of the datasets. Train Len refers to the longest length examples seen at training or finetuning. Eval Len refers to the maximum length of the input prompt. ✓refers to whether the method is exact attention.

+5

A Controlled Study on Long Context Extension and Generalization in LLMs

September 2024

·

11 Reads

Yi Lu

·

Jing Nathan Yan

·

Songlin Yang

·

[...]

·

Alexander M. Rush

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.


Figure 1: Transferring Transformer to Mamba. Weights, in orange, are initialized from the Transformer (Linear projections for Q, K, and V are initialized using linear projection for C, B, and X respectively). We replace individual attention heads with Mamba heads, and then finetune Mamba blocks while freezing the MLP blocks. Shapes are kept mainly the same. Weights in green are added. New parameters are introduced for the learned A and ∆ parameters.
Figure 3: Performance of the multi-step SSM kernel for generating 32 tokens.
The Mamba in the Llama: Distilling and Accelerating Hybrid Models

August 2024

·

47 Reads

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.


Great Memory, Shallow Reasoning: Limits of kNN-LMs

August 2024

·

11 Reads

Knearestneighborlanguagemodels(-nearest neighbor language models (kNNLMs),whichintegrateretrievalwithnextwordprediction,havedemonstratedstrongperformanceinlanguagemodelingaswellasdownstreamNLPbenchmarks.TheseresultshaveledresearcherstoarguethatmodelstrainedonpoorqualityoroutdateddatacouldperformwellbyemployingaNN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a kNNextensionthathasaccesstoahigherqualitydatastore.Inthiswork,weaskwhetherthisimprovedabilitytorecallinformationreallytranslatesintodownstreamabilities.WeextensivelyevaluateNN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate kNNLMsonadiversesetoftasks,rangingfromsentimentclassificationandcommonsensereasoningtomultihopreasoning.ResultsshowthatNN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that kNNLMsexcelatmemoryintensivetasks,whereutilizingthepatternsintheinputissufficientfordeterminingtheoutput,butstrugglewithreasoningtasksthatrequireintegratingmultiplepiecesofinformationtoderivenewknowledge.Wefurtherdemonstratethroughoracleexperimentsandqualitativeanalysisthatevenwithperfectretrieval,NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.


Figure 2: Example outputs produced by GPT-4. Via prompting, the model detects unanswerable questions then reformulates the questions with a second prompt.
Figure 4: Qualitative analysis on the rule-based heuristic approach to hack our proposed metric, success rates.
Different aspects of question reformulation. Ans indicates whether the reformulated question can be answered by the document, and Rel indicates whether the reformulated question is relevant to the original question.
Comparing the standard zero-shot prompting method to various baselines to hack the proposed metric, success rates.
I Could've Asked That: Reformulating Unanswerable Questions

July 2024

·

12 Reads

When seeking information from unfamiliar documents, users frequently pose questions that cannot be answered by the documents. While existing large language models (LLMs) identify these unanswerable questions, they do not assist users in reformulating their questions, thereby reducing their overall utility. We curate CouldAsk, an evaluation benchmark composed of existing and new datasets for document-grounded question answering, specifically designed to study reformulating unanswerable questions. We evaluate state-of-the-art open-source and proprietary LLMs on CouldAsk. The results demonstrate the limited capabilities of these models in reformulating questions. Specifically, GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the time, respectively. Error analysis shows that 62% of the unsuccessful reformulations stem from the models merely rephrasing the questions or even generating identical questions. We publicly release the benchmark and the code to reproduce the experiments.



Figure 1: (Left) Our proposed masked diffusion language model (MDLM) is trained using a weighted average of masked cross entropy losses. (Top Right) In comparison to masked language models (MLM), MDLM's objective correspond to a principled variational lower bound, and supports generation via ancestral sampling. (Bottom Right) Perplexity (PPL) on One Billion Words benchmark.
GLUE evaluation results. Evaluation measures (↑) are F1 score for QQP and MRPC, Spearman correlations for STS-B, and accuracy for the rest. For MNLI, we report match/mismatch accuracies.
Simple and Effective Masked Diffusion Language Models

June 2024

·

28 Reads

While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: https://github.com/kuleshov-group/mdlm




Citations (51)


... Recently, large language models (LLMs) have been proposed as costeffective alternatives to human evaluation, and act as proxies for assessing text quality. Such methods often first provide explanations for judgments of the response, then output a discrete score or preference label as the prediction Li et al., 2023;Yan et al., 2024;. CriticGPT (McAleese et al., 2024) has also extended this line of work into coding tasks, where the LLM critic model is fine-tuned to pinpoint problems in code from real-world assistant tasks. ...

Reference:

Self-Generated Critiques Boost Reward Modeling for Language Models
Predicting Text Preference Via Structured Comparative Reasoning
  • Citing Conference Paper
  • January 2024

... Attempts at having transformers with sub-quadratic complexity [11,47,76] introduce the additional constraint of fixing the number of tokens, which prevents generating images or videos of different sizes. Alternatively, recurrent models such as State-Space Models (SSM) [26,27] have been investigated for the task [38,69,79] since their complexity is linear with the sequence length [25]. However, they introduce an arbitrary causal raster scan of the sequence that does not fit the 2D geometry of images very well. ...

Diffusion Models Without Attention
  • Citing Conference Paper
  • June 2024

... We strive for transparency, reproducibility, and accountability-three key ingredients of privacyrelated research. Therefore we experiment with fully open-source and transparent models, such as BLOOM (Le Scao et al., 2023). All our source codes and datasets are also publicly available for further scrutiny. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... The Hop, Union, Generate (HUG) [13] framework introduces an explainable MHR framework that uses a pretrained model to score paths for their relevance to the reasoning task. The Tree-of-Mixed-Thought (ToMT) [7] model combines rapid, one-stop reasoning with iterative refinement from a learned model, striving for a balance between efficiency and accuracy in MHR tasks. ...

Hop, Union, Generate: Explainable Multi-hop Reasoning without Rationale Supervision
  • Citing Conference Paper
  • January 2023

... Beyond these traditional concerns, LLMs introduce a novel privacy challenge: prompt theft attacks, which threaten intellectual property rights and personal privacy. These prompt theft attacks have emerged in various forms, including extracting system prompts by adversarial prompting [63], [91], [141], [144], inverting prompts from embedding vectors [16], [16], [59], [79], [105] or model responses [67], [97], [101], [138], and recovering input prompts by exploiting next-token probability distributions [78] or token-length sequences [129]. These attacks pose significant risks to the growing prompt marketplace and user privacy in downstream LLM applications, particularly when prompts contain sensitive information or proprietary instructions. ...

Text Embeddings Reveal (Almost) As Much As Text
  • Citing Conference Paper
  • January 2023

... Those works obtain very competitive performance and even beat various Attention variant architectures by a large margin in several long sequence tasks. In recent work [84], the state space model even reaches comparable performance in large-scale NLP pretraining tasks to the BERT-type models. It is possible to further boost S 3 Attention's performance by combining with those breakthroughs. ...

Pretraining Without Attention
  • Citing Conference Paper
  • January 2023

... Besides tuning on curated instruction finetuning datasets, a series of recent works [59,50,38,16] generate instructions of a specific task given a few examples. Instead of improving on specific tasks, some work [42,15] generates task-agnostic large-scale instruction tuning data without given examples. ...

Explaining Data Patterns in Natural Language with Language Models
  • Citing Conference Paper
  • January 2023

... Our Induction-Gram LM is also based on a nonparametric LM, but unlike these other works, it maintains complete interpretability during inference. In simplified settings such as text classification, some works have built fully interpretable models that bridge LLMs and ngram models (Li et al., 2017;Singh et al., 2023a) or built partially interpretable models based on approximating concepts with natural language (Yang et al., 2023a;Sun et al., 2024;Morris et al., 2023;. ...

Tree Prompting: Efficient Task Adaptation without Fine-Tuning
  • Citing Conference Paper
  • January 2023

... Cornille et al. (2024), we complement perplexity, which does not directly assess generated text, with generation metrics. We report ROUGE-2 (F1) (Lin, 2004) and MAUVE (Pillutla et al., 2021) to evaluate generated texts at the surface level, and Levenshtein distance (Levenshtein et al., 1966) and latent perplexity (Deng et al., 2022) to assess text quality at an abstract level. For the surface level, ROUGE-2 evaluates bigram overlap between generated and real text, while MAUVE measures the divergence between model and true data distributions by comparing generated and real texts unconditionally. ...

Model Criticism for Long-Form Text Generation
  • Citing Conference Paper
  • January 2022