Adam Fisch’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (37)


Figure 2 | Left: An example of unshared, full-size model with 6 layers. Middle: Three proposed methodologies for initializing looped layers in a Recursive Transformer. Each layer number indicates the source layer in the full-size model used for initialization. Right: Example of a Relaxed Recursive Transformer initialized by SVD method. Here, looped layers are initialized using the Average method.
Figure C.1 | We visualize LoRA modules to show which residual matrices they target for initialization under three different looping initialization methods, assuming a full-size model with six layers and two looping blocks. For ease of understanding, A matrices are colored according to the full-size model weights at the corresponding depth, while B matrices are colored based on the looped layer weights. White B matrices indicate cases where the full-size model and looped model weights are identical, resulting in standard zero initialization.
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
  • Preprint
  • File available

October 2024

·

6 Reads

Sangmin Bae

·

Adam Fisch

·

Hrayr Harutyunyan

·

[...]

·

Tal Schuster

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

Download

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

June 2024

·

15 Reads

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate -- but potentially biased -- automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified sampling. In particular, we show both theoretically and empirically that, with appropriate choices of stratification and sample allocation, our approach can provide substantially tighter confidence intervals than unstratified approaches. Specifically, StratPPI is expected to improve in cases where the performance of the autorater varies across different conditional distributions of the target data.


Figure 2: Pareto frontier of throughput to language modeling performance. Throughput denotes the number of generated tokens per second, and the numbers next to each point represent the number of non embedding parameters. (a) Pareto frontier in the prefill-heavy setting. (b) Pareto frontier in the decode-heavy setting. (c) Throughput in the prefill-heavy setting with varying prompt lengths. Each point corresponds to the same order of model sizes as in the left figures.
Figure 3: (Left: (a), (d)) Average and position-wise loss by the ratio of parameter allocation between block and token decoders. The ratio is represented as block to token decoders. (Center: (b), (e)) Average and position-wise loss in relation to block length L B . (Right: (c), (f)) Training loss curve for variants of the embedder and token decoder. We consider four different lengths for the prefix-based token decoder. We use models with 302M non-embedding parameters and one-to-one ratio trained on 8 billion tokens.
Figure 12: Loss by varying block lengths and the parameter allocation ratios. The numbers indicate the sum of non-embedding parameters in the block and token decoders.
Figure 21: Visualization of attention scores in the token decoder. A total sequence length of attention scores is 5, since the block length is 4 and the prefix length is 2. The causal mask parts are marked in gray.
Block Transformer: Global-to-Local Language Modeling for Fast Inference

June 2024

·

44 Reads

This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at https://github.com/itsnamgyu/block-transformer.


Figure 2: Conformal sampling results for C λ a function of ϵ. We report the loss, relative excess samples, and overall size (normalized by k max ). We also report the AUC over achieved/non-trivial ϵ.
Figure 3: Conformal component selection results for C inner γ as a function of α. We report the number of components identified in C inner γ
Figure G.1: Conformal component selection results for C inner γ as a function of α. We report the recall achieved by C inner γ
Figure H.1: Chest X-ray and reference radiology report for study id 55663120
Conformal Language Modeling

June 2023

·

115 Reads

·

1 Citation

We propose a novel approach to conformal prediction for generative language models (LMs). Standard conformal prediction produces prediction sets -- in place of single predictions -- that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model's predicted distribution over the large, combinatorial output space of natural language. Translating this process to conformal prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples may be low-quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e., small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets of individual components -- such as phrases or sentences -- that are each independently correct (e.g., that are not "hallucinations"), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.


Efficiently Controlling Multiple Risks with Pareto Testing

October 2022

·

45 Reads

Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiable high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes -- including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered -- to simultaneously control and optimize various accuracy and cost metrics.


Calibrated Selective Classification

August 2022

·

17 Reads

Selective classification allows models to abstain from making predictions (e.g., say "I don't know") when in doubt in order to obtain better effective accuracy. While typical selective models can be effective at producing more accurate predictions on average, they may still allow for wrong predictions that have high confidence, or skip correct predictions that have low confidence. Providing calibrated uncertainty estimates alongside predictions -- probabilities that correspond to true frequencies -- can be as important as having predictions that are simply accurate on average. However, uncertainty estimates can be unreliable for certain inputs. In this paper, we develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties. By doing so, we aim to make predictions with {well-calibrated} uncertainty estimates over the distribution of accepted examples, a property we call selective calibration. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. In particular, our work focuses on achieving robust calibration, where the model is intentionally designed to be tested on out-of-domain data. We achieve this through a training strategy inspired by distributionally robust optimization, in which we apply simulated input perturbations to the known, in-domain training data. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.


Conformal Risk Control

August 2022

·

39 Reads

·

1 Citation

We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an O(1/n)\mathcal{O}(1/n) factor. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score.


Confident Adaptive Language Modeling

July 2022

·

55 Reads

·

2 Citations

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to ×3\times 3 -- while provably maintaining high performance.


Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

June 2022

·

1,003 Reads

·

66 Citations

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.


Conformal Prediction Sets with Limited False Positives

February 2022

·

16 Reads

We develop a new approach to multi-label conformal prediction in which we aim to output a precise set of promising prediction candidates with a bounded number of incorrect answers. Standard conformal prediction provides the ability to adapt to model uncertainty by constructing a calibrated candidate set in place of a single prediction, with guarantees that the set contains the correct answer with high probability. In order to obey this coverage property, however, conformal sets can become inundated with noisy candidates -- which can render them unhelpful in practice. This is particularly relevant to practical applications where there is a limited budget, and the cost (monetary or otherwise) associated with false positives is non-negligible. We propose to trade coverage for a notion of precision by enforcing that the presence of incorrect candidates in the predicted conformal sets (i.e., the total number of false positives) is bounded according to a user-specified tolerance. Subject to this constraint, our algorithm then optimizes for a generalized notion of set coverage (i.e., the true positive rate) that allows for any number of true answers for a given query (including zero). We demonstrate the effectiveness of this approach across a number of classification tasks in natural language processing, computer vision, and computational chemistry.


Citations (20)


... Early exit methods (Schwartz et al., 2020;Xin et al., 2020;Schuster et al., 2022), which speed up LLMs by processing only the bottom part of the model, also provide evidence that the top layers of the model has already gained relevant information from previous tokens. Some of our setups are similar to previous work. ...

Reference:

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
Confident Adaptive Language Modeling
  • Citing Preprint
  • July 2022

... In this work, we focus on Early-Exiting (EE), an inference optimization technique that allocates budget adaptively to the test samples, based on their perceived difficulty. Earlyexit strategies (Grubb & Bagnell, 2012;Huang et al., 2017;Elbayad et al., 2019a;Schuster et al., 2021;Chen et al., 2023) involve establishing exit points at intermediate layers of a network based on the confidence levels of the predictions at each layer. The most common approach within these strategies is to make predictions at each intermediate layer and evaluate their confidence, allowing the model to exit early if the confidence exceeds a predetermined threshold. ...

Consistent Accelerated Inference via Confident Adaptive Transformers
  • Citing Article
  • January 2021

... Image captioning has been extensively studied and applied to various applications in society, such as generating fetching instructions for robots, assisting blind people, and answering questions from images (Magassouba et al., 2019;Ogura et al., 2020;Kambara et al., 2021;Gurari et al., 2020;White et al., 2021;Fisch et al., 2020). In this field, it is important that the quality of the generated captions is evaluated appropriately. ...

CapWAP: Image Captioning with a Purpose
  • Citing Article
  • January 2020

... The rapid dissemination of disinformation and misinformation in the digital era has catastrophic consequences, impacting public opinion. Since manual factchecking is not scalable and time-consuming, automated fact-checking approaches have been proposed [7,11,26,32] to combat misinformation. Automated factchecking pipelines mimic the human fact-checking process by collecting multiple pieces of evidence for different aspects of the claim, followed by reasoning over arXiv:2502.05803v1 ...

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence
  • Citing Article
  • January 2021

... The use of NLP for pharmaceutical applications is a relatively new and evolving space of research. Recent studies have found success in utilizing NLP for the parsing of electronic health records as well as for target identification for drug discovery, suggesting great potential for the impact of NLP on drug discovery efforts in the future (Fisch, Schuster, Jaakkola, & Barzilay, 2021;Santus et al., 2019). ...

Few-shot Conformal Prediction with Auxiliary Tasks
  • Citing Article
  • January 2021

... There are now books on detailed philosophical conversations with a NLM (Leib, 2023). Srivastava et al. (2022) called their giant benchmark setup for evaluating the capabilities of NLMs Beyond the Imitation Game (BIG-bench) and assume that this type of review will be far surpassed. A team at AI21 Labs has developed a type of social imitation game in which most people are unable to distinguish whether their conversation partner is a human or an NLM (Jannai et al., 2023). ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Citing Preprint
  • June 2022

... In machine translation, [Gio23] and [ZM24] used CP to assess translation quality, providing calibrated confidence estimates for both human and machine evaluations. Additionally, [Sch+21;Sch+22b] proposed confident early exiting methods for Transformers, where intermediate layers assess uncertainty to speed up inference while maintaining consistency with the full model. ...

Consistent Accelerated Inference via Confident Adaptive Transformers
  • Citing Conference Paper
  • January 2021

... Their work demonstrated that combining soft and hard templates yielded impressive results in multiple general NLP tasks such as Sentence-Pair Classification, Multiple-Choice Classification, and Single-Sentence Classification. Some researchers have focused on prompt template design, with Gao et al. [38] being the first to propose automated methods for generating label sets and templates. Shin et al. [39] introduced a gradient-based approach to automatically generate words for label sets and templates. ...

Making Pre-trained Language Models Better Few-shot Learners
  • Citing Conference Paper
  • January 2021

... Figure 1 provides a specific illustration for Vietnamese fact-checking. Although substantial efforts have been devoted to fact-checking in English (Thorne et al. 2018;Aly et al. 2021;Schuster, Fisch, and Barzilay 2021), resources for fact-checking in low-resource languages like Vietnamese are limited. This scarcity primarily stems from the limited availability of guidance resources to analyze the structure and semantics of Vietnamese. ...

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence
  • Citing Conference Paper
  • January 2021

... Sustainability 2025, 17, x FOR PEER REVIEW 16 of 28 To address the small dataset issue, we incorporated additional data from the 2024 spring migration season to enhance the model's generalization. We applied a "pre-training and fine-tuning" strategy [53], where the CLA model, initially trained on the 2023 data, was fine-tuned using the 2024 dataset. This process involved freezing the CNN layers and adjusting the parameters of the LSTM and attention layers. ...

Making Pre-trained Language Models Better Few-shot Learners
  • Citing Preprint
  • December 2020