Rabeeh Karimi Mahabadi’s research while affiliated with Swiss Federal Institute of Technology in Lausanne and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (16)


Figure 4: Time taken to generate the given number of tokens with 100 diffusion steps. We report the average time over five runs. Diffusion-based models use RoBERTa large as their backbone.
Comparison of TESS and RoBERTa on GLUE tasks on the development set. 8 Following Devlin et al. (2019), we report F1 scores for MRPC and QQP; STS-B, Spearman correlation coefficient; CoLA, Matthews correlation; all other tasks, accuracy. Bold fonts indicate the best results. Following RoBERTa, both results on the development set are a median over five runs. Following RoBERTa for RTE, STS, and MRPC, we finetune starting from the MNLI model instead of the baseline pretrained model.
Ablation on the effects of self-conditioning. We compare our proposed self-conditioning to the orig- inal method in Chen et al. (2022), and the model with- out self-conditioning. Bold fonts indicate the best re- sults.
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
  • Preprint
  • File available

May 2023

·

63 Reads

Rabeeh Karimi Mahabadi

·

Jaesung Tae

·

Hamish Ivison

·

[...]

·

Arman Cohan

Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various domains with continuous-valued inputs. Despite the promises of fully non-autoregressive text generation, applying diffusion models to natural language remains challenging due to its discrete nature. In this work, we propose Text-to-text Self-conditioned Simplex Diffusion (TESS), a text diffusion model that is fully non-autoregressive, employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the typical learned embedding space. Through extensive experiments on natural language understanding and generation tasks including summarization, text simplification, paraphrase generation, and question generation, we demonstrate that TESS outperforms state-of-the-art non-autoregressive models and is competitive with pretrained autoregressive sequence-to-sequence models.

Download

Figure 1: Existing few-shot fine-tuning methods require manual engineering to reduce new tasks to masked language modeling. PERFECT does not rely on any handcrafting, removing both patterns and verbalizers (see Figure 3).
Validation performance for sentence-pair bench- marks for different locations of mask tokens. Bold fonts indicate the best results in each row.
PERFECT: Prompt-free and Efficient Few-shot Learning with Language Models

April 2022

·

128 Reads

Current methods for few-shot fine-tuning of pretrained masked language models (PLMs) require carefully engineered prompts and verbalizers for each new task to convert examples into a cloze-format that the PLM can score. In this work, we propose PERFECT, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting, which is highly effective given as few as 32 data points. PERFECT makes two key design choices: First, we show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning and reduce memory and storage costs by roughly factors of 5 and 100, respectively. Second, instead of using handcrafted verbalizers, we learn new multi-token label embeddings during fine-tuning, which are not tied to the model vocabulary and which allow us to avoid complex auto-regressive decoding. These embeddings are not only learnable from limited data but also enable nearly 100x faster training and inference. Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods. Our code is publicly available at https://github.com/rabeehk/perfect.



Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

June 2021

·

18 Reads

While large-scale pretrained language models have obtained impressive results when fine-tuned on a wide variety of tasks, they still often suffer from overfitting in low-resource scenarios. Since such models are general-purpose feature extractors, many of these features are inevitably irrelevant for a given target task. We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks, and show that our method successfully reduces overfitting. Moreover, we show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets, and thereby obtains better generalization to out-of-domain datasets. Evaluation on seven low-resource datasets in different tasks shows that our method significantly improves transfer learning in low-resource scenarios, surpassing prior work. Moreover, it improves generalization on 13 out of 15 out-of-domain natural language inference benchmarks. Our code is publicly available in https://github.com/rabeehk/vibert.


Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

June 2021

·

32 Reads

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing information across tasks. In this paper, we show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which condition on task, adapter position, and layer id in a transformer model. This parameter-efficient multi-task learning framework allows us to achieve the best of both worlds by sharing knowledge across tasks via hypernetworks while enabling the model to adapt to each individual task through task-specific adapters. Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29% parameters per task. We additionally demonstrate substantial performance improvements in few-shot domain generalization across a variety of tasks. Our code is publicly available in https://github.com/rabeehk/hyperformer.


Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

June 2021

·

22 Reads

·

2 Citations

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared ``slow'' weights and ``fast'' rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms fine-tuning in low-resource settings. Our code is publicly available in https://github.com/rabeehk/compacter/



ParsiNLU: A Suite of Language Understanding Challenges for Persian

December 2020

·

337 Reads

Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.



simple but effective techniques to reduce biases

September 2019

·

19 Reads

·

1 Citation

There have been several studies recently showing that strong natural language inference (NLI) models are prone to relying on unwanted dataset biases, resulting in models which fail to capture the underlying generalization, and are likely to perform poorly in real-world scenarios. Biases are identified as statistical cues or superficial heuristic correlated with certain labels that are effective for the majority of examples but fail to succeed in more challenging hard examples. In this work, we propose several learning strategies to train neural models which are more robust to such biases and transfer better to out-of-domain datasets. We first introduce an additive lightweight model which learn dataset biases. We then use its prediction to adjust the loss of the base model to reduce the biases. In other words, our methods down-weight the importance of the biased examples, and focus training on hard examples which require grounded reasoning to deduce the label. Our approaches are model agnostic and simple to implement. We experiment on large-scale natural language inference and fact-verification datasets and show that our debiased models obtain significant gain over the baselines on several challenging out-of-domain datasets.


Citations (9)


... P-Tuning v2 [23] introduces continuous, trainable prompts at each layer of the model, significantly enhancing its adaptability and flexibility, and allowing it to better handle complex natural language understanding (NLU) tasks. PERFECT [24] enhances the model's ability to adapt to downstream tasks by adding task-specific learnable adapters, thereby bypassing the need for designing templates and verbalizers. However, these methods require ample training data to achieve satisfactory results. ...

Reference:

Construction of Prompt Verbalizer Based on Dynamic Search Tree for Text Classification
Prompt-free and Efficient Few-shot Learning with Language Models
  • Citing Conference Paper
  • January 2022

... Various approaches in low-rank adaptations have been proposed to enhance these techniques (Renduchintala et al., 2024;Sheng et al., 2024;Zhang et al., 2023;Xia et al., 2024;Wang et al., 2023b;Hao et al., 2024;Wang et al., 2025), including efforts to train models from scratch (Kamalakara et al., 2022;Wang et al., 2023a;Zhao et al., 2023). Broadly, memory-efficient training also encompasses methods such as adapters (Houlsby et al., 2019;Karimi Mahabadi et al., 2021), which insert trainable layers and prompt tuning (Li & Liang, 2021;Lester et al., 2021), which optimizes continuous prompts. Additionally, its combination with quantization techniques (Kwon et al., 2022) and other methods that update subparts of the parameter vector (Guo et al., 2021;Ben Zaken et al., 2022;Sung et al., 2021) are also relevant. ...

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks
  • Citing Conference Paper
  • January 2021

... [30] ConRe encourages the student model to assign less attention to samples that the teacher model considers biased: L = −S(pt, p i b )· log p d , where S(pt, p i b ) denotes the soft predictions with temperature p i b . In addition to above basic debiasing methods, we also compare our method with some complex baselines such as Learned-Mixin [1], Modeling Bias [19], Soft Label Encoding [12], Debias Mask [21], and DCT [18]. ...

End-to-End Bias Mitigation by Modelling Biases in Corpora
  • Citing Conference Paper
  • January 2020

... Mahabadi et al. [18] introduced three new strategies to reduce bias: the first is an ensemble-based approach that combines multiple probabilistic models of the same data by multiplying the probabilities together and then renormalizing them. The idea is still to combine the probability distributions of the problem-only model and the main model, enabling them to make predictions based on different characteristics of the input. ...

simple but effective techniques to reduce biases
  • Citing Preprint
  • September 2019

... The quantization inevitably introduces noise that degrades the reconstruction performance of image-based BCS. Many researchers have thus studied the effects of quantization on the reconstruction performance of the quantized CS in recent years and proposed robust strategies to accommodate the quantization errors [16][17][18][19][20]. The authors of ref. [16] presented a detailed theoretical analysis of the effects of quantizationinduced noise on the reconstruction performance of CS. ...

A Learning-Based Framework for Quantized Compressed Sensing
  • Citing Article
  • February 2019

Signal Processing Letters, IEEE

... However, random sampling is suceptible to some major limitations, which make it practically insignificant. Thus, to mitigate the limitations of random subsampling one should aim for structured subsampling of the signal of interest as shown in [7] and [18]. The fundamental query at this point is, "Which structured samples should one aim towards?". ...

Real-Time DCT Learning-based Reconstruction of Neural Signals
  • Citing Conference Paper
  • September 2018

... We would like to mention that there are cases and frameworks where one can define a non-euclidean version of gradient descent (see [HKM+18]). The only thing that remains to be done is to prove that NNet also fits in the decorated framework. Thus, we are going to consider its hom-decorated counterpart as a subcategory of the category Para Para Para. ...

A Non-Euclidean Gradient Descent Framework for Non-Convex Matrix Factorization
  • Citing Article
  • September 2018

IEEE Transactions on Signal Processing

... Statistical experiment design techniques for MRI sampling prediction were proposed that used the Cramer-Rao lower bound [33], [34]. Later, the greedy algorithm and its variations were used to learn a single population-adaptive sampling pattern over a training set of images with a specific choice of reconstruction method [35], [36]. Since these approaches learn the undersampling pattern using greedy algorithms over a large number of images, the computational cost involved is high and it scales quadratically with the number of lines in the mask. ...

Learning-Based Compressive MRI

IEEE Transactions on Medical Imaging

... The theory developed in this paper, on the one hand, is a generalization of the well-known self-concordance notion developed in [45]; on the other hand, it also covers the work in [1, 61,72] as specific examples. Several concrete applications and extensions of self-concordance notion can also be found in the literature including [28,32,49,53]. Recently, [14] exploited smooth structures of exponential functions to design interior-point methods for solving two fundamental problems in scientific computing called matrix scaling and balancing. ...

Scalable Sparse Covariance Estimation via Self-Concordance

Proceedings of the AAAI Conference on Artificial Intelligence