Ali Modarressi’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning
  • Preprint

November 2022

·

24 Reads

·

Ehsan Aghazadeh

·

Ali Modarressi

·

[...]

·

Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.


GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

May 2022

·

16 Reads

There has been a growing interest in interpreting the underlying dynamics of Transformers. While self-attention patterns were initially deemed as the primary option, recent studies have shown that integrating other components can yield more accurate explanations. This paper introduces a novel token attribution analysis method that incorporates all the components in the encoder block and aggregates this throughout layers. Through extensive quantitative and qualitative experiments, we demonstrate that our method can produce faithful and meaningful global token attributions. Our experiments reveal that incorporating almost every encoder component results in increasingly more accurate analysis in both local (single layer) and global (the whole model) settings. Our global attribution analysis significantly outperforms previous methods on various tasks regarding correlation with gradient-based saliency scores. Our code is freely available at https://github.com/mohsenfayyaz/GlobEnc.


AdapLeR: Speeding up Inference by Adaptive Length Reduction

March 2022

·

13 Reads

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has a significantly lower false positive rate in generating rationales. Our code is freely available at https://github.com/amodaresi/AdapLeR .




Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations

September 2021

·

24 Reads

Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy -- which is widely used in the context of layer-wise probing -- can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results.


Exploring the Role of BERT Token Representations to Explain Sentence Probing Results

April 2021

·

18 Reads

·

1 Citation

Several studies have been carried out on revealing linguistic features captured by BERT. This is usually achieved by training a diagnostic classifier on the representations obtained from different layers of BERT. The subsequent classification accuracy is then interpreted as the ability of the model in encoding the corresponding linguistic property. Despite providing insights, these studies have left out the potential role of token representations. In this paper, we provide an analysis on the representation space of BERT in search for distinct and meaningful subspaces that can explain probing results. Based on a set of probing tasks and with the help of attribution methods we show that BERT tends to encode meaningful knowledge in specific token representations (which are often ignored in standard classification setups), allowing the model to detect syntactic and semantic abnormalities, and to distinctively separate grammatical number and tense subspaces.




Citations (8)


... However, they also result in a significant performance drop on hard or slightly modified test data (Naik et al., 2018). For example, in the area of natural language inference (NLI), models like BERT (Devlin et al., 2019) tend to misclassify premise-hypothesis pairs that contain "negation" words in their hypotheses as "contradiction," which happen to be predictive features associated with the contradiction label in certain NLI datasets (Gururangan et al., 2018;Poliak et al., 2018;Modarressi et al., 2023). ...

Reference:

FairFlow: Mitigating Dataset Biases through Undecided Learning
Guide the Learner: Controlling Product of Experts Debiasing Method Based on Token Attribution Similarities
  • Citing Conference Paper
  • January 2023

... respectively. Moreover, REX-augmented LIME and SHAP outperform DecompX (Modarressi et al. 2023), a state-of-the-art explanation method for text models on its target model. We also run a user study, which shows that REX helps end users better understand and predict the behaviors of target models. ...

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition
  • Citing Conference Paper
  • January 2023

... However, whether it is preferable to model such a feature in a controlled manner or to rely on the result of a technical limitation is debatable. The technical limitation could be mitigated by using summarized input or including memory mechanisms with retrieval methods (Modarressi et al., 2023;Zhong et 585 al., 2024;, although these approaches all require extra effort in designing peripheral agent work-flows. ...

RET-LLM: Towards a General Read-Write Memory for Large Language Models
  • Citing Preprint
  • May 2023

... In the case of local methods, some methods determine the relevance of each token in the prediction of the model, either by altering the data input (Modarressi et al. (2023), the study of gradients associated with the different tokens Kindermans et al. (2017), or the representation of the various tokens in their respective vectors along the model Modarressi et al. (2022). Other local methods seek to study the behaviour of the transformer models in a granular way, distinguishing at this point those that focus on the analysis of the attentional mechanism Kobayashi et al. (2020) and on the multilayer perceptron (MLP) Geva et al. (2022). ...

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers
  • Citing Conference Paper
  • January 2022

... Prompt compression techniques had already been explored in the era of BERT-scale (Devlin 2018) language models Kim and Cho 2021;Modarressi, Mohebbi, and Pilehvar 2022). With the widespread success of large generative language models (Raffel et al. 2020;Brown et al. 2020) across various tasks (Zhao et al. 2024), prompt compression has garnered significant attention and can broadly be categorized into two main approaches: blackbox compression and white-box compression. ...

AdapLeR: Speeding up Inference by Adaptive Length Reduction
  • Citing Conference Paper
  • January 2022

... In Table 5 we have included instances of some additional specific syntactic phenomena encoded in a series of layerwise locations. 23 This is a fact that is explained by Fayyaz et al. [43]. 24 In their case, they observe that it is the first three layers -out of a total of six-that encode most syntactic information, and that data on sentence length starts to disappear starting from the third layer. ...

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations
  • Citing Conference Paper
  • January 2021

... [123]; gradient-based methods, which compute per-token importance scores using the partial derivatives of the output with respect to each input dimension -e.g. [110]; surrogate models -e.g. [35]-, which approximate the "black-box" PLM model with more interpretable models -i.e. ...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results
  • Citing Conference Paper
  • January 2021

... We utilized TensorFlow's implementation of BERTTokenizer [10] for tokenizing all complaint data in our dataset and WordNet [8] Lemmatizer from Scikit Learn to convert all tokens in a sentence to their root form. Lemmatization was preferred over Stemming as we wanted the root words to retain their context. ...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results
  • Citing Preprint
  • April 2021