Ali Modarressi’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
  • Preprint

March 2025

·

1 Read

·

Ali Modarressi

·

Hinrich Schuetze

·

Nanyun Peng

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.


Figure 1. Haystack conflicting information filtering pipeline
Figure 4. Needle placements in full sweep (top) vs. last 2K tokens sweep (bottom): In the last 2K setup, placement positions are aligned in different context lengths, unlike the proportion-based positioning in full sweep.
Figure 5. Normalized performance comparison across GPT-4o and Llama 3.3 70B models, with and without distractors. The red dotted line marks the 0.85 effective threshold.
NoLiMa: Long-Context Evaluation Beyond Literal Matching
  • Preprint
  • File available

February 2025

·

73 Reads

Ali Modarressi

·

Hanieh Deilamsalehy

·

·

[...]

·

Hinrich Schütze

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

Download

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

October 2024

·

22 Reads

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.


Consistent Document-Level Relation Extraction via Counterfactuals

July 2024

·

1 Read

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge \unicode{x2013} rather than on the input context \unicode{x2013} to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.



DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

June 2023

·

29 Reads

An emerging solution for explaining Transformer-based models is to use vector-based analysis on how the representations are formed. However, providing a faithful vector-based explanation for a multi-layer model could be challenging in three aspects: (1) Incorporating all components into the analysis, (2) Aggregating the layer dynamics to determine the information flow and mixture throughout the entire model, and (3) Identifying the connection between the vector-based analysis and the model's predictions. In this paper, we present DecompX to tackle these challenges. DecompX is based on the construction of decomposed token representations and their successive propagation throughout the model without mixing them in between layers. Additionally, our proposal provides multiple advantages over existing solutions for its inclusion of all encoder components (especially nonlinear feed-forward networks) and the classification head. The former allows acquiring precise vectors while the latter transforms the decomposition into meaningful prediction-based values, eliminating the need for norm- or summation-based vector aggregation. According to the standard faithfulness evaluations, DecompX consistently outperforms existing gradient-based and vector-based approaches on various datasets. Our code is available at https://github.com/mohsenfayyaz/DecompX.


RET-LLM: Towards a General Read-Write Memory for Large Language Models

May 2023

·

29 Reads

·

1 Citation

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general write-read memory unit, allowing them to extract, store, and recall knowledge from the text as needed for task performance. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. The memory unit is designed to be scalable, aggregatable, updatable, and interpretable. Through qualitative evaluations, we demonstrate the superiority of our proposed framework over baseline approaches in question answering tasks. Moreover, our framework exhibits robust performance in handling temporal-based question answering tasks, showcasing its ability to effectively manage time-dependent information.


Guide the Learner: Controlling Product of Experts Debiasing Method Based on Token Attribution Similarities

February 2023

·

11 Reads

Several proposals have been put forward in recent years for improving out-of-distribution (OOD) performance through mitigating dataset biases. A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model. Here, the underlying assumption is that the biased model resorts to shortcut features. Hence, those training examples that are correctly predicted by the biased model are flagged as being biased and are down-weighted during the training of the main model. However, assessing the importance of an instance merely based on the predictions of the biased model may be too naive. It is possible that the prediction of the main model can be derived from another decision-making process that is distinct from the behavior of the biased model. To circumvent this, we introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts (PoE) loss function to further improve OOD performance. With experiments conducted on natural language inference and fact verification benchmarks, we show that our method improves OOD results while maintaining in-distribution (ID) performance.




Citations (8)


... However, they also result in a significant performance drop on hard or slightly modified test data (Naik et al., 2018). For example, in the area of natural language inference (NLI), models like BERT (Devlin et al., 2019) tend to misclassify premise-hypothesis pairs that contain "negation" words in their hypotheses as "contradiction," which happen to be predictive features associated with the contradiction label in certain NLI datasets (Gururangan et al., 2018;Poliak et al., 2018;Modarressi et al., 2023). ...

Reference:

FairFlow: Mitigating Dataset Biases through Undecided Learning
Guide the Learner: Controlling Product of Experts Debiasing Method Based on Token Attribution Similarities
  • Citing Conference Paper
  • January 2023

... respectively. Moreover, REX-augmented LIME and SHAP outperform DecompX (Modarressi et al. 2023), a state-of-the-art explanation method for text models on its target model. We also run a user study, which shows that REX helps end users better understand and predict the behaviors of target models. ...

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition
  • Citing Conference Paper
  • January 2023

... However, whether it is preferable to model such a feature in a controlled manner or to rely on the result of a technical limitation is debatable. The technical limitation could be mitigated by using summarized input or including memory mechanisms with retrieval methods (Modarressi et al., 2023;Zhong et 585 al., 2024;, although these approaches all require extra effort in designing peripheral agent work-flows. ...

RET-LLM: Towards a General Read-Write Memory for Large Language Models
  • Citing Preprint
  • May 2023

... In the case of local methods, some methods determine the relevance of each token in the prediction of the model, either by altering the data input (Modarressi et al. (2023), the study of gradients associated with the different tokens Kindermans et al. (2017), or the representation of the various tokens in their respective vectors along the model Modarressi et al. (2022). Other local methods seek to study the behaviour of the transformer models in a granular way, distinguishing at this point those that focus on the analysis of the attentional mechanism Kobayashi et al. (2020) and on the multilayer perceptron (MLP) Geva et al. (2022). ...

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers
  • Citing Conference Paper
  • January 2022

... Prompt compression techniques had already been explored in the era of BERT-scale (Devlin 2018) language models Kim and Cho 2021;Modarressi, Mohebbi, and Pilehvar 2022). With the widespread success of large generative language models (Raffel et al. 2020;Brown et al. 2020) across various tasks (Zhao et al. 2024), prompt compression has garnered significant attention and can broadly be categorized into two main approaches: blackbox compression and white-box compression. ...

AdapLeR: Speeding up Inference by Adaptive Length Reduction
  • Citing Conference Paper
  • January 2022

... In Table 5 we have included instances of some additional specific syntactic phenomena encoded in a series of layerwise locations. 23 This is a fact that is explained by Fayyaz et al. [43]. 24 In their case, they observe that it is the first three layers -out of a total of six-that encode most syntactic information, and that data on sentence length starts to disappear starting from the third layer. ...

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations
  • Citing Conference Paper
  • January 2021

... [123]; gradient-based methods, which compute per-token importance scores using the partial derivatives of the output with respect to each input dimension -e.g. [110]; surrogate models -e.g. [35]-, which approximate the "black-box" PLM model with more interpretable models -i.e. ...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results
  • Citing Conference Paper
  • January 2021

... We utilized TensorFlow's implementation of BERTTokenizer [10] for tokenizing all complaint data in our dataset and WordNet [8] Lemmatizer from Scikit Learn to convert all tokens in a sentence to their root form. Lemmatization was preferred over Stemming as we wanted the root words to retain their context. ...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results
  • Citing Preprint
  • April 2021