Mehdi Rezagholizadeh’s research while affiliated with Huawei Technologies and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (124)


X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
  • Preprint
  • File available

March 2025

·

2 Reads

Guihong Li

·

Mehdi Rezagholizadeh

·

Mingyu Yang

·

[...]

·

Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid (i.e., combination of regular attention and MLA layers) or full MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. Our results show that using an 8B teacher model allows us to compress the KV cache size of the Llama3.2-1B-Inst baseline by 6.4x while preserving 100% of its average score across multiple tasks on the LM Harness Evaluation benchmark. This is achieved with only 3.6B training tokens and about 70 GPU hours on AMD MI300 GPUs, compared to the 370K GPU hours required for pre-training the Llama3.2-1B model.

Download

Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models

March 2025

·

9 Reads

Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model's performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip as well as other leading compression techniques on multiple models and at various scales, across a variety of benchmarks.


The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs

February 2025

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.


ReGLA: Refining Gated Linear Attention

February 2025

·

2 Reads

Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.


Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

December 2024

·

1 Read

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.


Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination

October 2024

The growth in prominence of large language models (LLMs) in everyday life can be largely attributed to their generative abilities, yet some of this is also owed to the risks and costs associated with their use. On one front is their tendency to \textit{hallucinate} false or misleading information, limiting their reliability. On another is the increasing focus on the computational limitations associated with traditional self-attention based LLMs, which has brought about new alternatives, in particular recurrent models, meant to overcome them. Yet it remains uncommon to consider these two concerns simultaneously. Do changes in architecture exacerbate/alleviate existing concerns about hallucinations? Do they affect how and where they occur? Through an extensive evaluation, we study how these architecture-based inductive biases affect the propensity to hallucinate. While hallucination remains a general phenomenon not limited to specific architectures, the situations in which they occur and the ease with which specific types of hallucinations can be induced can significantly differ based on the model architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations.


Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

October 2024

·

6 Reads

We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.


EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models

September 2024

·

5 Reads

Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15\%, training speed by 25\%, and reduces the number of parameters by approximately 4\%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.


Acceleration on SpecBench using a Vicuna-33B expert with Vicuna-68m, Vicuna-160m and LLaMA-68m as draft candidates. We use a greedy policy to select a drafter for a given query.
Reproduced translation results Target Model Draft Model Temperature Draft Tokens Ours Original Relative Difference
Reproduced summarization results
Optimization Hyperparameters
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

August 2024

·

20 Reads

Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model's generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains provided the candidates are effective. Further results show this to hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.


Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

August 2024

·

45 Reads

·

47 Citations

Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capacity for long-context understanding. In particular, we review and categorize a wide range of techniques including architectural modifications, such as modified positional encoding and altered attention mechanisms, which are designed to enhance the processing of longer sequences while avoiding a proportional increase in computational cost. The diverse methodologies investigated in this study can be leveraged across different phases of LLMs, i.e., training, fine-tuning and inference. This enables LLMs to efficiently process extended sequences. The limitations of the current methodologies is discussed in the last section along with the suggestions for future research directions, underscoring the importance of sequence length in the continued advancement of LLMs.


Citations (44)


... For example, DyLoRA (Valipour et al., 2022) employs this approach by randomly sampling rank values at each training step and truncating LoRA matrices accordingly. Qdylora (Rajabzadeh et al., 2024) extends this framework to quantized models by combining dynamic rank allocation with memory-efficient quantization. ...

Reference:

ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning
  • Citing Conference Paper
  • January 2024

... Dense retrieval algorithms (e.g., BGE-M3 ) have introduced concepts like embeddings to enhance both retrieval and ranking capabilities. Conversational retrieval methods, such as CHIQ (Mo et al., 2024), have attempted to improve retrieval result accuracy by considering historical search queries. None of them provide personalized results tailored to users' interaction history or profiles. ...

CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
  • Citing Conference Paper
  • January 2024

... Self-speculative decoding methods eliminate the overhead of serving two models by using a subset or modified version of the target model to generate draft tokens. Draft & Verify (Zhang et al., 2024), SWIFT (Xia et al., 2025), and Draft on the Fly (Metel et al., 2024) reduce draft generation time by selecting intermediate layers based on Bayesian optimization. Kangaroo ) uses a shallow sub-network as the draft model, and a lightweight adapter module is trained on it to align its representations with the full model. ...

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
  • Citing Conference Paper
  • January 2024

... (1) In the query-routing scenario, one line of work trains additional classification models to route queries to other LLMs for answers based on designated performance metrics [13,20,21] and routing data [43,46], which eventually form a routing pipeline with mixture-of-experts (MoE) systems [26,24] to boost the performance. Another line of work leverages routers to measure utility-efficiency tradeoffs, aiming to reconstruct model architectures to reduce inference overhead in terms of cost and execution time while maintaining utility [8,58]. ...

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
  • Citing Conference Paper
  • January 2024

... Moreover, an examination of 19 languages at different resource levels at various resource levels, in free-form text generation task, reveals a notable factuality gap, measured by FACTSCORE, between low-resource and high-resource languages, with hallucination proving significantly more severe in the former [40]. To assist a more thorough evaluation of LLM robustness in retrieval augmented generation, Thakur et al. [170] curate a human-annotated dataset NoMIRACL across 18 languages. Although most LLMs produce a high percentage of non-relevant information across languages, ironically, low-resource languages like Swahili and Yoruba exhibit the lowest hallucination rates, likely because their limited resources lead them to frequently respond with "I don't know." ...

“Knowing When You Don’t Know”: A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation
  • Citing Conference Paper
  • January 2024

... YaRN [401] discovered that using a ramp function to perform NTK interpolation at different ratios across dimensions could achieve better results. Building on YaRN, Resonance RoPE [537] further optimized RoPE's features using integer wavelengths. LongRoPE [102] directly uses evolution search to find optimal frequency scaling parameters for each dimension. ...

Resonance RoPE: Improving Context Length Generalization of Large Language Models
  • Citing Conference Paper
  • January 2024

... We also considered EWEK-QA (Dehghan et al., 2024), which is an enhanced citation-based QA system that improves knowledge extraction by combining an adaptive Web retriever with efficient knowledge graph (KG) integration. However, this method primarily focuses on citation-based question answering, which diverges from the scope of our work. ...

EWEK-QA : Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems
  • Citing Conference Paper
  • January 2024

... To extract an interpretable sequential progression it was assumed necessary to divide the long story into shorter segments (Wagner et al., 2023). However, recent advances in NLP introduced significant increases in context lengths of models (Wang et al., 2024), allowing the extraction of sequences as an end-to-end task. ...

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models
  • Citing Conference Paper
  • August 2024

... LLM Citation Generation has gained attention as a way to enhance the verifiability and transparency of model-generated responses by including citations linked to external evidence. Citation generation improves the factual accuracy of LLMs answers (Gao et al., 2023a) and allows users to trace the sources of information, thereby increasing the credibility and explainability of outputs (Tahaei et al., 2024). Early work like ALCE (Gao et al., 2023a) proposed foundational methods and evaluation metrics for enabling LLMs to generate citations. ...

Efficient Citer: Tuning Large Language Models for Enhanced Answer Quality and Verification

... Although the Somali NLP research has made some initial progress over the past few years in several tasks including text preprocessing (Mohamed and Mohamed, 2023), PoS tagging (Mohammed, 2020), text classification (Adelani et al., 2023;Alabi et al., 2022), machine translation (Wang et al., 2024;Adelani et al., 2022), and text retrieval (Badel et al., 2023;Adeyemi et al., 2024), yet the use of PLLMs for its various applications remain in its infancy stage compared to other languages. Generally, existing related research employing PLLMs can be roughly put into two categories: studies that focus on the development and pre-training of language models, and research works that emphasize finetuning models and downstream tasks including the creation of labelled datsets. ...

CIRAL: A Test Collection for CLIR Evaluations in African Languages