Hanieh Deilamsalehy’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (25)


From Selection to Generation: A Survey of LLM-based Active Learning
  • Preprint

February 2025

·

21 Reads

Yu Xia

·

·

Zhouhang Xie

·

[...]

·

Julian McAuley

Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.


Figure 1. Haystack conflicting information filtering pipeline
Figure 4. Needle placements in full sweep (top) vs. last 2K tokens sweep (bottom): In the last 2K setup, placement positions are aligned in different context lengths, unlike the proportion-based positioning in full sweep.
Figure 5. Normalized performance comparison across GPT-4o and Llama 3.3 70B models, with and without distractors. The red dotted line marks the 0.85 effective threshold.
NoLiMa: Long-Context Evaluation Beyond Literal Matching
  • Preprint
  • File available

February 2025

·

79 Reads

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

Download

Personalized Graph-Based Retrieval for Large Language Models

January 2025

·

39 Reads

As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.


Figure 1: Centralized and Decentralized approaches using a 5-LLM example. Similar topologies can be applied to any ("k") number of LLMs. In centralized interactions, all models communicate with a central model; in decentralized interactions, each model communicate with every other model and also itself.
Figure 4: Evaluation prompt for evaluating the summaries generated by different LLMs using our conversational (decentralized) multi-LLM framework. "k" is a parameter reflecting the number of LLMs that generate summaries.
Figure 8: Prompt 2 for generating the initial summary in the first round.
Multi-LLM Text Summarization

December 2024

·

216 Reads

In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.


Personalized Multimodal Large Language Models: A Survey

December 2024

·

28 Reads

Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.


A Survey of Small Language Models

October 2024

·

190 Reads

·

1 Citation

Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.


Figure 2: An overview of the Taipan architecture.
Figure 3: Attention mechanisms in Taipan's Selective Attention Layers. White areas indicate no attention. (a) Full Causal Attention (b) Sliding Window Attention (w = 4) (c) Selective Attention (C = 0.3, w = 5)
Figure 5: Effect of Attention Budget Capacity C on Taipan's Performance
Figure 6: Perplexity comparison of Taipan variants with and without Positional Embeddings across different context lengths. Lower perplexity indicates better performance.
Taipan: Efficient and Expressive State Space Language Models with Selective Attention

October 2024

·

27 Reads

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.


A Multi-LLM Debiasing Framework

September 2024

·

18 Reads

Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.



Figure 1: An example in the MediaSum dataset. In the SpeakerID setting, the speakers are not provided with their names at test time but their speaker identities such as "speaker1", "speaker2" produced by a speaker diarization system. A model performing SpeakerID needs to recover the actual names for the speakers based on the transcript.
Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

July 2024

·

77 Reads

We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. Despite the advancements in speech recognition, the task of text-based speaker identification (SpeakerID) has received limited attention, lacking large-scale, diverse datasets for effective model training. Addressing these gaps, we present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names. Through extensive experiments, our best model achieves a great precision of 80.3\%, setting a new benchmark for SpeakerID. The data and code are publicly available here: \url{https://github.com/adobe-research/speaker-identification}


Citations (10)


... Some models offer smaller versions that can count as SLM, such as Gemma from Google, Llama 3.2 from Meta, Phi3-mini from Microsoft, and Qwen 2.5 from Alibaba. Such model families can offer impressive language processing capabilities, and they are constantly improving thanks to ongoing research efforts [2]. ...

Reference:

Exploring LLM function calling for automation solutions in embedded systems via microcontroller-based IoT networks
A Survey of Small Language Models

... Drawing inspiration from the remarkable achievements of large language models (LLMs) in a wide range of tasks , Huang et al. [2025], the potential of multimodal large language models (MLLMs) is increasingly being realized in the domain of visual scene analysis. The research trajectory of Video-LLMs Lin et al. [2023], Maaz et al. [2024], Cheng et al. [2024], is progressively advancing from the understanding of short, visioncentric videos Li et al. [2023a], Jin et al. [2024], Lin et al. [2024a], Li et al. [2024d] to the integration of omni-modal information-including video, audio, and subtitles Chen et al. [2023], Duan et al. [2023], Li et al. [2023b], Tan et al. [2024], Li et al. [2024c]-for comprehensive long video analysis. ...

Koala: Key Frame-Conditioned Long Video-LLM
  • Citing Conference Paper
  • June 2024

... The advent of large language models, particularly multimodal large language models (MM-LLMs), has revolutionized deep visual understanding [1], [2]. These models excel in tasks such as image captioning [3], visual question answering [4], and video summarization [5], [6]. Long videos, spanning over an hour and containing tens of thousands of frames, pose unique challenges, including maintaining long-term dependencies [7], managing complex temporal dynamics [8], and processing vast amounts of visual information [9]. ...

Scaling Up Video Summarization Pretraining with Large Language Models
  • Citing Conference Paper
  • June 2024

... The availability of large labeled data led to the increased use of supervised methods for addressing unique segmentation nuances across different domains. These approaches generally involve training a boundary classifier on a sequence of input sentences [9,30,52,53,58,65] or on pairs of left and right context blocks [39,54]. To represent the input, these methods utilize various techniques, including statistical features [14,53], neural networks [3,12,30,32,50,56,58], or transformers [4,33,35,38,39,52,65]. ...

Curricular Next Conversation Prediction Pretraining for Transcript Segmentation
  • Citing Conference Paper
  • January 2023

... As mentioned previously, in the English language domain, research projects on textonly punctuation models such as [6,8,25,26] almost always utilize the IWSLT 2011 and 2012 datasets, both derived from the transcription of TED talks. Models derived from text and acoustics such as [21,23,24,27] have very often used the MuST-C dataset [28], which is also derived from TED talks. ...

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning
  • Citing Conference Paper
  • August 2023

... • Our top-performing model, using contextually rich sentences as guidance, outperforms the previous SOTA model CURRSUM (Sotudeh et al., 2022a), achieving improvements of 0.40, 0.82, and 4.07 in ROUGE-1, ROUGE-2, and ROUGE-L scores, respectively. Furthermore, it achieves a 2.5% higher FactCC score compared to BART, and a 3.0% increase over the original GSUM. ...

Curriculum-guided Abstractive Summarization for Mental Health Online Posts
  • Citing Conference Paper
  • January 2022

... Extractive Q&A. Prasad et al. (2023) explore extractive Q&A on meeting transcripts, however, not testing generative models, finding that predictions do not stick to the sentences in the transcript and could include hallucinations. Mallick et al. (2023) propose to make a generative model generate the answer index instead of generating the complete answer to reduce hallucinations. ...

MeetingQA: Extractive Question-Answering on Meeting Transcripts
  • Citing Conference Paper
  • January 2023

... on Rouge-2 metric compared to LLMLingua-2, while having a higher compression ratio (12.0x→12.9x). Our proposed LLM-DCP is not optimal in BLEU metric compared to LLMLingua-2, the main reason is that our DCP-Agent is trained on conversation data, while LLMLingua-2 is trained on the summarization task dataset, MeetingBank [65]. Meanwhile, it exactly proves that the proposed LLM-DCP still achieves better prompt compression performance in the cross-task situation. ...

MeetingBank: A Benchmark Dataset for Meeting Summarization
  • Citing Conference Paper
  • January 2023

... Detection of illness is another issue dealt with via abstractive summarization. In particular, mental health problem detection has been tested for Reddit [71], and depression detectionfor Twitter [72]. ...

Curriculum-guided Abstractive Summarization for Mental Health Online Posts

... However, due to the lack of publicly available training data, summarization of short user-generated texts is underrepresented. Within the research area of MDS, the practicability of extractive and abstractive methods was evaluated in the context of summarizing posts on the Reddit platform (Sotudeh et al., 2021), product reviews (Angelidis et al., 2021;Oved and Levy, 2021), and Twitter streaming data (Dusart et al., 2023). ...

TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts