Soroush Vosoughi’s research while affiliated with Dartmouth College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (196)


Figure 3: The changes in AUC and logloss when perturbations are applied to the input embeddings of the original CTR methods and those after applying our method. Left y-axis represents AUC (↑) and right y-axis corresponds to logloss (↓).
Figure 4: The AUC score of DCNV2 (Wang et al. 2021), WideDeep (Cheng et al. 2016), and xDeepFM (Lian et al. 2018) on the ML-1M, Yelp2018, and Amazon-book dataset with different ratio λ r in training.
FiGNN
The statistics of the benchmark datasets.
Scaled Supervision is an Implicit Lipschitz Regularizer
  • Preprint
  • File available

March 2025

·

6 Reads

Zhongyu Ouyang

·

Chunhui Zhang

·

Yaning Jia

·

Soroush Vosoughi

In modern social media, recommender systems (RecSys) rely on the click-through rate (CTR) as the standard metric to evaluate user engagement. CTR prediction is traditionally framed as a binary classification task to predict whether a user will interact with a given item. However, this approach overlooks the complexity of real-world social modeling, where the user, item, and their interactive features change dynamically in fast-paced online environments. This dynamic nature often leads to model instability, reflected in overfitting short-term fluctuations rather than higher-level interactive patterns. While overfitting calls for more scaled and refined supervisions, current solutions often rely on binary labels that overly simplify fine-grained user preferences through the thresholding process, which significantly reduces the richness of the supervision. Therefore, we aim to alleviate the overfitting problem by increasing the supervision bandwidth in CTR training. Specifically, (i) theoretically, we formulate the impact of fine-grained preferences on model stability as a Lipschitz constrain; (ii) empirically, we discover that scaling the supervision bandwidth can act as an implicit Lipschitz regularizer, stably optimizing existing CTR models to achieve better generalizability. Extensive experiments show that this scaled supervision significantly and consistently improves the optimization process and the performance of existing CTR models, even without the need for additional hyperparameter tuning.

Download

Figure 1: The Superficial Self-Improved Reasoners phenomenon is mitigated by iterative model merging. Our method improves ID and OOD reasoning performances.
Figure 9: The weight change over layers for (i) Fintuning Qwen2.5-1.5B with synthetic MATH (Hendrycks et al., 2021) dataset data and limited training data (7.5k real MATH training data)
Superficial Self-Improved Reasoners Benefit from Model Merging

March 2025

·

2 Reads

Xiangchi Yuan

·

Chunhui Zhang

·

Zheyuan Liu

·

[...]

·

Wenke Lee

As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.


Figure 2: Comparisons of different setups for models on the MSR-VTT dataset: (a) freezing modules, (b) scales of LLMs, (c) usage of image-text pairs in pretrained BLIP-2, and (d) supervision with and without SCST. We also replicate the comparisons and ablations on other datasets (e.g., MSVD and VATEX) in App. D.4.
Fig. 2(b) and Fig. 5). The BLIP-2 framework was selected for its state-of-the-art performance on the MSCOCO image captioning benchmark, which remains the most canonical dataset for captioning evaluation. The chosen language models-OPT-2.7B, Flan-T5-XL-3B, and Vicuna-7B-are all extensively used within BLIP-2 for vision-language tasks and represent a range of architectures and parameter sizes. Their open-source nature and community adoption further enhance their relevance and comparability in this domain. The results demonstrate that Flan-T5-XL-3B, a mid-sized model, achieves superior performance in generating video captions, outperforming both the smaller OPT-2.7B and the larger Vicuna-7B on key metric CIDEr. This challenges the notion that larger LMs always yield better results in multimodal tasks.
Figure 3: (a) temporal fusion by average v.s. concatenation; (b) different resolutions.
Pretrained Image-Text Models are Secretly Video Captioners

February 2025

·

1 Read

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.


Figure 2: Frequency Distribution of Persuasion Strategies in Independently Generated Dialogues. The Y-axis indicates the proportion of each strategy used within the model-generated dialogues. Each bar represents the strategy distribution of a single dialogue, organized by generation topic. Our framework adapts to various persuasion topics.
Figure A3: Using GPT-4o for all the agents leads to fluent language, while the generations periodically go off-topic.
Figure A4: The Claude 3 model consistently refuses to generate persuasive text in scenarios that challenge moral standards.
Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

February 2025

·

7 Reads

Large Language Models (LLMs) have shown proficiency in generating persuasive dialogue, yet concerns about the fluency and sophistication of their outputs persist. This paper presents a multi-LLM communication framework designed to enhance the generation of persuasive data automatically. This framework facilitates the efficient production of high-quality, diverse linguistic content with minimal human oversight. Through extensive evaluations, we demonstrate that the generated data excels in naturalness, linguistic diversity, and the strategic use of persuasion, even in complex scenarios involving social taboos. The framework also proves adept at generalizing across novel contexts. Our results highlight the framework's potential to significantly advance research in both computational and social science domains concerning persuasive communication.


Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

February 2025

Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.


Figure 1: Recognizing Navajo (and other Athabaskan languages) presents significant challenges for centralized models like Google Translate. Our model addresses these challenges effectively.
Figure 2: Misdetected languages by Google Language Detection API, along with their frequency counts.
Figure 4: Family Tree for Athabaskan Languages
Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

January 2025

·

18 Reads

Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's large language model (LLM)-based language identification system, which consistently misidentifies Navajo, exposing inherent limitations when applied to low-resource Native American languages. To address this, we introduce a random forest classifier trained on Navajo and eight frequently confused languages. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%), significantly outperforming Google's LLM-based system. Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.



N\"ushuRescue: Revitalization of the endangered N\"ushu Language with AI

November 2024

·

49 Reads

The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by N\"ushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. N\"ushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence N\"ushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.


ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

November 2024

·

2 Reads

Handling implicit language is essential for natural language processing systems to achieve precise text understanding and facilitate natural interactions with users. Despite its importance, the absence of a robust metric for accurately measuring the implicitness of language significantly constrains the depth of analysis possible in evaluating models' comprehension capabilities. This paper addresses this gap by developing a scalar metric that quantifies the implicitness level of language without relying on external references. Drawing on principles from traditional linguistics, we define ''implicitness'' as the divergence between semantic meaning and pragmatic interpretation. To operationalize this definition, we introduce ImpScore, a novel, reference-free metric formulated through an interpretable regression model. This model is trained using pairwise contrastive learning on a specially curated dataset comprising 112,580 (implicit sentence, explicit sentence) pairs. We validate ImpScore through a user study that compares its assessments with human evaluations on out-of-distribution data, demonstrating its accuracy and strong correlation with human judgments. Additionally, we apply ImpScore to hate speech detection datasets, illustrating its utility and highlighting significant limitations in current large language models' ability to understand highly implicit content. The metric model and its training data are available at https://github.com/audreycs/ImpScore.


Figure 2: (a) The average percentage of successful adversarial attacks by TextFooler [34] on a host of models [57, 56, 16, 43] and the IMDB [47] dataset regressed with the average of knowledge continuity coefficients across all hidden layers (í µí± 2 = 0.35). (b) í µí±˜-Volatility as í µí±˜ is varied across a model's relative depth. (c) Correlation between í µí±˜-volatility and adversarial vulnerability (averaged across all models shown in (b)) with respect to TextFooler [34] as í µí±˜ varies.
Figure 4: Regularization í µí±˜-volatility for a host of vision models. We apply two adversarial attacks FGSM [24] (top row) and SI-NI-FGSM [40] (bottom row) with various attack strengths. Attack strength is measured in terms of maximum í µí³ 2 -norm of the applied perturbation to the image.
Achieving Domain-Independent Certified Robustness via Knowledge Continuity

November 2024

·

5 Reads

We present knowledge continuity, a novel definition inspired by Lipschitz continuity which aims to certify the robustness of neural networks across input domains (such as continuous and discrete domains in vision and language, respectively). Most existing approaches that seek to certify robustness, especially Lipschitz continuity, lie within the continuous domain with norm and distribution-dependent guarantees. In contrast, our proposed definition yields certification guarantees that depend only on the loss function and the intermediate learned metric spaces of the neural network. These bounds are independent of domain modality, norms, and distribution. We further demonstrate that the expressiveness of a model class is not at odds with its knowledge continuity. This implies that achieving robustness by maximizing knowledge continuity should not theoretically hinder inferential performance. Finally, to complement our theoretical results, we present several applications of knowledge continuity such as regularization, a certification algorithm, and show that knowledge continuity can be used to localize vulnerable components of a neural network.


Citations (49)


... To achieve this, we developed CoT-RNA-Transfer, which significantly enhances RNA contact prediction through transfer learning using a publicly available protein language model. 59 Our findings suggest that structural patterns learned from proteins can be successfully transferred to RNAs, paving the way for new research opportunities. ...

Reference:

AI-integrated network for RNA complex structure and dynamic prediction
Knowledge from Large-Scale Protein Contact Prediction Models Can Be Transferred to the Data-Scarce RNA Contact Prediction Task
  • Citing Chapter
  • December 2024

... With the emergence of LVMs, they have performed well in cross-modal understanding [12,27]. Jia et al. [14] proposed to use LVMs such as GPT 4V and Gemini 1.0 Pro [36,44] to detect forged face images, and designed text prompts. However, the above methods still face two major challenges: First, the existing datasets are not diverse in scene and domain diversity, resulting in the model overfitting specific artifacts; Second, the generalization ability of the detection algorithm to new generation models (such as Runway and Kling) [42] is insufficient. ...

Is GPT-4V (ision) All You Need for Automating Academic Data Visualization? Exploring Vision-Language Models’ Capability in Reproducing Academic Charts
  • Citing Conference Paper
  • January 2024

... In addition, Hodgkinson et al. (2022); Simsekli et al. (2020); Wang et al. (2024a) proved generalization bounds dependent on the HT distributions in either model weights or the ESDs of the weight matrices, which are validated through extensive experiments. Motivated by these studies, some efforts have begun to leverage the degree of HT for model training Qing et al., 2024;, model selection (Agrawal et al., 2022;Yang et al., 2023), and model compression (Barsbey et al., 2021;, as well as to enhance model robustness (Nassar et al., 2020). ...

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality
  • Citing Conference Paper
  • January 2024

... Based on their objections, we designed a set of examples that put into question the interpretability of prototypes. For example, Figure 1 shows an interpretation of the prototype-based network (specifically ProtoViT, Ma et al., 2024), where an image of a bird is classified based on features extracted from car images. Our proposed prototype manipulation highlights that visual confirmation bias (Klayman, 1995;Kim et al., 2022) is a threat potentially masking these models' inherent uninterpretability. ...

Interpretable Image Classification with Adaptive Prototype-based Vision Transformers

... Furthermore, LLMs seem to be particularly useful in reducing the administrative burden for clinicians when combined with audio recordings and a transcription service. [ 15 ] This is called ambient listening . In the intensive care unit (ICU), a plethora of potential use cases can be considered valuable, for example, during rounds, during family conversations, or during multidisciplinary meetings. ...

Preparing for the Widespread Adoption of Clinic Visit Recording
  • Citing Article
  • October 2024

NEJM AI

... To advance the analysis of manipulative dialogues, Wang et al. (2024b) introduces the first dataset, MentalManip, specialized for mental manipulation detection and classification. Despite their strengths, LLMs exhibit notable difficulties in identifying manipulative dialogues; in particular, the false negative rate is almost twice the false positive rate, as evidenced by our pilot study (see Section 2). ...

MentalManip: A Dataset For Fine-grained Analysis of Mental Manipulation in Conversations
  • Citing Conference Paper
  • January 2024

... In misinformation detection, the emotion and empathy perspective provides a new analytical dimension. Ma et al. [22] simulated user responses through the susceptibility to misinformation test, while [60] improved detection performance by mining the relationship between publisher emotions and social emotions. These studies indicate that introducing emotional and empathetic perspectives helps reveal the deep psychological mechanisms of misinformation propagation, providing a powerful complement to traditional technical feature analysis. ...

Simulated Misinformation Susceptibility (SMISTS): Enhancing Misinformation Research with Large Language Model Simulations
  • Citing Conference Paper
  • January 2024

... Video Swin Transformer [79] applies joint spatiotemporal attention within localized 3D windows. TimeSformer [7], ViViT [3], TESTA [102] and EVLGen [43] apply self-attention mechanism along the spatial and temporal dimensions, respectively. We also introduce an additional temporal attention module to ViT layers, but the attention is causal to realize progressive feature encoding. ...

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
  • Citing Conference Paper
  • January 2024

... In the only study to investigate LGBTQIA+ bias in LLMs in healthcare thus far, Xie et al. (2024) generated short sentences including LGBTQIA+ or racial identities and investigated the degree to which these identities were associated with stereotypical conditions such as HIV. 13 They found that larger models trained on biomedical corpora exhibited greater degrees of bias, implying that latent bias . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. ...

Addressing Healthcare-related Racial and LGBTQ+ Biases in Pretrained Language Models
  • Citing Conference Paper
  • January 2024