James Caverlee’s research while affiliated with Texas A&M University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (289)


Figure 1: An overview of our two-part framework: (i) Using our Gendered Discourse Correlation Framework (GDCF, as shown in Figure 2), we obtain gendered discourse word lists. (ii) We then perform our Discourse Word-Embedding Association Test (D-WEAT, as shown here in Figure 1). We form parallel sentences, s and s ′ , by swapping masculine discourse words (e.g. "going") for feminine discourse words (e.g. "like"): s = And I was going, hey, it's cold outside..., and s ′ = And I was like, hey, it's cold outside... We find that the masculine discourse words have a more stable embedding representation -this is a representational harm and a masculine default.
Figure 4: Impact of τ on the average percentage of S m segments which move closer to the women concept (A w ) versus the men (A m ) concept.
Figure 5: Impact of γ on the average percentage of S w segments which move closer to the women concept (A w ) versus the men (A m ) concept.
Significant correlations between duration, speech rate, and gender.
Significant correlations between content topics and content topics. Topic N Topic M r Topic N Word List Topic N Categories Topic M Word List Topic M Categories Topic 3 Topic 34 0.10 women, woman, men, baby, pregnant, girls, men, doctor, health, birth Content -Pregnancy patients, pain, patient, disease, treatment, injury, risk, test, type, symptoms Content -Medical

+2

Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models
  • Preprint
  • File available

April 2025

·

3 Reads

Maria Teleki

·

Xiangjue Dong

·

Haoran Liu

·

James Caverlee

Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.

Download

Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

Haoran Liu

·

·

Tianxiao Li

·

[...]

·

Martin Renqiang Min

We consider the conditional generation of 3D drug-like molecules with explicit control over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.



Towards An Efficient LLM Training Paradigm for CTR Prediction

March 2025

·

5 Reads

Large Language Models (LLMs) have demonstrated tremendous potential as the next-generation ranking-based recommendation system. Many recent works have shown that LLMs can significantly outperform conventional click-through-rate (CTR) prediction approaches. Despite such promising results, the computational inefficiency inherent in the current training paradigm makes it particularly challenging to train LLMs for ranking-based recommendation tasks on large datasets. To train LLMs for CTR prediction, most existing studies adopt the prevalent ''sliding-window'' paradigm. Given a sequence of m user interactions, a unique training prompt is constructed for each interaction by designating it as the prediction target along with its preceding n interactions serving as context. In turn, the sliding-window paradigm results in an overall complexity of O(mn2)O(mn^2) that scales linearly with the length of user interactions. Consequently, a direct adoption to train LLMs with such strategy can result in prohibitively high training costs as the length of interactions grows. To alleviate the computational inefficiency, we propose a novel training paradigm, namely Dynamic Target Isolation (DTI), that structurally parallelizes the training of k (where k>>1k >> 1) target interactions. Furthermore, we identify two major bottlenecks - hidden-state leakage and positional bias overfitting - that limit DTI to only scale up to a small value of k (e.g., 5) then propose a computationally light solution to effectively tackle each. Through extensive experiments on three widely adopted public CTR datasets, we empirically show that DTI reduces training time by an average of \textbf{92%} (e.g., from 70.5 hrs to 5.31 hrs), without compromising CTR prediction performance.


Figure 1: The overall flow of the decentralized historical interests estimation stage
Figure 2: The overall flow of the decentralized interactive preference elicitation stage
Figure 3: Effectiveness of learned embeddings and adopted policy agent in FedCRS and CRIF embeddings, which have been shown to be less effective than Fed-CRS's in Section 4.2. As observed in Section 5.1, replacing the learned embeddings in FedCRS with randomly initialized embeddings causes dramatic drop in recommendation performance (both in SR@15 and AT). In addition, replacing FedCRS's embeddings with CRIF's embedding also incurs a decrease in recommendation performance. This greatly demonstrates the importance of the embeddings learned by the proposed Decentralized Historical Interests Estimation component. For the Decentralized Interactive Preference Elicitation component, the main novelty lies in the added local user embedding projection layer, as specified in Equation (10). Since previous works [24, 53] have shown that a uniform reinforcement learning based policy agent can only achieve sub-optimal performance due to environment heterogeneity. We apply a learnable local user embedding layer to transform the input of the policy agent to better adapt to the existing policy. As observed in Section 5.1, removing the learnable local user embedding layer causes significant decrease in recommendation performance (both in SR@15 and AT). This demonstrates the importance of the local user embedding layer to the Decentralized Interactive Preference Elicitation component.
Federated Conversational Recommender System

March 2025

·

6 Reads

Conversational Recommender Systems (CRSs) have become increasingly popular as a powerful tool for providing personalized recommendation experiences. By directly engaging with users in a conversational manner to learn their current and fine-grained preferences, a CRS can quickly derive recommendations that are relevant and justifiable. However, existing conversational recommendation systems (CRSs) typically rely on a centralized training and deployment process, which involves collecting and storing explicitly-communicated user preferences in a centralized repository. These fine-grained user preferences are completely human-interpretable and can easily be used to infer sensitive information (e.g., financial status, political stands, and health information) about the user, if leaked or breached. To address the user privacy concerns in CRS, we first define a set of privacy protection guidelines for preserving user privacy under the conversational recommendation setting. Based on these guidelines, we propose a novel federated conversational recommendation framework that effectively reduces the risk of exposing user privacy by (i) de-centralizing both the historical interests estimation stage and the interactive preference elicitation stage and (ii) strictly bounding privacy leakage by enforcing user-level differential privacy with meticulously selected privacy budgets. Through extensive experiments, we show that the proposed framework not only satisfies these user privacy protection guidelines, but also enables the system to achieve competitive recommendation performance even when compared to the state-of-the-art non-private conversational recommendation approach.


Complex LLM Planning via Automated Heuristics Discovery

February 2025

·

8 Reads

We consider enhancing large language models (LLMs) for complex planning tasks. While existing methods allow LLMs to explore intermediate steps to make plans, they either depend on unreliable self-verification or external verifiers to evaluate these steps, which demand significant data and computations. Here, we propose automated heuristics discovery (AutoHD), a novel approach that enables LLMs to explicitly generate heuristic functions to guide inference-time search, allowing accurate evaluation of intermediate states. These heuristic functions are further refined through a heuristic evolution process, improving their robustness and effectiveness. Our proposed method requires no additional model training or fine-tuning, and the explicit definition of heuristic functions generated by the LLMs provides interpretability and insights into the reasoning process. Extensive experiments across diverse benchmarks demonstrate significant gains over multiple baselines, including nearly twice the accuracy on some datasets, establishing our approach as a reliable and interpretable solution for complex planning tasks.


GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking

February 2025

·

23 Reads

Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose \textbf{\textit{GraphCheck}}, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains which are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate a 6.1\% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.


Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

February 2025

·

15 Reads

We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.


Figure 1: Illustration of VAE and diffusion-based CF models, the proposed FlowCF, along with a trajectory comparison between diffusion process and flow in FlowCF.
Figure 3: Performance comparison under natural noise setting on MovieLens-20M and Amazon-Beauty.
Statistics of the experimented datasets.
Flow Matching for Collaborative Filtering

February 2025

·

6 Reads

Generative models have shown great promise in collaborative filtering by capturing the underlying distribution of user interests and preferences. However, existing approaches struggle with inaccurate posterior approximations and misalignment with the discrete nature of recommendation data, limiting their expressiveness and real-world performance. To address these limitations, we propose FlowCF, a novel flow-based recommendation system leveraging flow matching for collaborative filtering. We tailor flow matching to the unique challenges in recommendation through two key innovations: (1) a behavior-guided prior that aligns with user behavior patterns to handle the sparse and heterogeneous user-item interactions, and (2) a discrete flow framework to preserve the binary nature of implicit feedback while maintaining the benefits of flow matching, such as stable training and efficient inference. Extensive experiments demonstrate that FlowCF achieves state-of-the-art recommendation accuracy across various datasets with the fastest inference speed, making it a compelling approach for real-world recommender systems.


Cold-Start Recommendation towards the Era of Large Language Models (LLMs): A Comprehensive Survey and Roadmap

January 2025

·

53 Reads

Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in https://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.


Citations (50)


... LLMRG maintains a knowledge store to cache validated reasoning chains for later reuse, which can alleviate redundant reasoning by LLMs and help train recommenders with more informative samples. Besides, Sachdeva et al. [50] adopt an auto-regressive dataset distillation strategy [49] in LLM4Rec, achieving up to 98 -120% of full data performance using only 0.1% data. ...

Reference:

A Survey on Efficient Solutions of Large Language Models for Recommendation
Improving Data Efficiency for Recommenders and LLMs
  • Citing Conference Paper
  • October 2024

... The prevailing mainstream approaches leverage the exceptional zero-shot capabilities of CLIP (Radford et al. 2021) to develop "forward methods" that directly enable semantic segmentation, achieving notable progress (Xu et al. 2023;Cho et al. 2024;Xie et al. 2024;Ghiasi et al. 2022). However, as the CLIP model suffers from the classimbalance-problem (Chuang et al. 2023;Parashar et al. 2024), this issue inevitably affects CLIP-based segmentation models as well, causing them to be biased towards recognizing seldom-seen or novel categories as common ones. ...

The Neglected Tails in Vision-Language Models
  • Citing Conference Paper
  • June 2024

... Uncertain results depend on heuristics RA-SIM [146], KG-Rank [201], SKP [38], NuTrea [27], Zeshel [118], Conll [70], RNG-KBQA [204], ArcaneQA [59], HybridRAG [154],MedGraphRAG [193], TOG2 [123], DepsRAG [3], KELP [111], KGQA [157], Graph-LLM [25], GLBench [102] Learning-based Algorithms ...

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques
  • Citing Conference Paper
  • January 2024

... Adversarial attacks are not unique to image classifiers, and deep learning models for many different real world tasks are vulnerable to adversarial attacks. This includes speech systems, such as speech-to-text models [2], radio signal classification models [21], and graph models [13]. ...

Everything Perturbed All at Once: Enabling Differentiable Graph Attacks
  • Citing Conference Paper
  • May 2024

... The emergence of Large Language Models (LLMs) marks a transformative shift in recommendation systems research, offering novel approaches to overcome conventional recommendation frameworks' traditional constraints and performance limitations [43,52]. LLMs offer three key strengths: their built-in knowledge for recommendation reasoning, enhanced performance through domain-specific fine-tuning, and effective handling of cold-start scenarios [30,39]. ...

Large Language Models as Data Augmenters for Cold-Start Item Recommendation
  • Citing Conference Paper
  • May 2024

... The integration of artificial intelligence (AI), and in particular, large language models (LLMs), into clinical medicine is on the horizon [7]. LLMs have advanced text generation capability and extensive domain-specific knowledge [8,9]. Notably, these models have demonstrated their proficiency by successfully passing advanced medical examinations [10] and scoring clinical risk gradings on par with experienced physicians [11]. ...

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques

... Fairness and Toxicity These approaches focus on protected variable such as race. Existing methods spans across counterfactual data augmentation (Zmigrod et al., 2019;Dinan et al., 2020;Barikeri et al., 2021), comparisons between network architectures (Meade et al., 2022), deibasing with counterfactual inference (Qian et al., 2021), adversarial training (Madanagopal and Caverlee, 2023), prompt perturbation (Guo et al., 2023), data balancing (Han et al., 2022), contrastive learning (Cheng et al., 2021), detecting toxic outputs (Schick et al., 2021), performance degradation incurred by debiasing methods (Meade et al., 2022), and benchmarks (Nadeem et al., 2021;Hartvigsen et al., 2022;Sun et al., 2022). Social debiasing methods may underperform in OOD settings because OOD examples may not contain social stereotypes or biases. ...

Bias Neutralization in Non-Parallel Texts: A Cyclic Approach with Auxiliary Guidance
  • Citing Conference Paper
  • January 2023

... Considerable prior research has examined gender differences in social media (e.g., Wang, Pappu, and Cramer (2021); Kalhor et al. (2023); Johnson et al. (2021); Wang and Horvát (2019)) and in LLMs (e.g., Dong et al. (2023); Caliskan, Bryson, and Narayanan (2017); May et al. (2019); Bolukbasi et al. (2016)). But how do masculine defaults manifest on social media? ...

Co2PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning
  • Citing Conference Paper
  • January 2023

... The dynamic nature of graph data necessitates the graph class-incremental learning (GCIL) task, which is to train a neural network using a few-shot sample graph as input in the form of a data stream. This approach enables class incremental learning of new categories while maintaining discriminative capabilities against samples from previous categories [4][5][6][7][8][9]. ...

Robust Graph Meta-Learning for Weakly Supervised Few-Shot Node Classification
  • Citing Article
  • November 2023

ACM Transactions on Knowledge Discovery from Data

... Finally, we design an encoder of timestamps to learn . The idea of incorporating timestamps have been studied in sequential models [2,12,17], and we use a simplified variant of [17] to learn timestamp embeddings. For a timestamp ∈ T where each dimension is a discrete value (e.g., representing [day, hour, minute, second]), each dimension will be first encoded into a one-hot vector and further embedded with a learnable matrix. ...

Incorporating Time in Sequential Recommendation Models
  • Citing Conference Paper
  • September 2023