ChengXiang Zhai’s research while affiliated with University of Illinois Urbana-Champaign and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (491)


Figure 1: Overview of the various uses of user simulation.
Figure 3: An innovation ecosystem, where academic researchers develop open-source user simulators, which industry partners validate using real user data, thereby bridging the data divide between academia and industry.
Examples of user simulation, ranging from single actions to more complex behaviours.
User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation
  • Preprint
  • File available

January 2025

·

11 Reads

·

ChengXiang Zhai

User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. User simulation has profound implications for diverse fields and plays a vital role in the pursuit of Artificial General Intelligence. This paper provides an overview of user simulation, highlighting its key applications, connections to various disciplines, and outlining future research directions to advance this increasingly important technology.

Download

Interactive Information Need Prediction with Intent and Context

January 2025

·

3 Reads

The ability to predict a user's information need would have wide-ranging implications, from saving time and effort to mitigating vocabulary gaps. We study how to interactively predict a user's information need by letting them select a pre-search context (e.g., a paragraph, sentence, or singe word) and specify an optional partial search intent (e.g., "how", "why", "applications", etc.). We examine how various generative language models can explicitly make this prediction by generating a question as well as how retrieval models can implicitly make this prediction by retrieving an answer. We find that this prediction process is possible in many cases and that user-provided partial search intent can help mitigate large pre-search contexts. We conclude that this framework is promising and suitable for real-world applications.


Figure 2: Illustration of our proposed LMS3 method. Method Theoretical Dependency Generality Complexity Guarantee on LLM Similar-ICL ✗ ✗ ✓
Figure 3: Few-shot Answer Accuracy of Llama3-8B.
What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis

December 2024

·

4 Reads

Jiayu Liu

·

·

Chaokun Wang

·

[...]

·

Enhong Chen

Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by a LLM-oriented semantic similarity and an inference stability of demonstrations, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It can adaptively facilitate to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.


Figure 6: Some cases of the original and rewritten questions. The entity names, value of numbers and semantics are different after rewritting, while the computational graphs remain the same.
Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval

November 2024

·

3 Reads

Large language models (LLMs) are known to struggle with complicated reasoning tasks such as math word problems (MWPs). In this paper, we present how analogy from similarly structured questions can improve LLMs' problem-solving capabilities for MWPs. Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt, providing the correct reasoning path for the generation model to refer to. Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method, which achieves a significant improvement of up to 6.7 percent on average in absolute value, compared to baseline methods. These results highlight our method's potential in addressing the reasoning challenges in current LLMs.


Large language models for whole-learner support: opportunities and challenges

October 2024

·

47 Reads

Frontiers in Artificial Intelligence

In recent years, large language models (LLMs) have seen rapid advancement and adoption, and are increasingly being used in educational contexts. In this perspective article, we explore the open challenge of leveraging LLMs to create personalized learning environments that support the “whole learner” by modeling and adapting to both cognitive and non-cognitive characteristics. We identify three key challenges toward this vision: (1) improving the interpretability of LLMs’ representations of whole learners, (2) implementing adaptive technologies that can leverage such representations to provide tailored pedagogical support, and (3) authoring and evaluating LLM-based educational agents. For interpretability, we discuss approaches for explaining LLM behaviors in terms of their internal representations of learners; for adaptation, we examine how LLMs can be used to provide context-aware feedback and scaold noncognitive skills through natural language interactions; and for authoring, we highlight the opportunities and challenges involved in using natural language instructions to specify behaviors of educational agents. Addressing these challenges will enable personalized AI tutors that can enhance learning by accounting for each student’s unique background, abilities, motivations, and socioemotional needs.


Figure 1. Left: Panel discussion. Right: Breakout group participants.
Figure 2. Depiction of the multi-layered factors of a user for the example task of learning about evaluation methodologies for RAG.
Timeline of the Sim4IA workshop.
Report on the Workshop on Simulations for Information Access (Sim4IA 2024) at SIGIR 2024

September 2024

·

24 Reads

This paper is a report of the Workshop on Simulations for Information Access (Sim4IA) workshop at SIGIR 2024. The workshop had two keynotes, a panel discussion, nine lightning talks, and two breakout sessions. Key takeaways were user simulation's importance in academia and industry, the possible bridging of online and offline evaluation, and the issues of organizing a companion shared task around user simulations for information access. We report on how we organized the workshop, provide a brief overview of what happened at the workshop, and summarize the main topics and findings of the workshop and future work.


UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models

July 2024

·

16 Reads

Smaller-scale Vision-Langauge Models (VLMs) often claim to perform on par with larger models in general-domain visual grounding and question-answering benchmarks while offering advantages in computational efficiency and storage. However, their ability to handle rare objects, which fall into the long tail of data distributions, is less understood. To rigorously evaluate this aspect, we introduce the "Uncontextualized Uncommon Objects" (UOUO) benchmark. This benchmark focuses on systematically testing VLMs with both large and small parameter counts on rare and specialized objects. Our comprehensive analysis reveals that while smaller VLMs maintain competitive performance on common datasets, they significantly underperform on tasks involving uncommon objects. We also propose an advanced, scalable pipeline for data collection and cleaning, ensuring the UOUO benchmark provides high-quality, challenging instances. These findings highlight the need to consider long-tail distributions when assessing the true capabilities of VLMs.



Large Language Models for Relevance Judgment in Product Search

May 2024

·

16 Reads

High relevance of retrieved and re-ranked items to the search query is the cornerstone of successful product search, yet measuring relevance of items to queries is one of the most challenging tasks in product information retrieval, and quality of product search is highly influenced by the precision and scale of available relevance-labelled data. In this paper, we present an array of techniques for leveraging Large Language Models (LLMs) for automating the relevance judgment of query-item pairs (QIPs) at scale. Using a unique dataset of multi-million QIPs, annotated by human evaluators, we test and optimize hyper parameters for finetuning billion-parameter LLMs with and without Low Rank Adaption (LoRA), as well as various modes of item attribute concatenation and prompting in LLM finetuning, and consider trade offs in item attribute inclusion for quality of relevance predictions. We demonstrate considerable improvement over baselines of prior generations of LLMs, as well as off-the-shelf models, towards relevance annotations on par with the human relevance evaluators. Our findings have immediate implications for the growing field of relevance judgment automation in product search.



Citations (48)


... As a research topic, user simulation is inherently interdisciplinary, intersecting with diverse fields both within and beyond computer science. For example, it draws upon concepts from psychology, economics, and human-computer interaction to create accurate and representative models of user behaviour [5]. The recent success of large language models (LLMs) has made it possible to simulate complex user actions and led to their widespread use This work is licensed under a Creative Commons Attribution 4.0 International License. ...

Reference:

User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation
User Simulation for Evaluating Information Access Systems
  • Citing Article
  • January 2024

Foundations and Trends® in Information Retrieval

... in simulation tasks across different domains and application scenarios, including generating realistic conversations for dialogue systems [19], performing automatic relevance assessment of search results [27], simulating specific human subpopulations in social science research [1], and simulating the behaviour of communities [22]. Realizing the potential of user simulation, there has been a surge of interest and activity in this area, evidenced by the growing number of related workshops [3,24] and tutorials [4,12]. This article aims to synthesize research dispersed across various fields, review the current state of the art, and highlight potential future directions. ...

Tutorial on User Simulation for Evaluating Information Access Systems on the Web
  • Citing Conference Paper
  • May 2024

... Market demand is high for valuable and efficient systems and algorithms that can give concise advice or conclusions from big datasets [1]. A major challenge of information extraction is a frequent lack of sufficiently labeled data for training machine learning algorithms due to labor intensiveness and cost of acquiring labels [2]. Generating synthetic datasets is often quite useful, particularly for testing purposes in numerous areas of computing, including artificial intelligence, data mining, data visualization and software engineering [3]. ...

Exploring Large Language Models for Low-Resource IT Information Extraction

... Yanagi et al. [65] generated comments with news articles and actual comments provided to replace actual ones. Again, recent studies explored instructing LLMs to generate highly readable and human-like responses [22,36,44,54,69]. A contemporary work, DELL [50], leverages LLMs to generate comments for social graph simulation and assists in graph-based fake news detection. ...

Decoding the Silent Majority: Inducing Belief Augmented Social Graph with Large Language Model for Response Forecasting
  • Citing Conference Paper
  • January 2023

... MemeCap As shown in Table 10, the GPT-4o prompt without template context achieved a BLEURT score of 0.525 on the whole test set. For comparison, a human-level model in (Bhavya et al., 2022) scored 0.448 on analogy generation, suggesting that our result represents a human-level annotation. The method also attained a BERTscore of 0.879, indicating high-quality annotation. ...

Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT

... With recent advances in NLP research, natural language generation methods (e.g., LLMs) have been increasingly used to enrich training scenarios [10,29]. In the domain of AI/ML, user simulation aims to extract valuable insights and patterns from user behaviors in different in-context scenarios [1,47]. This process is instrumental in adapting systems to cater to the unique requirements of specific user groups. ...

Rethinking Conversational Agents in the Era of LLMs: Proactivity, Non-collaborativity, and Beyond
  • Citing Conference Paper
  • November 2023

... Other recent examples of a re-started interest in the topic of (user) simulation were the Sim4IR workshop that was held at SIGIR 2021 [Balog et al., 2021], the SIMIIR 2.0 framework 1 [Zerhoudi et al., 2022], tutorials [Balog and Zhai, 2023], and a recurring theme of how generative model can be used for simulation [Azzopardi et al., 2024]. At ECIR or SIGIR, a reasonable number of relevant papers on user simulations were accepted, and even a study on simulating user queries won the best paper award at ECIR 2022 [Penha et al., 2022]. ...

Tutorial on User Simulation for Evaluating Information Access Systems
  • Citing Conference Paper
  • October 2023

... Procedural generation, is also referred to as script construction [98], Similar to other tasks, cooking recipes [71,95,97,123], and everyday tasks [98,112,122,123,145,163,192] like "baking a cake" or "fueling a car" from WikiHow and other sources, are the most common domains. While most datasets are derived from existing resources [71,98,122,123,142,163], several are created specifically for this task, such as proScript [145], CoScript [192], WIKIPLAN and RECIPEPLAN [97], InScript [112] and XIACHUFANG [95] datasets. ...

Incorporating Task-Specific Concept Knowledge into Script Learning
  • Citing Conference Paper
  • January 2023

... Another study created a dataset, OR-QuAC, to facilitate research on open-retrieval conversational question answering [27]. The dataset created by Ros et al. [32] is perhaps the most similar data to ProCIS. It is also based on Reddit thread, however, ProCIS has multiple advantages in comparison; it is more than an order of magnitude larger in terms of both training examples and corpus size, it has a carefully annotated test set and it uses a corpus of clean Wikipedia articles which can be more useful for research experiments that involve specific concepts. ...

Retrieving Webpages Using Online Discussions
  • Citing Conference Paper
  • August 2023

... Various strategies have been proposed to improve the inference speed of large-scale transformer models. These include employing model pruning techniques (Fan et al., 2019;Gale et al., 2019;Michel et al., 2019;Voita et al., 2019;Sanh et al., 2020;Kurtic et al., 2022;Kwon et al., 2022;Campos & Zhai, 2023); implementing knowledge distillation methods to downsize the models (Jiao et al., 2019;Sanh et al., 2019); and adopting quantization procedures (Zafrir et al., 2019;Shen et al., 2020;Zadeh et al., 2020;Kim et al., 2021;Dettmers et al., 2022;Wu et al., 2022;Yao et al., 2022;Frantar et al., 2022). However, these approaches do not necessarily guarantee the original inference quality since they do not have a mechanism that verifies the validity of the generated token. ...

To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency
  • Citing Conference Paper
  • January 2023