Siheng Chen’s research while affiliated with Shanghai Jiao Tong University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (166)


Figure 1: Illustration of traditional centralized data collection with human annotation versus our automatic distributed approach.
Figure 2: System overview of FedMobileAgent. During individual users' daily phone usage, Auto-Annotation automatically constructs training data through step-wise descriptions and episode-wide summarization. Each participating user then locally trains a agent and uploads it to the server. By applying adapted global aggregation, we obtain the target global mobile agent with enhanced capabilities.
Figure 3: Performance and cost comparison between FedMobileAgent and baselines. Our approach achieves the optimal balance.
Figure 5: Accuracy across action types in the action space of Android Control.
Figure 6: Episode example from Android Control dataset. The high-level task is Open the Zoho Meet app and view the scheduled meetings. Instructions in grey indicate ground truth from the original dataset, while those in green are predictions generated by Auto-Annotation. Our generated data sample achieves quality comparable to human-annotated ground truth.

+5

FedMobileAgent: Training Mobile Agents Using Decentralized Self-Sourced Data from Diverse Users
  • Preprint
  • File available

February 2025

Wenhao Wang

·

Zijie Yu

·

William Liu

·

[...]

·

Yanfeng Wang

The advancement of mobile agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is costly using human labor. Given the vast number of mobile phone users worldwide, if automated data collection from them is feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting high-level and low-level user instructions without involving human and (2) utilizing distributed data from diverse users while preserving privacy. To tackle these challenges, we propose FedMobileAgent, a collaborative framework that trains mobile agents using self-sourced data from diverse users. Specifically, it includes two techniques. First, we propose Auto-Annotation, which enables the automatic collection of high-quality datasets during users' routine phone usage with minimal cost. Second, we introduce adapted aggregation to improve federated training of mobile agents on non-IID user data, by incorporating both episode- and step-level distributions. In distributed settings, FedMobileAgent achieves performance comparable to centralized human-annotated models at less than 0.02\% of the cost, highlighting its potential for real-world applications.

Download

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

December 2024

·

10 Reads

With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present SafeAgentBench -- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks. More details and codes are available at https://github.com/shengyin1224/SafeAgentBench.


ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes

December 2024

·

4 Reads

Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capable of generating interactive, controllable and realistic participant dynamics in street scenes based on language instructions. To achieve precise control through complex language, ChatDyn employs a multi-LLM-agent role-playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine-grained dynamics based on the planning, ChatDyn designs two novel executors: the PedExecutor, a unified multi-task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition-based policy that generates physically plausible vehicle dynamics. Extensive experiments show that ChatDyn can generate realistic driving scene dynamics with multiple vehicles and pedestrians, and significantly outperforms previous methods on subtasks. Code and model will be available at https://vfishc.github.io/chatdyn.


Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

December 2024

·

9 Reads

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.


FedRSU: Federated Learning for Scene Flow Estimation on Roadside Units

November 2024

·

7 Reads

·

1 Citation

IEEE Transactions on Intelligent Transportation Systems

Roadside unit (RSU) can significantly improve the safety and robustness of autonomous vehicles through Vehicle-to-Everything (V2X) communication. Currently, the usage of a single RSU mainly focuses on real-time inference and V2X collaboration, while neglecting the potential value of the high-quality data collected by RSU sensors. Integrating the vast amounts of data from numerous RSUs can provide a rich source of data for model training. However, the absence of ground truth annotations and the difficulty of transmitting enormous volumes of data are two inevitable barriers to fully exploiting this hidden value. In this paper, we introduce FedRSU, an innovative federated learning framework for self-supervised scene flow estimation. In FedRSU, we present a recurrent self-supervision training paradigm, where for each RSU, the scene flow prediction of points at every timestamp can be supervised by its subsequent future multi-modality observation. Another key component of FedRSU is federated learning, where multiple devices collaboratively train an ML model while keeping the training data local and private. With the power of the recurrent self-supervised learning paradigm, FL is able to leverage innumerable underutilized data from RSU. To verify the FedRSU framework, we construct a large-scale multi-modality dataset RSU-SF. The dataset consists of 17 RSU clients and an additional 4 vehicle clients, covering various scenarios, modalities, and sensor settings. Based on RSU-SF, we show that FedRSU can greatly improve model performance in ITS and provide a comprehensive benchmark under diverse FL scenarios. To the best of our knowledge, we provide the first real-world LiDAR-camera multi-modal dataset and benchmark for the FL community. Code and dataset are available at https://github.com/wwh0411/FedRSU .


Self-Evolving Multi-Agent Collaboration Networks for Software Development

October 2024

·

7 Reads

LLM-driven multi-agent collaboration (MAC) systems have demonstrated impressive capabilities in automatic software development at the function level. However, their heavy reliance on human design limits their adaptability to the diverse demands of real-world software development. To address this limitation, we introduce EvoMAC, a novel self-evolving paradigm for MAC networks. Inspired by traditional neural network training, EvoMAC obtains text-based environmental feedback by verifying the MAC network's output against a target proxy and leverages a novel textual backpropagation to update the network. To extend coding capabilities beyond function-level tasks to more challenging software-level development, we further propose rSDE-Bench, a requirement-oriented software development benchmark, which features complex and diverse software requirements along with automatic evaluation of requirement correctness. Our experiments show that: i) The automatic requirement-aware evaluation in rSDE-Bench closely aligns with human evaluations, validating its reliability as a software-level coding benchmark. ii) EvoMAC outperforms previous SOTA methods on both the software-level rSDE-Bench and the function-level HumanEval benchmarks, reflecting its superior coding capabilities. The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/.


Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

October 2024

·

9 Reads

Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. Notably, on AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model, which was trained on over 10M pairs; see our project at https://github.com/ShuoTang123/MATRIX-Gen.


Data Quality Control in Federated Instruction-tuning of Large Language Models

October 2024

·

3 Reads

By leveraging massively distributed data, federated learning (FL) enables collaborative instruction tuning of large language models (LLMs) in a privacy-preserving way. While FL effectively expands the data quantity, the issue of data quality remains under-explored in the current literature on FL for LLMs. To address this gap, we propose a new framework of federated instruction tuning of LLMs with data quality control (FedDQC), which measures data quality to facilitate the subsequent filtering and hierarchical training processes. Our approach introduces an efficient metric to assess each client's instruction-response alignment (IRA), identifying potentially noisy data through single-shot inference. Low-IRA samples are potentially noisy and filtered to mitigate their negative impacts. To further utilize this IRA value, we propose a quality-aware hierarchical training paradigm, where LLM is progressively fine-tuned from high-IRA to low-IRA data, mirroring the easy-to-hard learning process. We conduct extensive experiments on 4 synthetic and a real-world dataset, and compare our method with baselines adapted from centralized setting. Results show that our method consistently and significantly improves the performance of LLMs trained on mix-quality data in FL.



KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

October 2024

·

1 Read

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose \textit{KnowledgeSG}, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.


Citations (47)


... For example, Li et al. leveraged random sample consensus (RANSAC) to sample a subset of collaborators and calculate the intersection of union (IoU) of the bounding boxes to verify whether there is any malicious agent among the collaboration network. Zhao et al. (Zhao et al., 2024) designed a match loss and a reconstruction loss as statistics to measure the consensus between the ego CAV and the collaborators. However, these methods all follow a hypothesize-and-verify paradigm, which requires generating multiple hypothetical perception results and verifying the consistency between the ego CAV and the collaborators. ...

Reference:

CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception
MADE: Malicious Agent Detection for Robust Multi-Agent Collaborative Perception
  • Citing Conference Paper
  • October 2024

... Wan et al. enhanced human-agent interactions within social virtual environments by developing LLM-based AI agents capable of memory-enhanced, context-aware responses [63]. Wei et al. created ChatSim, a system that allows for the editing of photorealistic 3D driving scenes via natural language commands, integrating external digital assets and utilizing a collaborative framework of LLM agents for greater realism and efficiency [66]. Bayat et al. focused on improving the user experience in virtual museums by employing a unified design that includes an Intelligent Virtual Avatar and a Virtual Environment, both powered by an LLM [9]. ...

Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
  • Citing Conference Paper
  • June 2024

... However, this passive method often results in information redundancy or delays, potentially limiting the real-time decisionmaking capabilities of autonomous driving systems [21]. To strengthen cooperative perception, the study CodeFilling adopts two key strategies: optimizing collaborative messages through improved representation and selection [22]. Consequently, optimizing V2X communication to actively filter and prioritize useful information for cooperative perception remains a critical area of research. ...

Communication-Efficient Collaborative Perception via Information Filling with Codebook
  • Citing Conference Paper
  • June 2024

... Although significant efforts have been made for safety alignment, LLMs still potentially exhibit vulnerabilities in producing harmful generations [Wei et al., 2024, Zou et al., 2023, Yi et al., 2024. Prior research has shown that even if LLMs are trained to be safe and harmless, they can still be misused. ...

On the Vulnerability of Safety Alignment in Open-Access LLMs

... Most current research focuses on optimizing single performance metrics, utilizing diverse model segmentation schemes and user selection strategies [21][22][23][24]. However, under multifactor conditions, existing federated learning algorithms lack quantitative and theoretical modeling of various metrics, making it challenging to balance model performance, communication efficiency, and privacy security [25][26][27][28]. ...

OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning
  • Citing Conference Paper
  • August 2024

... In other words, the Pragmatic Communications (PragComm) 1 , which aims to deliver compact latent representations tailored to specific downstream decision-making tasks [20], can better take account of both collaborative perception according to sensor data and subsequent driving decision simultaneously [21]. In the context of V2X-AD, the PragComm is commonly deployed as a compression paradigm [15]- [17], [19], [22]. These methods operate under a fundamental assumption: during each time interval τ , all participating agents first broadcast Basic Safety Messages (BSMs) and subsequently decide whether to engage in communication [16] or exchange valuable perception blocks [15]. ...

Robust Collaborative Perception without External Localization and Clock Devices
  • Citing Conference Paper
  • May 2024

... This enhanced flexibility has motivated a growing body of literature on extending classical graph neural network architectures to hypergraphs, including message-passing (Huang and Yang, 2021) and transformer-based models (Liu et al., 2024). Typical validation studies compare hypergraph architectures against each other, but not against standard graph neural networks (GNNs). ...

Hypergraph Transformer for Semi-Supervised Classification
  • Citing Conference Paper
  • April 2024

... Additionally, the application of message passing or diffusion on cellular sheaves [52] over graphs [24,25,[53][54][55] has proven effective in heterophilic scenarios. Models without message passing have been introduced for simplicial complexes [41,56] and hypergraphs [57]. An architecture for inferring a latent regular cell complex to improve a downstream task has been introduced in [58]. ...

Hypergraph-Mlp: Learning on Hypergraphs Without Message Passing
  • Citing Conference Paper
  • April 2024

... V2X communication enables cooperative perception in intelligent transportation systems (ITS), allowing vehicles to exchange messages with other agents such as the infrastructures and pedestrians [6]. As shown in Fig. 2, V2X communication framework has four layers. ...

Interruption-Aware Cooperative Perception for V2X Communication-Aided Autonomous Driving
  • Citing Article
  • April 2024

IEEE Transactions on Intelligent Vehicles

... This enables GNNs to make informed predictions even with missing values. The pioneering works [29,30] have applied a graph aggregation method after the first encoding stage in RNNs to enhance the imputation accuracy or representation learning. Alternatively, [31] aimed to address this issue by learning the embeddings and dynamics of sensors through a temporal-aware attention with [32,33] expanding upon its foundation; considering extra interval information and transfer learning strategy. ...

Compatible Transformer for Irregularly Sampled Multivariate Time Series
  • Citing Conference Paper
  • December 2023