Xiao-Ming Wu’s research while affiliated with China University of Petroleum and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (115)


GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
  • Preprint

November 2024

·

5 Reads

Bo Liu

·

Ke Zou

·

Liming Zhan

·

[...]

·

Medical Visual Question Answering (VQA) is an essential technology that integrates computer vision and natural language processing to automatically respond to clinical inquiries about medical images. However, current medical VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, which impedes their ability to satisfy the comprehension needs of patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements encountered in clinical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) A multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) Four distinct question types, open-ended, closed-ended, single-choice, and multiple-choice, that better reflect diverse clinical needs. We evaluated 10 representative large vision language models on GEMeX and found that they underperformed, highlighting the dataset's complexity. However, after fine-tuning a baseline model using the training set, we observed a significant performance improvement, demonstrating the dataset's effectiveness. The project is available at www.med-vqa.com/GEMeX.


Figure 3: Overall approach of our weather generalist foundation model (WeatherGFM).
Figure 6: Visual results of our our WeatherGFM on OOD tasks.
Figure 7: The effect of dataset size and model sizes. ViT-ST: single-task ViT trained on 0.5 million samples. Base: our WeatherGFM with 100 M parameters trained on 4 million samples. Large: our WeatherGFM with 330 M parameters trained on 4 million samples.
Figure 9: RMSE performance comparison across different model configurations.
WeatherGFM: Learning A Weather Generalist Foundation Model via In-context Learning
  • Preprint
  • File available

November 2024

·

28 Reads

The Earth's weather system encompasses intricate weather data modalities and diverse weather understanding tasks, which hold significant value to human life. Existing data-driven models focus on single weather understanding tasks (e.g., weather forecasting). Although these models have achieved promising results, they fail to tackle various complex tasks within a single and unified model. Moreover, the paradigm that relies on limited real observations for a single scenario hinders the model's performance upper bound. In response to these limitations, we draw inspiration from the in-context learning paradigm employed in state-of-the-art visual foundation models and large language models. In this paper, we introduce the first generalist weather foundation model (WeatherGFM), designed to address a wide spectrum of weather understanding tasks in a unified manner. More specifically, we initially unify the representation and definition of the diverse weather understanding tasks. Subsequently, we devised weather prompt formats to manage different weather data modalities, namely single, multiple, and temporal modalities. Finally, we adopt a visual prompting question-answering paradigm for the training of unified weather understanding tasks. Extensive experiments indicate that our WeatherGFM can effectively handle up to ten weather understanding tasks, including weather forecasting, super-resolution, weather image translation, and post-processing. Our method also showcases generalization ability on unseen tasks.

Download



Figure 1: Layer importance ranking of LLAMA 2-7B (Touvron et al., 2023) and Mistral-7B-v0.1 (Jiang et al., 2023) by ILA across the Alpaca-GPT4 (Peng et al., 2023), LIMA Zhou et al. (2023), and No Robots (Rajani et al., 2023) datasets. Layers ranked in the top 75% by scores (s i ) are considered important. The x-axis represents the transformer block index, and the y-axis shows the names of linear layers within each block. The figure illustrates two key findings: (1) There is a significant overlap (up to 90%) in the important layers identified by ILA across different alignment datasets, as supported by the Jaccard similarity values in Table 2. This high consistency indicates that similar capabilities are needed for alignment, regardless of substantial differences in dataset content. (2) The important layers vary between different network architectures, suggesting that each model's structure and dynamics uniquely affect which layers are most crucial for alignment.
Understanding Layer Significance in LLM Alignment

October 2024

·

26 Reads

Aligning large language models (LLMs) through fine-tuning is essential for tailoring them to specific applications. Therefore, understanding what LLMs learn during the alignment process is crucial. Recent studies suggest that alignment primarily adjusts a model's presentation style rather than its foundational knowledge, indicating that only certain components of the model are significantly impacted. To delve deeper into LLM alignment, we propose to identify which layers within LLMs are most critical to the alignment process, thereby uncovering how alignment influences model behavior at a granular level. We propose a novel approach to identify the important layers for LLM alignment (ILA). It involves learning a binary mask for each incremental weight matrix in the LoRA algorithm, indicating the significance of each layer. ILA consistently identifies important layers across various alignment datasets, with nearly 90% overlap even with substantial dataset differences, highlighting fundamental patterns in LLM alignment. Experimental results indicate that freezing non-essential layers improves overall model performance, while selectively tuning the most critical layers significantly enhances fine-tuning efficiency with minimal performance loss.


UGotMe: An Embodied System for Affective Human-Robot Interaction

October 2024

·

12 Reads

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.


Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection

October 2024

·

11 Reads

For 6-DoF grasp detection, simulated data is expandable to train more powerful model, but it faces the challenge of the large gap between simulation and real world. Previous works bridge this gap with a sim-to-real way. However, this way explicitly or implicitly forces the simulated data to adapt to the noisy real data when training grasp detectors, where the positional drift and structural distortion within the camera noise will harm the grasp learning. In this work, we propose a Real-to-Sim framework for 6-DoF Grasp detection, named R2SGrasp, with the key insight of bridging this gap in a real-to-sim way, which directly bypasses the camera noise in grasp detector training through an inference-time real-to-sim adaption. To achieve this real-to-sim adaptation, our R2SGrasp designs the Real-to-Sim Data Repairer (R2SRepairer) to mitigate the camera noise of real depth maps in data-level, and the Real-to-Sim Feature Enhancer (R2SEnhancer) to enhance real features with precise simulated geometric primitives in feature-level. To endow our framework with the generalization ability, we construct a large-scale simulated dataset cost-efficiently to train our grasp detector, which includes 64,000 RGB-D images with 14.4 million grasp annotations. Sufficient experiments show that R2SGrasp is powerful and our real-to-sim perspective is effective. The real-world experiments further show great generalization ability of R2SGrasp. Project page is available on https://isee-laboratory.github.io/R2SGrasp.


AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

September 2024

·

21 Reads

Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision- making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.


Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection

September 2024

In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emph{near} OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.


STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM

September 2024

·

19 Reads

Traditional recommendation models often rely on unique item identifiers (IDs) to distinguish between items, which can hinder their ability to effectively leverage item content information and generalize to long-tail or cold-start items. Recently, semantic tokenization has been proposed as a promising solution that aims to tokenize each item's semantic representation into a sequence of discrete tokens. In this way, it preserves the item's semantics within these tokens and ensures that semantically similar items are represented by similar tokens. These semantic tokens have become fundamental in training generative recommendation models. However, existing generative recommendation methods typically involve multiple sub-models for embedding, quantization, and recommendation, leading to an overly complex system. In this paper, we propose to streamline the semantic tokenization and generative recommendation process with a unified framework, dubbed STORE, which leverages a single large language model (LLM) for both tasks. Specifically, we formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task. All these tasks are framed in a generative manner and trained using a single LLM backbone. Extensive experiments have been conducted to validate the effectiveness of our STORE framework across various recommendation tasks and datasets. We will release the source code and configurations for reproducible research.


Citations (41)


... Different from 6 DoF grapsing pose prediction for rigid object manipulation, we need to predict both the 6-DoF part grasping pose and the 2-DoF movement direction after grasping. We adapt the SOTA method Economicgrasp [42] as our actionable pose estimator dubbed Part-aware EcoGrasp and use pretrained GAPartNet [9] to predict the part movement direction. ...

Reference:

GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation
An Economic Framework for 6-DoF Grasp Detection
  • Citing Chapter
  • November 2024

... Continual Learning in task-oriented dialogue systems, focusing on mitigating catastrophic forgetting, has employed various methods such as architecture-based (Shen et al., 2019;Geng et al., 2021;Xu et al., 2023a), rehearsal-based (Rebuffi et al., 2017Hou et al., 2019;Lu et al., 2021b), and regularization-based (Li and Hoiem, 2017;Feng et al., 2024;Lu et al., 2021a). In DST, contributions like Madotto et al. (2020) and Liu et al. (2021) have utilized these CL strategies, with Zhu et al. (2022) introducing Continual Prompt Tuning (CPT) to fine-tune domain-specific soft prompts. ...

TaSL: Continual Dialog State Tracking via Task Skill Localization and Consolidation
  • Citing Conference Paper
  • January 2024

... AIGC refers to contents produced by AI in various forms, such as text, images, audio, and more. Notable advancements in this field include Generative Adversarial Networks(GANs) [10], diffusion models [11], and multimodal generation techniques [12]. These technologies can be used for various fields, providing potential solutions to wireless communication system. ...

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
  • Citing Conference Paper
  • January 2024

... However, current Continual DST methods (Madotto et al., 2020;Liu et al., 2021;Cho et al., 2023;Feng et al., 2024) mainly focus on mitigating forgetting through memory replay or regularization, overlooking the advantages of KT that can be derived from the inherent correlations between different DST domains. ...

Continual Dialogue State Tracking via Reason-of-Select Distillation
  • Citing Conference Paper
  • January 2024

... Recently, numerous universal (also known as all-in-one) image restoration frameworks [24,41,61] have been proposed, viewed as potential solutions for becoming foundation models, which aims to simultaneously handle multiple restoration tasks on a single model. However, these approaches [24,35,41,53,61] simply combine several public synthetic datasets [1,17,23,27,36,38,39,52,63] as their corresponding all-in-one training sets. ...

Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model
  • Citing Conference Paper
  • June 2024

... Modality-based Sequential Recommendation (MoRec). The Recommender System community has shown a growing inter-est in incorporating various modality information into recommendation systems [33], [5], [34], [35], [24]. These systems utilize large-scale multimodal foundation models [30], [36], [4], [34] from NLP and CV [22], [37] to encode text and image data. ...

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey
  • Citing Conference Paper
  • August 2024

... To address these limitations, semantic tokenization has recently emerged as a promising solution and has gained rapid traction in the community [18,25,28,42]. As illustrated in Figure 1, instead of representing each item with a unique ID embedding, semantic tokenization encodes each item's semantic representation into a compact sequence of discrete tokens. ...

Discrete Semantic Tokenization for Deep CTR Prediction
  • Citing Conference Paper
  • May 2024

... Simplifying complex computational processes is an effective approach for sustainable algorithm development. Liu et al. [169] developed GreenRec, a Green AI benchmark for news recommendation systems, which uses an "Only-Encode-Once" training paradigm to pre-train content encoders and cache content vectors, thereby reducing redundant processing and energy consumption. Lu et al. [162] proposed GreenFlow, a computational allocation framework designed to minimize the carbon footprint and energy demands of recommendation systems. ...

Benchmarking News Recommendation in the Era of Green AI
  • Citing Conference Paper
  • May 2024

... UHD restoration. Since current general learning-based image restoration algorithms (Wang et al., 2020bYao et al., 2021;Zhao et al., 2021;Chen et al., 2021cChen et al., , 2020aZamir et al., 2021Zamir et al., , 2022Chen et al., 2021a;Tu et al., 2022;Wang et al., 2024aWang et al., , 2022aWang et al., , 2023aWang et al., , 2024cMei et al., 2023;Zhang et al., 2022a) cannot effectively process UHD images Wang et al., 2024b), several UHD restoration models are developed (Zheng et al., 2021;Deng et al., 2021;Wang et al., 2023b;Li et al., 2023;Wang et al., 2024b) as well as UHD restoration benchmarks. ...

Correlation Matching Transformation Transformers for UHD Image Restoration

Proceedings of the AAAI Conference on Artificial Intelligence