Botian Shi’s research while affiliated with Chinese Association for Artificial Intelligence and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (49)


OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
  • Preprint

December 2024

Linke Ouyang

·

Yuan Qu

·

Hongbin Zhou

·

[...]

·

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.


Chimera: Improving Generalist Model with Domain-Specific Experts

December 2024

·

2 Reads

Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.


Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

December 2024

·

7 Reads

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL


ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving

November 2024

·

5 Reads

Offboard perception aims to automatically generate high-quality 3D labels for autonomous driving (AD) scenes. Existing offboard methods focus on 3D object detection with closed-set taxonomy and fail to match human-level recognition capability on the rapidly evolving perception tasks. Due to heavy reliance on human labels and the prevalence of data imbalance and sparsity, a unified framework for offboard auto-labeling various elements in AD scenes that meets the distinct needs of perception tasks is not being fully explored. In this paper, we propose a novel multi-modal Zero-shot Offboard Panoptic Perception (ZOPP) framework for autonomous driving scenes. ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds. To the best of our knowledge, ZOPP represents a pioneering effort in the domain of multi-modal panoptic perception and auto labeling for autonomous driving scenes. We conduct comprehensive empirical studies and evaluations on Waymo open dataset to validate the proposed ZOPP on various perception tasks. To further explore the usability and extensibility of our proposed ZOPP, we also conduct experiments in downstream applications. The results further demonstrate the great potential of our ZOPP for real-world scenarios.


Fig. 2. An potential architecture of Sora. This architecture is inspired from [21], [136].
Fig. 5. Timeline of World Models in Autonomous Driving. End-to-end driving and neural driving simulator (both 2D and 3D) approaches are emerging since 2023.
Fig. 6. Chronological overview of video generation model. We present the world model-based autonomous agents proposed in recent years. The colors show different structures of world models. The RSSM dominated these efforts while the Transformer, JEPA, and diffusion are gaining more and more attention from 2022.
Fig. 7. The three level hierarchy of intelligence [163]. World models are expected to conduct counterfactual reasoning.
Datasets for video generation. ASR: Automatic speech recognition. This table is reported by [34]

+2

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
  • Preprint
  • File available

October 2024

·

36 Reads

General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey.

Download

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

October 2024

·

5 Reads

Diffusion models have recently achieved great success in the synthesis of high-quality images and videos. However, the existing denoising techniques in diffusion models are commonly based on step-by-step noise predictions, which suffers from high computation cost, resulting in a prohibitive latency for interactive applications. In this paper, we propose AdaptiveDiffusion to relieve this bottleneck by adaptively reducing the noise prediction steps during the denoising process. Our method considers the potential of skipping as many noise prediction steps as possible while keeping the final denoised results identical to the original full-step ones. Specifically, the skipping strategy is guided by the third-order latent difference that indicates the stability between timesteps during the denoising process, which benefits the reusing of previous noise prediction results. Extensive experiments on image and video diffusion models demonstrate that our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 2~5x speedup without quality degradation.


Few-Shot Cross-Domain Object Detection With Instance-Level Prototype-Based Meta-Learning

October 2024

·

5 Reads

·

1 Citation

IEEE Transactions on Circuits and Systems for Video Technology

In typical unsupervised domain adaptive object detection, it is assumed that extensive unlabeled training data from the target domain can be easily obtained. However, in some access-constrained scenarios, massive target data cannot be guaranteed, but acquiring only a few target samples and annotating them may costs less. Therefore, inspired by the meta-learning success in few-shot tasks, we propose an Instance-level Prototype learning Network (IPNet) for solving the domain adaptive object detection under the supervised few-shot scenario in this work. To compensate for the target domain data deficiency, we fuse cropped instances from labeled images in both domains to learn a representative prototype for each class, by enforcing features of the same class’s instances but from different domains to be as close as possible. These prototypes are further employed to discriminate various features’ salience in an image, and separate foreground and background regions for respective domain alignment. Extensive experiments are conducted on several cross-domain scenarios, and their results show the consistent accuracy gains of the IPNet over state-of-the-art methods, e.g ., 10.4% mAP increase on Cityscapes-to-FoggyCityscapes setting and 3.0% mAP increase on Sim10k-to-Cityscapes setting.



MinerU: An Open-Source Solution for Precise Document Content Extraction

September 2024

·

14 Reads

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.


DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

September 2024

·

5 Reads

Recent advances in diffusion models have significantly enhanced the cotrollable generation of streetscapes for and facilitated downstream perception and planning tasks. However, challenges such as maintaining temporal coherence, generating long videos, and accurately modeling driving scenes persist. Accordingly, we propose DreamForge, an advanced diffusion-based autoregressive video generation model designed for the long-term generation of 3D-controllable and extensible video. In terms of controllability, our DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts, while also providing perspective guidance to produce driving scenes that are both geometrically and contextually accurate. For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues. Codes will be available at https://github.com/PJLab-ADG/DriveArena.


Citations (19)


... Experimental results show that OccFusion performs exceptionally well in challenging environments such as rainy and nighttime conditions. LiCROcc [108] proposes a method that uses LiDAR and camera data to teach radar for accurate semantic occupancy prediction. By combining the rich information from LiDAR and cameras with the cost-effectiveness of radar, this method significantly enhances radar's performance in 3D occupancy prediction tasks. ...

Reference:

A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions
LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera
  • Citing Article
  • January 2024

IEEE Robotics and Automation Letters

... Comprehensive perception and understanding of 3D scenes are important for autonomous driving (AD). We have witnessed the evolution of machine perception at different levels within a short period: from single-modal [1,2,3,4,5,6,7,8,9,10,11,12] to multi-modal inputs [13,14,15,16,17,18], from limited categories to open set [19,20,21,22,23,24,25], from 3D box to 3D occupancy [26,27,28,29,30,31,32], and from low-level detection to high-level understanding. Though remarkable, to train a model for different AD perception tasks, huge amounts of high-quality data and labels are still required, which is time-consuming and expensive. ...

VeloVox: A Low-Cost and Accurate 4D Object Detector with Single-Frame Point Cloud of Livox LiDAR
  • Citing Conference Paper
  • May 2024

... Despite its success in single-modality segmentation, extending SAM to multi-modal semantic segmentation poses significant challenges. Each modality, such as LiDAR, radar, and event cameras, exhibits distinct spatial, temporal, and noise characteristics, complicating their seamless integration into SAM's architecture [9]. SAM's pre-trained features, optimized for RGB images, often result in suboptimal performance when directly applied to heterogeneous multi-modal data. ...

Zero-training LiDAR-Camera Extrinsic Calibration Method Using Segment Anything Model
  • Citing Conference Paper
  • May 2024

... In contrast, closed-loop simulations [10,23,46,55] offer feedback-driven systems where an agent's actions influence and are influenced by other agents and the environment. For example, traffic flow-based simulation methods [12,46,55] successfully enable multi-agent simulations. However, they lack the ability to process visual sensor inputs, limiting their interplay with vision-based end-to-end models. ...

LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving
  • Citing Conference Paper
  • June 2024

... One promising approach to addressing these challenges is through the use of world models [21,37]. In autonomous driving, world models serve as neural simulators [74] that generate synthetic environments and scenarios, supplementing real-world data, which can be difficult or dangerous to collect, and enabling closed-loop evaluations of autonomous driving systems. Early research explored the use of world models on simulator as policy models [16,27,46], directly guiding decision-making. ...

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

... Currently, methods utilizing MLLMs for driving tasks are primarily categorized into the following three types: 1. Finetuning MLLMs Ding et al. 2024;Wen et al. 2023;Sima et al. 2023;Cui et al. 2023;Fu et al. 2024) directly for tasks such as prediction and planning. 2. The dual-branch system (Tian et al. 2024b;Ding et al. 2023;Mei et al. 2024) for separating and managing tasks based on real-time requirements, addressing time constraints with fast and slow branches. ...

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
  • Citing Conference Paper
  • January 2024

... An equally crucial area of research is the development of language-guided closed-loop autonomous driving systems. These systems leverage multimodal sensor data from simulators, as demonstrated by Lim-Sim++ (Fu et al., 2024a) and LMDrive (Shao et al., 2024). Additionally, RAG-Driver (Yuan et al., 2024) introduces a novel retrieval-augmented incontext learning approach, significantly enhancing the zero-shot generalization capabilities of driving LLMs. ...

LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving

... It fuses all point cloud frames of one sequence and surpasses the detection accuracy of humans in the challenging Waymo benchmark. The top-performing offboard detectors [47,48] can serve as the auto labelling function that can significantly reduce the expensive cost of the labelling process. ...

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
  • Citing Conference Paper
  • October 2023

... However, during the transition to lane-free 7 environments, human-driven vehicles (HVs) and AVs using end-to-end camera-based strategies, like Tesla's, will coexist with CAVs. 8 These vehicles may not respond to CAV nudging suggestions and lack coordinated control, resembling weak-lane-discipline driving 9 common in developing countries, which can affect CAV operations (Sekeran et al., 2022(Sekeran et al., , 2023Chi et al., 2024). 10 Given these complexities, further exploration of non-CAV participants' driving behavior in lane-free traffic is necessary. ...

Spatiotemporal-restricted A* algorithm as a support for lane-free traffic at intersections with mixed flows
  • Citing Article
  • April 2024

Green Energy and Intelligent Transportation

... The main goal is to achieve a smooth and logical interaction between the ego car and its surroundings vehicles, while also imitating the subtle behaviors of human drivers. However, translating the intricate rhythms of real-world traffic into the realm of simulation presents a considerable challenge [7], [8]. This challenge involves multiple dimensions, including the intricate interplay of road layouts and vehicle dynamics, as well as the subtleties of driver preferences and social interactions. ...

Human-Like Decision Making at Unsignalized Intersections Using Social Value Orientation
  • Citing Article
  • January 2023

IEEE Intelligent Transportation Systems Magazine