Baining Guo’s research while affiliated with Tsinghua University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (94)


ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
  • Preprint
  • File available

February 2025

·

39 Reads

Yifan Pu

·

Yiming Zhao

·

Zhicong Tang

·

[...]

·

Baining Guo

Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.

Download

Diffusion Models without Classifier-free Guidance

February 2025

·

1 Read

This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.


Fig. 1 A 2D illustration of sparse points in a 4 × 4 window. (a) A fully-occupied window. (b) A sparsely-occupied window, with white cells being empty. (c) Regularly-distributed sparse points in a window. (d) Sparse points irregularly distributed in the window, where different circle colors indicate the varying point-wise signal, such as the RGB color. For simplicity, only one point is drawn on non-empty cells.
Category-wise segmentation results evaluated on ScanNet validation set
Quantitative evaluation on 3D detection (ScanNet). The methods in the upper part of the table are supervised methods, while those in the lower part are based on pretraining
Quantitative comparison of 3D detection (S3DIS)
Quantitative comparison of 3D detection (ScanNet)

+1

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

January 2025

·

10 Reads

·

22 Citations

The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3D, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3D model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, respectively, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validated the scalability, generality, and superior performance enabled by our approach.


MageBench: Bridging Large Multimodal Models to Agents

December 2024

·

5 Reads

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.



UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

December 2024

·

4 Reads

We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.


CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

November 2024

·

58 Reads

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).


RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

July 2024

·

6 Reads

We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a novel data scheduling strategy and a weight consolidation regularization term, which improves the decoder's capability of rendering sharper details. Additionally, we optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on 46K avatars with a noise schedule optimized for triplanes, the resulting model can generate 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input.




Citations (68)


... However, the sparse nature of point clouds creates parallelisation challenges due to varied window sizes. Various solutions have been proposed to alleviate this issue [10,29,44,59], at the cost of bulky implementations. Serialisation-based transformers overcome these inefficiencies by converting point clouds into ordered sequences, enabling structured attention over equally sized windows. ...

Reference:

HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

... In contrast to the previous work that focuses on expensive real data, we overcome these limitations and propose SynShot, a new method that builds a prior on synthetic data, and adapts to a real test subject requiring only a few input images. Building on the success of ML models trained on synthetic data for tasks like 3D face regression [54], 2D landmark prediction [66], rigid face alignment [3], and few-shot head reconstruction [6,64,74], SynShot is trained solely on a large synthetic dataset generated from 3DMM samples and diverse assets. Synthetic data offers complete control over dataset creation to meet size and diversity needs for training an expressive head prior, eliminating the need for costly capture hardware and addressing privacy concerns with real subjects. ...

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
  • Citing Chapter
  • December 2024

... Several works have focused on instruction-following image editing models, e.g. InstructPix2Pix [4] and Instruct-Diffusion [12]. These models take an input image and natural language instruction and perform the desired edit on the image. ...

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
  • Citing Conference Paper
  • June 2024

... Layout generation is an important task in graphic design intelligence Li et al. 2021) (e.g., layout representation learning Feng et al. 2022), layout reverse engineering (Hao et al. 2023;Shi et al. 2023;Zhu et al. 2024;Huang et al. 2021)). Traditional works (Hurst, Li, and Marriott 2009;Kumar et al. 2011;O'Donovan, Agarwala, and Hertzmann 2014;Tabata et al. 2019) are mostly based on heuristics with constraint optimization, which usually ensure high-quality but limited outputs. ...

Unsupervised Graphic Layout Grouping with Transformers
  • Citing Conference Paper
  • January 2024

... However, in the SR task, the input size differs between training and testing, making it unstable to apply this approach consistently. AFFNet [15] multiplies mixed tokens with Fourier features. However, the multiplication of complex tensors produces unstable values, leading to gradient explosion in SR training. ...

Adaptive Frequency Filters As Efficient Global Token Mixers
  • Citing Conference Paper
  • October 2023

... However, these works mainly focus on studying Far-OOD scenarios where the distributions of in-distribution (ID) and OOD data are distant. Additionally, while zero-shot OOD detection methods using CLIP have shown great potential, they may struggle to capture the nuances and specific characteristics of downstream tasks (Wei et al. 2023). Therefore, when the distributions of ID and OOD data are similar for the Near-OOD scenarios, it is still a challenging problem requiring more discriminative and detailed information. ...

Improving CLIP Fine-tuning Performance
  • Citing Conference Paper
  • October 2023

... In general, unimodal learning tasks can be broadly categorized into three types based on the availability and quantity of labeled data: (i) Supervised Learning, where a large amount of labeled data is available for training, allowing models to learn features and perform accurate recognition (Yan et al., 2023a;Li et al., 2023a); (ii) Few-shot Learning, where only a limited number of labeled samples are provided for each class, challenging the model to generalize effectively from minimal data Luo et al., 2023); and (iii) Zero-shot Learning, where no labeled examples are available for certain classes (Wei et al., 2023;Li et al., 2024a;Mirza et al., 2024). ...

iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition
  • Citing Conference Paper
  • June 2023

... Recent advancements in multi-modal machine learning have significantly enhanced models' ability to process and integrate data from diverse modalities, such as language, acoustic, vision, and tabular data [25,66,91]. With the development of deep learning architectures and sophisticated interaction designs, models are able to learn, infer, and reason by integrating multiple communicative modalities. ...

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
  • Citing Conference Paper
  • June 2023