November 2024
·
3 Reads
·
82 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
November 2024
·
3 Reads
·
82 Citations
November 2024
·
38 Reads
Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"{\i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5, achieving 3.0 speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.
October 2024
·
3 Reads
·
1 Citation
October 2024
·
2 Citations
October 2024
·
2 Reads
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.
October 2024
·
348 Reads
We introduce \model, a text-to-image framework that can efficiently generate images up to 40964096 resolution. \model can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8, we trained an AE that can compress images 32, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, \model-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, \model-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 10241024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
October 2024
·
78 Reads
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.
September 2024
·
53 Reads
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
September 2024
·
72 Reads
·
1 Citation
Communications Engineering
Understanding a person’s behavior from their 3D motion sequence is a fundamental problem in computer vision with many applications. An important component of this problem is 3D action localization, which involves recognizing what actions a person is performing, and when the actions occur in the sequence. To promote the progress of the 3D action localization community, we introduce a new, challenging, and more complex benchmark dataset, BABEL-TAL (BT), for 3D action localization. Important baselines and evaluating metrics, as well as human evaluations, are carefully established on this benchmark. We also propose a strong baseline model, i.e., Localizing Actions with Transformers (LocATe), that jointly localizes and recognizes actions in a 3D sequence. The proposed LocATe shows superior performance on BABEL-TAL as well as on the large-scale PKU-MMD dataset, achieving state-of-the-art performance by using only 10% of the labeled training data. Our research could advance the development of more accurate and efficient systems for human behavior analysis, with potential applications in areas such as human-computer interaction and healthcare.
July 2024
·
11 Reads
This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.
... The images of previous frames are processed as well, the total output of this module are V t I , ..., V t−k I . 2) Spatio-temporal Alignment: The effectiveness of temporal information has been proved in previous methods, such as [27], [28], which align features in BEV plane and cause the loss of height information. Due to the rugged surface of off-road environments, height changes frequently. ...
January 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence
... We denote such evaluator as LLM Grader, and include the prompting and evaluation setup in Appendix A. "Photo of an athlete cat explaining it's latest scandal at a press conference to journalists." with DINO-LinearHead, third row is sampled from PixArt-Σ [8] with Verifier Ensemble, and last two rows are sampled from FLUX-1.dev [41] with Verifier Ensemble. Lastly, on T2I-CompBench, we use the evaluation pipeline provided by Huang et al. [30] to assess the performance of our framework in compositional generation tasks. ...
November 2024
... However, as the demand for higher resolution and more complex shapes increases, existing methodologies struggle to keep pace, particularly when scaling to larger sequence lengths or higher voxel resolutions. Traditional approaches, such as diffusion transformers (DiT-3D) [11,12] that leverage self-attention mechanisms, although promising, are hindered by the cubic complexity of attention operations relative to input length. This complexity barrier poses significant challenges, particularly in resource-intensive scenarios involving high-resolution 3D shape generation. ...
October 2024
... What is more, as the basic setting of depth inpainting, we choose depth as another essential input. It is because monocular depth estimation is, more likely, an ill problem [9]- [12]. As shown in Fig. 2, it is impossible to infer precise spatial information from one single RGB image. ...
October 2024
... Alternatively, we can update the RGB component first and use it to condition the alpha update, but observed that the former approach worked best in practice. When sampling from diffusion models, it is common to use fewer sampling steps than at training time for faster image generation [53]. Therefore, in order to make our training regime more flexible and applicable to a variety of sampling strategies, we additionally use pairs (y RGB t , y α k ), and (y RGB k , y α t ) as conditioning input in the second half of training iterations, with k randomly sampled in [0, t − 1]. ...
June 2024
... However, there are many other applications, where DEA can be very promising. Three such applications are face recognition 43 , drug discovery 44 and activity recognition from sensor data [45][46][47] . We demonstrated the high performance application of DEA for face recognition in supplementary section 1. ...
September 2024
Communications Engineering
... These images are initially converted into perspective view features, {F P Vi } N i=1 , using a feature extraction function f P V , such that F P Vi = f P V (I i ) and F P Vi ∈ R H P V ×W P V ×C P V . Subsequently, the perspective view features are aggregated and projected into a single Bird's Eye View (BEV) feature map, F BEV , using a projection function f BEV , which can be implemented using methods such as Lift-Splat-Shoot (LSS) [18,29,47], transformers [6,30,56,72], or other projection techniques [15,28,61]. This projection is defined as ...
June 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence
... Con- [15] 2018 6200 ✓ R ✓ CADP [7] 2018 1416 ✓ ✓ R ✓ VIENA 2 [18] 2018 15000 ✓ S ✓ ✓ DADA-2000 [17] 2019 2000 ✓ Text, Att. R ✓ ✓ ✓ GTACrash [19] 2019 11381 ✓ ✓ S ✓ CCD [6] 2020 4500 ✓ Text R ✓ ✓ DoTA [8] 2023 4677 ✓ ✓ R ✓ ✓ ✓ ROL [5] 2023 1000 [20] 2023 691 ✓ ✓ S ✓ ✓ MM-AU [16] 2024 11727 ✓ Text R ✓ ✓ ✓ [4] 2016 ✓ ✓ Shah et al. [7] 2018 ✓ ✓ Ustring [6] 2020 ✓ ✓ FA [10] 2020 ✓ ✓ DRIVE [26] 2021 ✓ DSTA [9] 2022 ✓ ✓ CAP [27] 2022 ✓ FOL-Ensemble [8] 2023 ✓ ✓ ✓ AM-Net [5] 2023 ✓ ✓ ✓ THAT-Net [11] 2023 ✓ ✓ ✓ Maruyama et al. [13] 2023 ✓ ✓ DAA-GNN [12] 2024 ✓ ✓ TTHF [29] 2024 surveillance cameras. DeepAccident [20] is the first accident prediction dataset designed for Vehicle-to-Everything (V2X) applications. ...
March 2024
Proceedings of the AAAI Conference on Artificial Intelligence
... Moreover, the performance of these multi-modal detectors can significantly degrade in the absence of LiDAR data, potentially falling behind the capabilities of camera-only detectors. Some methods (Liang et al. 2022;Yan et al. 2023;Ge et al. 2023;Wang et al. 2024) focus on solving the problem but usually introduce redundant model architecture or utilize mask-modal data augmentation, which brings more training time or data waste. This highlights the need for more efficient and robust approaches that can maintain high performance even in scenarios where certain data modalities are unavailable. ...
October 2023
... These results highlight the need for textual focus to adapt dynamically to varying visual content during cross-modal interaction. Second, although some works [4,8,9,10] try to merge cross-scale visual information, they directly concatenate the hierarchical visual features and leverage standard self-processing operations (e.g., self-attention [11]), before decoding the final predictions (as shown in Fig. 1a). However, the spatial priors derived from both visual and linguistic inputs impose greater demands on the representation of global context and local details. ...
October 2023