Chao-Yuan Wu’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (38)


SAM 2: Segment Anything in Images and Videos
  • Preprint

August 2024

·

129 Reads

·

4 Citations

Nikhila Ravi

·

Valentin Gabeur

·

Yuan-Ting Hu

·

[...]

·

Christoph Feichtenhofer

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.




Figure 3. Ablation Experiments. (a): We observe that (1) Learning without activation caching does not hurt reversible accuracy for RevViT-B of varying depths and (2) Internal residual connections train well for shallow models but the training diverges for deeper models. (b): Rev-MViT has higher throughput for higher input resolution and deeper MViT models increasing up to 2.3× at 224 resolution for 80 layers. (c): We benchmark the maximum batch size for Rev-ViT Base (B), Large (L) and Huge (H) and their non-reversible counterparts.
Lateral Fusion Strategies. Residual streams I1 and I2
Reversible Vision Transformer Architectures: Rev- ViT are reversible adaption of ViT with exactly matched FLOPs, parameters and accuracy under identical conditions but with much lower GPU memory footprints.
Rev-MViT-B with 8.7G FLOPs, 39M param, 66.8MB/img memory, and 82.5% top-1 accuracy is reversible adaption of MViT-B architecture [15]. [3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?
Reversible Vision Transformers
  • Preprint
  • File available

February 2023

·

101 Reads

We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at https://github.com/facebookresearch/slowfast. A simpler, easy to understand and modify version is also available at https://github.com/karttikeya/minREV

Download

Multiview Compressive Coding for 3D Reconstruction

January 2023

·

56 Reads

A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALL\cdotE 2 or captured in-the-wild with an iPhone.







Citations (27)


... Instead of directly applying it for diffusion learning , our novel design concretizes the unordered tokens into the shape of the 3D input. Specifically, this is achieved by cross-attending (Huang et al., 2024b) the set latent via a sparse point cloud sampled from the input 3D shape, as visualized in Fig. 2. The resulting point-cloud structured latent space significantly facilitate shape-texture disentanglement and 3D editing. Afterward, a DiT-based 3D decoder (Peebles & Xie, 2023; gradually decodes and upsamples the latent point cloud into a set of dense surfel Gaussians (Huang et al., 2024a), which are rasterized to high-resolution renderings to supervise 3D VAE training. ...

Reference:

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
PointInfinity: Resolution-Invariant Point Diffusion Models
  • Citing Conference Paper
  • June 2024

... Both deep learning architectures are extensively employed in the medical image segmentation area and have realized satisfactory results. In addition to classic networks, recent models like Segment Anything Model (SAM) [18] and SAM2 [19] which trained with billions of annotated images have also demonstrated generalizable performances in image segmentation tasks. ...

SAM 2: Segment Anything in Images and Videos
  • Citing Preprint
  • August 2024

... Earlier works [10,19,25,37,59,62,66] in this field typically predict only the geometry and train on small datasets [5,50], which limits their generalization ability. Recently, larger 3D datasets [11,42] have been collected, unlocking the potential to train feedforward 3D models at scale [24,27,61]. These models exhibit great generalization ability to unseen images, and excel at producing reconstructions that tightly align with the observed cues in the input image. ...

Multiview Compressive Coding for 3D Reconstruction
  • Citing Conference Paper
  • June 2023

... We used off-the-shelf MViT2-Huge [41] in the Detectron2 framework [42] trained on MS-COCO [43] as the 2D object detector and Metric3DV2giant [9] for metric depth estimation. We use our pseudo-labels to train MonoDETR [19] monocular 3D object detector, using AdamW [44] optimiser with learning rate and weight decay equal to 0.0002 and 0.0001, respectively, while keeping other hyper-parameters as in [19]. ...

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
  • Citing Conference Paper
  • June 2022

... Figure 4 presents saliency maps highlighting regions influencing the stenoses predictions. Additional performance metrics using MemVIT architecture [30] instead of the R2D + 1 are presented in Supplemental Materials Table S13. ...

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
  • Citing Conference Paper
  • June 2022

... Deep learning, and particularly convolutional neural networks (CNNs), has gained traction in radar localization due to their strong performance in extracting detailed features [17,18]. While CNNs can capture informative features for estimation, their focus on localized relationships limits their ability to fully capture global dependencies across arrays, which is critical in FDA-based localization [19]. ...

A ConvNet for the 2020s
  • Citing Conference Paper
  • June 2022

... Some approaches [11,13,42] extend MAE [12] to the video field. MaskFeat [43] studies five different types of features in reconstructed masked regions. Hiera [15] combines MAE pre-training with a hierarchical transformer architecture in a simple way. ...

Masked Feature Prediction for Self-Supervised Visual Pre-Training
  • Citing Conference Paper
  • June 2022

... Several techniques have been developed to optimize GPU memory utilization during LLM training. Reversible subnetworks (Liao, Tan, and Monz 2023;Mangalam et al. 2022;Zhao et al. 2024a;Gomez et al. 2017b;Kitaev, Kaiser, and Levskaya 2020) minimize activation memory by recalculating activations during backpropagation. Gradient checkpointing (Chen et al. 2016) improves memory efficiency by discarding and later reconstructing some intermediate activations through an additional forward pass. ...

Reversible Vision Transformers
  • Citing Conference Paper
  • June 2022

... 8,9 Convolutional Neural Networks (CNNs), as introduced by LeCun et al. in 1989 10 and popularized by Krizhevsky et al. in 2012, have brought transformative advancements to medical imaging. 10,11 These networks have revolutionized the field by harnessing their capacity to autonomously learn intricate data representations. The renaissance of CNNs has yielded remarkable progress across diverse medical imaging modalities, including Radiography, 12 Endoscopy, 13 Computed Tomography(CT). ...

A ConvNet for the 2020s

... This method applies a certain percentage of the mask to images, then uses an encoder to learn the visible patch features, and a decoder to reconstruct the original image, continuously optimizing the model's ability to learn general features. The reconstruction target has many forms, such as pixels [10,11], discretized tokens [12,13], gradient histograms [14], features generated by other networks [15][16][17][18] and features extracted by a teacher network [19]. Currently, the method with better performance and efficiency in the natural image field is the last one. ...

Masked Feature Prediction for Self-Supervised Visual Pre-Training
  • Citing Preprint
  • December 2021