Evan Shelhamer’s research while affiliated with University of California, Berkeley and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (38)


Simpler Fast Vision Transformers with a Jumbo CLS Token
  • Preprint
  • File available

February 2025

·

5 Reads

Anthony Fuller

·

Yousef Yassin

·

Daniel G. Kyrollos

·

[...]

·

We introduce a simple enhancement to the global processing of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Jumbo significantly improves over ViT+Registers on ImageNet-1K at high speeds (by 3.2% for ViT-tiny and 13.5% for ViT-nano); these Jumbo models even outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs. Although Jumbo sees no gains for ViT-small on ImageNet-1K, it gains 3.4% on ImageNet-21K over ViT+Registers. Both findings indicate that Jumbo is most helpful when the ViT is otherwise too narrow for the task. Finally, we show that Jumbo can be easily adapted to excel on data beyond images, e.g., time series.

Download

Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models

February 2025

·

8 Reads

From crop mapping to flood detection, machine learning in remote sensing has a wide range of societally beneficial applications. The commonalities between remote sensing data in these applications present an opportunity for pretrained machine learning models tailored to remote sensing to reduce the labeled data and effort required to solve individual tasks. However, such models must be: (i) flexible enough to ingest input data of varying sensor modalities and shapes (i.e., of varying spatial and temporal dimensions), and (ii) able to model Earth surface phenomena of varying scales and types. To solve this gap, we present Galileo, a family of pretrained remote sensing models designed to flexibly process multimodal remote sensing data. We also introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features, a challenge not addressed by previous models. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.


Figure 1: Two-step ARS for L∞-bounded attacks. Step M1 adds noise to input X and post-processes the
Figure 2: Certified Test Accuracy on CIFAR-10 (20kBG). (a)-(c) show the effect of dimensionality for (a) no background / k = 32, (b) k = 48, and (c) k = 64 for constant σ = 0.75. (d)-(f) show the effect of noise for (d) σ = 0.12, (e) σ = 0.5 and (f) σ = 1.5 with dimensionality fixed to k = 48. (g)-(i) show the effect of noise for (d) σ = 0.12, (e) σ = 0.5 and (f) σ = 1.5 with dimensionality fixed to k = 64. These results are in our 20kBG setting where a CIFAR-10 image is placed randomly along the edges of a background image. Each line is the mean and the shaded interval covers +/-one standard deviation across seeds.
Figure 3: ARS Masks on CIFAR-10 (20kBG) select the task-relevant input over the distractor background.
Figure 5: ARS masks are localized and input specific.
Figure 6: ImageNet σ = 0.5.

+1

Adaptive Randomized Smoothing: Certifying Multi-Step Defences against Adversarial Examples

June 2024

·

27 Reads

We propose Adaptive Randomized Smoothing (ARS) to certify the predictions of our test-time adaptive models against adversarial examples. ARS extends the analysis of randomized smoothing using f-Differential Privacy to certify the adaptive composition of multiple steps. For the first time, our theory covers the sound adaptive composition of general and high-dimensional functions of noisy input. We instantiate ARS on deep image classification to certify predictions against adversarial examples of bounded LL_{\infty} norm. In the LL_{\infty} threat model, our flexibility enables adaptation through high-dimensional input-dependent masking. We design adaptivity benchmarks, based on CIFAR-10 and CelebA, and show that ARS improves accuracy by 2 to 5%5\% points. On ImageNet, ARS improves accuracy by 1 to 3%3\% points over standard RS without adaptivity.




Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts

February 2023

·

7 Reads

Adversarial training is widely used to make classifiers robust to a specific threat or adversary, such as p\ell_p-norm bounded perturbations of a given p-norm. However, existing methods for training classifiers robust to multiple threats require knowledge of all attacks during training and remain vulnerable to unseen distribution shifts. In this work, we describe how to obtain adversarially-robust model soups (i.e., linear combinations of parameters) that smoothly trade-off robustness to different p\ell_p-norm bounded adversaries. We demonstrate that such soups allow us to control the type and level of robustness, and can achieve robustness to all threats without jointly training on all of them. In some cases, the resulting model soups are more robust to a given p\ell_p-norm adversary than the constituent model specialized against that same adversary. Finally, we show that adversarially-robust model soups can be a viable tool to adapt to distribution shifts from a few examples.


Object Discovery and Representation Networks

October 2022

·

12 Reads

·

39 Citations

Lecture Notes in Computer Science

The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers this image structure by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.


Figure 1: Segmentation accuracy after pre-training ViT-B models on ALIGN or JFT. Supervision consistently outperforms self-supervision in both accuracy (up) and efficiency (left) for this image segmentation evaluation, with MAE closely following supervised pre-training. Best viewed in color.
Figure 3: Per step FLOP count of various supervised and self-supervised pre-training methods.
Figure A.6.4 and A.6.5 shows the FLOP efficiency of each method, measured on the object detection task. We obtain similar results to segmentation, with the exception that MAE overtakes one or two supervised methods, depending on the pre-training dataset. MAE performs strongly both in absolute terms, and with respect to its compute efficiency.
Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

September 2022

·

75 Reads

Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a \textit{fixed} FLOP budget, what are the best datasets, models, and (self-)supervised training methods for obtaining high accuracy on representative visual tasks? Given the availability of large datasets, this setting is often more relevant for both academic and industry labs alike. We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised). In a like-for-like fashion, we characterize their FLOP and CO2_2 footprints, relative to their accuracy when transferred to a canonical image segmentation task. Our analysis reveals strong disparities in the computational efficiency of pre-training methods and their dependence on dataset quality. In particular, our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data. We therefore advocate for (1) paying closer attention to dataset curation and (2) reporting of accuracies in context of the total computational cost.


Back to the Source: Diffusion-Driven Test-Time Adaptation

July 2022

·

26 Reads

Test-time adaptation harnesses test inputs to improve the accuracy of a model trained on source data when tested on shifted target data. Existing methods update the source model by (re-)training on each target domain. While effective, re-training is sensitive to the amount and order of the data and the hyperparameters for optimization. We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation method, DDA, shares its models for classification and generation across all domains. Both models are trained on the source domain, then fixed during testing. We augment diffusion with image guidance and self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than prior model adaptation approaches across a variety of corruptions, architectures, and data regimes on the ImageNet-C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data in small batches, dependent data in non-uniform order, or mixed data with multiple corruptions.


Exploring Simple and Transferable Recognition-Aware Image Processing

June 2022

·

291 Reads

·

24 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we examine simple approaches to improve machine recognition of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate input transformation model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the methods applicable even when we do not have the knowledge of future recognition models, e.g., when uploading processed images to the Internet. We conduct experiments on multiple image processing tasks paired with ImageNet classification and PASCAL VOC detection as recognition tasks. With these simple yet effective methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/liuzhuang13/Transferable_RA.


Citations (23)


... Fully Convolutional Network (FCN): FCN eliminates traditional fully connected layers, making it capable of handling images of any size and directly outputting segmentation masks. With upsampling and skip connections, FCN effectively fuses low-level texture information with high-level semantic data, significantly improving segmentation accuracy and efficiency (Shelhamer et al. 2016). UNet: Known for its symmetrical encoder-decoder structure, UNet's encoders extract image features while decoders use skip connections to combine high-resolution feature maps with upsampled feature maps. ...

Reference:

KDANet: a farmland extraction network using band selection and dual attention fusion – a case study of paddy fields and irrigated land in Qingtongxia, China
Fully Convolutional Networks for Semantic Segmentation
  • Citing Preprint
  • May 2016

... The soup's weights are averaged across the included ingredients at each step. This technique has since gained popularity for fine-tuning Large Language Models (LLMs) [33] and has been applied in domains such as Adversarial Networks [34], [35], LiDAR Segmentation [36], and Image Classification [37], [38]. ...

Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts
  • Citing Conference Paper
  • June 2023

... Beyond adjusting model parameters, image restoration offers an alternative for mitigating distribution shifts. Diffusion models have been explored for noise removal, effectively filtering away perturbations [31]. Instead of shifting the controller to align with a known new distribution (as in domain adaptation), our method enables controller generalization to unseen disturbances by shifting the inputs closer to the initial training distribution. ...

Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption
  • Citing Conference Paper
  • June 2023

... Despite the success of our framework, there are still challenges yet to be explored. Given the remarkable performance of SACL with the assistance of image-computable pseudo masks, one challenge is to develop a learnable module for pseudo mask generation [78]. An end-to-end learning paradigm should have the potential to adaptively segment images, hence bringing simplicity and improving generalization. ...

Object Discovery and Representation Networks
  • Citing Chapter
  • October 2022

Lecture Notes in Computer Science

... Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR [1], the image enhancement model and the visual recognition model can mutually enhance each other's quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact -different levels of vision tasks have varying and sometimes conflicting requirements of image features. ...

Exploring Simple and Transferable Recognition-Aware Image Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence

... In these metric-based methods, Prototypical Networks [19] play a crucial role, with many metric-based methods considered variants of prototypical networks, collectively referred to as prototype-based methods. These methods typically assume the existence of a feature space where sample features from the same class cluster around a class prototype [22,[24][25][26]. Generally, these methods first construct prototypes for each class of data using the support set from the few-shot classification task, and then classify by calculating the distance between the query sample features and the class prototypes. ...

Infinite mixture prototypes for few-shot learning
  • Citing Article
  • January 2019

... Self-Attention Block and Classifier: Inspired by Perceiver IO and to further improve the predictive capacity of our model, a self-attention module was added after the cross-attention block. [56] The self-attention matrix of the output of the cross-attention block was calculated, following Equation (1) as well. Finally, a three-layer full-connected neural network was used to complete the functional phosphorylation site prediction. ...

Perceiver IO: A General Architecture for Structured Inputs & Outputs
  • Citing Preprint
  • July 2021

... We argue that our approach could serve a large range of practical applications. For example, determining the most suitable model for a given scenario within a fleet of lightweight AI models able to analyze the environment, as proposed in previous works [8,30,35], detecting significant domain transitions to collect new data for domain adaptation methods that rely on buffers or adaptable internal statistics [10,15,[49][50][51][52] or activating adaptation mechanisms based on clustering [2,45,53,54]. ...

Fully Test-time Adaptation by Entropy Minimization
  • Citing Preprint
  • June 2020