Piotr Bojanowski’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (67)


You Don't Need Data-Augmentation in Self-Supervised Learning
  • Preprint

June 2024

·

15 Reads

·

Maxime Oquab

·

Marc Szafraniec

·

[...]

·

Piotr Bojanowski

Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the general belief that they are required for the proper training and performance of such models. On the other hand, generative reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive Architectures such as I-JEPA have shown strong performance without using data augmentations except masking. In this work, we challenge the importance of invariance and data-augmentation in JEAs at scale. By running a case-study on a recent SSL foundation model - DINOv2 - we show that strong image representations can be obtained with JEAs and only cropping without resizing provided the training data is large enough, reaching state-of-the-art results and using the least amount of augmentation in the literature. Through this study, we also discuss the impact of compute constraints on the outcomes of experimental deep learning research, showing that they can lead to very different conclusions.


Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
  • Preprint
  • File available

May 2024

·

33 Reads

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of k-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

Download


Figure 2. Visualizations of the morphology feature space automatically discovered by DINO. A) UMAP visualization of single-cell morphology features obtained with DINO for images in the Allen Institute WTC-11 hiPSC dataset. The two UMAP plots show the same data points, with the left plot coloring points by the endogenously tagged cell structure, and the right plot coloring points by the cell-cycle stage annotation. The title and frame colors match the legend in C for factors of variation. B) UMAP visualizations of the same single-cell images in A. The left-hand side plot shows DINO features integrated with Harmony over cell lines. The right-hand side plot shows engineered features used in the original
Figure 3. Quantitative evaluation of feature representations in supervised downstream tasks. A) Cell-cycle stage classification task in the Allen Institute WTC-11 hiPSC dataset. The bar plot shows the macro average F1-score over six classes. The three classifiers have the same multi-layer perceptron (MLP) architecture (Methods) and only differ in the input features: engineered features 45 (green), DINO features trained on ImageNet (red), and DINO features trained on WTC11 images (blue). B) Confusion matrices of the three classification models in A, each showing the normalized true positive counts for the six cell-cycle stages (Methods). C) Cell line classification accuracy on whole-images of the Human Protein Atlas (HPA). The bar plot shows the average accuracy over 35 cell line classes for three identical MLP classifiers that differ in the input features: DINO features trained on ImageNet (red), the top performing CNN model (Bestfitting team) supervised for protein localization in the whole-image classification Kaggle competition 21 (green), and DINO features trained on HPA (blue). D) whole-image protein localization classification challenge 21 . The histogram shows the distribution of the number of Kaggle competitors (left y-axis) according to their performance (x-axis, higher is better). Three classifiers trained with the input features in C are highlighted in the plot with lines and points of the corresponding colors. The dashed line is the cumulative distribution of the number of competitors (right y-axis). E) Single-cell protein localization classification challenge 22 . The histogram shows the distribution of the number of Kaggle competitors (right y-axis) according to their performance (x-axis, higher is better). Three classifiers are highlighted in the plot as in D. The dashed line is the cumulative distribution of the number of competitors (left y-axis). F) Mechanism of action prediction for compounds in the LINCS Cell Painting dataset 47 . Three identical MLP classifiers are trained with input features: CellProfiler features (green), Cell Painting CNN features 29 (red), and DINO features trained on Cell Painting images (blue). G) Compound bioactivity prediction for small molecules in the BBBC036 Cell Painting dataset 53 . The bar plot compares three classifiers as in F. The y-axis of the bar plot is the average PR-AUC for 270 assays. H)
Figure 4. Image-based profiling of cellular state under different conditions. A) Cell line profiling in the Human Protein Atlas dataset using bulk mRNA levels (left) and aggregated DINO features (right). The matrices display the cosine similarity among cell lines (rows and columns) according to mRNA levels or imaging features. The order of rows and columns in both matrices follows groups determined by hierarchical clustering of the mRNA similarity values. B) Canonical correlation analysis between mRNA readouts and DINO features aggregated by cell line in the HPA dataset (Methods). Each point in the plot corresponds to one cell line. Red points are the mRNA representation and blue points are the DINO features representation. Lines between points indicate the correct connection between the two representations for one cell line. A representative subset of the cell line points are annotated. C) Matrix of cosine similarities among DINO features aggregated by protein localization groups, with rows and columns ordered by the ground truth annotations. The clusters highlighted in green are protein localizations in the nucleus or the cytoplasm. The red clusters correspond to secondary groupings of the protein localizations, annotated by experts 27 (Methods). D) Pseudotime analysis of cell cycle stages in the WTC11 dataset using the Diffusion Pseudotime (DPT) algorithm 63 on the features extracted by DINO
Figure 5. Single-cell heterogeneity of protein localization patterns in the Human Protein Atlas. A) UMAP visualization of single-cell DINO features. The small map in the top-right corner shows all single cells in the HPA dataset and highlights the cluster of U2OS cells, which are presented in the main plot. Single cells from two images of the genes EFHC2 (blue), NUSAP (orange), and PSTPIP2 (green) are displayed in colors to illustrate heterogeneity patterns. Panels C, D, and E use the same color convention of the genes and the images (shades of the corresponding gene colors). B) Ability of morphology features to capture single cell heterogeneity. The x-axis represents the ranking of genes according to the variance of single-cell features in an image (Methods). The y-axis represents the proportion of genes labeled as heterogeneous according to existing annotations in the HPA website. C, D, and E are examples of heterogeneous protein localization patterns, and show cosine similarity matrices of DINO features for single cells associated with three genes: EFHC2, PSTPIP2, and NUSAP, respectively. The images below the matrices are the source of the single cells in the analysis; left images: microtubules (red), endoplasmic reticulum (yellow), and nucleus (blue); right images: protein channel -arrows indicate cells exhibiting different protein expression patterns according to the type of heterogeneity. The color labels follow the conventions in A.
Figure 6. Vision transformers encode biologically meaningful features of subcellular structures. A) Matrix of the intersection-over-union (IoU) of fluorescent channels and attention heads in the last layer of the ViT DINO model trained on the HPA dataset. The IoU values are the average over 50 single cells. B) Left column: first principal component (PC) of the patch tokens for two cells, one of the interphase group and the other of the metaphase group in the WTC11 dataset. The locations of discriminative patch tokens are highlighted in magenta (Methods). Middle column: 2nd, 3rd, and 4th PCs of the key descriptors, portrayed in red, green and blue respectively. Right column: original single-cell image. Right: barplot of the average IoU of key descriptors and nucleus (red) or cytoplasm (blue) over the single cells of each group. C) Examples of fluorescent channels (left) and attention maps (right) of six single cells, where the attention head highly overlaps with the fluorescent channel according to A. D) Left: similar to B, but for cells from the DMSO control and a DNA replication inhibitor / STAT inhibitor in the LINCS dataset (compound ID: BRD-K35960502). Right: The average ratio of nucleus IoU over cytoplasm IoU per cell for each group. The treated group had a higher ratio, indicating relatively more discriminative patch tokens in the nucleus compared to the control group.
Unbiased single-cell morphology with self-supervised vision transformers

June 2023

·

326 Reads

·

9 Citations

Accurately quantifying cellular morphology at scale could substantially empower existing single-cell approaches. However, measuring cell morphology remains an active field of research, which has inspired multiple computer vision algorithms over the years. Here, we show that DINO, a vision-transformer based, self-supervised algorithm, has a remarkable ability for learning rich representations of cellular morphology without manual annotations or any other type of supervision. We evaluate DINO on a wide variety of tasks across three publicly available imaging datasets of diverse specifications and biological focus. We find that DINO encodes meaningful features of cellular morphology at multiple scales, from subcellular and single-cell resolution, to multi-cellular and aggregated experimental groups. Importantly, DINO successfully uncovers a hierarchy of biological and technical factors of variation in imaging datasets. The results show that DINO can support the study of unknown biological variation, including single-cell heterogeneity and relationships between samples, making it an excellent tool for image-based biological discovery.




Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions

April 2023

·

10 Reads

The success of transformer models trained with a language modeling objective brings a promising opportunity to the reinforcement learning framework. Decision Transformer is a step towards this direction, showing how to train transformers with a similar next-step prediction objective on offline data. Another important development in this area is the recent emergence of large-scale datasets collected from the internet, such as the ones composed of tutorial videos with captions where people talk about what they are doing. To take advantage of this language component, we propose a novel method for unifying language reasoning with actions in a single policy. Specifically, we augment a transformer policy with word outputs, so it can generate textual captions interleaved with actions. When tested on the most challenging task in BabyAI, with captions describing next subgoals, our reasoning policy consistently outperforms the caption-free baseline.


DINOv2: Learning Robust Visual Features without Supervision

April 2023

·

621 Reads

·

27 Citations

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.


Sub-meter resolution canopy height maps using self-supervised learning and a vision transformer trained on Aerial and GEDI Lidar

April 2023

·

1,514 Reads

Vegetation structure mapping is critical for understanding the global carbon cycle and monitoring nature-based approaches to climate adaptation and mitigation. Repeat measurements of these data allow for the observation of deforestation or degradation of existing forests, natural forest regeneration, and the implementation of sustainable agricultural practices like agroforestry. Assessments of tree canopy height and crown projected area at a high spatial resolution are also important for monitoring carbon fluxes and assessing tree-based land uses, since forest structures can be highly spatially heterogeneous, especially in agroforestry systems. Very high resolution satellite imagery (less than one meter (1m) ground sample distance) makes it possible to extract information at the tree level while allowing monitoring at a very large scale. This paper presents the first high-resolution canopy height map concurrently produced for multiple sub-national jurisdictions. Specifically, we produce canopy height maps for the states of California and S\~{a}o Paolo, at sub-meter resolution, a significant improvement over the ten meter (10m) resolution of previous Sentinel / GEDI based worldwide maps of canopy height. The maps are generated by applying a vision transformer to features extracted from a self-supervised model in Maxar imagery from 2017 to 2020, and are trained against aerial lidar and GEDI observations. We evaluate the proposed maps with set-aside validation lidar data as well as by comparing with other remotely sensed maps and field-collected data, and find our model produces an average Mean Absolute Error (MAE) within set-aside validation areas of 3.0 meters.


Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

January 2023

·

199 Reads

·

3 Citations

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) predict several target blocks in the image, (b) sample target blocks with sufficiently large scale (occupying 15%-20% of the image), and (c) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong downstream performance across a wide range of tasks requiring various levels of abstraction, from linear classification to object counting and depth prediction.


Citations (44)


... On the contrary, contrastive techniques like SimCLR, or MoCo(He et al., 2020) involve generating positive (similar) and negative (dissimilar) pairs by applying deformations on input images, such that the models are trained to maximize agreement between positive pairs and minimize it between negative pairs. Self-distillation, a technique where the model improves itself by learning from its own soft (i-e probabilistic) predictions, showed promising results with ViTs(Caron et al., 2021;Oquab et al., 2023) with applications in remote sensing(Tolan et al., 2024). the embedding and classifier level, pretrained models can be used as feature extractors to extract embeddings which are then used as fixed predictors in any classifier/regression model. ...

Reference:

Introduction to deep learning methods for multi‐species predictions
Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar

Remote Sensing of Environment

... Some studies [40,60] tire network as the teacher and an early exit network as the student. CoSub [48] notably improved model performance in visual recognition tasks with sub-model-based self-distillation. MaskSub [13] introduced a drop-based technique for sub-model self-distillation, improving model performance and cost efficiency for image classification tasks. ...

Co-training 2L Submodels for Visual Recognition
  • Citing Conference Paper
  • June 2023

... Feature Hallucination. Inspired by previous works on feature learning [53][54][55], we use feature hallucination as an auxiliary loss during training. Specifically, we compute the MSE loss between the output image tokens and that directly extracted from future frames. ...

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
  • Citing Conference Paper
  • June 2023

... This is probably the case in botany and ornithology (53) but also in the medical domain (42) or research in biology (5,28,36) . As biology and Drug Discovery stands to benefit massively from Deep Learning and SSRL (25) , there is an ongoing focus on proving the transferability of SSRL approaches pretrained on natural images to the biological domain (17) , or leveraging the existing unlabeled biological images for biology specific SSRL pretraining with the fixed number of channels (24) or through channel agnostic approaches (6,23) . ...

Unbiased single-cell morphology with self-supervised vision transformers

... We consider three different (frozen) backbones to extract features from the input images: ResNet-50 [13], DINOv2 [31], and CLIP [37]. We use the libraries LAMPE [41] and ZUKO [40] to learn a model (of type NPE ) and to manipulate the normalizing flows (of type NSF ), respectively. ...

DINOv2: Learning Robust Visual Features without Supervision
  • Citing Preprint
  • April 2023

... According to LeCun, instead of focusing on the input distribution, one should aim to achieve high-quality reconstruction in an abstract space [129]. The JEPA proposal from his group is to produce a joint embedding of an image and of a noisy version of that image, and to learn to reconstruct the latent representation of the original image from the latent representation of the noisy one [130]. ...

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
  • Citing Preprint
  • January 2023

... 6: During test time, these low level actions guide agent movement and produce new observations as input for the upper levels of the hierarchy. Mezghan et al. (2022). However, it still suffers from several issues such as sample inefficiency, handling sparse rewards, and the long horizon problem Fujimoto & Gu (2021); Le et al. (2018). ...

Memory-Augmented Reinforcement Learning for Image-Goal Navigation
  • Citing Conference Paper
  • October 2022

... Path Foundation (Lai et al., 2023) is a Vision Transformer (ViT) (Dosovitskiy et al., 2020) encoder for histopathology image patches trained with self-supervised learning (Masked Siamese Networks, Assran et al. (2022)). It incorporates pathology-specific optimizations, including approaches to help learn stain-agnostic features and to generalize across patch magnifications. ...

Masked Siamese Networks for Label-Efficient Learning
  • Citing Chapter
  • October 2022

Lecture Notes in Computer Science

... However, owing to the high computational cost of self-attention, studies have attempted to replace it with alternative token mixers. Accordingly, multilayer perceptron (MLP)-like token mixers [23,41,42,51,57] have emerged as a dominant approach that employs an MLP operator to mix spatial tokens. As another mainstream approach, depthwise convolution has been studied as a token mixer. ...

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training
  • Citing Article
  • September 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... The sampling of pairs of images with the same weak label is a concept adaptable to other networks. Self-supervised methods are an active field of research and new algorithms have outperformed DINO on ImageNet classification, for example Masked Siamese Networks [31]. Future work could incorporate these studies which would be adaptable in a very similar way to WS-DINO. ...

Masked Siamese Networks for Label-Efficient Learning