PreprintPDF Available

Masked Autoencoders Are Scalable Vision Learners

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Content may be subject to copyright.
Masked Autoencoders Are Scalable Vision Learners
Kaiming He,Xinlei ChenSaining Xie Yanghao Li Piotr Doll´
ar Ross Girshick
equal technical contribution project lead
Facebook AI Research (FAIR)
Abstract
This paper shows that masked autoencoders (MAE) are
scalable self-supervised learners for computer vision. Our
MAE approach is simple: we mask random patches of the
input image and reconstruct the missing pixels. It is based
on two core designs. First, we develop an asymmetric
encoder-decoder architecture, with an encoder that oper-
ates only on the visible subset of patches (without mask to-
kens), along with a lightweight decoder that reconstructs
the original image from the latent representation and mask
tokens. Second, we find that masking a high proportion
of the input image, e.g., 75%, yields a nontrivial and
meaningful self-supervisory task. Coupling these two de-
signs enables us to train large models efficiently and ef-
fectively: we accelerate training (by 3×or more) and im-
prove accuracy. Our scalable approach allows for learning
high-capacity models that generalize well: e.g., a vanilla
ViT-Huge model achieves the best accuracy (87.8%) among
methods that use only ImageNet-1K data. Transfer per-
formance in downstream tasks outperforms supervised pre-
training and shows promising scaling behavior.
1. Introduction
Deep learning has witnessed an explosion of archi-
tectures of continuously growing capability and capacity
[28,24,47]. Aided by the rapid gains in hardware, mod-
els today can easily overfit one million images [13] and
begin to demand hundreds of millions of—often publicly
inaccessible—labeled images [16].
This appetite for data has been successfully addressed in
natural language processing (NLP) by self-supervised pre-
training. The solutions, based on autoregressive language
modeling in GPT [40,41,4] and masked autoencoding in
BERT [14], are conceptually simple: they remove a portion
of the data and learn to predict the removed content. These
methods now enable training of generalizable NLP models
containing over one hundred billion parameters [4].
The idea of masked autoencoders, a form of more gen-
eral denoising autoencoders [48], is natural and applicable
in computer vision as well. Indeed, closely related research
encoder
....
....
decoder
input target
Figure 1. Our MAE architecture. During pre-training, a large
random subset of image patches (e.g., 75%) is masked out. The
encoder is applied to the small subset of visible patches. Mask
tokens are introduced after the encoder, and the full set of en-
coded patches and mask tokens is processed by a small decoder
that reconstructs the original image in pixels. After pre-training,
the decoder is discarded and the encoder is applied to uncorrupted
images to produce representations for recognition tasks.
in vision [49,39] preceded BERT. However, despite signif-
icant interest in this idea following the success of BERT,
progress of autoencoding methods in vision lags behind
NLP. We ask: what makes masked autoencoding different
between vision and language? We attempt to answer this
question from the following perspectives:
(i) Until recently, architectures were different. In vision,
convolutional networks [29] were dominant over the last
decade [28]. Convolutions typically operate on regular grids
and it is not straightforward to integrate ‘indicators’ such as
mask tokens [14] or positional embeddings [47] into con-
volutional networks. This architectural gap, however, has
been addressed with the introduction of Vision Transform-
ers (ViT) [16] and should no longer present an obstacle.
(ii) Information density is different between language
and vision. Languages are human-generated signals that
are highly semantic and information-dense. When training
a model to predict only a few missing words per sentence,
this task appears to induce sophisticated language under-
standing. Images, on the contrary, are natural signals with
heavy spatial redundancy—e.g., a missing patch can be re-
covered from neighboring patches with little high-level un-
1
arXiv:2111.06377v1 [cs.CV] 11 Nov 2021
Figure 2. Example results on ImageNet validation images. For each triplet, we show the masked image (left), our MAE reconstruction
(middle), and the ground-truth (right). The masking ratio is 80%, leaving only 39 out of 196 patches. More examples are in the appendix.
As no loss is computed on visible patches, the model output on visible patches is qualitatively worse. One can simply overlay the output with the visible
patches to improve visual quality. We intentionally opt not to do this, so we can more comprehensively demonstrate the method’s behavior.
Figure 3. Example results on COCO validation images, using an MAE trained on ImageNet (the same model weights as in Figure 2).
Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.
derstanding of parts, objects, and scenes. To overcome this
difference and encourage learning useful features, we show
that a simple strategy works well in computer vision: mask-
ing a very high portion of random patches. This strategy
largely reduces redundancy and creates a challenging self-
supervisory task that requires holistic understanding beyond
low-level image statistics. To get a qualitative sense of our
reconstruction task, see Figures 24.
(iii) The autoencoder’s decoder, which maps the latent
representation back to the input, plays a different role be-
tween reconstructing text and images. In vision, the decoder
reconstructs pixels, hence its output is of a lower semantic
level than common recognition tasks. This is in contrast
to language, where the decoder predicts missing words that
contain rich semantic information. While in BERT the de-
coder can be trivial (an MLP) [14], we found that for im-
ages, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
Driven by this analysis, we present a simple, effective,
and scalable form of a masked autoencoder (MAE) for
visual representation learning. Our MAE masks random
patches from the input image and reconstructs the missing
patches in the pixel space. It has an asymmetric encoder-
decoder design. Our encoder operates only on the visible
subset of patches (without mask tokens), and our decoder is
lightweight and reconstructs the input from the latent rep-
resentation along with mask tokens (Figure 1). Shifting
the mask tokens to the small decoder in our asymmetric
encoder-decoder results in a large reduction in computation.
Under this design, a very high masking ratio (e.g., 75%) can
achieve a win-win scenario: it optimizes accuracy while al-
lowing the encoder to process only a small portion (e.g.,
25%) of patches. This can reduce overall pre-training time
by 3×or more and likewise reduce memory consumption,
enabling us to easily scale our MAE to large models.
Our MAE learns very high-capacity models that gen-
eralize well. With MAE pre-training, we can train data-
hungry models like ViT-Large/-Huge [16] on ImageNet-1K
with improved generalization performance. With a vanilla
ViT-Huge model, we achieve 87.8% accuracy when fine-
tuned on ImageNet-1K. This outperforms all previous re-
sults that use only ImageNet-1K data. We also evaluate
transfer learning on object detection, instance segmentation,
and semantic segmentation. In these tasks, our pre-training
achieves better results than its supervised pre-training coun-
terparts, and more importantly, we observe significant gains
by scaling up models. These observations are aligned
with those witnessed in self-supervised pre-training in NLP
[14,40,41,4] and we hope that they will enable our field to
explore a similar trajectory.
2
original mask 75% mask 85% mask 95%
Figure 4. Reconstructions of ImageNet validation images using
an MAE pre-trained with a masking ratio of 75% but applied on
inputs with higher masking ratios. The predictions differ plausibly
from the original images, showing that the method can generalize.
2. Related Work
Masked language modeling and its autoregressive coun-
terparts, e.g., BERT [14] and GPT [40,41,4], are highly
successful methods for pre-training in NLP. These methods
hold out a portion of the input sequence and train models
to predict the missing content. These methods have been
shown to scale excellently [4] and a large abundance of ev-
idence indicates that these pre-trained representations gen-
eralize well to various downstream tasks.
Autoencoding is a classical method for learning representa-
tions. It has an encoder that maps an input to a latent repre-
sentation and a decoder that reconstructs the input. For ex-
ample, PCA and k-means are autoencoders [25]. Denoising
autoencoders (DAE) [48] are a class of autoencoders that
corrupt an input signal and learn to reconstruct the origi-
nal, uncorrupted signal. A series of methods can be thought
of as a generalized DAE under different corruptions, e.g.,
masking pixels [49,39,6] or removing color channels [59].
Our MAE is a form of denoising autoencoding, but different
from the classical DAE in numerous ways.
Masked image encoding methods learn representations
from images corrupted by masking. The pioneering work
of [49] presents masking as a noise type in DAE. Context
Encoder [39] inpaints large missing regions using convolu-
tional networks. Motivated by the success in NLP, related
recent methods [6,16,2] are based on Transformers [47].
iGPT [6] operates on sequences of pixels and predicts un-
known pixels. The ViT paper [16] studies masked patch
prediction for self-supervised learning. Most recently, BEiT
[2] proposes to predict discrete tokens [37,43].
Self-supervised learning approaches have seen significant
interest in computer vision, often focusing on different pre-
text tasks for pre-training [15,50,35,59,38,17]. Re-
cently, contrastive learning [3,21] has been popular, e.g.,
[51,36,22,7], which models image similarity and dis-
similarity (or only similarity [20,8]) between two or more
views. Contrastive and related methods strongly depend on
data augmentation [7,20,8]. Autoencoding pursues a con-
ceptually different direction, and it exhibits different behav-
iors as we will present.
3. Approach
Our masked autoencoder (MAE) is a simple autoencod-
ing approach that reconstructs the original signal given its
partial observation. Like all autoencoders, our approach
has an encoder that maps the observed signal to a latent
representation, and a decoder that reconstructs the origi-
nal signal from the latent representation. Unlike classical
autoencoders, we adopt an asymmetric design that allows
the encoder to operate only on the partial, observed signal
(without mask tokens) and a lightweight decoder that re-
constructs the full signal from the latent representation and
mask tokens. Figure 1illustrates the idea, introduced next.
Masking. Following ViT [16], we divide an image into reg-
ular non-overlapping patches. Then we sample a subset of
patches and mask (i.e., remove) the remaining ones. Our
sampling strategy is straightforward: we sample random
patches without replacement, following a uniform distribu-
tion. We simply refer to this as “random sampling”.
Random sampling with a high masking ratio (i.e., the ra-
tio of removed patches) largely eliminates redundancy, thus
creating a task that cannot be easily solved by extrapolation
from visible neighboring patches (see Figures 24). The
uniform distribution prevents a potential center bias (i.e.,
more masked patches near the image center). Finally, the
highly sparse input creates an opportunity for designing an
efficient encoder, introduced next.
MAE encoder. Our encoder is a ViT [16] but applied only
on visible, unmasked patches. Just as in a standard ViT, our
encoder embeds patches by a linear projection with added
positional embeddings, and then processes the resulting set
via a series of Transformer blocks. However, our encoder
only operates on a small subset (e.g., 25%) of the full set.
Masked patches are removed; no mask tokens are used.
This allows us to train very large encoders with only a frac-
tion of compute and memory. The full set is handled by a
lightweight decoder, described next.
MAE decoder. The input to the MAE decoder is the full
set of tokens consisting of (i) encoded visible patches, and
(ii) mask tokens. See Figure 1. Each mask token [14] is a
shared, learned vector that indicates the presence of a miss-
3
ing patch to be predicted. We add positional embeddings to
all tokens in this full set; without this, mask tokens would
have no information about their location in the image. The
decoder has another series of Transformer blocks.
The MAE decoder is only used during pre-training to
perform the image reconstruction task (only the encoder
is used to produce image representations for recognition).
Therefore, the decoder architecture can be flexibly designed
in a manner that is independent of the encoder design. We
experiment with very small decoders, narrower and shal-
lower than the encoder. For example, our default decoder
has <10% computation per token vs. the encoder. With this
asymmetrical design, the full set of tokens are only pro-
cessed by the lightweight decoder, which significantly re-
duces pre-training time.
Reconstruction target. Our MAE reconstructs the input
by predicting the pixel values for each masked patch. Each
element in the decoder’s output is a vector of pixel values
representing a patch. The last layer of the decoder is a lin-
ear projection whose number of output channels equals the
number of pixel values in a patch. The decoder’s output is
reshaped to form a reconstructed image. Our loss function
computes the mean squared error (MSE) between the recon-
structed and original images in the pixel space. We compute
the loss only on masked patches, similar to BERT [14].1
We also study a variant whose reconstruction target is
the normalized pixel values of each masked patch. Specif-
ically, we compute the mean and standard deviation of all
pixels in a patch and use them to normalize this patch. Us-
ing normalized pixels as the reconstruction target improves
representation quality in our experiments.
Simple implementation. Our MAE pre-training can be im-
plemented efficiently, and importantly, does not require any
specialized sparse operations. First we generate a token for
every input patch (by linear projection with an added po-
sitional embedding). Next we randomly shuffle the list of
tokens and remove the last portion of the list, based on the
masking ratio. This process produces a small subset of to-
kens for the encoder and is equivalent to sampling patches
without replacement. After encoding, we append a list of
mask tokens to the list of encoded patches, and unshuffle
this full list (inverting the random shuffle operation) to align
all tokens with their targets. The decoder is applied to this
full list (with positional embeddings added). As noted, no
sparse operations are needed. This simple implementation
introduces negligible overhead as the shuffling and unshuf-
fling operations are fast.
1Computing the loss only on masked patches differs from traditional
denoising autoencoders [48] that compute the loss on all pixels. This
choice is purely result-driven: computing the loss on all pixels leads to
a slight decrease in accuracy (e.g., 0.5%).
10 20 30 40 50 60 70 80 90
83
84
85
83.2 83.4 83.4
84.7 84.9 85.0 84.9 84.9
84.5
83.0
fine-tuning
masking ratio (%)
10 20 30 40 50 60 70 80 90
50
60
70
54.6
58.9 61.7
67.0 69.9 71.8 73.2 73.5 71.8
66.1
linear probing
masking ratio (%)
Figure 5. Masking ratio. A high masking ratio (75%) works well
for both fine-tuning (top) and linear probing (bottom). The y-axes
are ImageNet-1K validation accuracy (%) in all plots in this paper.
4. ImageNet Experiments
We do self-supervised pre-training on the ImageNet-1K
(IN1K) [13] training set. Then we do supervised training to
evaluate the representations with (i) end-to-end fine-tuning
or (ii) linear probing. We report top-1 validation accuracy
of a single 224×224 crop. Details are in Appendix A.1.
Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16]
as the backbone in our ablation study. ViT-L is very big (an
order of magnitude bigger than ResNet-50 [24]) and tends
to overfit. The following is a comparison between ViT-L
trained from scratch vs. fine-tuned from our baseline MAE:
scratch, original [16] scratch, our impl. baseline MAE
76.5 82.5 84.9
We note that it is nontrivial to train supervised ViT-L from
scratch and a good recipe with strong regularization is
needed (82.5%, see Appendix A.2). Even so, our MAE pre-
training contributes a big improvement. Here fine-tuning is
only for 50 epochs (vs. 200 from scratch), implying that the
fine-tuning accuracy heavily depends on pre-training.
4.1. Main Properties
We ablate our MAE using the default settings in Table 1
(see caption). Several intriguing properties are observed.
Masking ratio. Figure 5shows the influence of the mask-
ing ratio. The optimal ratios are surprisingly high. The ra-
tio of 75% is good for both linear probing and fine-tuning.
This behavior is in contrast with BERT [14], whose typical
masking ratio is 15%. Our masking ratios are also much
higher than those in related works [6,16,2] in computer
vision (20% to 50%).
The model infers missing patches to produce different,
yet plausible, outputs (Figure 4). It makes sense of the
gestalt of objects and scenes, which cannot be simply com-
pleted by extending lines or textures. We hypothesize that
this reasoning-like behavior is linked to the learning of use-
ful representations.
Figure 5also shows that linear probing and fine-tuning
results follow different trends. For linear probing, the ac-
4
blocks ft lin
1 84.8 65.5
284.9 70.0
484.9 71.9
884.9 73.5
12 84.4 73.3
(a) Decoder depth. A deep decoder can im-
prove linear probing accuracy.
dim ft lin
128 84.9 69.1
256 84.8 71.3
512 84.9 73.5
768 84.4 73.1
1024 84.3 73.1
(b) Decoder width. The decoder can be nar-
rower than the encoder (1024-d).
case ft lin FLOPs
encoder w/ [M] 84.2 59.6 3.3×
encoder w/o [M] 84.9 73.5 1×
(c) Mask token. An encoder without mask to-
kens is more accurate and faster (Table 2).
case ft lin
pixel (w/o norm) 84.9 73.5
pixel (w/ norm) 85.4 73.9
PCA 84.6 72.3
dVAE token 85.3 71.6
(d) Reconstruction target. Pixels as recon-
struction targets are effective.
case ft lin
none 84.0 65.7
crop, fixed size 84.7 73.1
crop, rand size 84.9 73.5
crop + color jit 84.3 71.9
(e) Data augmentation. Our MAE works with
minimal or no augmentation.
case ratio ft lin
random 75 84.9 73.5
block 50 83.9 72.3
block 75 82.8 63.9
grid 75 84.0 66.0
(f) Mask sampling. Random sampling works
the best. See Figure 6for visualizations.
Table 1. MAE ablation experiments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If
not specified, the default is: the decoder has depth 8 and width 512, the reconstruction target is unnormalized pixels, the data augmentation
is random resized cropping, the masking ratio is 75%, and the pre-training length is 800 epochs. Default settings are marked in gray .
curacy increases steadily with the masking ratio until the
sweet point: the accuracy gap is up to 20% (54.6% vs.
73.5%). For fine-tuning, the results are less sensitive to the
ratios, and a wide range of masking ratios (40–80%) work
well. All fine-tuning results in Figure 5are better than train-
ing from scratch (82.5%).
Decoder design. Our MAE decoder can be flexibly de-
signed, as studied in Table 1a and 1b.
Table 1a varies the decoder depth (number of Trans-
former blocks). A sufficiently deep decoder is important
for linear probing. This can be explained by the gap be-
tween a pixel reconstruction task and a recognition task: the
last several layers in an autoencoder are more specialized
for reconstruction, but are less relevant for recognition. A
reasonably deep decoder can account for the reconstruction
specialization, leaving the latent representations at a more
abstract level. This design can yield up to 8% improvement
in linear probing (Table 1a, ‘lin’). However, if fine-tuning
is used, the last layers of the encoder can be tuned to adapt
to the recognition task. The decoder depth is less influential
for improving fine-tuning (Table 1a, ‘ft’).
Interestingly, our MAE with a single-block decoder can
perform strongly with fine-tuning (84.8%). Note that a sin-
gle Transformer block is the minimal requirement to propa-
gate information from visible tokens to mask tokens. Such
a small decoder can further speed up training.
In Table 1b we study the decoder width (number of chan-
nels). We use 512-d by default, which performs well un-
der fine-tuning and linear probing. A narrower decoder also
works well with fine-tuning.
Overall, our default MAE decoder is lightweight. It has
8 blocks and a width of 512-d ( gray in Table 1). It only
has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
As such, while the decoder processes all tokens, it is still a
small fraction of the overall compute.
encoder dec. depth ft acc hours speedup
ViT-L, w/ [M] 8 84.2 42.4 -
ViT-L 8 84.9 15.4 2.8×
ViT-L 1 84.8 11.6 3.7×
ViT-H, w/ [M] 8 - 119.6-
ViT-H 8 85.8 34.5 3.5×
ViT-H 1 85.9 29.3 4.1×
Table 2. Wall-clock time of our MAE training (800 epochs),
benchmarked in 128 TPU-v3 cores with TensorFlow. The speedup
is relative to the entry whose encoder has mask tokens (gray). The
decoder width is 512, and the mask ratio is 75%. : This entry is
estimated by training ten epochs.
Mask token. An important design of our MAE is to skip
the mask token [M] in the encoder and apply it later in the
lightweight decoder. Table 1c studies this design.
If the encoder uses mask tokens, it performs worse: its
accuracy drops by 14% in linear probing. In this case,
there is a gap between pre-training and deploying: this en-
coder has a large portion of mask tokens in its input in pre-
training, which does not exist in uncorrupted images. This
gap may degrade accuracy in deployment. By removing the
mask token from the encoder, we constrain the encoder to
always see real patches and thus improve accuracy.
Moreover, by skipping the mask token in the encoder,
we greatly reduce training computation. In Table 1c, we
reduce the overall training FLOPs by 3.3×. This leads to
a 2.8×wall-clock speedup in our implementation (see Ta-
ble 2). The wall-clock speedup is even bigger (3.5–4.1×),
for a smaller decoder (1-block), a larger encoder (ViT-H),
or both. Note that the speedup can be >4×for a masking
ratio of 75%, partially because the self-attention complexity
is quadratic. In addition, memory is greatly reduced, which
can enable training even larger models or speeding up more
by large-batch training. The time and memory efficiency
makes our MAE favorable for training very large models.
5
Figure 6. Mask sampling strategies determine the pretext task
difficulty, influencing reconstruction quality and representations
(Table 1f). Here each output is from an MAE trained with the spec-
ified masking strategy. Left: random sampling (our default). Mid-
dle: block-wise sampling [2] that removes large random blocks.
Right: grid-wise sampling that keeps one of every four patches.
Images are from the validation set.
Reconstruction target. We compare different reconstruc-
tion targets in Table 1d. Our results thus far are based on
pixels without (per-patch) normalization. Using pixels with
normalization improves accuracy. This per-patch normal-
ization enhances the contrast locally. In another variant, we
perform PCA in the patch space and use the largest PCA
coefficients (96 here) as the target. Doing so degrades ac-
curacy. Both experiments suggest that the high-frequency
components are useful in our method.
We also compare an MAE variant that predicts tokens,
the target used in BEiT [2]. Specifically for this variant,
we use the DALLE pre-trained dVAE [43] as the tokenizer,
following [2]. Here the MAE decoder predicts the token in-
dices using cross-entropy loss. This tokenization improves
fine-tuning accuracy by 0.4% vs. unnormalized pixels, but
has no advantage vs. normalized pixels. It also reduces lin-
ear probing accuracy. In §5we further show that tokeniza-
tion is not necessary in transfer learning.
Our pixel-based MAE is much simpler than tokeniza-
tion. The dVAE tokenizer requires one more pre-training
stage, which may depend on extra data (250M images [43]).
The dVAE encoder is a large convolutional network (40%
FLOPs of ViT-L) and adds nontrivial overhead. Using pix-
els does not suffer from these problems.
Data augmentation. Table 1e studies the influence of data
augmentation on our MAE pre-training.
Our MAE works well using cropping-only augmenta-
tion, either fixed-size or random-size (both having random
horizontal flipping). Adding color jittering degrades the re-
sults and so we do not use it in other experiments.
Surprisingly, our MAE behaves decently even if using
no data augmentation (only center-crop, no flipping). This
property is dramatically different from contrastive learning
and related methods [51,22,7,20], which heavily rely
on data augmentation. It was observed [20] that using
cropping-only augmentation reduces the accuracy by 13%
100 200 400 800 1600
82
83
84
85
82.3
83.3
84.3
84.9 85.1
fine-tuning
epochs (log-scale)
100 200 400 800 1600
60
65
70
75
57.3
64.4
69.7
73.5 75.1
linear probing
epochs (log-scale)
Figure 7. Training schedules. A longer training schedule gives a
noticeable improvement. Here each point is a full training sched-
ule. The model is ViT-L with the default setting in Table 1.
and 28% respectively for BYOL [20] and SimCLR [7]. In
addition, there is no evidence that contrastive learning can
work without augmentation: the two views of an image are
the same and can easily satisfy a trivial solution.
In MAE, the role of data augmentation is mainly per-
formed by random masking (ablated next). The masks are
different for each iteration and so they generate new training
samples regardless of data augmentation. The pretext task
is made difficult by masking and requires less augmentation
to regularize training.
Mask sampling strategy. In Table 1f we compare different
mask sampling strategies, illustrated in Figure 6.
The block-wise masking strategy, proposed in [2], tends
to remove large blocks (Figure 6middle). Our MAE with
block-wise masking works reasonably well at a ratio of
50%, but degrades at a ratio of 75%. This task is harder
than that of random sampling, as a higher training loss is
observed. The reconstruction is also blurrier.
We also study grid-wise sampling, which regularly keeps
one of every four patches (Figure 6right). This is an eas-
ier task and has lower training loss. The reconstruction is
sharper. However, the representation quality is lower.
Simple random sampling works the best for our MAE. It
allows for a higher masking ratio, which provides a greater
speedup benefit while also enjoying good accuracy.
Training schedule. Our ablations thus far are based on
800-epoch pre-training. Figure 7shows the influence of the
training schedule length. The accuracy improves steadily
with longer training. Indeed, we have not observed sat-
uration of linear probing accuracy even at 1600 epochs.
This behavior is unlike contrastive learning methods, e.g.,
MoCo v3 [9] saturates at 300 epochs for ViT-L. Note that
the MAE encoder only sees 25% of patches per epoch,
while in contrastive learning the encoder sees 200% (two-
crop) or even more (multi-crop) patches per epoch.
6
method pre-train data ViT-B ViT-L ViT-H ViT-H448
scratch, our impl. - 82.3 82.6 83.1 -
DINO [5]IN1K 82.8 - - -
MoCo v3 [9]IN1K 83.2 84.1 - -
BEiT [2]IN1K+DALLE 83.2 85.2 - -
MAE IN1K 83.6 85.9 86.9 87.8
Table 3. Comparisons with previous results on ImageNet-
1K. The pre-training data is the ImageNet-1K training set (ex-
cept the tokenizer in BEiT was pre-trained on 250M DALLE data
[43]). All self-supervised methods are evaluated by end-to-end
fine-tuning. The ViT models are B/16, L/16, H/14 [16]. The best
for each column is underlined. All results are on an image size of
224, except for ViT-H with an extra result on 448. Here our MAE
reconstructs normalized pixels and is pre-trained for 1600 epochs.
0 200 400 600
76
78
80
82
84
86
88
ViT-B
ViT-L
ViT-H
MAE, IN1K
supervised, IN1K, our impl.
supervised, IN1K
supervised, JFT300M
[16]
[16]
params (M)
Figure 8. MAE pre-training vs. supervised pre-training, evalu-
ated by fine-tuning in ImageNet-1K (224 size). We compare with
the original ViT results [16] trained in IN1K or JFT300M.
4.2. Comparisons with Previous Results
Comparisons with self-supervised methods. In Table 3
we compare the fine-tuning results of self-supervised ViT
models. For ViT-B, all methods perform closely. For ViT-L,
the gaps among methods are bigger, suggesting that a chal-
lenge for bigger models is to reduce overfitting.
Our MAE can scale up easily and has shown steady im-
provement from bigger models. We obtain 86.9% accuracy
using ViT-H (224 size). By fine-tuning with a 448 size, we
achieve 87.8% accuracy, using only IN1K data. The pre-
vious best accuracy, among all methods using only IN1K
data, is 87.1% (512 size) [56], based on advanced networks.
We improve over the state-of-the-art by a nontrivial margin
in the highly competitive benchmark of IN1K (no external
data). Our result is based on vanilla ViT, and we expect
advanced networks will perform better.
Comparing with BEiT [2], our MAE is more accurate
while being simpler and faster. Our method reconstructs
pixels, in contrast to BEiT that predicts tokens: BEiT re-
ported a 1.8% degradation [2] when reconstructing pixels
with ViT-B.2We do not need dVAE pre-training. More-
over, our MAE is considerably faster (3.5×per epoch) than
BEiT, for the reason as studied in Table 1c.
2We observed the degradation also in BEiT with ViT-L: it produces
85.2% (tokens) and 83.5% (pixels), reproduced from the official code.
0 1 2 4 6 12 18 24
70
75
80
85
73.5
81.0
83.1
84.2 84.4 84.6 84.7 84.9
77.6
79.9
80.8 81.6 81.9
83.2 83.8 84.1
MAE baseline
MoCo v3
# blocks fine-tuned
Figure 9. Partial fine-tuning results of ViT-L w.r.t. the number
of fine-tuned Transformer blocks under the default settings from
Table 1. Tuning 0 blocks is linear probing; 24 is full fine-tuning.
Our MAE representations are less linearly separable, but are con-
sistently better than MoCo v3 if one or more blocks are tuned.
The MAE models in Table 3are pre-trained for 1600
epochs for better accuracy (Figure 7). Even so, our total
pre-training time is less than all other methods if they were
trained in the same hardware. For example, with ViT-L,
our MAE’s training time is 31 hours for 1600 epochs and
MoCo v3’s is 36 hours for 300 epochs [9], using the same
128 TPU-v3 cores.
Comparisons with supervised pre-training. In the origi-
nal ViT paper [16], ViT-L degrades when trained in IN1K.
See Figure 8. Our improved supervised recipe works better
for training from scratch (Figure 8, “our impl.”; see A.2),
but the accuracy is saturated.
Our MAE pre-training, using only IN1K, can general-
ize better: the gain over training from scratch is bigger for
higher-capacity models. It follows a trend similar to the
JFT-300M supervised pre-training in [16]. This compari-
son shows that our MAE can help scale up model sizes.
4.3. Partial Fine-tuning
Table 1shows that linear probing and fine-tuning results
are largely uncorrelated. Linear probing has been a popular
protocol in the past few years; however, it misses the oppor-
tunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning. As a middle ground, we
study a partial fine-tuning protocol: fine-tune the last sev-
eral layers while freezing the others. This protocol was also
used in early works, e.g., [54,59,35].
Figure 9shows the results. Notably, fine-tuning only one
Transformer block boosts the accuracy significantly from
73.5% to 81.0%. Moreover, if we fine-tune only “half” of
the last block (i.e., its MLP sub-block), we can get 79.1%,
much better than linear probing. This variant is essentially
fine-tuning an MLP head. Fine-tuning a few blocks (e.g.,
4 or 6) can achieve decent accuracy, which is still a small
fine-tuning head compared with the frozen backbone.
In Figure 9we also compare with MoCo v3 [9], which
is a contrastive method with ViT-L results available. It has
higher linear probing accuracy than our MAE. However, all
7
APbox APmask
method pre-train data ViT-B ViT-L ViT-B ViT-L
supervised IN1K w/ labels 47.9 49.3 42.9 43.9
MoCo v3 IN1K 47.9 49.3 42.7 44.0
BEiT IN1K+DALLE 49.8 53.3 44.4 47.1
MAE IN1K 50.3 53.3 44.9 47.2
Table 4. COCO object detection and segmentation using a ViT
Mask R-CNN baseline. All entries are based on our implementa-
tion. Self-supervised entries use IN1K data without labels. Mask
AP follows a similar trend as box AP.
of its partial fine-tuning results are worse than ours. The gap
is 2.6% when tuning 4 blocks. These results show that the
MAE representations are less linearly separable, but they
are stronger non-linear features and perform well when a
non-linear head is tuned.
These observations suggest that linear separability is not
the sole metric for evaluating representation quality. It has
also been observed (e.g., [8]) that linear probing is not well
correlated with transfer learning performance, e.g., for ob-
ject detection. To our knowledge, linear evaluation is not
often used in NLP for benchmarking pre-training.
5. Transfer Learning Experiments
We evaluate transfer learning in object detection and seg-
mentation on COCO [32] and semantic segmentation on
ADE20K [60]. We use the pre-trained models in Table 3.
Object detection and segmentation. We fine-tune Mask
R-CNN [23] end-to-end on COCO. The ViT backbone is
adapted for use with FPN [31] (see Appendix A.3). We
apply this object detection system to all entries in Table 4.
We report box AP for object detection and mask AP for
instance segmentation.
Compared to supervised pre-training, our MAE performs
better under all configurations (Table 4). With the smaller
ViT-B, our MAE is 2.4 points higher than supervised pre-
training (50.3 vs. 47.9, APbox). More significantly, with the
larger ViT-L, our MAE pre-training outperforms supervised
pre-training by 4.0 points (53.3 vs. 49.3).
The pixel-based MAE is better than or on par with the
token-based BEiT, while MAE is much simpler and faster.
Both MAE and BEiT are better than MoCo v3 and MoCo
v3 is on par with supervised pre-training.
Semantic segmentation. Our experiments on ADE20K
use UperNet [52] following the code in [2]. Details are in
A.4. Table 5shows that our MAE significantly improves
the transferring results of ViT-L, which is 3.7 points better
than the supervised pre-training counterpart (53.6 vs. 49.9).
The pixel-based MAE outperforms the token-based BEiT.
These observations are consistent with those in COCO.
Pixels vs. tokens. Table 6presents an all-around compari-
son on pixels vs. tokens as the MAE reconstruction target.
While using dVAE tokens is better than using unnormalized
pixels, it is statistically similar to just using normalized pix-
method pre-train data ViT-B ViT-L
supervised IN1K w/ labels 47.4 49.9
MoCo v3 IN1K 47.3 49.1
BEiT IN1K+DALLE 47.1 53.3
MAE IN1K 48.1 53.6
Table 5. ADE20K semantic segmentation (mIoU) using Uper-
Net. BEiT results are reproduced using the official code. Other
entries are based on our implementation. Self-supervised entries
use IN1K data without labels.
IN1K COCO ADE20K
ViT-B ViT-L ViT-H ViT-B ViT-L ViT-B ViT-L
pixel (w/o norm) 83.3 85.1 86.2 49.5 52.8 48.0 51.8
pixel (w/ norm) 83.6 85.9 86.9 50.3 53.3 48.1 53.6
dVAE token 83.6 85.7 86.9 50.3 53.2 48.1 53.4
40.0 -0.2 0.0 0.0 -0.1 0.0 -0.2
Table 6. Pixels vs. tokens as the MAE reconstruction target. 4is
the difference between using dVAE tokens and using normalized
pixels. The difference is statistically insignificant.
els across all tasks and models we studied. It agains shows
that tokenization is not necessary for our MAE.
6. Discussion and Conclusion
Simple algorithms that scale well are the core of deep
learning. In NLP, simple self-supervised learning methods
(e.g., [40,14,41,4]) enable benefits from exponentially
scaling models. In computer vision, practical pre-training
paradigms are dominantly supervised (e.g. [28,44,24,16])
despite progress in self-supervised learning. In this study,
we observe on ImageNet and in transfer learning that
an autoencoder—a simple self-supervised method similar
to techniques in NLP—provides scalable benefits. Self-
supervised learning in vision may now be embarking on a
similar trajectory as in NLP.
On the other hand, we note that images and languages
are signals of a different nature and this difference must
be addressed carefully. Images are merely recorded light
without a semantic decomposition into the visual analogue
of words. Instead of attempting to remove objects, we re-
move random patches that most likely do not form a seman-
tic segment. Likewise, our MAE reconstructs pixels, which
are not semantic entities. Nevertheless, we observe (e.g.,
Figure 4) that our MAE infers complex, holistic reconstruc-
tions, suggesting it has learned numerous visual concepts,
i.e., semantics. We hypothesize that this behavior occurs
by way of a rich hidden representation inside the MAE. We
hope this perspective will inspire future work.
Broader impacts. The proposed method predicts content
based on learned statistics of the training dataset and as such
will reflect biases in those data, including ones with nega-
tive societal impacts. The model may generate inexistent
content. These issues warrant further research and consid-
eration when building upon this work to generate images.
8
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.
Layer normalization. arXiv:1607.06450, 2016.
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-
training of image transformers. arXiv:2106.08254, 2021.
Accessed in June 2021.
[3] Suzanna Becker and Geoffrey E Hinton. Self-organizing
neural network that discovers surfaces in random-dot stere-
ograms. Nature, 1992.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are
few-shot learners. In NeurIPS, 2020.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´
e J´
egou,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
ing properties in self-supervised vision transformers. In
ICCV, 2021.
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
training from pixels. In ICML, 2020.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
offrey Hinton. A simple framework for contrastive learning
of visual representations. In ICML, 2020.
[8] Xinlei Chen and Kaiming He. Exploring simple Siamese
representation learning. In CVPR, 2021.
[9] Xinlei Chen, Saining Xie, and Kaiming He. An empirical
study of training self-supervised Vision Transformers. In
ICCV, 2021.
[10] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christo-
pher D Manning. ELECTRA: Pre-training text encoders as
discriminators rather than generators. In ICLR, 2020.
[11] Corinna Cortes and Vladimir Vapnik. Support-vector net-
works. Machine learning, 1995.
[12] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V
Le. Randaugment: Practical automated data augmentation
with a reduced search space. In CVPR Workshops, 2020.
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. ImageNet: A large-scale hierarchical image
database. In CVPR, 2009.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional trans-
formers for language understanding. In NAACL, 2019.
[15] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-
vised visual representation learning by context prediction. In
ICCV, 2015.
[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
worth 16x16 words: Transformers for image recognition at
scale. In ICLR, 2021.
[17] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
supervised representation learning by predicting image rota-
tions. In ICLR, 2018.
[18] Xavier Glorot and Yoshua Bengio. Understanding the diffi-
culty of training deep feedforward neural networks. In AIS-
TATS, 2010.
[19] Priya Goyal, Piotr Doll´
ar, Ross Girshick, Pieter Noord-
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large minibatch
SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
[20] Jean-Bastien Grill, Florian Strub, Florent Altch´
e, Corentin
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
laghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos,
and Michal Valko. Bootstrap your own latent - a new ap-
proach to self-supervised learning. In NeurIPS, 2020.
[21] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension-
ality reduction by learning an invariant mapping. In CVPR,
2006.
[22] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
resentation learning. In CVPR, 2020.
[23] Kaiming He, Georgia Gkioxari, Piotr Doll´
ar, and Ross Gir-
shick. Mask R-CNN. In ICCV, 2017.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In CVPR,
2016.
[25] Geoffrey E Hinton and Richard S Zemel. Autoencoders,
minimum description length, and helmholtz free energy. In
NeurIPS, 1994.
[26] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Weinberger. Deep networks with stochastic depth. In ECCV,
2016.
[27] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
variate shift. In ICML, 2015.
[28] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Ima-
genet classification with deep convolutional neural networks.
In NeurIPS, 2012.
[29] Yann LeCun, Bernhard Boser, John S Denker, Donnie
Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwrit-
ten zip code recognition. Neural computation, 1989.
[30] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Doll´
ar, Kaim-
ing He, and Ross Girshick. Benchmarking detection transfer
learning with vision transformers. In preparation, 2021.
[31] Tsung-Yi Lin, Piotr Doll´
ar, Ross Girshick, Kaiming He,
Bharath Hariharan, and Serge Belongie. Feature pyramid
networks for object detection. In CVPR, 2017.
[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Doll´
ar, and C Lawrence
Zitnick. Microsoft COCO: Common objects in context. In
ECCV, 2014.
[33] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi-
ent descent with warm restarts. In ICLR, 2017.
[34] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. In ICLR, 2019.
9
[35] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
visual representations by solving jigsaw puzzles. In ECCV,
2016.
[36] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep-
resentation learning with contrastive predictive coding.
arXiv:1807.03748, 2018.
[37] Aaron van den Oord, Oriol Vinyals, and Koray
Kavukcuoglu. Neural discrete representation learning.
In NeurIPS, 2017.
[38] Deepak Pathak, Ross Girshick, Piotr Doll´
ar, Trevor Darrell,
and Bharath Hariharan. Learning features by watching ob-
jects move. In CVPR, 2017.
[39] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
Darrell, and Alexei A Efros. Context encoders: Feature
learning by inpainting. In CVPR, 2016.
[40] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Sutskever. Improving language understanding by generative
pre-training. 2018.
[41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever. Language models are unsuper-
vised multitask learners. 2019.
[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J. Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. JMLR, 2020.
[43] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Zero-shot text-to-image generation. In ICML, 2021.
[44] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In ICLR,
2015.
[45] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
ception architecture for computer vision. In CVPR, 2016.
[46] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Herv´
e J´
egou. Training
data-efficient image transformers & distillation through at-
tention. In ICML, 2021.
[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[48] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
Pierre-Antoine Manzagol. Extracting and composing robust
features with denoising autoencoders. In ICML, 2008.
[49] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua
Bengio, Pierre-Antoine Manzagol, and L´
eon Bottou.
Stacked denoising autoencoders: Learning useful represen-
tations in a deep network with a local denoising criterion.
JMLR, 2010.
[50] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In ICCV, 2015.
[51] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Un-
supervised feature learning via non-parametric instance dis-
crimination. In CVPR, 2018.
[52] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understand-
ing. In ECCV, 2018.
[53] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr
Doll´
ar, and Ross Girshick. Early convolutions help trans-
formers see better. In NeurIPS, 2021.
[54] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.
How transferable are features in deep neural networks? In
NeurIPS, 2014.
[55] Yang You, Igor Gitman, and Boris Ginsburg. Large batch
training of convolutional networks. arXiv:1708.03888, 2017.
[56] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and
Shuicheng Yan. VOLO: Vision outlooker for visual recogni-
tion. arXiv:2106.13112, 2021.
[57] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
larization strategy to train strong classifiers with localizable
features. In ICCV, 2019.
[58] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
David Lopez-Paz. mixup: Beyond empirical risk minimiza-
tion. In ICLR, 2018.
[59] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In ECCV, 2016.
[60] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic un-
derstanding of scenes through the ADE20K dataset. IJCV,
2019.
10
A. Implementation Details
A.1. ImageNet Experiments
ViT architecture. We follow the standard ViT architecture
[16]. It has a stack of Transformer blocks [47], and each
block consists of a multi-head self-attention block and an
MLP block, both having LayerNorm (LN) [1]. The encoder
ends with LN. As the MAE encoder and decoder have dif-
ferent width, we adopt a linear projection layer after the
encoder to match it. Our MAE adds positional embeddings
[47] (the sine-cosine version) to both the encoder and de-
coder inputs. Our MAE does not use relative position or
layer scaling (which are used in the code of [2]).
We extract features from the encoder output for fine-
tuning and linear probing. As ViT has a class token [16],
to adapt to this design, in our MAE pre-training we append
an auxiliary dummy token to the encoder input. This token
will be treated as the class token for training the classifier in
linear probing and fine-tuning. Our MAE works similarly
well without this token (with average pooling).
Pre-training. The default setting is in Table 7. We do
not use color jittering, drop path, or gradient clip. We use
xavier uniform [18] to initialize all Transformer blocks, fol-
lowing ViT’s official code [16]. We use the linear lr scaling
rule [19]: lr =base lr×batchsize / 256.
End-to-end fine-tuning. Our fine-tuning follows common
practice of supervised ViT training. The default setting is in
Table 8. We use layer-wise lr decay [10] following [2].
Linear probing. Our linear classifier training follows [9].
See Table 9. We observe that linear probing requires a very
different recipe than end-to-end fine-tuning. In particular,
regularization is in general harmful for linear probing. Fol-
lowing [9], we disable many common regularization strate-
gies: we do not use mixup [58], cutmix [57], drop path [26],
or color jittering, and we set weight decay as zero.
It is a common practice to normalize the classifier input
when training a classical linear classifier (e.g., SVM [11]).
Similarly, it is beneficial to normalize the pre-trained fea-
tures when training the linear probing classifier. Follow-
ing [15], we adopt an extra BatchNorm layer [27] without
affine transformation (affine=False). This layer is ap-
plied on the pre-trained features produced by the encoder,
and is before the linear classifier. We note that the layer
does not break the linear property, and it can be absorbed
into the linear classifier after training: it is essentially a re-
parameterized linear classifier.3Introducing this layer helps
calibrate the feature magnitudes across different variants in
our ablations, so that they can use the same setting without
further lr search.
3Alternatively, we can pre-compute the mean and std of the features
and use the normalized features to train linear classifiers.
config value
optimizer AdamW [34]
base learning rate 1.5e-4
weight decay 0.05
optimizer momentum β1, β2=0.9,0.95 [6]
batch size 4096
learning rate schedule cosine decay [33]
warmup epochs [19] 40
augmentation RandomResizedCrop
Table 7. Pre-training setting.
config value
optimizer AdamW
base learning rate 1e-3
weight decay 0.05
optimizer momentum β1, β2=0.9,0.999
layer-wise lr decay [10,2] 0.75
batch size 1024
learning rate schedule cosine decay
warmup epochs 5
training epochs 100 (B), 50 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [45] 0.1
mixup [58] 0.8
cutmix [57] 1.0
drop path [26] 0.1 (B/L) 0.2 (H)
Table 8. End-to-end fine-tuning setting.
config value
optimizer LARS [55]
base learning rate 0.1
weight decay 0
optimizer momentum 0.9
batch size 16384
learning rate schedule cosine decay
warmup epochs 10
training epochs 90
augmentation RandomResizedCrop
Table 9. Linear probing setting. We use LARS with a large batch
for faster training; SGD works similarly with a 4096 batch size.
Partial fine-tuning. Our MAE partial fine-tuning (§4.3)
follows the setting in Table 8, except that we adjust the num-
ber of fine-tuning epochs. We observe that tuning fewer
blocks requires a longer schedule. We set the numbers of
fine-tuning epochs as {50, 100, 200}and use the optimal
one for each number of blocks tuned.
A.2. Supervised Training ViT-L/H from Scratch
We find that it is nontrivial to train supervised ViT-L/H
from scratch on ImageNet-1K. The training is unstable.
While there have been strong baselines with publicly avail-
able implementations [46] for smaller models, the recipes
for the larger ViT-L/H are unexplored. Directly applying
the previous recipes to these larger models does not work.
A NaN loss is frequently observed during training.
We provide our recipe in Table 10. We use a wd of 0.3,
a large batch size of 4096, and a long warmup, following
the original ViT [16]. We use β2=0.95 following [6]. We
use the regularizations listed in Table 10 and disable others,
following [53]. All these choices are for improving training
stability. Our recipe can finish training with no NaN loss.
11
config value
optimizer AdamW
base learning rate 1e-4
weight decay 0.3
optimizer momentum β1, β2=0.9,0.95
batch size 4096
learning rate schedule cosine decay
warmup epochs 20
training epochs 300 (B), 200 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [45] 0.1
mixup [58] 0.8
cutmix [57] 1.0
drop path [26] 0.1 (B), 0.2 (L/H)
exp. moving average (EMA) 0.9999
Table 10. Supervised training ViT from scratch.
The accuracy is 82.6% for ViT-L (81.5% w/o EMA), and
83.1% for ViT-H (80.9% w/o EMA). Both ViT-L and ViT-H
show an overfitting trend if not using EMA.
As a by-product, our recipe for ViT-B has 82.3% accu-
racy (82.1% w/o EMA), vs. 81.8% in [46].
A.3. Object Detection and Segmentation in COCO
We adapt the vanilla ViT for the use of an FPN backbone
[31] in Mask R-CNN [23]. ViT has a stack of Transformer
blocks that all produce feature maps at a single scale (e.g.,
stride 16). We equally divide this stack into 4 subsets and
apply convolutions to upsample or downsample the inter-
mediate feature maps for producing different scales (stride
4, 8, 16, or 32, the same as a standard ResNet [24]). FPN is
built on these multi-scale maps.
For fair comparisons among different methods, we
search for hyper-parameters for each entry in Table 4(in-
cluding all competitors). The hyper-parameters we search
for are the learning rate, weight decay, drop path rate, and
fine-tuning epochs. We will release code along with the
specific configurations. For full model and training details,
plus additional experiments, see [30].
A.4. Semantic Segmentation in ADE20K
We use UperNet [52] following the semantic segmenta-
tion code of [2]. We fine-tune end-to-end for 100 epochs
with a batch size of 16. We search for the optimal lr for
each entry in Table 5(including all competitors).
The semantic segmentation code of [2] uses relative po-
sition bias [42]. Our MAE pre-training does not use it. For
fair comparison, we turn on relative position bias only dur-
ing transfer learning, initialized as zero. We note that our
BEiT reproduction uses relative position bias in both pre-
training and fine-tuning, following their code.
B. Comparison on Linear Probing Results
In §4.3 we have shown that linear probing accuracy and
fine-tuning accuracy are largely uncorrelated and they have
different focuses about linear separability. We notice that
method model params acc
iGPT [6] iGPT-L 1362 M 69.0
iGPT [6] iGPT-XL 6801 M 72.0
BEiT [2] ViT-L 304 M 52.1
MAE ViT-B 86 M 68.0
MAE ViT-L 304 M 75.8
MAE ViT-H 632 M 76.6
Table 11. Linear probing results of masked encoding methods.
Our fine-tuning results are in Table 3.: our implementation.
existing masked image encoding methods are generally less
competitive in linear probing (e.g., than contrastive learn-
ing). For completeness, in Table 11 we compare on linear
probing accuracy with masking-based methods.
Our MAE with ViT-L has 75.8% linear probing accu-
racy. This is substantially better than previous masking-
based methods. On the other hand, it still lags behind con-
trastive methods under this protocol: e.g., MoCo v3 [9] has
77.6% linear probing accuracy for the ViT-L (Figure 9).
12
Figure 10. Uncurated random samples on ImageNet validation images. For each triplet, we show the masked image (left), our MAE
reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
13
Figure 11. Uncurated random samples on COCO validation images, using an MAE trained on ImageNet. For each triplet, we show the
masked image (left), our MAE reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
14
... For both D25 and D50, 70% of the ima ges fr om the indistribution datasets were used for training, 20% for validation, and 10% for testing. An EfficientNet network, pr etr ained on Ima geNet [ 100 ], was trained using our method called MAPLE (MAhalanobis distance based uncertainty Prediction for reLiablE classification [ 101 ]) illustrated in Fig. 6 . To address high intraclass variances due to, for instance, different viewpoints from which the images were acquired, we use X-means clustering [ 102 ] to break down classes into m ultiple clusters, eac h of whic h contains ima ges clustering together in the feature space of representations learned by the network. ...
... In our second set of experiments, we examined the impact of SSL on diatom classification. SSL is a methodology to impr ov e classification performance by using unlabeled data [ 100,[107][108][109][110][111]. The basic idea is that prior to training the classifier in the usual supervised way, a so-called pretext task is learned. ...
... We conducted iden- tical experiments, using the smaller D t 0 . 1 training data subset, referring to them as R N D t 0 . 1 and Vi T D t 0 . 1 , r espectiv el y. To compare a semi-supervised a ppr oac h with the ViT baseline, w e emplo y ed a masked auto-encoder (MAE) [ 100 ] using the same back-end ViT. This MAE had already been pretrained using SSL on ImageNet data, and we fine-tuned it on D t . ...
Article
*** OPEN ACCESS ARTICLE *** Diatoms are microalgae with finely ornamented microscopic silica shells. Their taxonomic identification by light microscopy is routinely used as part of community ecological research as well as ecological status assessment of aquatic ecosystems, and a need for digitalization of these methods has long been recognized. Alongside their high taxonomic and morphological diversity, several other factors make diatoms highly challenging for deep learning–based identification using light microscopy images. These include (i) an unusually high intraclass variability combined with small between-class differences, (ii) a rather different visual appearance of specimens depending on their orientation on the microscope slide, and (iii) the limited availability of diatom experts for accurate taxonomic annotation. Findings We present the largest diatom image dataset thus far, aimed at facilitating the application and benchmarking of innovative deep learning methods to the diatom identification problem on realistic research data, “UDE DIATOMS in the Wild 2024.” The dataset contains 83,570 images of 611 diatom taxa, 101 of which are represented by at least 100 examples and 144 by at least 50 examples each. We showcase this dataset in 2 innovative analyses that address individual aspects of the above challenges using subclustering to deal with visually heterogeneous classes, out-of-distribution sample detection, and semi-supervised learning. Conclusions The problem of image-based identification of diatoms is both important for environmental research and challenging from the machine learning perspective. By making available the so far largest image dataset, accompanied by innovative analyses, this contribution will facilitate addressing these points by the scientific community.
... In MAE [13], the Transformer is employed to reconstruct and restore the missing parts of an image. To address the current challenges in lane line detection, we improved the MAE approach and proposed a novel model called Transformer De-occlusion for Lane Line Detection (TDLLD). ...
... With the development of deep learning, some deep neural network-based approaches [19,20] have shown significant superiority in lane line detection. Current deep learning approaches for lane line detection are primarily based on image segmentation methods [21][22][23][24], anchor-based methods [13,16,[25][26][27][28] and parameter prediction methods [11]. ...
... We encode the slope information by applying the sine function and obtain four lane line slope embedding matricesf i . For each pixel in the original image, we calculate the corresponding value using the slope embedding Eq. (13). This calculation generates four distinct lane line slope embedding matrices, f i , each representing the contribution of a specific slope cluster center. ...
Article
Full-text available
Deep learning-based lane line detection has garnered substantial success in common scenarios. However, detecting lane lines under conditions of severe occlusion, where visual cues are largely absent, remains a considerable challenge. To address this issue, we propose a cutting-edge strategy that utilizes an enhanced Vision Transformer (ViT) for the de-occlusion of lane lines. Our approach significantly improves the accuracy of lane line detection by integrating a fused feature map with prior knowledge. Specifically, we refine the ViT model by employing overlapping patches technology to reconstruct occluded lane lines from the input image. Subsequently, we extract the feature maps from the model and integrate them with slope and category information pertaining to the lane lines, facilitating more robust and accurate lane line detection. Additionally, we introduce an innovative sensitivity loss function that evaluates not only pixel value errors but also spatial discrepancies between pixels. We assessed our strategy on three benchmark datasets: TuSimple, CULane, and CurveLanes. Our results demonstrate that our approach outperforms existing methods in terms of accuracy and F1-score on all these datasets.
... Self-supervised learning demonstrates superior data efficiency Self-supervised learning is a game-changing technique for natural language processing (NLP) 34,35 . Many well-known architectures, including BERT 34 , GPT-X 35 , MAEs (Masked Autoencoders) 40 are SSL at their core. SSL also plays an important role in AlphaFold, a revolutionary AI-based protein structure predictor 25 . ...
Article
Full-text available
The interaction between peptides and major histocompatibility complex (MHC) molecules is pivotal in autoimmunity, pathogen recognition and tumor immunity. Recent advances in cancer immunotherapies demand for more accurate computational prediction of MHC-bound peptides. We address the generalizability challenge of MHC-bound peptide predictions, revealing limitations in current sequence-based approaches. Our structure-based methods leveraging geometric deep learning (GDL) demonstrate promising improvement in generalizability across unseen MHC alleles. Further, we tackle data efficiency by introducing a self-supervised learning approach on structures (3D-SSL). Without being exposed to any binding affinity data, our 3D-SSL outperforms sequence-based methods trained on ~90 times more data points. Finally, we demonstrate the resilience of structure-based GDL methods to biases in binding data on an Hepatitis B virus vaccine immunopeptidomics case study. This proof-of-concept study highlights structure-based methods’ potential to enhance generalizability and data efficiency, with possible implications for data-intensive fields like T-cell receptor specificity predictions.
... We propose a simulation-based loss ℓ = ℓ res + ℓ m + ℓ ∆f + ℓ N + ℓ c where ℓ res evaluates the reconstruction residuals of the model with respect to the observational input, ℓ m and ℓ ∆f ensure that the network matches prior knowledge, ℓ N ensures the physiological plausibility of the SIF estimates and ℓ c denotes a perturbation based regularization that enhances the decorrelation between predicted variables by means of a physically accurate augmentation. Self-supervised learning with radiance observations is addressed by adopting the methodology of [5,7], where the reconstructed signal is compared to the observation similarly to other selfsupervised methods such as masked auto-encoders [29,31]. A squared residual over the whole spectrum as well as a weighted residual boosting the loss in spectral regions with high average fluorescence contributions punish the network for not reproducing the at-sensor observations. ...
Preprint
Full-text available
We provide the first method allowing to retrieve spaceborne SIF maps at 30 m ground resolution with a strong correlation (r2=0.6r^2=0.6) to high-quality airborne estimates of sun-induced fluorescence (SIF). SIF estimates can provide explanatory information for many tasks related to agricultural management and physiological studies. While SIF products from airborne platforms are accurate and spatially well resolved, the data acquisition of such products remains science-oriented and limited to temporally constrained campaigns. Spaceborne SIF products on the other hand are available globally with often sufficient revisit times. However, the spatial resolution of spaceborne SIF products is too small for agricultural applications. In view of ESA's upcoming FLEX mission we develop a method for SIF retrieval in the O2_2-A band of hyperspectral DESIS imagery to provide first insights for spaceborne SIF retrieval at high spatial resolution. To this end, we train a simulation-based self-supervised network with a novel perturbation based regularizer and test performance improvements under additional supervised regularization of atmospheric variable prediction. In a validation study with corresponding HyPlant derived SIF estimates at 740 nm we find that our model reaches a mean absolute difference of 0.78 mW / nm / sr / m2^2.
... Second, the nature of the masks used in each task is different. For example, in the image completion task, such as the MAE proposed by Kaiming et al. [21], large-scale masks are used to simulate missing parts of the image. Meanwhile, in denoising tasks, the masks are pixel-to-pixel, discrete, and spaced at certain intervals to ensure that contextual information is fully utilized during embedding. ...
Article
Full-text available
Background Magnetic Resonance Imaging (MRI) is extensively utilized in clinical diagnostics and medical research, yet the imaging process is often compromised by noise interference. This noise arises from various sources, leading to a reduction in image quality and subsequently hindering the accurate interpretation of image details by clinicians. Traditional denoising methods typically assume that noise follows a Gaussian distribution, thereby neglecting the more complex noise types present in MRI images, such as Rician noise. As a result, denoising remains a challenging and practical task. Method The main research work of this paper focuses on modifying mask information based on a global mask mapper. The mask mapper samples all blind spot pixels on the denoised image and maps them to the same channel. By incorporating perceptual loss, it utilizes all available information to improve performance while avoiding identity mapping. During the denoising process, the model may mistakenly remove some useful information as noise, resulting in a loss of detail in the denoised image. To address this issue, we train a generative adversarial network (GAN) with adaptive hybrid attention to restore the detailed information in the denoised MRI images. Result The two-stage model NRAE shows an improvement of nearly 1.4 dB in PSNR and approximately 0.1 in SSIM on clinical datasets compared to other classic models. Specifically, compared to the baseline model, PSNR is increased by about 0.6 dB, and SSIM is only 0.015 lower. From a visual perspective, NRAE more effectively restores the details in the images, resulting in richer and clearer representation of image details. Conclusion We have developed a deep learning-based two-stage model to address noise issues in medical MRI images. This method not only successfully reduces noise signals but also effectively restores anatomical details. The current results indicate that this is a promising approach. In future work, we plan to replace the current denoising network with more advanced models to further enhance performance.
Article
Full-text available
This study aims to assess how customer well-being influences customers’ citizenship behavior nowadays, tourism and hotel organizations care about their customers. By studying their behaviors and satisfaction levels, one of these trends is to determine the level of customers’ well-being and build intimate relationships with them. Although there is a wide range of research on customer well-being, the impact of customer well-being, customer intimacy, and customer citizenship behavior has not received enough empirical investigation. This study aims to fill this gap. In addition, we examine the mediating role of customer intimacy. Partial least squares structural equation modeling (Warp-PLS V.7) and SPSS 26.0 were used to analyze data collected from 333 hotel customers in Cairo. The results show that customer well-being has both direct and indirect effects on customers’ citizenship behavior, with customer intimacy acting as a partial mediator. The results of the study provide useful guidelines that can help hotel managers make more efforts to enhance customer relationships, achieve customer well-being, achieve customer citizenship, achieve customer intimacy, and benefit from its mediating role in the relationship.
Preprint
Full-text available
Artificial intelligence (AI) applied to brain magnetic resonance imaging (MRI) has the potential to improve disease diagnosis and management but requires algorithms with generalizable knowledge that can perform well in a variety of clinical scenarios. The field has been constrained, thus far, by limited training data and task-specific models that do not generalize well across patient populations and medical tasks. Foundation models, by leveraging self-supervised learning, pretraining, and targeted adaptation, present a promising paradigm to overcome these limitations. Here, we present Brain Imaging Adaptive Core (BrainIAC), a novel foundation model designed to learn generalized representations from unlabeled brain MRI data and serve as a core basis for diverse downstream application adaptation. Trained and validated on 48,519 brain MRIs across a broad spectrum of tasks, we demonstrate that BrainIAC outperforms localized supervised training and other pretrained models, particularly in low-data settings and high-difficulty tasks, allowing for application in scenarios otherwise infeasible. BrainIAC can be integrated into imaging pipelines and multimodal frameworks and may lead to improved biomarker discovery and AI clinical translation.
Article
Full-text available
Synopsis Artificial intelligence (AI) is poised to revolutionize many aspects of science, including the study of evolutionary morphology. While classical AI methods such as principal component analysis and cluster analysis have been commonplace in the study of evolutionary morphology for decades, recent years have seen increasing application of deep learning to ecology and evolutionary biology. As digitized specimen databases become increasingly prevalent and openly available, AI is offering vast new potential to circumvent long-standing barriers to rapid, big data analysis of phenotypes. Here, we review the current state of AI methods available for the study of evolutionary morphology, which are most developed in the area of data acquisition and processing. We introduce the main available AI techniques, categorizing them into 3 stages based on their order of appearance: (1) machine learning, (2) deep learning, and (3) the most recent advancements in large-scale models and multimodal learning. Next, we present case studies of existing approaches using AI for evolutionary morphology, including image capture and segmentation, feature recognition, morphometrics, and phylogenetics. We then discuss the prospectus for near-term advances in specific areas of inquiry within this field, including the potential of new AI methods that have not yet been applied to the study of morphological evolution. In particular, we note key areas where AI remains underutilized and could be used to enhance studies of evolutionary morphology. This combination of current methods and potential developments has the capacity to transform the evolutionary analysis of the organismal phenotype into evolutionary phenomics, leading to an era of “big data” that aligns the study of phenotypes with genomics and other areas of bioinformatics.
Article
This paper reports the application of deep learning techniques in bright-field transmission electron microscopy image segmentation and tracking, in order to understand dynamic changes of nanoparticles during high-temperature sintering. Four state-of-the-art deep learning models, YOLO v8n-seg, Swin- UNet, VMamba , and EfficientSAM -tiny, were used and their performances were compared. The results show that EfficientSAM -tiny performs best in the segmentation task, achieving the highest accuracy ( IoU 0.99672, Dice Coefficient 0.99836). In the tracking task, combining EfficientSAM -tiny with the DeAOT model successfully achieves efficient tracking and accurate identification of nanoparticles in microscope videos. The effectiveness and reliability of the method is verified by analyzing the video of the nanoparticle sintering process at a high temperature of 800°C. These results demonstrate the potential of deep learning techniques in microscope image analysis and introduce a new method from computer vision for microscope video tracking and analysis in materials science.
Article
Full-text available
Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .
Conference Paper
Full-text available
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).
Article
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Article
Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.