This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
1. Introduction
Deep learning has witnessed an explosion of archi-
tectures of continuously growing capability and capacity
[28,24,47]. Aided by the rapid gains in hardware, mod-
els today can easily overfit one million images [13] and
begin to demand hundreds of millions of—often publicly
inaccessible—labeled images [16].
This appetite for data has been successfully addressed in
natural language processing (NLP) by self-supervised pre-
training. The solutions, based on autoregressive language
modeling in GPT [40,41,4] and masked autoencoding in
BERT [14], are conceptually simple: they remove a portion
of the data and learn to predict the removed content. These
methods now enable training of generalizable NLP models
containing over one hundred billion parameters [4].
The idea of masked autoencoders, a form of more gen-
eral denoising autoencoders [48], is natural and applicable
in computer vision as well. Indeed, closely related research
input target
Figure 1. Our MAE architecture. During pre-training, a large
random subset of image patches (e.g., 75%) is masked out. The
encoder is applied to the small subset of visible patches. Mask
tokens are introduced after the encoder, and the full set of en-
coded patches and mask tokens is processed by a small decoder
that reconstructs the original image in pixels. After pre-training,
the decoder is discarded and the encoder is applied to uncorrupted
images to produce representations for recognition tasks.
in vision [49,39] preceded BERT. However, despite signif-
icant interest in this idea following the success of BERT,
progress of autoencoding methods in vision lags behind
NLP. We ask: what makes masked autoencoding different
between vision and language? We attempt to answer this
question from the following perspectives:
(i) Until recently, architectures were different. In vision,
convolutional networks [29] were dominant over the last
decade [28]. Convolutions typically operate on regular grids
and it is not straightforward to integrate ‘indicators’ such as
mask tokens [14] or positional embeddings [47] into con-
volutional networks. This architectural gap, however, has
been addressed with the introduction of Vision Transform-
ers (ViT) [16] and should no longer present an obstacle.
(ii) Information density is different between language
and vision. Languages are human-generated signals that
are highly semantic and information-dense. When training
a model to predict only a few missing words per sentence,
this task appears to induce sophisticated language under-
standing. Images, on the contrary, are natural signals with
heavy spatial redundancy—e.g., a missing patch can be re-
covered from neighboring patches with little high-level un-
Figure 2. Example results on ImageNet validation images. For each triplet, we show the masked image (left), our MAE reconstruction
(middle), and the ground-truth (right). The masking ratio is 80%, leaving only 39 out of 196 patches. More examples are in the appendix.
As no loss is computed on visible patches, the model output on visible patches is qualitatively worse. One can simply overlay the output with the visible
patches to improve visual quality. We intentionally opt not to do this, so we can more comprehensively demonstrate the method’s behavior.
Figure 3. Example results on COCO validation images, using an MAE trained on ImageNet (the same model weights as in Figure 2).
Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.
derstanding of parts, objects, and scenes. To overcome this
difference and encourage learning useful features, we show
that a simple strategy works well in computer vision: mask-
ing a very high portion of random patches. This strategy
largely reduces redundancy and creates a challenging self-
supervisory task that requires holistic understanding beyond
low-level image statistics. To get a qualitative sense of our
reconstruction task, see Figures 24.
(iii) The autoencoder’s decoder, which maps the latent
representation back to the input, plays a different role be-
tween reconstructing text and images. In vision, the decoder
reconstructs pixels, hence its output is of a lower semantic
level than common recognition tasks. This is in contrast
to language, where the decoder predicts missing words that
contain rich semantic information. While in BERT the de-
coder can be trivial (an MLP) [14], we found that for im-
ages, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
Driven by this analysis, we present a simple, effective,
and scalable form of a masked autoencoder (MAE) for
visual representation learning. Our MAE masks random
patches from the input image and reconstructs the missing
patches in the pixel space. It has an asymmetric encoder-
decoder design. Our encoder operates only on the visible
subset of patches (without mask tokens), and our decoder is
lightweight and reconstructs the input from the latent rep-
resentation along with mask tokens (Figure 1). Shifting
the mask tokens to the small decoder in our asymmetric
encoder-decoder results in a large reduction in computation.
Under this design, a very high masking ratio (e.g., 75%) can
achieve a win-win scenario: it optimizes accuracy while al-
lowing the encoder to process only a small portion (e.g.,
25%) of patches. This can reduce overall pre-training time
by 3×or more and likewise reduce memory consumption,
enabling us to easily scale our MAE to large models.
Our MAE learns very high-capacity models that gen-
eralize well. With MAE pre-training, we can train data-
hungry models like ViT-Large/-Huge [16] on ImageNet-1K
with improved generalization performance. With a vanilla
ViT-Huge model, we achieve 87.8% accuracy when fine-
tuned on ImageNet-1K. This outperforms all previous re-
sults that use only ImageNet-1K data. We also evaluate
transfer learning on object detection, instance segmentation,
and semantic segmentation. In these tasks, our pre-training
achieves better results than its supervised pre-training coun-
terparts, and more importantly, we observe significant gains
by scaling up models. These observations are aligned
with those witnessed in self-supervised pre-training in NLP
[14,40,41,4] and we hope that they will enable our field to
explore a similar trajectory.
original mask 75% mask 85% mask 95%
Figure 4. Reconstructions of ImageNet validation images using
an MAE pre-trained with a masking ratio of 75% but applied on
inputs with higher masking ratios. The predictions differ plausibly
from the original images, showing that the method can generalize.
2. Related Work
Masked language modeling and its autoregressive coun-
terparts, e.g., BERT [14] and GPT [40,41,4], are highly
successful methods for pre-training in NLP. These methods
hold out a portion of the input sequence and train models
to predict the missing content. These methods have been
shown to scale excellently [4] and a large abundance of ev-
idence indicates that these pre-trained representations gen-
eralize well to various downstream tasks.
Autoencoding is a classical method for learning representa-
tions. It has an encoder that maps an input to a latent repre-
sentation and a decoder that reconstructs the input. For ex-
ample, PCA and k-means are autoencoders [25]. Denoising
autoencoders (DAE) [48] are a class of autoencoders that
corrupt an input signal and learn to reconstruct the origi-
nal, uncorrupted signal. A series of methods can be thought
of as a generalized DAE under different corruptions, e.g.,
masking pixels [49,39,6] or removing color channels [59].
Our MAE is a form of denoising autoencoding, but different
from the classical DAE in numerous ways.
Masked image encoding methods learn representations
from images corrupted by masking. The pioneering work
of [49] presents masking as a noise type in DAE. Context
Encoder [39] inpaints large missing regions using convolu-
tional networks. Motivated by the success in NLP, related
recent methods [6,16,2] are based on Transformers [47].
iGPT [6] operates on sequences of pixels and predicts un-
known pixels. The ViT paper [16] studies masked patch
prediction for self-supervised learning. Most recently, BEiT
[2] proposes to predict discrete tokens [37,43].
Self-supervised learning approaches have seen significant
interest in computer vision, often focusing on different pre-
text tasks for pre-training [15,50,35,59,38,17]. Re-
cently, contrastive learning [3,21] has been popular, e.g.,
[51,36,22,7], which models image similarity and dis-
similarity (or only similarity [20,8]) between two or more
views. Contrastive and related methods strongly depend on
data augmentation [7,20,8]. Autoencoding pursues a con-
ceptually different direction, and it exhibits different behav-
iors as we will present.
3. Approach
Our masked autoencoder (MAE) is a simple autoencod-
ing approach that reconstructs the original signal given its
partial observation. Like all autoencoders, our approach
has an encoder that maps the observed signal to a latent
representation, and a decoder that reconstructs the origi-
nal signal from the latent representation. Unlike classical
autoencoders, we adopt an asymmetric design that allows
the encoder to operate only on the partial, observed signal
(without mask tokens) and a lightweight decoder that re-
constructs the full signal from the latent representation and
mask tokens. Figure 1illustrates the idea, introduced next.
Masking. Following ViT [16], we divide an image into reg-
ular non-overlapping patches. Then we sample a subset of
patches and mask (i.e., remove) the remaining ones. Our
sampling strategy is straightforward: we sample random
patches without replacement, following a uniform distribu-
tion. We simply refer to this as “random sampling”.
Random sampling with a high masking ratio (i.e., the ra-
tio of removed patches) largely eliminates redundancy, thus
creating a task that cannot be easily solved by extrapolation
from visible neighboring patches (see Figures 24). The
uniform distribution prevents a potential center bias (i.e.,
more masked patches near the image center). Finally, the
highly sparse input creates an opportunity for designing an
efficient encoder, introduced next.
MAE encoder. Our encoder is a ViT [16] but applied only
on visible, unmasked patches. Just as in a standard ViT, our
encoder embeds patches by a linear projection with added
positional embeddings, and then processes the resulting set
via a series of Transformer blocks. However, our encoder
only operates on a small subset (e.g., 25%) of the full set.
Masked patches are removed; no mask tokens are used.
This allows us to train very large encoders with only a frac-
tion of compute and memory. The full set is handled by a
lightweight decoder, described next.
MAE decoder. The input to the MAE decoder is the full
set of tokens consisting of (i) encoded visible patches, and
(ii) mask tokens. See Figure 1. Each mask token [14] is a
shared, learned vector that indicates the presence of a miss-
ing patch to be predicted. We add positional embeddings to
all tokens in this full set; without this, mask tokens would
have no information about their location in the image. The
decoder has another series of Transformer blocks.
The MAE decoder is only used during pre-training to
perform the image reconstruction task (only the encoder
is used to produce image representations for recognition).
Therefore, the decoder architecture can be flexibly designed
in a manner that is independent of the encoder design. We
experiment with very small decoders, narrower and shal-
lower than the encoder. For example, our default decoder
has <10% computation per token vs. the encoder. With this
asymmetrical design, the full set of tokens are only pro-
cessed by the lightweight decoder, which significantly re-
duces pre-training time.
Reconstruction target. Our MAE reconstructs the input
by predicting the pixel values for each masked patch. Each
element in the decoder’s output is a vector of pixel values
representing a patch. The last layer of the decoder is a lin-
ear projection whose number of output channels equals the
number of pixel values in a patch. The decoder’s output is
reshaped to form a reconstructed image. Our loss function
computes the mean squared error (MSE) between the recon-
structed and original images in the pixel space. We compute
the loss only on masked patches, similar to BERT [14].1
We also study a variant whose reconstruction target is
the normalized pixel values of each masked patch. Specif-
ically, we compute the mean and standard deviation of all
pixels in a patch and use them to normalize this patch. Us-
ing normalized pixels as the reconstruction target improves
representation quality in our experiments.
Simple implementation. Our MAE pre-training can be im-
plemented efficiently, and importantly, does not require any
specialized sparse operations. First we generate a token for
every input patch (by linear projection with an added po-
sitional embedding). Next we randomly shuffle the list of
tokens and remove the last portion of the list, based on the
masking ratio. This process produces a small subset of to-
kens for the encoder and is equivalent to sampling patches
without replacement. After encoding, we append a list of
mask tokens to the list of encoded patches, and unshuffle
this full list (inverting the random shuffle operation) to align
all tokens with their targets. The decoder is applied to this
full list (with positional embeddings added). As noted, no
sparse operations are needed. This simple implementation
introduces negligible overhead as the shuffling and unshuf-
fling operations are fast.
1Computing the loss only on masked patches differs from traditional
denoising autoencoders [48] that compute the loss on all pixels. This
choice is purely result-driven: computing the loss on all pixels leads to
a slight decrease in accuracy (e.g., 0.5%).
10 20 30 40 50 60 70 80 90
83.2 83.4 83.4
84.7 84.9 85.0 84.9 84.9
masking ratio (%)
10 20 30 40 50 60 70 80 90
58.9 61.7
67.0 69.9 71.8 73.2 73.5 71.8
linear probing
masking ratio (%)
Figure 5. Masking ratio. A high masking ratio (75%) works well
for both fine-tuning (top) and linear probing (bottom). The y-axes
are ImageNet-1K validation accuracy (%) in all plots in this paper.
4. ImageNet Experiments
We do self-supervised pre-training on the ImageNet-1K
(IN1K) [13] training set. Then we do supervised training to
evaluate the representations with (i) end-to-end fine-tuning
or (ii) linear probing. We report top-1 validation accuracy
of a single 224×224 crop. Details are in Appendix A.1.
Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16]
as the backbone in our ablation study. ViT-L is very big (an
order of magnitude bigger than ResNet-50 [24]) and tends
to overfit. The following is a comparison between ViT-L
trained from scratch vs. fine-tuned from our baseline MAE:
scratch, original [16] scratch, our impl. baseline MAE
76.5 82.5 84.9
We note that it is nontrivial to train supervised ViT-L from
scratch and a good recipe with strong regularization is
needed (82.5%, see Appendix A.2). Even so, our MAE pre-
training contributes a big improvement. Here fine-tuning is
only for 50 epochs (vs. 200 from scratch), implying that the
fine-tuning accuracy heavily depends on pre-training.
4.1. Main Properties
We ablate our MAE using the default settings in Table 1
(see caption). Several intriguing properties are observed.
Masking ratio. Figure 5shows the influence of the mask-
ing ratio. The optimal ratios are surprisingly high. The ra-
tio of 75% is good for both linear probing and fine-tuning.
This behavior is in contrast with BERT [14], whose typical
masking ratio is 15%. Our masking ratios are also much
higher than those in related works [6,16,2] in computer
vision (20% to 50%).
The model infers missing patches to produce different,
yet plausible, outputs (Figure 4). It makes sense of the
gestalt of objects and scenes, which cannot be simply com-
pleted by extending lines or textures. We hypothesize that
this reasoning-like behavior is linked to the learning of use-
ful representations.
Figure 5also shows that linear probing and fine-tuning
results follow different trends. For linear probing, the ac-
blocks ft lin
1 84.8 65.5
284.9 70.0
484.9 71.9
884.9 73.5
12 84.4 73.3
(a) Decoder depth. A deep decoder can im-
prove linear probing accuracy.
dim ft lin
128 84.9 69.1
256 84.8 71.3
512 84.9 73.5
768 84.4 73.1
1024 84.3 73.1
(b) Decoder width. The decoder can be nar-
rower than the encoder (1024-d).
case ft lin FLOPs
encoder w/ [M] 84.2 59.6 3.3×
encoder w/o [M] 84.9 73.5 1×
(c) Mask token. An encoder without mask to-
kens is more accurate and faster (Table 2).
case ft lin
pixel (w/o norm) 84.9 73.5
pixel (w/ norm) 85.4 73.9
PCA 84.6 72.3
dVAE token 85.3 71.6
(d) Reconstruction target. Pixels as recon-
struction targets are effective.
case ft lin
none 84.0 65.7
crop, fixed size 84.7 73.1
crop, rand size 84.9 73.5
crop + color jit 84.3 71.9
(e) Data augmentation. Our MAE works with
minimal or no augmentation.
case ratio ft lin
random 75 84.9 73.5
block 50 83.9 72.3
block 75 82.8 63.9
grid 75 84.0 66.0
(f) Mask sampling. Random sampling works
the best. See Figure 6for visualizations.
Table 1. MAE ablation experiments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If
not specified, the default is: the decoder has depth 8 and width 512, the reconstruction target is unnormalized pixels, the data augmentation
is random resized cropping, the masking ratio is 75%, and the pre-training length is 800 epochs. Default settings are marked in gray .
curacy increases steadily with the masking ratio until the
sweet point: the accuracy gap is up to 20% (54.6% vs.
73.5%). For fine-tuning, the results are less sensitive to the
ratios, and a wide range of masking ratios (40–80%) work
well. All fine-tuning results in Figure 5are better than train-
ing from scratch (82.5%).
Decoder design. Our MAE decoder can be flexibly de-
signed, as studied in Table 1a and 1b.
Table 1a varies the decoder depth (number of Trans-
former blocks). A sufficiently deep decoder is important
for linear probing. This can be explained by the gap be-
tween a pixel reconstruction task and a recognition task: the
last several layers in an autoencoder are more specialized
for reconstruction, but are less relevant for recognition. A
reasonably deep decoder can account for the reconstruction
specialization, leaving the latent representations at a more
abstract level. This design can yield up to 8% improvement
in linear probing (Table 1a, ‘lin’). However, if fine-tuning
is used, the last layers of the encoder can be tuned to adapt
to the recognition task. The decoder depth is less influential
for improving fine-tuning (Table 1a, ‘ft’).
Interestingly, our MAE with a single-block decoder can
perform strongly with fine-tuning (84.8%). Note that a sin-
gle Transformer block is the minimal requirement to propa-
gate information from visible tokens to mask tokens. Such
a small decoder can further speed up training.
In Table 1b we study the decoder width (number of chan-
nels). We use 512-d by default, which performs well un-
der fine-tuning and linear probing. A narrower decoder also
works well with fine-tuning.
Overall, our default MAE decoder is lightweight. It has
8 blocks and a width of 512-d ( gray in Table 1). It only
has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
As such, while the decoder processes all tokens, it is still a
small fraction of the overall compute.
encoder dec. depth ft acc hours speedup
ViT-L, w/ [M] 8 84.2 42.4 -
ViT-L 8 84.9 15.4 2.8×
ViT-L 1 84.8 11.6 3.7×
ViT-H, w/ [M] 8 - 119.6-
ViT-H 8 85.8 34.5 3.5×
ViT-H 1 85.9 29.3 4.1×
Table 2. Wall-clock time of our MAE training (800 epochs),
benchmarked in 128 TPU-v3 cores with TensorFlow. The speedup
is relative to the entry whose encoder has mask tokens (gray). The
decoder width is 512, and the mask ratio is 75%. : This entry is
estimated by training ten epochs.
Mask token. An important design of our MAE is to skip
the mask token [M] in the encoder and apply it later in the
lightweight decoder. Table 1c studies this design.
If the encoder uses mask tokens, it performs worse: its
accuracy drops by 14% in linear probing. In this case,
there is a gap between pre-training and deploying: this en-
coder has a large portion of mask tokens in its input in pre-
training, which does not exist in uncorrupted images. This
gap may degrade accuracy in deployment. By removing the
mask token from the encoder, we constrain the encoder to
always see real patches and thus improve accuracy.
Moreover, by skipping the mask token in the encoder,
we greatly reduce training computation. In Table 1c, we
reduce the overall training FLOPs by 3.3×. This leads to
a 2.8×wall-clock speedup in our implementation (see Ta-
ble 2). The wall-clock speedup is even bigger (3.5–4.1×),
for a smaller decoder (1-block), a larger encoder (ViT-H),
or both. Note that the speedup can be >4×for a masking
ratio of 75%, partially because the self-attention complexity
is quadratic. In addition, memory is greatly reduced, which
can enable training even larger models or speeding up more
by large-batch training. The time and memory efficiency
makes our MAE favorable for training very large models.
Figure 6. Mask sampling strategies determine the pretext task
difficulty, influencing reconstruction quality and representations
(Table 1f). Here each output is from an MAE trained with the spec-
ified masking strategy. Left: random sampling (our default). Mid-
dle: block-wise sampling [2] that removes large random blocks.
Right: grid-wise sampling that keeps one of every four patches.
Images are from the validation set.
Reconstruction target. We compare different reconstruc-
tion targets in Table 1d. Our results thus far are based on
pixels without (per-patch) normalization. Using pixels with
normalization improves accuracy. This per-patch normal-
ization enhances the contrast locally. In another variant, we
perform PCA in the patch space and use the largest PCA
coefficients (96 here) as the target. Doing so degrades ac-
curacy. Both experiments suggest that the high-frequency
components are useful in our method.
We also compare an MAE variant that predicts tokens,
the target used in BEiT [2]. Specifically for this variant,
we use the DALLE pre-trained dVAE [43] as the tokenizer,
following [2]. Here the MAE decoder predicts the token in-
dices using cross-entropy loss. This tokenization improves
fine-tuning accuracy by 0.4% vs. unnormalized pixels, but
has no advantage vs. normalized pixels. It also reduces lin-
ear probing accuracy. In §5we further show that tokeniza-
tion is not necessary in transfer learning.
Our pixel-based MAE is much simpler than tokeniza-
tion. The dVAE tokenizer requires one more pre-training
stage, which may depend on extra data (250M images [43]).
The dVAE encoder is a large convolutional network (40%
FLOPs of ViT-L) and adds nontrivial overhead. Using pix-
els does not suffer from these problems.
Data augmentation. Table 1e studies the influence of data
augmentation on our MAE pre-training.
Our MAE works well using cropping-only augmenta-
tion, either fixed-size or random-size (both having random
horizontal flipping). Adding color jittering degrades the re-
sults and so we do not use it in other experiments.
Surprisingly, our MAE behaves decently even if using
no data augmentation (only center-crop, no flipping). This
property is dramatically different from contrastive learning
and related methods [51,22,7,20], which heavily rely
on data augmentation. It was observed [20] that using
cropping-only augmentation reduces the accuracy by 13%
100 200 400 800 1600
84.9 85.1
epochs (log-scale)
100 200 400 800 1600
73.5 75.1
linear probing
epochs (log-scale)
Figure 7. Training schedules. A longer training schedule gives a
noticeable improvement. Here each point is a full training sched-
ule. The model is ViT-L with the default setting in Table 1.
and 28% respectively for BYOL [20] and SimCLR [7]. In
addition, there is no evidence that contrastive learning can
work without augmentation: the two views of an image are
the same and can easily satisfy a trivial solution.
In MAE, the role of data augmentation is mainly per-
formed by random masking (ablated next). The masks are
different for each iteration and so they generate new training
samples regardless of data augmentation. The pretext task
is made difficult by masking and requires less augmentation
to regularize training.
Mask sampling strategy. In Table 1f we compare different
mask sampling strategies, illustrated in Figure 6.
The block-wise masking strategy, proposed in [2], tends
to remove large blocks (Figure 6middle). Our MAE with
block-wise masking works reasonably well at a ratio of
50%, but degrades at a ratio of 75%. This task is harder
than that of random sampling, as a higher training loss is
observed. The reconstruction is also blurrier.
We also study grid-wise sampling, which regularly keeps
one of every four patches (Figure 6right). This is an eas-
ier task and has lower training loss. The reconstruction is
sharper. However, the representation quality is lower.
Simple random sampling works the best for our MAE. It
allows for a higher masking ratio, which provides a greater
speedup benefit while also enjoying good accuracy.
Training schedule. Our ablations thus far are based on
800-epoch pre-training. Figure 7shows the influence of the
training schedule length. The accuracy improves steadily
with longer training. Indeed, we have not observed sat-
uration of linear probing accuracy even at 1600 epochs.
This behavior is unlike contrastive learning methods, e.g.,
MoCo v3 [9] saturates at 300 epochs for ViT-L. Note that
the MAE encoder only sees 25% of patches per epoch,
while in contrastive learning the encoder sees 200% (two-
crop) or even more (multi-crop) patches per epoch.
method pre-train data ViT-B ViT-L ViT-H ViT-H448
scratch, our impl. - 82.3 82.6 83.1 -
DINO [5]IN1K 82.8 - - -
MoCo v3 [9]IN1K 83.2 84.1 - -
BEiT [2]IN1K+DALLE 83.2 85.2 - -
MAE IN1K 83.6 85.9 86.9 87.8
Table 3. Comparisons with previous results on ImageNet-
1K. The pre-training data is the ImageNet-1K training set (ex-
cept the tokenizer in BEiT was pre-trained on 250M DALLE data
[43]). All self-supervised methods are evaluated by end-to-end
fine-tuning. The ViT models are B/16, L/16, H/14 [16]. The best
for each column is underlined. All results are on an image size of
224, except for ViT-H with an extra result on 448. Here our MAE
reconstructs normalized pixels and is pre-trained for 1600 epochs.
0 200 400 600
supervised, IN1K, our impl.
supervised, IN1K
supervised, JFT300M
params (M)
Figure 8. MAE pre-training vs. supervised pre-training, evalu-
ated by fine-tuning in ImageNet-1K (224 size). We compare with
the original ViT results [16] trained in IN1K or JFT300M.
4.2. Comparisons with Previous Results
Comparisons with self-supervised methods. In Table 3
we compare the fine-tuning results of self-supervised ViT
models. For ViT-B, all methods perform closely. For ViT-L,
the gaps among methods are bigger, suggesting that a chal-
lenge for bigger models is to reduce overfitting.
Our MAE can scale up easily and has shown steady im-
provement from bigger models. We obtain 86.9% accuracy
using ViT-H (224 size). By fine-tuning with a 448 size, we
achieve 87.8% accuracy, using only IN1K data. The pre-
vious best accuracy, among all methods using only IN1K
data, is 87.1% (512 size) [56], based on advanced networks.
We improve over the state-of-the-art by a nontrivial margin
in the highly competitive benchmark of IN1K (no external
data). Our result is based on vanilla ViT, and we expect
advanced networks will perform better.
Comparing with BEiT [2], our MAE is more accurate
while being simpler and faster. Our method reconstructs
pixels, in contrast to BEiT that predicts tokens: BEiT re-
ported a 1.8% degradation [2] when reconstructing pixels
with ViT-B.2We do not need dVAE pre-training. More-
over, our MAE is considerably faster (3.5×per epoch) than
BEiT, for the reason as studied in Table 1c.
2We observed the degradation also in BEiT with ViT-L: it produces
85.2% (tokens) and 83.5% (pixels), reproduced from the official code.
0 1 2 4 6 12 18 24
84.2 84.4 84.6 84.7 84.9
80.8 81.6 81.9
83.2 83.8 84.1
MAE baseline
MoCo v3
# blocks fine-tuned
Figure 9. Partial fine-tuning results of ViT-L w.r.t. the number
of fine-tuned Transformer blocks under the default settings from
Table 1. Tuning 0 blocks is linear probing; 24 is full fine-tuning.
Our MAE representations are less linearly separable, but are con-
sistently better than MoCo v3 if one or more blocks are tuned.
The MAE models in Table 3are pre-trained for 1600
epochs for better accuracy (Figure 7). Even so, our total
pre-training time is less than all other methods if they were
trained in the same hardware. For example, with ViT-L,
our MAE’s training time is 31 hours for 1600 epochs and
MoCo v3’s is 36 hours for 300 epochs [9], using the same
128 TPU-v3 cores.
Comparisons with supervised pre-training. In the origi-
nal ViT paper [16], ViT-L degrades when trained in IN1K.
See Figure 8. Our improved supervised recipe works better
for training from scratch (Figure 8, “our impl.”; see A.2),
but the accuracy is saturated.
Our MAE pre-training, using only IN1K, can general-
ize better: the gain over training from scratch is bigger for
higher-capacity models. It follows a trend similar to the
JFT-300M supervised pre-training in [16]. This compari-
son shows that our MAE can help scale up model sizes.
4.3. Partial Fine-tuning
Table 1shows that linear probing and fine-tuning results
are largely uncorrelated. Linear probing has been a popular
protocol in the past few years; however, it misses the oppor-
tunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning. As a middle ground, we
study a partial fine-tuning protocol: fine-tune the last sev-
eral layers while freezing the others. This protocol was also
used in early works, e.g., [54,59,35].
Figure 9shows the results. Notably, fine-tuning only one
Transformer block boosts the accuracy significantly from
73.5% to 81.0%. Moreover, if we fine-tune only “half” of
the last block (i.e., its MLP sub-block), we can get 79.1%,
much better than linear probing. This variant is essentially
fine-tuning an MLP head. Fine-tuning a few blocks (e.g.,
4 or 6) can achieve decent accuracy, which is still a small
fine-tuning head compared with the frozen backbone.
In Figure 9we also compare with MoCo v3 [9], which
is a contrastive method with ViT-L results available. It has
higher linear probing accuracy than our MAE. However, all
APbox APmask
method pre-train data ViT-B ViT-L ViT-B ViT-L
supervised IN1K w/ labels 47.9 49.3 42.9 43.9
MoCo v3 IN1K 47.9 49.3 42.7 44.0
BEiT IN1K+DALLE 49.8 53.3 44.4 47.1
MAE IN1K 50.3 53.3 44.9 47.2
Table 4. COCO object detection and segmentation using a ViT
Mask R-CNN baseline. All entries are based on our implementa-
tion. Self-supervised entries use IN1K data without labels. Mask
AP follows a similar trend as box AP.
of its partial fine-tuning results are worse than ours. The gap
is 2.6% when tuning 4 blocks. These results show that the
MAE representations are less linearly separable, but they
are stronger non-linear features and perform well when a
non-linear head is tuned.
These observations suggest that linear separability is not
the sole metric for evaluating representation quality. It has
also been observed (e.g., [8]) that linear probing is not well
correlated with transfer learning performance, e.g., for ob-
ject detection. To our knowledge, linear evaluation is not
often used in NLP for benchmarking pre-training.
5. Transfer Learning Experiments
We evaluate transfer learning in object detection and seg-
mentation on COCO [32] and semantic segmentation on
ADE20K [60]. We use the pre-trained models in Table 3.
Object detection and segmentation. We fine-tune Mask
R-CNN [23] end-to-end on COCO. The ViT backbone is
adapted for use with FPN [31] (see Appendix A.3). We
apply this object detection system to all entries in Table 4.
We report box AP for object detection and mask AP for
instance segmentation.
Compared to supervised pre-training, our MAE performs
better under all configurations (Table 4). With the smaller
ViT-B, our MAE is 2.4 points higher than supervised pre-
training (50.3 vs. 47.9, APbox). More significantly, with the
larger ViT-L, our MAE pre-training outperforms supervised
pre-training by 4.0 points (53.3 vs. 49.3).
The pixel-based MAE is better than or on par with the
token-based BEiT, while MAE is much simpler and faster.
Both MAE and BEiT are better than MoCo v3 and MoCo
v3 is on par with supervised pre-training.
Semantic segmentation. Our experiments on ADE20K
use UperNet [52] following the code in [2]. Details are in
A.4. Table 5shows that our MAE significantly improves
the transferring results of ViT-L, which is 3.7 points better
than the supervised pre-training counterpart (53.6 vs. 49.9).
The pixel-based MAE outperforms the token-based BEiT.
These observations are consistent with those in COCO.
Pixels vs. tokens. Table 6presents an all-around compari-
son on pixels vs. tokens as the MAE reconstruction target.
While using dVAE tokens is better than using unnormalized
pixels, it is statistically similar to just using normalized pix-
method pre-train data ViT-B ViT-L
supervised IN1K w/ labels 47.4 49.9
MoCo v3 IN1K 47.3 49.1
BEiT IN1K+DALLE 47.1 53.3
MAE IN1K 48.1 53.6
Table 5. ADE20K semantic segmentation (mIoU) using Uper-
Net. BEiT results are reproduced using the official code. Other
entries are based on our implementation. Self-supervised entries
use IN1K data without labels.
pixel (w/o norm) 83.3 85.1 86.2 49.5 52.8 48.0 51.8
pixel (w/ norm) 83.6 85.9 86.9 50.3 53.3 48.1 53.6
dVAE token 83.6 85.7 86.9 50.3 53.2 48.1 53.4
40.0 -0.2 0.0 0.0 -0.1 0.0 -0.2
Table 6. Pixels vs. tokens as the MAE reconstruction target. 4is
the difference between using dVAE tokens and using normalized
pixels. The difference is statistically insignificant.
els across all tasks and models we studied. It agains shows
that tokenization is not necessary for our MAE.
6. Discussion and Conclusion
Simple algorithms that scale well are the core of deep
learning. In NLP, simple self-supervised learning methods
(e.g., [40,14,41,4]) enable benefits from exponentially
scaling models. In computer vision, practical pre-training
paradigms are dominantly supervised (e.g. [28,44,24,16])
despite progress in self-supervised learning. In this study,
we observe on ImageNet and in transfer learning that
an autoencoder—a simple self-supervised method similar
to techniques in NLP—provides scalable benefits. Self-
supervised learning in vision may now be embarking on a
similar trajectory as in NLP.
On the other hand, we note that images and languages
are signals of a different nature and this difference must
be addressed carefully. Images are merely recorded light
without a semantic decomposition into the visual analogue
of words. Instead of attempting to remove objects, we re-
move random patches that most likely do not form a seman-
tic segment. Likewise, our MAE reconstructs pixels, which
are not semantic entities. Nevertheless, we observe (e.g.,
Figure 4) that our MAE infers complex, holistic reconstruc-
tions, suggesting it has learned numerous visual concepts,
i.e., semantics. We hypothesize that this behavior occurs
by way of a rich hidden representation inside the MAE. We
hope this perspective will inspire future work.
Broader impacts. The proposed method predicts content
based on learned statistics of the training dataset and as such
will reflect biases in those data, including ones with nega-
tive societal impacts. The model may generate inexistent
content. These issues warrant further research and consid-
eration when building upon this work to generate images.
A. Implementation Details
A.1. ImageNet Experiments
ViT architecture. We follow the standard ViT architecture
[16]. It has a stack of Transformer blocks [47], and each
block consists of a multi-head self-attention block and an
MLP block, both having LayerNorm (LN) [1]. The encoder
ends with LN. As the MAE encoder and decoder have dif-
ferent width, we adopt a linear projection layer after the
encoder to match it. Our MAE adds positional embeddings
[47] (the sine-cosine version) to both the encoder and de-
coder inputs. Our MAE does not use relative position or
layer scaling (which are used in the code of [2]).
We extract features from the encoder output for fine-
tuning and linear probing. As ViT has a class token [16],
to adapt to this design, in our MAE pre-training we append
an auxiliary dummy token to the encoder input. This token
will be treated as the class token for training the classifier in
linear probing and fine-tuning. Our MAE works similarly
well without this token (with average pooling).
Pre-training. The default setting is in Table 7. We do
not use color jittering, drop path, or gradient clip. We use
xavier uniform [18] to initialize all Transformer blocks, fol-
lowing ViT’s official code [16]. We use the linear lr scaling
rule [19]: lr =base lr×batchsize / 256.
End-to-end fine-tuning. Our fine-tuning follows common
practice of supervised ViT training. The default setting is in
Table 8. We use layer-wise lr decay [10] following [2].
Linear probing. Our linear classifier training follows [9].
See Table 9. We observe that linear probing requires a very
different recipe than end-to-end fine-tuning. In particular,
regularization is in general harmful for linear probing. Fol-
lowing [9], we disable many common regularization strate-
gies: we do not use mixup [58], cutmix [57], drop path [26],
or color jittering, and we set weight decay as zero.
It is a common practice to normalize the classifier input
when training a classical linear classifier (e.g., SVM [11]).
Similarly, it is beneficial to normalize the pre-trained fea-
tures when training the linear probing classifier. Follow-
ing [15], we adopt an extra BatchNorm layer [27] without
affine transformation (affine=False). This layer is ap-
plied on the pre-trained features produced by the encoder,
and is before the linear classifier. We note that the layer
does not break the linear property, and it can be absorbed
into the linear classifier after training: it is essentially a re-
parameterized linear classifier.3Introducing this layer helps
calibrate the feature magnitudes across different variants in
our ablations, so that they can use the same setting without
further lr search.
3Alternatively, we can pre-compute the mean and std of the features
and use the normalized features to train linear classifiers.
config value
optimizer AdamW [34]
base learning rate 1.5e-4
weight decay 0.05
optimizer momentum β1, β2=0.9,0.95 [6]
batch size 4096
learning rate schedule cosine decay [33]
warmup epochs [19] 40
augmentation RandomResizedCrop
Table 7. Pre-training setting.
config value
optimizer AdamW
base learning rate 1e-3
weight decay 0.05
optimizer momentum β1, β2=0.9,0.999
layer-wise lr decay [10,2] 0.75
batch size 1024
learning rate schedule cosine decay
warmup epochs 5
training epochs 100 (B), 50 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [45] 0.1
mixup [58] 0.8
cutmix [57] 1.0
drop path [26] 0.1 (B/L) 0.2 (H)
Table 8. End-to-end fine-tuning setting.
config value
optimizer LARS [55]
base learning rate 0.1
weight decay 0
optimizer momentum 0.9
batch size 16384
learning rate schedule cosine decay
warmup epochs 10
training epochs 90
augmentation RandomResizedCrop
Table 9. Linear probing setting. We use LARS with a large batch
for faster training; SGD works similarly with a 4096 batch size.
Partial fine-tuning. Our MAE partial fine-tuning (§4.3)
follows the setting in Table 8, except that we adjust the num-
ber of fine-tuning epochs. We observe that tuning fewer
blocks requires a longer schedule. We set the numbers of
fine-tuning epochs as {50, 100, 200}and use the optimal
one for each number of blocks tuned.
A.2. Supervised Training ViT-L/H from Scratch
We find that it is nontrivial to train supervised ViT-L/H
from scratch on ImageNet-1K. The training is unstable.
While there have been strong baselines with publicly avail-
able implementations [46] for smaller models, the recipes
for the larger ViT-L/H are unexplored. Directly applying
the previous recipes to these larger models does not work.
A NaN loss is frequently observed during training.
We provide our recipe in Table 10. We use a wd of 0.3,
a large batch size of 4096, and a long warmup, following
the original ViT [16]. We use β2=0.95 following [6]. We
use the regularizations listed in Table 10 and disable others,
following [53]. All these choices are for improving training
stability. Our recipe can finish training with no NaN loss.
config value
optimizer AdamW
base learning rate 1e-4
weight decay 0.3
optimizer momentum β1, β2=0.9,0.95
batch size 4096
learning rate schedule cosine decay
warmup epochs 20
training epochs 300 (B), 200 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [45] 0.1
mixup [58] 0.8
cutmix [57] 1.0
drop path [26] 0.1 (B), 0.2 (L/H)
exp. moving average (EMA) 0.9999
Table 10. Supervised training ViT from scratch.
The accuracy is 82.6% for ViT-L (81.5% w/o EMA), and
83.1% for ViT-H (80.9% w/o EMA). Both ViT-L and ViT-H
show an overfitting trend if not using EMA.
As a by-product, our recipe for ViT-B has 82.3% accu-
racy (82.1% w/o EMA), vs. 81.8% in [46].
A.3. Object Detection and Segmentation in COCO
We adapt the vanilla ViT for the use of an FPN backbone
[31] in Mask R-CNN [23]. ViT has a stack of Transformer
blocks that all produce feature maps at a single scale (e.g.,
stride 16). We equally divide this stack into 4 subsets and
apply convolutions to upsample or downsample the inter-
mediate feature maps for producing different scales (stride
4, 8, 16, or 32, the same as a standard ResNet [24]). FPN is
built on these multi-scale maps.
For fair comparisons among different methods, we
search for hyper-parameters for each entry in Table 4(in-
cluding all competitors). The hyper-parameters we search
for are the learning rate, weight decay, drop path rate, and
fine-tuning epochs. We will release code along with the
specific configurations. For full model and training details,
plus additional experiments, see [30].
A.4. Semantic Segmentation in ADE20K
We use UperNet [52] following the semantic segmenta-
tion code of [2]. We fine-tune end-to-end for 100 epochs
with a batch size of 16. We search for the optimal lr for
each entry in Table 5(including all competitors).
The semantic segmentation code of [2] uses relative po-
sition bias [42]. Our MAE pre-training does not use it. For
fair comparison, we turn on relative position bias only dur-
ing transfer learning, initialized as zero. We note that our
BEiT reproduction uses relative position bias in both pre-
training and fine-tuning, following their code.
B. Comparison on Linear Probing Results
In §4.3 we have shown that linear probing accuracy and
fine-tuning accuracy are largely uncorrelated and they have
different focuses about linear separability. We notice that
method model params acc
iGPT [6] iGPT-L 1362 M 69.0
iGPT [6] iGPT-XL 6801 M 72.0
BEiT [2] ViT-L 304 M 52.1
MAE ViT-B 86 M 68.0
MAE ViT-L 304 M 75.8
MAE ViT-H 632 M 76.6
Table 11. Linear probing results of masked encoding methods.
Our fine-tuning results are in Table 3.: our implementation.
existing masked image encoding methods are generally less
competitive in linear probing (e.g., than contrastive learn-
ing). For completeness, in Table 11 we compare on linear
probing accuracy with masking-based methods.
Our MAE with ViT-L has 75.8% linear probing accu-
racy. This is substantially better than previous masking-
based methods. On the other hand, it still lags behind con-
trastive methods under this protocol: e.g., MoCo v3 [9] has
77.6% linear probing accuracy for the ViT-L (Figure 9).
Figure 10. Uncurated random samples on ImageNet validation images. For each triplet, we show the masked image (left), our MAE
reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
Figure 11. Uncurated random samples on COCO validation images, using an MAE trained on ImageNet. For each triplet, we show the
masked image (left), our MAE reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
