PreprintPDF Available

Will Large-scale Generative Models Corrupt Future Datasets?

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Recently proposed large-scale text-to-image generative models such as DALL$\cdot$E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained on ``contaminated'' datasets on various tasks including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets are available via https://github.com/moskomule/dataset-contamination.
Content may be subject to copyright.
WILL LARGE-SCALE GENERATIVE MODELS CORRUPT FUTURE
DATASETS?
A PREPRINT
Ryuichiro Hataya
RIKEN ADSP
Han Bao
Kyoto University
Hiromi Arai
RIKEN AIP
ABS TRAC T
Recently proposed large-scale text-to-image generative models such as DALL·E 2 [49], Midjour-
ney [44], and StableDiffusion [52] can generate high-quality and realistic images from users’
prompts. Not limited to the research community, ordinary Internet users enjoy these generative
models, and consequently a tremendous amount of generated images have been shared on the In-
ternet. Meanwhile, today’s success of deep learning in the computer vision field owes a lot to
images collected from the Internet. These trends lead us to a research question: will such gen-
erated images impact the quality of future datasets and the performance of computer vi-
sion models positively or negatively? This paper empirically answers this question by sim-
ulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a
state-of-the-art generative model and evaluate models trained on “contaminated” datasets on var-
ious tasks including image classification and image generation. Throughout experiments, we con-
clude that generated images negatively affect downstream performance, while the significance de-
pends on tasks and the amount of generated images. The generated datasets are available via
https://github.com/moskomule/dataset-contamination.
1 Introduction
Deep generative models for image generation have advanced progressively since the original GANs [18] and VAEs
[34]. Recently, denoising diffusion models [29,58,59] have beaten GANs in image quality [12,30] and are becoming
the de-facto standard generative models. Among them, some large models trained on billion-scale captioned images
collected from the Internet achieved high-fidelity image generation conditioned by users’ prompts [49,54,52,44,43,
67,66,17,3]. Particularly, DALL·E 2 [49], Midjourney [44], and StableDiffusion [52] have web and smartphone
applications, and many Internet users enjoy image generation, and consequently a tremendous amount of generated
images have been uploaded to the Internet.1
At the same time, highly realistic generated images may have potentially significant impacts on society. For example,
face images generated by StyleGANs [33] were reportedly used to create fake profiles of SNSs or dating apps to
deceive other users [22,28]. Furthermore, recent text-to-image generative models can generate images that look
real at first glance from users’ instruction and are able to support fake news [60]. They also amplify demographic
stereotypes [4].
Another concern is that generated images might affect the quality of newly curated image datasets from the Internet
in the future, similar to the fact that the outputs of machine translation models degenerate the quality of corpora
[56,50,13]. Without a doubt, today’s success of deep learning and computer vision, including generative models
themselves, largely owes to image datasets collected from the Internet, such as ImageNet [53,10]. However, when
generated images are shared on the Internet, they may contaminate the sources of image datasets.2Based on these
1According to OpenAI’s blog post, the DALL·E 2 model alone generated two million images per day in September 2022
https://openai.com/blog/dall-e-now-available- without-waitlist/.
2Although the “official” web applications implant watermarks to generated images, which thus can be filtered, we found that
some software have options to disable such functions.
arXiv:2211.08095v1 [cs.CV] 15 Nov 2022
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
Figure 1: Schematic view of the problem. Some large-scale generative models are public, and many users are playing
with them to share generated images on the Internet (top). Dataset collection heavily relies on images on the Internet,
which may be contaminated by generated images (bottom). This paper discusses the effect of such dataset corruption.
backgrounds, our research question in this paper raises: what will happen if datasets are contaminated by generated
images?
We aim to answer this question through experiments: simulating such contamination by large-scale datasets of gen-
erated images and measuring downstream performance trained on them. Specifically, we generate nearly two million
images from ImageNet categories and COCO captions using StableDiffusion and emulate the contamination by re-
placing real images in datasets with generated ones. Then, we measure the performance of models trained on such
contaminated datasets in various tasks, namely, image classification, image captioning, self-supervised learning, and
image generation. Throughout experiments, we find that generated images have negative effects on downstream per-
formance, while in some cases to some extent, the effects may be endurable. We hypothesize that such negative effects
are caused by the fact that the generative models capture fewer modes than the actual data, although the existing
synthetic experiments have shown high coverage [64].
In summary, our contributions are as follows:
To simulate the effects of possible contamination, we create large-scale datasets consisting of generated
images corresponding to ImageNet and COCO caption (Section 3).
We conduct experiments over four distinct tasks on the generated datasets and discover negative effects of
contamination, which can be partially attributed to fewer modes of generated images than real data (Sections 4
and 5).
Based on the empirical results, we recommend researchers how to publish generative models and how to
collect datasets (Section 7).
2
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
2 Background and Related Work
A deep generative model aims to approximate the underlying data distribution by neural networks, and sampled data
are expected to generate data similar to real ones. Since the emergence of GANs [18,48], the research of deep
generative models has advanced progressively. In particular, denoising diffusion models, equivalently, score-based
generative models, have achieved high-quality image generation with diversity [29,58,59], capturing modes of data
distributions faithfully [64]. Text-to-image generative models based on diffusion models can generate high-quality
images from users’ text instructions with high fidelity, even for unseen novel combinations of concepts, such as “a
photo of an astronaut riding a horse” [49,54,52,44,43,67,66,17,3]. Notably, some models have publicly accessible
applications [52,44,49,67], and many Internet users are generating images and posting them to the Internet with
related texts. Such generated images are sometimes difficult to be distinguished from real ones, and thus some of them
are potent to contaminate future datasets collected from the web.
The NLP community has experienced similar problems in the last decade; thanks to the development of NLP tech-
nologies, many contents on the Internet have become machine-generated, e.g., by machine translation and optical
character recognition systems, but such generated texts have degenerated quality of corpora [56,13]. As a result,
filtering such low-quality samples is essential to keep downstream performance [50]. Although such issues by gener-
ated data have been investigated by the NLP community, the effects of generated images by text-to-image models on
various downstream performance in computer vision have rarely been studied.
The dataset contamination issue in general has been studied from various aspects; including dataset poisoning [9],
adversarial training [19], label noise [21], outlier robustness [31], and distribution shifts [57]. The existing studies
usually posit an attacker/contamination model that is plausible yet mathematically convenient, such as Huber’s con-
tamination model [31]. By contrast, we are rather interested in realistic contamination of the web images induced by
generative models and its potential effects.
3 Dataset Corruption by Generated Images
The goal of this paper is to answer our research question will the contamination of generated images perform posi-
tively or negatively? To empirically answer this question, we simulate realistic dataset contamination by generated
images and evaluate the quality of contaminated datasets by training commonly-used models on such datasets in
several tasks. In this section, we describe the dataset creation.
3.1 Dataset Creation
To simulate image generation by users, we create datasets using a StableDiffusion model [52], a state-of-the-art text-
to-image generative model, pre-trained on LAION 2B [55]. These datasets are generated from category names of the
ImageNet ILSVRC-2012 classification dataset and captions of the COCO caption dataset, which are referred to as
SD-ImageNet and SD-COCO in the remaining text. For generation of both datasets, we disabled the watermarking
functionality to trace outputs as generated images and the safety checker to reduce explicit outputs.
SD-ImageNet
The ImageNet ILSVRC-2012 classification dataset [53] is a subset of ImageNet [10] and a dataset for the image
classification task. Its training set contains 1.2 million photo images over 1,000 categories selected from synsets of
WordNet, e.g., African elephant” (n02504458). Using these category names, we prepared prompts like “A photo of
African elephant” for each category and generated 1,400 photography-like images per class.
Figure 2(left) shows examples from SD-ImageNet. Images are natural at first glance, but contain some flaws. For
example, the elephant at the top left has two noses.
SD-COCO
The COCO caption dataset [8] is a dataset for the image captioning task. Following the dataset split in [32], this
dataset has 113,000 images with five captions for each image, such as A small child wearing headphones plays on the
computer”. These captions were used as prompts to generate 565,000 images.
Figure 2(right) presents some examples from SD-COCO with their captions. Similar to examples of SD-ImageNet,
the images are apparently faithful to the captions used as prompts, but staring at them reveals unnatural or unfaithful
details. For example, the bottom right example fails to produce a “blue and white plate.”
3
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
Figure 2: Randomly selected examples from generated datasets, namely, SD-ImageNet (left) and SD-COCO (right).
Images are at single glance high quality and fidelity to prompts, i.e., category names and captions, while details are
unnatural, e.g., a two-nose elephant.
In the remaining text, we call the ILSVRC-2012 dataset as ImageNet and the COCO caption dataset as COCO for
simplicity.
3.2 Simulation of Corruption
To simulate possible corruption, we randomly substitute generated images for 20, 40, and 80 % of real images of the
original datasets with generated ones without replacement. We refer to these mixed datasets as IN/SD-n%, where n
indicates the ratio of generated data. Similarly to IN/SD-n%, we also created mixtures of COCO and SD-COCO,
which are referred to as CO/SD-n%.
In the next section (Section 4), we empirically investigate the effect of the corruption using the downstream perfor-
mance of models trained on these contaminated datasets. However, mixtures of ImageNet and SD-ImageNet, e.g.,
IN/SD-20%, alone still entangle the effect of artifacts of generated and the domain shift between generated images
and real images. To estimate the effect of domain shift, we additionally use datasets consisting of real images similar
to ImageNet and COCO and compare the downstream performance of models trained on them with that of IN/SDs
or IN/COs. As a counterpart of ImageNet, we adopt a subset of the WebVision dataset [38], which was collected
by querying ImageNet category names to Google and Flickr; then, we mix it with ImageNet. Because this subset is
imbalanced, and some of its categories contain fewer images than needed, we then use sampling by replacement to
create balanced mixtures. Similar to IN/SD-n%, we refer to these mixed datasets as IN/WV-n%. Correspondingly,
as a counterpart of COCO, we use Flickr-30k [47], which contains 32,000 images collected from Flickr with five
captions per image. Because its size is much less than COCO, we only prepare CO/FL-40% as a mixture of COCO
and Flickr-30k.
4 Experimental Results
In this section, we evaluate the effect of contamination using the datasets created in Section 3on several downstream
tasks.
Shared Experimental Settings
We used neural network models implemented with PyTorch v1.12 [46] and its accompanying torchvision on
CUDA 11.3. Experiments including dataset creation described in Section 3were conducted on NVIDIA V-100 GPUs
and NVIDIA A-100 GPUs. Further description of settings and configurations can be found in the supplemental mate-
rial.
4
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
4.1 Image Classification
This task classifies images into 1,000 categories of ImageNet. We used ResNet-50 [24], SwinTransformer-S (Swin-
S) [40], and ConvNeXt-T [41] in torchvision, training them according to the standardized training protocols3on
ImageNet, SD-ImageNet, WebVision, and their mixtures. ResNet is a convolutional neural network with residual
connections, SwinTransformer is a variant of Vision Transformer, and ConvNeXt is a CNN inspired by Vision Trans-
formers, which represent modern vision models.
Table 1shows accuracy on the ImageNet validation set. As can be seen, the performance decreases as the ratio of
SD-ImageNet in training data increases. When the ratio of generated images is at most 40%, the performance drops
are marginal and may be endurable in most practical scenarios. However, when the ratio is 80%, the performance
degeneration is not negligible. Compared to SD-ImageNet, WebVision images have less influence on the performance.
In the extreme cases, when no ImageNet data are included in training data, i.e., SD-ImageNet and WebVision, this
difference in performance is significant, which suggests that the performance drop may not be solely due to the domain
gap.
Additionally, Fig. 3presents confusion matrices of ResNet-50 trained on ImageNet and SD-ImageNet. For clarity,
categories are subsampled and rearranged according to 12 superclasses, adopting big 12 classes from [16]. As can
be seen, the mispredictions by ResNet-50 trained on ImageNet mostly falls in the same superclasses, represented
by diagonal blocks. Contrarily, we observe that the model trained on SD-ImageNet uniformly misclassifies certain
classes, partially because the category names of such classes are ambiguous, and thus, the generated images for such
classes are semantically diverse. Titi (monkey) is such an example, where it is intended to mean a New World monkey
in ImageNet, but it is also a name of people, plants, and places, and thus, generated images are also semantically diverse
(see the supplemental material for example images).
ResNet-50 Swin-S ConvNeXt-T
ImageNet 75.6 83.1 80.8
IN/SD-20% 74.5 82.1 79.7
IN/SD-40% 72.6 81.0 78.3
IN/SD-80% 65.3 74.3 70.8
SD-ImageNet 15.7 19.3 19.6
IN/WV-20% 75.1 82.5 80.0
IN/WV-40% 73.9 81.8 78.8
IN/WV-80% 68.3 NaN 73.9
WebVision 61.3 70.9 66.2
Table 1: Test accuracy of the image classification task on the ImageNet validation set. We could not stably train
Swin-S on IN/WV-80, which resulted in loss explosion. The performance drop is marginal when the ImageNet images
dominate the the dataset.
4.2 Image Captioning
Image captioning is a task to generate appropriate captions for given images. We used a pre-trained BLIP model
[37], a state-of-the-art vision-language model, and fine-tuned its captioner and filter modules on COCO, SD-COCO,
Flickr-30k, and their mixtures for five epochs, following [36].
Table 2reports the performance in various statistics on the COCO test set when the captions are generated with
beam search with a beam size of three. Aligned with the results of image classification, a performance drop by
generated images can also be observed. Especially, CO/SD-0.2 yields comparable or inferior performance to CO/FL-
0.4, even in metrics for image captioning like SPICE [1] and CIDEr [62], indicating that generated images cause
degeneration of dataset quality. Moreover, a comparison between the results of SD-COCO and Flickr-30k suggest that
such performance drops cannot be fully attributed to domain shift.
3https://github.com/pytorch/vision/tree/v0.13.1/references/classification. Exceptionally, ConvNeXt was
trained for 300 epochs.
5
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
dog
carnivore
primate
bird
reptile
insect
food
clothing
instrument
furniture
vehicle
structure
Label
ImageNet
0
5
50
dog
carnivore
primate
bird
reptile
insect
food
clothing
instrument
furniture
vehicle
structure
Prediction
dog
carnivore
primate
bird
reptile
insect
food
clothing
instrument
furniture
vehicle
structure
Label
SD-ImageNet
0
5
50
Figure 3: Confusion matrices of ResNet-50 predictions on a subset of ImageNet validation data. Models were trained
on ImageNet and SD-ImageNet. Class indices are rearranged. X and Y axes depict superclasses by gathering cat-
egories. Colors correspond to the number of data at each pixel in log scale, and block-diagonal components are
highlighted.
BLEU-1 BLEU-2 BLEU-3 BLEU-4 [45] SPICE [1] METEOR [11] ROUGE-L [39] CIDEr [62]
COCO 0.791 0.641 0.508 0.400 0.240 0.310 0.602 1.335
CO/SD-0.2 0.787 0.634 0.500 0.391 0.235 0.306 0.596 1.320
CO/SD-0.4 0.786 0.632 0.499 0.390 0.236 0.307 0.596 1.319
CO/SD-0.8 0.780 0.623 0.486 0.377 0.233 0.300 0.588 1.279
SD-COCO 0.711 0.534 0.394 0.287 0.191 0.252 0.514 1.000
CO/FL-0.4 0.787 0.634 0.501 0.393 0.238 0.308 0.598 1.326
Flickr 30k 0.754 0.587 0.439 0.321 0.215 0.275 0.554 1.092
w/o fine-tuning 0.473 0.392 0.308 0.237 0.158 0.212 0.488 0.838
Table 2: Test metrics in image captioning of the BLIP model on the COCO test split. Higher values are better.
4.3 Self-supervised Learning
Self-supervised learning aims to acquire useful representations without using explicit supervision. We used a masked
autoencoder (MAE) [23], whose pretext task is to reconstruct missing patches like denoising autoencoder [63]. Unlike
other popular self-supervised learning methods such as SimCLR [7] and BYOL [20], MAE requires minimal data
augmentation, namely random cropping and random flipping, for pre-training. This is suitable to purely evaluate the
effect of generated images on representations because more intense perturbations may make characteristics of images
deviate from the original and generated datasets.
6
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
We pre-trained a Vision Transformer model, specifically ViT-B [14], as MAE’s encoder for 200 epochs with a mask
ratio of 0.75. Table 3presents test accuracy after 90 epochs of linear probing that trains only the last classifier layer
from the extracted features. In this case, each setting yields almost identical performance except for SD-ImageNet,
which again supports the negative effects of generated images.
Accuracy
ImageNet 43.9
IN/SD-20% 44.0
IN/SD-40% 44.1
IN/SD-80% 42.5
SD-ImageNet 38.8
IN/WD-20% 43.3
IN/WD-40% 43.3
IN/WD-80% 40.9
WebVision 44.1
Table 3: Linear probing test accuracy of MAE [23] on the ImageNet validation set.
4.4 Image Generation
Finally, we evaluate if generated images are useful as training data of the image generation task. We evaluated an
improved denoising diffusion probabilistic model (IDDPM) [42] on datasets resized to 64 ×64, which we refer to
as, for example, ImageNet-64 and IN/SD-20%-64. We trained the model for 1.8×106iterations using the Lhybrid
objective (see [42]) with a batch size of 512 and generated 5.0×104images with 250 sampling steps.
Table 4reports the quality of unconditionally generated images in Fr´
eche Inception Distance (FID) [27], improved
precision and recall metrics of 5-nearest neighbors [35] between Inception features of generated images and all valida-
tion data from ImageNet and WebVision. The improved precision and recall metrics are computed by estimating the
volumes of real and generated images in the embedded space with nearest neighbors [35], from which we can deduce
how much two image distributions overlap with each other. From Table 4, we see the trend that the precision and recall
increases and decreases, respectively, as the ratio of generated images in training data increases. To put it differently,
the heavier contamination results in generated images that are more likely to be in the support of test images, while
the test support coverage is worsened. This indicates that generated images may concentrate in a smaller subset of the
test support. We suppose that the contaminated training datasets may cover fewer modes than the original ImageNet,
causing IDDPM to achieve higher precision and lower recall. The lack of modes can also be observed visually from
generated images in the supplemental material. This hypothesis can explain the performance degeneration in other
tasks, where contaminated datasets are biased to some modes, and thus models trained on them cannot generalize
better.
For the purpose of reference, generated images are randomly selected and shown in Fig. 4, but perceptual differences
among the generated images look marginal.
FID Precision@5 Recall@5
ImageNet-64 14.9 / 15.6 0.665 / 0.679 0.644 / 0.653
IN/SD-20%-64 12.6 / 12.7 0.699 / 0.708 0.621 / 0.634
IN/SD-40%-64 11.0 / 10.8 0.730 / 0.739 0.585 / 0.608
IN/SD-80%-64 12.5 / 11.3 0.795 / 0.802 0.490 / 0.512
SD-ImageNet-64 16.9 / 15.4 0.831 / 0.835 0.364 / 0.379
Table 4: Image quality comparison on unconditional ImageNet-64 validation data and WebVision-64 validation data
using Inception-V3 features shown in left and right of each cell, respectively.
7
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
ImageNet-64
IN/SD-20%-64
IN/SD-40%-64
IN/SD-80%-64
SD-ImageNet-64
Figure 4: Randomly selected generated images by class-unconditional IDDPM [42] using 250 sampling steps.
5 Analysis
In the main experiments, we demonstrated that generated images cause performance degeneration in downstream tasks.
In this section, we conduct extended experiments to see how fewer modes in generated images affect downstream tasks
through performance comparison on evaluation on out-of-distribution data and training on subsampled data.
5.1 Effects on Robustness
In the main experiments of the classification task, we saw the performance only by accuracy on validation data. To
further investigate the effect of generated images on learned representation, we measured accuracy on other validation
data; namely, ImageNet-A [26] and ImageNet-R [25]. These datasets share the same categories with ImageNet, but
are curated independently to measure robustness to out-of-distribution data.
Table 5summarizes the results, which generally indicates that generated images degenerate the robustness, except for
IN/SD-20% on ImageNet-A. Contrarily, WebVision images consistently enhance the robustness to out-of-distribution
data on ImageNet-A and ImageNet-R. These results further support the hypothesis that generated images have fewer
modes than the real data, and thus, cause the downstream performance drop on test data and out-of-distribution data.
Source ImageNet Val acc. ImageNet-A [26]ImageNet-R [25]
ImageNet 75.6 1.76 36.7
IN/SD-20% 74.5 2.12 35.7
IN/SD-40% 72.6 1.67 35.0
IN/SD-80% 65.3 1.61 30.0
IN/WV-20% 75.1 3.55 40.7
IN/WV-40% 73.9 3.77 41.5
IN/WV-80% 68.3 5.00 40.0
Table 5: Robustness metrics of ResNet-50 on the ImageNet validation set, ImageNet-A, and ImageNet-R. Different
from IN/WVs, IN/SDs generally affect robustness to out-of-distribution data.
5.2 Comparison with Subsampled Data
In the main experiments, we compared the performances between networks trained with the contaminated datasets
and with the full-size clean datasets. However, one may argue that the performance degradation is resulted from the
8
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
different amount of real data in the training set. In this section, we compare the performance of subsampled real
datasets with contaminated datasets to disentangle the effect of the amount of clean data.
Table 6shows the test accuracy of ResNet-50 trained on a 5% subset of ImageNet and IN/SD-95%, which fills 95%
of missing data by generated data. Although IN/SD-95% yields 7.4% performance improvement over subsampled
ImageNet, this is inferior to the gain by IN/WV-95%.
ImageNet (5%) IN/SD-95% IN/WV-95%
44.7 52.1 61.4
Table 6: Validation accuracy of ResNet-50 trained on a 5% subsampled ImageNet, IN/SD-95%, and IN/WV-95%.
In Table 7, test metrics of BLIP trained on 5% and 20% subsets of COCO and their corresponding CO/SDs are
presented. In this case, adding generated data affects negatively, even when only 5% of real data are available.
BLEU-4 SPICE CIDEr
COCO (5%) 0.386 0.233 1.290
CO/SD-95% 0.362 0.226 1.220
COCO (20%) 0.385 0.235 1.305
CO/SD-80% 0.377 0.233 1.279
Table 7: Test metrics of BLIP trained on 5% and 20% COCO subsets and mixtures to complement the missing data.
Higher values are better.
Additionally, Table 8compares the image quality of generated images by IDDPM trained on the IN/SD-80% and a
20% subset of ImageNet. Aligned with the results in Section 4, the recall metric diminishes with contaminated data,
supporting the hypothesis that the modes of generated images are fewer than real ones.
FID Precision@5 Recall@5
ImageNet-64 (20%) 16.5 0.639 0.646
IN/SD-80%-64 12.5 0.795 0.490
Table 8: Image quality comparison of generated images of IDDPM trained on the IN/SD-80% and a 20% ImageNet
subset.
These results emphasize the negative effects of generated data. Additionally, the observations imply that using gener-
ated images for data augmentation needs careful consideration. Such an idea has been studied in image classification
using conditional GANs [61,2], particularly in medical imaging [65], but also known to hinder the final performance
in large-scale settings [51]. Our results align with the latter that generated images are not always effective in data
augmentation.
6 Discussion
We have empirically seen the negative effects of generated images. In this section, we discuss the detection of gener-
ated images and limitations of this research.
6.1 Detection of Generated Images
To avoid the negative effects by generated images, one may want to detect generated images easily. For example,
exploiting the differences between real and generated images in high-frequency spectra is a simple and convincing
approach [15]. However, this discrepancy may be caused when an upsampling operation (to decode the original images
from the low-dimensional latent representations) includes zero pixel insertion; otherwise, detecting generated images
only by frequency spectra is difficult [5]. Probably because StableDiffusion uses upsampling by nearest neighbor rather
than zero pixel insertion, distinguishing real and generated images only from frequency information may be difficult.
Figure 5presents the power spectra of 1,000 images from ImageNet and SD-ImageNet, which are highly overlapped
in all frequencies, which agrees with [5]. Additionally, we trained a linear classifier and a multi-layer perceptron on
an ImageNet-pre-trained ResNet features to detect generated images. When trained on 104images from both datasets,
9
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
the classifiers achieved around 85% test accuracy on 2,000 separated test images, which is still unsatisfactory detection
rate for a binary classification task and indicates the difficulty of detection of generated images.
0 20 40 60 80 100 120 140 160
Spatial Frequency
2
4
6
8
10
Power Spectrum
ImageNet
SD-ImageNet
Figure 5: Average and standard deviation of spectra of 103images from ImageNet and SD-ImageNet.
6.2 Limitations
This paper has revealed the potential effects of generated images on datasets through various experiments. Neverthe-
less, the discussion has some limitations. Firstly, we could only use StableDiffusion trained on LAION 2B, because
models and their pre-trained weights are publicly available, which is important to generate images without identifiable
watermarks. Different generative models and source datasets may lead to other conclusions, which are left for future
work.
Another limitation is types of created datasets and tasks of experiments. Specifically, we created datasets from category
names of ImageNet and captions of COCO and conducted experiments of image classification, image captioning, self-
supervised learning, and image generation. Such a dataset generation scheme may be too simple to approximate
possible data generation processes by users’ prompts. In addition, these datasets and tasks may often not be so
complex that the insights of this paper would not cover some important aspects of other visual recognition tasks. For
example, the object counting task [6] on contaminated data may be challenging because generative models cannot
always correctly handle numbers [54]. We leave the further in-depth analysis for future work.
7 Conclusion
Recent generative models trained on billion-scale data enable to generate high-quality and high-fidelity images, and
many users play with these models to share generated images on the web. Observing such a trend, we questioned if
such generated images affect the quality of future datasets collected images from the Internet. To answer this question,
we simulated contamination of generated images using a state-of-the-art generative model and conducted experiments
on such data in various tasks, namely, image classification, image captioning, self-supervised learning, and image
generation. Throughout experiments, we found that generated images impact negatively on downstream performance,
although its extent depends on the ratio of generated images and downstream tasks. Additional analysis revealed that
generated images degrade robustness to out-of-distribution data; application of generated images to data augmentation
needs careful consideration; and easy detection of generated images may not be applicable to up-to-date generative
models.
Based on these observations, we recommend that researchers to publish generative models carefully implement wa-
termarks to enable identification of generated images. As we discussed in this paper, generated images have negative
impacts on downstream performance, and their effect on new tasks is immeasurable; thus, publishers of generative
models have responsibility to avoid possible contamination. One simple way to avoid this problem is to implement
either identifiable or invisible watermarks, as some publishers have already done, e.g., [49,52], then dataset curators
can easily identify and filter them out. We also suggest that researchers who develop image datasets collected from the
Internet should filter out or mark generated images, which may affect final downstream performance, because adding
generated images may degenerate performance as shown in Section 5.2.
Another important implication of this paper is further research on the detection methods of generated images, in
parallel with the development of generative models. As experimented in Section 6.1, generated images of the latest
generative methods cannot be detected by simple methods that once had been effective. Consequently, their develop-
ment for filtering is essential for the soundness of future research.
10
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
Acknowledgement
This work was supported by JST, ACT-X Grant Number JPMJAX210H, Japan. We used computational resources of
“mdx: a platform for the data-driven future” and RAIDEN (Riken AIp Deep learning ENvironment). R.H. thanks Kai
Katsumata at the University of Tokyo for his suggestions on image generation experiments.
References
[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evalua-
tion. In ECCV, 2016. 5,6
[2] Anthreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. In ICLR,
2018. 9
[3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli
Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv, 2022. 1,3
[4] Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan
Jurafsky, James Zou, and Aylin Caliskan. Easily accessible text-to-image generation amplifies demographic stereotypes at
large scale, 2022. 1
[5] Keshigeyan Chandrasegaran, Ngoc-Trung Tran, and Ngai-Man Cheung. A closer look at fourier spectrum discrepancies for
cnn-generated images detection. In CVPR, 2021. 9
[6] Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R Selvaraju, Dhruv Batra, and Devi Parikh. Counting every-
day objects in everyday scenes. In CVPR, 2017. 10
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of
visual representations. In ICML, 2020. 6
[8] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick.
Microsoft coco captions: Data collection and evaluation server. arXiv, 2015. 3
[9] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using
data poisoning. arXiv, 2017. 3
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.
In CVPR, 2009. 1,3
[11] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In
Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014. 6
[12] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021. 1
[13] Jesse Dodge, Maarten Sap, Ana Marasovi´
c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt
Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In EMNLP, 2021. 1,3
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16
words: Transformers for image recognition at scale. In ICLR, 2021. 7
[15] Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images. In
NeurIPS, 2020. 9
[16] Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. Robustness, 2019. 5
[17] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based
text-to-image generation with human priors. In ECCV, 2022. 1,3
[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Joshua
Bengio. Generative Adversarial Networks. In NIPS, 2014. 1,3
[19] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
3
[20] Jean-Bastien Grill, Florian Strub, Florent Altch ´
e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal
Valko. Bootstrap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020. 6
[21] Bo Han, Quanming Yao, Tongliang Liu, Gang Niu, Ivor W Tsang, James T Kwok, and Masashi Sugiyama. A survey of
label-noise representation learning: Past, present and future. arXiv, 2020. 3
[22] Drew Harwell. Dating apps need women. advertisers need diversity. ai companies offer a solution: fake people. Washington
Post, 2020. 1
[23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´
ar, and Ross Girshick. Masked autoencoders are scalable vision
learners. In CVPR, pages 16000–16009, 2022. 6,7
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
5
[25] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak
Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of
out-of-distribution generalization. In ICCV, 2021. 8
11
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
[26] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR,
2021. 8
[27] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two
time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 7
[28] Kashmir Hill and Jeremy White. Designed to deceive: Do these people look real to you? New York Times, 2020. 1
[29] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 1,3
[30] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
models for high fidelity image generation. JMLR, 23:47–1, 2022. 1
[31] Peter J Huber. Robust statistics. In International encyclopedia of statistical science, pages 1248–1251. Springer, 2011. 3
[32] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015. 3
[33] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR,
2019. 1
[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 1
[35] Tuomas Kynk¨
a¨
anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric
for assessing generative models. In NeurIPS, 2019. 7
[36] Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. Lavis: A library for language-vision
intelligence. arXiv, 2022. 5
[37] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-
language understanding and generation. In ICML, 2022. 5
[38] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. Webvision database: Visual learning and understanding
from web data. arXiv, 2017. 4
[39] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004. 6
[40] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In ICCV, 2021. 5
[41] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.
In CVPR, 2022. 5
[42] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021. 7,8
[43] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever,
and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML,
2022. 1,3
[44] Jonas Oppenlaender. The creativity of text-to-image generation. arXiv, 2022. 1,3
[45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. 6
[46] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,
Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019. 4
[47] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93,
2017. 4
[48] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative
adversarial networks. In ICLR, 2015. 3
[49] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation
with clip latents. arXiv, 2022. 1,3,10
[50] Spencer Rarrick, Chris Quirk, and Will Lewis. Mt detection in web-scraped parallel corpora. In Proceedings of Machine
Translation Summit XIII: Papers, 2011. 1,3
[51] Suman Ravuri and Oriol Vinyals. Seeing is not necessarily believing: Limitations of BigGANs for data augmentation. In
ICLR Learning from Limited Labeled Data Workshop, 2019. 9
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨
orn Ommer. High-resolution image synthesis
with latent diffusion models. In CVPR, 2022. 1,3,10
[53] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya
Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,
2015. 1,3
[54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep
language understanding. arXiv, 2022. 1,3,10
[55] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes,
Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig
Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-
text models. In NeurIPS, 2022. 3
[56] Michel Simard. Clean data for training statistical mt: the case of mt contamination. In Proceedings of the 11th Conference of
the Association for Machine Translation in the Americas: MT Researchers Track, pages 69–82, 2014. 1,3
12
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
[57] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training.
In ICLR, 2018. 3
[58] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequi-
librium thermodynamics. In ICML, 2015. 1,3
[59] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based
generative modeling through stochastic differential equations. In ICLR, 2021. 1,3
[60] Nitasha Tiku. Ai can now create any image in seconds, bringing wonder and danger. Washington Post, 2022. 1
[61] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. A Bayesian Data Augmentation Approach for Learning
Deep Models. In NIPS, 2017. 9
[62] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In
CVPR, 2015. 5,6
[63] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L´
eon Bottou. Stacked
denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11(12),
2010. 6
[64] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs.
In ICLR, 2022. 2,3
[65] Xin Yi, Ekta Walia, and Paul Babyn. Generative Adversarial Network in Medical Imaging: A Review. Medical Image
Analysis, 58:101552, 2019. 9
[66] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei
Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui
Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv, 2022. 1,3
[67] Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, and Haifeng
Wang. Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation. arXiv, 2021. 1,3
A Detailed Experimental Configurations
This section describes the detailed experimental settings and configurations.
A.1 Dataset Creation
We generated images using the StableDiffusion model4and its accompanying pre-trained weight (sd-v1-1.ckpt)
on eight NVIDIA A100 GPUs. Each image of the datasets was sampled by 50 steps of the PLMS sampler with an
unconditional guidance scale of 7.5, which is identical to the setting of its web application.5
A.2 Image Classification
We trained ResNet-50 and Swin-S models following torchvisions training protocol6on eight NVIDIA A100 GPUs.
The results of Swin-S were calculated using parameters with exponential moving average.
A.3 Image Captioning
We fine-tuned the captioner and filter modules of the BLIP model following LAVISs training script7on two NVIDIA
A100 GPUs.
A.4 Self-supervised Learning
We pre-trained MAE following the official implementation8on eight NVIDIA A-100 GPUs and fine-tuned the last
layer of its encoder on 16 NVIDIA V-100 GPUs. Pre-training was for 200 epochs with 40-epoch warmup with a batch
size of 4096, using gradient accumulation once in every two iteration. Fine-tuning was for 90 epochs using the LARS
optimizer with a batch size of 16,384.
4https://github.com/CompVis/stable-diffusion
5https://huggingface.co/spaces/stabilityai/stable-diffusion/blob/main/app.py
6https://github.com/pytorch/vision/tree/v0.13.1/references/classification
7https://github.com/salesforce/LAVIS/blob/v0.1.0/run_scripts/blip/train/train_caption_coco.sh
8https://github.com/facebookresearch/mae/tree/main
13
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
A.5 Image Generation
We trained and generated images from IDDPM following the official instructions for the ImageNet-64 dataset9on
eight NVIDIA V-100 GPUs. The model was trained for 1.8×106iterations using the Lhybrid objective with a batch
size of 512. We generated 50,000 images with 250 sampling steps from EMA models. The computation of metrics is
based on https://github.com/NVlabs/stylegan2-ada-pytorch/tree/main/metrics.
A.6 Detection of Generated Images
For the experiments in Section 6.1, we first extracted the features of ImageNet-pre-trained ResNet-50 on 12,000 images
from ImageNet and SD-ImageNet. Each feature vector has a dimension of 2,048. Then, we trained a linear classifier
and a two-layer MLP with a hidden size of 1,024 with a ReLU activation to classify them using 10,000 feature vectors
for 5,000 iterations using the Adam optimizer with a batch size of 128. Their performances were evaluated on the
other 2,000 test vectors. The linear classifier and the MLP achieve 83% and 86% accuracy, respectively.
B Example Images
This section displays some example images from SD-ImageNet and ImageNet.
B.1 Comparison of Real and Generated Images
In Section 4.4, we hypothesized that generated images have fewer modes than real ones, which causes the performance
degeneration. Comparing randomly selected images from ImageNet and SD-ImageNet in Figs. 6and 7visually
supports this hypothesis.
9https://github.com/openai/improved-diffusion/tree/main
14
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
Figure 6: Real images of African elephants from ImageNet.
15
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
Figure 7: Generated images of African elephants from SD-ImageNet.
16
Will Large-scale Generative Models Corrupt Future Datasets? A PREPRINT
B.2 Examples of titi
In Section 4.1, we argued that some categories were semantically diverse because the ambiguity of category names.
Figure 8presents randomly selected images from the titi category. Although ImageNet intended this class to mean
a New World monkey, the generated images are mostly photos of human, because “titi” is also a name of people.
Figure 8: Generated images of the titi category from SD-ImageNet.
17
... Each successive dataset obtained in this way tends to be less diverse than the previous one. This has been referred to as 'model collapse' [122], and has been observed in text [123] and images [124]. From a statistical point of view, there is evidence that 'rare' features are discarded over successive generations, thereby increasingly converging on average or majority features. ...
Chapter
Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512×512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story we wrote.