Content uploaded by Ridam Rahman
Author content
All content in this area was uploaded by Ridam Rahman on Oct 10, 2020
Content may be subject to copyright.
Abertay University
School of Design and Informatics
Using Generative Adversarial Networks
for Content Generation in Games
Dissertation
Submitted in partial fulfillment of the requirements for the degree of
MSc in Computer Games Technology
August 2020
Ridam S.S. Rahman
Abstract
Building game worlds for the players to explore can be a particularly time-consuming
activity, especially for game designers. Editing tools in game engines can heavily
improve this process by allowing users to edit one or more identical level
components in a way that usually resembles Microsoft Paint’s tool system;
nonetheless, these methods still require a lot of input, and designers oftentimes need
to care about a multitude of small details when creating levels and game worlds.
While there is undoubtedly a certain positive value in manually crafted details,
sometimes these small minutiae replicate patterns seen elsewhere. A question
arises: what if we could encapsulate these patterns into replicable features? Until
now, such attempts at replication have been done through standard procedural
content generation techniques. But technological advancements made in the field of
machine learning now allow us to generate content by unprecedented means.
The aim of this dissertation is to prove, using a generative adversarial network, if is it
possible to create game worlds using a semantic tool-brush, which would allow
users to paint a map with colours that represent entities such as mountains, hills,
rivers, etc., and generate a 3D world accordingly. In order to achieve such a result, a
open-source GAN model by Nvidia called SPADE will be trained on game data, in
order to convert a 2D image to a 3D game world, in a similar fashion to Nvidia’s
GauGAN model.
Contents
1. Introduction
1.1. Problem Definition
1.2. Aims and Objectives
1.3. Project Overview
2. Background and Literature Review
2.1. Overview on Procedural Content Generation
2.2. Image Synthesis using GANs
2.2.1. Random Generation
2.2.2. Paired Image to Image Translation
2.2.3. Unpaired Image to Image Translation
2.2.4. Image Synthesis using Text
2.3. Music Generation
2.4. Terrain Generation using GANs
3. Requirements Specification and Design
3.1. Training Data Set
3.1.1. Requirements
3.1.2. Acquiring the Data Set
3.1.3. Improving the Data Set
3.2. Compute Power
3.2.1. Limitations
3.2.2. Possibilities
4. Implementation
4.1. SPADE
5. Evaluation
5.1. Frechét Inception Distance
5.2. Summary of Results
6. Conclusion
6.1. Future work
References
Chapter 1
Introduction
Creating fictional worlds is an activity mankind has been doing for ages. If we don’t
limit ourselves to the context of strictly computer-generated content, we could
definitely broaden our set of virtual worlds to something way larger, full of information
that has been around for more time than anyone can remember. Human folklore has
always been filled with imaginary worlds that mimic reality in one way or another.
Depictions of the Garden of Eden almost always end up representing a nice looking
green forest, for instance, except for a few exceptional cases.
i.1 [On the left: Lucas Cranach the Elder, The Garden of Eden (1530)]
i.2 [On the right: Peter Paul Rubens and Jan Brueghel the Elder, The garden of Eden with the fall of man (1615)]
i.3 [On the left: Thomas Cole, The Garden of Eden (1828)]
i.4 [On the right: Hieronymus Bosch, The Garden of Earthly Delights (1503-1515)]
While some would assume these similarities are due to a lack of imagination, an
interesting implication is that we just re elaborate certain concepts and ideas that we
grow up with. We subconsciously build our imaginary worlds around certain tropes, or
features, that we absorb through the environments we live in and our cultural
backgrounds. It is pretty much the reason why the fantasy genre in western countries
has become a reiteration of certain tropes seen in J.R.R. Tolkien’s Lord of the Rings.
From a certain perspective, without even realizing, humans do something similar to
what happens in the context of machine learning: we see, we learn, and we elaborate
outputs based on the inputs we have received.
Before our modern-day computers, all we could do to describe and depict imaginary
worlds was to rely on the talent of writers and artists. Although the results of
handcrafted work can be stunning, it still remains a really tedious process that can
take a lot of time. But ever since the power of automata has become part of our lives,
we managed to figure out new ways to create virtual worlds using technology. Among
the many, one that clearly comes to mind is procedural content generation. The idea at
the core of the concept is more or less to replicate patterns. According to cognitive
neuroscience, humans have a knack for pattern recognition, in our own way. We even
have a discreet tendency to pareidolia, a brain mechanism that can lead people to see
dinosaur-shaped clouds, for example. Knowing all of this, it shouldn’t come as a
surprise then that humans are good at coming up with techniques to algorithmically
generate pseudo-randomized patterns. These methods have been around for years,
with impressive results that can easily be seen in different kinds of entertainment
media, from movies to video games; but although it is true that there are already
techniques that can mimic natural features based on patterns, the implementation
process usually requires establishing a set of rules for content generation, which may
not be immediate, and still requires a lot of input from the designers. In layman terms,
we need to figure out the rules that shape our world, before being able to create a
world.
Technological advancements achieved by mankind over the course of the last decade
however have opened up a whole new set of possibilities for content generation using
machine learning. What if the patterns we try to replicate were extrapolated by a
neural network? What if we could turn these patterns into feature vectors? At the end,
any information regarding virtual worlds can be represented as a data distribution.
Patterns and features can be automatically gathered during the training process from
the content itself we’re trying to synthesize. The real issue though is gathering such
training data.
1.1. Problem Definition
Most common game editing tools allow the users to create entire virtual worlds starting
from simple blocks. Sometimes these tools are even integrated in the game itself:
creative players can build extremely rich and diverse environments from scratch just by
using these in-game editors, although they are the tip of the iceberg compared to all the
tools available nowadays for game world editing, including third-party software, or
actual proper game engines.
i.5 [On the left, screenshot from a world generated using Minecraft’s Creative Mode, credits to Mojang Studios]
i.6 [On the right, screenshot from Dreams, credits to Media Molecule]
i.7 [On the left, screenshot from Little Big Planet’s Create Mode, credits to Media Molecule]
i.8 [On the right, screenshot from Mario Maker, credits to Nintendo]
Games such as Minecraft (Mojang Studios, 2009), Dreams (Media Molecule, 2020),
Little Big Planet (Media Molecule, 2008), or Mario Maker (Nintendo, 2015) allow players
to create the game world by playing the game itself. While being able to create game
content has become an in-game feature for some commercially successful games,
turning an often tedious process for developers into an enjoyable experience for players,
one thing that hasn’t changed is the effort that has to be put in the process of turning
one’s imagination into a game level. Aforementioned third party software, such as
MCEdit (MCEdit.net, 2016) for Minecraft, become quite useful in these cases, as tools
like this can dramatically expand the ability to create worlds: the latter tool for example
allows to modify a Minecraft save file, helping users to reshape the game world in ways
that are more complex than what can be made by merely using the classic in-game
editor in the creative mode. The amount of features these tools offer could almost make
them considerable as low-level game engines.
i.9 [Screenshot from MCEdit, credits to MCEdit.net]
As complex as they can be nonetheless, these tools still present certain limitations: for
instance, although entertaining, designing and building game worlds can still be a tiring
experience, as current world editing tools allow at most to create worlds only by
interacting with individual sets of game components in a way that is similar to Microsoft
Paint, a method that requires a lot of interaction from the user.
Is it possible to build environments using a more efficient method? Is there some way to
design worlds using artificial intelligence, without losing the intrinsic complexity that is
derived by the time and work that is spent on more conventional world building
techniques? To answer such questions, it is opportune to first focus on the technological
advancements that have been made in the field of machine learning, particularly for
what concerns content generation.
Generative Adversarial Networks, or GANs for short, have recently acquired the
spotlight as a machine learning technique (Goodfellow I. et al., 2014), due to the
amazing possibilities they offer. In this machine learning model two neural networks
compete against each other in order to make one of the networks become good at
“fooling” the other one, by acquiring the ability to replicate any data distribution. In other
words, given a proper training set a GAN can imitate or actually create any kind of
content, from music to art.
Ever since their inception, GANs have been used for a wide variety of purposes: the
fashion industry for example has benefited from GAN-made imaginary models, thus
avoiding all the expenses associated with hiring models, photographers, makeup artists,
and so forth (Wong, C., 2019); GANs have been used also to make portraits,
landscapes, and high-fidelity replication of art-styles using image style transfer. Using
the latter technology anyone can become an artist, and turn any picture into something
that looks as if it was made by Van Gogh, or Gauguin (Gatys et al., 2015). Ethical
concerns have also been raised as GANs have been recently combined with
autoencoders to improve Deepfakes, synthetic media where typically a person in an
existing image or video is replaced with someone else's likeness.
GANs offer limitless possibilities, even within the scope of games: in this context, one of
the most popular GAN uses comes from upscaling 2D game content, such as textures,
in order to obtain a better version of them with a higher resolution (Wang et al., 2015).
AI upscaling has also been used on a commercial level on games like Resident Evil.
Last but not least, there’s one recent GAN model that is extremely relevant for the
potential it offers: GauGAN, a trained model created by NVidia.
i.10 [Screenshot from GauGAN, credits to Nvidia]
This software features a semantic tool-brush (Park T. et al., 2019), a particular
two-dimensional brush that can represent semantic references as colours. Using
GauGAN users can paint rocky pinnacles using a “Mountain brush”, or similarly use a
“Sea brush” or a “Lake brush” to paint bodies of water on a realistic landscape.
The idea proposed here is that it is possible to train a model like GauGAN on a training
set based on game data, and possibly even develop a tool that allows to create 3D
worlds by merely drawing a 2D representation of it.
1.2. Aims and Objectives
The aim of this dissertation can be summarized quite easily in a few steps.
The first objective is to analyze thoroughly the technology behind content generation,
with particular interest towards the context of machine learning, in order to gather a full
understanding of what can be done and what are the limits of content generation, at its
current state. Although the primary focus of this dissertation will be put into analyzing
the technology behind Nvidia’s GauGAN, and how to generate images using semantic
representations, insight shall be gathered on other techniques as well, to avoid having a
narrowed view on the topic, and to evaluate the possibility of game content generation
using different approaches.
After a brief overview on such techniques, semantic-image-to-photo translation shall be
explained in more detail. In particular, requirements shall be analyzed, to explore the
possibility of generating game content using SPADE, a specific generative adversarial
network model used by Nvidia to create GauGAN. If such requirements can be met, a
SPADE model shall be trained, in order to generate a GAN model that can process
semantic inputs in order to generate game data.
Once the training session is over, the results generated by the model shall be assessed
using a quantitative score system, known as the Frechét Inception Distance. If such
results will prove valid according to the score, the experiment shall be considered
successful.
1.3. Project Overview
The project’s most critical sections can be split into four parts. The first is understanding
content generation using GANs, with particular focus on semantic-image-to-photo
translation. This part will be covered in chapter 2, Background and Literature Review.
The second part of the project involves acquiring data that is suitable for training a
model based on such technology. This section will be covered in chapter 3,
Requirements Specification and Design.
The third part shall be mostly related to the training session. This section will be covered
in chapter 4, Implementation.
Last but not least, the fourth part of this project will be centered around the evaluation
of the results obtained from the training, and what are the possibilities, given the
results. This last section shall be covered in chapter 5, Evaluation.
Chapter 2
Background and Literature Review
In order to understand how to use a GAN model for generating game content, it is first
necessary to get a more in-depth insight on how artificial content generation works and
how it has been used in the context of neural networks, and within the game industry.
Before diving into that though, a quick overview shall be given into the context of
standard procedural content generation.
2.1. Overview on Procedural Content Generation
Many games nowadays rely in one way or another on some sort of algorithm-based
procedural content generation technique in order to generate the game world. Some of
these techniques are usually used to generate terrains and trees, for instance, although
there’s no specific limitation to what can be created, which can even include planets and
monsters as well. Another interesting use of procedural content generation techniques
is to create dungeons, particularly for the game genre known as rogue-likes.
Although implementation-wise all these methods are quite different, they have
something in common: they’re all based on a rule system, and the general objective is to
replicate patterns in an apparently random way, in order to simulate our perception of
nature as something discreetly chaotic. Some of these implementations do indeed
include a certain degree of randomness, but usually this is kept to a minimum.
Pseudo-randomness in procedurally generated game worlds is usually achieved by
relying on seeds, a bunch of static numbers that are used in a fashion that is quite
similar to how Perlin noise is made. The seed is meant to be a consistent value between
playthroughs in a game world originated by such seed. If the seed and the rules used for
content generation are known, it is technically possible to predict what is going to be
generated. Although powerful, the problem with content generated using techniques
such as Perlin Noise is the little control designers have over the generated content.
Developers can change the rules for content generation, but in order to be able to edit
the game world at one’s heart content, classic editing tools are more preferable.
2.2. Image Synthesis using GANs
The most famous example of content generation in common culture is image synthesis,
the creation of images using computer algorithms. The ability of machines to generate
pictures has been improving way beyond expectations over the course of the last
decades. Such improvements have led researchers to analyse the techniques used to
generate these images, thus allowing a more precise classification system that will be
used here to explain the separate approaches to image synthesis through machine
learning. According to “ Image Synthesis Using Machine Learning Techniques” (Gupta P.
et al., 2019), there are four generation techniques.
i.11 [Classification diagram, taken from Image Synthesis Using Machine Learning Techniques, 2019]
2.2.1. Random Generation
The synthesis of random images of a particular class. A random image generator
that is trained using a set of pictures of real faces will synthesize realistic images
of new faces previously unseen by the generator. The major limitation of this
technique is that a large training set is required. In the original paper by Ian
Goodfellow, et al. “Generative Adversarial Networks” (2014), GANs were used to
generate new realistic examples for the MNIST handwritten digit dataset, the
CIFAR-10 small object photograph dataset, and the Toronto Face Database.
i.12 [Examples of GANs used to generate new plausible examples for image datasets. Taken from Generative
Adversarial Networks, 2014.]
In “Unsupervised Representation Learning with Deep Convolutional Generative
Adversarial Networks” (Alec Radford, et al., 2015), a model called DCGAN is
used to demonstrate how to train stable GANs at scale in order to generate
examples of bedrooms.
i.13 [Example of GAN-Generated Photographs of Bedrooms.Taken from Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, 2015.]
A notable feat accomplished by the DCGAN is the ability to perform vector
arithmetic in the latent space between two inputs.
i.14 [Example of Vector Arithmetic for GAN-Generated Faces.Taken from Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, 2015.]
In “Progressive Growing of GANs for Improved Quality, Stability, and Variation”
(Tero Karras, et al., 2017), it is demonstrated that the generation of realistic
photographs of human faces can be achieved. The model was trained using the
physical appearance of celebrities, meaning that there are features from existing
well-known personalities in the generated faces, making them seem oddly
familiar within a certain degree.
i.15 [Examples of Photorealistic GAN-Generated Faces.Taken from Progressive Growing of GANs for Improved
Quality, Stability, and Variation, 2017.]
DCGAN has also been used to generate objects and scenes.
i.16 [Example of Photorealistic GAN-Generated Objects and ScenesTaken from Progressive Growing of GANs for
Improved Quality, Stability, and Variation, 2017.]
In “The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and
Mitigation” (M. Brundage et al., 2018), the results achieved by DCGAN are
discussed in order to showcase the rapid progress of GANs from 2014 to 2017.
i.17 [Example of the Progression in the Capabilities of GANs from 2014 to 2017.Taken from The Malicious Use of
Artificial Intelligence: Forecasting, Prevention, and Mitigation, 2018.]
In “Large Scale GAN Training for High Fidelity Natural Image Synthesis” (Andrew
Brock, et al., 2018), another model called BigGAN shows results that recall
realistic photographs.
i.18 [Example of Realistic Synthetic Photographs Generated with BigGAN. Taken from Large Scale GAN Training
for High Fidelity Natural Image Synthesis, 2018.]
In “Towards the Automatic Anime Characters Creation with Generative
Adversarial Networks” (Yanghua Jin, et al., 2017) it is also shown that GANs can
be also used to generate faces of anime characters (i.e. Japanese comic book
characters).
i.19 [Example of GAN-Generated Anime Character Faces.Taken from Towards the Automatic Anime Characters
Creation with Generative Adversarial Networks, 2017.]
2.2.2. Paired Image to Image Translation
Used for synthesizing an image that belongs to a certain category or set using an
image of another category or set when paired images belonging to both sets are
available. In “Image-to-Image Translation with Conditional Adversarial
Networks” (Phillip Isola, et al., 2016) it is demonstrated that GANs can also be
used for Image-To-Image Translation. Examples include tasks such as the
translation of:
●Semantic images to photographs of cityscapes and buildings.
●Satellite photographs to Google Maps.
●Photos from day to night.
●Black and white photographs to color.
●Sketches to color photographs.
i.20 [Example of Photographs of Daytime Cityscapes to Nighttime With pix2pix. Taken from Image-to-Image
Translation with Conditional Adversarial Networks, 2016.]
i.21 [Example of Sketches to Color Photographs With pix2pix. Taken from Image-to-Image Translation with
Conditional Adversarial Networks, 2016..]
In “Beyond Face Rotation: Global and Local Perception GAN for Photorealistic
and Identity Preserving Frontal View Synthesis” (Rui Huang, et al., 2017) GANs
are used to generate frontal-view photographs of human faces given
photographs taken at an angle.
i.22 [Example of GAN-based Face Frontal View Photo GenerationTaken from Beyond Face Rotation: Global and
Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis, 2017.]
In “Pose Guided Person Image Generation” (Liqian Ma, et al., 2017) photographs
of human models striking new poses are generated.
i.23 [Example of GAN-Generated Photographs of Human PosesTaken from Pose Guided Person Image
Generation, 2017.]
In “Unsupervised Cross-Domain Image Generation” (Yaniv Taigman, et al., 2016)
a GAN model is used to translate images from one domain to another, including
from street numbers to MNIST handwritten digits, and from photographs of
celebrities to what they call emojis or small cartoon faces.
i.24 [Example of Celebrity Photographs and GAN-Generated Emojis.Taken from Unsupervised Cross-Domain
Image Generation, 2016.]
In “Invertible Conditional GANs For Image Editing” (Guim Perarnau, et al., 2016)
a model called IcGAN is used to reconstruct photographs of faces with specific
features, such as changes in hair color, style, facial expression, and even gender.
i.25 [Example of Face Photo Editing with IcGAN.Taken from Invertible Conditional GANs For Image Editing, 2016.]
In “Coupled Generative Adversarial Networks” (Ming-Yu Liu, et al., 2016) the
generation of faces with specific properties such as hair color, facial expression,
and glasses is explored. Images generated with varied color and depth are also
generated.
i.26 [Example of GANs used to Generate Faces With and Without Blond Hair.Taken from Coupled Generative
Adversarial Networks, 2016.]
In “Neural Photo Editing with Introspective Adversarial Networks” (Andrew
Brock, et al., 2016) a face photo editor is presented using a hybrid of variational
autoencoders and GANs. The editor allows rapid realistic modification of human
faces including changing hair color, hairstyles, facial expression, poses, and
adding facial hair.
i.27 [Example of GANs used to Generate Faces With and Without Blond Hair.Taken from Coupled Generative
Adversarial Networks, 2016.]
In “Image De-raining Using a Conditional Generative Adversarial Network” (He
Zhang, et al., 2017) GANs are used for image editing, including examples such as
removing rain and snow from photographs.
i.28 [Example of Using a GAN to Remove Rain From PhotographsTaken from Image De-raining Using a
Conditional Generative Adversarial Network]
In “Face Aging With Conditional Generative Adversarial Networks” (Grigory
Antipov, et al., 2017) GANs are used to generate photographs of faces with
different apparent ages, from younger to older.
i.29 [Example of Photographs of Faces Generated With a GAN With Different Apparent Ages.Taken from Face
Aging With Conditional Generative Adversarial Networks, 2017.]
In “Age Progression/Regression by Conditional Adversarial Autoencoder” (Zhifei
Zhang et al., 2017) GANs are used to rejuvenate photographs of faces.
i.30 [Example of Using a GAN to Age Photographs of FacesTaken from Age Progression/Regression by
Conditional Adversarial Autoencoder, 2017.]
In “GP-GAN: Towards Realistic High-Resolution Image Blending” (Huikai Wu, et
al., 2017) it is demonstrated that GANs can be used to blend photographs,
specifically elements from different photographs such as fields, mountains, and
other large structures.
i.31 [Example of GAN-based Photograph Blending.Taken from GP-GAN: Towards Realistic High-Resolution
Image Blending, 2017.]
In “Photo-Realistic Single Image Super-Resolution Using a Generative
Adversarial Network” (Christian Ledig, et al., 2016) the SRGAN model is used to
generate output images with higher, sometimes much higher, pixel resolution.
i.32 [Example of GAN-Generated Images With Super Resolution. Taken from Photo-Realistic Single Image
Super-Resolution Using a Generative Adversarial Network, 2016.]
In “High-Quality Face Image SR Using Conditional Generative Adversarial
Networks” (Huang Bin, et al., 2017) GANs are used to create versions of
photographs of human faces.
i.33 [Example of High-Resolution Generated Human Faces. Taken from High-Quality Face Image SR Using
Conditional Generative Adversarial Networks, 2017.]
In “Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual
Super-resolution Network” (Subeesh Vasu, et al., 2018) an example of GANs
creating high-resolution photographs is provided.
i.34 [Example of GAN-Generated Photograph Inpainting Using Context Encoders.Taken from Context Encoders:
Feature Learning by Inpainting describe the use of GANs, specifically Context Encoders, 2016.]
In “Semantic Image Inpainting with Deep Generative Models” (Raymond A. Yeh,
et al., 2016) use GANs to fill in and repair intentionally damaged photographs of
human faces.
i.35 [Example of GAN-based Inpainting of Photographs of Human FacesTaken from Semantic Image Inpainting
with Deep Generative Models, 2016.]
In “Generative Face Completion” (Yijun Li, et al., 2017) GANs are used for
inpainting and reconstructing damaged photographs of human faces.
i.36 [Example of GAN Reconstructed Photographs of FacesTaken from Generative Face Completion, 2017.]
In “Pixel-Level Domain Transfer” (Donggeun Yoo, et al., 2016) it is demonstrated
that GANs can be used to generate photographs of clothing as may be seen in a
catalog or online store, based on photographs of models wearing the clothing.
i.37 [Example of Input Photographs and GAN-Generated Clothing Photographs. Taken from Pixel-Level Domain
Transfer, 2016.]
In “Generating Videos with Scene Dynamics” (Carl Vondrick, et al., 2016) GANs
are used for video prediction, specifically predicting up to a second of video
frames with success, mainly for static elements of the scene.
i.38 [Example of Video Frames Generated With a GAN. Taken from Generating Videos with Scene Dynamics,
2016.]
In “Learning a Probabilistic Latent Space of Object Shapes via 3D
Generative-Adversarial Modeling” (Jiajun Wu, et al., 2016) GANs are used to
generate new three-dimensional objects (e.g. 3D models) such as chairs, cars,
sofas, and tables.
i.39 [Example of GAN-Generated Three Dimensional Objects. Taken from Learning a Probabilistic Latent Space of
Object Shapes via 3D Generative-Adversarial Modeling]
In “3D Shape Induction from 2D Views of Multiple Objects” (Matheus Gadelha, et
al., 2016) GANs are used to generate three-dimensional models given
two-dimensional pictures of objects from multiple perspectives.
i.40 [Example of Three-Dimensional Reconstructions of a Chair From Two-Dimensional Images. Taken from 3D
Shape Induction from 2D Views of Multiple Objects, 2016.]
GauGAN, the model upon which we will be focusing on within this project in
order to train our own model for game content generation, belongs to a
subcategory of paired-image-to-image translation models based on semantic
inputs. In “High-Resolution Image Synthesis and Semantic Manipulation with
Conditional GANs” (Ting-Chun Wang, et al., 2017) photorealistic images
generated from semantic images are showcased as examples.
i.41 [Example of Semantic Image and GAN-Generated Cityscape Photograph. Taken from High-Resolution Image
Synthesis and Semantic Manipulation with Conditional GANs, 2017.]
2.2.3. Unpaired Image to Image Translation
Used when paired data belonging to the input set and the target set do not exist
for training. Images of both sets must be used for training but each input image
does not require a corresponding target image present in the training dataset.
In “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial
Networks” (Jun-Yan Zhu et al., 2017), the CycleGAN model performs an
impressive set of image-to-image translations.
Five image translation cases are shown below as an example:
●Photograph to artistic painting style.
●Horse to zebra.
●Photograph from summer to winter.
●Satellite photograph to Google Maps view.
●Painting to photograph.
i.42 [Four Image-to-Image Translations performed with CycleGAN. Taken from Unpaired Image-to-Image
Translation using Cycle-Consistent Adversarial Networks, 2017.]
i.43 [Example of Translation from Paintings to Photographs With CycleGAN. Taken from Unpaired
Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017.]
2.2.4. Image Synthesis using Text
Used to synthesize images using a description of the content of the image to be
synthetized in the form of text. Image synthesis using text models will produce
an image of a particular class that the model is trained for when provided a
detailed description of the image of that particular class. For example, a model
can be created for synthetizing images of birds using a detailed description of
the bird. Images of a particular class along with paired text description must be
provided for training. In “StackGAN: Text to Photo-realistic Image Synthesis with
Stacked Generative Adversarial Networks” (Han Zhang, et al., 2016) it is
demonstrated that GANs can be used to generate realistic looking photographs
from textual descriptions of simple objects like birds and flowers.
i.44 [Example of Textual Descriptions and GAN-Generated Photographs of Birds. Taken from StackGAN: Text to
Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, 2016.]
“Generative Adversarial Text to Image Synthesis” (Scott Reed, et al., 2016) also
features an interesting example of text to image generation of small objects and
scenes including birds and flowers. In “TAC-GAN – Text Conditioned Auxiliary
Classifier Generative Adversarial Network“ (Ayushman Dash, et al., 2017),
another model is trained on the same dataset, producing similar results.
i.45 [Example of Textual Descriptions and GAN-Generated Photographs of Birds and Flowers. Taken from
Generative Adversarial Text to Image Synthesis.]
In “Learning What and Where to Draw” (Scott Reed, et al., 2016), the capability
of GANs to generate images from text is expanded by using bounding boxes and
key points as hints as to where to draw a described object.
i.46 [Example of Photos of Object Generated From Text and Position Hints With a GAN.Taken from Learning
What and Where to Draw, 2016.]
2.3. Music Generation
Although it does not strictly relate to the scope of this project, music generation is also
another interesting feat that has been achieved in the field of machine learning. A good
example of this can be seen by using MuseNet (Payne C., 2019), a neural network
trained to generate the next note in a sequence, given an input sequence. This is built on
a model called Sparse Transformer (Child R. et al., 2019), a model that can reconstruct
sequences using a concept known as “sparse attention”. Using the latter Sparse
Transformers can actually create not only music, but also images or any data set.
2.4. Terrain Generation using GANs
GANs have also been used for terrain generation. In “Interactive example-based terrain
authoring with conditional generative adversarial networks” (Guérin É. et al., 2017),
Conditional GANs have been used in order to generate terrains using sketches, similarly
to what is the aim of this project. Rivers, mountains and other entities can be drawn and
turned into a 3D terrain. In (Beckham C. et al., 2017) there is also another interesting
use of GANs generated height-mapping, using data from NASA’s Visible Earth project.
In the latter example, the input contains data from satellite images that capture Earth’s
morphology. These inputs are then processed by a DCGAN to generate height maps,
and Pix2Pix to generate textures.
i.47-50 [Examples of Terrains Generated With a GAN. Taken from Interactive example-based terrain authoring with
conditional generative adversarial networks, 2017.]
Chapter 3
Requirements Specification and Design
Based on “Image Synthesis Using Machine Learning Techniques” (Gupta P. et al., 2019)
and “Interactive example-based terrain authoring with conditional generative adversarial
networks” (Guérin É. et al., 2017) it is assumed that a model trained using SPADE may
show the results that feature the best inception score, and the best height-maps as it is
also a conditional GAN, while it can be assumed that Pix2PixHD generates better
textures than its predecessor, Pix2Pix. The research conducted in “A step towards
procedural terrain generation with GANs” (Beckham C. et al., 2017) may thus be adapted
to a more modern framework.
The first fundamental step that has to be taken in order to train SPADE is to obtain a
proper dataset to train on. In order to do so, the ideal plan is to obtain game world data
from a video game, such as Minecraft for example, and 2D images that can be mapped
into correspondent 3D landscapes.
While the optimal case would be to start with available game data, in the unlikely event
that the aforementioned condition is not fulfilled the scope of this research would pivot
towards acquiring data. This data could be acquired by capturing game information
from any game that is deemed suitable for terrain generation using GauGAN. Before
defining how to do so, it is important to establish some requirements, as not all games
could be suitable for the scope of this project.
To elaborate more, a suitable game should have level content that features entities that
can be mapped into semantic data. As a semantic tool-brush could generate any
content it is trained to generate, it is important to have entities that can be mapped
using semantic rules. In other words, it is required that the training data is labelled,
associating entities with data representations of them. This can be a taxing task, as
sometimes labelling data can be achieved exclusively by exploiting human effort; even
with automation tools for data scraping, gathering training data can be a heavily time
consuming experience.
Although this makes it harder to develop a general tool for games with specific unique
entities (like a game containing a fictional alien planet with particularly uncommon
features), a potential tool can still be really helpful for generating entities that are easier
to find in most games, like natural terrains. Once these fundamental requirements are
sorted out, training activities can be discussed in more detail. Using Minecraft as an
example, natural entities could be easily defined as block types belonging to a certain
kind of biome, while more challenging entities such as man-made artefacts like temples
and villages could be added later on, using conditional generation methods.
3.1. Training Data Set
The first step for training our model, as can be easily guessed, is to acquire suitable
training data out of a game. Luckily this has been kindly provided by Microsoft, using
game save data from fifty Minecraft worlds generated through project Malmo, a
particular research project centered around artificial intelligence. Specifically, a bot has
been sent to travel around each of these worlds, filling the missing chunks as it walked.
This raw data has then been processed through an open-source software called
Mapcrafter (Moritz Hilscher, 2018). Mapcrafter is a powerful tool that can generate
isometric zoomable views of the processed maps, navigable through a browser-ready
HTML file. As it parses through a save file in order to generate the aforementioned map,
it also creates a nested folder structure that contains images of sections from the map.
These are quite convenient for the training process, as they can indeed be fed to a GAN
as ground truth images.
Last but not least, Mapcrafter allows to tweak how the map is rendered by specifying
certain options, such as lighting, biome-rendering, shadow-rendering, or even
block-masking, and most importantly, custom-texturing. These options have been
carefully selected in order to generate the training set, alongside with the labelled set.
The means by which this feat has been achieved will be discussed in the next sections.
3.1.1. Requirements
As mentioned earlier, one of the most important things to do for semantic labeling is the
definition of semantic entities. In the case of GauGAN, such entities are things like
flowers, trees, mountains, rocks, stones, and so on and so forth.
Although Minecraft does indeed feature things that look like geomorphological features,
labeling them is not immediate, nor easy. There are no specific data structures that
define mountains or hills, for instance. However, there are certain elements that could
be clustered into semantic entities, like trees. Certain block types for example do appear
clustered together in specific biomes. But let us acquire some context first.
Blocks are building materials that can be used to create structures in Minecraft. They are
the very core of Minecraft’s gameplay, and they feature in all versions of Minecraft. They
can be crafted or can be naturally found in biomes, although it should be noted that
some blocks are exclusive to the Creative Mode. For the scope of this project, we shall
be limiting the block type labeling to the ones that belong to the Overworld. Blocks
belonging to the Nether or the End won’t be considered.
Biomes are areas with specific height, light levels, vegetation, and types of blocks that
could be easily compared to ecosystems we find in real life. Minecraft currently has 34
biomes, listed here quickly for reference, and categorized by temperature:
●Snowy
These biomes are known for their inclusion of Snow and Ice.
○Frozen River
○Ice Plains
○Ice Plains Spikes
○Cold Beach
○Cold Taiga
○Cold Taiga (Mountainous)
●Cold
These biomes are cold, but not cold enough to have snow everywhere.
○Extreme Hills
○Extreme Hills M
○Taiga
○Taiga M
○Mega Taiga
○Mega Spruce Taiga
○Extreme Hills+
○Extreme Hills+ M
○Stone Beach
●Lush
Lush biomes are warm and often contain Flowers.
○Plains
○Sunflower Plains
○Forest
○Flower Forest
○Swamp
○Swamp M
○River
○Beach
○Jungle
○Jungle M
○Jungle Edge
○Jungle Edge M
○Birch Forest
○Birch Forest M
○Birch Forest Hills M
○Roofed Forest
○Roofed Forest M
○Mushroom Island
○Mushroom Island Shore
●Dry
Dry Biomes are very hot, and rarely contain any moisture or any foliage.
○Desert
○Desert M
○Savannah
○Savannah M
○Mesa
○Mesa (Bryce)
○Plateau
○Plateau M
●Neutral
These biomes are either completely filled with Water, or have several variants
that differ depending on their biome.
○Ocean (Variants)
○Hills (Variants)
As can be seen, biomes are not trivial entities to encode in labels. While such a unique
environmental diversity is indeed to appreciate, it comes with a fair share of problems,
when trying to establish semantic entities. Different biomes, in fact, require different
semantic labels. An entity like “grass” would require a different label for every different
hue of grass existing across biomes. The same thought process can be applied to
anything that looks different in a particular biome.
Although it is noted that a more complex labeling can doubtlessly be achieved, for the
scope of this project biome diversity has been excluded from the training requirements,
by using a rendering option in Mapcrafter that can turn off biome rendering.
Given the aforementioned issue and the fact that blocks are usually defined by a small
subset of associated textures, entities have thus been clustered using a different ad-hoc
structure, i.e by custom block clustering as follows:
i.51 [Semantic Entity Block Clustering used for training]
i.52 [Semantic Entity Block Clustering used for training]
i.53 [Semantic Entity Block Clustering used for training]
i.54 [Semantic Entity Block Clustering used for training]
i.55 [Semantic Entity Block Clustering used for training]
i.56 [Semantic Entity Block Clustering used for training]
The elements listed in the block section are texture references, matching the names of
the textures in Mapcrafter. This has been a fundamental step in order to achieve
labeling. The process will be described in more detail in the following section.
3.1.2. Acquiring the Data Set
In order to train a GauGAN-like model using SPADE, two kinds of inputs are required.
The first kind of input is a real image, the kind of final result we want to obtain, or
replicate. The second input, on the other hand, has to be a labeled version of the same
image. Getting this labeled image set is not trivial.
GauGAN for instance has been trained on a dataset known as COCOStuff, an open
source repository of labelled images. The means by which this has been achieved goes
across human labor and computer vision-based algorithms that can distinguish distinct
objects within a certain degree. As this was not available to us in the short span of time
in which the project was developed, alternative methods for labeling had to be sought.
As the first step, in order to label the dataset, all the maps have been processed through
Mapcrafter by using two custom Python scripts. These scripts generated a
configuration file, and executed a Mapcrafter command with the generated scripts for
each of the maps in the Minecraft game save data set. These generated configuration
templates are shown here for reference.
Mapcrafter original map generator sample configuration file:
output_dir = ..\output_original\***
[global:map]
world = world
render_view = isometric
render_mode = daylight
render_biomes = false
rotations = top-left
[world:world]
input_dir = ..\minecraft-java-raw-worlds\***
[map:map_world]
name = World
world = world
render_view = isometric
Although we won’t dive deep into how Mapcrafter works for the scope of this project, it
should be noted that further documentation is available at mapcrafter.org. Nonetheless,
the most relevant options that have been chosen to generate these files will be briefly
discussed here.
As can be seen in the original map generator sample configuration file, we have
specified certain options in the global map section related to view, mode, biomes, and
rotations.
Mapcrafter labeled map generator sample configuration file:
output_dir = ..\output_original\***
background_color = #000000
[global:map]
world = world
render_view = isometric
render_mode = plain
render_biomes = false
rotations = top-left
[world:world]
input_dir = ..\minecraft-java-raw-worlds\***
[map:map_world]
name = World
world = world
texture_dir = data\labeled_textures
lighting_intensity = 0.0
render_view = isometric
Here’s a quick rundown on the configuration options:
render_view = isometric | topdown
Default: isometric - This is the view that your world is rendered from. You
can choose from different render views:
isometric
A 3D isometric view looking at north-east, north-west, south-west or
south-east (depending on the rotation of the world).
topdown
A simple 2D top view. This view could be used to generate an alternative
training set. Its potential for training will be discussed in a later section.
render_mode = plain|daylight|nightlight|cave
Default: daylight - This is the render mode to use when rendering the world.
Possible render modes are:
plain
Plain render mode without lighting or other special effects. This was chosen
for the labeled set rendering. As the labeling doesn’t require any particular
kind of lighting effect, this was opted as an optimal choice.
daylight
Renders the world with lighting. This was selected to generate the real
images, with high quality shadows.
nightlight
Like daylight, but renders at night. This could be potentially used for training
to generate a night version of the same maps.
cave
Renders only caves and colors blocks depending on their height to make
them easier to recognize. This is not suitable for training, and it is hard to
label.
render_biomes = true|false
Default: true - This setting makes the renderer use the original biome colors
for blocks like grass and leaves. It was important to set this feature to false in the
original map configuration file, in order to exclude biome complexity. Although
the latter can be kept, it requires a deeper understanding of how textures are
processed in mapcrafter.
rotations = [top-left] [top-right] [bottom-right] [bottom-left]
Default: top-left - This is a list of directions to render the world from. You can
rotate the world by n*90 degrees. Later in the output file you can interactively
rotate your world. Possible values for this space-separated list are: top-left,
top-right, bottom-right, bottom-left. Top left means that north is on the top left
side on the map (same thing for other directions).
texture_dir = data\labeled_textures
This is the directory with the Minecraft Texture files. Labeled images have been
generated by modifying this value in the labeled map configuration file. The new
folder address contains a labelled version of the textures. Here’s a screenshot of
the content of that folder:
i.57 [Screenshot of the labeled texture folder content ]
The process to generate these textures is simple. Given the texture elements that
belong to a certain block, and given a labeled cluster of blocks to which the
aforementioned block belongs to, we choose a color to represent the label, and
we change every non empty pixel in the textures using a color processing script.
Due to a lack of time, this has been quickly achieved by using Photoshop and
Microsoft Paint, although it is also believed that the labeling process can be
improved by algorithmical means. Knowing what textures are associated with a
block type, some simple software tool could easily be made in order to select
certain blocks, cluster them into an arbitrary entity, choose the entity’s color, and
swap all selected textures’ colors with such semantic label color. Textures that
are not to be labeled have been swapped with a black texture, to avoid artifacts
from being visible in the map view.
i.58-59 [Full map images generated using the map processing script. The map in the top image shows the real images. The
map in the bottom shows labeled maps. ]
On each iteration of the map processing script a real map and an equivalent
labeled map are generated in a new directory. Once all the maps are generated, a
Python script processes the map containing folders in order to generate one
huge image out of each map, to obtain a full representation of the map in a single
image.
These large images are then processed through a third party software called
IrfanView to make them suitable for training, and to generate the dataset that
our GAN model, SPADE, will use to train.
IrfanView (Irfan Skiljan, 2020), is a powerful tool to process images, and it’s quite
convenient to quickly apply a huge variety of graphical effects on an image, or
most importantly, to do batch processing of images. This tool was indeed
fundamental in order to obtain the labeled images, as changing the textures in
Mapcrafter to render the labeled maps is not enough to obtain a properly labeled
GAN-ready dataset.
Mapcrafter, in fact, applies shadows to textures, and there’s no trivial way of
disabling this feature through the configuration options. Due to this, the labeled
images generated from Mapcrafter are like the one shown below.
i.60 [Unprocessed Labeled Map ]
This is quite problematic. Each color chosen for labeling ends accidentally
generating up to two extra hues of the original label color. As can be seen in the
image, there is one hue for each side that is illuminated. This makes the data
unsuitable for training, due to there being too many colors.
An efficient solution for labeling would indeed be to look through the code of
Mapcrafter (as it’s an open source project), find how shadows are generated, and
get rid of any line of code that renders the extra shadows. Although this option
has been evaluated, due to technical requirements and Mapcrafter’s complexity it
has been deemed inefficient to try to pursue this solution in a short amount of
time. Instead, the labeled full maps have been processed all together through
IrfanView, by using an option that allows image batch processing.
i.61 [On the left, File>Batch Conversion/Rename option, from
IrfanView]
i.62 [Above, File>Batch Conversion/Rename>Advanced
button, from IrfanView]
Although some attempts have been previously tried to create a script in order to
get rid of the extra colors efficiently by using OpenCV and nearest-neighbour
color processing algorithms, IrfanView revealed itself to be a more efficient
solution than expected. Why reinvent the wheel when the wheel already exists?
The batch processing option allows to select a set of images, and process them
using specific conditions. In our case, the only relevant condition that was
selected in order to generate a properly labeled image set was the “Replace
color” function.
i.63 [File>Batch Conversion/Rename>Advanced>Replace color>Settings button, from IrfanView]
The “Replace Color” function can not only replace one color with another, but
also replace multiple colors together by using a “tolerance value”, similarly to
what has been attempted by using the nearest-neighbour algorithm.
As only 13 labels have been chosen as semantic entities for the scope of this
project, this color swapping process has been done manually for each label
across all maps, but it should also be noted that the same process can be
achieved in a more efficient way. According to IrfanView’s documentation in fact,
custom scripts can be written in order to do batch processing, by using a .ini file.
Although this hasn’t been done for this project, it is suggested that gathering
further insight on how to process image batches using .ini files may lead to a
more efficient way to process images, especially if the number of labels is
greater.
i.64 [File>Batch Conversion/Rename>Advanced>Replace color>Settings>Replace Color window, from IrfanView]
Once the properly labeled images have been obtained, the large full map images
have been split again into squares using another option of IrfanView that allows
to turn images into tiles. This last processing bit has been run on both the
original maps and the labeled maps. The final result is a GAN-ready dataset,
perfectly suitable for training!
3.2. Improving the Data Set
The dataset that has been generated for the training is the one obtained through the
process described in the previous section. However some relevant information
concerning the labeled set generation and how to improve this is reported in this section
as well.
As can be seen in the Semantic Entity Block Clustering images (i.52-57), some blocks
belonging to multiple biomes have been clustered together under certain semantic
entities in the labeling process. Most shrubbery has been clustered under “Grass”, for
instance. It should be mentioned though that patches of tall grass, flowers and cacti are
technically small features shared across different biomes. This is not a trivial notion, as
this would cause labeling issues if the biome rendering feature was turned on. By
turning off such a feature, this problem of oddly clustered textures has been avoided,
but not solved as it could have been. While at first labeling all these textures seemed a
good idea, an issue came to light during the training process, as cacti belong to two
different clusters, the regular sand desert cluster and the red sand desert. As cacti were
included in the regular sand biome, a first naive attempt at generating labels produced
some odd looking artifacts in red desert biomes: cacti appeared to be these “regular
sand-labeled” objects, scattered across the red sand biome, even though cacti were part
of this biome as well. The reason why this wasn’t noticed at first is due to the fact that
the red sand biome is rare, and appears only in three maps, among the fifty provided by
Microsoft.
An efficient solution to this issue, as well as for anything that is shared across different
biomes, is to turn off the rendering of anything that is not needed, by using a Mapcrafter
configuration option.
block_mask = <block mask>
Default: show all blocks
With the block mask option it is possible to hide or show only specific blocks.
The block mask is a space separated list of block groups you want to hide/show.
If a ! precedes a block group, all blocks of this block group are hidden, otherwise
they are shown. Per default, all blocks are shown. Possible block groups are:
● All blocks:
○*
●A single block (independent of block data):
○[blockid]
●A single block with specific block data:
○[blockid]:[blockdata]
●A range of blocks:
○[blockid1]-[blockid2]
●All blocks with a specific id and (block data & bitmask) ==
specified data:
○[blockid]:[blockdata]b[bitmask]
For example:
●Hide all blocks except blocks with id 1,7,8,9 or id 3 / data 2:
○!* 1 3:2 7-9
●Show all blocks except jungle wood and jungle leaves:
○!17:3b3 !18:3b3
○Jungle wood and jungle leaves have id 17 and 18 and use
data value 3 for first two bits (bitmask 3 = 0b11)
○other bits are used otherwise -> ignore all those bits
As can be guessed, the block mask configuration option has a lot of potential, especially
if biome-enabled labeling is desired.
Last but not least, the set can be definitely expanded by rendering other rotations as
well using Mapcrafter, as the currently available real and labeled dataset is solely based
on the top-left view. This would increase the size of the image dataset up to four times
more, one per rotation (top right, top left, bottom right, bottom left). This entire set could
then be ulteriorly expanded by batch processing the image set and generating a
mirrored view of the same images.
The implication is that the dataset used for training for the scope of this project can be
expanded up to 16 times more. Although it is wished that training could’ve been
achieved with such a larger dataset, this was sadly not possible, due to limitations that
will be discussed in the next sections,
3.2.1 Compute Power
This is a non-negotiable requirement for machine learning. Anybody who has ever spent
time training a neural network knows how terribly slow a training process can be, and
having computational power can definitely ease up the process. For the scope of this
project the accessible resources were a local machine, a remote server provided from
Abertay University, and 5000 credits of compute power on Microsoft Azure provided by
Microsoft, with a few usage limitations on the credits.
The local machine has the following specifications:
●Operative System: Windows 10 Professional
●CPU: AMD Ryzen 5 1600 Six-Core Processor, 3400 Mhz, 6 core,
12 logical processors
●Motherboard: B350 TOMAHAWK (MS-7A34)
●GPU: NVIDIA GTX 1060 6G
●RAM: 8 GB DDR4
Abertay’s remote machine, also known as Talisker, has been provided with the
following specifications:
●Architecture: x86_64
●CPU op-mode(s): 32-bit, 64-bit
●Byte Order: Little Endian
●CPU(s): 40
●On-line CPU(s) list: 0-39
●Thread(s) per core: 2
●Core(s) per socket: 10
●Socket(s): 2
●NUMA node(s): 2
●Vendor ID: GenuineIntel
●CPU family: 6
●Model: 79
●Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
●Stepping: 1
●CPU MHz: 1297.800
●CPU max MHz: 3100.0000
●CPU min MHz: 1200.0000
●BogoMIPS: 4390.24
●Virtualisation: VT-x
●L1d cache: 32K
●L1i cache: 32K
●L2 cache: 256K
●L3 cache: 25600K
●NUMA node0 CPU(s): 0-9,20-29
●NUMA node1 CPU(s): 10-19,30-39
The Azure Virtual Machines have been configured with these options.
Virtual machine
●Computer name: VM-02
●Operating system: Linux (ubuntu 18.04)
●SKU: 1804
●Publisher: microsoft-dsvm
●VM generation: V1
●Agent status: Ready
●Agent version: 2.2.50
●Host: None
●Proximity placement group: N/A
●Colocation status: N/A
Availability + scaling
●Availability zone: N/A
Extensions:
●DependencyAgentLinux
●OMSAgentForLinux
Size
●Standard NC24r_Promo
●vCPUs: 24
●RAM: 224 GiB
Each Mapcrafter map generation can run at best using 12 threads. If we multiply this by
the number of maps, we require at least 12*50 = 600 threads to generate the maps.
This value would have to be duplicated for each of the maps that are generated, since a
real map and a labeled map are generated for each one of the maps present in the
Minecraft game save dataset. The amount of possible concurrent threads rises to 1200.
Last but not least, an improved version of the dataset could be generated by rendering
all four isometric viewpoints. In order to generate this, the possible amount of
concurrent threads rises to 4800. If Mapcrafter is modded in order to generate the
labeled images as well, no further processing would be required. As at the current state
of the work such a feat has not been achieved yet, we still need to rely on further image
processing.
An image batch processing script needs to be run in order to adapt the images to a
properly labeled image. IrfanView can also be used to duplicate the size of the dataset,
and generate mirrored views.
3.2.2 Limitations
During the training process it has been noticed that Talisker takes less time to process
images, as it has a powerful GPU, but due to not having a large amount of available
GPUs, it can’t process too many images. The problem on Azure is the opposite.
Although more images can be processed, the computational power provided takes way
more time to process the images, due to the GPU being less powerful than the one on
Talisker.
3.2.3 Possibilities
Due to a lack of time more computational power couldn’t be provided, nor could there
be generated a larger dataset with the data generation tweaks discussed at the end of
the Compute Power section, as having time to process a dataset 16 times larger than
the one used for the scope of this project would’ve been unrealistic. Nonetheless, the
labeling process can also not only be improved, but the dataset can also be expanded in
other ways as well. Mapcrafter in fact has a dimension configuration option, beside a
biome rendering option.
dimension = nether|overworld|end
Default: overworld - You can specify with this option the dimension of the
world Mapcrafter should render. If you choose The Nether or The End,
Mapcrafter will automatically detect the corresponding region directory.
The “dimension” value has been ignored for the current training, letting it to be set to
“overworld” by default, but similarly to the “night” option for the rendering mode, it can
come quite in handy to expand the labeling set, in order to train the model to render
different kinds of maps. The Nether, in fact, features its own biomes, and although it is
less interesting visually, training could also be done on the End world, too.
Labeling Workflow Diagram
The diagram shown on the left
summarizes all the steps described in
the sections above. The code used in
all of the steps, including links to open
source repositories used for the scope
of this project can be found at the end
of this document, in the Appendix
section.
Although this workflow relies on
IrfanView to do image processing for
labeling, it is important to know that
such processes can be avoided if
Mapcrafter is tweaked to not render
shadows. Although achieving such a
result is not trivial, it is assured that it
can be achieved.
While technical limitations have
impeded to achieve what is believed to
be the most optimal technique to
obtain a labeled dataset from
Minecraft, the alternative solution that
has been used is still reported in this
workflow, in order to allow replication
of the results obtained in this project.
Chapter 4
Implementation
The architecture behind GauGAN is called SPADE, and it is the core of the training
model. It is an image processing layer that uses spatially-adaptive normalization to
synthetize images given an input semantic layout. Using this open-source model
provided by (Park T. et al., 2019) and Nvidia, the goal of this project is to train SPADE
utilizing the data discussed in the previous sections. The training data should contain
images of the terrains seen from above, as well as a semantic-labelling of the images.
Training datasets examples can actually be seen on the SPADE repository: COCO-Stuff,
Cityscapes and ADE20K can be useful examples to look at in order to get an idea of
what a proper dataset for training should look like. Alternatively, the Visible Earth
project from NASA seems another viable option to consider. Once a proper training
dataset is acquired, we enter the training phase. This phase can tend to be a bit erratic,
as tuning a training model and distributing weights are processes that can be prone to
trial-and-error approaches. Nonetheless, given enough time, finding the best tuning
values can become trivial. Ideally, once the model is properly trained with a good
dataset, the trained model will be able to generate 3D height-map textures using
semantic inputs, in a way that is similar to what GauGAN does.
4.1. SPADE
i.65 [SPADE batch normalization. Taken from Semantic Image Synthesis with Spatially-Adaptive Normalization]
In many common normalization techniques such as Batch Normalization (Ioffe et al.,
2015), there are learned affine layers (as in PyTorch and TensorFlow) that are applied
after the actual normalization step. In SPADE, the affine layer is learned from semantic
segmentation maps. This is similar to Conditional Normalization (De Vries et al., 2017
and Dumoulin et al., 2016), except that the learned affine parameters need to be
spatially-adaptive, which means SPADE uses different scaling and bias for each
semantic label. Using this simple method, semantic signals can act on all layer outputs,
unaffected by the normalization process which may cause loss of such information.
Moreover, because the semantic information is provided via SPADE layers, random
latent vectors may be used as input for the neural network, which can be used to
manipulate the style of the generated images (Park T. et al., 2019). In order to train the
model, Nvidia’s open source repository has been cloned from Github. The repository is
available at https://github.com/NVlabs/SPADE
Instructions to set up the training are reported from the official documentation. New
models can be trained with the following commands.
1. Prepare a dataset.
To train on custom datasets, the easiest way is to use ./data/custom_dataset.py
by specifying the option --dataset_mode custom, along with --label_dir
[path_to_labels] --image_dir [path_to_images]. You also need to specify options
such as --label_nc for the number of label classes in the dataset,
--contain_dontcare_label to specify whether it has an unknown label, or
--no_instance to denote the dataset doesn't have instance maps.
2. Train.
To train on your own custom dataset the command pattern to use is the
following: python train.py --name [experiment_name] --dataset_mode custom
--label_dir [path_to_labels] -- image_dir [path_to_images] --label_nc
[num_labels]
E.g for the training of our model:
python train.py --name <insert_name> --dataset_mode custom --label_dir
./datasets/027/train_label --image_dir ./datasets/027/train_img --label_nc 13
--no_instance --contain_dontcare_label --gpu_ids 0,1,2,3 --batchSize 4
There are many options that can be specified. More instructions can be accessed by
typing python train.py --help. The specified options are printed to the console. To
specify the number of GPUs to utilize, use --gpu_ids. If you want to use the second and
third GPUs for example, use --gpu_ids 1,2.
To log training, use --tf_log for Tensorboard. The logs are stored at
[checkpoints_dir]/[name]/logs.
Testing is similar to testing pretrained models.
python test.py --name [name_of_experiment] --dataset_mode [dataset_mode]
--dataroot [path_to_dataset]
Use --results_dir to specify the output directory. --how_many will specify the maximum
number of images to generate. By default, it loads the latest checkpoint. It can be
changed using --which_epoch.
Code Structure Reference:
●train.py, test.py: the entry point for training and testing.
●trainers/pix2pix_trainer.py: harnesses and reports the progress of training.
●models/pix2pix_model.py: creates the networks, and compute the losses
●models/networks/: defines the architecture of all models
●options/: creates option lists using argparse package. More individuals are
dynamically added in other files as well. Please see the section below.
●data/: defines the class for loading images and label maps.
Options:
Some options belong to only one specific model, while others have different default
values depending on other options. The BaseOption class dynamically loads and sets
options depending on what model, network, and datasets are used.
Training has been done on 90% of the input dataset (4145 images per world, randomly
sampled).
Here’s some additional information gathered during the implementation process.
●On the Azure Nodes an epoch (i.e. an entire cycle through the input dataset)
takes 30 minutes, so training takes around 30 hours on the Azure Virtual
Machines.
●On Talisker, using a 1080Ti and Titan X, training takes 14- 16 hours.
Testing has been done using only 10% of the dataset (461 images randomly sampled).
This set was then used during the evaluation process to gather further results on the
quality of the model.
Default parameters used for training:
●Epochs: 50
●Learning rate: 0.0002
●Optimizer: Adam optimizer is used for the generator as well as the discriminator.
Β1 and β2 are set to 0.5 and 0.999 respectively.
●Model - Pix2pix
●NetG – Spade
●NGF -- # convolution filters in 1st layer – 64
●Trained 20 out of 50 worlds
Chapter 5
Evaluation
According to SPADE’s open source repository, the results of the model trained by
Nvidia, GauGAN, reported in the paper, can be replicated by using an NVIDIA DGX1
machine with 8 V100 GPUs, which is equivalent to 128 GB in GPUs. A benchmark
reference for the training time could be grasped out of this data, by considering the size
of the dataset used for training, and the number of labels. As the time required for
training is not available in the paper, and as the resources to replicate such results were
not available, such benchmark statistics for quantitative assessment could not be
included for the evaluation. However, other means of evaluation have been found and
used to assess the training results.
Results are tested by using the scoring methodology used by (Gupta P. et al., 2019), in
order to evaluate the generated game worlds’ Inception Distance and Fréchet score.
Another useful feedback could’ve been also gathered by qualitative assessments using
surveys, but due to the extraordinary circumstances in which this dissertation was
written, i.e. during a global pandemic, small issues have arised and made this hard.
Nonetheless, we do still have plenty of data to do quantitative assessment. Using the
latter, it is possible to find more insight on how to improve the model. Furthermore, once
the model is trained, a tool like GauGAN could be developed in order to generate 3D
game worlds. While this is beyond the scope of this project, it is worth noting down
how much can be built on top of a successfully trained model.
5.1. Frechét Inception Distance
In GANs, the objective function for the generator and the discriminator estimates how
good they are at fooling each other. For example, we measure how well the generator is
capable of generating content that the discriminator finds close to the Ground Truth
image. However it is not a good metric to calculate the image quality or its diversity.
A common evaluation metric is the Inception Score. It uses two criteria in measuring the
performance of GAN: the quality of the generated images, and their diversity.
Entropy can be viewed as randomness. If the value of a random variable x is highly
predictable, it has low entropy. On the contrary, if it is highly unpredictable, the entropy
is high. In Frechét Inception Distance, or FID, the Inception network is used to extract
features from an intermediate layer. Then the data distribution is modeled around these
features using a multivariate Gaussian distribution with mean µ and covariance Σ. The
FID between the real images x and generated images g is computed as:
where Tr sums up all the diagonal elements. FID is more robust to noise than the classic
Inception Score. If the model generates only a single image per class, the distance will
be high. So FID is a better measurement for image diversity. FID has some rather high
bias but low variance. By computing the FID between a training dataset and a testing
dataset, we should expect the FID score to be close to zero since both are real images.
However, running the test with different batches of training samples shows a different
FID. This characteristic has been used in order to establish a threshold called World
variance FID in the table shown in the next section. This value calculates the mean
world distance, or, in layman terms, “how much do the worlds vary between each
other”. As this was calculated over training of 20 worlds, the possible unique pairwise
combinations were 190.
5.2.Summary of Results
The following table has been partially filled with results obtained from training.
Epoch
Reference FID
(average reference
FID of n=190)
Training FID score
(real images vs training
synthesized images)
FID score on test image
10
93.099
41.471
77.515
15
93.099
43.761
N/A
20
93.099
39.125
N/A
25
93.099
42.656
N/A
30
93.099
41.669
88.498
35
93.099
43.662
N/A
40
93.099
34.898
72.169
45
93.099
36.656
N/A
50
93.099
40.367
77.473
As can be noticed, the FID score between the synthesized images and the real images is
decreasing over the course of the training, although slight spikes do happen every now
and then. This is quite expected, as is expected that unseen images from the test set
will get a higher FID score. Considering the partial results that we were able to
extrapolate and how it is decreasing over time, training can be assumed as discreetly
successful, given the limits of this first training attempt. Increases in FID towards the
ending are assumed to be related to attempts of generating certain artifacts present in
Minecraft. Some visual results are also included in the next section for qualitative
assessment.
i.66 [Training results sample at epoch 35]
i.67 [Training results sample at epoch 35]
i.68 [Training results sample at epoch 35]
i.69 [Training results sample at epoch 35]
i.70 [ Training results sample at epoch 35]
As can be seen, although SPADE struggles to extrapolate the features of less common
biomes (in particular the red sand label, or villages), it is still able to offer a quasi-realistic
representation. It is assumed that a larger dataset featuring more red desert biomes
would improve the process of content generation, as more frequent biomes generation
show promising results.
The increase of FID scores over the last epoch can be
explained. Our GAN model is starting to acquire the
ability to encode features taken from the ones seen
across different images, thus starting to diverge from
the original images it has been trained on, while still
maintaining the ability to generate features related to
the labels. While ultimately this is a result we want to
obtain, it is always a good idea to check the visual
results at each epoch, in order to avoid a situation where unrealistic artifacts are
generated.
i.71 [ Real Image sample at epoch 50]
i.72 [Synthesized sample at epoch 50]
Chapter 6
Conclusion
If trained correctly, it is assumed that a SPADE model could not only give us a tool like
GauGAN for game-specific world generation (depending on what dataset is the model
trained on), but it is also believed that it could go beyond such limitations and generate
all-purpose content for multiple other games. It is held as a conviction that a semantic
tool-brush based terrain generator could highly improve the work of artists and game
designers.
Procedurally generated content for instance has always a certain degree of chaos that
can’t be controlled. While on one hand it allows to generate a lot of content, it doesn’t
give much power over content creation to designers. A tool like the one discussed in this
dissertation on the other hand could extremely improve the ability of content creators to
manipulate their own creation, allowing them to sketch entire maps without having to
focus on small details, saving time and allowing them to focus on creating other content.
Further training may also substantially improve the amount of things that could be
generated by the model, increasing the variety of elements that could be created by
inventive users.
Through this research, we’ve managed to find an answer to our main question: can
semantic-labeling based game generation be used to generate game content? The
answer is positive. The processing of labeling is not trivial and finding suitable game
data to parse through is what requires most of the effort, but as has been shown in this
dissertation, it can be achieved.
6.1. Future work
Due to a lack of time, a global pandemic and technical limitations, understanding in full
depth how Mapcrafter works is still on the to-do-list, especially in order to get rid of
shadows rendering for labeling (thus ideally getting rid of the extra processing required
through IrfanView). Last but not least, an interesting feat achieved by C. Acornley from
Abertay University has opened new possibilities for training, as a modified version of
Mapcrafter (here referred ad Heightcrafter) can extrapolate and encode the height into
the alpha channel. This feature, combined with what has been found over the course of
this project, opens up a whole new set of possibilities. Retraining a new model with
Heightcrafter using the top-down view could possibly allow us to create a tool that can
parse the synthetized results and turn them into actual Minecraft worlds.
Last but not least, training done using biome-enabled labeling could allow us to apply
style transfer as well, in a fashion similar to GauGAN’s demo. As most biomes share
similar features, changing the style is trivial. But the fundamental requirement in order
to achieve so is to do training on the biomes.
References
Bibliography
Games
➢Mojang (2019) Minecraft
➢Media Molecule (2020) Dreams
➢Media Molecule (2008) Little Big Planet
➢Nintendo (2015) Super Mario Maker
Softwares
➢Mcedit.net (2016) Mcedit - World Editor For Minecraft. [online] Available at: https://www.mcedit.net/
➢IrfanView (Irfan Skiljan, 2020) [online] Available at:https://www.irfanview.com/
Papers
➢Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, Yoshua Bengio (2014) Generative Adversarial Networks
➢Ceecee Wong (2019) The Rise of AI Supermodels, CDO Trends
➢Leon A. Gatys, Alexander S. Ecker, Matthias Bethge (2015) A Neural Algorithm of Artistic Style
➢Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou
Tang (2015) ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
➢Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu (2019) Semantic Image Synthesis with
Spatially-Adaptive Normalization
➢Gupta P., Shukla S. (2020) Image Synthesis Using Machine Learning Techniques. In: Hemanth D.,
Shakya S., Baig Z. (eds) Intelligent Data Communication Technologies and Internet of Things. ICICI
2019. Lecture Notes on Data Engineering and Communications Technologies, vol 38. Springer, Cham
➢Christine Payne MuseNet. OpenAI, 25 Apr. 2019, openai.com/blog/musenet
➢Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever, (2019) Generating Long Sequences with Sparse
Transformers
➢Éric Guérin, Julie Digne, Éric Galin, Adrien Peytavie, Christian Wolf, Bedrich Benes, Benoît Martinez
(2017) Interactive example-based terrain authoring with conditional generative adversarial networks
➢Christopher Beckham, Christopher Pal (2017) A step towards procedural terrain generation with GANs
➢Alec Radford, Luke Metz, Soumith Chintala (2015), Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks
➢Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen (2017) Progressive Growing of GANs for
Improved Quality, Stability, and Variation
➢Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul
Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum Anderson, Heather Roff, Gregory C. Allen, Jacob
Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare
Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, Dario Amodei
(2018) The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
➢Andrew Brock, Jeff Donahue, Karen Simonyan (2018) Large Scale GAN Training for High Fidelity
Natural Image Synthesis
➢Yanghua Jin, Jiakai Zhang, Minjun Li, Yingtao Tian, Huachun Zhu, Zhihao Fang (2017) Towards the
Automatic Anime Characters Creation with Generative Adversarial Networks
➢Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2016) Image-to-Image Translation with
Conditional Adversarial Networks
➢Rui Huang, Shu Zhang, Tianyu Li, Ran He (2017) Beyond Face Rotation: Global and Local Perception
GAN for Photorealistic and Identity Preserving Frontal View Synthesis
➢Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, Luc Van Gool (2017) Pose Guided
Person Image Generation
➢Guim Perarnau, Joost van de Weijer, Bogdan Raducanu, Jose M. Álvarez (2016), Invertible Conditional
GANs For Image Editing
➢Ming-Yu Liu, Oncel Tuzel (2016), Coupled Generative Adversarial Networks
➢Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston (2016) Neural Photo Editing with Introspective
Adversarial Networks
➢He Zhang, Vishwanath Sindagi, Vishal M. Patel (2017), Image De-raining Using a Conditional
Generative Adversarial Network
➢Grigory Antipov, Moez Baccouche, Jean-Luc Dugelay (2017), Face Aging With Conditional Generative
Adversarial Networks
➢Zhifei Zhang, Yang Song, Hairong Qi (2017), Age Progression/Regression by Conditional Adversarial
Autoencoder
➢Huikai Wu, Shuai Zheng, Junge Zhang, Kaiqi Huang (2017), GP-GAN: Towards Realistic
High-Resolution Image Blending
➢Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta,
Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi (2016), Photo-Realistic
Single Image Super-Resolution Using a Generative Adversarial Network
➢Huang Bin, Chen Weihai, Wu Xingming, Lin Chun-Liang (2017), High-Quality Face Image SR Using
Conditional Generative Adversarial Networks
➢Subeesh Vasu, Nimisha Thekke Madam, Rajagopalan A.N (2018) Analyzing Perception-Distortion
Tradeoff using Enhanced Perceptual Super-resolution Network
➢Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, Minh N.
Do (2016) Semantic Image Inpainting with Deep Generative Models
➢Yijun Li, Sifei Liu, Jimei Yang, Ming-Hsuan Yang (2017), Generative Face Completion
➢Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S. Paek, In So Kweon (2016), Pixel-Level Domain
Transfer
➢Carl Vondrick, Hamed Pirsiavash, Antonio Torralba (2016), Generating Videos with Scene Dynamics
➢Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, Joshua B. Tenenbaum (2016), Learning a
Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling
➢Matheus Gadelha, Subhransu Maji, Rui Wang (2016) 3D Shape Induction from 2D Views of Multiple
Objects
➢Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro (2017),
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
➢Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros (2017), Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks
➢Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas
(2016), StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial
Networks
➢Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee (2016),
Generative Adversarial Text to Image Synthesis
➢Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, Muhammad Zeshan
Afzal (2017), TAC-GAN – Text Conditioned Auxiliary Classifier Generative Adversarial Network
➢Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, Honglak Lee (2016), Learning
What and Where to Draw
➢Sergey Ioffe, Christian Szegedy (2015), Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift
➢Harm de Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, Aaron Courville (2017),
Modulating early visual processing by language
➢Vincent Dumoulin, Jonathon Shlens, Manjunath Kudlur (2016), A Learned Representation For Artistic
Style
List of Figures
➢(i.1) Lucas Cranach the Elder, The Garden of Eden (1530), oil on poplar wood, 81 cm x 114 cm,
Gemäldegalerie Alte Meister, Dresden
➢(i.2) Peter Paul Rubens and Jan Brueghel the Elder, The garden of Eden with the fall of man (1615), oil
on panel, 74.3 cm x 114.7 cm, Mauritshuis art museum in The Hague, Netherlands
➢(i.3)Thomas Cole, The Garden of Eden (1828), 97.7 cm x 133.9 cm, Amon Carter Museum of American
Art
➢(i.4) Hieronymus Bosch, The Garden of Earthly Delights (1503-1515), oil on oak panels, 205.5 cm ×
384.9 cm (81 in × 152 in), Museo del Prado, Madrid
➢(i.5) Mojang Studios, screenshot from Minecraft
➢(i.6) Media Molecule, screenshot from Dreams
➢(i.7) Media Molecule, screenshot from Little Big Planet
➢(i.8) Nintendo, screenshot from Mario Maker
➢(i.9) MCEdit.net, screenshot from MCEdit 2.0
➢(i.10) Nvidia, screenshot from GauGAN
➢(i.11) Classification diagram, taken from Image Synthesis Using Machine Learning Techniques, 2019
➢(i.12) Examples of GANs used to generate new plausible examples for image datasets. Taken from
Generative Adversarial Networks, 2014
➢(i.13) Example of GAN-Generated Photographs of Bedrooms.Taken from Unsupervised Representation
Learning with Deep Convolutional Generative Adversarial Networks, 2015
➢(i.14) Example of Vector Arithmetic for GAN-Generated Faces.Taken from Unsupervised
Representation Learning with Deep Convolutional Generative Adversarial Networks, 2015
➢(i.15) Examples of Photorealistic GAN-Generated Faces.Taken from Progressive Growing of GANs for
Improved Quality, Stability, and Variation, 2017
➢(i.16) Example of Photorealistic GAN-Generated Objects and ScenesTaken from Progressive Growing
of GANs for Improved Quality, Stability, and Variation, 2017
➢(i.17) Example of the Progression in the Capabilities of GANs from 2014 to 2017.Taken from The
Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation, 2018
➢(i.18) Example of Realistic Synthetic Photographs Generated with BigGAN. Taken from Large Scale
GAN Training for High Fidelity Natural Image Synthesis, 2018
➢(i.19) Example of GAN-Generated Anime Character Faces.Taken from Towards the Automatic Anime
Characters Creation with Generative Adversarial Networks, 2017
➢i.20 Example of Photographs of Daytime Cityscapes to Nighttime With pix2pix. Taken from
Image-to-Image Translation with Conditional Adversarial Networks, 2016
➢i.21 Example of Sketches to Color Photographs With pix2pix. Taken from Image-to-Image Translation
with Conditional Adversarial Networks, 2016.
➢i.22 Example of GAN-based Face Frontal View Photo GenerationTaken from Beyond Face Rotation:
Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis,
2017
➢i.23 Example of GAN-Generated Photographs of Human PosesTaken from Pose Guided Person Image
Generation, 2017
➢i.24 Example of Celebrity Photographs and GAN-Generated Emojis.Taken from Unsupervised
Cross-Domain Image Generation, 2016
➢i.25 Example of Face Photo Editing with IcGAN.Taken from Invertible Conditional GANs For Image
Editing, 2016
➢i.26 Example of GANs used to Generate Faces With and Without Blond Hair.Taken from Coupled
Generative Adversarial Networks, 2016
➢i.27 Example of GANs used to Generate Faces With and Without Blond Hair.Taken from Coupled
Generative Adversarial Networks, 2016
➢i.28 Example of Using a GAN to Remove Rain From PhotographsTaken from Image De-raining Using a
Conditional Generative Adversarial Network
➢i.29 Example of Photographs of Faces Generated With a GAN With Different Apparent Ages.Taken
from Face Aging With Conditional Generative Adversarial Networks, 2017
➢i.30 Example of Using a GAN to Age Photographs of FacesTaken from Age Progression/Regression by
Conditional Adversarial Autoencoder, 2017
➢i.31 Example of GAN-based Photograph Blending.Taken from GP-GAN: Towards Realistic
High-Resolution Image Blending, 2017
➢i.32 Example of GAN-Generated Images With Super Resolution. Taken from Photo-Realistic Single
Image Super-Resolution Using a Generative Adversarial Network, 2016
➢i.33 Example of High-Resolution Generated Human Faces. Taken from High-Quality Face Image SR
Using Conditional Generative Adversarial Networks, 2017
➢i.34 Example of GAN-Generated Photograph Inpainting Using Context Encoders.Taken from Context
Encoders: Feature Learning by Inpainting describe the use of GANs, specifically Context Encoders,
2016
➢i.35 Example of GAN-based Inpainting of Photographs of Human FacesTaken from Semantic Image
Inpainting with Deep Generative Models, 2016
➢i.36 Example of GAN Reconstructed Photographs of FacesTaken from Generative Face Completion,
2017
➢i.37 Example of Input Photographs and GAN-Generated Clothing Photographs. Taken from Pixel-Level
Domain Transfer, 2016
➢i.38 Example of Video Frames Generated With a GAN. Taken from Generating Videos with Scene
Dynamics, 2016
➢i.39 Example of GAN-Generated Three Dimensional Objects. Taken from Learning a Probabilistic Latent
Space of Object Shapes via 3D Generative-Adversarial Modeling
➢i.40 Example of Three-Dimensional Reconstructions of a Chair From Two-Dimensional Images. Taken
from 3D Shape Induction from 2D Views of Multiple Objects, 2016
➢i.41 Example of Semantic Image and GAN-Generated Cityscape Photograph. Taken from
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs, 2017
➢i.42 Four Image-to-Image Translations performed with CycleGAN. Taken from Unpaired
Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017
➢i.43 Example of Translation from Paintings to Photographs With CycleGAN. Taken from Unpaired
Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017
➢
➢i.44 Example of Textual Descriptions and GAN-Generated Photographs of Birds. Taken from
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks,
2016.
➢i.45 Example of Textual Descriptions and GAN-Generated Photographs of Birds and Flowers. Taken
from Generative Adversarial Text to Image Synthesis
➢i.46 Example of Photos of Object Generated From Text and Position Hints With a GAN. Taken from
Learning What and Where to Draw, 2016
➢i.47-50 Examples of Terrains Generated With a GAN. Taken from Interactive example-based terrain
authoring with conditional generative adversarial networks, 2017.
➢I.51-56 Semantic Entity Block Clustering used for training
➢i.57 Screenshot of the labeled texture folder content
➢i.58-59 Full map images generated using the map processing script. The map in the top image shows
the real images. The map in the bottom shows labeled maps
➢i.60 Unprocessed Labeled Map
➢i.61 File>Batch Conversion/Rename option, from IrfanView
➢i.62 Above, File>Batch Conversion/Rename>Advanced button, from IrfanView
➢i.63 File>Batch Conversion/Rename>Advanced>Replace color>Settings button, from IrfanView
➢i.64 File>Batch Conversion/Rename>Advanced>Replace color>Settings>Replace Color window, from
IrfanView
➢i.65 SPADE batch normalization. Taken from Semantic Image Synthesis with Spatially-Adaptive
Normalization
➢i.66-70 Training results samples at epoch 35
➢i.71 Real Image samples at epoch 50
➢i.72 Synthesized sample at epoch 50