PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

As ChatGPT goes viral, generative AI (AIGC, a.k.a AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond. With such overwhelming media coverage, it is almost impossible for us to miss the opportunity to glimpse AIGC from a certain angle. In the era of AI transitioning from pure analysis to creation, it is worth noting that ChatGPT, with its most recent language model GPT-4, is just a tool out of numerous AIGC tasks. Impressed by the capability of the ChatGPT, many people are wondering about its limits: can GPT-5 (or other future GPT variants) help ChatGPT unify all AIGC tasks for 1 2 Zhang et al. diversified content creation? Toward answering this question, a comprehensive review of existing AIGC tasks is needed. As such, our work comes to fill this gap promptly by offering a first look at AIGC, ranging from its techniques to applications. Modern generative AI relies on various technical foundations, ranging from model architecture and self-supervised pretraining to generative modeling methods (like GAN and diffusion models). After introducing the fundamental techniques, this work focuses on the technological development of various AIGC tasks based on their output type, including text, images, videos, 3D content, etc., which depicts the full potential of ChatGPT's future. Moreover, we summarize their significant applications in some mainstream industries, such as education and creativity content. Finally, we discuss the challenges currently faced and present an outlook on how generative AI might evolve in the near future.
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/369385153
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All
You Need?
Preprint · March 2023
CITATIONS
0
READ
1
17 authors, including:
Some of the authors of this publication are also working on these related projects:
Metaverse in education— A blessing or a curse? A Reality or a Myth? View project
The Planet-scale Metaverse View project
Mengchun Zhang
Korea Advanced Institute of Science and Technology
1 PUBLICATION0 CITATIONS
SEE PROFILE
Lik-Hang Lee
Korea Advanced Institute of Science and Technology
88 PUBLICATIONS776 CITATIONS
SEE PROFILE
All content following this page was uploaded by Lik-Hang Lee on 21 March 2023.
The user has requested enhancement of the downloaded file.
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5
All You Need?
CHAONING ZHANG, Kyung Hee University, South Korea
CHENSHUANG ZHANG, KAIST, South Korea
SHENG ZHENG, Beijing Institute of Technology, China
YU QIAO, Kyung Hee University, South Korea
CHENGHAO LI, KAIST, South Korea
MENGCHUN ZHANG, KAIST, South Korea
SUMIT KUMAR DAM, Kyung Hee University, South Korea
CHU MYAET THWAL, Kyung Hee University, South Korea
YE LIN TUN, Kyung Hee University, South Korea
LE LUANG HUY, Kyung Hee University, South Korea
DONGUK KIM, Kyung Hee University, South Korea
SUNG-HO BAE, Kyung Hee University, South Korea
LIK-HANG LEE, Hong Kong Polytechnic University, Hong Kong (China)
YANG YANG, University of Electronic Science and technology, China
HENG TAO SHEN, University of Electronic Science and technology, China
IN SO KWEON, KAIST, South Korea
CHOONG SEON HONG, Kyung Hee University, South Korea
As ChatGPT goes viral, generative AI (AIGC, a.k.a AI-generated content) has made headlines everywhere because of its ability to
analyze and create text, images, and beyond. With such overwhelming media coverage, it is almost impossible for us to miss the
opportunity to glimpse AIGC from a certain angle. In the era of AI transitioning from pure analysis to creation, it is worth noting that
ChatGPT, with its most recent language model GPT-4, is just a tool out of numerous AIGC tasks . Impressed by the capability of the
ChatGPT, many people are wondering about its limits: can GPT-5 (or other future GPT variants) help ChatGPT unify all AIGC tasks for
Authors’ addresses: Chaoning Zhang, Kyung Hee University, South Korea, chaoningzhang1990@gmail.com; Chenshuang Zhang, KAIST, South Korea,
zcs15@kaist.ac.kr; Sheng Zheng, Beijing Institute of Technology, China, zszhx2021@gmail.com; Yu Qiao, Kyung Hee University, South Korea, qiaoyu@
khu.ac.kr; Chenghao Li, KAIST, South Korea, lch17692405449@gmail.com; Mengchun Zhang, KAIST, South Korea, zhangmengchun527@gmail.com; Sumit
Kumar Dam, Kyung Hee University, South Korea, skd160205@khu.ac.kr; Chu Myaet Thwal, Kyung Hee University, South Korea, chumyaet@khu.ac.kr;
Ye Lin Tun, Kyung Hee University, South Korea, yelintun@khu.ac.kr; Le Luang Huy, Kyung Hee University, South Korea, quanghuy69@khu.ac.kr;
Donguk kim, Kyung Hee University, South Korea, g9896@khu.ac.kr; Sung-Ho Bae, Kyung Hee University, South Korea, shbae@khu.ac.kr; Lik-Hang Lee,
Hong Kong Polytechnic University, Hong Kong (China), iskweon77@kaist.ac.kr; Yang Yang, University of Electronic Science and technology, China,
dlyyang@gmail.com; Heng Tao Shen, University of Electronic Science and technology, China, shenhengtao@hotmail.com; In So Kweon, KAIST, South
Korea, iskweon77@kaist.ac.kr; Choong Seon Hong, Kyung Hee University, South Korea, cshong@khu.ac.kr.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 Zhang et al.
diversied content creation? Toward answering this question, a comprehensive review of existing AIGC tasks is needed. As such, our
work comes to ll this gap promptly by oering a rst look at AIGC, ranging from its techniques to applications. Modern generative
AI relies on various technical foundations, ranging from model architecture and self-supervised pretraining to generative modeling
methods (like GAN and diusion models). After introducing the fundamental techniques, this work focuses on the technological
development of various AIGC tasks based on their output type, including text, images, videos, 3D content, etc., which depicts the
full potential of ChatGPT’s future. Moreover, we summarize their signicant applications in some mainstream industries, such as
education and creativity content. Finally, we discuss the challenges currently faced and present an outlook on how generative AI
might evolve in the near future.
CCS Concepts: Computing methodologies
Computer vision tasks;Natural language generation; Machine learning approaches.
Additional Key Words and Phrases: Survey, Generative AI, AIGC, ChatGPT, GPT-4, GPT-5, Text Generation, Image Generation
ACM Reference Format:
Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal,
Ye Lin Tun, Le Luang Huy, Donguk kim, Sung-Ho Bae, Lik-Hang Lee, Yang Yang, Heng Tao Shen, In So Kweon, and Choong Seon
Hong. 2022. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?. 1, 1 (March 2022), 57 pages.
https://doi.org/XXXXXXX.XXXXXXX
Contents
Abstract 1
Contents 2
1 Introduction 3
2 Overview 5
2.1 Popularity indicated by search interest 5
2.2 Why does it get popular? 6
2.2.1 Content need 6
2.2.2 Technology conditions 7
3 Fundamental techniques behind AIGC 9
3.1 General techniques in AI 9
3.1.1 Backbone architecture 9
3.1.2 Self-supervised pretraining 12
3.2 Creation techniques in AI 13
3.2.1 Likelihood-based models 14
3.2.2 Energy-based models 15
3.2.3 Two star-models: from GAN to diusion model 16
4 AIGC task: text generation 17
4.1 Text to text 18
4.1.1 Chatbots 18
4.1.2 Machine translation 19
4.2 Multimodal text generation 20
4.2.1 Image-to-text 20
4.2.2 Speech-to-Text 21
5 AIGC task: image generation 22
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 3
5.1 Image-to-image 22
5.1.1 Image restoration 22
5.1.2 Image editing 23
5.2 Multimodal image generation 25
5.2.1 Text-to-image 25
5.2.2 Talking face 25
6 AIGC task: beyond text and image 27
6.1 Video 27
6.2 3D generation 28
6.3 Speech 29
6.4 Graph 29
6.5 Others 30
7 Industry Applications 30
7.1 Education 30
7.2 Game and metaverse 31
7.3 Media 32
7.4 Advertising 32
7.5 Movie 33
7.6 Music 34
7.7 Painting 35
7.8 Code development 35
7.9 Phone apps and features 36
7.10 Other elds 36
8 Challenges and outlook 37
8.1 Challenges 37
8.2 Outlook 37
References 38
1 INTRODUCTION
Generative AI (AIGC, a.k.a AI-generated content) has made headlines with intriguing tools like ChatGPT or DALL-
E [
346
], suggesting a new era of AI is coming. Under such overwhelming media coverage, the general public is oered
many opportunities to have a glimpse of AIGC. However, the content in the media report tends to be biased or
sometimes misleading. Moreover, impressed by the powerful capability of ChatGPT, many people are wondering about
its limits. Very recently, OpenAI released GPT-4 [
310
], which demonstrates remarkable performance improvement
over the previous variant GPT-3 as well multimodal generation capability like understanding images. Impressed by
the powerful capability of GPT-4 powered by AIGC, many are wondering about its limits: can GPT-5 (or other GPT
variants) help next-generation ChatGPT unify all AIGC tasks? Therefore, a comprehensive review of generative AI
serves as a groundwork to respond to the inevitable trend of AI-powered content creation. More importantly, our work
comes to ll this gap in a timely manner.
Manuscript submitted to ACM
4 Zhang et al.
The goal of conventional AI is mainly to perform classication [
266
] or regression [
230
]. Such a discriminative
approach renders its role mainly for analyzing existing data. Therefore conventional AI is also often termed analytical
AI. By contrast, generative AI dierentiates by creating new content. However, generative AI often also requires the
model to rst understand some existing data (like text instruction) before generating new content [
41
,
345
]. From this
perspective, analytical AI can be seen as the foundation of modern generative AI and the boundary between them is
often ambiguous. Note that analytical AI tasks also generate content. For example, the label content is generated in
image classication [
219
]. Nonetheless, image recognition is often not considered in the category of generative AI
because the label content has low dimensionality. Typical tasks for generative AI involve generating high-dimensional
data, like text or images. Such generated content can also be used as synthetic data for alleviating the need for more data
in deep learning [
148
]. An overview of the popularity of generative AI as well as its underlying reasons, is presented in
Sec.2.
As stated above, what distinguishes generative AI from conventional one lies in its generated content. With this said,
generative AI is conceptually similar to AIGC (a.k.a. AI-generated content) [
307
]. In the context of describing AI-based
content generation, these two terms are often interchangeable. In this work, we call the content generation tasks AIGC
for simplicity. For example, ChatGPT is a tool for the AIGC task termed ChatBot [
44
], which is the tip of the iceberg
considering the variety of AIGC tasks. Despite the high resemblance between generative AI and AIGC, these two
terms have a nuanced dierence. AIGC focuses on the tasks for content generation, while generative AI additionally
considers the fundamental technical foundations that support the development of various AIGC tasks. In this work, we
divide those underlying techniques into two classes. The rst class refers to the generative modeling techniques, like
GAN [
127
] and diusion model [
159
], which are directly related to generative AI for content creation. The second class
of AI techniques mainly consists of backbone architecture (like Transformer [
448
]) and self-supervised pretraining (like
BERT [
89
] or MAE [
145
]). Some of them are developed in the context of analytical AI. However, they have also become
essential for demonstrating competitive performance, especially in challenging AIGC tasks. Considering this, both
classes of underlying techniques are summarized in Sec.3.
On top of these basic techniques, numerous AIGC tasks have become possible and can be straightforwardly categorized
based on the generated content type. The development of various AIGC tasks is summarized in Sec.4, Sec.5and Sec.6.
Specically, Sec.4and Sec.5focus on text output and image output, respectively. For text generation, ChatBot [
44
] and
machine translation [
503
] are two dominant tasks. Some text generation tasks also take other modalities as the input,
for which we mainly focus on image and speech. For image generation, two dominant tasks are image restoration and
editing [
256
]. More recently, text-to-image has attracted signicant attention. Beyond the above two dominant output
types (i.e. text and image), Sec.6covers other types of output, such as Video, 3D, Speech, etc.
As technology advances, the AIGC performance gets satisfactory for more and more tasks. For example, ChatBot
used to be limited to answering simple questions. However, the recent ChatGPT has been shown to understand jokes
and generate code under simple instruction. Text-to-image used to be considered a challenging task; however, recent
DALL-E 2 [
345
] and stable diusion [
361
] have been able to generate photorealistic images. Therefore, opportunities
of applying the AIGC to the industry emerge. Sec.7covers the application of AIGC in various industries, including
entertainment, digital art, media/advertising, education, etc. Along with the application of AIGC in the real world,
numerous challenges like ethical concerns have also emerged and they are disused in Sec.8. Alongside the current
challenges, an outlook on how generative AI might evolve is also presented.
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 5
Fig. 1. Search interest of generative AI: Timeline trend (le) and region-wise interest (right). The color darkness on the right part
indicates the rank interest level.
Overall, this work conducts a survey on generative AI through the lens of generated content (i.e. AIGC tasks),
covering its underlying basic techniques, task-wise technological development, application in the industry as well as its
social impact. An overview of the paper structure is presented in Figure 4.
2 OVERVIEW
Adopting AI for content creation has a long history. IBM made the rst public demonstration of a machine translation
system at its head oce in New York in 1954. The rst computer-generated music came out with the name “Illiac
Suite" in 1957. Such early attempts and proof-of-concept successes caused a high expectation of the AI future, which
motivated governments and companies to invest numerous resources in AI. Such a high boom in investment, however,
did not yield the expected output. After that, a period called AI winter came, which dramatically undermines the
development of AI and its applications. Entering the 2010s, AI has again become popular again, especially after the
success of AlexNet [219] for ImageNet classication in 2012. Entering the 2020s, AI has entered a new era of not only
understanding existing data but also creating new content [
41
,
345
]. This section provides an overview of generative AI
by focusing on its popularity and why it gets popular.
2.1 Popularity indicated by search interest
A good indicator of ‘how popular a certain term is’ refers to search interest. Google provides a promising tool to
visualize search frequency, called Google trends. Although alternative search engines might provide similar functions,
we adopt Google trends because Google is one of the most widely used search engines in the world.
Interest over time and by region. Figure 1(left) shows the search interest of generative AI, which indicates that
the search interest signicantly increased in the past year, especially after October 2022. Entering 2013, this search
interest reaches a new height. A similar trend is observed for the term AIGC, see Figure 2(left). Except for interest over
time, Google trends also provides region-wise search interest. The search heatmaps for generative AI and AIGC are
shown in Figure 1(right) and Figure 2(right), respectively. For both terms, the main hot regions include Asia, Northern
America, and Western Europe. Most notably, for both terms, China ranks highest among all countries with a search
interest of 100, followed by around 30 in Northern America and 20 in Western Europe. It is worth mentioning that
some small but tech-oriented countries also have a very high search interest in generative AI. For example, the three
countries that rank top on the country-wise search interest are Singapore (59), Israel (58), and South Korea (43).
Manuscript submitted to ACM
6 Zhang et al.
Fig. 2. Search interest of AIGC: Timeline trend (le) and region-wise interest (right). The color darkness on the right part indicates
the rank interest level.
Fig. 3. Search interest comparison between generative AI and AIGC: Timeline trend (le) and region-wise interest (right).
Generative AI v.s. AIGC. Figure 3shows a comparison between generative AI and AIGC for the search interest.
Here, we dene the interest ratio of generative AI and AIGC as GAI/AIGC. A major observation is that China prefers to
use the term AIGC compared with generative AI with the GAI/AIGC ratio being 15/85. By contrast, the GAI/AIGC in
the US is 90/10. In many countries, including Russia and Brazil, the GAI/AIGC is 100/0. Overall, most countries prefer
generative AI to AIGC, which makes generative AI have an overall higher search interest than AIGC. The reason that
China becomes the leading country to adopt the term AIGC is not fully clear. A possible explanation is that AIGC is
shortened to a single word and thus is easier to use. We also search the Chinese version of generative AI and AIGC on
Google trends, however, the current demonstration is not sucient.
2.2 Why does it get popular?
The recent surging interest in generative AI in the last year can be mainly attributed to the emergence of intriguing
tools like Stable diusion or ChatGPT. Here, we discuss why generative AI gets popular by focusing on what factors
contributed to the advent of such powerful AIGC tools. The reasons are summarized from two perspectives: content
need and technology conditions.
2.2.1 Content need. The way we communicate and interact with the world has been fundamentally changed by
the Internet, for which digital content plays a key role. Over the last few decades, the content on the web has also
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 7
undergone multiple major changes. In the Web 1.0 era (the 1990s-2004), the Internet was primarily used to access and
share information, with websites mainly static. There was little interaction between users and the primary mode of
communication was one-way, with users accessing information but not contributing or sharing their own content.
The content was largely text-based and it was mainly generated by professionals in the relative elds, like journalists
generating news articles. Therefore, such content is often called Professional Generated Content (PGC), which has been
dominated by another type of content, termed User Generated Content (UGC) [
217
,
325
,
431
]. In contrast to PGC, UGC
in Web 2.0 [
311
] is mainly generated by users on social media, like Facebook [
206
], Twitter [
260
], Youtube [
162
], etc.
Compared with PGC, the volume of UGC is signicantly larger, however, its quality might be inferior.
We are currently transitioning from Web 2.0 to Web 3.0 [
367
]. With dening features of being decentralized and
intermediary-free, Web 3.0 also relies on a new content generation type beyond PGC and UGC to address the trade-o
between volume and quality. AI is widely recognized as a promising tool for addressing this trade-o. For example, in
the past, only those users that have a long period of practice could draw images of decent quality. With text-to-image
tools (like stable diusion [
361
]), anyone can create drawing images with a plain text description. Such a combination of
user imagination power and AI execution power makes it possible to generate new types of images at an unprecedented
speed. Beyond image generation, AIGC tasks also facilitate generating other types of content.
Another change AIGC brings is that the boundary between content consumer and creator becomes vague. In Web
2.0, Content generators and consumers are often dierent users. With AIGC in Web 3.0, however, data consumers are
now able to become data creators, as they are able to use AI algorithms and technology to generate their own original
content, and it allows them to have more control over the content they produce and consume, making them use their
own data and AI technology to produce content that is tailored to their specic needs and interests. Overall, the shift
towards AIGC has the potential to greatly transform the way data is consumed and produced, giving individuals and
organizations more control and exibility in the content they create and consume. In the following, we discuss why
AIGC has become popular now.
2.2.2 Technology conditions. When it comes to AIGC technology, the rst thing that comes into mind is often machine
(deep) learning algorithm, while overlooking its two important conditions: data access and compute resources.
Advances in data access. Deep learning refers to the practice of training a model on data. The model performance
heavily relies on the size of the training data. Typically, the model performance increases with more training samples.
Taking image classication as an example, ImageNet [
85
] with more than 1 million images is a commonly used dataset
for training the model and validating the performance. Generative AI often requires an even larger dataset, especially
for challenging AIGC tasks like text-to-image. For example, approximately 250M images were used for training DALL-
E [
346
]. DALL-E 2 [
345
], on the other hand, used approximately 650M images. ChatGPT was built on top of GPT3 [
41
]
partly trained on CommonCrawl dataset, which has 45TB of compressed plaintext before ltering and 570GB after
ltering. Other datasets like WebText2, Books1/2, and Wikipedia are involved in the training of GPT3. Accessing such a
huge dataset becomes possible mainly due to the Internet.
Advances in computing resources. Another important factor contributing to this development of AIGC is advanced
in computing resources. Early AI algorithm was run on CPU, which cannot meet the need of training large deep learning
models. For example, AlexNet [
219
] was the rst model trained on full ImageNet and the training was done on Graphics
Processing Units (GPUs). GPUs were originally designed for rendering graphics in video games but have become
increasingly common in deep learning. GPUs are highly parallelized and can perform matrix operations much faster
than CPUs. Nvidia is a leading company in manufacturing GPUs. The computing capability of its CUDA has improved
Manuscript submitted to ACM
8 Zhang et al.
Fig. 4. An overview of generative AI (AIGC): fundamental techniques, core AIGC tasks, and industrial applications.
from the rst CUDA-capable GPU (GeForce 8800) in 2006 to the recent GPU (Hopper) with hundreds of times more
computing power. The price of GPUs can range from a few hundred dollars to several thousand dollars, depending
on the number of cores and memory. Tensor Processing Units (TPUs) are specialized processors designed by Google
specically for accelerating neural network training. TPUs are available on the Google Cloud Platform, and the pricing
varies depending on usage and conguration. Overall, the price of computing resources is on the trend of becoming
more aordable.
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 9
3 FUNDAMENTAL TECHNIQUES BEHIND AIGC
In this work, we perceive AIGC as a set of tasks or applications that generates content with AI methods. Before
introducing AIGC, we rst visit the fundamental techniques behind AIGC, which fall in the scope of generative AI
at the technical level. Here, we summarize the fundamental techniques by roughly dividing them into two classes:
Generative techniques and Creation techniques. Specically, Creation techniques refer to the techniques that are able
to generate various contents, e.g., GAN and diusion model. Meanwhile, General techniques cannot generate content
directly but are essential for the development of AIGC, e.g., the Transformer architecture. In this section, we provide a
brief summary of the required techniques for AIGC.
3.1 General techniques in AI
After the phenomenal success of AlexNet [
219
], there is a surging interest in deep learning, which somewhat becomes a
synonym for AI. In contrast to traditional rule-based algorithms, deep learning is a data-driven method that optimizes the
model parameters with a stochastic gradient. The success of deep learning in obtaining a superior feature representation
depends on better backbone architecture and more data, which greatly accelerates the development of AIGC.
3.1.1 Backbone architecture. As two mainstream elds in deep learning, the research on natural language processing
(NLP) and computer vision (CV) have signicantly improved the backbone architectures and inspired various applications
of improved backbones in other elds, e.g., the speech area. In the NLP eld, Transformer [
448
] has replaced recurrent
neural networks (RNN) [
284
,
288
] to be the de-facto standard backbone. In the CV area, vision Transformer (ViT) [
99
]
has also shown its power besides the traditional convolutional neural networks (CNN). Here, we will briey introduce
how these mainstream backbones work and their representative variants.
RNN architecture. RNN is mainly adopted for handling data with time sequences, like language or audio. A vanilla
RNN has three layers: input, hidden, and output. The information ow in RNN is in two directions. The rst direction is
from the input to the hidden layer and then to the output. What captures the recurrent nature of RNN lies in its second
information ow in the time direction. Except for the corresponding input, the current hidden state depends at time
𝑡
depends on the hidden state at time
𝑡
1. This two-ow design well handles the sequence order but suers from
exploding or vanishing gradients when the sequence gets long. To mitigate long-term dependency, LSTM [
161
] was
introduced with a cell state that acts like a freeway to facilitate the information ow in the sequence direction. LSTM
is one of the most popular methods for alleviating the gradient vanishing/exploding issue. With three types of gates,
however, LSTM suers from high complexity and a higher memory requirement. Gated Recurrent Unit (GRU) [
66
]
simplies LSTM by merging its cell and hidden states and replacing the forget and input gates with a so-called update
state. Unitary RNN [
19
] handles the gradient issue by implementing unitary matrices. Gated Orthogonal Recurrent
Unit [
188
] leverages the merits of both gate and unitary matrices. Bidirectional RNN [
379
] improves vanilla RNN by
capturing both past and future information in the cell, i.e., the state at time
𝑡
is calculated based on both time
𝑡
1
and
𝑡+
1. Depending on the tasks, RNN can have various architectures with a dierent number of inputs and outputs:
one-to-one, many-to-one, one-to-many, and many-to-many. The many-to-many can be used in machine translation and
is also called the sequence-to-sequence (seq2seq) model [
417
]. Attention was introduced in [
25
] to make the model
decoder see every encoder token and automatically decide the weights on them based on their importance.
Transformer. Dierent from Seq2seq with attention [
25
,
270
,
318
], a new variant of architecture discards the
seq-2seq architecture and claims that attention is all you need [
448
]. Such attention is called self-attention, and the
proposed architecture is termed Transformer [
448
] (see Figure 5). A standard Transformer consists of an encoder and a
Manuscript submitted to ACM
10 Zhang et al.
decoder and is developed based on residual connection [
147
] and layer normalization [
23
]. Except for the Add &Norm
module, the Transformer has two core components: multi-head attention and feed-forward neural network (a.k.a. MLP).
The attention module adopts a multi-head design with the self-attention in the form of scaled dot-product dened as:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑄 , 𝐾, 𝑉 )=𝑠𝑜 𝑓 𝑡 𝑚𝑎𝑥 (𝑄𝐾𝑇
𝑑𝑘.)𝑉(1)
Unlike RNNs, which build positional information by sequentially inputting sentence information, Transformer obtains
powerful modeling capabilities by constructing global dependencies but also loses information with positional bias.
Therefore, positional encoding is needed to enable the model to sense the positional information of the input signal.
There are two types of positional encoding. Fixed position coding is represented by sinusoids and cosines of dierent
frequencies. The learnable position encoding is composed of a set of learnable parameters. Transformer has become the
de-facto standard method in NLP tasks.
Fig. 5. Transformer structure (figure obtained from [448]).
CNN architecture. After introducing RNN and Transformer in NLP eld, we start to visit two mainstream backbones
in CV area, i.e., CNN and ViT. CNNs have become a standard backbone in the eld of computer vision. The core of CNN
lies in its convolution layer. The convolution kernel (also known as lter) in the convolution layer is a set of shared
weight parameters for operating on images, which is inspired by the biological visual cortex cells. The convolution
kernel slides on the image and performs correlation operations with the pixel values on the image, nally obtaining
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 11
the feature map and realizing the feature extraction of the image. GoogleNet[
421
], with its Inception module allowing
multiple convolutional lter sizes to be chosen in each block, increased the diversity of convolutional kernels, thus the
performance of CNN was improved. ResNet[
147
] was a milestone for CNNs, introducing residual connections that
stabilized training and enabled the models to achieve better performance through deeper modeling. After that, it became
part of the binding in CNNs. In order to expand the work of ResNet, DenseNet[
168
] establishes dense connections
between all the previous layers and the subsequent layers, thus enabling the model to have better modeling ability.
EcientNet[
422
] uses a scaling method which uses a set of xed scaling coecients to uniformly scale the width, depth,
and resolution of the convolutional neural network architecture, thus making the model more ecient.
Fig. 6. ViT structure (figure obtained from [99]).
ViT architecture. Inspired by the success of Transformer in NLP, numerous works have tried to apply Transformer
to the eld of CV with ViT[
99
] (see Figure 6), being the rst of its kind. ViT rst attens the image into a sequence of
2D patches and inserts a class token at the beginning of the sequence to extract classication information. After the
embedding position encoding, the token embeddings are fed into a standard Transformer. This simple and eective
implementation of ViT makes it highly scalable. Swin [
264
] eciently deals with image classication and dense
recognition tasks by constructing hierarchical feature maps by merging image blocks at a deeper level, and due to
its computation of self-attention only within each local window, it reduces computational complexity. DeiT[
435
]
uses the teacher-student strategy for training, reducing the dependence of Transformer models on large data, by
introducing distillation tokens. CaiT[
436
] introduces class attention to eectively increase the depth of the model.
T2T[
514
] eectively localizes the model by Token Fusion and introduces hierarchical deep and narrow structures through
the prior of CNNs by recursively aggregating adjacent Tokens into one Token. Through permutation equivariance,
Transformers have liberated CNNs from their translation invariance, allowing for long-range dependencies and less
inductive bias, making them more powerful modeling tools and better transferable to downstream tasks than CNNs. In
Manuscript submitted to ACM
12 Zhang et al.
the current paradigm of large models and large datasets, Transformers have gradually replaced CNNs as the mainstream
model in the eld of computer vision.
3.1.2 Self-supervised pretraining. Parallel to better backbone architecture, deep learning also benets from self-
supervised pertaining which can exploit a larger (unlabeled) training dataset. Here, we summarize the most relevant
pretraining techniques to AIGC, and categorize them according to the training data type (e.g., language, vision, and
joint pretraining).
Language pretraining. There are three major types of language pretraining methods. The rst type pretrains an
encoder with masking, for which the representative work is BERT [
89
] (see Figure 7). Specically, BERT predicts the
masked language tokens from the unmasked tokens. There is a signicant discrepancy between the mask-then-predict
pertaining task and downstream tasks, therefore masked language modeling like BERT is rarely used for text generation
without netuning. By contrast, autoregressive language pretraining methods are suitable for few-shot or zero-shot
text generation. GPT family [
41
,
341
,
342
] is the most popular one which adopts a decoder instead of an encoder.
Specically, GPT-1 [
341
] is the rst of its kind with GPT-2 [
342
] and GPT-3 [
41
] further investigating the role of massive
data and large model in the transfer capacity. Based on GPT-3, the unprecedented success of ChatGPT has attracted
great attention recently. Moreover, a stream of language models adopts both an encoder and decoder as the original
Transformer. BART [
229
] perturbed the input with various types of noise and predicted the original clean input, like a
denoising autoencoder. MASS [
404
] and PropheNet [
335
] follow BERT to take a masked sequence as the input of the
encoder with the decoder predicting the masked tokens in an autoregressive manner. T5 [
343
] replaces the masked
tokens with some random tokens.
Fig. 7. BERT structure (figure obtained from [89]).
Visual pretraining. To learn better representations of vision data during pretraining, self-supervised learning (SSL)
has been widely applied, and we term it visual SSL. Visual SSL has undergone three stages. Early works focused on
designing various pretext tasks like jigsaw puzzles [
306
] or predicting rotation [
124
]. Such pretraining yields better
performance on the downstream task than training from scratch, which motivates contrastive learning methods [
55
,
146
,
527
]. Contrastive learning adopts joint embedding to minimize the representation distance between augmented
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 13
images for learning augmentation-invariant representation. The representation in pure joint embedding can collapse
to a constant regardless of the inputs, for which contrastive learning simultaneously maximizes the representation
distance from negative samples. Negative-free joint-embedding methods have also been investigated in SimSiam [56]
and BYOL [
133
]. How SimSiam works without negative samples have been investigated in [
528
]. Inspired by the success
of BERT in NLP for pertaining, BEiT [
31
] applied masking modeling in vision and its success relies on a pre-trained
VAE to obtain the visual token. Masked autoencoder (MAE) [
145
] (see Figure 8) simplies it to an end-to-end denoising
framework by predicting the masked patches from the unmasked patches. Outperforming contrastive learning and
negative-free joint-embedding methods, MAE has become a new variant of the visual SSL framework. Interested readers
can refer [526] for more details.
Fig. 8. MAE structure (figure obtained from [145]).
Joint pretraining. With large datasets of image-text pairs collected from the Internet, multimodal learning [
30
,
493
]
has made unprecedented progress to learn data representations, at the front of which is cross-modal matching [
118
].
Contrastive pretraining is widely used to match the image embedding and text encoding in the same representation
space [
184
,
339
,
513
]. CLIP [
339
] (see Figure 9is a pioneering work in this direction and is used in numerous text-to-
image models, such as DALL-E 2 [345], Upainting [244], DiusionCLIP [209]. ALIGN [184] extended CLIP with noisy
text supervision so that the text-image dataset requires no cleaning and can be scaled to a much larger size (from 400M
to 1.8B). Florence [
513
] further expands the cross-modal shared representation from coarse scene to dine object and
from static images to dynamic videos, etc. Therefore, the learned shared representation is more universal and shows
superior performance [513].
3.2 Creation techniques in AI
Deep generative models (DGMs) are a group of probabilistic models that use neural networks to generate samples. Early
attempts at generative modeling focused on pre-training with an autoencoder [
29
,
157
,
368
]. A variant of autoencoder
with masking has emerged to become a dominant self-supervised learning framework, and interested readers are
Manuscript submitted to ACM
14 Zhang et al.
Fig. 9. CLIP structure (figure obtained from [339]).
encouraged to check a survey on masked autoencoder [
526
]. A discussion on the connection between generative models
(AIGC) and self-supervised learning is given in section xx. Unless specied, the use cases of deep generative models
in this survey only consider generating new data. The generated data is typically high-dimensional, and therefore,
predicting a label of a sample is not considered discriminative instead of generative modeling even though something
like a label is also technically generated.
Numerous DGMs have emerged and can be categorized into two major groups: likelihood-based and energy-based.
Likelihood-based probabilistic models, like autoregressive models [
130
] and ow models [
92
], have a tractable likelihood
which provides a straightforward method to optimize the model weights w.r.t. the log-likelihood of the observed
(training) data. The likelihood is not fully tractable in variational autoencoders (VAEs) [
213
], but a tractable lower
bound can be optimized, thus VAE is also considered to lie in the likelihood-based group which species a normalized
probability. By contrast, energy-based models [
132
,
156
] are featured by the unnormalized probability, a.k.a. energy
function. Without the constraint on the tractability of the normalizing constant, energy-based models are more exible
in parameterizing but dicult to train [
407
]. Notably, GAN and diusion models are highly related to energy-based
models even though are developed from dierent motivations. In the following, we present an introduction to each
class of likelihood-based models, followed by how the energy-based models can be trained as well as the mechanism
behind GAN and diusion models.
3.2.1 Likelihood-based models. Autoregressive models. Autoregressive models learn the joint distribution of se-
quential data and predict each variable in the sequence with previous time-step variables as inputs. As shown in
Eq. 2, autoregressive models assumes that the joint distribution
𝑝𝜃(𝑥)
can be decomposed to a product of conditional
distributions.
𝑝𝜃(𝑥)=𝑝𝜃(𝑥1)𝑝𝜃(𝑥2|𝑥1)...𝑝𝜃(𝑥𝑛|𝑥1, 𝑥 2, .. .,𝑥 𝑛1),(2)
Although both rely on previous timesteps, autoregressive models dier from RNN architecture since the previous
timesteps are given to the model as input instead of hidden states in RNN. In other words, autoregressive models can be
seen as a feed-forward network that takes all the previous time-step variables as inputs. Early works model discrete
data with dierent functions estimating the conditional distribution, e.g. logistic regression in Fully Visible Sigmoid
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 15
Belief Network (FVSBN) [
117
] and one hidden layer neural networks in Neural Autoregressive Distribution Estimation
(NADE) [
224
]. The following research further extends to model the continuous variables [
442
,
443
]. Autoregressive
methods have been widely applied in multiple areas, including computer vision (PixelCNN [
446
] and PixelCNN++ [
376
]),
audio generation (WaveNet [445]), natural language processing (Transformer [448]).
VAE. Autoencoders are a family of models that rst map the input to a low-dimension latent layer with an encoder
and then reconstruct the input with a decoder. The entire encoder-decoder process aims to learn the underlying data
patterns and generate unseen samples [
313
]. Variational autoencoder (VAE) [
213
] is an autoencoder that learns the data
distribution
𝑝(𝑥)
from latent space z, i.e.,
𝑝(𝑥)=𝑝(𝑥|𝑧)𝑝(𝑧)
, where
𝑝(𝑥|𝑧)
is learned by the decoder. In order to obtain
𝑝(𝑧)
, VAE [
213
] adopts Bayes’ theorem and approximates the posterior distribution
𝑝(𝑧|𝑥)
by the encoder. The VAE
model is optimized toward a likelihood goal with regularizer [14].
3.2.2 Energy-based models. With a tractable likelihood, autoregressive models and ow models allow a straightforward
optimization of the parameters w.r.t. the log-likelihood of the data. This forces the model to be constrained in a certain
form. For example, the autoregressive model needs to be factorized as a product of conditional probabilities, and the
ow model must adopt invertible transformation.
Energy-baed models specify probability up to an unknown normalizing constant, therefore, they are also known as
non-normalzied probabilistic models. Without losing generality by assuming the energy-based model is over a single
variable 𝒙, we denote its energy as 𝐸𝜃(𝒙). Its probability density is then calculated as
𝑝𝜃(𝒙)=
exp(−𝐸𝜃(𝒙))
𝑧𝜃
(3)
where
𝑧𝜃
is the so-called normalizing constant and dened as
𝑧𝜃=exp(−𝐸𝜃(𝒙)) d𝒙
.
𝑧𝜃
is an intractable integral,
making optimizing energy-based models a challenging task.
MCMC and NCE. Early attempts at optimizing energy-based models opt to estimate the gradient of the log-likelihood
with Markov chain Monte Carlo (MCMC) approaches, which require a cumbersome drawing of random samples.
Therefore, some works aim to improve the eciency of MCMC a representative work Langevin MCMC [
132
,
319
].
Nonetheless, performing MCMCM to obtain requires large computation and contrastive divergence (CD) [
156
] is
a popular method to reduce the computation via approximation with various variants: persistent CD [
429
], mean
eld CD [
474
], and multi-grid CD [
119
]. Another line of work optimizes energy-based models via notice contrastive
estimation (NCE) [
141
], which contrasts the probabilistic model with another noise distribution. Specically, it optimizes
the following loss:
E𝑝𝑑ln 𝑝𝜃(𝒙)
𝑝𝜃(𝒙) +𝑞𝜙(𝒙)+E𝑞𝜙ln 𝑞𝜙(𝒙)
𝑝𝜃(𝒙) +𝑞𝜙(𝒙),(4)
Score matching. For optimizing energy-based models, another popular MCMC-free method minimizes the deriva-
tives of log probability density between the model and the observed data. The rst-order of a log probability density
function is called score of the distribution (
𝑠(𝒙)=𝒙log𝑝(𝒙)
), therefore, this method is often termed score matching.
Unfortunately, the data score function
𝑠𝑑(𝒙)
is unavailable. Various attempts [
317
,
377
,
392
,
405
,
406
,
451
] have been
made to mitigate this issue, with a representative method called denoising score matching [
451
]. Denoising score
matching approximates the score of data with noisy samples. The model takes a noisy sample as the input and predicts
its noise. Therefore, it can be used for sampling clean samples from noise by iterative removing the noise [377,405].
Manuscript submitted to ACM
16 Zhang et al.
3.2.3 Two star-models: from GAN to diusion model. When it comes to deep generative models, what rst comes to
your mind? The answer depends on your background, however, GAN is denitely one of the most mentioned models.
GAN stands for generative adversarial network [
127
] which was rst proposed by Ian J. Goodfellow and his team in
2014 and rated as “the most interesting idea in the last 10 years in machine learning" by Yann Lecun in 2016. As the
pioneering work to generate images of reasonably high quality, GAN has been widely regarded as a de facto standard
model for the challenging task of image synthesis. This long-time dominance has been recently challenged by a new
family of deep generative models termed diusion models [
159
]. The overwhelming success of diusion models starts
from image synthesis but extends to other modalities, like video, audio, text, graph, etc. Considering their dominant
inuence in the development of generative AI, we rst summarize GAN and diusion models before introducing other
families of deep generative models.
GAN. The architecture of GAN is shown in Figure 10. GAN is featured by its two network components: a discriminator
(
D
) and a generator (
G
).
D
distinguishes real images from those generated by
G
, while
G
aims to fool
D
. Given a
latent variable
𝒛𝑝𝒛
, the output of
G
is
G(𝒛)
constituting a probability distribution
𝑝𝒈
. The goal of GAN is to make
𝑝𝒈
approximate the observed data distribution
𝑝𝒅𝒂 𝒕𝒂
. This objective is achieved through adversarial learning, which
can be interpreted as a min-max game [378]:
min
𝐺max
𝐷
Ex𝑝𝑑𝑎𝑡𝑎 log[𝐷(x)] + Ez𝑝zlog [1𝐷(𝐺(z))].(5)
where
D
is trained to maximize the probability of assigning correct labels to real images and generated ones, and is
used to guide the optimization of
G
towards generating more real images. GANs have the weakness of potentially
unstable training and less diversity in generation due to their adversarial training nature. The basic dierence between
GANs and autoregressive models is that GANs learn implicit data distribution, whereas the latter learns an explicit
distribution governed by a prior imposed by model structure.
Fig. 10. A schematic of GAN structure.
Diusion model. The use of diusion models, a special form of hierarchical VAEs, has seen explosive growth in the
past few years [
46
,
75
,
248
,
323
,
440
]. Diusion models (Figure 11) are also known as denoising diusion probabilistic
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 17
models (DDPMs) or score-based generative models that generate new data similar to the data on which they are
trained [
159
]. Inspired by non-equilibrium thermodynamics, DDPMs can be dened as a parameterized Markov chain
of diusion steps to slowly add random noise to the training data and learn to reverse the diusion process to construct
desired data samples from the pure noise.
Fig. 11. Diusion model for image generation (figure obtained from [159]).
In the forward diusion process, DDPM destroys the training data through the successive addition of Gaussian noise.
Given a data distribution x
0𝑞(
x
0)
, DDPM maps the training data to noise by gradually perturbing the input data.
This is formally achieved by a simple stochastic process that starts from a data sample and iteratively generates noisier
samples x𝑇with 𝑞(x𝑡|x𝑡1), using a simple Gaussian diusion kernel:
𝑞(𝑥1:𝑇|𝑥0):=
𝑇
Ö
𝑡=1
𝑞(𝑥𝑡|𝑥𝑡1),(6)
𝑞(𝑥𝑡|𝑥𝑡1):=N(𝑥𝑡;1𝛽𝑡𝑥𝑡1, 𝛽𝑡𝐼)(7)
where
𝑇
and
𝛽𝑡
are the diusion steps and hyper-parameters, respectively. We only discuss the case of Gaussian noise
as transition kernels for simplicity, indicated as
N
in Eq. 7. With
𝛼𝑡
:
=
1
𝛽𝑡
and
¯
𝛼𝑡
:
=Î𝑡
𝑠=0𝛼𝑠
, we can obtain noised
image at arbitrary step 𝑡as follows:
𝑞(𝑥𝑡|𝑥0):=N(𝑥𝑡;¯
𝛼𝑡𝑥0,(1¯
𝛼𝑡)𝐼)(8)
During the reverse denoising process, DDPM is learning to recover the data by reversing the noising process i.e., it
undoes the forward diusion by performing the iterative denoising. This process represents data synthesis and DDPM
is trained to generate data by converting random noise into real data. It is also formally dened as a stochastic process,
which iteratively denoises the input data starting from
𝑝𝜃(𝑇)
and generates
𝑝𝜃(𝑥0)
which can follow the true data
distribution 𝑞(𝑥0). Therefore, the optimization objective of the model is as follows:
𝐸𝑡∼U(1,𝑇 ),x0𝑞(x0),𝜖N(0,I)𝜆(𝑡)𝜖𝜖𝜃(x𝑡, 𝑡 )2(9)
Both the forward and reverse processes of DDPMs often use thousands of steps for gradual noise injection and during
generation for denoising.
4 AIGC TASK: TEXT GENERATION
NLP studies natural language with two fundamental tasks: understanding and generation. These two tasks are not
exclusively separate because the generation of an appropriate text often depends on the understanding of some text
inputs. For example, language models often transform a sequence of text into another, which constitutes the core task of
Manuscript submitted to ACM
18 Zhang et al.
text generation, including machine translation, text summarization, and dialogue systems. Beyond this, text generation
evolves in two directions: controllability and multi-modality. The rst direction aims to make the generated content
4.1 Text to text
4.1.1 Chatbots. The main task of the dialogue system (chatbots) is to provide better communication between humans
and machines [
87
,
302
]. According to whether the task is specied in the applications, dialogue system can be divided
into two categories : (1) task-oriented dialogue systems (TOD) [
326
,
508
,
540
] and (2) open-domain dialogue systems
(OOD) [
4
,
539
,
548
]. Specically, the task-oriented dialogue systems focus on task completion and solve specic problems
(e.g., restaurant reservations and ticket booking) [
540
]. Meanwhile, open-domain dialogue systems are often data-driven
and aim to chat with humans without task or domain restrictions [357,540].
Task-oriented systems. Task-oriented dialogue systems can be divided into modular and end-to-end systems. The
modular methods include four main parts: natural language understanding (NLU) [
399
,
413
], dialogue state tracking
(DST) [
385
,
468
], dialogue policy learning (DPL) [
172
,
489
], and natural language generation (NLG) [
26
,
102
]. After
encoding the user inputs into semantic slots with NLU, DST, and DPL decide the next action that is then converted to
natural language by NLG as the nal response. These four modules aim to generate responses in a controllable way and
can be optimized individually. However, some modules may not be dierentiable, and the improvement of a single
module may not lead to the improvement of the whole system [
540
]. To solve these problems, end-to-end methods either
achieve an end-to-end training pipeline by making each module dierentiable [
143
,
165
], or use a single end-to-end
module in the system [
504
,
538
]. There still exist several challenges for both modular and end-to-end systems, including
Fig. 12. A diagram illustrating the three steps of how ChatGPT is trained by OpenAI (figure obtained from [314]).
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 19
how to improve tracking eciency for DST [
211
,
315
] and how to increase the response quality of end-to-end system
with limited data [149,151,285].
Open-domain systems. Open-domain systems aim to chat with users without task and domain restrictions [
357
,
540
],
and can be categorized into three types: retrieval-based systems, generative systems, and ensemble systems [
540
].
Specically, retrieval-based systems always nd an existing response from a response corpus, while generative systems
can generate responses that may not appear in the training set. Ensemble systems combine retrieval-based and generative
methods by either choosing the best response or rening the retrieval-based model with generative one [
381
,
540
,
553
].
Previous works improve the open-domain systems from multiple aspects, including dialogue context modeling [
108
,
185
,
253
,
285
], improving the response coherence [
9
,
120
,
254
,
489
] and diversity [
32
,
214
,
338
,
412
]. Most recently, ChatGPT
(see Figure 12) has achieved unprecedented success and also falls into the scope of open-domain dialogue systems.
Apart from answering various questions, ChatGPT can also be used for paper writing, code debugging, table generation,
and to name but a few.
Fig. 13. An example of machine translation (figure obtained from [39]).
4.1.2 Machine translation. As the term suggests, machine translation automatically translates the text from one
language to another [
174
,
503
] (see Figure 13). With deep learning replacing rule-based [
111
] and statistical [
215
,
216
]
methods, neural machine translation (NMT) requires minimum linguistic expertise [
403
,
457
] and has become a
mainstream approach featured by its higher capacity in capturing long dependency in the sentence [
63
]. The success of
neural machine learning can be mainly attributed to language models [
35
], which predicts the probability of a word
conditioned on previous ones. Seq2seq [
417
] is a pioneering work to apply encoder-decoder RNN structure [
195
] to
machine translation. When the sentence gets long, the performance of Seq2seq [
417
] deteriorates, for which an attention
mechanism was proposed in [
25
] to help translate the long sentence with additional word alignment. With increasing
attention, in 2006, Google’s NMT system helped reduce the translation eort of humans by around 60% compared
Manuscript submitted to ACM
20 Zhang et al.
to Google’s phrase-based production system, which bridges the gap between Human and machine translation [
481
].
CNN-based architectures have also been investigated for NMT with numerous attempts [
194
,
196
], but fail to achieve
comparable performance as the RNN boosted by attention [
25
]. Convolutional Seq2seq [
123
] makes CNN compatible
with the attention mechanism, showing CNN can achieve comparable or even better performance than RNN. However,
this improvement was later outperformed by another architecture termed Transformer [
448
]. With RNN or Transformer
as the architecture, NMT often utilizes autoregressive generative model, where a greedy search only considers the word
with the highest probability for predicting the next work during inference.
A trend for NMT is to achieve satisfactory performance in low-resource setup, where the model is trained with
limited bilingual corpus [
464
]. One way to mitigate this data scarcity is to utilize auxiliary languages, like multilingual
training with other language pairs [
191
,
386
,
554
] or pivot translation with English as the middle pivot language [
59
,
353
].
Another popular approach is to utilize pre-trained language models, like BERT [
89
] or GPT [
341
]. For example, it
is shown in [
363
] that initializing the model weights with BERT [
89
] or RoBERTa [
262
] signicantly improves the
English-German translation performance. Without the need for ne-tuning, GPT-family models [
41
,
341
,
342
] also
show competitive performance. Most recently, ChatGPT has shown its power in machine translation, performing
competitively with commercial products (e.g., Google translate) [186].
4.2 Multimodal text generation
4.2.1 Image-to-text. Image-to-text, also known as image captioning, refers to describing a given image’s content in
natural language (see Figure 14). A seminal work in this area is Neural Image Caption (NIC) [
452
], which employs CNN
as an encoder to extract high-level representations of input images and then feed these representations into an RNN
decoder to generate image descriptions. This two-step encoder-decoder architecture has been widely applied in later
works on image captioning, and we term them as visual encoding [
411
] and language decoding, respectively. Here, we
rst revisit the history and recent trends of both stages in image captioning.
Fig. 14. An example of image captioning (figure obtained from [112]).
Visual encoding. Extracting an eective representation of images is the main task of visual encoding module. Start
from NIC [
452
] with GoogleNet [
421
] extracting the global feature of input image, multiple works adopt various CNN
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 21
backbones as the encoder, including AlexNet [
219
] in [
199
] and VGG network [
397
] in [
94
,
275
]. However, it is hard for
a language model to generate ne-grained captions with global visual features. Following works introduce attention
mechanism for ne-grained visual features, including attention over dierent grids of CNN features [
57
,
267
,
469
,
490
]
or over dierent visual regions [
17
,
204
,
524
]. Another branch of work [
506
,
543
] adopts graph neural networks to
encode the semantic and spatial relationships between dierent regions. However, the human-dened graph structures
may limit the interactions among elements [
411
], which can be mitigated by the self-attention methods [
234
,
507
,
537
]
(including ViT [259]) that connects all the elements.
Language decoding. In image captioning, a language decoder generates captions by predicting the probability of a
given word sequence [
411
]. Inspired by the breakthroughs in the NLP area, the backbones of language decoders evolve
from RNN [
204
,
267
,
452
,
462
] to Transformer [
136
,
152
,
234
], achieving signicant performance improvement. Beyond
the visual encoder-language decoder architecture, a branch of work adopts BERT-like architecture that fuses the image
and captions in the early stage of a single model [
247
,
533
,
549
]. For example, [
549
] adopts a single encoder to learn a
shared space for image and text, which is rst pre-tained on large image-text corpus and netuned, specically for
image captioning tasks.
4.2.2 Speech-to-Text. Speech-to-text generation, also known as automatic speech recognition (ASR), is the process of
converting spoken language, specically a speech signal, into a corresponding text [
177
,
350
] (see Figure 15). With
many potential applications such as voice dialing, computer-assisted language learning, caption generation, and virtual
assistants like Alexa and Siri, ASR has been an exciting eld of research [
198
,
273
,
348
] since the 1950s, and evolved
from hidden Markov models (HMM) [192,228] to DNN-based systems [77,131,155,300,479].
Fig. 15. A example of speech recognition (figure obtained from [47]).
Various research topics and challenges. Previous works improved ASR systems in various aspects. Multiple
works discuss dierent feature extraction methods for speech signals [
273
], including temporal features (e.g., discrete
wavelet transform [
290
,
423
]) and spectral features such as the most commonly used mel-frequency cepstral coecients
(MFCC) [
62
,
71
,
434
]. Another branch of work improves the system pipeline [
359
] from multi-model [
271
] to end-to-end
ones [
164
,
236
,
237
,
299
,
459
]. Specically, a multi-model system [
271
,
273
] rst learns an acoustic model (e.g., a phoneme
classier that maps the features to phonemes) and then a language model for the word outputs [
359
]. On the other
hand, end-to-end models directly predict the transcriptions from the audio input [
164
,
236
,
237
,
299
,
459
]. Although
end-to-end models achieve impressive performance in various languages and dialects, many challenges still exist.
First, their applications for under-resourced speech tasks remain challenging as it is costly and time-consuming to
acquire vast amounts of annotated training data [
107
,
359
]. Second, these systems may struggle to handle speech with
specialized out-of-vocabulary words and may perform well on the training data but may not generalize well to new or
Manuscript submitted to ACM
22 Zhang et al.
Fig. 16. Examples of image restoration (figure obtained from [458]).
unseen data [
107
,
337
]. Moreover, biases in the training data can also aect the performance of supervised ASR systems,
leading to poor accuracy on certain groups of people or speech styles [36].
Under-resourced speech tasks. Researchers work on new technologies to overcome challenges in ASR systems,
among which we mainly discuss the under-resourced speech problem that lacks data for impaired speech [
359
]. A
branch of work [
324
,
349
] adopts multi-task learning to optimize a shared encoder for dierent tasks. Meanwhile,
self-supervised ASR systems have recently become an active area of research without relying on a large number
of labeled samples. Specically, self-supervised ASR systems rst pre-train a model on huge volumes of unlabeled
speech data, then ne-tune it on a smaller set of labeled data to facilitate the eciency of ASR systems. It can be
applied for low-resource languages, handling dierent speaking styles or noise conditions, and transcribing multiple
languages [24,73,258,498].
5 AIGC TASK: IMAGE GENERATION
Similar to text generation, the task of image synthesis can also be categorized into dierent classes based on its input
control. Since the output is images, a straightforward type of control is images. Image-type control induces numerous
tasks, like super-resolution, deblur, editing, translation, etc. A limitation of image-type control is the lack of exibility.
By contrast, text-guided control enables the generation of any image content with any style at the free will of humans.
Text-to-image falls into the category of cross-modal generation, since the input text is a dierent modality from the
output image.
5.1 Image-to-image
5.1.1 Image restoration. Image restoration solves a typical inverse problem that restores clean images from their
corresponding degraded versions, with examples shown in Figure 16. Such an inverse problem is non-trivial with its
ill-posed nature because there are innite possible mappings from the degraded image to the clean one. There are two
sources of degradation: missing information from the original image and adding something undesirable to the clean
image. The former type of degradation includes capturing a photo with a low resolution and thus losing some detailed
information, cropping a certain region, and transforming a colorful image to its gray form. Restoration tasks recover them
in order are image super-resolution, inpainting, and colorization, respectively. Another class of restoration tasks aims to
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 23
remove undesirable perturbations, like denoise, derain, dehaze, deblur, etc. Early restoration techniques primarily use
mathematical and statistical modeling to remove image degradations, including spatial lters for denoising [
126
,
395
,
536
],
kernel estimation for deblurring [
491
,
495
]. Lately, deep learning-based methods [
43
,
60
,
95
,
181
,
251
,
255
,
487
,
492
]
have become predominant in image restoration tasks due to their versatility and superior visual quality over their
traditional counterparts. CNN is widely used as the building block in image restoration [
96
,
415
,
447
,
465
], while recent
works explore more powerful transformer architecture and achieve impressive performance in various tasks, such as
image super-resolution [
250
], colorization [
221
], and inpainting [
243
]. There are also works that combine the strength
of CNNs and Transformers together [106,541,542].
Generative methods for restoration. Typical image restoration models learn a mapping between the source
(degraded) and target (clean) images with a reconstruction loss. Depending on the task, training data pairs can be
generated by degrading clean images with various perturbations, including resolution downsampling and grayscale
transformation. To keep more high-frequency details and create more realistic images, generative models are widely
used for restoration, such as GAN in super-resolution [
226
,
466
,
535
] and inpainting [
43
,
255
,
301
]. However, GAN-based
models typically suer from a complex training process and mode collapse. These drawbacks and the massive popularity
of DMs led numerous recent works to adopt DMs for image restoration tasks [
203
,
235
,
268
,
352
,
370
,
372
]. Generative
approaches like GAN and DM can also produce multiple variations of clean output from a single degraded image.
From single-task to multi-task. A majority of existing restoration approaches train separate models for dierent
forms of image degradation. This limits their eectiveness in practical use cases where the images are corrupted by a
combination of degradations. To address this, several studies [
6
,
210
,
394
,
547
] introduce multi-distortion datasets that
combine various forms of degradation with dierent intensities. Some studies [
210
,
261
,
511
,
515
] propose restoration
models in which dierent sub-networks are responsible for dierent degradations. Another line of work [
231
,
245
,
394
,
414
,
547
] relies on attention modules or a guiding sub-network to assist the restoration network through dierent
degradations, allowing a single network to handle multiple degradations.
5.1.2 Image editing. In contrast to image restoration for enhancing image quality, image editing refers to modifying
an image to meet a certain need like style transfer (see Figure 17). Technically, some image restoration tasks like
colorization might also be perceived as image editing by perceiving adding color as the desired need. Modern cameras
often have basic editing features such as sharpness adjustments [
531
], automatic cropping [
532
], red eye removal [
400
],
etc. However, in AIGC, we are more interested in advanced image editing tasks that change the image semantics in
various forms, such as content, style, object attributes, etc.
A family of image editing targets to modify the attributes (like age) of the main object (like a face) in the image. A
typical use case is facial attribute editing which can change the hairstyle, age, or even gender. Based on a pre-trained
CNN encoder, a line of pioneering works adopt optimization-based approaches [
239
,
441
], which is time-consuming
due to its iterative nature. Another line of works adopts learning-based approaches to directly generate the image, with
a trend from single attribute [
240
,
388
] to multiple ones [
150
,
212
,
484
]. A drawback of most aforementioned methods is
the dependence on annotated labels for attributes, therefore, unsupervised learning has been introduced to disentangle
dierent attributes [61,389].
Another family of image editing changes the semantics by combining two images. For example, image morphing [
189
]
interpolates the content of two images, while style transfer [
122
] yields a new image with the content of one image
and the style of the other. A naive method for image morphing is to perform interpolation in the pixel space, which
causes obvious artifacts. By contrast, interpolating in the latent space can consider the view change and generate a
Manuscript submitted to ACM
24 Zhang et al.
Fig. 17. Examples of style transfer as a form of image editing (figure obtained from [121]).
smooth image. The latent space for those two images can be obtained via GAN inversion method [
483
]. Numerous
works [
1
,
496
,
551
,
552
] have explored the latent place of a pre-trained GAN for image morphing. For the task of style
transfer, a specic style-based variant of GAN termed StyleGAN [
201
] is a popular choice. From the earlier layers to the
latter ones, StyleGAN controls the attributes from coarser-grained (like structure) to ner-grained ones (like texture).
Therefore, StyleGAN can be used for style transfer by mixing the earlier layer’s latent representation of the content
image and the latter layer’s latent representation of the style image [1,135,449,473].
Compared with restoration tasks, various editing tasks enable a more exible image generation. However, its diversity
is still limited, which is alleviated by allowing other text as the input. More recently, image editing based on diusion
models has been widely discussed and achieved impressive results [
49
,
153
,
209
,
456
]. DiusionCLIP [
209
] is a pioneering
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 25
work that netunes a pre-trained diusion model to align the target image and text. By contrast, LDEdit [
49
] avoids
netuning based on LDM [
361
]. A branch of works discusses the mask problem in image editing, including how to
connect a manually designed masked region and background seamlessly [
3
,
20
,
22
,
22
]. On the other hand, DiEdit [
74
]
proposes to predict the mask automatically that indicates which part to be edited. There are also works editing 3D
objects based on diusion models and text guidance [48,208,233].
5.2 Multimodal image generation
5.2.1 Text-to-image. Text-to-image (T2I) task aims to generate images from textual descriptions (see Figure
??
.), and
can be traced back to image generation from tags or attributes [
409
,
501
]. AlignDRAW [
274
] is a pioneering work to
generate images from natural language, and it is impressive that AlignDRAW [
274
] can generate images from novel
text like ‘a stop sign is ying in blue skies’. More recently, advances in text-to-image area can be categorized into three
branches, including GAN-based methods, autoregressive methods, and diusion-based methods.
GAN-based methods. The limitation of AlignDRAW [
274
] is that the generated images are unrealistic and re-
quire an additional GAN for post-processing. Based on a deep convolutional generative adversarial network (DC-
GAN) [
340
], [
351
] is the rst end-to-end dierential architecture from the character level to the pixel level. To generate
high-resolution images while stabilizing the training process, StackGAN [
529
] and StackGAN++ [
530
] propose a multi-
stage mechanism that multiple generators produce images of dierent scales, and high-resolution image generation is
conditioned on the low-resolution images. Moreover, AttnGAN [494] and Controlgan [232] adopt attention networks
to obtain ne-grained control on the subregions according to relevant words.
Autoregressive methods. Inspired by the success of autoregressive Transformers [
448
], a branch of works generates
images in an auto-regressive manner by mapping images to a sequence of tokens, among which DALL-E [
346
] is a
pioneering work. Specically, DALL-E [
346
] rst converts the images to image tokens with a pre-trained discrete
variational autoencoder (dVAE), then trains an auto-regressive Transformer to learn the joint distribution of text and
image tokens. A concurrent work CogView [
90
] independently proposes the same idea with DALL-E [
346
] but achieves
superior FID [
154
] than DALL-E [
346
] on blurred MS COCO dataset. CogView2 [
91
] extends CogView [
90
] to various
tasks, e.g., image captioning, by masking dierent tokens. Parti [
510
] further improves the image quality by scaling the
model size to 20 billion.
Diusion-based methods. Diusion model-based methods have achieved unprecedented success and attention
recently, which can be categorized by either working on the pixel space directly [
303
,
371
] or the latent space [
345
,
361
].
GLIDE [
303
] outperforms DALL-E by extending class-conditional diusion models to text-conditional settings, while
Imagen [
371
] improves the image quality further with a pre-trained large language model (e.g., T5) capturing the text
semantics. To reduce resource consumption of diusion models in pixel space, Stable Diusion [
361
] rst compresses
the high-resolution images to a low-dimensional latent space, then trains the diusion model in the latent space. This
method is also known as Latent Diusion Models (LDM) [
361
]. Dierent from Stable Diusion [
361
] that learns the
latent space based on only images, DALL-E2 [
345
] applies diusion model to learn a prior as alignment between image
space and text space of CLIP. Other works also improve the model from multiple aspects, including introducing spatial
control [21,455] and reference images [37,390].
5.2.2 Talking face. From the perspective of output, the task of talking face[
544
] generates a series of image frames
which are thus technically a video (see Figure 19). Dierent from general video generation (see Sec. 6.1), talking face
requires an image face as an identity reference, and edits it based on the speech input. In this sense, talking face is
Manuscript submitted to ACM
26 Zhang et al.
Fig. 18. Examples of text-to-image (figure from [303]).
more related to image editing. Moreover, talking face converts a speech clip to a corresponding face image, resembling
speech recognition to convert a speech clip to a corresponding word text. With speech recognition recognized as a
multimodal generation text task, this survey considers talking face as a multimodal image generation task. Driven by
deep learning models, speech-to-head video synthesis models have attracted wide attention, which can be divided into
2D-based methods and 3D-based methods.
With 2D-based methods, talking face video synthesis mainly relies on landmarks, semantic maps, or similar represen-
tations. Landmarks are used as an intermediate layer from low-dimensional audio to high-dimensional video, as well as
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 27
two decoders to decouple speech and speaker identity for generating video unaected by speaker identity [
67
], which
is also the rst work to use deep generative models to create speech faces. In addition, image-to-image translation
generation [
182
] can also be used for lip synthesis, while the combination of separate audio-visual representations and
neural networks can also be used to optimize synthesis [408,546]
Another line of work is based on building a 3D model and controlling the motion process through rendering
technology [
222
,
418
], with a drawback of high construction cost. Later, many generative talking face models based on
3DMM parameters [
76
,
114
,
200
,
427
] were established, using models such as blendshape [
76
], ame [
242
], and 3D mesh
[
356
], with audio as model input for content generation. At present, most methods are directly reconstructed from
training videos. NeRF uses multi-layer perceptrons to simulate implicit representations, which can store 3D spatial
coordinates and appearance information and are used for high-resolution scenes [
241
,
289
,
297
]. In addition, a pipeline
and an end-to-end framework for unrestricted talking face video synthesis have also been proposed [
218
,
331
], taking
any unidentied video and arbitrary speech as input.
6 AIGC TASK: BEYOND TEXT AND IMAGE
6.1 Video
Compared with image generation, the progress of video generation lags behind largely because of the complexity
of modeling higher-dimensional video data. Video generation involves not only generating pixels but also ensuring
semantic coherence between dierent frames. Video generation works can be categorized into unguided and guided
generation (e.g., text, images, video, and action classes), with text-guided age (see Figure
??
) receiving the most attention
due to its high inuence.
Unguided video generation. Early works on extending image generation from single frame to multiple frames are
limited to creating monotonous yet regular content like sea waves. The generated dynamic textures [
98
,
472
] often have
a spatially repetitive pattern with time-varying visualization. With the development of generative models, numerous
works [
2
,
69
,
308
,
373
,
438
,
454
,
518
] extend the exploration from naive dynamic textures to real video generation.
Nonetheless, their success is limited to short videos for simple scenes with the availability of low-resolution datasets.
More recent works [
68
,
160
,
374
,
428
] improve the video quality further, among which [
160
] is regarded as a pioneering
work of diusion models.
Fig. 19. Examples of talking face (image obtained from [52]).
Manuscript submitted to ACM
28 Zhang et al.
Fig. 20. Examples of text-guided video generation (figure obtained from [398]).
Text-guided video generation. Compared to text-to-image models that can create almost photorealistic pictures,
text-guided video generation is more challenging. Early works [
140
,
249
,
263
,
279
,
293
,
316
] based on VAE or GAN
concentrate on creating video in simple settings, such as digit bouncing, and human walking. Given the great success
of the VQ-VAE model in text-guided image generation, some works [
163
,
478
] extend it to text-guided video generation,
resulting in more realistic video scenes. To achieve high-quality video, [
160
] rst applies the diusion model to
text-guided video generation, which refreshes the benchmarks of evaluation. After that, Meta and Google propose
Make-a-Video [
398
] and Imagen Video [
158
] based on the diusion model, respectively. Specically, Make-a-Video
extends a diusion-based text-guided image generation model to video generation, which can speed up the generation
and eliminate the need for paired text-video data in training. However, Make-a-Video requires a large-scale text-video
dataset for ne-tuning, which results in a signicant amount of computational resources. The latest Tune-a-Video [
480
]
proposes one-shot video generation, driven by text guidance and image inputs, where a single text-video pair is used to
train an open-domain generator.
6.2 3D generation
The tremendous success of deep generative models on 2D images has prompted researchers to explore 3D data generation
(see Figure
??
), which is actually a modeling of the real physical world. Dierent from the single format of 2D data, a 3D
object can be represented by depth images, voxel grids[
482
], point clouds[
333
,
334
], meshes[
144
] and neural elds[
286
],
each of which has its advantages and disadvantages.
Manuscript submitted to ACM
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? 29
According to the type of input and guidance, 3D objects can be generated from text, images and 3D data. Although
multiple methods [
115
,
179
,
265
] have explored shape editing guided by semantic tags or language descriptions, 3D
generation is still challenging due to the lack of 3D data and suitable architectures. Based on the diusion model,
DreamFusion [
329
] proposes to solve these problems with a pre-trained text-to-2D model. Another branch of works
reconstruct the 3D objects from single-view images [
34
,
125
,
246
,
437
,
463
,
516
] or multi-view images [
64
,
170
,
460
,
486
],
termed Image-to-3D. A new branch of multi-view 3D reconstruction is Neural Radiance Fields (NeRF) [
289
] for implicit
representation of 3D information. The 3D-3D task includes completion from partial 3D data [
461
] and transformation
[27], with 3D object retrieval as a representative transformation task.
6.3 Speech
Speech synthesis is an important research area in speech processing that aims to make machines generate natural and
understandable speech from text. Methods of traditional speech synthesis include articulatory [
220
,
383
], formant [
13
,
380
], concatenative synthesis [
296
,
309
], and statistical parametric speech synthesis (SPSS) [
202
,
295
]. These methods
have been widely studied and applied, e.g., formant synthesis is still used in the open-source NVDA (one of the leading
free screen readers for Windows). However, these generated speeches are identiable from the human voice, and
artifacts in synthesis speech reduce intelligibility.
Early works [
105
,
336
,
520
522
] consist of three modules: text analysis, an acoustic model, and a vocoder. WaveNet [
445
]
is a revolution within speech synthesis which can generate the raw waveform from the linguistic features. To improve
the quality of speech and diversity of voices, generative models are introduced in speech synthesis, such as GAN [
127
].
Compared with GAN, diusion models do not require a discriminator, making training more stable and simple. Therefore,
the works of speech synthesis adopt diusion models, becoming a rising trend. A branch of works [
58
,
171
,
223
,
485
]
focuses on ecient speech synthesis, in which dierent ways are adopted to reduce the generated time by accelerating
inference, such as combining the schedule and score networks for training, jointly trained GAN. Another branch of
study [
53
,
292
,
365
,
393
] concentrates on end-to-end models, which directly generate waveform from text without any
intermediate representations. A fully end-to-end model not only simplies the training and inference, but also reduces
the demand for human annotations. The branch of diusion-based speech synthesis is not limited to the two mentioned
above, such as speech enhancement and guided speech synthesis.
6.4 Graph
Graphs are ubiquitous in the world, which aid in visualizing and dening the relationships between objects in a wide
range of domains, from social networks to chemical compounds. Graph generation, which creates new graphs from a
trained distribution that is similar to the existing graphs, has received a lot of attention.
Traditional graph generation works [
12
,
227
,
470
] create new graphs with specic features that are related to the
hand-crafted statistical features of real graphs , which simplies the process but fails to capture relational structure in
complex scenarios. With the successes of deep learning algorithms, researchers have begun to apply them to graph
generation, which, unlike the traditional methods, can be directly trained by real data and automatically extract features.
Among them, works [
78
,
252
,
509
] based on autoregressive model create graph structures sequentially in a step-wise
fashion, which allows for greater scalability but fails to model the permutation invariance and is computationally
expensive. Simultaneously, One-shot models [
257
,
272
,
272
] such as VAE and ow are incapable of accurately modeling
structure information because of ensuring tractable likelihood computation. Although graph generation [
82
,
187
,
281
]
based on GAN sides step likelihood-based optimization by using a discriminator, the training is unstable.
Manuscript submitted to ACM
30 Zhang et al.
Recently, there has been a surging interest in developing diusion models for graph-structured data. EDP-GNN [
305
]
is the pioneering to show the capability of the diusion model in the Graph generation, with the goal of addressing
non-invariant properties. After that, On the one hand, diusion-based works [
142
,
169
,
190
,
269
,
450
] focus on realistic
graph generation, which produces graphs that are similar to a given set of graphs. On the other hand, [
15
,
391
,
488
,
519
]
concentrate on goal-directed graph generation, which generates graphs that optimizes given objects, like molecular and
material generation.
6.5 Others
There are also other interesting tasks generating content in dierent modalities, e.g., music generation [
183
] and
lip-reading [
109