PreprintPDF Available

Towards Computational Architecture of Liberty: A Comprehensive Survey on Deep Learning for Generating Virtual Architecture in the Metaverse

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

3D shape generation techniques utilizing deep learning are increasing attention from both computer vision and architectural design. This survey focuses on investigating and comparing the current latest approaches to 3D object generation with deep generative models (DGMs), including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), 3D-aware images, and diffusion models. We discuss 187 articles (80.7% of articles published between 2018-2022) to review the field of generated possibilities of architecture in virtual environments, limited to the architecture form. We provide an overview of architectural research, virtual environment, and related technical approaches, followed by a review of recent trends in discrete voxel generation, 3D models generated from 2D images, and conditional parameters. We highlight under-explored issues in 3D generation and parameterized control that is worth further investigation. Moreover, we speculate that four research agendas including data limitation, editability, evaluation metrics, and human-computer interaction are important enablers of ubiquitous interaction with immersive systems in architecture for computer-aided design Our work contributes to researchers' understanding of the current potential and future needs of deep learnings in generating virtual architecture.
Unpublished working draft.
Not for distribution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
111
Towards Computational Architecture of Liberty: A
Comprehensive Survey on Deep Learning for Generating
Virtual Architecture in the Metaverse
ANQI WANG, Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong
Kong SAR, China and Computational Media and Arts, Hong Kong University of Science and Technology
(Guangzhou), Guangzhou, China
JIAHUA DONG, School of Architecture, The Chinese University of Hong Kong, Hong Kong SAR, China
JIACHUAN SHEN, The Bartlett School of Architecture, University College London, London, UK
LIK-HANG LEE, The Hong Kong Polytechnic University, Hong Kong SAR, China
PAN HUI, Computational Media and Arts, Hong Kong University of Science and Technology (Guangzhou),
Guangzhou, China and Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology,
Hong Kong, China
3D shape generation techniques utilizing deep learning are increasing attention from both computer vision
and architectural design. This survey focuses on investigating and comparing the current latest approaches
to 3D object generation with deep generative models (DGMs), including Generative Adversarial Networks
(GANs), Variational Autoencoders (VAEs), 3D-aware images, and diusion models. We discuss 187 articles
(80.7% of articles published between 2018-2022) to review the eld of generated possibilities of architecture
in virtual environments, limited to the architecture form. We provide an overview of architectural research,
virtual environment and related technical approaches, followed by a review of recent trends in discrete voxel
generation, 3D models generated from 2D images, and conditional parameters. We highlight under-explored
issues in 3D generation and parameterized control that is worth further investigation. Moreover, we speculate
that four research agendas including data limitation, editability, evaluation metrics and human-computer
interaction are important enablers of ubiquitous interaction with immersive systems in architecture for
computer-aided design Our work contributes to researchers’ understanding of the current potential and future
needs of deep learnings in generating virtual architecture.
CCS Concepts: Human-centered computing
Interaction design process and methods;Virtual
reality;Computing methodologies
Machine learning;Applied computing
Architecture
(buildings).
Additional Key Words and Phrases: Deep Learning, virtual environment, architectural design, computational
architecture, 3D shape generation, 3D-aware image synthesis, human-computer interaction, metaverse, AIGC
Authors’ addresses: Anqi Wang, Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong
Kong SAR, China and Computational Media and Arts, Hong Kong University of Science and Technology (Guangzhou),
Guangzhou, China; Jiahua Dong, School of Architecture, The Chinese University of Hong Kong, Hong Kong SAR, China;
Jiachuan Shen, The Bartlett School of Architecture, University College London, London, UK; Lik-Hang Lee, The Hong Kong
Polytechnic University, Hong Kong SAR, China; Pan Hui, Computational Media and Arts, Hong Kong University of Science
and Technology (Guangzhou), Guangzhou, China and Emerging Interdisciplinary Areas, Hong Kong University of Science
and Technology, Hong Kong, China.
Unpublished working draft. Not for distribution.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2018 Association for Computing Machinery.
0004-5411/2018/8-ART111 $15.00
https://doi.org/XXXXXXX.XXXXXXX
2023-04-30 10:37. Page 1 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
111:2 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
ACM Reference Format:
Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui. 2018. Towards Computational Archi-
tecture of Liberty: A Comprehensive Survey on Deep Learning for Generating Virtual Architecture in the
Metaverse. J. ACM 37, 4, Article 111 (August 2018), 35 pages. https://doi.org/XXXXXXX.XXXXXXX
1 INTRODUCTION
In the past decades, the study of architectural space has changed the exploration directions from
reinforcement concrete to digital architecture, then to the information frameworks with virtualiza-
tion beyond the physical layer. The digital innovations and technological advances associated with
the spline, pixels, voxels, and bits have enabled architectural forms to be reconceptualized. The
architecture has been not as static, permanent objects but as a larger part of a data network and the
evolving communication between dierent kinds of architectural systems [
32
]. For example, the
scenes of video games and augmented reality (AR) or virtual reality (VR) cooperate in such virtual
environments for architecture. This virtual fabrication of digital space pushes the boundaries of
consideration in what is being produced and designed. The purpose and subject of architecture
have radically changed in this digitalization. It is free from the constraints of physical construction,
socioeconomic factors and environmental conditions, such as daylight, architectural materials
or structure, budget, etc. Instead, the visually appealing experiences they bring enable virtual
architectures to serve as intersections within this innity beyond reality. It becomes a spatial
medium full of innite possibilities to carry society and culture.
The innitely expanding spatial eld of virtual worlds (VWs) faces many tasks that require e-
cient modeling. Creating various object models through generative techniques is a timely research
topic [
3
]. Research on 3D model generation or 3D-aware image synthesis through deep learning
(DL) has been booming in recent years. Generative Adversarial Networks (GANs), Variational
Autoencoder (VAE) and the very recent diusion model (DDPM) belong to deep learning. In contrast
to other classic ML algorithms, Their category belongs to unsupervised learning (USL), which does
not rely on large sets of labeled data. DL has surpassed human perception regarding abstraction
strategies through invisible deep neural networks. For instance, AlphaGO can beat top board players
in board games. DALLE, a drawing tool performing the multi-modality of text-transformed images,
learns human intention from the natural language. DL has pushed the potential for output farther
and farther beyond human imagination.
1.1 Preamble: 3D Virtual Architecture Generated by Deep Learning
The generative virtual architecture is a vast domain spreading over computer-aided design (CAD),
3D shape generation techniques, and human-computer interaction (HCI). On the other hand, 3D
shape generation techniques by DL are a fundamental viewpoint in computer vision and computer
graphics. Thus, we need to dene these terms at the beginning of this survey.
Deep learning. Deep learning (DL), a subclass of machine learning (ML) and articial intelligence
(AI), has developed surgery with a boost in data process and computation. A new class of DL is deep
generative models (DGMs) by combining generative models and deep neural networks. They rely
on paradigms of unsupervised learning. Neural networks such as ANN, CNN and RNN, as signature
deep learning architectures, have played essential roles in manipulating the relationship between the
input and output data. The denition of DL signies the system master the capability of self-learning
and experience enhancement [
149
]. DL applications have broad applications to all aspects of life.
There are plenty of notable examples. Such as the rst fully automatic self-driving car, Navlab5
(Fig. 1a); Alpha GO, a computer program that can beat top human Go players(Fig. 1(b)); Mirror
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 2 of 1–35.
Unpublished working draft.
Not for distribution.
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:3
Fig. 1. Applications of deep learning impact our lives in all aspects. (a) Self-drive cars; (b) AlfaGO; (c)
Segmentation in the city recognition with computer vision; (d) Mirror World NFT, which shows AI dialogue
character with personality and development from learning; (e) OpenCV recognizing the object types in the
camera view; (f) Apple watch paired with deep learning detect atrial fibrillation with 97 % accuracy; (g)
ChatGPT
1
developed by OpenAI; (h) recommendation system in the Tiktok; (i) Smart agriculture implemented
by deep learning with drones; (j) DALLE-2, one powerful painting tool empowered by machine learning;
(k) D.O.U.G, a collaborative robotic arms interacted with human, learning human behaviors and gestures,
performance and created by artist Soug Wen; (l) digital human body generation by 3D reconstruction
technique; (m) An AI art movie created by GANs (Casey Reas); (n) BCI (Brain-computer interface).
World NFT’s intelligent character
2
that can learn and grow up from human text conversations (Fig.
1d); ChatGPT developed by OpenAI [
130
], the ever rst intelligent conversational machine model;
One of the best recommendation systems in the world (Fig. 1h), which made TikTok stand out from
13.47 million DAUs
3
; DALLE-2, which was commented as the ever-best AI painting tool (Fig.1j);
and the innite potential BCI (Brain-computer interface) (n); Moreover, the intelligence revolution
could not be ongoing without DL, for instance, smart agriculture (Fig. 1i), and smart transportation.
Computer Vision relies on the DL closely, such as notable OpenCV (Fig. 1e), segmentation with
vision cognition (Fig. 1c) and 3D scanning techniques (Fig. 1l). Additionally, loads of contemporary
digital art were created through deep learning by inputting and processing the data of human
gestures and bio-signals (Fig. 1k). Figure 1m represents the cutting-edge example of experimental
AI-art lms created by GAN.
3D Shape Generation Technique. With an increasing surge of AI-Generated Content (AIGC), DGMs
have the capability to process 3D shape generation through various approaches. There are plenty
of frameworks, such as GAN, VAE, Flow model, and so on. DGMs have the widest applications
and the most prominent inuence in the eld of two-dimensional (2D) image process, such as
textures, transfer style art, photorealistic faces and text-to-image generation [
79
,
97
,
141
,
187
].
For innovative techniques and boosted arithmetic power, DGMs for 3D shape generation have
burgeoned in research years. The DGMs can achieve this leveraging eect in the aspect of the 3D
generative object by shifting from the outcome in the 2D image. Rapidly, a method with GANs,
named 3D-GAN [
169
], was applicable to 3D shape generation in a probability space for voxel
grids (See Fig. 4a). The 3D shape generation inspires some downstream operations, such as object
classication and part segmentation, to scene semantic parsing.
Computer-aided Design and Deep Learning-Assisted Form Generation. Computer-aided design
(CAD) is an extensive research eld regarding digital tools-assisted creation and optimization in
the design phase [
146
]. Especially in the architecture eld, the design with the involvement of
computational tools has spread over Building Information Modeling (BIM), structural performance
analysis, robotics and digital fabrication, urban analytics, environmental performance, and so
on [
12
]. Architectural design aided by DL is one of the typical classications of the CAD eld by
2Mirror Wolrd’s ocial websites: https://link3.to/mirrorworld
3A report on TikTok, souce: https://www.statista.com/statistics/1090659/tiktok-dau-worldwide-android/
2023-04-30 10:37. Page 3 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
111:4 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
providing a wide range of options in design processes [
161
]. DL-assisted architecture generation
has enlarged to the generative systems from rule-based topology optimization such as cellular
automata
4
and shape grammars
5
, to neural network tools, which provide more exibility, and
more controllable parameters in a generation. Reviewing DL-aided form generation, deep neural
networks in DGMs have proved useful eciency and power in architecture design. In such a
eld, the workforce and computational power needed to coordinate with each other in studying
form generation and future construction. Nevertheless, there are no such requirements in virtual
environment generation. We found that there needs to be more knowledge in the form generation
around the transformation between physical and virtual spaces.
Generative 3D Virtual Architecture. Generating architectural spaces eciently and applicatively
from 3D representations is a popular and worthwhile research topic, both in the architecture and
computer science domains. For architecture, it is crucial to clarify the rules of digital space, which
aligns with functionality, aesthetics, and satisfaction. As Roberto Bottazzi states, as opposed to
transforming digital architecture, urgency is how the digital space can be architecturized [
19
]. This
unveils the signicance of virtual architecture. The increasing tendency to build a virtual world
(VW) is associated with the reality of owning digitalized lives and produces, which refers to the
metaverse.
1.2 Towards an interdisciplinary area among architecture, HCI, & AIGC
The research on virtual environments has gradually stood out at the intersection of human-computer
interaction (HCI) and immersive techniques such as Augmented reality (AR) and Virtual Reality
(VR) [
99
]. Moreover, the demands in Virtual Reality (VR) or Augmented Reality (AR) environments
are surging due to modeling productivity and eciency. Consequently, these demands and devel-
opments have also raised the viewpoint for 3D generative approaches. The feasibility of virtual
architecture clearly benets from the CAD approach in terms of the modeling task load and HCI
approaches. However, despite the popularity of exploring the possibilities of space in 3D object
generation, research on architecture is still limited. Most studies have similarly focused on the two-
dimensional (2D) image process. ArchiGAN [
24
], studied by the MIT team, explores the potential of
GANs in training large numbers of building oor plans for spatial layout and functional delineation
automatically, and further advent applications[
25
]. Most other architectural research with GAN,
such as generating some fantastic-style images by [
83
], satises an imagination for designs that
are either beyond the constraints of physical worlds, or have not been eectively proposed and
illustrated before. For example, Özel utilizes some creative images for architecture, relying on arti-
cial intelligence, to predict the future of architecture [
133
]. These studies consider the creativity
of generating absences from an aesthetic perspective beyond reality. Abstract two-dimensional
images are far from architecture, even in virtual worlds, due to the lack of methods to transfer
those automatically generated images to a three-dimensional format. Additionally, the traditional
method of 2D oor plans constructed in real space is unsuitable for virtual environments. In other
words, there has been considerable divergence in the production approach between physical and
virtual architecture. The former has to consider the external environment (e.g., daylight, light) and
construction constraints; while the virtual one put more concerned with the sense of identity, de-
nition of self, and aesthetics. Therefore, to close the gaps mentioned above, a novel discipline urges
reconstructing a niche domain encompassing architectural components, constraint requirements
in VE, and user needs. At the same time, deep learning could support generating computational
architecture freely. It is worthwhile to mention that the gaps remain for most existing research.
4A cellular automata (CA) is a discrete model of computation studied in automata theory
5Shape grammars in the computation are a specic class of production systems that generate geometric shapes.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 4 of 1–35.
Unpublished working draft.
Not for distribution.
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:5
(a) This survey investigates this intersection
area.
(b) A profile of the number of works cited in this paper in
dierent categories and years.
Fig. 2. The survey’s scope and profile of related articles: a architectural studies on DGMs; c computer
vision studies; v those works on rules in VWs.
To summarize the problem mentioned above, rst, generating the architecture with the design
purpose for the 3D shape generation techniques is rarely considered since generative architecture
requires sophisticated consideration and innovative techniques, especially for non-tech-savvy
architects. Second, architectural generation approaches rarely regard the “virtual" and lack the
usage of 3D shapes generation. Therefore, the design dimensions for virtual architecture generated
by 3D approaches have not been systematically considered. Therefore, our survey addresses how
to leverage 3D shape generation techniques to produce 3D virtual spaces from a user-centered
perspective. We noted that this article denes the user-centered perspective as ‘inclusive’ to consider
the needs of non-tech-savvy architects who lack technical (computer-science) backgrounds, and
layman users who intended to create virtual buildings in the Metaverse.
1.3 Methodology and Related Articles
This survey article presents ndings of a systematic literature review on deep learning for 3D shape
generation in computer vision and CAD for computational architecture in terms of generating
virtual architecture in recent years. Since the problem mentioned above, the intersection of 3D
generation techniques and virtual architecture is still nearly blank. Therefore, we anchored the
three elds to conduct this survey in order to complement the key insight with each other: 3D
shape generation techniques, DL-assisted architectural design, and the design considerations in a
VWs in terms of HCI (See Fig. 2a).
We reviewed a sample of 187 articles and primarily focused on works published between 2018
and 2022 (ve years, 80.7%) as follows: 2023 or later: 4 (8.7%), 2022: 33 (17.6%), 2021: 39 (20.9%),
2020: 36 (19.3%), 2019: 30 (16%), 2018: 33 (5.8%), before 2018: 32 (17.1%) from those three elds.
We found the articles primarily through publication databases such as ACM Digital Library, IEEE
Xplore, ScienceDirect, Springer Link and CuminCAD. We used the following keywords Augmented
Reality (AR), Virtual Reality (VR), deep generative models, 3D representation, 3D model or shape
or geometry, object generation, 3D-aware image, shape synthesis, point cloud, voxel grid, mesh,
implicit neural eld, virtual architecture, virtual environment, deep learning design, generative
design, Generative Adversarial Network (GAN), 3D GAN, VAE (Variational Autoencoder), diusion
model (DDPM), text to 3D, image to 3D, zero-shot, computational architecture, spatial objects, exi-
ble spaces, virtual rules, design discipline, human-computer-interaction (HCI), evaluation metric,
human perception, emotion, simulation, participatory design, aesthetics, real-time interaction, and
combinations of these keywords. Additionally, we count on some latest or high-inuential research
2023-04-30 10:37. Page 5 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
111:6 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
on Computer Vision (CV) only published in arXiv. We screened through the titles and the keywords
to ensure they only included full papers and extended abstracts. Short papers and abstracts were
excluded from this scope. When the keywords and abstracts do not appear as the key information or
elements in our investigating scope, we read the whole publication to check whether it is included
or not. After the screening, we got 147 articles and 19 CV research published on arXiv to review,
i.e., 166 authoritative articles. Additionally, online resources were directly searched through the
Google search engine, we mainly conclude 19 articles and 2 relevant architectural projects from the
perspective of architecture design, categorized by the virtual world, computational architecture,
architectural theory and so on. Eventually, a total of 187 articles and 2 architectural projects are
included in this survey.
Various other surveys further locate this scope, as follows: Category 1 (machine learning [
92
,
137
]
or deep learning [1, 5, 6, 6, 63, 86, 92, 139, 148]): 3D shape generation [23, 132, 156], scene synthe-
sis [
172
,
184
], applications [
35
,
47
], 3D representation [
58
], 3D reconstruction from 2D [
176
], and gen-
erative models [
3
,
4
,
33
,
34
,
62
,
73
,
134
,
167
]. Category 2 (DL-assisted architectural design [
123
,
137
]):
infrastructure [
161
], intelligent construction [
5
,
12
], life cycle [
68
], or other design [
142
]. And Cate-
gory 3 (architecture in a virtual environment or virtual worlds): design disciplinary in virtual reality
(theories and applications) [
15
], HCI in the virtual architecture (human senses and emotions) [
99
],
metaverse or virtual worlds [
16
,
46
,
98
]. In contrast, this article reviews the approaches to 3D
shape generation and the factors of virtualization in architecture in recent years, especially in
the last ve years (2018-2022), regarding virtual rules, design principles, social parameters, and
HCI methods for CAD design. We argue to combine these research areas and consider this an
interdisciplinary problem. Finally, the research outlines the crucial challenges of HCI in virtual
architectural model generation tasks. Our survey article uniquely considers the prominent features
of the above categories and further paves a path towards the computational architecture of liberty,
with the below contributions.
(1)
We provide a comprehensive investigation for the inclusion of DL-assisted architectural
design and deep generative models, dedicated to developing a critical lens for computational
architecture in virtual environments.
(2)
We highlight an opportunity to address the academic gap between the two existing areas of
research, attempting to respond algorithmically to social factors.
(3)
We propose research topics for the future of virtual architecture towards liberty, considering
disciplinary beyond reality such as humanism and spirituality.
1.4 Scope and Structure
Although the intersection of DL and architecture is in all aspects, we only investigate articles where
DL considers the generative deep learning of 3D virtual architecture, especially for the 3D DGMs.
We only include the research limit to the 2D style imaginary drawings coupled with providing
innovative approaches to the 3D transition. We also excluded the articles that only consider the
real problem such as BIM rather than implementing it in a purely virtual environment. This scope
reects the automatic generation of timely design issues in the virtual space.
The paper reviews the current problem space in this eld consisting of rules of the virtual world,
social parameters and civilization of formal liberty starting from Section 2. Section 3 covers the
innovative form generation in architecture under generation approaches in terms of 3D form
transposition and 3D solid form generation. Four topics are covered in this section, including GAN
ed into specic training, VAE for the specic information extraction, 3D-aware image synthesis
and diusion model based on the conditional text (See Fig. 3). Subsequently, we revisit the eld
from the perspective of HCI, formulating research agendas in four grand challenges, ranging from
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 6 of 1–35.
Unpublished working draft.
Not for distribution.
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:7
Fig. 3. The survey structure (Sections 2 4).
data limitation, editability, and evaluation metrics to HCI design, collecting user information, oper-
ation and perception. We indicate that can explore new possibilities for optimizing and inventing
innovative methods regarding automatically generating virtual architecture with human and social
group-centric considerations.
2 OVERVIEW OF GENERATED 3D VIRTUAL ARCHITECTURE
Since the last century, discussing methods and rules for computer-aided design has never stopped.
However, initially, the digitized models simply worked to simulate the physical environment or
document the design process. The discussion around virtual architecture started when web-based
social media or games were invented. After the emergence of the metaverse concept, enthusiasm
for virtual architecture research has intensied, indicating that we entered a new era of existence
with a proliferation of dierent kinds of virtual environments (VEs). On the one hand, VE is built
on a liminal reproduction to reality, which has no legitimacy and produces no consequences [
121
].
On the other hand, the VE is built on potential interactions of humans as social attributes in
virtual spaces. Lee et al. state that digital natives are essential for developing the ultimate form of
the metaverse [
98
]. Those digital ones enable boosting their impacts on all craft as well as user-
generated content (UGC) through social interaction with avatars. As virtual worlds evolve, social
attributes are expanded, such as poverty rights, identity, roles, and group dierentiation [
150
]. As
Roberto Bottazzi argues, the roadmap for our modern society should be the architecturalization of
unregulated digital spaces[
19
]. Increasingly, people are integrating virtual spaces into a coexistence
of living spaces. In general, the research from the primary “virtual worlds" until the explosion of
the metaverse is rather diverse. However, most studies have focused on the layer of virtual reality
or interactive games, but not specically on the subject of space itself.
2.1 The Rules of Virtual Worlds
The discussion of virtual worlds, which are constructed and internally coherent, has always ceased.
There have three primary layers stacked in the concept of the virtual worlds, semantics, and virtual
environment to the architecture [
122
]. Consequently, these virtual worlds are everywhere, from
politics to video games to token economies. Immersion, presence, and interactivity intertwine as
2023-04-30 10:37. Page 7 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
111:8 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
the three pillars of virtual worlds [
117
]. Gilbert identied ve essential characteristics of virtual
worlds [
54
]. They are embodied in every aspect, such as spatial perception, public or private
activities, social experience and emotional expression. Furthermore, many scholars have tried
to dene and classify the virtual world in terms of layers or development stages. The confused
denition of VW has been mitigated by the exuberance of the metaverse. Dionisio divided the
virtual world into 5 developing stages, ranging from text and 2D graphical interfaces to UGC and
then to a complete decentralized economic system [
46
]. In the latest research on metaverse, Lee et al.
state that there are three stages toward the co-existence of physical and virtual space - digital twins,
digital natives, and surreality [
98
]. The digital twin is a reproduced version of the physical world,
depending on the development of CAD for both industry and architecture. The surging numbers of
digital natives enable boosting their positive impacts on the interaction with avatars and all craft as
well as user-generated content. Surreality is the ultimate ideal world that the metaverse aims for,
supporting heterogeneous activities with interoperability in real-time between the physical and
virtual worlds.
From the technology perspective, software and hardware architecture denes spatial functionality,
constrain and social interaction. These architectures form the politics of VW [
100
], while code
forms the laws of the graphic VE. Every law invented by the human has Intrinsic value with specic
intention and elaborate design. Therefore, the design discipline consists of codes and computing
ought to satisfy the complex parameters, including computing capability, cost-benet ratio and
user preference. Despite the codes and programs providing the rules and laws in the VW, it is
undeniable that unique expertise is required to handle the design and organization of VWs. This is
not enough by the ability given by the codability in computers solely. This composite capability, as
a special case of visual, analogically integrated reasoning, is fully capable of being a key expertise.
It can operate at multiple scales and in multiple contexts to map, analyze, and organize VWs, while
being able to introduce new systems, rules, and forms into them [81].
2.1.1 Virtuality in the Architecture. Virtuality is a fundamental characteristic specic to architecture
in a VE and encompasses the following three elements: immersion, presence, and interactivity. Stem
from three pillars in the virtual world [
117
], the virtual architecture shares the same interpretation
but more specic deployment. In other words, since the virtual building is a specic type of mediated
matter that has a 3D representation in the VW, we can regard these buildings as a subtype of the
virtual environment. Various components and frameworks of the virtual world are assumed to be
applied within it. Therefore, the theory of the VW is equally applicable to virtual architecture.
The design discipline of virtual architecture had to regard these three pillars as essential. Some
research on architectural design has conducted these rules, VRoamer [
31
] reports an interactive
VE through the releasing users’ attention to achieve immersion. Not only for the research, the
architecture projects also tend to integrity with immersive technology. Zaha Hadid Architects and
JOURNEE have jointly developed a virtual NFT gallery "NFTism", which is one of a handful of virtual
buildings with interactivity. This gallery inherits Zaha’s representative uidic form, supporting
MMO (massively multiplayer online) technology and integrating audio-video interaction 6.
2.2 Design Discipline
2.2.1 The Absence of Reality. To better clarify the design principles of virtual architecture, we
compare them to real-world architecture. First, real factors had to be absent in building a virtual
architecture, encompassing design consideration, construction structure, and economic cost. Pre-
cisely, a kind of emerging factor running on the immersive technologies and tting the virtual logic
6
A report by Archdaily: https://www.archdaily.com/972886/zaha-hadid- architects-presents-virtual-gallery-exploring-
architecture-nfts-and-the- metaverse
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 8 of 1–35.
Unpublished working draft.
Not for distribution.
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:9
(a) Algorithmic Social Sciences Re-
searchUnit (ASSRU) a[10].
aSource: http://www.assru.org/index.html
(b) Parametric Semiology
a
: Semio-field, dierentiation of public
vs private as a parametric range.
aSource: https://www.patrikschumacher.com/Texts/Design%20of%
20Information%20Rich%20Environments.html
Fig. 4. Social factors as parameters in dierent theories.
replaces the original position occupied by the real ones. The following elaborate exact three aspects:
First, from environmental factors to social factors: environmental factors such as the direction of
wind and light aect the spatial layout of the architecture, while they are not eective for a virtual
building. What is replaced in the virtual world is a more neoliberal virtual logic, since humans
gather socially unrestricted by time and space by emphasizing social activities.
Second, from Building structure to unrestricted form: the structure of a virtual building is more
toward a more accessible and open form. Many architectural structures that existing technology
could not implement have been consecrated as paper architecture
7
with cutting-edge conception.
For example, Zaha’s early works were not structurally possible with the technology and construction
of the time. There are plenty of schools of thought in this regard, such as bionic architecture
8
and
responsive architecture 9.
Third, from construction costs to the cost of scene and computation: It is taken for granted that
real construction costs, such as material and labor costs, are transferred to the costs of modeling,
lighting and rendering, while the complex data computation for such tasks as rendering consumes
computer memory. The cost of building and constructing complex and fantastical real scenes
is exponentially higher, and they are achieved through special eects and modeling rendering
techniques that are impossible to achieve in the real world.
Apart from the above, virtual buildings and real buildings are built with a transforming logic,
from construction to unrestrained form. Real-world architecture needs to start from the spatial
planning and functional layout of the oor plan, so as to deduce and complete the architectural
design. But in the virtual environment, the alternative solution is to start directly from the functional
layout on the 3D space, mostly modeling through the game engine or 3D modeling software. This
is not only the production method of virtual assets, but even some advanced real buildings are
starting to do so because of the eciency and higher accuracy of spatial perception.
7Visionary architecture that couldn’t build in reality, only as drawings, collages, or models.
8
Bionic architecture is usually computationally adapted to the structure or form of organic matter in nature. Design
considerations for biomimetic architecture include the physiological, behavioral, and structural adaptations of living
organisms.
9
Responsive architecture refers to the ability of an architecture or building to exhibit the ability to change its form to
constantly reect the conditions of its surroundings. It reects the idea of interactive architecture.
2023-04-30 10:37. Page 9 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
111:10 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
2.2.2 Social Factors As Parameters. It is important to highlight that the discipline of virtual spaces
cannot abandon the social impacts. The built environment is a vast, navigable, and information-rich
communication interface, especially in the virtual world. It provides potential social participants
with information about the communicative interactions expected within its scope [
151
]. Although
virtuality has unique attributes beyond reality, people are active in a virtual environment with social
abilities. In other words, virtual technology supports social activities and goals in an immersive
environment. Additionally, from the psychology perspective, familiarity with the realistic scene
facilitates the boosting of presence and self-awareness, due to the preference given by the exposure
rate [118].
Driven by the information society and virtual world, socialization is a medium for communication
that is increasingly complex, which conveys a rich diversity of social systems and sophisticated
information in multiple scenarios. For example, Lam et al. have developed a context-aware, contex-
tually interactive AR urban interface enabling users to locate websites intuitively with minimal
modications [95]. Architecture signies a spatial place containing activities, where the study on
the semiotics of spatial forms has always revolved around the topic of simulating or restoring social
scenarios, including public spaces, semi-public, and private spaces (See Fig. 4b). Space conveys an
invitation to participate in framing social situations[
151
]. For example, there are a lot of studies
discussing the human perception of spaces in urban design studies in history. Jacobs [
74
] introduced
walkable streets as a concept in the forming of neighborhoods, which considers visual qualities,
connectivity of circulations and other indicators. Following the tendency of human-centric design,
a lot of researchers explored the making of desirable streets and the making of places on dierent
scales. Appleton[
7
] introduced the prospect-refuge theory to address the safety sense of humans in
placemaking, which signicantly inuences socializing. Hall [
60
] introduced proxemic zones to
represent dierent types of social distances. These theories regarding the human sense of space are
still widely used in nowadays design discipline. All those are from the signicance of the social
parameters in terms of the architectural discipline.
2.2.3 The Goal of Construction. The design principles of virtual architecture serve the purpose of
constructing buildings in VEs. With the boom in virtual technology and the rise of social platforms
for 3D virtual worlds, the production demand for innite and sprawling virtual environments has
surged rapidly. The main task confronts with building rapid, large-scale architectural environments.
These construction tasks are mostly done collaboratively by 3D modeling software and game engines
such as Unity. 3D building models are 3D spatial representations of articial spatial elements [
174
].
Its most relevant quality criteria are completeness, the spatial accuracy of location and the level of
detail [
85
]. In addition, realistic simulations regarding scale and size are also important, including
granularity and simulation as well. All in all, those are very signicant to bring the experience for
users. The non-uniform approach causes various problems, such as inconsistent buildings from the
manual and automatic operation [85], as well as the expensive cost of human resources.
One reliable solution is facing many ecient and automatic construction tasks in a virtual
environment. The solution based on the computation approach is relatively consistent since the
same automation frameworks are applied for all spatial objects. All these approaches are across
various scales including urban and architecture. MineDojo uses autonomous agents that utilize large
pre-trained video language models for automation to generate 3D scenes of VWs [
48
]. From the
recently released by Tencent, the proposed solution for the automatic generation of 3D virtual scenes
contains 3 modules ranging from city layout generation, building exterior generation and interior
mapping generation [
94
]
10
. It is a new paradigm that combines design perspectives using multiple
CV techniques. Similarly, there are several studies on computational generation in architecture.
10Source: https://gdcvault.com/play/1028921/Recorded-AI-Enhanced-Procedural- City
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 10 of 1–35.
Unpublished working draft.
Not for distribution.
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:11
Fig. 5. The frameworks of GAN, VAE and Diusion Models.
Most of them are generated by extracting the logic of urban planning [
72
,
165
], i.e., designing 3D
layouts and functional divisions from urban plan layouts. In another approach [
165
], three methods
with dierent specic objectives are integrated. It generates building massing congurations by
autonomously inferring the composition rules of existing urban areas.
2.2.4 The Problem on the Collective. However, there is still a mismatch between that new paradigm
of mechanism run on the virtuality and the purposes of the emerging virtual environment: collective,
presence, and unencumbered [
121
]. The collective pertains to the notion that virtual spaces are
communal environments wherein individuals from diverse backgrounds and cultures can converge
and engage with one another. The nature of the collective refers to the highly multi-social experience.
We only proliferate the products that manifest our unique identities and personal needs [
121
]. That
conicts with the collective and the products for everyday use, especially the space or architecture.
As we live in a world with a seamless fusion of reality and the virtual, such as the exquisite
information and goods powered by a recommendation system on social media, creation or live
space beyond reality, virtual economic mechanisms, ownership, identity, and so on. All of these
exhibit the precision of individual values. Apparently, the collective and congregate have become
blind here. In other words, virtual technologies should support collective social activities and
goals cued by individual experiences in immersive environments. Many theories of virtual worlds
emphasize this point. Activity Theory argues that virtual worlds should be designed to support
users in achieving their goals [
80
]. Bartle’s Four Keys to Virtual World Design states that providing
players with a sense of purpose is one of the key metrics for designing virtual worlds [
15
]. The
purpose enhances user engagement by providing them with a sense of progress and accomplishment,
thereby creating a sense of immersion.
2.3 Deep Generative Models
We briey overview the progression of deep generative models for 3D representation, including 3D
shape generation and 3D aware image generation.
2.3.1 3D Shape Generation. is contributed by traditional deep generative models, in addition to
the well-known Generative Adversarial Networks (GANs) and variational autoencoders (VAEs),
normalizing ows (N-Flows), very recent diusion probabilistic models (DDPMs) as well as energy-
based models (EBMs), which learn by maximizing from the similarity of the given data. These deep
generative models generate a tangible 3D object that is ready for rendering. It conveys a latent
variable to a high-quality image. Although every model has its own benets and great progress in
recent years, the domain in architecture relies on the GAN mostly, while VAE and the very latest
diusion models are designed for a few research. Considering the relevance, we thus introduce the
GAN, VAE and diusion models in detail rather than an exhaustive list of models included in other
CV survey articles.
2023-04-30 10:37. Page 11 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
111:12 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
Fig. 6. The 3D representations for (a) Voxel grids, (b) Meshes, (c) Point cloud, (d) Neural fields.
GANs. GANs are a type of semi-supervised learning relying on the noise value. The GANs rely
on machine learning algorithms to construct two neural networks: one is a generator, and another
is a discriminator. It trains a large database by means of a zero-sum game between two of these
neural networks to generate agnostic creative results.
Variational Autoencoders. Variational autoencoders are probabilistic generative models that use
neural networks partially. Along with inputs and outputs, neural networks need encoders and
decoders. The latent space refers to the process of learning data features and simplifying data
representations to facilitate model training for a specic purpose. To guarantee that the latent
space of a Variational Autoencoder has acceptable qualities and can be used to create fresh data, the
distribution of its encodings is regularised during training [
89
]. Furthermore, the name "variational"
originates from the tight connection between regularisation and the variational inference technique
used in statistical analysis.
Diusion Models. Through modeling the dispersion of data points in latent space, we discover the
underlying structure of a dataset of images or volumetric, e.g., Denoising Diusion Probabilistic
Models [
65
]. This entails teaching a neural network to remove the blurring eect of Gaussian noise
on an image. It has the prominent advantage of generating sharp and detailed features.
2.3.2 3D-Aware Image Synthesis. This approaches extract latent vectors from the latent space and
decode them into a target representation by using GAN. Generally, the generation pipeline is for
an image with 3D awareness as a result and it also starts with an image as a generative source.
2.4 3D Representations
These two types of 3D generation develop diverse representations of 3D scenes in computer
vision and computer graphics. The 3D representation in 3D shape generally includes explicit
representations, such as voxel grids, point clouds, meshes, and implicit neural elds. A 3D-aware
image includes depth or normal maps, voxel grids, neural elds and hybrid representations. The
integration between them and the architecture generation is also dierent. For example, a point
cloud is often considered when 3D serves as an input source to train the generative model. The
3D representation is articulated in existing survey research as a classication. Below are the brief
descriptions.
Architectural design has a preference for explicit representations due to the controllability,
familiarity, visualization, and availability regarding modifying in 3D modeling software. Explicit
geometric representations are easier to visualize and interpret as they directly represent 3D space.
The designers can precisely position and adjust each point or voxel, allowing for more accurate
control over the shape and form of the generated geometry. Nevertheless, implicit representations
(neural elds) have huge possibilities in architectural research regarding their benets to oer
more exible, continuous, and ecient representations of geometry.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 12 of 1–35.
Unpublished working draft.
Not for distribution.
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:13
Voxel grids. It refers to a three-dimensional grid of values organised into rows and columns. The
grid contains rows, columns, and layer intersections, referred to as a voxel, i.e., a miniature 3D
cube [40].
Point clouds. A point cloud [
144
] is a distinct collection of data points in space, which might
indicate a three-dimensional form or item through Cartesian coordinates (X, Y, Z) assigned to each
point location.
Meshes. A 3D mesh is the polygonal framework upon which a 3D object is built [
126
]. Reference
points along the X, Y, and Z axes describe the height, breadth, and depth of a 3D mesh’s constituent
forms. It is important to note that creating a photorealistic 3D model sometimes requires many
polygons.
Neural elds. It creates images by using traditional volume rendering methods to query 5D
coordinates along camera rays and projects the resulting colours and densities onto a 2D plane.
Despite its use of depth data, the scene geometry is rendered in exquisite detail, complete with
intricate occlusions [112].
Hybrid representation. This refers to a hybrid pipeline of 3D representation for the pre-training in
a 3D feature space embedded in both the virtual and actual worlds. The hybrid pipeline can include
multitudinous data sources and image frame features [
155
], depending on the generation purposes
of the 3D volumetry.
2.5 The Design Factors in DL-Aided Architecture
In the last subsection, we summarize the design principles for virtual architecture, rst adapting to
the essential characteristics of virtuality as a guideline, using computational generation massively
and eciently as a method, meanwhile emphasizing the social factors as parameters. On this
foundation, we identify the mismatch between the architectural collectives and the logic for private
production. In this section, we explain how to design a virtual building with the above framework
using a specic automated algorithmic framework.
2.5.1 Interpretability and Input Datasets. Relying on the interpretability of the input data is crucial
as a rst step in generating virtual buildings aided by ML algorithms. Generally, interpretability
requires the valid illustration of dataset input itself in the eld of architecture [
22
,
91
]. The gen-
erated results that meet this goal have the ability to support participators for a variety of design
purposes. Bridging the gap between data and purpose is the massive human and computational
exertions that drive interpretability in design goals. For example, BAŞAK ÇAKMAK explored
extended design cognition with GANs and an encoder-decoder [
22
]. This methodology conducts
the partitioned 3D point clouds captured by lidar according to the type of components as input.
The research implements the extension for those models manually and automatically to promote
the DL framework to learn spatial organization.
The approach of matching datasets with required parameters associated with spatial design goals
has been widely used in solutions for architectural design. Such applications with DL frameworks
are often capable of designing solutions with explicit design goals. For instance, Adaptive Acoustic
implements a methodology with CGN to generate a computational 3D concert hall. The designers
trained meshes of concert hall interiors as well as the space and acoustic parameters as two datasets
to pursue an architecture tting into acoustics requirements. Manually, the latter were rened into
quantiable information as parameters containing seats, volume, reverberation time, acoustical
absorption area, absorption coecient and so on, resulting in AI-generated concert hall forms with
acoustic features.
2.5.2 The Algorithmic Form. The algorithmic form is “the relationship between computation and
information about computationally generated objects (such as strings or any other data structures)"
2023-04-30 10:37. Page 13 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
111:14 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
(a) George Guida, 2022.
Multimodal Architecture:
Applications of Language
in a Machine Learning
Aided Design Process.
(b) Joris
Pue-
neers,
2016.
Synesthe-
sia.
(c) The
gener-
ating
process in
Synesthe-
sia.
(d) A data-driven archi-
tectural project named E-
motion: a digital interface
allows users to capture real-
time data to interact with
for a rethinking of co-living
among various species.
(e) Viviane toraci fiorella,
Taza celilia, Prandini Al-
varo Campo. "ISOS" in “Vol-
umeric Cinema" workshop
by Current.CAM, 2022
(f) Tane Moleta and Mizuho
Nishioka, the co-constructive
project, "Populating Virtual Worlds
Together", 2021 [115].
(g) [
14
] generates a 3D volu-
metric architecture for virtual en-
vironments by utilising BCI to
capture aective-driven dynamic
noise.
(h) Current.CAM, VR gallery, 2021
Fig. 7. Some architecture projects with the liberty of form.
[
27
]. A growing number of social algorithm proposals promise that neural networks and machine
learning algorithms are research areas that can take social factors into account. For example, with
dierent data input to the housing generation algorithms, various master plans can be generated
with varying perceptions of privacy or construction price choices [91].
Additionally, there is a growing emphasis on social parameters, ranging from the data-driven in
algorithmic social sciences to agent-based parametric semiotics in the architectural form [
10
,
151
].
Algorithm form implicates the relationship between the computed objects such as String and
any data structure and the information[
26
]. The growing number of social algorithms proposed
promises that neural networks and machine learning algorithms are areas of research that can take
social factors into account. In this regard, the social goal stands in the middle of computationally
generated forms and architectural designs.
2.5.3 The Liberty of Form. The virtual architectural form is more exible than ever before, and the
boundaries of the denition are more indistinct and inclusive (See Fig. 7). The generative logic of
forms has met transformation, where the geometry of space has been expanded to the intelligence
of space. The intelligence of space represents a multisensory approach where we are free to generate
form embedded as assistance. Joris Putteneers’ project creates a surreal and complex architectural
construction by simulating the particle motions in Houdini 11 (Fig. 7b and 7c). This is a gurative
abstraction of the algorithm in 3D space. The form for virtual architecture can also be a visualization
of data in 3D space. For example, an architectural project, namely E-motion, designs an interface
for data visualization driven by the redistribution and simulation of animal and human movement
11Source: https://putteneersjoris.xyz/projects/synesthesia/synesthesia.html
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 14 of 1–35.
Unpublished working draft.
Not for distribution.
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:15
habits, thus linking human and non-human intelligence
12
(Fig. 7d). While George Guida visualizes
the inuence of intelligent algorithms and multimodality for the architecture in another project
13
(Fig. 7a). In the Fourth Virtual Dimension, the authors propose a redenition of the dimensionality
of thermoception in VR to understand and engage with the spatial and directional aspects of virtual
scenes [
154
] (Fig. 7f). The form of virtual architecture is even a kind of co-construction. For instance,
a project, namely Populating Virtual Worlds Together, encourages artists with no experience in 3D
modeling to create using a participatory design approach [
115
]. It leads to an autonomous virtual
world consisting of cubes and corresponding columns of varying heights and forests.
Second, in the generation of virtual architecture, VWs generally take into account more aesthetic,
cultural, and human-centered intentions. [
14
] (Fig. 7g). The denition of "good" architecture has
been practically the core of architectural discourse [
17
]. Discussions around digital architecture
often address this question by escaping into the realm of taste or artistic judgment. Aesthetics
is criticized as using digital and data as a supportive tool homogeneously as a solution in digital
architecture. In contrast, aesthetics in virtual architecture can justify the grand plan. For instance,
Current. CAM’s VR exhibition is formed by continuous partitioned spaces with purely uid blue,
reinforcing the digital interface’s shaping on the human senses (Fig. 7h). In a workshop, they
organised the interaction between virtual avatars and fantastically dramatic environments to
explore human perception of space (Fig. 7e). This transcendent novelty of virtual architecture
encourages the user’s quest for novel audiovisual sensations.
3 GENERATED 3D ARCHITECTURE: A PARADIGM SHIFT
Before the 3D algorithms that can automatically store and process the 3D data, the 3D generation
methods were mostly developed based on 2D images. The remarkable growth of the 3D generation in
recent years has revealed the tremendous power of this eld. Compared to 2D image generation, 3D
generation is a daunting task regarding the aspects of the 3D dataset, computational consumption,
feature learning, and probability distribution in 3D space. We investigate the virtual architecture
generation based on various DGMs both in methods of CV (Table
??
, Table
??
) and architecture
(Table
??
). In the rst part, we introduce some research on architectural designs concentrating on
the 2D deep generative models aiming for 3D transposition, especially GANs. Then, we divide deep
learning generation approaches for 3D representations into four categories based on DGMs: (See
Fig. 8):
(1) 3D form generation from probabilistic spaces or 2D image sets with GANs.
(2) 3D information extraction from latent space with VAEs.
(3)
Recent advances in 3D-aware image synthesis and the possibilities of incorporating with
architecture.
(4) Latest research on diusion models based on conditional text.
3.1 3D Form Transposition with Constrained Approaches
In the past few years, 2D image generation by deep generative models has been rapidly developed.
Most DL-assisted architecture with deep generative models relies on dealing with 2D drawings
[
77
,
114
,
158
], such as composition by overlapping the section or plan. It has resulted in the opinion
of equaling deep learning in DL-assisted architectural design to pix2pix. Signicant progress
in this methodology is post-processing those generated images for targeting 3D models. Those
12
Fei Chen, Mochen Jiang, Haojun Cui, and Yuankai Wang, E-motion, 2020. Source: https://bproautumn2020.bartlettarchucl.
com/rc18/e-motion
13
Source: https://www.gsd.harvard.edu/project/2022-digital-design-prize-george-guidas-multimodal- architecture-
applications-of-language-in- a-machine-learning- aided-design-process/
2023-04-30 10:37. Page 15 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
111:16 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
Fig. 8. A systematic taxonomy for a review of generation approaches on virtual architecture design with
DGMs.
Fig. 9. The examples of generated objects in the field of computer. (a) 3D GAN [
169
], (b)PointFlow [
173
], (c)
HoloGAN [125], (d)StyleSDF [131] (e)Point-E [127] (f)DreamFusion [138].
(a) A pipeline in described work [182].
(b) An process signifies pixels filter from 2D to 3D
laice in [143].
Fig. 10. The architectural projects through 3D form transposition with 2D deep generative models.
methods intuitively consider a post-process through heuristic algorithms or human labor rather
than scientic methods, aiming for critical ideas and innovative concepts.
3.1.1 3D Form Transposition. 3D transposition indicates a methodology commences with seg-
menting a 3D model into discrete images, such as sections, plans, and projections from multiple
viewpoints. It transforms resulting abstractions into 3D representations with using tedious compu-
tational methods or intuitive manual manipulation (See Fig. 10a) (See Fig. 10b). As an illustration,
Zhang and Blasetti employed section transformation between two models to manipulate the form
from 2D to 3D [
182
]. Inherited from 2D design thinking, the 3D form transposition experiments are
likewise mainly conducted based on 2D to 3D composition using the image-to-image translation
networks such as Style Transfer [
106
,
133
,
143
,
182
], StyleGAN [
38
,
180
,
183
] and pix2pixGAN
[
42
,
175
]. They act as 2D-based form nding tools that support the decision-making process for
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 16 of 1–35.
Unpublished working draft.
Not for distribution.
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:17
Method Names Publication& Year 3D Representations Models
3D GAN [169] NIPS 2016 Voxel grid GAN
Text2Shape [29] ACCV 2018 Voxel grid GAN
PLATONICGAN [64] CVPR 2019 Voxel grid GAN
IG GAN [109] arXiv 2020 Voxel grid GAN
Achlioptas et al. [2] ICML 2018 Point cloud GAN
Shu et al. [157] CVPR 2019 Point cloud GAN
Get3d [50] NeurIPS 2022 Mesh GAN
IM-Net [30] CVPR 2019 Neural eld GAN
Kleineberg et al. [90] arXiv 2020 Neural eld GAN
Brock et al. [20] arXiv 2016 Voxel grid VAE
Autosdf [113] CVPR 2022 Voxel gird VAE
Sagnet [171] TOG 2019 Voxel gird VAE
Li et al. [101] AAAI 2020 Voxel gird VAE
Pq-net [170] CVPR 2020 Voxel gird VAE
AdversarialAE [177] CVPR 2020 Voxel gird VAE
Multi-Chart [18] TOG 2018 Mesh VAE
SDM-NET [51] TOG 2019 Mesh VAE
Tm-net [52] TOG 2021 Mesh VAE
Polygen [120] ICML 2020 Mesh VAE
PointFLow [173] CVPR 2019 Point cloud Normaliz. ow model
CLIP-Forge [145] CVPR 2022 Voxel grid Normaliz. ow model
PDV [186] CVPR 2021 Hybrid: point-voxel Diusion model
Magic3D [104] arXiv 2022 Neural eld-Mesh Diusion model
LION [179] arXiv 2022 Mesh Diusion mo del
Point-E [127] arXiv 2022 Neural eld-Point cloud Diusion model
Table 1. An overview of 3D generative approaches of 3D shape generation. For 3D shape generation, each
method allows generating editable models for explicit representations. Models indicates the DGM types
inlcuding GAN, VAE, normalizing flow model, and diusion model.
designers in transforming 3D models into dierent formats. The representation of the 2D images
was developed further from pixels or voxels to lateral thinking. Data can be compressed to a high
dimensional latent space with enhanced connections. EI Asmar and Sareen employ vector arithmetic
and interpolations to navigate in the latent space to generate various images as options for 3D
voxelization [
9
]. Bank et al. developed an interactive tool that can manipulate data in latent spaces,
where the generated stylized images were represented as point clouds and can be assembled as
spectral entities in adjustable resolution [
13
]. Similarly, in the latent space, the continuous sequence
of images can be generated by feature interpolation. The project ‘Generali Center’ developed by Del
Campo et al. also utilized StyleGAN to present latent walks[
39
]. Using pixel projection to convert
the values in the pixels into a 3D model is a signicant advancement in their research.
Huang et al. [
69
] employs latent space to encode the 3D information from images of GANs. The
generated image is technically a set of points mapping from latent space to a 2D graph, then it
conduct an interpolation containing a sequence of perspectives. The perspectival GAN [
88
] is an
extended research comprising latent space rotation to learn 3D information in 2D images.
The 2D image-to-image translation algorithms utilized in 3D form generation are evidence of
computational freedom in the innovative 3D generation since the future direction of virtual spaces
lies in complex variations. This generation method allows for the preservation of high resolution in
both input and output, as 2D images are lightweight and simple to process. Furthermore, training
algorithms for 2D-based GANs networks have been well-developed, thereby providing a wide range
of possibilities for human interaction with algorithms, as well as various adjustable output options
through contouring 2D patterns into 3D forms.
3.2 3D Solid Form Generation with the GANs
3.2.1 3D Shape Generation with GANs. GANs have been developed as a controlled 3D generation
method from image data that can generate dierent explicit representations, including point clouds
[
21
] or voxel grids [
44
,
53
,
57
,
96
,
102
,
105
,
108
111
,
129
], and implicit neural functions, such
2023-04-30 10:37. Page 17 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
111:18 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
(a) Results for voxel grids
from [165].
(b) 3D GAN housing utilize 3D
GAN to generate housing with
3D assets as input dataset.
(c) The overview methodology
of [22].
Fig. 11. Three examples of applying GANs for architectural designs in 3D solod form generation.
as occupancy eld and signed distance function (SDF). Wu et al. adopted the architecture of a
generative adversarial network to generate the 3D voxel grids relying on capturing the probability
distribution of 3D shapes [
169
] (Fig. 9a). Many approaches already achieve the outstanding outcome
in a more ne-grained shape [
101
,
170
,
171
]. However, the general disadvantage of this approach
for voxel grids is that ne-grained voxels cannot be accomplished due to the cubic increase in
computational cost. PLATONICGAN and IG GAN [
64
,
109
] also generate the 3D voxel grids models
from the unstructured 2D image data with GAN. While another 3D representation, point cloud, is
the output as raw data through depth scanning. For the various problems in generating point clouds
with GAN, numbers of researchers introduce dierent approaches, ranging from the converge [
2
],
utilizing the local contexts [
8
,
70
,
157
,
164
], as well as the high memory consumption [
140
]. The mesh
representation is usually utilized as the target object in the 3D modeling software. Nevertheless the
popularity in the design discipline and traditional computer graphics, the diculties lie in applying
the deep generation models to the mesh. There are two main reasons. Firstly, non-Euclidean data
could not directly apply to the convolutional neural networks (CNN). Secondly, the diculty of
connecting the mesh vertices to composite the shape is high [
156
]. Get3D [
50
] enables the high-
quality geometry and texture from the 2D image collections by incorporating the dierentiable
surface modeling and dierentiable rendering to GANs.
3.2.2 3D Solid Form Generation. Currently, the architectural design utilizing GANs in 3D solid
form generation is all based on explicit representations including voxel grids, point clouds, and
meshes. 3D solid form generation refers to a direct 3D data acquisition, evaluation, transformation,
and rearrangement using deep generative models [159].
Meanwhile, with improved algorithms, GANs can recognize 3D representations such as mesh
and point cloud, which shifts the paradigm of generation from 2D to 3D by a direct route using 3D
point cloud semantic segmentation in 3D spaces. Immanuel Koh uses a 3D GAN network to train a
large dataset of both exterior and interior Singapore high-rise buildings to generate innovative
housing typologies automatically [
91
] (See Fig. 11b). It tested the agencies of generative spaces
using deep neural networks, which inherit the congurations of architectural forms by extracting
building block arrangements. Moreover, the connection of 3D GANs with Houdini can expand the
algorithm to integrate with the 3D form generation. For instance, Joris Puteneers uses 3D GAN as
a form-nding tool in a project named ugly & stupid 14, which tested the agency of algorithms in
creating artifacts based on image recognition. Besides, Cakmak added an encoder-decoder network
in GAN to process the datasets and generate new 3D models, which are then represented in dierent
alternative formats like point cloud and mesh [
22
] (See Fig. 11c). This also meant extending design
cognition by adding AI as an agent in the design thinking process. A noteworthy study has been
14https://putteneersjoris.xyz/projects/Ugly%20Stupid%20Honest/ugly𝑠𝑡𝑢 𝑝𝑖𝑑𝑜 𝑛𝑒𝑠𝑡 .ℎ𝑡𝑚𝑙
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 18 of 1–35.
Unpublished working draft.
Not for distribution.
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:19
Reference
Category Obejctive Methodology
3D Representa-
tions
Generative Mod-
els
[38] 2D to 3D
Test AI agency in design
Utilizing Style Transfer to train two datasets Baroque and Modern
images as a basis to form a 3D model
- -
[133] 2D to 3D HCI in urban design
Utilizing Style Transfer to generate dierent stylized images and
generate 3D geometry through procedural modeling
- -
[143] 2D to 3D
Test AI agency in design
Utilizing Style Transfer to replace pixels with voxelization units
to generate 3D forms
- -
[106] 2D to 3D
Toolkits for 3D genera-
tion
Utilizing Style Transfer to assist the generation of 3D structure
from 2D images
- -
[175] 2D to 3D
Generate building mass-
ing
Utilizing pix2pixGAN to generate plan pattern and section pattern,
then converted to 3D massing
- -
[42] 2D to 3D
Generate building mass-
ing
Utilizing pix2pixGAN to generate urban morphology to create
building massing
- -
[182] 2D to 3D
Form nding to assist de-
sign
Slicing a 3D model and trained with dierent combinations of 2D
styleGAN networks, and nally stitching into a 3D model
- -
[180] 2D to 3D
Form nding to assist de-
sign
3D model generation based on 2D plan and section using Style
Transfer
- -
[183] 2D to 3D
Form nding to assist de-
sign
Combining the spatial sequence information to generate 3D form
from 2D images through multi-level deep generative networks
such as styleGAN
- -
[13] 2D to 3D
Human and neural net-
work interface
Utilizing 3D solid for training to map spatial semantics to a latent
space assembled using point cloud representations
Point cloud GAN
[9] 2D to 3D
Integrate latent space in
design
GAN allows for navigation in the latent space to create digital
designs using vector arithmetic and interpolation techniques, then
converting resulting images to 3D voxel structures
Voxel grid GAN
[69] 2D to 3D
Recognize 2D pattern to
3D form
Utilizing Latent space rotation and perspective projection to gen-
erate 3D model
Voxel grid GAN
[88] 2D to 3D
Recognize 2D pattern to
3D form
Utilizing Latent space rotation and perspective projection to gen-
erate 3D model
Voxel grid GAN
[22] 3D Solid Extend design cognition
Utilizing a GAN Model with a pair of encoder-decoder to process
datasets and generate new 3D models, resulting in dierent 3D
representations like point cloud and mesh
Point
cloud/Mesh GAN
[165] 3D Solid
Generate building mass-
ing
Utilizing 3D BAG dataset as a basis to train urban morphology
data to generate 3D building massing
- GAN
[36] 3D Solid
Generation, manipula-
tion and form nding of
structural typologies
Utilizing VAE to learn continuous latent space to generate new
geometries
Voxel grid VAE
[82] 3D Solid
Solve design problems
incorporating deep
learning
Utilizing VAE to manipulate objects according to the dierent
criteria selected
Voxel grid VAE
[153] 3D Solid
New way to design para-
metric models
Utilizing VAE to encode and decode geometries through dimen-
sionality manipulation
Voxel grid VAE
[181]
NLP-3D
Solid
Language assisted de-
sign
Utilizing language model to predict housing plan by training large
dataset relating texts to forms
- -
[56]
NLP-3D
Solid
Language assisted de-
sign in HCI
Utilizing diusion models to generate 3D forms from text input - Diusion model
Table 2. The related works in the architecture fields. The applications in the architectural field are sorted into
dierent categories in this table, including 2D to 3D transposition, 3D solid generation and NLP based 3D
form generation. The category means the generation methodology: ‘2D to 3D’ means the 3D form generation
is based on 2D images; ‘3D Solid’ means generating 3D form directly with DGMs including GAN, VAE, and
diusion model. ‘NLP’ means the 3D form generation process includes text input to assist the output control.
‘3D representations’ in architecture encompass explicit point cloud, voxel grid, and mesh. The Objective
column explains the directions and objectives the research aims to address. The methodology column gives
an overview to what kind of workflow the generation process proposed. This table compares the research in
the architectural field.
conducted on geometry extraction within urban environments using 3D GANs [
165
] (See Fig. 11a).
This enables generated 3D dataset automatically through 3D BAG
15
. The 3D BAG dataset includes
dierent levels of 3D details, which can be read and manipulated by GAN. The model was trained
based on three layers of information: building geometry, site context, and area of interest. The
information is stored in raster data for 3D representation of voxel grids in deep learning.
This method of matching training data sets with parameters associated with spatial design goals
has been widely used in architectural design solutions. Such applications, in which GANs as the
generation framework, can often design solutions with explicit design goals.
15An overview of 3D BAG, source: https://docs.3dbag.nl/en/
2023-04-30 10:37. Page 19 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
111:20 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
3.2.3 Limitations. Such methods discussed have several limitations.
Fewer variations in style. The performance of GANs frameworks heavily depends on the quality
and nature of the input data, resulting in limited variations in style. In the context of 2D to 3D form
nding, a signicant proportion of studies have relied on styleGAN as the basis for form generation,
which tends to produce outputs that mimic the style of the “style image" without adjustable options.
Singleness of category. Similarly, since one training process can only process one single category
of the dataset, the output is constrained to the category for design purpose. For example, the
3D-GAN-Housing project has a certain degree of repetition in high-rise building design due to the
limitation of the trained structure[
91
]. Since the dataset is limited, the blocks always follow the
same evolutionary rule, which lowers the variation in style and is limited to specic categories.
This can be adapted to modular housing design, however, not suitable for virtual space generation.
Unpredictability in design. The image-to-image translation of GANs is characterized by unpre-
dictability. The training process of these algorithms is time-consuming and requires substantial
computing resources. Also, the generative logic underlying these processes is that designers can
only evaluate their eects once they observe the nal output. The latent vector undergoes arbitrary
modications in dierent epochs, adding to the complexity and unpredictability of the output.
Topological inconsistency. Firstly, when the 3D forms are constructed solely from 2D images,
these forms will inevitably carry traces of the slicing process leading to a loss of interior details and
the overall consistency of the structure. Secondly, using a constrained 3D segmentation algorithm
poses a signicant challenge in generating consistent forms, leading to topological inconsistencies
in the form of gaps and defects in the nal output. For instance, applying this method to the
reconstruction of furniture reveals inconsistencies in the generated 3D shapes, as the algorithm
needed to be pre-trained on the specic type of objects[169].
Computing requirements of 3D data. Compared to other models, GAN models can typically
produce 3D structures that are more detailed and realistic, but they are also more unstable and
challenging to train. However, converting data from 2D to 3D usually takes a long time.
3.3 Architectural Form from Latent Space with Variational Autoencoder
3.3.1 3D Shape Generation with VAEs. Consequently, aforementioned GAN approaches, to improve
the instability in the GAN, Brock et al. introduced a variational auto-encoder to process 3D voxel
grids [
20
]. It utilizes a pair of encoder-decoder: the encoder consists of four 3D convolutional layers
to map the information to the latent vectors, and the decoder transforms the latent vectors into the
3D voxel. As aforementioned, the later work proposes improvements in blurry voxels for smooth
rounded edges [
113
]. For the point clouds as representation, although the research progress is
overcoming the diculties, the instability of GAN has derived the invention of the other types of
generative model based on the encoder in the 3D generation, the VAE and adversarial auto-encoder
model (AAE) [
177
]. The diculties in generating meshes with VAEs are similar to GANs. For the
complexity of processing topology, The parameterization of mesh called multi-chart approaches [
18
]
can handle this irregular structure of meshes. Many approaches work on simplifying the process
using this method [
51
,
52
,
120
]. TM-Net proposes an improved approach that denes a textured
space on the template cube mesh based on the SDM-Net [52].
3.3.2 Latent Space. For research on architectural generation, VAEs extract information through
the latent space with a pair of encoder-decoder. As aforementioned, the limitations in producing
scientic and accurate design results with GANs derive from the framework itself. Furthermore,
most existing generated architecture with DL aided have focused on 2D drawings. Consequently,
there is a gap in these approaches regarding their ability to extract and utilize essential low-level
spatial semantic and structural features to understand design intent and factors. Azizi et al. [
11
]
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 20 of 1–35.
Unpublished working draft.
Not for distribution.
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:21
(a) The framework of VAE processing 3D-structure
data in Deep Form Finding [36]. (b) An overview methodology in [11].
Fig. 12. Two pioneering architectural designs utilized VAE.
proposed VAE to enable encoding and decoding information about the spatial utilization of people’s
movements and activities in space to generate reliable and plausible architectural compositions.
3.3.3 Architectural information extraction from latent space. In pioneering research of VAE inte-
grated structural generation project Deep Form Finding [
36
], the researchers used labeled connec-
tivity vectors extracted from "3D-canvas" as data representation in rectangular 3D cubes since the
cubes are convenient to be used to illustrate the 3D structure information of any forms (See Fig. 12a).
The outcome achieved 3D voxelized wireframes of architectural forms through VAE models, where
the encoder processes the input data and maps it to a lower-dimensional latent space, while the
decoder takes the latent representation and maps it back to the original input space. The VAE
model can learn continuous latent distributions of the input data and output hybrids of dierent
forms with dierent style strengths. Since researchers found that the 3D GAN is hard to learn 3D
information, VAE was considered to have higher potential and has been used to test the capabilities
of deep neural networks in manipulating 3D geometries in the architectural eld. Another proof of
concept application is the design of a 3D voxel chair using multi-object VAE [
82
]. This application
aims to generate dierent types of chairs based on pre-dened criteria, ranging from leisure to work.
VAE has also been used to morph multiple simple objects such as cylinders, cubes, and spheres
into new shapes within a given composition range [
153
]. From the above application, we can see
that although VAE is a well-developed neural network, the usage of complex space generation in
architecture is still very limited. While another approach has targeted training by examining the
oor plan of the building [
11
], which is not associated with the construction logic of the virtual
space, the approach contains the consideration of human factors involved in the HCI methodology.
The oor plan is a potential representation that encodes multiple features. Autoencoders represent
the graph as a vector in continuous space. The attributed graph as the intermediate representation
encodes spatial semantics, structural information, and crowd behavioral features (See Fig. 12b).
VAEs utilize pointwise loss to nd a probability density by explicit representations to obtain an
optimal solution by minimizing a lower bound on the log-likelihood function, which results in
accurate generation results but lower resolution. GANs learn to generate from training distributions
through playing zero-sum-game, resulting in uncertain generation results but can ensure high-
quality data input. This results in dierent applications in the architectural eld. For example, the
applications in GANs are typically used for testing the agencies of AI, providing conceptual design
options and approaching the democratization of design. While VAEs are always being tested in
the form-nding process, to generate dierent design options available for dierent criteria and
scenarios. However, most research incorporating either GANs or VAEs in design only provides
a general approach to visual aesthetics instead of the design solutions on spatial functions and
structures [142].
2023-04-30 10:37. Page 21 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
111:22 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
Method Names Publication
& Year
3D Repre-
sentations
Single/multiple-
view Geometry Editability Controllability Highlight
Camera Object
NeRF [55] CVPR 2019 Neural eld multiple - relighting
3D Novel View
Synthesis
HoloGAN [125] CVPR 2019 Voxel grid single - - - -
Pi-GAN [28] CVPR 2021 Neural eld single -
position
- -
Girae [128] CVPR 2021 Neural eld single - -
StyleSDF [131] CVPR 2022 Neural eld single -- -
StyleNeRF [55] arXiv 2021 Neural eld single -- -
DreamField [76] CVPR 2022 Neural eld multiple -
CLIP: Text input
DreamFusion [138] arXiv 2022 Neural eld single Text input
CLIP-NeRF [166] CVPR 2022 Neural eld single - - -
CLIP: Text input
Table 3. An overview of 3D generative approaches of 3D-Aware Image Synthesis. Single/multiple represents
the result generated by a single image adopting a sample of single-view or multiples image adopting multiple-
view images. Geometry indicates whether this method allow to export to mesh. Editability indicates whether
this generation process enable to edit, such as composing objects in scene. 3D-aware image synthesis perform
by controllability including camera pose, position or object pose, location, relighting, and so on.
3.4 3D-Aware Image Synthesis
The 3D-aware image synthesis introduces expressive and ecient neural scene representations
inspired by the 3D view synthesis like NeRF [
172
]. It exhibits its capability of 3D view-consistent
rendering and ecient and expressive presentation, as well as interactive edibility. It is super
appropriate for the eld of architecture to adopt 3D-aware synthesis since this method enables lling
the gap lacking large-scale and high-quality 3D datasets in the eld of DL-assisted architecture. 3D-
aware synthesis only relies on supervising 2D images, which adopt dierentiable neural rendering.
This process involves the use of sophisticated techniques such as depth estimation and multi-
view stereo by generating a 3D-aware image from 2D images. It exhibits its capability of 3D
view-consistent rendering. Since without 3D representations for VAE-based models to render,
most 3D-aware image syntheses utilize a GAN-based model sampling the latent vectors and
decoder it to target a 3D representation. Although some methods implement the export of mesh
models [
28
,
55
,
131
], however, according to this survey, existing architectural studies have not
adopted this novel approach. We provide proof of its potential in virtual building generation.
3.4.1 3D-aware image synthesis and its editablity. 3D-aware image synthesis has achieved tremen-
dous progress made in the implicit representation of 3D models [
28
,
78
,
128
,
152
], in terms of
two mainstream problems, resolution, and multi-view consistency. It utilizes image synthesis in a
more controllable way to generate synthetic 3D scene representations by incorporating genera-
tive models. Later research has focused on generating 3D-aware images with the integration of
GAN-based model [
28
,
125
] (Fig. 9c d f). For instance, HoloGAN (Fig. 9c) can be trained end-to-end
from unlabeled 2D images without pose labeling, 3D shape, or the same view[
125
]. It is the rst
unsupervised model for learning from natural images. Some latest studies [
28
,
55
,
131
] prove their
framework could predominantly improve two dominant problems for 3D-aware synthesis, high
resolution and consistency of multiple views of synthetic images. The SDF-based method denes
detailed 3D surfaces, leading to consistent body drawing. For instance, StyleSDF shows higher
quality results in terms of visual and geometric quality [
131
] (See Fig. 9d). Moreover, cutting-edge
methods demonstrate that integrating 3D-aware images with CLIP model[
76
,
166
] enables 3D
geometry generation from natural language descriptions. While DreamFusion conducts a loss
derived from the distillation of a 2D diusion model instead of CLIP [
138
]. The originality of Dream
Field contains a pre-training process of 2D image-to-text models to optimize the underlying 3D rep-
resentations. On the other hand, some progress in advanced leaps out of the solid box of pre-training
2D image-to-text models. DreamFusion incorporates the diusion model as a strong image prior to
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 22 of 1–35.
Unpublished working draft.
Not for distribution.
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:23
this text-to-image pre-training, improving the eciency of the generation. In addition, 3D-aware
synthesis is applicable to incorporating other deep generative models, such as the very recent
diusion models. This advanced diusion model will be specically elucidated in the following
subsection. The main goal of 3D-aware image synthesis is gaining the explicit camera pose in the
task [
156
]. The controllability takes users to enter a more engaging and interactive environment.
Some approaches also support the editability of the object pose. For instance, GIRAFFE allows to pan
and rotate the obtained 3D objects in the scene [
128
]. StyleNeRF also allows altering style attributes
while supporting style blending, inversion, and semantic editing of the generated results [
55
]. This
editability provides various solutions from the perspective of the subdivision of generated target.
3.4.2 3D-aware image utilized in architecture. [
28
] conducts an integrated method by transforming
implicit neural representation into mesh representation, which performs an ability to editability in
3D space for architecture. StyleSDF [
131
] and StyleNeRF [
55
] also implement methods converting
to geometry. Meng et al. as pioneering architects have launched a congurative Colab with user-
friendly interaction for the creators supporting conditional text input and some parameters including
style attributes based on DreamField [
76
]
16
. However, the concern is that single eectiveness for
simple objects. These methods have plenty of limitations on the high resolution, which has the
incapability of generating 3D precise structures with internal spaces. Eciency is reduced hugely
for the task of generating complex architectural structures. As a result, it is dicult to obtain a
valid building with internal structure and functional space from image synthesis.
Although 3D-aware synthesis is relatively premature for virtual architecture, its ability to convert
to mesh, controllability, and multi-modality with linguistic descriptions have demonstrated its
potential for generating complex and unique architectural forms. In contrast to generation for
explicit representations, it oers more exible, continuous, and ecient representations of geometry,
as well as the capability of integration with other deep learning generation techniques. As the
eld of 3D DGMs and 3D-aware image synthesis evolves, architects may increasingly explore the
potential of this technique as a common toolkit for virtual architectural design.
3.5 Emerging Generation Based on Diusion Model
Recently, diusion as one of the deep generative models has gained a growing interest in generating
3D shapes due to its high quality with ne details and controllable attributes. It outperforms
Generative Adversarial Networks (GANs) in delity due to intricate details and sharp edges while
maintaining stability during training and reducing the risk of mode collapse. This superiority stems
from their ability to enable ne-grained control over the generation process with specic attributes
or interpolation between shapes smoothly and continuously. In contrast, a less controlled approach
to GANs dictates its diculty in specifying the desired properties of the output.
3.5.1 3D Diusion. DreamFusion adopts diusion models to denoising images for a high-quality
image for 3D-aware image synthesis [
138
]. Despite the exibility of conditional diusion sampling,
as revealed by studies of GANs, traditional diusion as a DGM only samples pixels. Ben et al.
abandoned processing large amounts of data from 2D images to 3D while generating a 3D model
directly. In DreamFusion, a parameter of 3D volume, instead of images’ indicators,
𝜃
, and g is
a volumetric renderer. It yields a sample through an optimization performed by minimizing a
loss function. Two limitations exist, DreamFusion was improved in the latest research, known
as Magic3D [
104
], which are the low resolution of geometry and textures and the expensive
computation as well as intensive memory. LION [
179
] has a higher quality performance by utilizing
the diusion models combined with a hierarchy VAE. Its exibility of operation and application has
16Source:https://github.com/shengyu-meng/dreamelds-3D
2023-04-30 10:37. Page 23 of 1–35. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Unpublished working draft.
Not for distribution.
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
111:24 Anqi Wang, Jiahua Dong, Jiachuan Shen, Lik-Hang Lee, and Pan Hui
(a) Selected results in [
181
] of the
linguistics-based architectural form
DGMs that make 3D form predictions
based on the text descriptions.
(b) Methodology incorporating
stable diusion (SD) with Lora by
AIG; Source: https://www.bilibili.
com/video/BV1Qb411Z7UP/.
(c) A plugin utilized diusion
model in 3D modeling soware
Rhino for [56].
Fig. 13. Architectural designs utilized diusion models.
also increased compared to previous models 3D DDMs [
66
] due to conditional synthesis and shape
interpolation. Unlike most existing DDPMs, PVD [
186
] employs a unied probabilistic formula to
generate high-delity 3D shapes with multiple results from a single-view depth scan of a real object.
Moreover, diusion models allow for the generation of 3D shapes with controllable attributes such
as shape and texture, which can be modied by conditioning the generation process on specic
attributes. These ndings suggest that diusion models may oer a more robust and controlled
approach to 3D shape generation, particularly regarding complex shapes with intricate details and
specic attributes.
3.5.2 3D Diusion Applications in Architecture. The use of diusion models in architecture is an
emerging and promising eld for development. Integrating an application programming interface
(API) directly into the diusion model in Rhino’s visual programming environment, Grasshopper
has the potential to usher in a paradigm shift in the generation of architectural 3D forms. For
example, morphological heatmap images transforming from 3D architecture models can be trained
using Lora models, in which stable diusion can further edit (See Fig. 13b). The generated grayscale
images processed by stable diusion include height information, which can be easily transformed
into a mesh model in the same modeling environment 17.
3.5.3 Controllability and Generative Models Conditioned on Text. Text-to-3D models as a featured
method have surged their development in generative 3D shapes in recent two years [
29
,
104
,
107
,
127
,
138
]. The earliest research we tracked is text2shape [
29
], in which 3D models with color and
shape paired with natural language formed datasets to build implicit semantic links. Recent research
has made a remarkable breakthrough in associating text and 3D models with unsupervised learning.
Similar to 3D-aware synthesis, most methods utilize CLIP with unsupervised learning [104, 127].
The text-to-3D approach demonstrates superior controllability compared to other methods for
generating 3D models, along with customized style attributes. This approach interprets textual
prompts, resulting in an intuitive visual representation catering to design intention. For instance,
Magic 3D [
104
] has developed a toolkit that oers advanced control over 3D-generated styles and
content through various image conditioning and prompt-based editing to achieve the desired result.
As a result, the text-to-3D approach democratizes 3D geometry generation, providing access for
individuals with varying levels of expertise to produce creatively. As aforementioned, integrating
Colab with DL algorithms also serves as a gateway for designers, artists, and amateurs to participate
in the burgeoning eld of content production.
17Source: https://www.bilibili.com/video/BV1Qb411Z7UP/
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 2023-04-30 10:37. Page 24 of 1–35.
Unpublished working draft.
Not for distribution.
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
A Survey on Deep Learning for Generating Virtual Architecture in the Metaverse 111:25
Fig. 14. Research agendas in a full process of DGMs-assisted architectural design.
3.5.4 3D Form Driven by Text in Architecture. There are some initial applications using languages as
starting points to utilize design. However, the generative model is constrained in such applications.
For example, Del Campo used attentional GAN (AttnGAN) to assist the brainstorming process for
transforming written ideas of multipurpose spaces to visual outputs [
37
]. Then, the nal outcome
was based on the previously demonstrated visuals. Zhang developed a machine-learning framework
capable of encoding the input geometry into a new geometry by using text to form a prediction [
181
]
(See Fig. 13a). In this framework, dierent usage of spaces has been trained with adjacent matrices
to understand the linguistic instructions. With the integration of natural language supervision, the
diusion models exhibit high-quality performance in form generation and have great potential to
become HCI tools. George Guida explored