Available via license: CC BY 4.0

Content may be subject to copyright.

GAUDI: A Neural Architect for

Immersive 3D Scene Generation

Miguel Angel Bautista∗Pengsheng Guo∗Samira Abnar Walter Talbott

Alexander Toshev Zhuoyuan Chen Laurent Dinh Shuangfei Zhai Hanlin Goh

Daniel Ulbricht Afshin Dehghan Josh Susskind

Apple

https://github.com/apple/ml-gaudi

Abstract

We introduce

GAUDI

, a generative model capable of capturing the distribution of

complex and realistic 3D scenes that can be rendered immersively from a moving

camera. We tackle this challenging problem with a scalable yet powerful approach,

where we ﬁrst optimize a latent representation that disentangles radiance ﬁelds

and camera poses. This latent representation is then used to learn a generative

model that enables both unconditional and conditional generation of 3D scenes.

Our model generalizes previous works that focus on single objects by removing

the assumption that the camera pose distribution can be shared across samples.

We show that GAUDI obtains state-of-the-art performance in the unconditional

generative setting across multiple datasets and allows for conditional generation of

3D scenes given conditioning variables like sparse image observations or text that

describes the scene.

1 Introduction

In order for learning systems to be able to understand and create 3D spaces, progress in generative

models for 3D is sorely needed. The quote "The creation continues incessantly through the media of

humans." is often attributed to Antoni Gaudí, who we pay homage to with our method’s name. We

are interested in generative models that can capture the distribution of 3D scenes and then render

views from scenes sampled from the learned distribution. Extensions of such generative models to

conditional inference problems could have tremendous impact in a wide range of tasks in machine

learning and computer vision. For example, one could sample plausible scene completions that are

consistent with an image observation, or a text description (see Fig. 1 for 3D scenes sampled from

GAUDI). In addition, such models would be of great practical use in model-based reinforcement

learning and planning [12], SLAM [39], or 3D content creation.

Recent works on generative modeling for 3D objects or scenes [

56

,

5

,

7

] employ a Generative

Adversarial Network (GAN) where the generator explicitly encodes radiance ﬁelds — a parametric

function that takes as input the coordinates of a point in 3D space and camera pose, and outputs a

density scalar and RGB value for that 3D point. Images can be rendered from the radiance ﬁeld

generated by the model by passing the queried 3D points through the volume rendering equation

to project onto any 2D camera view. While compelling on small or simple 3D datasets (e.g. single

∗denotes equal contribution. Corresponding email: mbautistamartin@apple.com

Preprint. Under review.

arXiv:2207.13751v1 [cs.CV] 27 Jul 2022

Unconditional

z∼p(z)

sample 3D scene

and poses

z∼p(z∣"go!through!hallway")

z∼p(z∣)

conditioned on image

conditioned on text

Conditional

•image#

•text#

•category

…

Figure 1: GAUDI allows to model both conditional and unconditional distributions over complex 3D

scenes. Sampled scenes and poses from (left) the unconditional distribution, and (right) a distribution

conditioned on an image observation or a text prompt.

objects or a small number of indoor scenes), GANs suffer from training pathologies including mode

collapse [

54

,

61

] and are difﬁcult to train on data for which a canonical coordinate system does not

exist, as is the case for 3D scenes [

57

]. In addition, one key difference between modeling distributions

of 3D objects vs. scenes is that when modeling objects it is often assumed that camera poses are

sampled from a distribution that is shared across objects (i.e. typically over

SO(3)

), which is not true

for scenes. This is because the distribution of valid camera poses depends on each particular scene

independently (based on the structure and location of walls and other objects). In addition, for scenes

this distribution can encompass all poses over the

SE(3)

group. This fact becomes more clear when

we think about camera poses as a trajectory through the scene(cf. Fig. 3(b)).

In GAUDI, we map each trajectory (i.e. a sequence of posed images from a 3D scene) into a latent

representation that encodes a radiance ﬁeld (e.g. the 3D scene) and camera path in a completely

disentangled way. We ﬁnd these latent representations by interpreting them as free parameters and

formulating an optimization problem where the latent representation for each trajectory is optimized

via a reconstruction objective. This simple training process is scalable to thousands of trajectories.

Interpreting the latent representation of each trajectory as a free parameter also makes it simple to

handle a large and variable number of views for each trajectory rather than requiring a sophisticated

encoder architecture to pool across a large number of views. After optimizing latent representations

for an observed empirical distribution of trajectories, we learn a generative model over the set of

latent representations. In the unconditional case, the model can sample radiance ﬁelds entirely from

the prior distribution learned by the model, allowing it to synthesize scenes by interpolating within the

latent space. In the conditional case, conditional variables available to the model at training time (e.g.

images, text prompts, etc.) can be used to generate radiance ﬁelds consistent with those variables.

Our contributions can be summarized as:

•

We scale 3D scene generation to thousands of indoor scenes containing hundreds of thousands of

images, without suffering from mode collapse or canonical orientation issues during training.

•

We introduce a novel denoising optimization objective to ﬁnd latent representations that jointly

model a radiance ﬁeld and the camera poses in a disentangled manner.

•Our approach obtains state-of-the-art generation performance across multiple datasets.

•

Our approach allows for various generative setups: unconditional generation as well as conditional

on images or text.

2 Related Work

In recent years the ﬁeld has witnessed outstanding progress in generative modeling for the 2D image

domain, with most approaches focusing either on adversarial [

19

,

20

] or auto-regressive models

[

64

,

42

,

9

]. More recently, score matching based approaches [

16

,

58

] have gained popularity. In

particular, Denoising Diffusion Probabilistic Models (DDPMs) [

15

,

33

,

48

,

63

] have emerged as

strong contenders to both adversarial and auto-regressive approaches. In DDPMs, the goal is to learn

a step-by-step inversion of a ﬁxed diffusion Markov Chain that gradually transforms an empirical data

2

distribution to a ﬁxed posterior, which typically takes the form of an isotropic Gaussian distribution.

In parallel, the last couple of years have seen a revolution in how 3D data is represented within neural

networks. By representing a 3D scene as a radiance ﬁeld, NeRF [

29

] introduces an approach to

optimize the weights of a MLP to represent the radiance of 3D points that fall inside the ﬁeld-of-view

of a given set of posed RGB images. Given the radiance for a set of 3D points that lie on a ray

shot from a given camera pose, NeRF [

29

] uses volumetric rendering to compute the color for the

corresponding pixel and optimizes the MLP weights via a reconstruction loss in image space.

A few attempts have also been made at incorporating a radiance ﬁeld representation within generative

models. Most approaches have focused on the problem of single objects with known canonical

orientations like faces or Shapenet objects with shared camera pose distributions across samples in a

dataset [

56

,

5

,

34

,

22

,

4

,

10

,

70

,

43

]. Extending these approaches from single objects to completely

unconstrained 3D scenes is an unsolved problem. One paper worth mentioning in this space is GSN

[

7

], which breaks the radiance ﬁeld into a grid of local radiance ﬁelds that collectively represent a

scene. While this decomposition of radiance ﬁelds endows the model with high representational

capacity, GSN still suffers from the standard training pathologies of GANs, like mode collapse [

61

],

which are exacerbated by the fact that unconstrained 3D scenes do not have a canonical orientation.

As we show in our experiments (cf. Sect. 4), these issues become prominent as the training set size

increases, impacting the capacity of the generative model to capture complex distributions. Separately,

a line of recent approaches have also studied the problem of learning generative models of scenes

without employing radiance ﬁelds [

36

,

65

,

47

]. These works assume that the model has access to

room layouts and a database of object CAD models during training, simplifying the problem of scene

generation to a selection of objects from the database and pose predictions for each object.

Finally, approaches that learn to predict a target view given a single (or multiple) source view and

relative pose transformation have been recently proposed [

24

,

69

,

53

,

8

,

11

]. The pure reconstruction

objective employed by these approaches forces them to learn a deterministic conditional function that

maps a source image and a relative camera transformation to a target image. The ﬁrst is that this scene

completion problem is ill-posed (e.g. given a single source view of a scene there are multiple target

completions that are equally likely). Attempts at modeling the problem in a probabilistic manner have

been proposed [

49

,

45

]. However, these approaches suffer from inconsistency in predicted scenes

because they do not explicitly model a 3D consistent representation like a radiance ﬁeld.

3 GAUDI

Our goal is to learn a generative model given an empirical distribution of trajectories over 3D scenes.

Let

X={xi∈{0,...,n}}

denote a collection of examples deﬁning an empirical distribution, where

each example

xi

is a trajectory. Every trajectory

xi

is deﬁned as a variable length sequence of

corresponding RGB, depth images and 6DOF camera poses (see Fig. 3).

We decompose the task of learning a generative model in two stages. First, we obtain a latent

representation

z= [zscene,zpose ]

for each example

x∈X

that represents the scene radiance ﬁeld

and pose in separate disentangled vectors. Second, given a set of latents

Z={zi∈{0,...,n}}

we learn

the distribution p(Z).

3.1 Optimizing latent representations for radiance ﬁelds and camera poses

We now turn to the task of ﬁnding a latent representation

z∈Z

for each example

x∈X

(i.e. for each

trajectory in the empirical distribution). To obtain this latent representation we take an encoder-less

view and interpret

z

’s as free parameters to be found via an optimization problem [

2

,

35

]. To map

latents

z

to trajectories

x

, we design a network architecture (i.e. a decoder) that disentangles camera

poses and radiance ﬁeld parameterization. Our decoder architecture is composed of 3 networks

(shown in Fig. 2):

•

The

camera pose decoder

network

c

(parameterized by

θc

), is responsible for predicting camera

poses

ˆ

Ts∈SE(3)

at the normalized temporal position

s∈[−1,1]

in the trajectory, conditioned on

zpose

which represents the camera poses for the whole trajectory. To ensure that the output of

c

is

a valid camera pose (e.g. an element of

SE(3)

), we output a

3

D vector representing a normalized

quaternion qsfor the orientation and a 3D translation vector ts.

3

Volume

Rendering

<latexit sha1_base64="B06wqsKluLjelLKASviSbUT2vTo=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GNBBI8t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCopeNUMWyyWMSqE1CNgktsGm4EdhKFNAoEtoPx7cxvP6HSPJYPZpKgH9Gh5CFn1FipEfbLFbfqzkFWiZeTCuSo98tfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xaKmmE2s/mh07JmVUGJIyVLWnIXP09kdFI60kU2M6ImpFe9mbif143NeGNn3GZpAYlWywKU0FMTGZfkwFXyIyYWEKZ4vZWwkZUUWZsNiUbgrf88ippXVS9q6rXuKzU7vI4inACp3AOHlxDDe6hDk1ggPAMr/DmPDovzrvzsWgtOPnMMfyB8/kDzdCM8w==</latexit>

f

<latexit sha1_base64="Oforsn3qXl3pFlZU0Yiyb1ETWWY=">AAAB/3icbVBNS8NAEN34WetXVPDiZbEIHqQkItpjwYvHCvYDmlA22027dDcJuxOxxBz8K148KOLVv+HNf+O2zUFbHww83pthZl6QCK7Bcb6tpeWV1bX10kZ5c2t7Z9fe22/pOFWUNWksYtUJiGaCR6wJHATrJIoRGQjWDkbXE799z5TmcXQH44T5kgwiHnJKwEg9+9DLPM0HkpxhD9gDBGFGci/v2RWn6kyBF4lbkAoq0OjZX14/pqlkEVBBtO66TgJ+RhRwKlhe9lLNEkJHZMC6hkZEMu1n0/tzfGKUPg5jZSoCPFV/T2REaj2WgemUBIZ63puI/3ndFMKan/EoSYFFdLYoTAWGGE/CwH2uGAUxNoRQxc2tmA6JIhRMZGUTgjv/8iJpnVfdy6p7e1Gp14o4SugIHaNT5KIrVEc3qIGaiKJH9Ixe0Zv1ZL1Y79bHrHXJKmYO0B9Ynz8s7JYu</latexit>

{,a}

<latexit sha1_base64="9Vfm4HmMOO/t/Aw5dEEp3fZWGo0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCCB5bsLXQhrLZTNq1m03Y3Qil9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSq4Nq777RTW1jc2t4rbpZ3dvf2D8uFRWyeZYthiiUhUJ6AaBZfYMtwI7KQKaRwIfAhGNzP/4QmV5om8N+MU/ZgOJI84o8ZKzbBfrrhVdw6ySrycVCBHo1/+6oUJy2KUhgmqdddzU+NPqDKcCZyWepnGlLIRHWDXUklj1P5kfuiUnFklJFGibElD5urviQmNtR7Hge2MqRnqZW8m/ud1MxNd+xMu08ygZItFUSaIScjsaxJyhcyIsSWUKW5vJWxIFWXGZlOyIXjLL6+S9kXVu6zWmrVK/TaPowgncArn4MEV1OEOGtACBgjP8ApvzqPz4rw7H4vWgpPPHMMfOJ8/y76M9A==</latexit>

d

<latexit sha1_base64="89J+JuC5FxMnvUsR7pwslvrJLdI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlJuuXK27VnYOsEi8nFcjR6Je/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSvqh6l9Vas1ap3+RxFOEETuEcPLiCOtxBA1rAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHyVOM8A==</latexit>

c

<latexit sha1_base64="B4F/f67BKYEZVOAvDxXEu4giNAw=">AAAB/XicbVDJSgNBFOyJW4zbuNy8NAbBU5iRoB6DXjxGNAskQ+jpvCRNeha634hxCP6KFw+KePU/vPk3dpI5aGJBQ1H1in6v/FgKjY7zbeWWlldW1/LrhY3Nre0de3evrqNEcajxSEaq6TMNUoRQQ4ESmrECFvgSGv7wauI37kFpEYV3OIrBC1g/FD3BGRqpYx+0ER4wveUQApUMIUQ97thFp+RMQReJm5EiyVDt2F/tbsSTwKS5ZFq3XCdGL2UKBZcwLrQTDTHjQ9aHlqEhC0B76XT7MT02Spf2ImVeiHSq/k6kLNB6FPhmMmA40PPeRPzPayXYu/BSEcaJOYvPPuolkmJEJ1XQrlDAUY4MYVwJsyvlA6YYR1NYwZTgzp+8SOqnJfesVL4pFyuXWR15ckiOyAlxyTmpkGtSJTXCySN5Jq/kzXqyXqx362M2mrOyzD75A+vzBwrKlZ4=</latexit>

Scene latents

<latexit sha1_base64="NHeYyMj21rU7fOCAq1xqGvdJFGc=">AAAB/HicbVDLSgNBEOyNrxhfqzl6GQyCp7ArQT0GvXiMYB6QLGF2MpsMmX0w0yuGJf6KFw+KePVDvPk3TpI9aGLBQFHVxXSXn0ih0XG+rcLa+sbmVnG7tLO7t39gHx61dJwqxpsslrHq+FRzKSLeRIGSdxLFaehL3vbHNzO//cCVFnF0j5OEeyEdRiIQjKKR+na5h/wRs0asOZEUeYR62rcrTtWZg6wSNycVyNHo21+9QczS0KSZpFp3XSdBL6MKBZN8WuqlmieUjemQdw2NaMi1l82Xn5JTowxIECvzIiRz9Xcio6HWk9A3kyHFkV72ZuJ/XjfF4MrLRJSk5iy2+ChIJcGYzJogA6E4QzkxhDIlzK6EjaiiDE1fJVOCu3zyKmmdV92Lau2uVqlf53UU4RhO4AxcuIQ63EIDmsBgAs/wCm/Wk/VivVsfi9GClWfK8AfW5w9aoZU9</latexit>

Pose latents

<latexit sha1_base64="b9efxyx94Kmxhn7kIWr/5HWKRiE=">AAACAnicbVBNS8NAEN3Ur1q/qp7ES7AInkoiRT0WvXisYD+gLWWznbRLN5uwOxFrCF78K148KOLVX+HNf+Om7UFbHww83pthZp4XCa7Rcb6t3NLyyupafr2wsbm1vVPc3WvoMFYM6iwUoWp5VIPgEurIUUArUkADT0DTG11lfvMOlOahvMVxBN2ADiT3OaNopF7xoBNQHHp+8pD2kg7CPSaagYQ07RVLTtmZwF4k7oyUyAy1XvGr0w9ZHIBEJqjWbdeJsJtQhZwJSAudWENE2YgOoG2opAHobjJ5IbWPjdK3/VCZkmhP1N8TCQ20Hgee6cwO1vNeJv7ntWP0L7oJl1GMINl0kR8LG0M7y8PucwUMxdgQyhQ3t9psSBVlaFIrmBDc+ZcXSeO07J6VKzeVUvVyFkeeHJIjckJcck6q5JrUSJ0w8kieySt5s56sF+vd+pi25qzZzD75A+vzB+t9mGw=</latexit>

zscene

<latexit sha1_base64="ZgJJTUJklOpDQOWM8P89PPK7Ft4=">AAAB9HicbVBNSwMxEJ31s9avqkcvwSJ40LIrRT0WvXisYD9gu5Rsmm1Ds8maZAtl6e/w4kERr/4Yb/4b03YP2vpg4PHeDDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDDaTtRFMchp61weDf1WyOqNJPi0YwTGsS4L1jECDZWCjTqMIH8C+8ceUG3VHYr7gxomXg5KUOOerf01elJksZUGMKx1r7nJibIsDKMcDopdlJNE0yGuE99SwWOqQ6y2dETdGqVHoqksiUMmqm/JzIcaz2OQ9sZYzPQi95U/M/zUxPdBBkTSWqoIPNFUcqRkWiaAOoxRYnhY0swUczeisgAK0yMzaloQ/AWX14mzcuKd1WpPlTLtds8jgIcwwmcgQfXUIN7qEMDCDzBM7zCmzNyXpx352PeuuLkM0fwB87nD0lhkH4=</latexit>

s2[1,1]

<latexit sha1_base64="dLTR5EUYNU8gMh6V547gIwc7bSU=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY8FLx4r2A9oQtlstu3SzSbsToRa+je8eFDEq3/Gm//GbZuDtj4YeLw3w8y8MJXCoOt+O4WNza3tneJuaW//4PCofHzSNkmmGW+xRCa6G1LDpVC8hQIl76aa0ziUvBOOb+d+55FrIxL1gJOUBzEdKjEQjKKV/Kd+RHwUMTdE9csVt+ouQNaJl5MK5Gj2y19+lLAs5gqZpMb0PDfFYEo1Cib5rORnhqeUjemQ9yxV1K4JpoubZ+TCKhEZJNqWQrJQf09MaWzMJA5tZ0xxZFa9ufif18twcBNMhUoz5IotFw0ySTAh8wBIJDRnKCeWUKaFvZWwEdWUoY2pZEPwVl9eJ+1a1buq1u/rlUYtj6MIZ3AOl+DBNTTgDprQAgYpPMMrvDmZ8+K8Ox/L1oKTz5zCHzifP4z3kVQ=</latexit>

zd⇥n

<latexit sha1_base64="R7u0OTHX7R25bJQnDAvr4GeMunk=">AAACBHicbVBNS8NAEN34WetX1GMvi0XwVJIq6rHgxWMV+wFNLZvtpl262YTdiVhCD178K148KOLVH+HNf+OmzUFbHww83pthZp4fC67Bcb6tpeWV1bX1wkZxc2t7Z9fe22/qKFGUNWgkItX2iWaCS9YADoK1Y8VI6AvW8keXmd+6Z0rzSN7COGbdkAwkDzglYKSeXfKAPYAfpPEEe1xiLyQw9P30ZnJ30rPLTsWZAi8SNydllKPes7+8fkSTkEmggmjdcZ0YuilRwKlgk6KXaBYTOiID1jFUkpDpbjp9YoKPjNLHQaRMScBT9fdESkKtx6FvOrMb9byXif95nQSCi27KZZwAk3S2KEgEhghnieA+V4yCGBtCqOLmVkyHRBEKJreiCcGdf3mRNKsV96ziXJ+Wa9U8jgIqoUN0jFx0jmroCtVRA1H0iJ7RK3qznqwX6936mLUuWfnMAfoD6/MHu02YFw==</latexit>

p2R3

<latexit sha1_base64="Eyd63NoWcXzHECFQXtVg1Nbm8e4=">AAAB/3icbVDLSsNAFJ3UV62vquDGTbAIrkoiRV0W3bisYB/QljKZ3rRDJ5MwcyPWmIW/4saFIm79DXf+jZO2C209MHA4517umeNFgmt0nG8rt7S8srqWXy9sbG5t7xR39xo6jBWDOgtFqFoe1SC4hDpyFNCKFNDAE9D0RleZ37wDpXkob3EcQTegA8l9zigaqVc86AQUh56fPKS9DsI9JlGoIe0VS07ZmcBeJO6MlMgMtV7xq9MPWRyARCao1m3XibCbUIWcCUgLnVhDRNmIDqBtqKQB6G4yyZ/ax0bp236ozJNoT9TfGwkNtB4HnpnM0up5LxP/89ox+hfdhMsoRpBsesiPhY2hnZVh97kChmJsCGWKm6w2G1JFGZrKCqYEd/7Li6RxWnbPypWbSql6OasjTw7JETkhLjknVXJNaqROGHkkz+SVvFlP1ov1bn1MR3PWbGef/IH1+QNiVJb/</latexit>

zpose

<latexit sha1_base64="f2Q76/2rshYrT+GfRna54t8/VoY=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBFclaQUdVlw47KCfUAbwmQ6aYdOHsxMqjHkT9y4UMStf+LOv3HSZqGtBwYO59zLPXO8mDOpLOvbWFvf2NzaruxUd/f2Dw7No+OujBJBaIdEPBJ9D0vKWUg7iilO+7GgOPA47XnTm8LvzaiQLArvVRpTJ8DjkPmMYKUl1zSHAVYTz88ecjd7TJ9y16xZdWsOtErsktSgRNs1v4ajiCQBDRXhWMqBbcXKybBQjHCaV4eJpDEmUzymA01DHFDpZPPkOTrXygj5kdAvVGiu/t7IcCBlGnh6ssgpl71C/M8bJMq/djIWxomiIVkc8hOOVISKGtCICUoUTzXBRDCdFZEJFpgoXVZVl2Avf3mVdBt1+7LevGvWWo2yjgqcwhlcgA1X0IJbaEMHCMzgGV7hzciMF+Pd+FiMrhnlzgn8gfH5A53zlEM=</latexit>

wxyz

<latexit sha1_base64="nU9dxt1jdC9rB96D2aENDeN3xws=">AAAB+HicbVDLSsNAFL2pr1ofjbp0EyyCq5JIUZdFNy4r2Ae0IUymk3boZBJmJmIM+RI3LhRx66e482+ctFlo64GBwzn3cs8cP2ZUKtv+Nipr6xubW9Xt2s7u3n7dPDjsySgRmHRxxCIx8JEkjHLSVVQxMogFQaHPSN+f3RR+/4EISSN+r9KYuCGacBpQjJSWPLM+CpGa+kHWz73sMc09s2E37TmsVeKUpAElOp75NRpHOAkJV5ghKYeOHSs3Q0JRzEheGyWSxAjP0IQMNeUoJNLN5sFz61QrYyuIhH5cWXP190aGQinT0NeTRUy57BXif94wUcGVm1EeJ4pwvDgUJMxSkVW0YI2pIFixVBOEBdVZLTxFAmGlu6rpEpzlL6+S3nnTuWi27lqN9nVZRxWO4QTOwIFLaMMtdKALGBJ4hld4M56MF+Pd+FiMVoxy5wj+wPj8AYzrk68=</latexit>

Wxy

<latexit sha1_base64="W9vlVYQi9fJgNTIWlP4lj5hL8aw=">AAACKXicbVDLSsNAFJ34rPUVdelmsAgupCRS1I1QdOOygn1AGsJkOmmHTh7MTKRpyO+48VfcKCjq1h9x0gasrQcGzpxzL/fe40aMCmkYn9rS8srq2nppo7y5tb2zq+/tt0QYc0yaOGQh77hIEEYD0pRUMtKJOEG+y0jbHd7kfvuBcEHD4F4mEbF91A+oRzGSSnL0etdHcuB6aTuDV9D6/TnpKMlO4ayQjOeE0TizHb1iVI0J4CIxC1IBBRqO/trthTj2SSAxQ0JYphFJO0VcUsxIVu7GgkQID1GfWIoGyCfCTieXZvBYKT3ohVy9QMKJOtuRIl+IxHdVZb6nmPdy8T/PiqV3aac0iGJJAjwd5MUMyhDmscEe5QRLliiCMKdqV4gHiCMsVbhlFYI5f/IiaZ1VzfNq7a5WqV8XcZTAITgCJ8AEF6AObkEDNAEGj+AZvIF37Ul70T60r2npklb0HIA/0L5/AOpHqFA=</latexit>

W=[Wxy,Wyz ,Wxz]

<latexit sha1_base64="USWVdL0K/kjkYXofrI11Wh/I3ko=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURUZdFNy4r2Ae0IUymk3boZBJmJmIb+iVuXCji1k9x5984abPQ1gMDh3Pu5Z45QcKZ0o7zbZXW1jc2t8rblZ3dvf2qfXDYVnEqCW2RmMeyG2BFORO0pZnmtJtIiqOA004wvs39ziOVisXiQU8S6kV4KFjICNZG8u1qP8J6FIRZZ+ZnT9OZb9ecujMHWiVuQWpQoOnbX/1BTNKICk04VqrnOon2Miw1I5zOKv1U0QSTMR7SnqECR1R52Tz4DJ0aZYDCWJonNJqrvzcyHCk1iQIzmcdUy14u/uf1Uh1eexkTSaqpIItDYcqRjlHeAhowSYnmE0MwkcxkRWSEJSbadFUxJbjLX14l7fO6e1m/uL+oNW6KOspwDCdwBi5cQQPuoAktIJDCM7zCmzW1Xqx362MxWrKKnSP4A+vzB45wk7A=</latexit>

Wxz

<latexit sha1_base64="rW4YXcW3C/NVVJyOZc+yWWsWYLA=">AAAB+HicbVDLSsNAFL2pr1ofjbp0EyyCq5JIUZdFNy4r2Ae0IUymk3boZBJmJkIa8iVuXCji1k9x5984abPQ1gMDh3Pu5Z45fsyoVLb9bVQ2Nre2d6q7tb39g8O6eXTck1EiMOniiEVi4CNJGOWkq6hiZBALgkKfkb4/uyv8/hMRkkb8UaUxcUM04TSgGCkteWZ9FCI19YOsn3tZOs89s2E37QWsdeKUpAElOp75NRpHOAkJV5ghKYeOHSs3Q0JRzEheGyWSxAjP0IQMNeUoJNLNFsFz61wrYyuIhH5cWQv190aGQinT0NeTRUy56hXif94wUcGNm1EeJ4pwvDwUJMxSkVW0YI2pIFixVBOEBdVZLTxFAmGlu6rpEpzVL6+T3mXTuWq2HlqN9m1ZRxVO4QwuwIFraMM9dKALGBJ4hld4M+bGi/FufCxHK0a5cwJ/YHz+AI/2k7E=</latexit>

Wyz

<latexit sha1_base64="SnYQS61M9QihKlqmdyE6xQ3C+Lw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4hkUeEDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaHSPJb3ZpygH9GB5CFn1Fip/tArltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl77JcqVdK1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+ALuvjOc=</latexit>

Z

<latexit sha1_base64="rl8UgLe6+YP+XOsJk2KEgugJIWI=">AAACF3icbVDLSsNAFJ3UV62vqEs3wSK4ComIuhGKblxW6AvaECbTSTt08nDmRigxf+HGX3HjQhG3uvNvnLQpaOuBgXPOvZe593gxZxIs61srLS2vrK6V1ysbm1vbO/ruXktGiSC0SSIeiY6HJeUspE1gwGknFhQHHqdtb3Sd19v3VEgWhQ0Yx9QJ8CBkPiMYlOXqZm+IIe0FGIaenzayzJWX3Zm8U+phJkAJx9WrlmlNYCwSuyBVVKDu6l+9fkSSgIZAOJaya1sxOCkWwAinWaWXSBpjMsID2lU0xAGVTjq5KzOOlNM3/EioF4IxcX9PpDiQchx4qjNfUs7XcvO/WjcB/8JJWRgnQEMy/chPuAGRkYdk9JmgBPhYEUwEU7saZIgFJqCirKgQ7PmTF0nrxLTPzNPb02rtqoijjA7QITpGNjpHNXSD6qiJCHpEz+gVvWlP2ov2rn1MW0taMbOP/kD7/AGtBqGD</latexit>

ˆ

Ts=[qs|ts]

Figure 2: Architecture of the decoder model that disentangles camera poses from 3D geometry and

appearance of the scene. Our decoder is composed by 3 submodules. A decoder

d

that takes as input

a latent code representing the scene

zscene

and produces a factorized representation of 3D space via

a tri-plane latent encoding

W

. A radiance ﬁeld network

f

that takes as input points

p∈R3

and

is conditioned on

W

to predict a density

σ

and a signal

a

to be rendered via volumetric rendering

(Eq. 1). Finally, we decode the camera poses through a network

c

that takes as input a normalized

temporal position

s∈[−1,1]

and is conditioned on

zpose

which represents camera poses for the

whole trajectory xto predict the camera pose ˆ

Ts∈SE(3).

•

The

scene decoder

network

d

(parameterized by

θd

), is responsible for predicting a conditioning

variable for the radiance ﬁeld network

f

. This network takes as input a latent code that represents the

scene

zscene

and predicts an axis-aligned tri-plane representation [

37

,

4

]

W∈R3×S×S×F

. Which

correspond to 3 feature maps

[Wxy,Wxz ,Wy z]

of spatial dimension

S×S

and

F

channels, one

for each axis aligned plane: xy,xz and yz.

•

The

radiance ﬁeld decoder

network

f

(parameterized by

θf

), is tasked with reconstructing image

level targets using the volumetric rendering equation in Eq. 1. The input to

f

is

p∈R3

and the

tri-plane representation

W= [Wxy,Wxz ,Wy z]

. Given a

3

D point

p= [i, j, k]

for which radiance

is to be predicted, we orthogonally project

p

into each plane in

W

and perform bi-linear sampling.

We concatenate the 3 bi-linearly sampled vectors into

wxyz = [Wxy (i, j),Wxz (j, k ),Wyz(i, k)] ∈

R3F

, which is used to condition the radiance ﬁeld function

f

. We implement

f

as a MLP that outputs

a density value

σ

and a signal

a

. To predict the value

v

of a pixel, the volumetric rendering equation

is used (cf. Eq. 1) where a 3D point is expressed as ray direction

r

(corresponding with the pixel

location) at particular depth u.

v(r,W) = Zuf

un

T r(u)σ(r(u),wxyz )a(r(u),d,wxyz )du

T r(u) = exp −Zu

un

σ(r(u),wxyz )du.(1)

We formulate a denoising reconstruction objective to jointly optimize for

θd

,

θc

,

θf

and

{z}i={0,...,n}

,

shown in Eq. 2. Note that while latents

z

are optimized for each example

x

independently, the

parameters of the networks

θd

,

θc

,

θf

are amortized across all examples

x∈X

. As opposed to

previous auto-decoding approaches [

2

,

35

], each latent

z

is perturbed during training with additive

noise that is proportional to the empirical standard deviation across all latents,

z=z+βN(0,std(Z))

,

inducing a contractive representation [

46

]. In this setting,

β

controls the trade-off between the entropy

of the distribution

z∈Z

and the reconstruction term, with

β= 0

the distribution of

z

’s becomes a

collection of indicator functions, whereas non-trivial structure in latent space arises for

β > 0

. We

use a small

β > 0

value to enforce a latent space in which interpolated samples (or samples that

contain small deviations from the empirical distribution, as the ones that one might get from sampling

a subsequent generative model) are included in the support of the decoder.

min

θd,θf,θc,Z Ex∼XLscene(xim

s,zscene,Ts) + λLpose (Ts,zpose , s)(2)

We optimize parameters

θd, θf, θc

and latents

z∈Z

with two different losses. The ﬁrst loss function

Lscene

measures the reconstruction between the radiance ﬁeld encoded in

zscene

and the images in

the trajectory

xim

s

(where

s

denotes the normalized temporal position of the frame in the trajectory),

given ground-truth camera poses

Ts

required for rendering. We use an

l2

loss for RGB and

l1

for

4

depth

1

. The second loss function

Lpose

measures the camera pose reconstruction error between

the poses

ˆ

Ts

encoded in

zpose

and the ground-truth poses. We employ an

l2

loss on translation and

l1

loss for the normalized quaternion part of the camera pose. Although theoretically normalized

quaternions are not necessarily unique (e.g.

q

and

−q

) we do not observe any issues empirically

during training.

3.2 Prior Learning

Given a set of latents

z∈Z

resulting from minimizing the objective in Eq. 2, our goal is to learn a

generative model

p(Z)

that captures their distribution (i.e. after minimizing the objective in Eq. 2

we interpret

z∈Z

as examples from an empirical distribution in latent space). In order to model

p(Z)we employ a Denoising Diffusion Probabilistic Model (DDPM) [15], a recent score-matching

[

16

] based model that learns to reverse a diffusion Markov Chain with a large but ﬁnite number of

timesteps. In DDPMs [

15

] it is shown that this reverse process is equivalent to learning a sequence of

denoising auto-encoders with tied weights. The supervised denoising objective in DDPMs makes

learning

p(Z)

simple and scalable. This allows us to learn a powerful generative model that enables

both unconditional and conditional generation of 3D scenes. For training our prior

pθp(Z)

we take

the objective function in [

15

] deﬁned in Eq. 3. In Eq. 3

t

denotes the timestep,

∼ N(0,I)

is

the noise and

¯αt

is a noise magnitude parameter with a ﬁxed scheduling. Finally,

θp

denotes the

denoising model.

min

θpEt,z∼Z,∼N (0,I)k−θp(√¯αtz+√1−¯αt, t)k2(3)

At inference time, we sample

z∼pθp(Z)

by following the inference process in DDPMs. We start by

sampling

zT∼ N(0,I)

and iteratively apply

θp

to gradually denoise

zT

, thus reversing the diffusion

Markov Chain to obtain

z0

. We then feed

z0

as input to the decoder architecture (cf. Fig. 2) and

reconstruct a radiance ﬁeld and a camera path.

If the goal is to learn a conditional distribution of the latents

p(Z|Y)

, given paired data

{z∈Z, y ∈

Y}

, the denoising model

θ

is augmented with a conditioning variable

y

, resulting in

θp(z, t, y)

,

implementation details about how the conditioning variable is used in the denoising architecture can

be found in the appendix C.

4 Experiments

In this section we show the applicability of GAUDI to multiple problems. First, we evaluate

reconstruction quality and performance of the reconstruction stage. Then, we evaluate the performance

of our model in generative tasks including unconditional and conditional inference, in which radiance

ﬁelds are generated from conditioning variables corresponding to images or text prompts. Full

experimental settings and details can be found in the appendix B.

4.1 Data

We report results on 4 datasets: Vizdoom [

21

], Replica [

60

], VLN-CE [

23

] and ARKit Scenes [

1

],

which vary in number of scenes and complexity (see Fig. 3 and Tab. 1).

Vizdoom

[

21

]: Vizdoom is a synthetic simulated environment with simple texture and geometry. We

use the data provided by [

7

] to train our model. It is the simplest dataset in terms of number of scenes

and trajectories, as well as texture, serving as a test bed to examine GAUDI in the simplest setting.

Replica

[

60

]: Replica is a dataset comprised of

18

realistic scenes from which trajectories are

rendered via Habitat [55]. We used the data provided by [7] to train our model.

VLN-CE

[

23

]: VLN-CE is a dataset originally designed for vision and language navigation in

continuous environments. This dataset is composed of

3.6

K trajectories of an agent navigating

between two points in a 3D scene from the 3D dataset [

6

]. We render observations via Habitat [

55

].

Notably, this dataset contains also textual descriptions of the trajectories taken by an agent. In Sect.

4.5 we train GAUDI in a conditional manner to generate 3D scenes given a description.

1We obtain depth predictions by aggregating densities across a ray as in [29]

5

<latexit sha1_base64="o2hvLYVsBn1LjhLmQoNUQGSR9wI=">AAAB8XicbVDLSgNBEJz1GeMr6tHLYBDiJexKUI9BLx4jmAcmS5iddJIhs7PLTK8YlvyFFw+KePVvvPk3TpI9aGJBQ1HVTXdXEEth0HW/nZXVtfWNzdxWfntnd2+/cHDYMFGiOdR5JCPdCpgBKRTUUaCEVqyBhYGEZjC6mfrNR9BGROoexzH4IRso0RecoZUeOghPmJbY2aRbKLpldwa6TLyMFEmGWrfw1elFPAlBIZfMmLbnxuinTKPgEib5TmIgZnzEBtC2VLEQjJ/OLp7QU6v0aD/SthTSmfp7ImWhMeMwsJ0hw6FZ9Kbif147wf6VnwoVJwiKzxf1E0kxotP3aU9o4CjHljCuhb2V8iHTjKMNKW9D8BZfXiaN87J3Ua7cVYrV6yyOHDkmJ6REPHJJquSW1EidcKLIM3klb45xXpx352PeuuJkM0fkD5zPH1GqkLI=</latexit>

(a)

<latexit sha1_base64="+Ut9sqvjAWv9B9ydcKbrsgVfHvw=">AAAB8XicbVDLSgNBEJz1GeMr6tHLYBDiJexKUI9BLx4jmAcmS5iddJIhs7PLTK8YlvyFFw+KePVvvPk3TpI9aGJBQ1HVTXdXEEth0HW/nZXVtfWNzdxWfntnd2+/cHDYMFGiOdR5JCPdCpgBKRTUUaCEVqyBhYGEZjC6mfrNR9BGROoexzH4IRso0RecoZUeOghPmJaCs0m3UHTL7gx0mXgZKZIMtW7hq9OLeBKCQi6ZMW3PjdFPmUbBJUzyncRAzPiIDaBtqWIhGD+dXTyhp1bp0X6kbSmkM/X3RMpCY8ZhYDtDhkOz6E3F/7x2gv0rPxUqThAUny/qJ5JiRKfv057QwFGOLWFcC3sr5UOmGUcbUt6G4C2+vEwa52Xvoly5qxSr11kcOXJMTkiJeOSSVMktqZE64USRZ/JK3hzjvDjvzse8dcXJZo7IHzifP1MwkLM=</latexit>

(b)

Figure 3: (a) Examples of the 4 datasets we use in this paper (from left to right): Vizdoom [

21

],

Replica [

60

], VLN-CE [

23

], ARKitScenes [

1

]. (b) Top-down views of 2 different camera paths in

VLN-CE [

23

]. Blue and red dots represent start-end positions and the camera path is highlighted in

blue.

ARKitScenes

[

1

]: ARKitScenes is a dataset of scans of indoor spaces. This dataset contains more

than

5

K scans of about

1.6

K different indoor spaces. As opposed to the previous datasets where

RGB, depth and camera poses are obtained via rendering in a simulation (i.e. either Vizdoom [

21

] or

Habitat [

55

]), ARKitScenes provides raw RGB and depth of the scans and camera poses estimated

using ARKit SLAM. In addition, whereas trajectories from the previous datasets are point-to-point,

as typically done in navigation, the camera trajectories for ARKitScenes resembles a natural scan a

of full indoor space. In our experiments we use a subset of

1

K scans from ARKitScenes to train our

models.

4.2 Reconstruction

We ﬁrst validate the hypothesis that the optimization problem described in Eq. 2 can ﬁnd latent codes

z

that are able reconstruct the trajectories in the empirical distribution in a satisfactory way. In Tab. 1

we report reconstruction performance of our model across all datasets. Fig. 4 shows reconstructions

of random trajectories for each dataset. For all our experiments we set the dimension of

zscene

and

zpose

to 2048 and

β= 0.1

unless otherwise stated. During training, we normalize camera poses for

each trajectory so that the middle frame in a trajectory becomes the origin of the coordinate system.

See appendix E for ablation experiments.

Figure 4: Qualitative reconstruction results of random trajectories on different datasets (one for each

column): Vizdoom [

21

], Replica [

60

], VLN-CE [

23

]and ARKitScenes [

1

]. For each pair of images

the left is ground-truth and right is reconstruction.

#sc-#tr-#im l1↓PSNR ↑SSIM ↑Rot Err. ↓Trans. Err ↓

Vizdoom [21] 1-32-1k 0.004 44.42 0.98 0.01 1.26

Replica [60] 18-100-1k 0.006 38.86 0.99 0.03 0.01

VLN-CE [23] 90-3.6k-600k 0.031 25.17 0.73 0.30 0.02

ARKitScenes [1] 300-1k-600k 0.039 24.51 0.76 0.16 0.04

Table 1: Reconstruction results of the optimization process described in Eq. 2. The ﬁrst column

shows the number of scenes (#sc), trajectories (#tr) and images (#im) per dataset. Due to the large

number of images on VLN-CE [

23

] and ARKitScenes [

1

] datasets we sample 10 random images per

trajectory to compute the reconstruction metrics.

4.3 Interpolation

In addition, to evaluate the structure of the latent representation obtained from minimizing the

optimization problem in Eq. 2, we show interpolation results between pairs of latents

(zi,zj)

in

Fig. 5. To render images while interpolating the scene we place a ﬁxed camera at the origin of the

6

coordinate system. We observe a smooth transition of scenes in both geometry (walls, ceilings) and

texture (stairs, carpets). More visualizations are included in the appendix E.1.

<latexit sha1_base64="MazdHGmt0ni1F4NWKhF8/6B9jMg=">AAAB83icbVDLSgMxFL2pr1pfVZdugkVwVWakqMuiG5cV7AM6Q8mkmTY0kxmSjFCH/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnSATXxnG+UWltfWNzq7xd2dnd2z+oHh51dJwqyto0FrHqBUQzwSVrG24E6yWKkSgQrBtMbnO/+8iU5rF8MNOE+REZSR5ySoyVPC8iZhyE2dNswAfVmlN35sCrxC1IDQq0BtUvbxjTNGLSUEG07rtOYvyMKMOpYLOKl2qWEDohI9a3VJKIaT+bZ57hM6sMcRgr+6TBc/X3RkYiradRYCfzjHrZy8X/vH5qwms/4zJJDZN0cShMBTYxzgvAQ64YNWJqCaGK26yYjoki1NiaKrYEd/nLq6RzUXcv6437Rq15U9RRhhM4hXNw4QqacActaAOFBJ7hFd5Qil7QO/pYjJZQsXMMf4A+fwCFPJID</latexit>

zi

<latexit sha1_base64="46flhQNvRHXdoMAYU5TQvvWVxeQ=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7AM6Q8mkmTY2kwxJRqhDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJpxp47rfTmlldW19o7xZ2dre2d2r7h+0tUwVoS0iuVTdEGvKmaAtwwyn3URRHIecdsLxTe53HqnSTIp7M0loEOOhYBEj2FjJ92NsRmGUPU37D/1qza27M6Bl4hWkBgWa/eqXP5AkjakwhGOte56bmCDDyjDC6bTip5ommIzxkPYsFTimOshmmafoxCoDFEllnzBopv7eyHCs9SQO7WSeUS96ufif10tNdBVkTCSpoYLMD0UpR0aivAA0YIoSwyeWYKKYzYrICCtMjK2pYkvwFr+8TNpnde+ifn53XmtcF3WU4QiO4RQ8uIQG3EITWkAggWd4hTcndV6cd+djPlpyip1D+APn8weGwJIE</latexit>

zj

Figure 5: Interpolation of 3D scenes in latent space (e.g. interpolating the encoded radiance ﬁeld) for

the VLN-CE dataset [23]. Each row corresponds to a different interpolation path.

4.4 Unconditional generative modeling

Given latent representations

z∈Z

that can reconstruct samples

x∈X

with high accuracy as shown

in Sect. 4.2, we now evaluate the capacity of the prior

pθp(Z)

to capture the empirical distribution

x∈ X

by learning the distribution of latents

zi∈Z

. To do so we sample

z∼pθp(Z)

by following

the inference process in DDPMs, and then feed

z

through the decoder network, which results in

trajectories of RGB images that are then used for evaluation. We compare our approach with the

following baselines: GRAF [

56

],

π

-GAN [

5

] and GSN [

7

]. We sample

5

k images from predicted

and target distributions for each model and dataset and report both FID [

14

] and SwAV-FID [

31

]

scores. We report quantitative results in Tab. 2, where we can see that GAUDI obtains state-of-the-art

performance across all datasets and metrics. We attribute this performance improvement to the fact

that GAUDI learns disentangled yet corresponding latents for radiance ﬁelds and camera poses, which

is key when modeling scenes (see ablations in the appendix E). We note that to obtain these great

empirical results GAUDI needs to simultaneously ﬁnd latents with high reconstruction ﬁdelity while

also efﬁciently learning their distribution.

VizDoom [21] Replica [60] VLN-CE [23] ARKitScenes [1]

FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓

GRAF [56] 47.50 ±2.13 5.44 ±0.43 65.37 ±1.64 5.76 ±0.14 90.43 ±4.83 8.65 ±0.27 87.06 ±9.99 13.44 ±0.26

π-GAN[5] 143.55 ±4.81 15.26 ±0.15 166.55 ±3.61 13.17 ±0.20 151.26 ±4.19674 14.07 ±0.56 134.80 ±10.60 15.58 ±0.13

GSN [7] 37.21 ±1.17 4.56 ±0.19 41.75 ±1.33 4.14 ±0.02 43.32 ±8.86 6.19 ±0.49 79.54 ±2.60 10.21 ±0.15

GAUDI 33.70 ±1.27 3.24 ±0.12 18.75 ±0.63 1.76 ±0.05 18.52 ±0.11 3.63 ±0.65 37.35 ±0.38 4.14 ±0.03

Table 2: Generative performance of state-of-the-art approaches for generative modelling of radiance

ﬁelds on 4 scene datasets: Vizdoom [

21

], Replica [

60

], VLN-CE [

23

] and ARKitScenes [

1

], according

to FID [14] and SwAV-FID [31] metrics.

In Fig. 6 we show samples from the unconditional distribution learnt by GAUDI for different datasets.

We observe that GAUDI is able to generate diverse and realistic 3D scenes from the empirical

distribution which can be rendered from the sampled camera poses.

4.5 Conditional Generative Modeling

In addition to modeling the distribution

p(Z)

, with GAUDI we can also tackle conditional gen-

erative problems

p(Z|Y)

, where a conditioning variable

y∈Y

is given to modulate

p(Z)

. For

all conditioning variables

y

we assume the existence of paired data

{z, y}

to train the conditional

model [

42

,

9

,

41

]. In this section we show both quantitative and qualitative results for conditional

inference problems. The ﬁrst conditioning variable we consider are textual descriptions of trajectories.

Second, we consider a conditional model where randomly sampled RGB images in a trajectory act as

conditioning variables. Finally, we use a categorical variable that indicates the 3D environment (i.e.

7

Figure 6: Different scenes sampled from unconditional GAUDI (one sample per row) and rendered

from their corresponding sampled camera poses (one dataset per column): Vizdoom [

21

], Replica

[

60

], VLN-CE [

23

]and ARKitScenes [

1

]. The resolutions are

64 ×64

,

64 ×64

,

128 ×128

and

64 ×64 respectively.

the particular indoor space) from which each trajectory was obtained (i.e. a one-hot vector). Tab. 3

shows quantitative results for the different conditional inference problems.

Text Conditioning Image Conditioning Categorical Conditioning

FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓

18.50 3.75 19.51 3.93 18.74 3.61

Avg. ∆Per-Environment

FID ↓SwAV-FID ↓

−50.79 −4.10

Table 3: Quantitative results of Conditional Generative Modeling on VLN-CE [

23

] dataset. GAUDI

is able to produce high-quality scene renderings with low FID and SwAV-FID scores. In the right

table we show the difference in average per-environment FID score between the conditional and

unconditional models.

4.5.1 Text Conditioning

We tackle the challenging task of training a text conditional model for 3D scene generation. We

use the navigation text descriptions provided in VLN-CE [

23

] to condition our model. These text

descriptions contain high level information about the scene as well as the navigation path (i.e."Walk

out of the bedroom and into the living room","Exit the room through the swinging doors and

then enter the bedroom"). We employ a pre-trained RoBERTa-base [

26

] text encoder and use its

intermediate representation to condition the diffusion model. Fig. 7 shows qualitative results of

GAUDI for this task. To the best of our knowledge, this is the ﬁrst model that allows for conditional

3D scene generation from text in an amortized manner (i.e. without distilling CLIP [

40

] through a

costly optimization problem [17, 28]).

Go through the hallway Go up the stairs Walk into the kitchen

Go down the stairs

Figure 7: Text conditional 3D scene generation using GAUDI (one sample per row). Our model is

able to capture the conditional distributions of scenes by generating multiple plausible scenes and

camera paths that match the given text prompts.

8

4.5.2 Image Conditioning

We now analyze whether GAUDI is able to pick up information from the RGB images to predict a

distribution over

Z

. In this experiment we randomly pick images in a trajectory

x∈X

and use it as a

conditioning variable

y

. For this experiment we use trajectories in the VLN-CE dataset [

23

]. During

each training iteration we sample a random image for each trajectory

x

and use it as a conditioning

variable. We employ a pre-trained ResNet-18 [

13

] as an image encoder. During inference, the

resulting conditional GAUDI model is able to sample radiance ﬁelds where the given image is

observed from a stochastic viewpoint. In Fig. 8 we show samples from the model conditioned on

different RGB images.

Figure 8: Image conditional 3D scene generation using GAUDI (one sample per row). Given a

conditioned image (top row), our model is able to sample scenes where the same or contextually

similar view is observed from a stochastic viewpoint.

Environment ID: 1 Environment ID: 2 Environment ID: 3

Figure 9: Samples from the GAUDI model conditioned on a categorical variable denoting the indoor

scene (one sample per row).

4.5.3 Categorical Conditioning

Finally, we analyze how GAUDI performs when conditioned on a categorical variable that indi-

cates the underlying 3D indoor environment in which each trajectory was recorded. We perform

experiments in the VLN-CE [

23

] dataset, where we employ a trainable embedding layer to learn a rep-

resentation for categorical variables indicating each environment. We compare the per-environment

FID score of conditional model with its unconditional counterpart. This per-enviroment FID score

is computed only on real images of the same indoor environment that the model is conditioned on.

Our hypothesis is that if the model efﬁciently captures the information in the conditioning variable

it should capture the environment speciﬁc distribution better than its unconditional counterpart

trained on the same data. In Tab. 3 the last column shows difference (e.g. the

∆

) on the average

per-environment FID score between the conditional and unconditional model on VLN-CE dataset.

We observe that the conditional model consistently obtains a better FID score than the unconditional

model across all indoor environments, resulting in a sharp reduction of average FID and SwAV-FID

scores. In addition, in Fig. 9 we show samples from the model conditioned on a given categorical

variable.

9

5 Conclusion

We have introduced GAUDI, a generative model that captures distributions of complex and realistic

3D scenes. GAUDI uses a scalable two-stage approach which ﬁrst involves learning a latent repre-

sentation that disentangles radiance ﬁelds and camera poses. The distribution of disentangled latent

representations is then modeled with a powerful prior. Our model obtains state-of-the-art performance

when compared with recent baselines across multiple 3D datasets and metrics. GAUDI can be used

both for conditional and unconditional problems, and enabling new tasks like generating 3D scenes

from text descriptions.

References

[1]

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz,

Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor

scene understanding using mobile rgb-d data. In Thirty-ﬁfth Conference on Neural Information Processing

Systems Datasets and Benchmarks Track (Round 1), 2021.

[2]

Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent space of

generative networks. arXiv preprint arXiv:1707.05776, 2017.

[3] Marcus Carter and Ben Egliston. Ethical implications of emerging mixed reality technologies. 2020.

[4]

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio

Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efﬁcient geometry-aware 3d generative

adversarial networks. arXiv preprint arXiv:2112.07945, 2021.

[5]

Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-GAN: Periodic

implicit generative adversarial networks for 3d-aware image synthesis. arXiv preprint arXiv:2012.00926,

2020.

[6]

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran

Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.

arXiv preprint arXiv:1709.06158, 2017.

[7]

Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W Taylor, and Joshua M Susskind.

Unconstrained scene generation with locally conditioned radiance ﬁelds. ICCV, 2021.

[8]

Emilien Dupont, Miguel Angel Bautista, Alex Colburn, Aditya Sankar, Carlos Guestrin, Josh Susskind,

and Qi Shan. Equivariant neural rendering. In International Conference on Machine Learning, pages

2761–2770. PMLR, 2020.

[9]

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image

synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 12873–12883, 2021.

[10]

Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator

for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.

[11]

Pengsheng Guo, Miguel Angel Bautista, Alex Colburn, Liang Yang, Daniel Ulbricht, Joshua M Susskind,

and Qi Shan. Fast and explicit neural view synthesis. In Proceedings of the IEEE/CVF Winter Conference

on Applications of Computer Vision, pages 3791–3800, 2022.

[12] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[14]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans

trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint

arXiv:1706.08500, 2017.

[15]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural

Information Processing Systems, 33:6840–6851, 2020.

[16]

Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching.

Journal of Machine Learning Research, 6(4), 2005.

[17]

Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object

generation with dream ﬁelds. arXiv preprint arXiv:2112.01455, 2021.

[18]

Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect

imaganation: Implications of gans exacerbating biases on facial data augmentation and snapchat selﬁe

lenses. arXiv preprint arXiv:2001.09528, 2020.

10

[19]

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial

networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 4401–4410, 2019.

[20]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and

improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 8110–8119, 2020.

[21]

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´

skowski. Vizdoom:

A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on

Computational Intelligence and Games (CIG), pages 1–8. IEEE, 2016.

[22]

Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sona Mokrá, and

Danilo Jimenez Rezende. Nerf-vae: A geometry aware 3d scene generative model. In International

Conference on Machine Learning, pages 5742–5752. PMLR, 2021.

[23]

Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph:

Vision and language navigation in continuous environments. In European Conference on Computer Vision

(ECCV), 2020.

[24]

Zihang Lai, Sifei Liu, Alexei A Efros, and Xiaolong Wang. Video autoencoder: self-supervised disentan-

glement of static 3d structure and motion. In Proceedings of the IEEE/CVF International Conference on

Computer Vision, pages 9730–9740, 2021.

[25]

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel ﬁelds.

arXiv preprint arXiv:2007.11571, 2020.

[26]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv

preprint arXiv:1907.11692, 2019.

[27]

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy

networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 4460–4470, 2019.

[28]

Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural

stylization for meshes. arXiv preprint arXiv:2112.03221, 2021.

[29]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng.

Nerf: Representing scenes as neural radiance ﬁelds for view synthesis. arXiv preprint arXiv:2003.08934,

2020.

[30]

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Computing

Surveys (CSUR), 54(1):1–41, 2021.

[31]

Stanislav Morozov, Andrey Voynov, and Artem Babenko. On self-supervised image representations for

GAN evaluation. In International Conference on Learning Representations, 2021.

[32]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya

Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided

diffusion models. arXiv preprint arXiv:2112.10741, 2021.

[33]

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In

International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.

[34]

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural

feature ﬁelds. arXiv preprint arXiv:2011.12100, 2020.

[35]

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf:

Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pages 165–174, 2019.

[36]

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler.

Atiss: Autoregressive transformers for indoor scene synthesis. Advances in Neural Information Processing

Systems, 34, 2021.

[37]

Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional

occupancy networks. arXiv preprint arXiv:2003.04618, 2, 2020.

[38]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reason-

ing with a general conditioning layer. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence,

volume 32, 2018.

[39]

Benjamin Planche, Xuejian Rong, Ziyan Wu, Srikrishna Karanam, Harald Kosch, YingLi Tian, Jan Ernst,

and Andreas Hutter. Incremental scene synthesis. In Advances in Neural Information Processing Systems,

pages 1668–1678, 2019.

11

[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish

Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from

natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR,

2021.

[41]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and

Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning,

pages 8821–8831. PMLR, 2021.

[42]

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-ﬁdelity images with vq-vae-2.

Advances in neural information processing systems, 32, 2019.

[43]

Daniel Rebain, Mark Matthews, Kwang Moo Yi, Dmitry Lagun, and Andrea Tagliasacchi. Lolnerf: Learn

from one look. arXiv preprint arXiv:2111.09996, 2021.

[44]

Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance

ﬁelds with thousands of tiny mlps. In Proceedings of the IEEE/CVF International Conference on Computer

Vision, pages 14335–14345, 2021.

[45]

Xuanchi Ren and Xiaolong Wang. Look outside the room: Synthesizing a consistent long-term 3d scene

video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), 2022.

[46]

Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier

Glorot. Higher order contractive auto-encoder. In Joint European conference on machine learning and

knowledge discovery in databases, pages 645–660. Springer, 2011.

[47]

Daniel Ritchie, Kai Wang, and Yu-an Lin. Fast and ﬂexible indoor scene synthesis via deep convolu-

tional generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 6182–6190, 2019.

[48]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution

image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2021.

[49]

Robin Rombach, Patrick Esser, and Björn Ommer. Geometry-free view synthesis: Transformers and no 3d

priors, 2021.

[50]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical

image segmentation. In International Conference on Medical image computing and computer-assisted

intervention, pages 234–241. Springer, 2015.

[51]

Negar Rostamzadeh, Emily Denton, and Linda Petrini. Ethics and creativity in computer vision. arXiv

preprint arXiv:2112.03111, 2021.

[52]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed

Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho,

David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language

understanding, 2022.

[53]

Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani

Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and

Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through

Set-Latent Scene Representations. CVPR, 2022.

[54]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved

techniques for training gans. NeurIPS’16, page 2234–2242, Red Hook, NY, USA, 2016. Curran Associates

Inc.

[55]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian

Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In

Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019.

[56]

Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. GRAF: Generative radiance ﬁelds for

3d-aware image synthesis. arXiv preprint arXiv:2007.02442, 2020.

[57]

Edward J Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction.

In Conference on Robot Learning, pages 87–96. PMLR, 2017.

[58]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised

learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages

2256–2265. PMLR, 2015.

[59]

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint

arXiv:2010.02502, 2020.

[60]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul

Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv

preprint arXiv:1906.05797, 2019.

12

[61]

Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in gans. In 2020

International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2020.

[62]

Patrick Tinsley, Adam Czajka, and Patrick Flynn. This face does not exist... but it might be yours! identity

leakage in generative models. In Proceedings of the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 1320–1328, 2021.

[63]

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances

in Neural Information Processing Systems, 34:11287–11302, 2021.

[64]

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural

information processing systems, 30, 2017.

[65]

Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with

transformers. In 2021 International Conference on 3D Vision (3DV), pages 106–115. IEEE, 2021.

[66]

Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning fast samplers for diffusion

models by differentiating through sample quality. In International Conference on Learning Representations,

2021.

[67]

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efﬁciently sample from

diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021.

[68]

Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa.

Plenoxels: Radiance ﬁelds without neural networks. arXiv preprint arXiv:2112.05131, 2021.

[69]

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance ﬁelds from one or

few images. arXiv preprint arXiv:2012.02190, 2020.

[70]

Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. Cips-3d: A 3d-aware generator of gans based on

conditionally-independent pixel synthesis. arXiv preprint arXiv:2110.09788, 2021.

13

A Limitations, Future Work and Societal Impact

Although GAUDI represents a step forward in generative models for 3D scenes, we would like to clearly discuss

the limitations. One current limitation of our model is the fact that inference is not real-time. The reason for this

is two fold: (i) sampling from the DDPM prior is slow even if it is amortized for the whole 3D scene. Techniques

for improving inference efﬁciency in DDPMs have been recently proposed [

59

,

66

,

67

] and can complement

GAUDI. (ii) Rendering from a radiance ﬁeld is not as efﬁcient as rendering other 3D structures like meshes.

Recent work have also tackled this problem [

25

,

68

,

44

] and could be applied to our approach. In addition, many

of the latest image generative models [

41

,

32

,

52

] use multiple stages of up-sampling through diffusion models

to render high-res images. These up-sample stages could be directly applied to GAUDI. In addition, one could

considering studying efﬁcient encoders to replace the optimization process to ﬁnd latents. While attempts have

been made at using transformers [

53

] for short trajectories (5-10 frames) it is unclear how to scale to thousands

of images per trajectory like the ones in [

1

]. Finally, the main limitation for a model like GAUDI to exhibit

improved generation and generalization abilities is the lack of massive-scale and open-domain 3D datasets. In

particular ones with other associated modalities like textual descriptions.

When considering societal impact of generative models a few aspects that need attention are the use generative

models for creating disingenuous data, e.g. "DeepFakes" [

30

], training data leakage and privacy [

62

], and

ampliﬁcation of the biases present in training data [

18

]. One speciﬁc ethical consideration that applies to GAUDI

is the impact that a model which can easily create immersive 3D scenes can have on future generations and their

detachment of reality [

3

]. For an in-depth review of ethical considerations in generative modeling we refer the

reader to [51].

B Experimental Settings and Details

In this section we describe details about data and model hyper-parameters. For all experiments our latents

zscene

and

zpose

have

2048

dimensions. In the ﬁrst stage, when latents are optimized via Eq. 2,

zscene

gets reshaped to

a

8×8×32

feature map before feeding it to the scene decoder network. In the second stage, when training the

DDPM prior we reshape

zscene

and

zpose

to

8×8×64

latent and leverage the power of a UNet [

50

] denoising

architecture.

For each dataset, trajectories have different length, physical scale, as well as near and far planes for rendering,

which we adjust accordingly in our model.

Vizdoom

[

21

]: In Vizdoom, trajectories contains

600

steps on average. In each step the camera is allowed to

move forward

0.5

game units or rotate left or right by

30

degrees. We set the unit length of an element in the

tri-plane representation as

0.05

game units (meaning each latent code

wxyz

represents a volume of space of

0.05

cubic game units). The near plane is at

0.0

game units and the far plane at

800

game units. We use the data

and splits provided by [7].

Replica

[

60

]: In Replica, all trajectories contain 100 steps. In each step, the camera can either rotate left or

right by

25

degrees or move forward

15

centimeters. We set the unit length of an element in the tri-plane

representation as

25

centimeters (meaning each latent code

wxyz

represents a volume of space of

0.25

cubic

centimeters). The near plane is at

0.0

meters and the far plane at

6

meters. We use the data and splits provided

by [7].

VLN-CE

[

23

]: in VLN-CE trajectories contain a variable number of steps between

30

and

150

, approximately.

In each step, the camera can either rotate left or right by

25

degrees or move forward

15

centimeters. We set the

unit length of an element in the tri-plane representation as

50

centimeters. The near plane is at

0.0

meters and

the far plane at 12 meters. We use the data and training splits provided by [23].

ARKitScenes

[

1

]: in ARKitScenes trajectories contain a number of steps around

1000

on average. In these

trajectories the camera is able to move continuously in any direction and orientation. We set the unit length of an

element in the tri-plane representation as

20

centimeters. The near plane is at

0.0

meters and the far plane at

8

meters. We use the 3DOD split of data provided by [1]

C Decoder Architecture Design and Details

In this section we describe the decoder model in Fig. 2 in the main paper. The decoder network is composed of

3 modules: scene decoder,camera pose decoder and radiance ﬁeld decoder.

•

The

scene decoder

network follows the architecture of the VQGAN decoder [

9

], parameterized with

convolutional architecture that contains a self-attention layers at the end of each block. The output

14

CBN CBN CBN

CBN

<latexit sha1_base64="rl8UgLe6+YP+XOsJk2KEgugJIWI=">AAACF3icbVDLSsNAFJ3UV62vqEs3wSK4ComIuhGKblxW6AvaECbTSTt08nDmRigxf+HGX3HjQhG3uvNvnLQpaOuBgXPOvZe593gxZxIs61srLS2vrK6V1ysbm1vbO/ruXktGiSC0SSIeiY6HJeUspE1gwGknFhQHHqdtb3Sd19v3VEgWhQ0Yx9QJ8CBkPiMYlOXqZm+IIe0FGIaenzayzJWX3Zm8U+phJkAJx9WrlmlNYCwSuyBVVKDu6l+9fkSSgIZAOJaya1sxOCkWwAinWaWXSBpjMsID2lU0xAGVTjq5KzOOlNM3/EioF4IxcX9PpDiQchx4qjNfUs7XcvO/WjcB/8JJWRgnQEMy/chPuAGRkYdk9JmgBPhYEUwEU7saZIgFJqCirKgQ7PmTF0nrxLTPzNPb02rtqoijjA7QITpGNjpHNXSD6qiJCHpEz+gVvWlP2ov2rn1MW0taMbOP/kD7/AGtBqGD</latexit>

ˆ

Ts=[qs|ts]

<latexit sha1_base64="Eyd63NoWcXzHECFQXtVg1Nbm8e4=">AAAB/3icbVDLSsNAFJ3UV62vquDGTbAIrkoiRV0W3bisYB/QljKZ3rRDJ5MwcyPWmIW/4saFIm79DXf+jZO2C209MHA4517umeNFgmt0nG8rt7S8srqWXy9sbG5t7xR39xo6jBWDOgtFqFoe1SC4hDpyFNCKFNDAE9D0RleZ37wDpXkob3EcQTegA8l9zigaqVc86AQUh56fPKS9DsI9JlGoIe0VS07ZmcBeJO6MlMgMtV7xq9MPWRyARCao1m3XibCbUIWcCUgLnVhDRNmIDqBtqKQB6G4yyZ/ax0bp236ozJNoT9TfGwkNtB4HnpnM0up5LxP/89ox+hfdhMsoRpBsesiPhY2hnZVh97kChmJsCGWKm6w2G1JFGZrKCqYEd/7Li6RxWnbPypWbSql6OasjTw7JETkhLjknVXJNaqROGHkkz+SVvFlP1ov1bn1MR3PWbGef/IH1+QNiVJb/</latexit>

zpose

<latexit sha1_base64="ZgJJTUJklOpDQOWM8P89PPK7Ft4=">AAAB9HicbVBNSwMxEJ31s9avqkcvwSJ40LIrRT0WvXisYD9gu5Rsmm1Ds8maZAtl6e/w4kERr/4Yb/4b03YP2vpg4PHeDDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDDaTtRFMchp61weDf1WyOqNJPi0YwTGsS4L1jECDZWCjTqMIH8C+8ceUG3VHYr7gxomXg5KUOOerf01elJksZUGMKx1r7nJibIsDKMcDopdlJNE0yGuE99SwWOqQ6y2dETdGqVHoqksiUMmqm/JzIcaz2OQ9sZYzPQi95U/M/zUxPdBBkTSWqoIPNFUcqRkWiaAOoxRYnhY0swUczeisgAK0yMzaloQ/AWX14mzcuKd1WpPlTLtds8jgIcwwmcgQfXUIN7qEMDCDzBM7zCmzNyXpx352PeuuLkM0fwB87nD0lhkH4=</latexit>

s2[1,1]

<latexit sha1_base64="89J+JuC5FxMnvUsR7pwslvrJLdI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlJuuXK27VnYOsEi8nFcjR6Je/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSvqh6l9Vas1ap3+RxFOEETuEcPLiCOtxBA1rAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHyVOM8A==</latexit>

c

<latexit sha1_base64="ctajA+Ew79AFF9gq5hDForI4aIg=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUCPvlilt15yCrxMtJBXLU++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzQ6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDGz7hMUoOSLRaFqSAmJrOvyYArZEZMLKFMcXsrYSOqKDM2m5INwVt+eZW0LqreVdVrXFZqt3kcRTiBUzgHD66hBvdQhyYwQHiGV3hzHp0X5935WLQWnHzmGP7A+fwBzOmM8A==</latexit>

f

<latexit sha1_base64="bzcJZNxuvyerPnnijwidinEzigo=">AAAB/3icbVBNS8NAEN34WetXVPDiZbEIHqQkIuqx6MVjBfsBTSib7aZdupuE3YlYYg7+FS8eFPHq3/Dmv3Hb5qCtDwYe780wMy9IBNfgON/WwuLS8spqaa28vrG5tW3v7DZ1nCrKGjQWsWoHRDPBI9YADoK1E8WIDARrBcPrsd+6Z0rzOLqDUcJ8SfoRDzklYKSuve9lnuZ9SU6wB+wBgjAjuZd37YpTdSbA88QtSAUVqHftL68X01SyCKggWndcJwE/Iwo4FSwve6lmCaFD0mcdQyMimfazyf05PjJKD4exMhUBnqi/JzIitR7JwHRKAgM9643F/7xOCuGln/EoSYFFdLooTAWGGI/DwD2uGAUxMoRQxc2tmA6IIhRMZGUTgjv78jxpnlbd86p7e1apXRVxlNABOkTHyEUXqIZuUB01EEWP6Bm9ojfryXqx3q2PaeuCVczsoT+wPn8AL+6WOA==</latexit>

{,a}

<latexit sha1_base64="kB6OhSJt9NPMyJMATvw7es7jHZ0=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVZJS1GXBjcsK9gFtCJPppB06eTBzUxtD/sSNC0Xc+ifu/BunbRbaeuDC4Zx7ufceLxZcgWV9G6WNza3tnfJuZW//4PDIPD7pqCiRlLVpJCLZ84higoesDRwE68WSkcATrOtNbud+d8qk4lH4AGnMnICMQu5zSkBLrmkOgM3A87PH3M1m6VPumlWrZi2A14ldkCoq0HLNr8EwoknAQqCCKNW3rRicjEjgVLC8MkgUiwmdkBHraxqSgCknW1ye4wutDLEfSV0h4IX6eyIjgVJp4OnOgMBYrXpz8T+vn4B/42Q8jBNgIV0u8hOBIcLzGPCQS0ZBpJoQKrm+FdMxkYSCDquiQ7BXX14nnXrNvqo17hvVZr2Io4zO0Dm6RDa6Rk10h1qojSiaomf0it6MzHgx3o2PZWvJKGZO0R8Ynz/IKZRe</latexit>

wxyz

<latexit sha1_base64="lcRJa1ZSaskFUFSM9oVw6UIPCOQ=">AAACBHicbVC7TsMwFL3hWcqrwNjFokJiqhKogLGChbEg+pCaUDmu01p1nMh2kKooAwu/wsIAQqx8BBt/g9tmgJYjWTo+517de48fc6a0bX9bS8srq2vrhY3i5tb2zm5pb7+lokQS2iQRj2THx4pyJmhTM81pJ5YUhz6nbX90NfHbD1QqFok7PY6pF+KBYAEjWBupVyq7IdZDP0jjDLlMoNnXT2+z+9NeqWJX7SnQInFyUoEcjV7py+1HJAmp0IRjpbqOHWsvxVIzwmlWdBNFY0xGeEC7hgocUuWl0yMydGSUPgoiaZ7QaKr+7khxqNQ49E3lZEc1703E/7xuooMLL2UiTjQVZDYoSDjSEZokgvpMUqL52BBMJDO7IjLEEhNtciuaEJz5kxdJ66TqnFVrN7VK/TKPowBlOIRjcOAc6nANDWgCgUd4hld4s56sF+vd+piVLll5zwH8gfX5A5YGmBA=</latexit>

p2R3

(a) (b)

Figure 10: (a) Architecture of the camera pose decoder network. (b) Architecture of the radiance

ﬁeld network.

of the scene decoder is a feature map of shape

64 ×64 ×768

. To obtain the tri-plane representation

W= [Wxy,Wxz ,Wy z ]

we split the channel dimension of the output feature map in 3 chunks of

equal size 64 ×64 ×256.

•

The

camera pose decoder

is implemented as an MLP with

4

conditional batch normalization (CBN)

blocks with residual connections and hidden size of

256

, as in [

27

]. The conditional batch normaliza-

tion parameters are predicted from

zpose

. We apply positional encoding to the inputs the camera pose

encoder (s∈[−1,1]). Fig. 10(a) shows the architecture of the camera pose decoder module.

•

The

radiance ﬁeld decoder

is implemented as an MLP with

8

linear layers with hidden dimension

of

512

and LeakyReLU activations. We apply positional encoding to the inputs the radiance ﬁeld

decoder (

p∈R3

) and concatenate the conditioning variable

wxyz

to the output of every other layer

in the MLP starting from the input layer (e.g. layers 0, 2, 4, and 6). To improve efﬁciency, we render a

small resolution feature map of

512

channels (two times smaller than the output resolution) instead of

an RGB image and use a UNet [

50

] with additional deconvolution layers to predict the ﬁnal image

[7, 34]. Fig. 10(b) shows the architecture of the radiance ﬁeld decoder module.

For training we initialize all latents

z= 0

and train them jointly with the parameters of the 3 modules. We use

the Adam optimizer and a learning rate of

0.001

for latents and

0.0001

for model parameters. We train our

model on 8 A100 NVIDIA GPUs for 2-7 days (depending on dataset size), with a batch size of

16

trajectories

where we randomly sample 2images per trajectory.

D Prior Architecture Design and Details

We employ a Denoising Diffusion Probabilistic Model (DDPM) [

15

] to learn the distribution

p(Z)

. Speciﬁcally,

we adopt the UNet architecture from [

33

] to denoise the latent at each timestep. During training, we sample

t∈ {1, ..., T }

uniformly and take the gradient descent step on

θp

from Eq. 3. Different from [

33

], we keep the

original DDPM training scheme with ﬁxed time-dependent covariance matrix and linear noise schedule. During

inference, we start from sampling latent from zero-mean unit-variance Gaussian distribution and perform the

denoising step iteratively. To accelerate the sampling efﬁciency, we leverage DDIM [

59

] to denoise only 50

steps by modeling the deterministic non-Markovian diffusion processes.

For conditional generative modelling tasks, the conditioning mechanism should be general to support conditioning

inputs from diverse modalities (i.e. text, image, categorical class, etc.). To fulﬁll this requirement, we ﬁrst

project the conditional inputs into an embedding representations

c

via a modality-speciﬁc encoder. For text

conditioning, we employ a pre-trained RoBERTa-base [

26

]. For image conditioning, we employ a ResNet-18

[

13

] pre-trained on ImageNet. For categorical conditioning, we employ a trainable per-environment embedding

layer. We freeze the encoders for text and image inputs to avoid over-ﬁtting issues. We borrow the cross attention

module from LDM [

48

] to fuse the conditioning representation

c

with the intermediate activations at multiple

levels in the UNet [

50

]. The cross-attention module implements an attention mechanism with key and value

generated from

c

while the query generated from the intermediate activations in the UNet architecture (we refer

readers to [48] for more details).

For training the DDPM prior, we use the Adam optimizer and learning rate of

4.0e−06

. We train our model on 1

A100 NVIDIA GPU for 1-3 days for unconditional prior learning and 3-5 days for conditional prior learning

experiments (depending on dataset size), with a batch size of 256 and 32 respectively. For the hyper-parameters

of the DDPM model, we set the number diffusion steps to 1000, noise schedule as linearly decreasing from

0.0195 to 0.0015, base channel size to 224, attention resolutions at [8, 4, 2, 1], and number of attention heads to

8.

E Ablation Study

15

We now provide additional ablations studies for the critical components in GAUDI. First, we analyze how the

dimensionality of the latent code

zd

and the magnitude of

β

affect the optimization problem deﬁned in Eq. 2.

Tab. 4 shows reconstruction metrics for both RGB images and camera poses for a subset of

100

trajectories in

the VLN-CE dataset [

23

]. We observe a clear trend where increasing the magnitude of

β

makes it harder to ﬁnd

latent codes with high reconstruction accuracy. This drop in accuracy is expected since

β

controls the amount of

noise in latent codes during training. Finally, we observe that reconstruction performance starts to degrade when

the latent code dimensionality grows past 2048.

l1↓PSNR ↑SSIM ↑Rot Err. ↓Trans. Err ↓

β= 0.1zd= 2048 7.63e-3 39.12 0.984 4.61e-3 2.90e-3

β= 0.1zd= 4096 7.89e-3 38.55 0.982 4.91e-3 2.76e-3

β= 0.1zd= 8192 9.02e-3 36.33 0.978 5.62e-3 3.36e-3

β= 1.0zd= 2048 1.00e-2 34.82 0.972 6.32e-3 3.77e-3

β= 1.0zd= 4096 1.11e-2 34.46 0.965 7.27e-3 5.69e-3

β= 1.0zd= 8192 1.54e-2 32.28 0.916 1.11e-2 7.13e-3

β= 10.0zd= 8192 3.89e-2 24.89 0.799 7.59e-2 3.61e-2

β= 10.0zd= 4096 9.25e-2 17.52 0.499 1.56e-1 6.30e-2

β= 10.0zd= 2048 1.35e-1 12.74 0.275 5.25e-1 1.29e-1

Table 4: Ablation experiment for the critical parameters of the optimization process described in Eq.

2

In addition, we also provide ablation experiments for the second stage of our model where we learn the prior

p(Z)

. In particular, we ablate critical factors of our model: the importance of learning corresponding scene and

pose latents, the width of the denoising network in the DDPM prior, and the noise scale parameter

β

. In Tab.

5 we show results for each factor. In particular, in the ﬁrst two rows of Tab. 5 we show the result of training

the prior while breaking the correspondence of

z= [zpose,zscene ]

. We break this correspondence by forming

random pairs of

z= [zpose,zscene ]

after optimizing the latent representations, and then training the prior on

these random pairs. We observe that training the prior to render scenes from a random pose latent impacts

both the FID and SwAV-FID scores substantially, which provides support for our claim that the distribution of

valid camera poses depends on the scene. In addition, we can see how the width of the denoising model affects

performance. By increasing the number of channels, the DDPM prior is able to better capture the distribution

of latents. Finally, we also show how different noise scales

β

impact the capacity of the generative model to

capture the distribution of scenes. All results Tab 5 are performed on the full VLN-CE dataset [23].

VLN-CE [23]

FID ↓SwAV-FID ↓

GAUDI 18.52 3.63

GAUDI w. Random Pose 83.66 10.73

Base Channel Size = 64 104.27 13.21

Base Channel Size = 128 22.04 4.35

Base Channel Size = 192 18.61 3.79

Base Channel Size = 224 18.52 3.63

Noise Scale β= 0.0 18.48 3.68

Noise Scale β= 0.1 (same as 1st stage) 18.52 3.63

Noise Scale β= 0.2 18.48 3.67

Noise Scale β= 0.5 20.20 4.11

Table 5: Ablation study for different design choice of GAUDI.

In Tab. 6 we report ablation results for modulation on the denoising architecture in the DDPM prior. We

compare cross-attention style conditioning as in LDM [48] with FiLM style conditioning [38]. For FiLM style

conditioning, we take the mean of the conditioning representation

c

across spatial dimension and project it into

the same space as denoising timestep embedding. After that, we take the sum of the conditioning and timestep

embedding and predict the scaling and shift factors of the afﬁne transformation applied to the UNet intermediate

activations. We compare the performance of the two conditioning mechanisms in Tab. 6. We observe that the

cross-attention style conditioning performs better than the FiLM style across all our conditional generative

modeling experiments.

E.1 Additional Visualizations

In this section we provide additional visualizations for both ﬁgures in this appendix and videos that can be

found attached in the supplementary material. In Fig. 11 we provide additional interpolations between random

16

Text Conditioning Image Conditioning Categorical Conditioning

FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓FID ↓SwAV-FID ↓

FiLM Module [38] 20.99 4.11 21.01 4.21 18.75 3.63

Cross Attention [48] 18.50 3.75 19.51 3.93 18.74 3.61

Table 6: Ablation study for conditioning mechanism of GAUDI.

pairs of latents obtained for VLN-CE dataset [

23

], where each row represents a interpolation path between

a random pair of latents (i.e. rightmost and leftmost columns). We can see how the model tends to produce

smoothly changing interpolation paths which align similar scene content. In addition we refer readers to the

folder

./interpolations

in which videos of interpolations can be found where for each interpolated scene we

immersively navigate it by moving the camera forwards and rotating left and right.

In addition, we provide more visualization of samples from the unconditional GAUDI model in Fig. 12 for

VLN-CE [

23

], Fig. 13 for ARKitScenes [

1

] and Fig. 14 for Replica [

60

]. In all these ﬁgures, each row represents

a sample from the prior that is rendered from its corresponding sampled camera path. We note how these

qualitative results reinforce the ﬁdelity and variability of the distribution captured by GAUDI, which is also

reﬂected in the quantitative results in Tab. 2 of the main paper. In addition, the folder

./uncond_samples

contains videos of more samples from the unconditional GAUDI model for all datasets.

Finally, the folder

./cond_samples

contains a video showing samples from GAUDI conditioned on different

modalities like text, images or categorical variables. These visualizations corresponds to the results in Sect. 4.5

of the main paper.

F License

Due to licensing issues we cannot release the VLN-CE [

23

] raw trajectory data and we refer the reader to

https://github.com/jacobkrantz/VLN-CE

and to the license of the Matterport3D data

http://kaldir.

vc.in.tum.de/matterport/MP_TOS.pdf.

17