Conference PaperPDF Available

LiLMaps: Learnable Implicit Language Maps

Authors:

Abstract and Figures

One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language representation, which can be further utilized by LLMs. Such a comprehensive scene representation enables numerous ways of interaction with the map for autonomously operating robots. In this work, we present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically , we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions. Our experiments demonstrate the effectiveness of LiLMaps and solid improvements in performance .
Content may be subject to copyright.
LiLMaps: Learnable Implicit Language Maps
Evgenii Kruzhkov
Autonomous Intelligent Systems, Computer Science Institute VI
University of Bonn, Germany
ekruzhkov@ais.uni-bonn.de
Sven Behnke
Autonomous Intelligent Systems, Computer Science Institute VI Intelligent Systems and Robotics
Center for Robotics and the Lamarr Institute for Machine Learning and Artificial Intelligence
University of Bonn, Germany
behnke@cs.uni-bonn.de
Abstract
One of the current trends in robotics is to employ large
language models (LLMs) to provide non-predefined com-
mand execution and natural human-robot interaction. It
is useful to have an environment map together with its
language representation, which can be further utilized by
LLMs. Such a comprehensive scene representation en-
ables numerous ways of interaction with the map for au-
tonomously operating robots. In this work, we present
an approach that enhances incremental implicit mapping
through the integration of vision-language features. Specif-
ically, we (i) propose a decoder optimization technique for
implicit language maps which can be used when new ob-
jects appear on the scene, and (ii) address the problem of
inconsistent vision-language predictions between different
viewing positions. Our experiments demonstrate the ef-
fectiveness of LiLMaps and solid improvements in perfor-
mance.
1. Introduction
Classic robotic maps are commonly used for estimat-
ing distances to obstacles and costs of motions in naviga-
tion and localization tasks. However, more comprehensive
tasks, as well as natural human-robot interaction, may re-
quire a deeper understanding of the environment, and thus
imply more advanced map representations. For example,
visual-language navigation is the task where a robot must
interpret a natural language command from a non-expert
user and proceed towards the goal according to the com-
mand. The environment might be unknown in advance, but
the robot still must navigate in the shortest possible time.
Figure 1. Reconstructed implicit language map built with
LiLMaps. Semantic colors are assigned based on the similarity
of reconstructed language features and CLIP [33] encodings of se-
mantic categories from the Matterport3D dataset [5].
From this example, it is clear that a map that allows one
to easily find correlation between the given language com-
mand and a partially or fully mapped environment can be
more useful than a pure obstacle costmap.
In this work, we make a step towards creation of effi-
cient yet compact natural language environment represen-
tations and introduce Learnable implicit Language Maps
(LiLMaps). We chose an implicit representation because of
its ability to compactly represent the data and for the possi-
1
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 2025.
bility of further detailed reconstruction.
Recent studies in implicit mapping demonstrate out-
standing results in geometry reconstruction. While in these
studies, geometry decoders can be easily pre-trained or even
trained in the first few iterations, in our work we demon-
strate that LiLMaps performs better compared to the pre-
trained language decoders because some language features
can be poorly represented in them. Moreover, pre-training
the decoder to represent every possible language feature
could make its structure and training process significantly
more complicated, and it can reduce its flexibility for dif-
ferent applications.
Another challenging problem that frequently appears in
incremental language learning is the inconsistency of mea-
surements taken from different viewing positions. Pre-
cise range sensors, such as LiDARs and RGB-D cameras,
are commonly used in implicit mapping, but usually they
do not provide contradictory measurements. However, in
visual-language navigation tasks, information about the en-
vironment is often derived from RGB images. Vision-
language features extracted from RGB images may have
many sources of inconsistency: a painting can be recog-
nized as a wall from a greater distance; a bed object can be
misclassified as a sofa at different angles of view; objects
on image borders and occluded objects might not be visi-
ble enough to provide correct features; inaccurate detection
on the object edge can spoil features of the objects behind
them; etc.
LiLMaps focuses on incremental implicit language map-
ping, i.e., when new observations become available incre-
mentally, one-by-one. This is a typical condition for SLAM,
and LiLMaps can be integrated into existing implicit SLAM
approaches with minimal changes. We achieve this with the
following key techniques that are presented in this work:
Adaptive Language Decoder Optimization dynami-
cally updates the decoder to new discovered language
features in the environment providing flexible and suf-
ficient coverage of language representations.
Measurements Update Strategy adjusts incoming
measurements reusing accumulated and implicitly
stored knowledge about the environment to reduce
measurements inconsistency.
Our experiments show that LiLMaps enables incremen-
tal vision-language environment exploration with just a
small overhead.
2. Related Works
Language and Vision-Language Models. Dynamic exe-
cution of natural language commands has been an active
research topic for a long time [2,13]. Previous works of-
ten propose custom environment representations that are
difficult to reuse in real applications. Recently, large lan-
guage models (LLMs) have received increased attention in
this field. The advantage of LLMs is their ability to be ap-
plied to a wide range of tasks. LLMs can enhance robot
abilities to understand and execute natural language com-
mands [14,29]. In addition, LLMs have been shown to
be successful in guiding object grasping [9,23], naviga-
tion [1,6,15,34,37] and scene understanding [7,11,31].
Vision-Language Models (VLMs) [10,20,22,35], which ex-
tract language information about the environment from the
provided images, are frequently used in conjunction with
LLMs. For instance, CLIP [33] is one of the most widely
used models capable of mapping images into a natural lan-
guage space.
Often, VLMs transfer only single or batched images into
a natural language vector space. However, some tasks may
benefit from mapping the entire environment into the lan-
guage space. For example, VLMaps [15] suggests enhanc-
ing 2D maps with language features and then demonstrates
that navigation and detection tasks can be solved directly
on these enhanced maps. Although VLMaps can be used
alongside simultaneous localization and mapping (SLAM),
its performance depends on localization and mapping qual-
ity, and the method is limited to building only 2D language
maps. On the other hand, OpenScene [31] can generate
maps as 3D point clouds with corresponding language fea-
tures. However, the method performs best in batch-like op-
erations, where all RGB images of the environment, their
poses, and 3D point clouds are available in advance, mean-
ing no real-time data is processed. Another approach, Con-
ceptFusion [16], demonstrates that language features can be
fused into 3D maps using traditional SLAM approaches.
SAM3D [41] projects SAM segmentation masks [19] into
3D and creates 3D scene masks.
Implicit Representations. Implicit representations [3,
24,26,30,36] have gained popularity for their compact-
ness and ability to achieve high-resolution reconstructions.
They demonstrate great capabilities in environment map-
ping. iMap [38] uses an RGB-D sensor to perform a real-
time SLAM task. Nice-SLAM [46] extended the possible
sizes of the mapped environment. SHINE-Mapping [43]
demonstrated implicit mapping of outdoor environments.
Recently, works based on Gaussian splatting [17] demon-
strated exceptional results [24,28,44,45]. In addition to ge-
ometry, implicit maps demonstrate a successful reconstruc-
tion of semantic information [42], physical properties [12],
and visual features [25].
Integration of language features into implicit representa-
tion is an actively researched topic. LERF [18] studies the
fusion of language features into Radiance Fields, but is lim-
ited to small scenes. LangSplat [32] uses Gaussian Splat-
ting [17] to achieve higher precision and training speed.
However, LangSplat demonstrates only the reconstruction
2
Figure 2. Implicit language mapping. Vision-language features φare extracted from the RGB image. The corresponding points of the
depth image are projected to the world coordinate system. Each point can be encoded using its coordinates and octree: the coordinates are
used to find the corresponding octree voxels (blue,red,green); learnable features stored in the voxels’ corners are interpolated and summed,
producing the point encoding. Fvectors are stored only in the voxels of the coarse octree level (blue). The language decoder reconstructs
the language feature ¯φin the spatial coordinates of the point based on its encoding and the vector F. Language loss optimizes the learnable
features and Fvectors. After optimization, the language map can be reconstructed in arbitrary spatial coordinates. The language detector
is optimized independently of the implicit mapping (Sec. 3.2).
of small table-sized scenes and does not consider incremen-
tal mapping.
Implicit representations are highly dependent on the type
of encoding they use. The encodings employed in the orig-
inal NeRF work [26] were able to generalize the predic-
tions [12,25], but were constrained by the limited size of
the environments. Subsequent works successfully increased
training and reconstruction speeds [27,36]. Gaussian Splat-
ting [17] is currently one of the most popular encodings
due to its speed, simplicity, and high quality reconstruc-
tion. However, some works may benefit from structured
encodings such as grid-based [40], octree-based feature vol-
umes [39], or combined ones [21].
Compared to the works discussed above, our approach
can build large-scale 3D implicit language maps and can
be seamlessly integrated with implicit SLAM methods.
LiLMaps use a sparse octree-based representation [39] to
store learnable features, but our method is not tied to any
particular representation and can be adapted to others with
minimal effort.
3. Method
We address the task of building an implicit vision-
language representation along with environment mapping.
Sec. 3.1 describes our model architecture used in LiLMaps.
During incremental mapping, the future observed objects
and their encoded representations are unknown in advance,
which makes it challenging to train a language decoder
to represent all language features. To address this issue,
in Sec. 3.2 we propose the adaptive language decoder opti-
mization strategy that can effectively adjust the decoder to
new language features while retaining previously observed
ones. In this work, we employ a visual language encoder
but pixel-wise language features are often inconsistent be-
tween frames. We address this issue in Sec. 3.3.
3.1. LiLMaps Architecture
Fig. 2shows the architecture of the proposed approach.
Input data for our pipeline are point clouds associated with
CLIP language features (language point clouds), as well as
camera poses estimated by any external SLAM method. We
produce language point clouds by extracting language fea-
tures from an RGB image. Extracted language features are
projected to the point clouds in the world coordinate system
using the corresponding depth image and camera pose. The
extraction of language features can be done using per-pixel
visual language encoders such as LSeg, OpenSeg, Segment-
Anything-CLIP [22], etc. For example, VLMaps [15] uti-
lizes LSeg for this purpose. In this study, the visual lan-
guage encoder is treated as an external module, and improv-
ing its performance is not our focus.
Our goal is to enable implicit vision-language mapping
under the conditions of environment exploration when fu-
ture measurements are not available. We use the octree
structure as positional encoding to build the implicit rep-
resentation. Unless otherwise specified, we consistently
use three different levels of the octree to store the features.
3
Each level of the octree is made up of voxels, and each
voxel from a higher level can encompass multiple voxels
from lower, more detailed levels. It should be noted that
we use a sparse octree representation, meaning voxels are
only present where observations have been made. When the
point clouds are projected to the world coordinates, we find
the corresponding voxels in the octree for each point. Each
voxel holds learnable features at its corners. These features
are shared among voxels that have common corners.
The language features have a high dimensionality and
a straightforward solution to encode them in the octree is
to increase the size of the learnable features stored in the
corners. However, storing high-dimensional learnable fea-
tures consumes a significant amount of memory. Instead,
we suggest storing one high-dimensional learnable feature
vector Fper voxel of the first (coarse) octree level, while
keeping the corner features low-dimensional.
To train the implicit representation, we apply the cosine
similarity loss between the features decoded from the oc-
tree ¯φand the vision-language features φof the input point
cloud:
Lvl =1
N
N
X
i=1
CosineSimilarity(φi,¯φi),(1)
where Nis the number of points with vision-language
features in the point cloud.
Note that we encode every new available measurement
into the learnable features using (1), but the weights of the
language decoder are not updated with this loss. The opti-
mization of the decoder is described in Sec. 3.2.
The decoder reconstructs the language feature ¯φin spa-
tial coordinates using feature vector Fand corners features
of the corresponding voxels. Before feeding to the decoder,
the corner features are linearly interpolated into the recon-
struction point and summed across all octree levels. The de-
coder consists of three fully connected layers. The first two
layers expand the dimensions of the corner features and pro-
duce the element-wise scaling vector for F, which is then
multiplied by it and passed to the last fully connected layer,
which outputs the predicted language feature ¯φ.
3.2. Adaptive Language Decoder Optimization
Equation (1) uses our MLP-based language decoder to
predict vision-language features ¯φbased on the encodings
stored in the octree. However, the new data can contain fea-
tures that have not been observed before. In this case, the
decoder weights must be updated to be able to reconstruct
new features without forgetting the old ones, but re-training
of the whole previously mapped environment is computa-
tionally expensive.
We propose adaptive language decoder optimization in
Algorithm 1. When a new point cloud with vision-language
features arrives, we perform decoder optimization before
the optimization of octree features described in Sec. 3.1.
Algorithm 1 Adaptive Language Decoder Optimization
1: global names
2: LDec Language decoder
3: kFeatures Known features
4: Encodings Learnable Parameter
5: FVectors Learnable Parameter
6: τ Cosine similarity threshold
7: LOS S Cosine similarity loss
8: end global names
9: procedure OPTIMIZE(inFeatures)
10: uniqueFeatures UNIQUE(inFeatures, τ )
11: newFeatures UNKN OW N(uniqueFeatures,
kFeatures, τ )
12: if len(newFeatures)=0then
13: return No optimization if no new features
14: end if
15: encodings RANDOM(len(newFeatures), m)
16: if len(FVectors)=0then
17: fvectors RANDOM(1, L)
18: else
19: fvectors MEAN(FVectors,dim = 0,
keepdim =true)
20: end if
21: optimizer ADAM ([encodings,fvectors,LDec])
22: allEncodings Encodings encodings
23: allFVectors FVectors fvectors
24: allFeatures kFeatures newFeatures
25: for i0to Nopt do
26: ¯φLDec(allEncodings,allFVectors)
27: fvectors1 SHUFFLE(allFVectors)
28: loss LOS S(allFeatures,¯φ)
29: lossFLOS S(fvectors1,allFVectors)
30: optimizer.optimize(loss +lossF)
31: end for
32: kFeatures allFeatures Update features
33: Encodings allEncodings Update encodings
34: FVectors allFVectors Update FVectors
35: end procedure
The proposed optimization operates with a language de-
coder (Line 2) and learnable parameters. The learnable pa-
rameters are the inputs that are directly forwarded to the
decoder. In our work (Fig. 2), the learnable parameters are
Fvectors (Line 5) and the interpolated and summed point
encoding (Line 4). Note that Algorithm 1is not limited
to our network architecture and can be adapted to other im-
plicit representations by simply replacing the corresponding
learnable parameters.
Firstly, we extract only unique language features
(Line 10) from all available ones in the input point cloud
and then filter out already known features (Line 11). In
both cases, cosine similarity and a predefined threshold τ
4
Parameter Symbol Value
Similarity threshold τ0.02
Used octree levels 8,9,10
Fine level resolution 0.05 [m]
Learnable features size m16
F vectors size L512
Iterations per decoder optimization Nopt 100
Iterations per mapping loss Eq. (1) 100
Table 1. LiLMaps parameters and their values in the experiments.
are used to estimate similarity.
For the new features (unobserved features without dupli-
cates), we initialize the learnable parameters: the encodings
and Fvectors (Lines 15,17 and 19). Note that the ini-
tialized encodings (Line 15) correspond to the linearly in-
terpolated and summed features of the corners of the octree
(Point Encoding in Fig. 2). We initialize only a single Ffea-
ture vector for all new language features (Line 17) because
the Ffeature vector stored in the coarse octree level may be
used for points with different language features (Sec. 3.1).
We also regularize new Fvectors by enforcing them to be
similar to existing ones (Lines 27 and 29). However, this
regularization is optional and can be omitted or changed to
any other regularization required by the corresponding im-
plicit representation.
The proposed adaptive optimization approach optimizes
the decoder for unobserved language features and finds the
learnable parameters for them. Only the decoder and new
learnable parameters are optimized (Line 21). The already
known features and their encodings are not optimized, but
used to prevent forgetting (Lines 22 to 24). After optimiza-
tion, we update the list of known vision-language features
and their learnable parameters (Lines 32 to 34).
The proposed optimization efficiently stores only a small
number of known features for replay (Line 3), as demon-
strated in experiments (Fig. 6). It enables fast decoder op-
timization using vectorization. Moreover, the language de-
coder and the learnable parameters are optimized only if
new language features are observed.
3.3. Measurement Update Strategy
During incremental mapping, new observations added to
the map should not corrupt previous measurements. How-
ever, vision-language features predicted by the visual en-
coder may not be consistent between frames. VLMaps [15]
averaged the language features of the objects received from
different views. We note that the averaging can be done in
a recursive form:
An=1
N
N
X
i=1
φi=n1
nAn1+φn
n,with A0=0.(2)
Figure 3. Left: Environments reconstructed without measure-
ment update; Middle: Ground Truth; Right: Environments recon-
structed with measurement update.
In this work, we propose to use as target for training in
Eq. (1) a weighted average φ
nbetween observations φnand
the features ¯φn1already stored in the map:
φ
n=n1
n¯φn1+φn
n,with ¯φ0=0.(3)
This averaging is especially useful for noisy measurements
such as vision-language features because they may signif-
icantly vary with distance to the objects or point of view
(Sec. 4.1). Mapping with Eq. (3) forces the map to store all
observations similarly to [15]. However, we observed better
results when not all previous data are stored and decided to
use exponential smoothing instead of averaging:
φ
n=α¯φn1+ (1α)φnwith ¯φ0=0,(4)
where αis set dynamically to higher values if new measure-
ments φnare more different from the previously optimized
map features ¯φn1and lower otherwise:
α=CosineSimilarity(φi,¯φi)
0.5 + CosineSimilarity(φi,¯φi).(5)
4. Experiments
In the experiments, we validate that our method can be
used for incremental implicit mapping of language features.
We use depth, semantic, and RGB images provided by [15]
through the Habitat simulator using Matterport3D [5]. Mat-
terport3D provides ground truth meshes with each face as-
signed to a class label. To get ground truth point clouds with
the corresponding language features, we sample the meshes
and encode their labels using CLIP [33]. During all exper-
iments, we project depth images into 3D using the current
camera pose to obtain input point clouds. The parameters
we use during the experiments are summarized in Tab. 1.
4.1. Mapping Quality
We evaluate accuracy, recall, precision, and intersec-
tion over union for our implicit language map. Accuracy
5
Approach
Sequence 5LpN3gDmAk7 1 YmJkqBEsHnH 1 gTV8FGcVJC9 1 jh4fc5c5qoQ 1 JmbYfDe2QKZ 2
A mR mP mIoU A mR mP mIoU A mR mP mIoU A mR mP mIoU A mR mP mIoU
LiLMaps*GT 97 97 96 93 98 93 92 89 98 98 95 94 98 96 88 85 95 96 92 89
LiLMapsGT 97 97 93 91 96 94 96 90 97 98 93 92 98 98 92 89 95 96 90 87
LiLMaps*SEM 84 77 70 57 84 77 82 65 85 85 84 73 83 78 73 61 78 77 77 63
LiLMapsSEM 88 86 75 66 90 82 87 72 90 90 86 79 90 84 82 71 85 88 82 74
OpenScene [31] 68 45 67 36 63 50 77 41 61 49 60 36 77 52 59 39 56 51 67 41
LiLMaps*LSeg 64 29 46 21 52 44 58 31 65 48 59 32 59 32 44 21 53 41 50 33
LiLMapsLSeg 68 37 57 26 57 56 60 34 70 50 63 33 73 39 58 27 56 42 56 34
VLMapsmean
2D 28 - - 19 28 - - 19 28 - - 19 28 - - 19 28 - - 19
Table 2. Language mapping quality evaluation: accuracy (A), recall (mR), precision (mP) and mean IoU (mIoU) in [%].
Figure 4. Left: Language map produced by OpenScene 3D [31];
Middle: Ground Truth; Right: Language map created by
LiLMaps.
is defined as the number of points with correctly recon-
structed language features divided by the total number of
points. The values are compared with the OpenScene 3D
model [31] trained on Matterport3D [5]. The results of
random sequences are presented in Tab. 2. To validate
the measurement update strategy, we also present results
with deactivated measurement update, marked by an aster-
isk (LiLMaps*).
For LiLMapsGT and LiLMaps*GT, the language features
of a measurement are obtained directly from the closest
points of the ground truth point cloud. This allows us to es-
timate upper-bound performance with close-to-ideal input
data. However, simulation measurements and ground truth
points sampled from uneven meshes do not always match
perfectly. As a result, some input points may have wrong
language features or do not have language features at all.
LiLMapsSEM and LiLMaps*SEM denote experiments in
which language features are extracted from semantic im-
ages. Simulated semantic images have incorrect labels due
to mesh discontinuities and on object edges. This allows us
to test LiLMaps when the input data are closer to the real
ones, e.g. when the data are imprecise and inconsistent be-
tween frames.
Our approach demonstrates the best performance when
used with the GT data. The introduction of the Measure-
ments Update does not change the performance signifi-
Figure 5. Language map incrementally created with our adaptive
optimization. Bottom Left: A region mapped in the beginning.
Bottom Right: The same region after the mapping is completed.
All initially mapped objects remain unchanged.
cantly, as the input features are precise and consistent be-
tween frames in this case.
As expected, LiLMapsSEM and LiLMaps*SEM yield
worse results due to the inconsistency of the input data, but
both still outperform OpenScene [31]. LiLMapsSEM out-
performs LiLMaps*SEM due to the proposed measurement
update technique which addresses potential data inconsis-
tencies. Fig. 3shows maps learned with activated and deac-
tivated measurement update procedure. Enabling measure-
6
Decoders F1-Score
100% 90% 90% 80% ↓↓80% 70% ↓↓↓70% 50% ↓↓↓↓50% 0%
LiLMapssimple
Adaptive 20 2 1 (picture) 0 0
LiLMapssimple
Pretrained 20 2 0 1 (towel) 0
OpenSceneMT
HEAD 16 4 2 0 1 (objects)
OpenSceneSC
HEAD 16 5 1 0 1 (objects)
OpenSceneNS
HEAD 9 8 1 3 2 (picture, shelving)
Table 3. Number of classes falling into different F1-score ranges for compared decoders. A method is considered to be better if a larger
number of classes are listed in the first column with F1 >90%.
ments update results in a cleaner final map, which is crucial
for object detection and navigation.
Fig. 4compares a LiLMaps reconstruction with the pre-
diction of the OpenScene 3D model. Despite OpenScene
being trained on the same dataset, it struggles with certain
labels and completely misses labels such as ”TV monitor”,
”appliances”, ”stool”, whereas our approach achieves high
accuracy for these labels and does not completely miss ob-
jects.
Tab. 2compares our approach combined with the LSeg
model (LiLMapsLSeg and LiLMaps*LSeg) and VLMaps
(VLMapsmean
2D ) with metrics reported in [15]. LSeg fre-
quently misses objects (e.g., segments a painting as a wall)
or provides wrong language features (e.g., detects a bed as
a sofa), which significantly influence the final results. Im-
proving the quality of per-pixel language segmentation is
beyond the scope of this research, however. In all cases, our
3D reconstructed language maps yield better results than the
mean results reported in VLMaps [15] for their 2D maps.
4.2. Adaptive Language Decoder Optimization
We demonstrate the impact of our Adaptive Optimiza-
tion Strategy on sequence 5LpN3gDmAk7 1 with GT la-
bels. We compare the performance of the decoder trained
with our Adaptive Optimization with other decoders built
from pre-trained models. We chose OpenScene [31] 3D
model’s head as the pre-trained decoder because it can pre-
dict language features for arbitrary input point clouds. For
fair comparison, we changed our language decoder archi-
tecture (LiLMapssimple) to match the architecture of Open-
Scene’s head. This head does not allow us to use the learn-
able vectors F, however, and therefore the results in Tab. 3
are presented when only the learnable corner features with
m=96 are optimized.
Tab. 3summarizes number of classes with distinct F1-
score qualities using different optimization types: de-
coder trained with the proposed adaptive optimization
(LiLMapssimple
adaptive); decoder pre-trained with the proposed
Figure 6. Number of language features stored for the adaptive
optimization with varying feature similarity thresholds τ.Blue:
Language features are extracted from ground truth (GT) data; Or-
ange: Language features are extracted by LSeg; Red line: Total
number of different GT classes presented in the scene.
optimization (LiLMapssimple
pretraiend); and pre-trained and fixed
headings of OpenScene [31], trained on different datasets
(Matterport3D [5], nuScenes [4], ScanNet [8]).
To obtain LiLMapssimple
pretraiend we extract all available la-
bels from the scene, convert them to language features us-
ing CLIP and use our adaptive language decoder optimiza-
tion with all features at once. The additional possibility of
pre-training the decoder in advance without any real mea-
surements may be useful for some applications. Our adap-
tive language decoder optimization allows to adjust pre-
trained decoders online if necessary, but for this experiment,
we do not update pre-trained models (LiLMapssimple
pretraiend,
OpenSceneMT
HEAD, OpenSceneSC
HEAD, OpenSceneNS
HEAD) during
the mapping.
The final results of LiLMapssimple
adaptive are similar to those
of LiLMapssimple
pretraiend because every time a new object is
observed, the corresponding language features are in-
cluded in the Adaptive Optimization and the decoder of
7
LiLMapssimple
adaptive is updated to represent new features without
forgetting the old ones. Adaptive and pre-trained LiLMaps
models may have minor differences in the results due to
their initial states and optimization processes being differ-
ent. In Fig. 5we demonstrate that the proposed adaptive
optimization can incrementally extend the decoder to rep-
resent new features without catastrophic forgetting of lan-
guage features observed in the beginning.
The results of Tab. 3show that the proposed adaptive
optimization LiLMapssimple
Adaptive performs better than pre-
trained and fixed models. Our adaptive optimization fits
the model to a specific scene while pre-trained decoders (in
this case OpenScene’s heads) are trained for general lan-
guage prediction. If a model trained for general prediction
is used, then some language features of the environment
may be poorly represented in it (e.g. picture and shelving
in Tab. 3), while well-represented features may be irrele-
vant for the specific scene. This can be seen in the results
of OpenSceneNS
HEAD. OpenSceneMT
HEAD is trained on Matter-
port3D [5] and has better results. OpenSceneSC
HEAD is trained
on ScanNet [8] which is similar to Matterport3D that ex-
plains the similar results. However, OpenSceneNS
HEAD pre-
trained on NuScenes [4] has significant degradation in the
results because the NuScenes environment is more differ-
ent from Matterport3D. Moreover, our adaptive language
decoder optimization allows one to build a custom decoder
architecture while employing pre-trained models could re-
strict available architecture options.
We analyze different values of the threshold τused to ex-
tract unique and unknown features from all input features.
Fig. 6shows the final number of features that were consid-
ered distinguished and were involved in the optimization at
the end of mapping. Lower τvalues lead to a larger number
of features, but they are still memory efficient. For com-
parison, the stored features are collected from hundreds of
high-resolution images, but their final number is less than
0.5% of the number of pixels in a single image with res-
olution 640×480. During all tests, adaptive language de-
coder optimization operated at a rate of 4 frames per second
(fps). To achieve real-time performance, adaptive optimiza-
tion can be executed in parallel to mapping.
5. Conclusion
In this work, we presented an implicit language map-
ping approach called LiLMaps. We address the problem
of unseen language features that appear during the map-
ping process and the problem of inconsistencies between
frames. Currently, LiLMaps is the only approach capable of
large-scale incremental implicit language mapping. It can
be used alone and enables a variety of interactions with the
environment, for instance, 3D language-based object detec-
tion (Fig. 7). Additionally, it can be integrated into existing
implicit mapping approaches, introducing only slight over-
Figure 7. 3D language-based object detection performed on our
language map. LiLMaps creates an implicit language map which
is reconstructed and queried for different objects. The highest cor-
respondences between the reconstructed language map and corre-
sponding request (blue) are highlighted in red.
head.
We evaluated LiLMaps on the public dataset commonly
used in related works. Based on the results, we outperform
similar works in terms of language mapping quality. How-
ever, LiLMaps significantly depends on the quality of visual
language features produced by the encoder, which is consid-
ered to be an external module in our study. The importance
of this dependency is reduced by the proposed Measure-
ment Update strategy that handles inconsistency between
frames. We demonstrated that LiLMaps can adapt to the
environment outperforming pre-trained decoders. The pro-
posed Adaptive Optimization demonstrates the ability to
prepare decoders given arbitrary language features without
the need for actual observations.
Acknowledgment
This research was funded by the German Federal Min-
istry of Education and Research (BMBF) in the project
WestAI AI Service Center West, grant no. 01IS22094A.
8
References
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb-
otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu,
Keerthana Gopalakrishnan, Karol Hausman, et al. Do as I
can, not as I say: Grounding language in robotic affordances.
arXiv preprint arXiv:2204.01691, 2022. 2
[2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark
Johnson, Niko S¨
underhauf, Ian Reid, Stephen Gould, and
Anton Van Den Hengel. Vision-and-language navigation:
Interpreting visually-grounded navigation instructions in real
environments. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 3674–3683, 2018. 2
[3] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan
Leutenegger, and Andrew J Davison. CodeSLAM learn-
ing a compact, optimisable representation for dense visual
SLAM. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 2560–2568, 2018. 2
[4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-
ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi-
modal dataset for autonomous driving. In IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 11621–11631, 2020. 7,8
[5] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal-
ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy
Zeng, and Yinda Zhang. Matterport3D: Learning from rgb-
d data in indoor environments. International Conference on
3D Vision (3DV), pages 667–676, 2017. 1,5,6,7,8
[6] Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao,
Keerthana Gopalakrishnan, Michael S Ryoo, Austin Stone,
and Daniel Kappler. Open-vocabulary queryable scene
representations for real world planning. In IEEE Inter-
national Conference on Robotics and Automation (ICRA),
pages 11509–11522, 2023. 2
[7] William Chen, Siyi Hu, Rajat Talak, and Luca Carlone.
Leveraging large language models for robot 3D scene un-
derstanding. arXiv preprint arXiv:2209.05629, 2022. 2
[8] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-
ber, Thomas A. Funkhouser, and Matthias Nießner. Scan-
Net: Richly-annotated 3d reconstructions of indoor scenes.
In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 2432–2443, 2017. 7,8
[9] Murtaza Dalal, Tarun Chiruvolu, Devendra Chaplot, and
Ruslan Salakhutdinov. Plan-Seq-Learn: Language model
guided RL for solving long horizon robotics tasks. In
12th International Conference on Learning Representations
(ICLR), 2024. 2
[10] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin.
Scaling open-vocabulary image segmentation with image-
level labels. In European Conference on Computer Vision
(ECCV), pages 540–557. Springer, 2022. 2
[11] Huy Ha and Shuran Song. Semantic Abstraction: Open-
world 3D scene understanding from 2D vision-language
models. In Conference on Robot Learning (CoRL), volume
205 of Proceedings of Machine Learning Research, pages
643–653. PMLR, 2022. 2
[12] Iain Haughton, Edgar Sucar, Andr´
e Mouton, Edward Johns,
and Andrew J. Davison. Real-time mapping of physical
scene properties with an autonomous robot experimenter.
In Conference on Robot Learning (CoRL), volume 205 of
Proceedings of Machine Learning Research, pages 118–127.
PMLR, 2022. 2,3
[13] Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridg-
ing the gap between learning in discrete and continuous envi-
ronments for vision-and-language navigation. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 15439–15449, 2022. 2
[14] Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay
Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi
Zhang, Zhibo Zhao, et al. Toward general-purpose robots
via foundation models: A survey and meta-analysis. arXiv
preprint arXiv:2312.08782, 2023. 2
[15] Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram
Burgard. Visual language maps for robot navigation. In
IEEE International Conference on Robotics and Automation
(ICRA), 2023. 2,3,5,7
[16] Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala,
Qiao Gu, Mohd. Omama, Ganesh Iyer, Soroush Saryazdi,
Tao Chen, Alaa Maalouf, Shuang Li, Nikhil Varma Keetha,
Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo,
K. Madhava Krishna, Liam Paull, Florian Shkurti, and Anto-
nio Torralba. ConceptFusion: Open-set multimodal 3D map-
ping. In Robotics: Science and Systems XIX (RSS), 2023. 2
[17] Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨
uhler,
and George Drettakis. 3D Gaussian splatting for real-time
radiance field rendering. ACM Transactions on Graphics
(TOG), 42(4):139–1, 2023. 2,3
[18] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo
Kanazawa, and Matthew Tancik. LERF: Language embed-
ded radiance fields. In IEEE/CVF International Conference
on Computer Vision (ICCV), pages 19729–19739, 2023. 2
[19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
thing. In IEEE/CVF International Conference on Computer
Vision (ICCV), pages 4015–4026, 2023. 2
[20] Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen
Koltun, and Ren´
e Ranftl. Language-driven semantic seg-
mentation. In 10th International Conference on Learning
Representations (ICLR), 2022. 2
[21] Jiaze Li, Zhengyu Wen, Luo Zhang, Jiangbei Hu, Fei Hou,
Zhebin Zhang, and Ying He. GS-Octree: Octree-based 3D
Gaussian splatting for robust object-level 3D reconstruction
under strong lighting. Computer Graphics Forum (CGF),
43(7):i–xxii, 2024. 3
[22] Ming-Feng Li. Per-pixel features: Mating segment-anything
with CLIP, 2023. 2,3
[23] Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli
Ding, James Betker, Robert Baruch, Travis Armstrong, and
Pete Florence. Interactive language: Talking to robots in real
time. IEEE Robotics and Automation Letters (RA-L), 2023.
2
[24] Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J
Davison. Gaussian splatting SLAM. In IEEE/CVF Confer-
9
ence on Computer Vision and Pattern Recognition (CVPR),
pages 18039–18048, 2024. 2
[25] Kirill Mazur, Edgar Sucar, and Andrew J Davison. Feature-
realistic neural fusion for real-time, open set scene under-
standing. In IEEE International Conference on Robotics and
Automation (ICRA), pages 8201–8207, 2023. 2,3
[26] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Representing scenes as neural radiance fields for view syn-
thesis. Communications of the ACM, 65(1):99–106, 2021. 2,
3
[27] Thomas M ¨
uller, Alex Evans, Christoph Schied, and Alexan-
der Keller. Instant neural graphics primitives with a mul-
tiresolution hash encoding. ACM Transactions on Graphics
(TOG), 41(4):1–15, 2022. 3
[28] Jens Naumann, Binbin Xu, Stefan Leutenegger, and Xingx-
ing Zuo. NeRF-VO: Real-time sparse visual odometry with
neural radiance fields. IEEE Robotics and Automation Let-
ters (RA-L), 9(8):7278–7285, 2024. 2
[29] Yoshiki Obinata, Naoaki Kanazawa, Kento Kawaharazuka,
Iori Yanokura, Soonhyo Kim, Kei Okada, and Masayuki In-
aba. Foundation model based open vocabulary task planning
and executive system for general purpose service robots.
arXiv preprint arXiv:2308.03357, 2023. 2
[30] Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar,
David Novotn´
y, Michael Zollh¨
ofer, and Mustafa Mukadam.
iSDF: Real-time neural signed distance fields for robot per-
ception. In Robotics: Science and Systems XVIII (RSS),
2022. 2
[31] Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea
Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al.
OpenScene: 3D scene understanding with open vocabular-
ies. In IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 815–824, 2023. 2,6,7
[32] Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and
Hanspeter Pfister. LangSplat: 3D language Gaussian splat-
ting. In IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 20051–20060, 2024. 2
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
ing transferable visual models from natural language super-
vision. In International Conference on Machine Learning
(ICLM), pages 8748–8763. PMLR, 2021. 1,2,5
[34] Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah
Idrees, David Paulius, and Stefanie Tellex. Planning with
large language models via corrective re-prompting. In
NeurIPS Foundation Models for Decision Making Workshop
(FMDM), 2022. 2
[35] Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi,
Yinfei Yang, Alexander Toshev, and Jonathon Shlens. Per-
ceptual grouping in contrastive vision-language models. In
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 5548–5561, 2023. 2
[36] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
Geiger. KiloNeRF: Speeding up neural radiance fields with
thousands of tiny MLPs. In IEEE/CVF International Confer-
ence on Computer Vision (ICCV), pages 14335–14345, 2021.
2,3
[37] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M
Sadler, Wei-Lun Chao, and Yu Su. LLM-Planner: Few-shot
grounded planning for embodied agents with large language
models. In IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 2998–3009, 2023. 2
[38] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi-
son. iMAP: Implicit mapping and positioning in real-time.
In IEEE/CVF International Conference on Computer Vision
(ICCV), pages 6229–6238, 2021. 2
[39] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten
Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacob-
son, Morgan McGuire, and Sanja Fidler. Neural geometric
level of detail: Real-time rendering with implicit 3D shapes.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 11358–11367, 2021. 3
[40] Jingwen Wang, Tymoteusz Bleja, and Lourdes Agapito. GO-
Surf: Neural feature grid optimization for fast, high-fidelity
RGB-D surface reconstruction. In International Conference
on 3D Vision (3DV), pages 433–442. IEEE, 2022. 3
[41] Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao,
and Xihui Liu. SAM3D: Segment anything in 3D scenes.
arXiv preprint arXiv:2306.03908, 2023. 2
[42] Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton,
Tristan Laidlow, and Andrew J Davison. iLabel: Interactive
neural scene labelling. arXiv preprint arXiv:2111.14637,
2021. 2
[43] Xingguang Zhong, Yue Pan, Jens Behley, and Cyrill Stach-
niss. SHINE-Mapping: Large-scale 3D mapping us-
ing sparse hierarchical implicit neural representations. In
IEEE International Conference on Robotics and Automation
(ICRA), pages 8371–8377, 2023. 2
[44] Liyuan Zhu, Yue Li, Erik Sandstr¨
om, Konrad Schindler, and
Iro Armeni. LoopSplat: Loop closure by registering 3D
Gaussian splats. arXiv preprint arXiv:2408.10154, 2024. 2
[45] Siting Zhu, Renjie Qin, Guangming Wang, Jiuming Liu, and
Hesheng Wang. SemGauss-SLAM: Dense semantic Gaus-
sian splatting slam. arXiv preprint arXiv:2403.07494, 2024.
2
[46] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu-
jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle-
feys. NICE-SLAM: Neural implicit scalable encoding for
SLAM. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 12786–12796, 2022. 2
10
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The 3D Gaussian Splatting technique has significantly advanced the construction of radiance fields from multi‐view images, enabling real‐time rendering. While point‐based rasterization effectively reduces computational demands for rendering, it often struggles to accurately reconstruct the geometry of the target object, especially under strong lighting conditions. Strong lighting can cause significant color variations on the object's surface when viewed from different directions, complicating the reconstruction process. To address this challenge, we introduce an approach that combines octree‐based implicit surface representations with Gaussian Splatting. Initially, it reconstructs a signed distance field (SDF) and a radiance field through volume rendering, encoding them in a low‐resolution octree. This initial SDF represents the coarse geometry of the target object. Subsequently, it introduces 3D Gaussians as additional degrees of freedom, which are guided by the initial SDF. In the third stage, the optimized Gaussians enhance the accuracy of the SDF, enabling the recovery of finer geometric details compared to the initial SDF. Finally, the refined SDF is used to further optimize the 3D Gaussians via splatting, eliminating those that contribute little to the visual appearance. Experimental results show that our method, which leverages the distribution of 3D Gaussians with SDFs, reconstructs more accurate geometry, particularly in images with specular highlights caused by strong lighting. The source code can be downloaded from https://github.com/LaoChui999/GS-Octree.
Article
We present a framework for building interactive, real-time, natural language-instructable robots in the real world, and we open source related assets (dataset, environment, benchmark, and policies). Trained with behavioral cloning on a dataset of hundreds of thousands of language-annotated trajectories, a produced policy can proficiently execute an order of magnitude more commands than previous works: specifically we estimate a 93.5% success rate on a set of 87,000 unique natural language strings specifying raw end-to-end visuolinguo-motor skills in the real world. We find that the same policy is capable of being guided by a human via real-time language to address a wide range of precise long-horizon rearrangement goals, e.g. “ make a smiley face out of blocks ”. The dataset we release comprises nearly 600,000 language-labeled trajectories, an order of magnitude larger than prior available datasets. We hope the demonstrated results and associated assets enable further advancement of helpful, capable, natural-language-interactable robots. See videos at https://interactive-language.github.io .
Article
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (≥ 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.