Francesco Milano’s research while affiliated with ETH Zurich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (12)


NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models
  • Conference Paper

October 2024

·

11 Reads

·

2 Citations

Francesco Milano

·

Jen Jen Chung

·

·

[...]

·

Lionel Ott

NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models

July 2024

·

11 Reads

State-of-the-art approaches for 6D object pose estimation assume the availability of CAD models and require the user to manually set up physically-based rendering (PBR) pipelines for synthetic training data generation. Both factors limit the application of these methods in real-world scenarios. In this work, we present a pipeline that does not require CAD models and allows training a state-of-the-art pose estimator requiring only a small set of real images as input. Our method is based on a NeuS2 object representation, that we learn through a semi-automated procedure based on Structure-from-Motion (SfM) and object-agnostic segmentation. We exploit the novel-view synthesis ability of NeuS2 and simple cut-and-paste augmentation to automatically generate photorealistic object renderings, which we use to train the correspondence-based SurfEmb pose estimator. We evaluate our method on the LINEMOD-Occlusion dataset, extensively studying the impact of its individual components and showing competitive performance with respect to approaches based on CAD models and PBR data. We additionally demonstrate the ease of use and effectiveness of our pipeline on self-collected real-world objects, showing that our method outperforms state-of-the-art CAD-model-free approaches, with better accuracy and robustness to mild occlusions. To allow the robotics community to benefit from this system, we will publicly release it at https://www.github.com/ethz-asl/neusurfemb.


Panoptic Vision-Language Feature Fields

March 2024

·

10 Reads

·

10 Citations

IEEE Robotics and Automation Letters

Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff .





Fig. 1. Our method enables real-time segmentation of scenes into arbitrary text classes provided at run-time.
Fig. 2. A diagram of the model used for our feature field.
Fig. 3. Randomly sampled 2D segmentation examples from the ScanNet validation set. Top row shows the original RGB images, second row shows our segmentation and the bottom row shows the ground truth segmentation from the ScanNet dataset. Black pixels in the ground truth segmentation correspond to classes not included in the 20 ScanNet evaluation classes.
Fig. 4. Snapshots from real-time zero shot volumetric segmentations from a fixed viewpoint at given intervals. Our representation is able to learn in real-time and is already useful after a dozen seconds. Each image shows RGB rendering output for the viewpoint, overlayed with the semantic segmentation given the 6 class prompts shown.
Neural Implicit Vision-Language Feature Fields
  • Preprint
  • File available

March 2023

·

133 Reads

Recently, groundbreaking results have been presented on open-vocabulary semantic image segmentation. Such methods segment each pixel in an image into arbitrary categories provided at run-time in the form of text prompts, as opposed to a fixed set of classes defined at training time. In this work, we present a zero-shot volumetric open-vocabulary semantic scene segmentation method. Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation. We show that the resulting feature field can be segmented into different classes by assigning points to natural language text prompts. The implicit volumetric representation enables us to segment the scene both in 3D and 2D by rendering feature maps from any given viewpoint of the scene. We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts. We also present quantitative comparisons on the ScanNet dataset.

Download

Effect of the 1 depth loss L d and of different types of semantic losses (either the original one proposed in [56] or ours) on the pseudo-label quality. The performance is evaluated on the training views of each scene and averaged over 3 runs.
Unsupervised Continual Semantic Adaptation through Neural Rendering

November 2022

·

31 Reads

An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.


Continual Adaptation of Semantic Segmentation Using Complementary 2D-3D Data Representations

October 2022

·

15 Reads

·

23 Citations

IEEE Robotics and Automation Letters

Semantic segmentation networks are usually pre-trained once and not updated during deployment. As a consequence, misclassifications commonly occur if the distribution of the training data deviates from the one encountered during the robot's operation. We propose to mitigate this problem by adapting the neural network to the robot's environment during deployment, without any need for external supervision. Leveraging complementary data representations, we generate a supervision signal, by probabilistically accumulating consecutive 2D semantic predictions in a volumetric 3D map. We then train the network on renderings of the accumulated semantic map, effectively resolving ambiguities and enforcing multi-view consistency through the 3D representation. In contrast to scene adaptation methods, we aim to retain the previously-learned knowledge, and therefore employ a continual learning experience replay strategy to adapt the network. Through extensive experimental evaluation, we show successful adaptation to real-world indoor scenes both on the ScanNet dataset and on in-house data recorded with an RGB-D sensor. Our method increases the segmentation accuracy on average by 9.9% compared to the fixed pre-trained neural network, while retaining knowledge from the pre-training dataset.


Fig. 2: Overview: An RGB-D camera provides inputs to a segmentation network (yellow) and pose estimation (green). 2D semantic estimates are accumulated in a 3D voxel map. Using ray tracing, we render 2D pseudo-labels from the map. These are used to adapt the network using a continual learning strategy (red), which can access previously stored samples in a memory buffer (dark blue).
Fig. 4: Resulting mesh in the pseudo-label generation of the first ScanNet scene. Left: Pseudo-label map (1-Pseudo) generated using the pre-trained neural network. Right: Pseudolabel map (GT-Pseudo) generated using the ground-truth labels.
Fig. 5: Multi-view consistency measured as per-voxel confidence in the first ScanNet scene. Left: Confidence of the voxel volume when mapping the pre-trained neural network predictions (1-Pred). Right: Confidence of the voxel volume when mapping the adapted neural network predictions in the second iteration (2-Pred).
Fig. 6: Resulting mesh in the pseudo-label generation of the lab data conference room scene.
Continual Learning of Semantic Segmentation using Complementary 2D-3D Data Representations

November 2021

·

120 Reads

Semantic segmentation networks are usually pre-trained and not updated during deployment. As a consequence, misclassifications commonly occur if the distribution of the training data deviates from the one encountered during the robot's operation. We propose to mitigate this problem by adapting the neural network to the robot's environment during deployment, without any need for external supervision. Leveraging complementary data representations, we generate a supervision signal, by probabilistically accumulating consecutive 2D semantic predictions in a volumetric 3D map. We then retrain the network on renderings of the accumulated semantic map, effectively resolving ambiguities and enforcing multi-view consistency through the 3D representation. To preserve the previously-learned knowledge while performing network adaptation, we employ a continual learning strategy based on experience replay. Through extensive experimental evaluation, we show successful adaptation to real-world indoor scenes both on the ScanNet dataset and on in-house data recorded with an RGB-D sensor. Our method increases the segmentation performance on average by 11.8% compared to the fixed pre-trained neural network, while effectively retaining knowledge from the pre-training dataset.


Citations (7)


... Present tasks on BOP benchmark still lack model-free 6D object localization tasks; iNeRF-OPE should be considered the first work implementing the model-free task of object localization, which the current BOP benchmark lacks for one of the BOP datasets, i.e., T-LESS. NeuSurfEmb [23] is a successful model-free method, which is similar to iNeRF-OPE in some ways. The experiments are conducted on the LINEMOD-Occlusion [24] dataset and real world. ...

Reference:

A Survey of Robotic Monocular Pose Estimation
NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models
  • Citing Conference Paper
  • October 2024

... The combination of class-agnostic detection and some kind of object re-identification has also been explored in related work. A two-step approach in which objects are first class-agnostically localized based on the Segment Anything Model (SAM) [23] and then feature vectors of DINO v2 [57] are used for re-identifying the novel categories is described in [58]. However, it should be noted that all experiments are performed on synthesized data, which limits the value of the results. ...

ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation and Re-Identification
  • Citing Conference Paper
  • January 2024

... Recent years witnesses great success in implicit neural representation [3], [4], which allows for end-to-end semantic reconstruction [5], [6], [7]. Empowered by foundation visual language model (VLM) [8], [9], the open-vocabulary semantic cues are further employed for zero-shot semantic reconstruction to eliminate human annotation and segmentation network training for unseen class [10], [11]. ...

Neural Implicit Vision-Language Feature Fields
  • Citing Conference Paper
  • October 2023

... It integrates semantic understanding of the environment into 3D scene reconstruction and estimates camera pose simultaneously. Traditional semantic SLAM struggles with predicting unknown areas and requires significant map storage [5]. While NeRF-based methods [6]- [9] mitigated these issues, they still suffer from inefficient per-pixel raycasting rendering [10]. ...

Unsupervised Continual Semantic Adaptation Through Neural Rendering
  • Citing Conference Paper
  • June 2023

... Conversely, self-supervised active learning methods automatically generate pseudo labels from maps incrementally built during a mission [16][17][18], without relying on human labelling. However, their applicability to diverse sets of unknown environments is limited since they require large labelled in-domain pre-training datasets to produce highquality pseudo labels without systematic prediction errors. ...

Continual Adaptation of Semantic Segmentation Using Complementary 2D-3D Data Representations
  • Citing Article
  • October 2022

IEEE Robotics and Automation Letters

... While on the other hand, unsupervised methods are a more interesting choice for learning object semantics, usually in the form of dynamic or static binary objects; those methods are data-driven methods that require minimal or no supervision. For example, [23] segment the indoor scene into foreground and background classes, the segmentation is performed w.r.t the floor plan, so any point that does not match with the floor plan is labelled as dynamic, else static. Those labels are used in a neural network model to improve agent localization. ...

Self-Improving Semantic Perception for Indoor Localisation