May 2024
·
20 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
May 2024
·
20 Reads
March 2024
·
10 Reads
·
10 Citations
IEEE Robotics and Automation Letters
Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff .
January 2024
·
5 Reads
·
2 Citations
October 2023
·
3 Reads
·
9 Citations
October 2023
·
9 Reads
·
8 Citations
May 2023
·
19 Reads
·
1 Citation
March 2023
·
133 Reads
Recently, groundbreaking results have been presented on open-vocabulary semantic image segmentation. Such methods segment each pixel in an image into arbitrary categories provided at run-time in the form of text prompts, as opposed to a fixed set of classes defined at training time. In this work, we present a zero-shot volumetric open-vocabulary semantic scene segmentation method. Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation. We show that the resulting feature field can be segmented into different classes by assigning points to natural language text prompts. The implicit volumetric representation enables us to segment the scene both in 3D and 2D by rendering feature maps from any given viewpoint of the scene. We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts. We also present quantitative comparisons on the ScanNet dataset.
September 2022
·
5 Reads
·
1 Citation
Methods have recently been proposed that densely segment 3D volumes into classes using only color images and expert supervision in the form of sparse semantically annotated pixels. While impressive, these methods still require a relatively large amount of supervision and segmenting an object can take several minutes in practice. Such systems typically only optimize their representation on the particular scene they are fitting, without leveraging any prior information from previously seen images. In this paper, we propose to use features extracted with models trained on large existing datasets to improve segmentation performance. We bake this feature representation into a Neural Radiance Field (NeRF) by volumetrically rendering feature maps and supervising on features extracted from each input image. We show that by baking this representation into the NeRF, we make the subsequent classification task much easier. Our experiments show that our method achieves higher segmentation accuracy with fewer semantic annotations than existing methods over a wide range of scenes.
August 2022
·
5 Reads
·
9 Citations
January 2022
·
236 Reads
Creating computer vision datasets requires careful planning and lots of time and effort. In robotics research, we often have to use standardized objects, such as the YCB object set, for tasks such as object tracking, pose estimation, grasping and manipulation, as there are datasets and pre-learned methods available for these objects. This limits the impact of our research since learning-based computer vision methods can only be used in scenarios that are supported by existing datasets. In this work, we present a full object keypoint tracking toolkit, encompassing the entire process from data collection, labeling, model learning and evaluation. We present a semi-automatic way of collecting and labeling datasets using a wrist mounted camera on a standard robotic arm. Using our toolkit and method, we are able to obtain a working 3D object keypoint detector and go through the whole process of data collection, annotation and learning in just a couple hours of active time.
... The combination of class-agnostic detection and some kind of object re-identification has also been explored in related work. A two-step approach in which objects are first class-agnostically localized based on the Segment Anything Model (SAM) [23] and then feature vectors of DINO v2 [57] are used for re-identifying the novel categories is described in [58]. However, it should be noted that all experiments are performed on synthesized data, which limits the value of the results. ...
January 2024
... We first introduce the experiment setup in Sec. 4 ...
March 2024
IEEE Robotics and Automation Letters
... Recent years witnesses great success in implicit neural representation [3], [4], which allows for end-to-end semantic reconstruction [5], [6], [7]. Empowered by foundation visual language model (VLM) [8], [9], the open-vocabulary semantic cues are further employed for zero-shot semantic reconstruction to eliminate human annotation and segmentation network training for unseen class [10], [11]. ...
October 2023
... Besides, benefiting from the joint encoding [64], we additionally apply a low-frequency positional encoding [37] for reasoning spatial semantics. Additionally, as shown in [4], the semantic information is the low-frequency information compared to the color. So, for semantic property decoding, we concatenate the positional encoding with the latent feature as the input. ...
October 2023
... Keypoint annotation can be a costly process, requiring multiple points to be labeled in each image of a large image dataset (e.g., with tens of thousands of images). We draw inspiration from prior methods of speeding up keypoint labeling by using known camera poses at training time 41,42 , enabling a small number of manually labeled keypoints to be projected into many images. Figure 4A presents an overview of the annotation pipeline. ...
August 2022
... In our interaction model, we envision a minimal brushing approach to determine points of interest and discriminate them against the rest of the data points. This brushing approach allows us to substantiate human intent into input, and it is somehow the 3D equivalent to minimal brushing interaction in regions of interest on 2D images used in AI image manipulation [3]. ...
September 2022
... The knowledge of artificial agents usually includes databases of objects that they do not need to learn and the steps necessary to achieve goals are specified in advance. Blomqvist et al. (2020) presented a mobile manipulation system capable of perception, location, navigation, motor planning, and grasping. The artificial agent is mounted on an omnidirectional mobile base and can navigate using a 3D global pre-built map of his environment. ...
April 2020