Kenneth Blomqvist’s research while affiliated with ETH Zurich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


Under pressure: learning-based analog gauge reading in the wild
  • Conference Paper

May 2024

·

20 Reads

Maurits Reitsma

·

Julian Keller

·

Kenneth Blomqvist

·


Panoptic Vision-Language Feature Fields

March 2024

·

10 Reads

·

10 Citations

IEEE Robotics and Automation Letters

Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff .






Fig. 1. Our method enables real-time segmentation of scenes into arbitrary text classes provided at run-time.
Fig. 2. A diagram of the model used for our feature field.
Fig. 3. Randomly sampled 2D segmentation examples from the ScanNet validation set. Top row shows the original RGB images, second row shows our segmentation and the bottom row shows the ground truth segmentation from the ScanNet dataset. Black pixels in the ground truth segmentation correspond to classes not included in the 20 ScanNet evaluation classes.
Fig. 4. Snapshots from real-time zero shot volumetric segmentations from a fixed viewpoint at given intervals. Our representation is able to learn in real-time and is already useful after a dozen seconds. Each image shows RGB rendering output for the viewpoint, overlayed with the semantic segmentation given the 6 class prompts shown.
Neural Implicit Vision-Language Feature Fields
  • Preprint
  • File available

March 2023

·

133 Reads

Recently, groundbreaking results have been presented on open-vocabulary semantic image segmentation. Such methods segment each pixel in an image into arbitrary categories provided at run-time in the form of text prompts, as opposed to a fixed set of classes defined at training time. In this work, we present a zero-shot volumetric open-vocabulary semantic scene segmentation method. Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation. We show that the resulting feature field can be segmented into different classes by assigning points to natural language text prompts. The implicit volumetric representation enables us to segment the scene both in 3D and 2D by rendering feature maps from any given viewpoint of the scene. We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts. We also present quantitative comparisons on the ScanNet dataset.

Download

Baking in the Feature: Accelerating Volumetric Segmentation by Rendering Feature Maps

September 2022

·

5 Reads

·

1 Citation

Methods have recently been proposed that densely segment 3D volumes into classes using only color images and expert supervision in the form of sparse semantically annotated pixels. While impressive, these methods still require a relatively large amount of supervision and segmenting an object can take several minutes in practice. Such systems typically only optimize their representation on the particular scene they are fitting, without leveraging any prior information from previously seen images. In this paper, we propose to use features extracted with models trained on large existing datasets to improve segmentation performance. We bake this feature representation into a Neural Radiance Field (NeRF) by volumetrically rendering feature maps and supervising on features extracted from each input image. We show that by baking this representation into the NeRF, we make the subsequent classification task much easier. Our experiments show that our method achieves higher segmentation accuracy with fewer semantic annotations than existing methods over a wide range of scenes.



Fig. 1. StereoLabel, our keypoint labeling tool. The user is presented with two images of the scene to label. The images are selected to maximize the orthogonality of the views.
Fig. 2. The components for both of the proposed keypoint tracking pipelines.
Fig. 3. (a) Valve setup showing keypoints for the value. (b) An image from the cup tracking scene.
Semi-automatic 3D Object Keypoint Annotation and Detection for the Masses

January 2022

·

236 Reads

Creating computer vision datasets requires careful planning and lots of time and effort. In robotics research, we often have to use standardized objects, such as the YCB object set, for tasks such as object tracking, pose estimation, grasping and manipulation, as there are datasets and pre-learned methods available for these objects. This limits the impact of our research since learning-based computer vision methods can only be used in scenarios that are supported by existing datasets. In this work, we present a full object keypoint tracking toolkit, encompassing the entire process from data collection, labeling, model learning and evaluation. We present a semi-automatic way of collecting and labeling datasets using a wrist mounted camera on a standard robotic arm. Using our toolkit and method, we are able to obtain a working 3D object keypoint detector and go through the whole process of data collection, annotation and learning in just a couple hours of active time.


Citations (7)


... The combination of class-agnostic detection and some kind of object re-identification has also been explored in related work. A two-step approach in which objects are first class-agnostically localized based on the Segment Anything Model (SAM) [23] and then feature vectors of DINO v2 [57] are used for re-identifying the novel categories is described in [58]. However, it should be noted that all experiments are performed on synthesized data, which limits the value of the results. ...

Reference:

Detection of Novel Objects without Fine-Tuning in Assembly Scenarios by Class-Agnostic Object Detection and Object Re-Identification
ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation and Re-Identification
  • Citing Conference Paper
  • January 2024

... Recent years witnesses great success in implicit neural representation [3], [4], which allows for end-to-end semantic reconstruction [5], [6], [7]. Empowered by foundation visual language model (VLM) [8], [9], the open-vocabulary semantic cues are further employed for zero-shot semantic reconstruction to eliminate human annotation and segmentation network training for unseen class [10], [11]. ...

Neural Implicit Vision-Language Feature Fields
  • Citing Conference Paper
  • October 2023

... Besides, benefiting from the joint encoding [64], we additionally apply a low-frequency positional encoding [37] for reasoning spatial semantics. Additionally, as shown in [4], the semantic information is the low-frequency information compared to the color. So, for semantic property decoding, we concatenate the positional encoding with the latent feature as the input. ...

Baking in the Feature: Accelerating Volumetric Segmentation by Rendering Feature Maps
  • Citing Conference Paper
  • October 2023

... Keypoint annotation can be a costly process, requiring multiple points to be labeled in each image of a large image dataset (e.g., with tens of thousands of images). We draw inspiration from prior methods of speeding up keypoint labeling by using known camera poses at training time 41,42 , enabling a small number of manually labeled keypoints to be projected into many images. Figure 4A presents an overview of the annotation pipeline. ...

Semi-automatic 3D Object Keypoint Annotation and Detection for the Masses
  • Citing Conference Paper
  • August 2022

... In our interaction model, we envision a minimal brushing approach to determine points of interest and discriminate them against the rest of the data points. This brushing approach allows us to substantiate human intent into input, and it is somehow the 3D equivalent to minimal brushing interaction in regions of interest on 2D images used in AI image manipulation [3]. ...

Baking in the Feature: Accelerating Volumetric Segmentation by Rendering Feature Maps
  • Citing Preprint
  • September 2022

... The knowledge of artificial agents usually includes databases of objects that they do not need to learn and the steps necessary to achieve goals are specified in advance. Blomqvist et al. (2020) presented a mobile manipulation system capable of perception, location, navigation, motor planning, and grasping. The artificial agent is mounted on an omnidirectional mobile base and can navigate using a 3D global pre-built map of his environment. ...

Go Fetch: Mobile Manipulation in Unstructured Environments