Conference Paper

IMU-Aided Event-based Stereo Visual Odometry

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... All the values for EVO [29] are taken from [22]. ESVO values are taken from the authors' latest paper [25], except for the TUM-VIE dataset on which we re-ran ESVO [37]. Rotation errors on the RPG dataset were taken from [22]. ...
... Similar to recent event-based VO methods in the literature [22,25,37], we perform quantitative evaluation by computing the root-mean-square (RMSE) Absolute Trajectory Errors (ATE) and Absolute Rotation Errors (ARE) on tracked camera poses using the tool in [34] (Tab. 2). ...
... For this dataset, we found different ATE values reported for ESVO in different papers, which may be due to its non-deterministic nature. Here, we report the values from the latest work by the same authors [25]. Since ARE is missing from that paper, we take them from [22]. ...
Preprint
Full-text available
Visual Odometry (VO) and SLAM are fundamental components for spatial perception in mobile robots. Despite enormous progress in the field, current VO/SLAM systems are limited by their sensors' capability. Event cameras are novel visual sensors that offer advantages to overcome the limitations of standard cameras, enabling robots to expand their operating range to challenging scenarios, such as high-speed motion and high dynamic range illumination. We propose a novel event-based stereo VO system by combining two ideas: a correspondence-free mapping module that estimates depth by maximizing ray density fusion and a tracking module that estimates camera poses by maximizing edge-map alignment. We evaluate the system comprehensively on five real-world datasets, spanning a variety of camera types (manufacturers and spatial resolutions) and scenarios (driving, flying drone, hand-held, egocentric, etc). The quantitative and qualitative results demonstrate that our method outperforms the state of the art in majority of the test sequences by a margin, e.g., trajectory error reduction of 45% on RPG dataset, 61% on DSEC dataset, and 21% on TUM-VIE dataset. To benefit the community and foster research on event-based perception systems, we release the source code and results: https://github.com/tub-rip/ES-PTAM
... Event-based Odometry: Existing Event Odometry (EO) approaches are developed specifically for event processing. While some approaches combine events with frames [8,24,25,46,67], event-only approaches can be classified as monocular EO [35,55], monocular EO with IMU [23,26], stereo EO [20,68], and stereo EO with IMU [44,53,54]. Due to the short history of event cameras, these systems require extensive research and development efforts to work reliably in practice. ...
Preprint
Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code and multimedia material are available at smartroboticslab.github.io/SuperEvent.
... The mapping module builds a semi-dense 3D scene map, and the tracking module determines the camera pose by addressing the 3D-2D registration problem. Building upon this, Niu et al. [44] integrate IMU data to present a direct visual-inertial odometry. A more compact event representation is introduced, called adaptive accumulation, which preserves relatively complete edges while maintaining a high signal-tonoise ratio. ...
Preprint
Full-text available
Pose tracking of uncooperative spacecraft is an essential technology for space exploration and on-orbit servicing, which remains an open problem. Event cameras possess numerous advantages, such as high dynamic range, high temporal resolution, and low power consumption. These attributes hold the promise of overcoming challenges encountered by conventional cameras, including motion blur and extreme illumination, among others. To address the standard on-orbit observation missions, we propose a line-based pose tracking method for uncooperative spacecraft utilizing a stereo event camera. To begin with, we estimate the wireframe model of uncooperative spacecraft, leveraging the spatio-temporal consistency of stereo event streams for line-based reconstruction. Then, we develop an effective strategy to establish correspondences between events and projected lines of uncooperative spacecraft. Using these correspondences, we formulate the pose tracking as a continuous optimization process over 6-DOF motion parameters, achieved by minimizing event-line distances. Moreover, we construct a stereo event-based uncooperative spacecraft motion dataset, encompassing both simulated and real events. The proposed method is quantitatively evaluated through experiments conducted on our self-collected dataset, demonstrating an improvement in terms of effectiveness and accuracy over competing methods. The code will be open-sourced at https://github.com/Zibin6/SE6PT.
... The tracking module projects the accumulated event frames onto these reference frames for pose estimation. Building on these two modules, different event representation methods have been introduced in the eventbased visual odometry, see [2]- [4]. However, both mapping and tracking modules of these systems depend on fixed-rate triggers determined by the platform's processing capacity, resulting in considerable computational overhead. ...
Preprint
Full-text available
Event-based visual odometry has recently gained attention for its high accuracy and real-time performance in fast-motion systems. Unlike traditional synchronous estimators that rely on constant-frequency (zero-order) triggers, event-based visual odometry can actively accumulate information to generate temporally high-order estimation triggers. However, existing methods primarily focus on adaptive event representation after estimation triggers, neglecting the decision-making process for efficient temporal triggering itself. This oversight leads to the computational redundancy and noise accumulation. In this paper, we introduce a temporally high-order event-based visual odometry with spiking event accumulation networks (THE-SEAN). To the best of our knowledge, it is the first event-based visual odometry capable of dynamically adjusting its estimation trigger decision in response to motion and environmental changes. Inspired by biological systems that regulate hormone secretion to modulate heart rate, a self-supervised spiking neural network is designed to generate estimation triggers. This spiking network extracts temporal features to produce triggers, with rewards based on block matching points and Fisher information matrix (FIM) trace acquired from the estimator itself. Finally, THE-SEAN is evaluated across several open datasets, thereby demonstrating average improvements of 13% in estimation accuracy, 9% in smoothness, and 38% in triggering efficiency compared to the state-of-the-art methods.
... The mapping module builds a semi-dense 3D scene map, and the tracking module determines the camera pose by addressing the 3D-2D registration problem. Building upon this, Niu et al. [44] integrate IMU data to present a direct visual-inertial odometry. A more compact event representation is introduced, called adaptive accumulation, which preserves relatively complete edges while maintaining a high signal-tonoise ratio. ...
Article
Full-text available
Pose tracking of uncooperative spacecraft is an essential technology for space exploration and on-orbit servicing, which remains an open problem. Event cameras possess numerous advantages, such as high dynamic range, high temporal resolution, and low power consumption. These attributes hold the promise of overcoming challenges encountered by conventional cameras, including motion blur and extreme illumination, among others. To address the standard on-orbit observation missions, we propose a line-based pose tracking method for uncooperative spacecraft utilizing a stereo event camera. To begin with, we estimate the wireframe model of uncooperative spacecraft, leveraging the spatio-temporal consistency of stereo event streams for line-based reconstruction. Then, we develop an effective strategy to establish correspondences between events and projected lines of uncooperative spacecraft. Using these correspondences, we formulate the pose tracking as a continuous optimization process over 6-DOF motion parameters, achieved by minimizing event-line distances. Moreover, we construct a stereo event-based uncooperative spacecraft motion dataset, encompassing both simulated and real events. The proposed method is quantitatively evaluated through experiments conducted on our self-collected dataset, demonstrating an improvement in terms of effectiveness and accuracy over competing methods. The code will be open-sourced at https://github.com/Zibin6/SE6PT.
... For stereo configurations, Zhou et al. [39] proposed ESVO, the first event-based stereo visual odometry, which maximized spatiotemporal consistency using TS. Junkai et al. [40] extended this work to incorporate gyroscope measurements to mitigate degeneracy issues. Tang et al. [41] introduced an adaptive decay-based TS for feature extraction and proposed a polarity-aware strategy to enhance robustness. ...
Preprint
Event cameras, as bio-inspired sensors, are asynchronously triggered with high-temporal resolution compared to intensity cameras. Recent work has focused on fusing the event measurements with inertial measurements to enable ego-motion estimation in high-speed and HDR environments. However, existing methods predominantly rely on IMU preintegration designed mainly for synchronous sensors and discrete-time frameworks. In this paper, we propose a continuous-time preintegration method based on the Temporal Gaussian Process (TGP) called GPO. Concretely, we model the preintegration as a time-indexed motion trajectory and leverage an efficient two-step optimization to initialize the precision preintegration pseudo-measurements. Our method realizes a linear and constant time cost for initialization and query, respectively. To further validate the proposal, we leverage the GPO to design an asynchronous event-inertial odometry and compare with other asynchronous fusion schemes within the same odometry system. Experiments conducted on both public and own-collected datasets demonstrate that the proposed GPO offers significant advantages in terms of precision and efficiency, outperforming existing approaches in handling asynchronous sensor fusion.
... ESVO [18] proposed a first event-based stereo visual odometry by maximizing the spatio-temporal consistency using TS. Extended from [18], Junkai et al. [19] proposed a gyroscope measurements prior via pre-integration to circumvent the degeneracy issue. Kai et al. [20] utilized an adaptive decay-based TS to extract features and proposed a polarity-aware strategy to enhance the robustness, Then, these tracked features and IMU were fused in the MSCKF state estimator Overall, all these methods transform asynchronous event streams to synchronous data association and convert high rate IMU data into inter frame motion constrain through IMU-preintegration. ...
Preprint
Event cameras, when combined with inertial sensors, show significant potential for motion estimation in challenging scenarios, such as high-speed maneuvers and low-light environments. There are many methods for producing such estimations, but most boil down to a synchronous discrete-time fusion problem. However, the asynchronous nature of event cameras and their unique fusion mechanism with inertial sensors remain underexplored. In this paper, we introduce a monocular event-inertial odometry method called AsynEIO, designed to fuse asynchronous event and inertial data within a unified Gaussian Process (GP) regression framework. Our approach incorporates an event-driven frontend that tracks feature trajectories directly from raw event streams at a high temporal resolution. These tracked feature trajectories, along with various inertial factors, are integrated into the same GP regression framework to enable asynchronous fusion. With deriving analytical residual Jacobians and noise models, our method constructs a factor graph that is iteratively optimized and pruned using a sliding-window optimizer. Comparative assessments highlight the performance of different inertial fusion strategies, suggesting optimal choices for varying conditions. Experimental results on both public datasets and our own event-inertial sequences indicate that AsynEIO outperforms existing methods, especially in high-speed and low-illumination scenarios.
... As can be seen from Unit:ATE/ RMSE(cm); Aligning the whole ground truth trajectory with estimated poses; The Baseline numbers (EVO, ESVO, and ES-PTAM) are taken from [46], while DH-PTAM and Ultimate-SLAM are sourced from [22], and DEVO is from [9]. Unit: ATE/RMSE(cm); Aligning the whole ground truth trajectory with estimated poses; Baseline numbers (ESVO and ESVO+IMU) are taken from [50], and ES-PTAM is sourced from [46]. importance of complementarity between visual events and IMU sensors. ...
Preprint
Event cameras are bio-inspired, motion-activated sensors that demonstrate impressive potential in handling challenging situations, such as motion blur and high-dynamic range. Despite their promise, existing event-based simultaneous localization and mapping (SLAM) approaches exhibit limited performance in real-world applications. On the other hand, state-of-the-art SLAM approaches that incorporate deep neural networks for better robustness and applicability. However, these is a lack of research in fusing learning-based event SLAM methods with IMU, which could be indispensable to push the event-based SLAM to large-scale, low-texture or complex scenarios. In this paper, we propose DEIO, the first monocular deep event-inertial odometry framework that combines learning-based method with traditional nonlinear graph-based optimization. Specifically, we tightly integrate a trainable event-based differentiable bundle adjustment (e-DBA) with the IMU pre-integration in a factor graph which employs keyframe-based sliding window optimization. Numerical Experiments in nine public challenge datasets show that our method can achieve superior performance compared with the image-based and event-based benchmarks. The source code is available at: https://github.com/arclab-hku/DEIO.
... Most recent works by Junkai Niu et al. have continued to push the boundaries of event-based stereo VO. Their IMUaided event-based stereo visual odometry system [39] and ESVO2 [40] incorporate direct visual-inertial odometry with stereo event cameras, significantly improving real-time pose estimation accuracy. ...
Preprint
Full-text available
Event cameras, inspired by biological vision, are asynchronous sensors that detect changes in brightness, offering notable advantages in environments characterized by high-speed motion, low lighting, or wide dynamic range. These distinctive properties render event cameras particularly effective for sensor fusion in robotics and computer vision, especially in enhancing traditional visual or LiDAR-inertial odometry. Conventional frame-based cameras suffer from limitations such as motion blur and drift, which can be mitigated by the continuous, low-latency data provided by event cameras. Similarly, LiDAR-based odometry encounters challenges related to the loss of geometric information in environments such as corridors. To address these limitations, unlike the existing event camera-related surveys, this paper presents a comprehensive overview of recent advancements in event-based sensor fusion for odometry applications particularly, investigating fusion strategies that incorporate frame-based cameras, inertial measurement units (IMUs), and LiDAR. The survey critically assesses the contributions of these fusion methods to improving odometry performance in complex environments, while highlighting key applications, and discussing the strengths, limitations, and unresolved challenges. Additionally, it offers insights into potential future research directions to advance event-based sensor fusion for next-generation odometry applications.
... The latest work [15] in this category is a follow-up from the first author of the ESVO paper. They have improved ESVO by adding IMU integration, reducing latency and remodeling depth estimation of horizontal edges. ...
Preprint
Full-text available
Stereopsis has widespread appeal in robotics as it is the predominant way by which living beings perceive depth to navigate our 3D world. Event cameras are novel bio-inspired sensors that detect per-pixel brightness changes asynchronously, with very high temporal resolution and high dynamic range, enabling machine perception in high-speed motion and broad illumination conditions. The high temporal precision also benefits stereo matching, making disparity (depth) estimation a popular research area for event cameras ever since its inception. Over the last 30 years, the field has evolved rapidly, from low-latency, low-power circuit design to current deep learning (DL) approaches driven by the computer vision community. The bibliography is vast and difficult to navigate for non-experts due its highly interdisciplinary nature. Past surveys have addressed distinct aspects of this topic, in the context of applications, or focusing only on a specific class of techniques, but have overlooked stereo datasets. This survey provides a comprehensive overview, covering both instantaneous stereo and long-term methods suitable for simultaneous localization and mapping (SLAM), along with theoretical and empirical comparisons. It is the first to extensively review DL methods as well as stereo datasets, even providing practical suggestions for creating new benchmarks to advance the field. The main advantages and challenges faced by event-based stereo depth estimation are also discussed. Despite significant progress, challenges remain in achieving optimal performance in not only accuracy but also efficiency, a cornerstone of event-based computing. We identify several gaps and propose future research directions. We hope this survey inspires future research in this area, by serving as an accessible entry point for newcomers, as well as a practical guide for seasoned researchers in the community.
... Early works using a monocular event-based camera (e.g., [3,4]) require a very gentle motion (typically a local-loopy behavior) for the initialization of a local 3D map, based on which the camera pose can be tracked using a 3D-2D registration pipeline. To remove such a limitation on the initialization, Zhou et al. [5,6] further use a stereo event-based camera to improve the efficiency and accuracy of mapping. However, tracking failure is still witnessed when the ego motion of the camera suddenly becomes violent (mainly in terms of the angular velocity). ...
... Conversely, direct methods [22] attempt to process all available sensor data, such as individual pixel intensity changes in images (events) or all RGB frame pixels, without any intermediate filtering or feature extraction in the front-end, relying on the back-end to handle the entire data. The proposed method adopts a hybrid approach where all events are directly processed during the events-frames fusion in the front-end. ...
Article
This paper presents a robust approach for a visual parallel tracking and mapping (PTAM) system that excels in challenging environments. Our proposed method combines the strengths of heterogeneous multi-modal visual sensors, including stereo event-based and frame-based sensors, in a unified reference frame through a novel spatio-temporal synchronization approach. We employ deep learning-based feature extraction and description for estimation to enhance robustness further. We also introduce an end-to-end parallel tracking and mapping optimization layer complemented by a simple loop-closure algorithm for efficient SLAM behavior. Through comprehensive experiments on both small-scale and large-scale real-world sequences of VECtor and TUM-VIE benchmarks, our proposed method (DH-PTAM) demonstrates superior performance in terms of robustness and accuracy in adverse conditions, especially in large-scale HDR scenarios. Our implementation's research-based Python API is publicly available on GitHub for further research and development: https://github.com/AbanobSoliman/DH-PTAM .
Article
Event-based visual odometry is a specific branch of visual Simultaneous Localization and Mapping (SLAM) techniques, which aims at solving tracking and mapping sub-problems (typically in parallel), by exploiting the special working principles of neuromorphic ( i.e. , event-based) cameras. Due to the motion-dependent nature of event data, explicit data association ( i.e. , feature matching) under large-baseline viewpoint changes is difficult to establish, making direct methods a more rational choice. However, state-of-the-art direct methods are limited by the high computational complexity of the mapping sub-problem and the degeneracy of camera pose tracking in certain degrees of freedom (DoF) in rotation. In this paper, we tackle these issues by building an event-based stereo visual-inertial odometry system on top of a direct pipeline [1]. Specifically, to speed up the mapping operation, we propose an efficient strategy for sampling contour points according to the local dynamics of events. The mapping performance is also improved in terms of structure completeness and local smoothness by merging the temporal stereo and static stereo results. To circumvent the degeneracy of camera pose tracking in recovering the pitch and yaw components of general 6-DoF motion, we introduce IMU measurements as motion priors via pre-integration. To this end, a compact back-end is proposed for continuously updating the IMU bias and predicting the linear velocity, enabling an accurate motion prediction for camera pose tracking. The resulting system scales well with modern high-resolution event cameras and leads to better global positioning accuracy in large-scale outdoor environments. Extensive evaluations on five publicly available datasets featuring different resolutions and scenarios justify the superior performance of the proposed system against five state-of-the-art methods. Compared to ESVO [1], our new pipeline significantly reduces the camera pose tracking error by 40%40\%80%80\% and 20%20\%80%80\% in terms of absolute trajectory error and relative pose error, respectively; at the same time, the mapping efficiency is improved by a factor of five. We release our pipeline as an open-source software for future research in this field.
Article
Full-text available
Recent advances in event-based cameras have led to significant developments in robotics, particularly in visual simultaneous localization and mapping (VSLAM) applications. This technique enables real-time camera motion estimation and simultaneous environment mapping using visual sensors on mobile platforms. Event cameras offer several distinct advantages over frame-based cameras, including a high dynamic range, high temporal resolution, low power consumption, and low latency. These attributes make event cameras highly suitable for addressing performance issues in challenging scenarios such as high-speed motion and environments with high-range illumination. This review paper delves into event-based VSLAM (EVSLAM) algorithms, leveraging the advantages inherent in event streams for localization and mapping endeavors. The exposition commences by explaining the operational principles of event cameras, providing insights into the diverse event representations applied in event data preprocessing. A crucial facet of this survey is the systematic categorization of EVSLAM research into three key parts: event preprocessing, event tracking, and sensor fusion algorithms in EVSLAM. Each category undergoes meticulous examination, offering practical insights and guidance for comprehending each approach. Moreover, we thoroughly assess state-of-the-art (SOTA) methods, emphasizing conducting the evaluation on a specific dataset for enhanced comparability. This evaluation sheds light on current challenges and outlines promising avenues for future research, emphasizing the persisting obstacles and potential advancements in this dynamically evolving domain.
Article
Full-text available
The increasing interest in developing robots capable of navigating autonomously has led to the necessity of developing robust methods that enable these robots to operate in challenging and dynamic environments. Visual odometry (VO) has emerged in this context as a key technique, offering the possibility of estimating the position of a robot using sequences of onboard cameras. In this paper, a VO algorithm is proposed that achieves sub-pixel precision by combining optical flow and direct methods. This approach uses only a downward-facing, monocular camera, eliminating the need for additional sensors. The experimental results demonstrate the robustness of the developed method across various surfaces, achieving minimal drift errors in calculation.
Article
Full-text available
There have been a number of corner detection methods proposed for event cameras in the last years, since event-driven computer vision has become more accessible. Current state-of-the-art have either unsatisfactory accuracy or real-time performance when considered for practical use, for example when a camera is randomly moved in an unconstrained environment. In this paper, we present yet another method to perform corner detection, dubbed look-up event-Harris (luvHarris), that employs the Harris algorithm for high accuracy but manages an improved event throughput. Our method has two major contributions, 1. a novel “threshold ordinal event-surface” that removes certain tuning parameters and is well suited for Harris operations, and 2. an implementation of the Harris algorithm such that the computational load per event is minimised and computational heavy convolutions are performed only ‘as-fast-as-possible’, i.e., only as computational resources are available. The result is a practical, real-time, and robust corner detector that runs more than 2.6×2.6\times the speed of current state-of-the-art; a necessity when using a high-resolution event-camera in real-time. We explain the considerations taken for the approach, compare the algorithm to current state-of-the-art in terms of computational performance and detection accuracy, and discuss the validity of the proposed approach for event cameras.
Article
Full-text available
Event-based cameras are bioinspired vision sensors whose pixels work independently from each other and respond asynchronously to brightness changes, with microsecond resolution. Their advantages make it possible to tackle challenging scenarios in robotics, such as high-speed and high dynamic range scenes. We present a solution to the problem of visual odometry from the data acquired by a stereo event-based camera rig. Our system follows a parallel tracking-and-mapping approach, where novel solutions to each subproblem (three-dimensional (3-D) reconstruction and camera pose estimation) are developed with two objectives in mind: being principled and efficient, for real-time operation with commodity hardware. To this end, we seek to maximize the spatio-temporal consistency of stereo event-based data while using a simple and efficient representation. Specifically, the mapping module builds a semidense 3-D map of the scene by fusing depth estimates from multiple viewpoints (obtained by spatio-temporal consistency) in a probabilistic fashion. The tracking module recovers the pose of the stereo rig by solving a registration problem that naturally arises due to the chosen map and event data representation. Experiments on publicly available datasets and on our own recordings demonstrate the versatility of the proposed method in natural scenes with general 6-DoF motion. The system successfully leverages the advantages of event-based cameras to perform visual odometry in challenging illumination conditions, such as low-light and high dynamic range, while running in real-time on a standard CPU. We release the software and dataset under an open source license to foster research in the emerging topic of event-based simultaneous localization and mapping.
Chapter
Full-text available
We present a method that leverages the complementarity of event cameras and standard cameras to track visual features with lowlatency. Event cameras are novel sensors that output pixel-level brightness changes, called \events". They oer signicant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the same scene pattern can produce dierent events depending on the motion direction, establishing event correspondences across time is challenging. By contrast, standard cameras provide intensity measurements (frames) that do not depend on motion direction. Our method extracts features on frames and subsequently tracks them asynchronously using events, thereby exploiting the best of both types of data: the frames provide a photometric representation that does not depend on motion direction and the events provide low-latency updates. In contrast to previous works, which are based on heuristics, this is the rst principled method that uses raw intensity measurements directly, based on a generative event model within a maximum-likelihood framework. As a result, our method produces feature tracks that are both more accurate (subpixel accuracy) and longer than the state of the art, across a wide variety of scenes.
Conference Paper
Full-text available
Recently, the emerging bio-inspired event cameras have demonstrated potentials for a wide range of robotic applications in dynamic environments. In this paper, we propose a novel fast and asynchronous event-based corner detection method which is called FA-Harris. FA-Harris consists of several components, including an event filter, a Global Surface of Active Events (G-SAE) maintaining unit, a corner candidate selecting unit, and a corner candidate refining unit. The proposed G-SAE maintenance algorithm and corner candidate selection algorithm greatly enhance the real-time performance for corner detection, while the corner candidate refinement algorithm maintains the accuracy of performance by using an improved event-based Harris detector. Additionally, FA-Harris does not require artificially synthesized event-frames and can operate on asynchronous events directly. We implement the proposed method in C++ and evaluate it on public Event Camera Datasets. The results show that our method achieves approximately 8× speed-up when compared with previously reported event-based Harris detector, and with no compromise on the accuracy of performance.
Conference Paper
Full-text available
Event cameras are novel vision sensors that output pixel-level brightness changes ("events") instead of traditional video frames. These asynchronous sensors offer several advantages over traditional cameras, such as, high temporal resolution, very high dynamic range, and no motion blur. To unlock the potential of such sensors, motion compensation methods have been recently proposed. We present a collection and taxonomy of twenty two objective functions to analyze event alignment in motion compensation approaches. We call them Focus Loss Functions since they have strong connections with functions used in traditional shape-from-focus applications. The proposed loss functions allow bringing mature computer vision tools to the realm of event cameras. We compare the accuracy and runtime performance of all loss functions on a publicly available dataset, and conclude that the variance, the gradient and the Laplacian magnitudes are among the best loss functions. The applicability of the loss functions is shown on multiple tasks: rotational motion, depth and optical flow estimation. The proposed focus loss functions allow to unlock the outstanding properties of event cameras.
Conference Paper
Full-text available
In this tutorial, we provide principled methods to quantitatively evaluate the quality of an estimated trajectory from visual(-inertial) odometry (VO/VIO), which is the foundation of benchmarking the accuracy of different algorithms. First, we show how to determine the transformation type to use in trajectory alignment based on the specific sensing modality (i.e., monocular, stereo and visual-inertial). Second, we describe commonly used error metrics (i.e., the absolute trajectory error and the relative error) and their strengths and weaknesses. To make the methodology presented for VO/VIO applicable to other setups, we also generalize our formulation to any given sensing modality. To facilitate the reproducibility of related research, we publicly release our implementation of the methods described in this tutorial.
Chapter
Full-text available
We present a method that leverages the complementarity of event cameras and standard cameras to track visual features with low-latency. Event cameras are novel sensors that output pixel-level brightness changes, called “events”. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the same scene pattern can produce different events depending on the motion direction, establishing event correspondences across time is challenging. By contrast, standard cameras provide intensity measurements (frames) that do not depend on motion direction. Our method extracts features on frames and subsequently tracks them asynchronously using events, thereby exploiting the best of both types of data: the frames provide a photometric representation that does not depend on motion direction and the events provide low-latency updates. In contrast to previous works, which are based on heuristics, this is the first principled method that uses raw intensity measurements directly, based on a generative event model within a maximum-likelihood framework. As a result, our method produces feature tracks that are both more accurate (subpixel accuracy) and longer than the state of the art, across a wide variety of scenes.
Chapter
Full-text available
Event cameras are bio-inspired sensors that offer several advantages, such as low latency, high-speed and high dynamic range, to tackle challenging scenarios in computer vision. This paper presents a solution to the problem of 3D reconstruction from data captured by a stereo event-camera rig moving in a static scene, such as in the context of stereo Simultaneous Localization and Mapping. The proposed method consists of the optimization of an energy function designed to exploit small-baseline spatio-temporal consistency of events triggered across both stereo image planes. To improve the density of the reconstruction and to reduce the uncertainty of the estimation, a probabilistic depth-fusion strategy is also developed. The resulting method has no special requirements on either the motion of the stereo event-camera rig or on prior knowledge about the scene. Experiments demonstrate our method can deal with both texture-rich scenes as well as sparse scenes, outperforming state-of-the-art stereo methods based on event data image representations.
Article
Full-text available
Event cameras are bioinspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, due to the fundamentally different structure of the sensor’s output, new algorithms that exploit the high temporal resolution and the asynchronous nature of the sensor are required. Recent work has shown that a continuous-time representation of the event camera pose can deal with the high temporal resolution and asynchronous nature of this sensor in a principled way. In this paper, we leverage such a continuous-time representation to perform visual-inertial odometry with an event camera. This representation allows direct integration of the asynchronous events with microsecond accuracy and the inertial measurements at high frequency. The event camera trajectory is approximated by a smooth curve in the space of rigid-body motions using cubic splines. This formulation significantly reduces the number of variables in trajectory estimation problems. We evaluate our method on real data from several scenes and compare the results against ground truth from a motion-capture system. We show that our method provides improved accuracy over the result of a state-of-the-art visual odometry method for event cameras. We also show that both the map orientation and scale can be recovered accurately by fusing events and inertial data. To the best of our knowledge, this is the first work on visual-inertial fusion with event cameras using a continuous-time framework.
Article
Full-text available
Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high speed motions or in scenes characterized by high dynamic range. However, event cameras output only little information when the amount of motion is limited, such as in the case of almost still motion. Conversely, standard cameras provide instant and rich information about the environment most of the time (in low-speed and good lighting scenarios), but they fail severely in case of fast motions, or difficult lighting such as high dynamic range or low light scenes. In this paper, we present the first state estimation pipeline that leverages the complementary advantages of these two sensors by fusing in a tightly-coupled manner events, standard frames, and inertial measurements. We show on the publicly available Event Camera Dataset that our hybrid pipeline leads to an accuracy improvement of 130% over event-only pipelines, and 85% over standard-frames-only visual-inertial systems, while still being computationally tractable. Furthermore, we use our pipeline to demonstrate - to the best of our knowledge - the first autonomous quadrotor flight using an event camera for state estimation, unlocking flight scenarios that were not reachable with traditional visual-inertial odometry, such as low-light environments and high-dynamic range scenes.
Article
Full-text available
Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the output is composed of a sequence of asynchronous events rather than actual intensity images, traditional vision algorithms cannot be applied, so that a paradigm shift is needed. We introduce the problem of event-based multi-view stereo (EMVS) for event cameras and propose a solution to it. Unlike traditional MVS methods, which address the problem of estimating dense 3D structure from a set of known viewpoints, EMVS estimates semi-dense 3D structure from an event camera with known trajectory. Our EMVS solution elegantly exploits two inherent properties of an event camera: (1) its ability to respond to scene edges—which naturally provide semi-dense geometric information without any pre-processing operation—and (2) the fact that it provides continuous measurements as the sensor moves. Despite its simplicity (it can be implemented in a few lines of code), our algorithm is able to produce accurate, semi-dense depth maps, without requiring any explicit data association or intensity estimation. We successfully validate our method on both synthetic and real data. Our method is computationally very efficient and runs in real-time on a CPU.
Article
Full-text available
Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high-speed motions or in scenes characterized by high dynamic range. These features, along with a very low power consumption, make event cameras an ideal complement to standard cameras for VR/AR and video game applications. With these applications in mind, this paper tackles the problem of accurate, low-latency tracking of an event camera from an existing photometric depth map (i.e., intensity plus depth information) built via classic dense reconstruction pipelines. Our approach tracks the 6-DOF pose of the event camera upon the arrival of each event, thus virtually eliminating latency. We successfully evaluate the method in both indoor and outdoor scenes and show that---because of the technological advantages of the event camera---our pipeline works in scenes characterized by high-speed motion, which are still unaccessible to standard cameras.
Conference Paper
Full-text available
Event cameras offer many advantages over standard frame-based cameras, such as low latency, high temporal resolution, and a high dynamic range. They respond to pixel-level brightness changes and, therefore, provide a sparse output. However, in textured scenes with rapid motion, millions of events are generated per second. Therefore, state-of-the-art event-based algorithms either require massive parallel computation (e.g., a GPU) or depart from the event-based processing paradigm. Inspired by frame-based pre-processing techniques that reduce an image to a set of features, which are typically the input to higher-level algorithms, we propose a method to reduce an event stream to a corner event stream. Our goal is twofold: extract relevant tracking information (corners do not suffer from the aperture problem) and decrease the event rate for later processing stages. Our event-based corner detector is very efficient due to its design principle, which consists of working on the Surface of Active Events (a map with the timestamp of the latest event at each pixel) using only comparison operations. Our method asynchronously processes event by event with very low latency. Our implementation is capable of processing millions of events per second on a single core (less than a micro-second per event) and reduces the event rate by a factor of 10 to 20.
Article
Full-text available
We present an algorithm to estimate the rotational motion of an event camera. In contrast to traditional cameras, which produce images at a fixed rate, event cameras have independent pixels that respond asynchronously to brightness changes, with microsecond resolution. Our method leverages the type of information conveyed by these novel sensors (i.e., edges) to directly estimate the angular velocity of the camera, without requiring optical flow or image intensity estimation. The core of the method is a contrast maximization design. The method performs favorably against ground truth data and gyroscopic measurements from an Inertial Measurement Unit, even in the presence of very high-speed motions (close to 1000 deg/s).
Article
Full-text available
We present EVO, an Event-based Visual Odometry algorithm. Our algorithm successfully leverages the outstanding properties of event cameras to track fast camera motions while recovering a semi-dense 3D map of the environment. The implementation runs in real-time on a standard CPU and outputs up to several hundred pose estimates per second. Due to the nature of event cameras, our algorithm is unaffected by motion blur and operates very well in challenging, high dynamic range conditions with strong illumination changes. To achieve this, we combine a novel, event-based tracking approach based on image-to-model alignment with a recent event-based 3D reconstruction algorithm in a parallel fashion. Additionally, we show that the output of our pipeline can be used to reconstruct intensity images from the binary event stream, though our algorithm does not require such intensity information. We believe that this work makes significant progress in SLAM by unlocking the potential of event cameras. This allows us to tackle challenging scenarios that are currently inaccessible to standard cameras.
Conference Paper
Full-text available
Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the output is composed of a sequence of asyn-chronous events rather than actual intensity images, traditional vision algorithms cannot be applied, so that a paradigm shift is needed. We introduce the problem of Event-based Multi-View Stereo (EMVS) for event cameras and propose a solution to it. Unlike traditional MVS methods, which address the problem of estimating dense 3D structure from a set of known viewpoints, EMVS estimates semi-dense 3D structure from an event camera with known trajectory. Our EMVS solution elegantly exploits two inherent properties of an event camera: (i) its ability to respond to scene edges—which naturally provide semi-dense geometric information without any pre-processing operation—and (ii) the fact that it provides continuous measurements as the sensor moves. Despite its simplicity (it can be implemented in a few lines of code), our algorithm is able to produce accurate, semi-dense depth maps. We successfully validate our method on both synthetic and real data. Our method is computationally very efficient and runs in real-time on a CPU.
Conference Paper
Full-text available
In the last few years, we have witnessed impressive demonstrations of aggressive flights and acrobatics using quadrotors. However, those robots are actually blind. They do not see by themselves, but through the " eyes " of an external motion capture system. Flight maneuvers using onboard sensors are still slow compared to those attainable with motion capture systems. At the current state, the agility of a robot is limited by the latency of its perception pipeline. To obtain more agile robots, we need to use faster sensors. In this paper, we present the first onboard perception system for 6-DOF localization during high-speed maneuvers using a Dynamic Vision Sensor (DVS). Unlike a standard CMOS camera, a DVS does not wastefully send full image frames at a fixed frame rate. Conversely, similar to the human eye, it only transmits pixel-level brightness changes at the time they occur with microsecond resolution, thus, offering the possibility to create a perception pipeline whose latency is negligible compared to the dynamics of the robot. We exploit these characteristics to estimate the pose of a quadrotor with respect to a known pattern during high-speed maneuvers, such as flips, with rotational speeds up to 1,200 • /s. Additionally, we provide a versatile method to capture ground-truth data using a DVS. SUPPLEMENTARY MATERIAL A video attachment to this work is available at: http://rpg.ifi.uzh.ch.
Article
Full-text available
This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.
Article
Full-text available
This paper presents a number of new methods for visual tracking using the output of an event-based asynchronous neuromorphic dynamic vision sensor. It allows the tracking of multiple visual features in real time, achieving an update rate of several hundred kilohertz on a standard desktop PC. The approach has been specially adapted to take advantage of the event-driven properties of these sensors by combining both spatial and temporal correlations of events in an asynchronous iterative framework. Various kernels, such as Gaussian, Gabor, combinations of Gabor functions, and arbitrary user-defined kernels, are used to track features from incoming events. The trackers described in this paper are capable of handling variations in position, scale, and orientation through the use of multiple pools of trackers. This approach avoids the N(2) operations per event associated with conventional kernel-based convolution operations with N ×N kernels. The tracking performance was evaluated experimentally for each type of kernel in order to demonstrate the robustness of the proposed solution.
Conference Paper
Full-text available
In this paper, we present a novel benchmark for the evaluation of RGB-D SLAM systems. We recorded a large set of image sequences from a Microsoft Kinect with highly accurate and time-synchronized ground truth camera poses from a motion capture system. The sequences contain both the color and depth images in full sensor resolution (640 × 480) at video frame rate (30 Hz). The ground-truth trajectory was obtained from a motion-capture system with eight high-speed tracking cameras (100 Hz). The dataset consists of 39 sequences that were recorded in an office environment and an industrial hall. The dataset covers a large variety of scenes and camera motions. We provide sequences for debugging with slow motions as well as longer trajectories with and without loop closures. Most sequences were recorded from a handheld Kinect with unconstrained 6-DOF motions but we also provide sequences from a Kinect mounted on a Pioneer 3 robot that was manually navigated through a cluttered indoor environment. To stimulate the comparison of different approaches, we provide automatic evaluation tools both for the evaluation of drift of visual odometry systems and the global pose error of SLAM systems. The benchmark website [1] contains all data, detailed descriptions of the scenes, specifications of the data formats, sample code, and evaluation tools.
Article
Full-text available
Conventional vision-based robotic systems that must operate quickly require high video frame rates and consequently high computational costs. Visual response latencies are lower-bound by the frame period, e.g., 20 ms for 50 Hz frame rate. This paper shows how an asynchronous neuromorphic dynamic vision sensor (DVS) silicon retina is used to build a fast self-calibrating robotic goalie, which offers high update rates and low latency at low CPU load. Independent and asynchronous per pixel illumination change events from the DVS signify moving objects and are used in software to track multiple balls. Motor actions to block the most “threatening” ball are based on measured ball positions and velocities. The goalie also sees its single-axis goalie arm and calibrates the motor output map during idle periods so that it can plan open-loop arm movements to desired visual locations. Blocking capability is about 80% for balls shot from 1 m from the goal even with the fastest-shots, and approaches 100% accuracy when the ball does not beat the limits of the servo motor to move the arm to the necessary position in time. Running with standard USB buses under a standard preemptive multitasking operating system (Windows), the goalie robot achieves median update rates of 550 Hz, with latencies of 2.2 ± 2 ms from ball movement to motor command at a peak CPU load of less than 4%. Practical observations and measurements of USB device latency are provided¹.
Conference Paper
Full-text available
Balancing a normal pencil on its tip requires rapid feedback control with latencies on the order of milliseconds. This demonstration shows how a pair of spike-based silicon retina dynamic vision sensors (DVS) is used to provide fast visual feedback for controlling an actuated table to balance an ordinary pencil. Two DVSs view the pencil from right angles. Movements of the pencil cause spike address-events (AEs) to be emitted from the DVSs. These AEs are transmitted to a PC over USB interfaces and are processed procedurally in real time. The PC updates its estimate of the pencil's location and angle in 3d space upon each incoming AE, applying a novel tracking method based on spike-driven fitting to a model of the vertical shape of the pencil. A PD-controller adjusts X-Y-position and velocity of the table to maintain the pencil balanced upright. The controller also minimizes the deviation of the pencil's base from the center of the table. The actuated table is built using ordinary high-speed hobby servos which have been modified to obtain feedback from linear position encoders via a microcontroller. Our system can balance any small, thin object such as a pencil, pen, chop-stick, or rod for many minutes. Balancing is only possible when incoming AEs are processed as they arrive from the sensors, typically at intervals below millisecond ranges. Controlling at normal image sensor sample rates (e.g. 60 Hz) results in too long latencies for a stable control loop.
Article
Full-text available
We propose a non-iterative solution for the perspective-n-point (PnP) problem, which can robustly retrieve the optimum by solving a seventh order polynomial. The central idea consists of three steps: (1) to divide the reference points into 3-point subsets in order to achieve a series of fourth order polynomials, (2) to compute the sum of the square of the polynomials so as to form a cost function, and (3) to find the roots of the derivative of the cost function in order to determine the optimum. The advantages of the proposed method are as follows: Firstly, it can stably deal with the planar case, ordinary 3D case and quasi-singular case, and it is as accurate as the state-of-the-art iterative algorithms with much less computational time. Secondly, it is the first non-iterative PnP solution that can achieve more accurate results than the iterative algorithms when no redundant reference points can be used (n \le 5). Thirdly, Large-size point sets can be handled efficiently because its computational complexity is O(n).
Conference Paper
Full-text available
Where feature points are used in real-time frame-rate applications, a high-speed feature detector is necessary. Feature detectors such as SIFT (DoG), Harris and SUSAN are good methods which yield high quality features, however they are too computationally intensive for use in real-time applications of any complexity. Here we show that machine learning can be used to derive a feature detector which can fully process live PAL video using less than 7% of the available processing time. By comparison neither the Harris detector (120%) nor the detection stage of SIFT (300%) can operate at full frame rate. Clearly a high-speed detector is of limited use if the features produced are unsuitable for downstream processing. In particular, the same scene viewed from two different positions should yield features which correspond to the same real-world 3D locations [1]. Hence the second contribution of this paper is a comparison corner detectors based on this criterion applied to 3D scenes. This comparison supports a number of claims made elsewhere concerning existing corner detectors. Further, contrary to our initial expectations, we show that despite being principally constructed for speed, our detector significantly outperforms existing feature detectors according to this criterion.
Article
Full-text available
This paper describes a 128 times 128 pixel CMOS vision sensor. Each pixel independently and in continuous time quantizes local relative intensity changes to generate spike events. These events appear at the output of the sensor as an asynchronous stream of digital pixel addresses. These address-events signify scene reflectance change and have sub-millisecond timing precision. The output data rate depends on the dynamic content of the scene and is typically orders of magnitude lower than those of conventional frame-based imagers. By combining an active continuous-time front-end logarithmic photoreceptor with a self-timed switched-capacitor differencing circuit, the sensor achieves an array mismatch of 2.1% in relative intensity event threshold and a pixel bandwidth of 3 kHz under 1 klux scene illumination. Dynamic range is > 120 dB and chip power consumption is 23 mW. Event latency shows weak light dependency with a minimum of 15 mus at > 1 klux pixel illumination. The sensor is built in a 0.35 mum 4M2P process. It has 40times40 mum2 pixels with 9.4% fill factor. By providing high pixel bandwidth, wide dynamic range, and precisely timed sparse digital output, this silicon retina provides an attractive combination of characteristics for low-latency dynamic vision under uncontrolled illumination with low post-processing requirements.
Article
Once an academic venture, autonomous driving has received unparalleled corporate funding in the last decade. Still, operating conditions of current autonomous cars are mostly restricted to ideal scenarios. This means that driving in challenging illumination conditions such as night, sunrise, and sunset remains an open problem. In these cases, standard cameras are being pushed to their limits in terms of low light and high dynamic range performance. To address these challenges, we propose, DSEC, a new dataset that contains such demanding illumination conditions and provides a rich set of sensory data. DSEC offers data from a wide-baseline stereo setup of two color frame cameras and two high-resolution monochrome event cameras. In addition, we collect lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of high-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high resolution, large scale stereo dataset with event cameras. The dataset contains over 40 sequences collected by driving in a variety of illumination conditions and provides ground truth depth for the development and evaluation of event-based stereo algorithms. Code and Dataset are available at \url{https://github.com/uzh-rpg/DSEC}
Conference Paper
We present a unifying framework to solve several computer vision problems with event cameras: motion, depth and optical flow estimation. The main idea of our framework is to find the point trajectories on the image plane that are best aligned with the event data by maximizing an objective function: the contrast of an image of warped events. Our method implicitly handles data association between the events, and therefore, does not rely on additional appearance information about the scene. In addition to accurately recovering the motion parameters of the problem, our framework produces motion-corrected edge-like images with high dynamic range that can be used for further scene analysis. The proposed method is not only simple, but more importantly, it is, to the best of our knowledge, the first method that can be successfully applied to such a diverse set of important vision tasks with event cameras.
Conference Paper
Dynamic Vision Sensors (DVS) output asynchronous log intensity change events. They have potential applications in high-speed robotics, autonomous cars and drones. The precise event timing, sparse output, and wide dynamic range of the events are well suited for optical flow, but conventional optical flow (OF) algorithms are not well matched to the event stream data. This paper proposes an event-driven OF algorithm called adaptive block-matching optical flow (ABMOF). ABMOF uses time slices of accumulated DVS events. The time slices are adaptively rotated based on the input events and OF results. Compared with other methods such as gradient-based OF, ABMOF can efficiently be implemented in compact logic circuits. We developed both ABMOF and Lucas-Kanade (LK) algorithms using our adapted slices. Results shows that ABMOF accuracy is comparable with LK accuracy on natural scene data including sparse and dense texture, high dynamic range, and fast motion exceeding 30,000 pixels per second.
Article
The recent emergence of bioinspired event cameras has opened up exciting new possibilities in high-frequency tracking, bringing robustness to common problems in traditional vision, such as lighting changes and motion blur. In order to leverage these attractive attributes of the event cameras, research has been focusing on understanding how to process their unusual output: an asynchronous stream of events. With the majority of existing techniques discretizing the event-stream essentially forming frames of events grouped according to their timestamp, we are still to exploit the power of these cameras. In this spirit, this letter proposes a new, purely event-based corner detector, and a novel corner tracker, demonstrating that it is possible to detect corners and track them directly on the event stream in real time. Evaluation on benchmarking datasets reveals a significant boost in the number of detected corners and the repeatability of such detections over the state of the art even in challenging scenarios with the proposed approach while enabling more than a 4 ×\times speed-up when compared to the most efficient algorithm in the literature. The proposed pipeline detects and tracks corners at a rate of more than 7.5 million events per second, promising great impact in high-speed applications.
Conference Paper
We propose a method which can perform real-time 3D reconstruction from a single hand-held event camera with no additional sensing, and works in unstructured scenes of which it has no prior knowledge. It is based on three decoupled probabilistic filters, each estimating 6-DoF camera motion, scene logarithmic (log) intensity gradient and scene inverse depth relative to a keyframe, and we build a real-time graph of these to track and model over an extended local workspace. We also upgrade the gradient estimate for each keyframe into an intensity image, allowing us to recover a real-time video-like intensity sequence with spatial and temporal super-resolution from the low bit-rate input event stream. To the best of our knowledge, this is the first algorithm provably able to track a general 6D motion along with reconstruction of arbitrary structure including its intensity and the reconstruction of grayscale video that exclusively relies on event camera data.
Conference Paper
A large number of absolute pose algorithms have been presented in the literature. Common performance criteria are computational complexity, geometric optimality, global optimality, structural degeneracies, and the number of solutions. The ability to handle minimal sets of correspondences, resulting solution multiplicity, and generalized cameras are further desirable properties. This paper presents the first PnP solution that unifies all the above desirable properties within a single algorithm. We compare our result to state-of-the-art minimal, non-minimal, central, and non-central PnP algorithms, and demonstrate universal applicability, competitive noise resilience, and superior computational efficiency. Our algorithm is called Unified PnP (UPnP).
Conference Paper
The fusion of visual and inertial cues has become popular in robotics due to the complementary nature of the two sensing modalities. While most fusion strategies to date rely on filtering schemes, the visual robotics community has recently turned to non-linear optimization approaches for tasks such as visual Simultaneous Localization And Mapping (SLAM), following the discovery that this comes with significant advantages in quality of performance and computational complexity. Following this trend, we present a novel approach to tightly integrate visual measurements with readings from an Inertial Measurement Unit (IMU) in SLAM. An IMU error term is integrated with the landmark reprojection error in a fully probabilistic manner, resulting to a joint non-linear cost function to be optimized. Employing the powerful concept of `keyframes' we partially marginalize old states to maintain a bounded-sized optimization window, ensuring real-time operation. Comparing against both vision-only and loosely-coupled visual-inertial algorithms, our experiments confirm the benefits of tight fusion in terms of accuracy and robustness.
Article
Conventional image sensors produce massive amounts of redundant data and are limited in temporal resolution by the frame rate. This paper reviews our recent breakthrough in the development of a high-performance spike-event based dynamic vision sensor (DVS) that discards the frame concept entirely, and then describes novel digital methods for efficient low-level filtering and feature extraction and high-level object tracking that are based on the DVS spike events. These methods filter events, label them, or use them for object tracking. Filtering reduces the number of events but improves the ratio of informative events. Labeling attaches additional interpretation to the events, e.g. orientation or local optical flow. Tracking uses the events to track moving objects. Processing occurs on an event-by-event basis and uses the event time and identity as the basis for computation. A common memory object for filtering and labeling is a spatial map of most recent past event times. Processing methods typically use these past event times together with the present event in integer branching logic to filter, label, or synthesize new events. These methods are straightforwardly computed on serial digital hardware, resulting in a new event-and timing-based approach for visual computation that efficiently integrates a neural style of computation with digital hardware. All code is open-sourced in the jAER project (jaer.wiki.sourceforge.net).
Conference Paper
In this paper, we present an extended Kalman filter (EKF)-based algorithm for real-time vision-aided inertial navigation. The primary contribution of this work is the derivation of a measurement model that is able to express the geometric constraints that arise when a static feature is observed from multiple camera poses. This measurement model does not require including the 3D feature position in the state vector of the EKF and is optimal, up to linearization errors. The vision-aided inertial navigation algorithm we propose has computational complexity only linear in the number of features, and is capable of high-precision pose estimation in large-scale real-world environments. The performance of the algorithm is demonstrated in extensive experimental results, involving a camera/IMU system localizing within an urban area.
Article
An efficient algorithmic solution to the classical five-point relative pose problem is presented. The problem is to find the possible solutions for relative camera pose between two calibrated views given five corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial in closed form and, subsequently, finding its roots. It is the first algorithm well-suited for numerical implementation that also corresponds to the inherent complexity of the problem. We investigate the numerical precision of the algorithm. We also study its performance under noise in minimal as well as overdetermined cases. The performance is compared to that of the well-known 8 and 7-point methods and a 6-point scheme. The algorithm is used in a robust hypothesize-and-test framework to estimate structure and motion in real-time with low delay. The real-time system uses solely visual input and has been demonstrated at major conferences.
Conference Paper
This paper presents a method of estimating camera pose in an unknown scene. While this has previously been attempted by adapting SLAM algorithms developed for robotic exploration, we propose a system specifically designed to track a hand-held camera in a small AR workspace. We propose to split tracking and mapping into two separate tasks, processed in parallel threads on a dual-core computer: one thread deals with the task of robustly tracking erratic hand-held motion, while the other produces a 3D map of point features from previously observed video frames. This allows the use of computationally expensive batch optimisation techniques not usually associated with real-time operation: The result is a system that produces detailed maps with thousands of landmarks which can be tracked at frame-rate, with an accuracy and robustness rivalling that of state-of-the-art model-based systems.
Article
The fundamental matrix is a basic tool in the analysis of scenes taken with two uncalibrated cameras, and the eight-point algorithm is a frequently cited method for computing the fundamental matrix from a set of eight or more point matches. It has the advantage of simplicity of implementation. The prevailing view is, however, that it is extremely susceptible to noise and hence virtually useless for most purposes. This paper challenges that view, by showing that by preceding the algorithm with a very simple normalization (translation and scaling) of the coordinates of the matched points, results are obtained comparable with the best iterative algorithms. This improved performance is justified by theory and verified by extensive experiments on real images
In defense of the eight-point algorithm
  • R I Hartley
About the algebraic structure of the orthogonal group and the other classical groups in a field of characteristic zero or a prime characteristic
  • Cayley