Content uploaded by Chaoran Xiong
Author content
All content in this area was uploaded by Chaoran Xiong on Mar 10, 2025
Content may be subject to copyright.
THE-SEAN: A Heart Rate Variation-Inspired Temporally High-Order
Event-Based Visual Odometry with Self-Supervised
Spiking Event Accumulation Networks
Chaoran Xiong1,Student Member, IEEE, Litao Wei1, Kehui Ma1, Zhen Sun1, Yan Xiang1, Zihan Nan2,
Trieu-Kien Truong1,Life Fellow, IEEE and Ling Pei1,∗,Senior Member, IEEE
Abstract— Event-based visual odometry has recently gained
attention for its high accuracy and real-time performance in
fast-motion systems. Unlike traditional synchronous estimators
that rely on constant-frequency (zero-order) triggers, event-
based visual odometry can actively accumulate information to
generate temporally high-order estimation triggers. However,
existing methods primarily focus on adaptive event represen-
tation after estimation triggers, neglecting the decision-making
process for efficient temporal triggering itself. This oversight
leads to the computational redundancy and noise accumulation.
In this paper, we introduce a temporally high-order event-based
visual odometry with spiking event accumulation networks
(THE-SEAN). To the best of our knowledge, it is the first
event-based visual odometry capable of dynamically adjusting
its estimation trigger decision in response to motion and envi-
ronmental changes. Inspired by biological systems that regulate
hormone secretion to modulate heart rate, a self-supervised
spiking neural network is designed to generate estimation
triggers. This spiking network extracts temporal features to
produce triggers, with rewards based on block matching points
and Fisher information matrix (FIM) trace acquired from the
estimator itself. Finally, THE-SEAN is evaluated across several
open datasets, thereby demonstrating average improvements of
13% in estimation accuracy, 9% in smoothness, and 38% in
triggering efficiency compared to the state-of-the-art methods.
I. INTRODUCTION
Event-based visual odometry is a state estimation frame-
work that uses data from event cameras, which offers ad-
vantages of low latency, high accuracy, and low power
consumption [1]–[4]. Unlike traditional frame-based syn-
chronous estimators [5]–[9] that rely on constant-frequency
external triggers (zero-order sampling), event-based estima-
tors autonomously generate internal triggers for estimation
[10]. This capability allows them to adapt dynamically to
the motion and environmental context, thus improving per-
formance in fast-motion systems and enhancing information
processing efficiency.
This work was supported in part by the Basic Science Center Program
of the National Natural Science Foundation of China (Grant No.62388101),
in part by National Nature Science Foundation of China (NSFC) (Grant
Number: 62273229) and in part by Science and Technology Commission of
Shanghai Municipality (Grant Number: 24TS1402600 and 24TS1402800).
∗Corresponding author: Ling Pei.
1The authors are with the Shanghai Jiao Tong University, Shang-
hai 200240, China (e-mail: sjtu4742986; oscar0371; khma0929; zhensun;
yan.xiang; truong@isu.edu.tw; ling.pei@sjtu.edu.cn).
2Zihan Nan is with Beijing Institute of Aerospace Control Devices,
Beijing Institute of Aerospace Control Devices, 100039, Beijing, China (e-
mail: nan657584155@163.com).
The code will be released at https://github.com/Franky-X/THE-SEAN.
Event Accumulation Estimation Trigger
High-order
trigger
with SEAN
Traditional
zero-order
trigger
Event Arrival
t
t
• Redundant Computation
• Noise Accumulation
• Efficient Computation
• Noise Reduction
Trajectory
Trajectory
Fig. 1. Comparison of the proposed temporally high-order event-based
visual odometry system with the traditional zero-order event-based esti-
mator. Traditional estimators typically rely on constant-frequency triggers
with adaptive accumulation, resulting in computational redundancy and
noise accumulation. In contrast, our proposed temporally high-order system
dynamically determines the optimal trigger moments, which is inspired
by heart rate variation mechanism of human, thereby enhancing both
computational efficiency and estimation accuracy.
Typically, event-based estimators require the accumula-
tion of asynchronous event data to process [1]–[4]. Various
methods have been explored for the event stream represen-
tation. Common approaches for event frame representation
include constant window methods, such as the naive direct
accumulation approach [11] and time surface [12], which
are straightforward to implement and yield intuitive results.
Alternatively, adaptive window methods, such as adaptive
time surface [13], [14] and adaptive accumulation [15],
offer more efficient event representations, but they often
involve higher computational complexity during the event
frame construction at each estimation trigger. They aim to
accumulate sufficient information at the estimation trigger
while minimizing the accumulation of irrelevant data.
However, existing research has primarily focused on event
accumulation after the estimation trigger [12]–[15], over-
looking the significance of the trigger decision itself. On
the one hand, high estimation trigger frequency can lead
to redundant computational overhead, especially when using
adaptive representations. For instance, frequent triggering
still accumulates events from a longer old time winder
under sparse data conditions, resulting in low information
gain and inefficient use of computational resources. On the
other hand, low trigger frequency may lead to insufficient
arXiv:2503.05112v1 [cs.RO] 7 Mar 2025
information utilization, degrading the estimation accuracy,
see [11]. Therefore, adaptive trigger decisions in the temporal
dimension is crucial for event-based estimators. This capa-
bility enables the estimator to increase the trigger frequency
during fast-motion conditions to enhance accuracy, while
reducing it in static or low-motion scenarios to conserve
computational resources and minimize accumulated noise.
In this paper, we propose a temporally high-order event-
based (THE) odometry system, in which the trigger decision
in the temporal domain is generated by spiking event accu-
mulation networks (SEAN). Inspired by biological mecha-
nisms, which regulate hormone and pheromone signaling to
adapt to dynamic motion and environmental conditions [16],
SEAN is designed to emulate this process. Firstly, SEAN
employs a leaky integrate-and-fire (LIF) neural network to
extract features from the incoming event stream. Then, the
leaky integrate (LI) neurons of SEAN are used to perform
value regression in order to generate rewards for triggering
or maintaining idle states. Finally, the weights of SEAN
are adjusted based on the information gain derived from
the estimator itself. This process simulates the biological
process of regulating heart rate. To the best of our knowledge,
THE-SEAN is the first event-based visual odometry system
capable of dynamically adjusting its estimation trigger deci-
sion based on its own motion and surrounding environmental
changes, as illustrated in Fig. 1. Our main contributions are
as follows:
1) A bio-inspired temporally high-order event-based
(THE) visual odometry with spiking event accumula-
tion networks (SEAN). THE-SEAN can dynamically
adjusting its mapping and tracking trigger decision pol-
icy based on its motion and surrounding environment,
mimicking biological mechanisms regulating hormone
secretion to modulate heart rate.
2) A self-supervised Q-learning strategy using the in-
formation gain computed from the estimator itself,
including valid block matching points for when to map
and Fisher information matrix (FIM) trace for when
to track, as rewards instead of labeled ground truth
trajectory. This ensures that THE-SEAN can operate
effectively across various scenarios without the need
for pre-training or a large number of parameters.
3) New evaluation metrics including tracking and map-
ping triggering rate to assess estimation triggering ef-
ficiency for event-based visual odometry. Experiments
conducted across various open datasets demonstrate
that, on average, THE-SEAN improves estimation ac-
curacy by 13%, enhances smoothness by 8%, and
reduces the estimation trigger rate by 38% compared
to the latest event-based odometry algorithms.
The remainder of this paper is organized as follows: The
related work on event-based visual odometry is reviewed in
Section II. Section III describes the formulation of decision-
making process for the temporally high-order event-based
estimator. Section IV introduces the main technical contribu-
tions of this work. Section V presents a comparison of THE-
SEAN with the latest event-based visual odometry methods
across multiple open datasets, including an ablation study.
Finally, the conclusion is given in Section VI .
II. RELATED WORK
Event-based visual odometry has gained significant atten-
tion due to its low latency, low power consumption, and high
accuracy in fast-motion systems [1]–[4]. Unlike traditional
cameras that rely on external constant-frequency triggers,
event cameras capture asynchronous data, mimicking human
vision. Stereo event cameras, in particular, enable more
effective depth perception by emulating human binocular
vision.
Different from traditional camera-based estimation meth-
ods, event-based visual odometry depends on internal mech-
anisms for accumulating event data, as it lacks external
triggers [11]. Existing approaches focus on accumulating
events after a trigger, typically within a temporal window,
and can be categorized into constant-window and adaptive-
window accumulation methods. The constant window accu-
mulation methods, such as simple accumulation [11] and
time surface decay [12], are easy to implement but may
result in either insufficient or excessive data accumulation. In
contrast, adaptive window methods adjust the time window
based on the amount of accumulated event information
[15], ensuring adequate data is available at each trigger.
Although more efficient, adaptive representation methods
tend to incur higher computational costs and are subject to
noise accumulation.
Building on event representation, various odometry algo-
rithms have been developed. In [1], Zhou et al. first proposed
to perform mapping and tracking process simultaneously
in classic stereo event-based visual odometry (ESVO). The
mapping module estimates depth through stereo disparity,
generating reference frames to perform pose estimation in
the tracking module. The tracking module projects the accu-
mulated event frames onto these reference frames for pose
estimation. Building on these two modules, different event
representation methods have been introduced in the event-
based visual odometry, see [2]–[4]. However, both mapping
and tracking modules of these systems depend on fixed-rate
triggers determined by the platform’s processing capacity,
resulting in considerable computational overhead. Moreover,
fixed-frequency triggering can lead to limited information
gain in certain situations, especially when overwhelmed by
noise, negatively impacting estimation accuracy and stability.
To date, existing stereo event-based visual odometry meth-
ods often neglect the critical importance of the trigger deci-
sion. In constant window methods, optimal trigger decision-
making ensures adequate data accumulation at each trigger
point. Similarly, adaptive methods rely on well-timed trig-
ger decisions to prevent redundant accumulation, especially
when event data is sparse. Therefore, it is essential to adjust
the trigger frequency adaptively based on motion dynamics
and environmental changes. Nevertheless, current methods
lack temporally high-order triggering mechanisms that can
adjust based on scene dynamics.
Mapping
Trigger Decider
When to Map
Tracking
Trigger Decider
When to Track
Reference
Map
Pose
Tracking
Estimation Trigger
Event Re-projection
Event Arrival
map
()
t
Ea
track
()
t
Ea
est,t
T
Estimator
Fig. 2. Problem formulation of temporally high-order event-based estima-
tor. The asynchronous estimator must determine when to trigger the mapping
and tracking process in order to minimize both estimation error and power
consumption.
III. PROBLEM FORMULATION
This section formulates temporally high-order state esti-
mation for asynchronous event-based systems. Section III-
A outlines the key challenges in current asynchronous
estimation paradigms. Section III-B introduces a Markov
Decision Process (MDP)-based approach to tackle the high-
order triggering problem. This formulation enables time-
aware triggering policies that optimize the timing of state
estimate updates.
A. Asynchronous vs. Synchronous Estimation
Traditional state estimation systems rely on external
triggers to update the state within synchronous frame-
works. Conversely, asynchronous event-based estimation au-
tonomously determines state updates based on irregular,
event-driven inputs. This introduces challenges in balancing
computational efficiency, accuracy, and responsiveness to
sparse, temporally irregular events. However, current event-
based estimation methods still rely on constant-frequency
triggering, which yields d∆t/dt=0,where the trigger time
interval ∆thas a derivative equal to zero with respect to time
t, and thus is referred to as the zero-order estimation. This
synchronous approach is inefficient, resulting in suboptimal
resource utilization and poor temporal fusion of information.
Existing methods focus on event processing at each trigger,
but neglect the crucial task of modeling the temporal process
for active asynchronous triggering. Therefore, a new formu-
lation for temporal modeling in asynchronous estimation is
needed.
B. Formulation of Temporally High-Order Estimation
To address the lack of principled temporal modeling in
existing event-based estimation frameworks, this paper intro-
duces the concept of temporally high-order state estimation
shown in Fig. 2. We define asynchronous estimation as a
decision-making process where the estimator must deter-
mine:
•When to map: At each event trigger, the estimator
decides whether to create or update the reference depth
frame through the mapping process.
•When to track: At each event trigger, the estimator
determines whether to update the agent’s pose by per-
forming the tracking process.
The decision-making process for when to trigger tracking
and mapping in an estimator is modeled as a MDP. The goal
is to minimize both estimation error and power consumption.
The components of the MDP are outlined below.
1) State Representation: The state stof the agent at time
tconsists of the current input event stream within a temporal
window, denoted by
st={(x,y,p,i)|i∈[t−tw,t]},(1)
where x,yare the coordinates of active pixels. pis the
polarity of the active event, and iis the timestamp of the
event. twrepresents the temporal window over which the
state is considered for decision-making.
2) Action Space: The action atrepresents the decisions
the estimator can take, which includes:
at={amap
t,atrack
t},(2)
where amap
tand atrack
tare binary indicators (either 0 or 1)
denoting whether the tracking or mapping process should be
triggered.
3) Energy Consumption: Each action incurs an associated
energy cost for computation. The energy consumption E(at)
for a given action atis given by
E(at) = amap
tEmap +atrack
tEtrack,(3)
where Emap and Etrack represent the power consumption for
the tracking and mapping computations, respectively.
4) Policy: The policy π(·)governs the action selection
process and aims to minimize both the estimation error and
power consumption. The optimal policy π∗minimizes the
following objective function
π∗=argmin
π λE
1
N
N
∑
t=0
E(at) + λP
1
N
N
∑
t=1T−1
g,i∗Test,i!,
(4)
where Tg,iis the ground truth pose at time step i, and
Test,iis the estimated pose at the same time. λEand λP
are predefined weights that reflect the relative importance
of energy consumption and pose accuracy, respectively. π∗
determines when to map or track for the estimator. Hence,
the trigger time interval ∆tis adaptive with respect to time,
and thus is referred to as the high-order estimation.
IV. METHODOLOGY
In this section, we first introduce THE-SEAN, a bio-
inspired framework for asynchronous state estimation. We
then present the use of spiking neural networks and self
supervised reinforcement learning methods for the mapping
and tracking process. Finally, the implementation settings of
SEAN in event-based estimator are provided.
LIF Layer
ON
OFF
FC
t
...
Input Event Spikes Spike Feature
Extraction Q Value
Regression
Event-based
Estimator
Output Trigger
Variation
+
-
FIM Trace / Block
Matching Reward
Asynchronous State Update
Eye Spike Signal Neuron Feature
Perception
Adrenaline
Encouragement Heart Pulse
Variation
Bio Estimator Reward Signal
(i.e. Dopamine)
Asynchronous State Update Bio Estimation THE-SEAN
LI Layer
A ABCDBCD
Fig. 3. System overview of the bio-inspired temporally high-order estimation framework, THE-SEAN. The left green section illustrates the biological
mechanism for asynchronous state estimation, where sensory spike signals pass through neurons, triggering hormone secretion that regulates heart rate.
Dopamine, as a reward signal, adjusts hormone levels in a feedback loop. The right blue section shows the proposed THE-SEAN, which emulates this
process. Event spikes are processed by spiking neural networks, and Q-values are regressed to regulate the trigger rate. The rewards acquired from the
estimator itself, including Fisher information matrix (FIM) trace and valid block matching points, supervise the network weights in a closed-loop system.
The corresponding processes are color-coded by A, B, C, D for clarity.
A. System Overview
THE-SEAN, our bio-inspired temporally high-order esti-
mation framework is illustrated in Fig. 3. As for biological
mechanisms, the human eye generates sensory pulses and
transmits the signal to the neurons. Then these neurons
trigger hormone secretion that regulates heart rate. In reality,
rapid environmental changes increase hormone secretion
by reward signals like dopamine, thereby speeding up the
heart rate and enhancing the body’s sensory processing.
Conversely, slow environmental changes decrease hormone
release, slowing the heart rate and reducing the frequency
and sensitivity of state estimation.
Inspired by this biological asynchronous estimation sys-
tem, we design THE-SEAN, an asynchronous estimation
framework for event cameras. Firstly, the pulse signals
from the camera are processed by the leaky integrate-and-
fire (LIF) neurons to simulate neuron activity. Then, leaky
integrate (LI) neurons are used to generate voltage values,
which are regressed to ON and OFF values. These values are
compared to generate triggers for state updates. Furthermore,
network weights are adjusted through self-supervised reward-
based learning. Finally, SEAN outputs asynchronous triggers
for the event-based estimator and the estimator produces
rewards for the network weights adjustment.
B. Architecture and Dynamics of SEAN
To mimic the function of neuron feature perception for
human, an asynchronous spiking event accumulation network
(SEAN) is designed to extract the features of the asyn-
chronous event stream. The temporal dynamics of SEAN is
introduced as follows.
Firstly, the input event stream is processed through the
fully connected leaky integrate-and-fire (LIF) layer of the
SNN. The LIF neuron dynamics is described as
Hi
t=fVi
t−1,Ei
t,(5)
Si
t=ΘHi
t−Vi
th ,(6)
Vi
t=Hi
t1−Si
t+Vi
rSi
t,(7)
where Hi
tand Vi
tdenote the ith membrane voltage after
neural dynamics and the trigger of a spike at time-step t,
respectively. Ei
tdenotes the ith pixel event trigger input, and
Si
tmeans the ith LIF neuron output spike at time-step t, which
equals 1 if there is a spike and 0 otherwise. Vth denotes the
threshold voltage and Vrdenotes the membrane rest voltage.
The function f(·)of the LIF neuron is defined as
fVi
t−1,Ei
t=Vi
t−1+1
τ−Vi
t−1−Vr+Ei
t,(8)
where τis the membrane time constant. Θ(x)is the Heaviside
step function, which is defined by Θ(x) = 1 for x≥0 and
Θ(x) = 0 otherwise. Note that V0=Vr.
Then the output of LIF layers regresses to the voltages of
a smaller number of leaky integrate (LI) models. These volt-
ages of LI neurons are prepared for the Q-value regression
in reinforcement learning. The calculation of LI voltages is
calculated by the LI neuron dynamics, which are expressed
by
Vj
t=Vj
t−1+1
τ −Vj
t−1−Vr+
N
∑
i=1
wLIF
i·Si
t!,(9)
where Vj
tdenote the jth membrane voltage after neural
dynamics and the trigger of a spike at time-step t.wLIF
iis
the weight of the ith LIF neuron output. Nis the number of
LIF neurons.
Next, the LI voltages are regressed to 2 Q-values through
a fully connected layer, which are given by
QON
t=
M
∑
j=1
σ(wON
jVj
t+bON
j),(10)
QOFF
t=
M
∑
j=1
σ(wOFF
jVj
t+bOFF
j),(11)
where QON
tand QOFF
tdenote the ON and OFF Q-values at
time-step t, respectively. wON
jand wOFF
jmean the weight of
the jth LI neuron voltage for on and off value regression,
respectively. bON
jand bOFF
tare the bias of the jth LI neuron
voltage for on and off trigger, respectively. Mis the number
of LI neurons. σ(·)is the activation function. Finally, the
action atis taken according the output Q-values for on and
off trigger. That is,
at=ΘQON
t−QOFF
t,(12)
for which if at=1, SEAN enables tracking or mapping
estimation process. Conversely, if at=0, SEAN let the
estimator remain idle.
To train the network, the discrete Q learning strategy
is adopted to update the weights and biases of SEAN
including wLIF
i,wON
j,bON
j,wOFF
jand bOFF
j. Details of the
back propagation method for this network may refer to [17].
The reward feedback is acquired from the estimator itself,
which is introduced in the next subsection.
C. Self Supervised Reward Construction of SEAN
In order to update the SEAN weights, rewards are con-
structed for tracking and mapping trigger policy networks
SEAN. While humans can adjust their hormonal states
such as heart rate online during estimation, an online self-
supervised strategy for reinforcement learning is adopted to
ensure the lightweight, real-time, and generalizable nature
of SEAN. Instead of relying on any external ground truth
supervision for reward, the information that the estimator
can output itself is used as feedback to adjust the network
weights. This approach enables the network to adapt to
arbitrary scenes and sensor configurations. Below, we present
the reward design for mapping and tracking.
1) Depth Generation and Fusion Reward for Mapping: In
order to define the reward for triggering mapping estimation,
it is necessary to assess the ability of the mapping process
to generate a valid reference depth frame for tracking. THE-
SEAN uses the ESVO series as the baseline estimator.
Within the ESVO framework, two distinct scenarios that
influence the success of effective mapping and reference
frame construction are identified: 1) the initialization of the
depth map, and 2) the online updating of the reference depth
frame.
During initialization, the effectiveness of mapping process
depends on the number of valid depth points it initializes. A
higher number of initialized depth points provides additional
re-projection constraints for tracking and contribute to updat-
ing the reference depth frame, which results in more points
for future updates. Consequently, the reward for mapping
during initialization is defined as a function of the number
of valid depth points initialized, which is given by
Rinit(t) = (NSGM(t)−NSGM(t−1),ainit
t=1,
−α,ainit
t=0,(13)
where Rinit(t)is the reward for mapping initialization at time
t.NSGM(t)is number of valid depth points generated by the
modified semiglobal matching (SGM) algorithm [18] during
the initialization. αis a constant representing the punishment
for the initialization delay. ainit
tis the action taken for ESVO
initialization.
The effectiveness of the mapping update process is as-
sessed by the number of points successfully fused during the
depth fusion procedure. A larger number of fused points re-
flects a higher consistency between the current and previous
frame mappings. Consequently, the reward for the mapping
update is designed to be proportional to the number of fused
points in the depth fusion process, as expressed by
Rmap(t) = (NBM(t) + λeNe(t),amap
t=1,
γmapNBM(tlast )−λeNe(t) + Rmap
idle ,amap
t=0,
(14)
where Rmap(t)is the reward for mapping update at time t.
NBM(t)is the number of valid depth fusion points generated
from block matching (BM) algorithm. Ne(t)is number of
active events during a fixed interval (e.g. 30 ms). γmap is a
decay factor representing the mapping information reduction
during the idle interval. tlast is the last time when a valid
trigger is produced. λeis a ratio to balance the scale
between NBM(t)and Ne(t).amap
tis the action taken for ESVO
mapping estimation.
2) Fisher Information Matrix Trace Reward for Tracking:
In order to formulate the reward for triggering tracking
estimation, it is essential to assess the information gain
in pose estimation throughout the tracking process. Pose
optimization is achieved through the reprojection of event-
based representations to the referenced depth frame. During
the optimization process, FIM is typically used to quantify
the information gain from the observations. The trace of FIM
serves as the metric for evaluating the information gain for
pose tracking.
In the optimization framework, FIM is equivalent to the
Hessian matrix. To maintain the real-time performance, we
approximate the trace of the Hessian using the Jacobian
matrix and the optimized residuals. The information gain
Itrack(t)of pose estimation at time tis given by
Itrack(t)≈Trace(Jtrack(t)⊤Jtrack(t))
∑M
i=1res2
i(t),(15)
where Jtrack(t)is the Jacobian matrix corresponding to the
the measurement model of event representation reprojection.
Details on the Jacobian matrix computation may refer to
[1]. resi(t)is the residual of each measurement after the
optimization. Mis the total number of measurements.
Based on Itrack(t), the reward for triggering tracking pro-
cess can be constructed, which is expressed by
Rtrack(t) = (Itrack(t) + λeNe(t),atrack
t=1,
γtrackItrack(tlast )−λeNe(t) + Rtrack
idle ,atrack
t=0,
(16)
where Rtrack(t)is the reward for tracking pose estiamtion
at time t.γtrack is a decay factor representing the tracking
information reduction during the idle interval. tlast is the last
time when a valid trigger is produced. λeis a ratio to balance
the scale between Itrack(t)and Ne(t).atrack
tis the action taken
for ESVO tracking estimation.
TABLE I
PARAMETER SETTINGS OF SEAN IMPLEMENTATION
Component Parameter Value
SNN
(LIF, LI, OUT) (IN, 128, 2)
Time Resolution 0.001s
Surrogate gradient Sigmoid
Training
Batch size 32
Replay buffer 100/10
Learning rate 0.2
Initial exploration rate 0.8
Exploration rate decay 0.001
D. Implementation of SEAN in Event-based Estimator
THE-SEAN is implemented using the SpikingJelly [19]
framework designed for asynchronous event SNN process-
ing. The detailed parameter settings are shown in Table I.
The network structure is constructed according to Section
IV-B. Our SEAN is configured with a input layer with LIF
neurons for event spikes, a hidden layer with 128 LI neurons
and an output layer of 2 neurons for Q-value regression. The
time resolution of the network is set to 0.001 seconds, and
a sigmoid function is employed for the surrogate gradient
during training. For the online self supervised weight update
phase, the network is trained using a batch size of 32. The
replay buffer is set to store 100/10 (100 for low resolution
346×260 event camra and 10 for high resolution 640×480)
previous experiences for experience replay. The learning
rate is initialized at 0.2, while the initial exploration rate
is set to 0.8, decaying 0.001 over every training time-
step to promote exploitation of learned policies as training
progresses. These parameter choices are designed to optimize
the performance of SEAN while balancing computational
efficiency and learning effectiveness.
V. EXPERIMENTS
In this section, THE-SEAN is evaluated against some
state-of-the-art event-based visual odometry algorithms.
First, we introduce the experimental setup, evaluation met-
rics, and event-based datasets used for testing. Then, ex-
periments conducted on popular open datasets is presented,
focusing on two key aspects: 1) the overall estimation
accuracy, and 2) the temporal computational efficiency of
THE-SEAN.
A. Experimental Setup
1) Evaluation Metrics: To evaluate the overall estimation
accuracy and the computational efficiency of the estimator,
we develop the following three evaluation metrics:
Absolute Positioning Error (APE): APE is used to evaluate
the overall estimation accuracy of the event-based estimator,
with root mean square (RMS) and standard deviation (STD)
metrics evaluating the absolute positioning accuracy and tra-
jectory smoothness, respectively. APE is given in centimeters
(cm) in this paper.
Tracking Triggering Rate (TTR): We define TTR to assess
the energy consumption for tracking process of the estimator.
0.5 0.0 0.5
X (m)
3.5
3.0
2.5
2.0
1.5
1.0
Y (m)
GT
ESVO
TS-THE-SEAN
(a) Part of Seq. indoor1.
1.0 0.5 0.0 0.5
X (m)
2.75
2.50
2.25
2.00
1.75
1.50
1.25
Y (m)
GT
ESVO2
AA-THE-SEAN
(b) Part of Seq. indoor3.
0 5 10 15 20
Time (s)
0.5
1.0
1.5
2.0
Velocity (m/s)
Velocity
0
1
2
3
4
5
MTR (Hz)
MTR
(c) Agent velocity VS. MTR of TS-THE-SEAN on Seq. indoor1.
Fig. 4. Trajectory comparison and MTR analysis. (a) and (b) illustrate part
of the representative estimated trajectories by THE-SEAN and baselines on
MVSEC. (c) shows the MTR variation of THE-SEAN corresponding to
theagent velocity in sequence indoor1.
TTR is the average triggering rate for tracking process for
the event-based estimator; that is,
TTR =1
N
N
∑
t=1
atrack
t,(17)
where Nis the length of the decision chain for tracking
process. atrack
tis the action taken for tracking estimation.
This metric quantifies average tracking estimation times for
the event-based estimator.
Mapping Triggering Rate (MTR): Similar to TTR, MTR
is defined to assess the energy consumption for mapping
process of the estimator. MTR is the average triggering rate
for mapping process for the event-based estimator, which is
given by
MTR =1
N
N
∑
t=1
amap
t,(18)
where Nis the length of the decision chain for mapping
process. amap
tis the action taken for mapping estimation.
This metric quantifies average mapping estimation times for
the event-based estimator.
2) Development of Experimental Datasets: To demon-
strate the effectiveness of THE-SEAN, experiments are
conducted on public real-world datasets with various event
resolutions and motion types.
RPG Dataset [20]: RPG is a hand-held stereo event camera
dataset. The motion is relatively gentle and focus on one
particular area.
MVSEC Dataset [21]: MVSEC is a stereo event camera
dataset collected by drones. The motion is relatively fierce
and the variation of velocity is large.
TABLE II
COMPARISON OF ESTIMATION ACC URACY APE[C M] BE TWEE N THE-SEAN AND LAT EST STE REO EVE NT VISUA L ODOM ETRY ALGORITHMS.
Dataset Seq.
ESVO ES-PTAM ESVO2 w/o IMU TS-THE-SEAN AA-THE-SEAN
APE APE APE APE APE
RMS STD RMS STD RMS STD RMS STD RMS STD
RPG
box 6.1 2.1 4.1 2.1 4.1 1.6 6.1 1.8 3.7 1.3
monitor 6.7 3.4 2.3 1.5 2.8 1.2 5.8 3.1 1.7 0.8
bin 4.1 1.4 2.6 0.9 2.5 0.8 3.4 1.1 2.2 0.7
desk 3.4 1.2 2.8 1.5 2.5 1.0 2.7 0.9 2.2 0.9
MVSEC
indoor1 15.9 5.6 15.0 6.3 9.6 4.9 10.5 4.5 8.6 4.4
indoor2 16.6 5.3 - - 14.7 7.1 13.0 5.3 11.5 5.5
indoor3 10.2 4.9 - - 9.0 4.8 8.2 3.9 7.4 3.5
DSEC
city04a 139.4 66.9 131.6 72.0 75.8 18.5 109.1 51.4 60.0 16.5
city04b 42.9 20.9 29.0 13.2 63.7 24.1 42.0 17.7 60.4 21.7
city04c 798.7 312.9 1184.4 588.8 571.1 241.5 730.4 290.3 549.2 241.1
city04d 992.7 393.1 1053.9 349.7 615.5 266.7 833.5 347.8 509.1 226.7
city04e 58.1 24.6 75.90 28.6 58.6 20.7 51.8 25.6 45.1 17.8
∗”-” means the lack of results of the algorithm.
TABLE III
COMPARISON OF TRIGGERING RAT E BETWE EN THE-SEAN AND BASELINE ST EREO EV ENT VIS UAL ODO METRY W ITH TS/AA REP RESE NTATION S.
Dataset Seq. ESVO TS-THE-SEAN ESVO2 w/o IMU AA-THE-SEAN
APE TTR MTR APE TTR MTR APE TTR MTR APE TTR MTR
RPG
box 6.1 99.5 19.6 6.1(== 0%) 80.0(↓20%) 11.6(↓41%) 4.1 99.6 19.8 3.7(↓9.7%) 84.1(↓16%) 13.5(↓32%)
monitor 6.7 99.7 19.1 5.8(↓13%) 79.2(↓20%) 10.0(↓48%) 2.8 99.7 19.8 1.7(↓39%) 79.7(↓20%) 15.4(↓22%)
bin 4.1 99.5 19.0 3.4(↓17%) 78.3(↓20%) 8.3(↓56%) 2.5 100 19.7 2.2(↓12%) 79.6(↓20%) 16.3(↓17%)
desk 3.4 100 19.8 2.7(↓21%) 73.7(↓26%) 19.1(↓3%) 2.5 99.5 19.6 2.2(↓12%) 84.7(↓15%) 15.2(↓23%)
MVSEC
indoor1 15.9 99.9 19.8 10.5(↓34%) 56.5(↓43%) 3.3(↓83%) 9.6 100 19.9 8.6(↓10%) 73.6(↓26%) 8.3(↓59%)
indoor2 16.6 100 19.9 13.0(↓26%) 65.6(↓35%) 4.3(↓78%) 14.7 100 20 11.5(↓22%) 74.6(↓25%) 8.8(↓56%)
indoor3 10.2 99.9 19.1 8.2(↓20%) 56.4(↓43%) 4.5(↓76%) 9.0 100 19.9 7.4(↓18%) 75.8(↓24%) 7.9(↓60%)
DSEC
city04a 139.4 94.6 19.9 109.1(↓22%) 82.1(↓13%) 10.0(↓50%) 75.8 100 19.9 60.0(↓21%) 91.1(↓9%) 12.1(↓39%)
city04b 42.9 94.5 19.6 42.0(↓2%) 75.1(↓21%) 10.6(↓46%) 63.7 94.9 19.8 60.4(↓5%) 81.4(↓14%) 10.7(↓46%)
city04c 798.7 94.0 19.8 730.4(↓9%) 84.1(↓11%) 12.7(↓36%) 571.1 94.0 19.9 549.2(↓4%) 80.5(↓4%) 12.6(↓37%)
city04d 992.7 93.9 19.9 833.5(↓16%) 82.3(↓12%) 8.3(↓58%) 615.5 93.9 19.9 509(↓17.3%) 95.7(↑2%) 11.3(↓43%)
city04e 58.1 97.8 19.5 50.9 (↓11%) 87.3(↓12%) 11.0(↓44%) 58.6 97.7 19.8 45.1(↓23%) 94.7(↓3%) 14.9(↓25%)
∗APE[cm] in RMS is listed in this table.
DSEC Dataset [22]: DSEC is an autonomous driving
dataset with stereo event cameras. The motion is very fierce,
and the acceleration is large.
3) Compared Algorithms: THE-SEAN is compared
against two main stereo-only event-based estimators, each
utilizing different event representations, as follows:
•ESVO: A classic real-time stereo-only event-based es-
timator using time surface (TS) event representation
proposed by Zhou et al. in 2021 [1].
•ES-PTAM: A recent multi-camera event-based multi-
view stereo (MC-EMVS) depth estimator designed for
stereo-only event-based odometry introduced by Ghosh
ea at. in 2024 [3].
•ESVO2 w/o IMU: The latest stereo event-based esti-
mator with TS event representation for tracking and
adaptive accumulation (AA) for mapping presented by
Niu et al. in 2024 [4]. The original ESVO2 integrates
IMU assistance, but for fair comparison, we modify it
to a stereo-only event camera setup in this study.
Note that the experiments focus on validating the tem-
porally high-order strategies of the event-based estimator.
Our SEAN is implemented with both the ESVO and ESVO2
w/o IMU estimators, referred to as TS-THE-SEAN and AA-
THE-SEAN, to compare overall estimation accuracy and
triggering rate. ES-PTAM is only included for comparison
of overall estimation accuracy (APE) but cannot be directly
compared for temporal computational efficiency (TTR or
MTR) due to its high CPU demands, making it unsuitable
for real-time implementation with SEAN.
B. Experimental Results and Discussions
1) Overall Estimation Accuracy: Table II compares the
estimation accuracy of TS-THE-SEAN and AA-THE-SEAN
with classic and state-of-the-art algorithms, including ESVO,
ES-PTAM, and ESVO2 w/o IMU. On the RPG dataset, AA-
THE-SEAN improves RMS by 18% and STD by 19% for
APE on average. On the MVSEC indoor dataset, it improves
RMS by 17% and STD by 20% on average. Fig. 4 (a) and
(b) show part of the estimated trajectories by THE-SEAN
and baselines. On the DSEC indoor dataset, AA-THE-SEAN
improves RMS by 11% and STD by 8% on average. These
results demonstrate that AA-THE-SEAN achieves superior
accuracy, along with smoother and more stable trajectories
compared to existing methods.
2) Triggering Rate Analysis: Table III compares the
performance of TS-THE-SEAN and AA-THE-SEAN with
ESVO and ESVO2 w/o IMU in terms of TTR and MTR
across various datasets. TS-THE-SEAN improves TTR by
23% and MTR by 51% while AA-THE-SEAN improves
TTR by 16% and MTR by 38% compared to their respective
baselines across all testing sequences on average. These
results emphasize that TS-THE-SEAN and AA-THE-SEAN
not only achieve higher estimation accuracy and stability,
but also reduce the computational cost by triggering updates
more efficiently. Fig. 4 (c) shows that THE-SEAN can
dynamically adjust the estimation trigger decision to adapt
to the agent motion. The improvements in triggering rates
are crucial for low-power systems, ensuring that the systems
can operate with minimal computational overhead while
TABLE IV
ABL AT ION STU DY OF THE-SEAN ON RPG AN D DSEC DATAS ETS.
Dataset Mapping SEAN Tracking SEAN APE TTR MTR
RPG
× × 2.98 99.7 19.7
✓×2.55 98.86 15.49
×✓2.68 79.63 19.89
✓ ✓ 2.45 82.03 15.1
DSEC
× × 203.4 96.6 19.9
✓×167.7 96.24 12.46
×✓186.7 87.16 18.95
✓ ✓ 168.6 90.74 12.24
TABLE V
NUMBER OF OPERATIONS AND PROPORT IONS F OR EAC H MODULE IN
AA-THE-SEAN.
Module ESVO2 w/o IMU SEAN AA-THE-SEAN
TS AA Tracking Mapping BP SUM
OPs 39M 63M 1800M 2600M 69M 4571M
Proportion 0.8% 1.4% 39.4% 56.9% 1.5% 100%
maintaining robust performance in various scenarios.
C. Ablation Study
Table IV presents the results of an ablation study evaluat-
ing the contributions of Mapping SEAN and Tracking SEAN
in THE-SEAN across the RPG and DSEC datasets, measur-
ing APE, TTR, and MTR. The results of the ablation study
confirms that both mapping SEAN and tracking SEAN sig-
nificantly enhance THE-SEAN’s performance, with the best
balanced results achieved by combining both components,
which optimize accuracy, efficiency, and computational cost.
D. Computational Cost Analysis
Table V presents the computational cost analysis of the
various modules in AA-THE-SEAN, measured in the number
of operations (OPs) per module. The tracking and mapping
modules of the baseline ESVO2 w/o IMU, crucial for state
estimation, necessitate 1800M (39.4%) and 2600M (56.9%)
OPs, respectively, highlighting their resource-intensive na-
ture. Conversely, SEAN, which utilizes lightweight spiking
networks, only requires 69M (1.5%) for per back propaga-
tion. Moreover, SEAN can save about 16% tracking trigger-
ing times and 38% mapping triggering times. This results in a
substantial reduction in computational cost while maintaining
high estimation performance. The SEAN module’s low OPs,
combined with the inherent SNN low-power nature, makes
it well-suited for real-time, event-based state estimation in
dynamic environments.
VI. CONCLUSIONS
In this paper, we introduce THE-SEAN, a temporally
high-order event-based visual odometry system utilizing self-
supervised spiking event accumulation networks. Inspired by
biological mechanisms regulating heart rate, THE-SEAN dy-
namically adjusts its estimation trigger decision policy based
on changes in motion and the environment. Experimen-
tal results demonstrate that THE-SEAN not only enhances
estimation accuracy and smoothness but also significantly
improves triggering efficiency compared to state-of-the-art
methods. Future work will focus on the neuromorphic hard-
ware implementation of THE-SEAN and its integration with
synchronous sensors, such as IMUs and traditional cameras,
to further optimize the temporally high-order system.
REFERENCES
[1] Y. Zhou, G. Gallego, and S. Shen, “Event-based stereo visual odom-
etry,” IEEE Trans. Robot., vol. 37, no. 5, pp. 1433–1450, 2021.
[2] J. Niu, S. Zhong, and Y. Zhou, “Imu-aided event-based stereo visual
odometry,” in 2024 IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp.
11 977–11 983.
[3] S. Ghosh, V. Cavinato, and G. Gallego, “ES-PTAM: Event-based
stereo parallel tracking and mapping,” in Eur. Conf. Comput. Vis.
(ECCV) Workshops, 2024.
[4] J. Niu, S. Zhong, X. Lu, S. Shen, G. Gallego, and Y. Zhou, “Esvo2:
Direct visual-inertial odometry with stereo event cameras,” 2025.
[Online]. Available: https://arxiv.org/abs/2410.09374
[5] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile
monocular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34,
no. 4, pp. 1004–1020, 2018.
[6] T. Qin and S. Shen, “Online temporal calibration for monocular
visual-inertial systems,” in 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), 2018, pp. 3662–3669.
[7] C. Campos, R. Elvira, J. J. G. Rodr´
ıguez, J. M. M. Montiel, and
J. D. Tard´
os, “Orb-slam3: An accurate open-source library for visual,
visual–inertial, and multimap slam,” IEEE Trans. Robot., vol. 37, no. 6,
pp. 1874–1890, 2021.
[8] Q. Wu, X. Xu, X. Chen, L. Pei, C. Long, J. Deng, G. Liu, S. Yang,
S. Wen, and W. Yu, “360-vio: A robust visual–inertial odometry using
a 360° camera,” IEEE Transactions on Industrial Electronics, vol. 71,
no. 9, pp. 11 136–11 145, 2024.
[9] C. Xiong, G. Liu, Q. Wu, S. Xia, T. Hua, K. Ma, Z. Sun, Y. Xiang,
and L. Pei, “Ton-vio: Online time offset modeling networks for robust
temporal alignment in high dynamic motion vio,” arXiv preprint
arXiv:2403.12504, 2024.
[10] S. Ghosh and G. Gallego, “Event-based stereo depth estimation: A
survey,” arXiv preprint arXiv:2409.17680, 2024.
[11] K. Xiao, G. Wang, Y. Chen, Y. Xie, H. Li, and S. Li, “Research on
event accumulator settings for event-based slam,” in 2022 6th Inter-
national Conference on Robotics, Control and Automation (ICRCA),
2022, pp. 50–56.
[12] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman,
“Hots: A hierarchy of event-based time-surfaces for pattern recog-
nition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp.
1346–1359, 2017.
[13] M. Liu and T. Delbruck, “Adaptive time-slice block-matching optical
flow algorithm for dynamic vision sensors.” BMVC, 2018.
[14] Z. Liu, D. Shi, R. Li, Y. Zhang, and S. Yang, “T-esvo: Improved event-
based stereo visual odometry via adaptive time-surface and truncated
signed distance function,” Advanced Intelligent Systems, vol. 5, no. 9,
p. 2300027, 2023.
[15] U. M. Nunes, R. Benosman, and S.-H. Ieng, “Adaptive global decay
process for event cameras,” in Proceedings of the IEEE Conf. Comput.
Vis. Pattern Recognit., 2023, pp. 9771–9780.
[16] H. F. Brown, D. DiFrancesco, and S. J. Noble, “How does adrenaline
accelerate the heart?” Nature, vol. 280, no. 5719, pp. 235–236, Jul
1979. [Online]. Available: https://doi.org/10.1038/280235a0
[17] D. Chen, P. Peng, T. Huang, and Y. Tian, “Deep reinforcement learning
with spiking q-learning,” arXiv preprint arXiv:2201.09754, 2022.
[18] H. Hirschmuller, “Stereo processing by semiglobal matching and
mutual information,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
no. 2, pp. 328–341, 2008.
[19] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang,
H. Zhou, G. Li, and Y. Tian, “Spikingjelly: An open-source machine
learning infrastructure platform for spike-based intelligence,” Science
Advances, vol. 9, no. 40, p. eadi1480, 2023.
[20] Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza,
“Semi-dense 3d reconstruction with a stereo event camera,” in Pro-
ceedings of the Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 235–251.
[21] A. Z. Zhu, D. Thakur, T. ¨
Ozaslan, B. Pfrommer, V. Kumar, and
K. Daniilidis, “The multivehicle stereo event camera dataset: An event
camera dataset for 3d perception,” IEEE Rob. Autom. Lett., vol. 3,
no. 3, pp. 2032–2039, 2018.
[22] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A
stereo event camera dataset for driving scenarios,” IEEE Rob. Autom.
Lett., vol. 6, no. 3, pp. 4947–4954, 2021.