ArticlePDF Available

Neuromorphic Vision Data Coding: Classifying and Reviewing the Literature

Authors:

Abstract and Figures

In recent years, visual sensors have been quickly improving towards mimicking the visual information acquisition process of human brain by responding to illumination changes as they occur in time rather than at fixed time intervals. In this context, the so-called neuromorphic vision sensors depart from the conventional frame-based image sensors by adopting a paradigm shift in the way visual information is acquired. This new way of visual information acquisition enables faster and asynchronous per-pixel responses/recordings driven by the scene dynamics with a very high dynamic range and low power consumption. However, the huge amount of data outputted by the emerging neuromorphic vision sensors critically demands highly efficient coding solutions in order applications may take full advantage of these new, attractive sensors’ capabilities. For this reason, considerable research efforts have been invested in recent years towards developing increasingly efficient neuromorphic vision data coding (NVDC) solutions. In this context, the main objective of this paper is to provide a comprehensive overview of NVDC solutions in the literature, guided by a novel classification taxonomy, which allows better organizing this emerging field. In this way, more solid conclusions can be drawn about the current NVDC status quo , thus allowing to better drive future research and standardization developments in this emerging technical area.
Content may be subject to copyright.
VOLUME XX, 2017 1
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2022.Doi Number
Neuromorphic Vision Data Coding: Classifying
and Reviewing the Literature
Catarina Brites1, Member, IEEE, and João Ascenso2, Senior Member, IEEE
1Instituto de Telecomunicações, Instituto Universitrio de Lisboa (ISCTE-IUL), 1649-026 Lisbon, Portugal
2Instituto de Telecomunicações, Instituto Superior Tcnico, University of Lisbon, 1049-001 Lisbon, Portugal
Corresponding author: Catarina Brites (e-mail: catarina.brites@ lx.it.pt).
This work was supported by the RayShaper SA, Valais, Switzerland, through the Project entitled Event Aware Sensor Compression, and by the Fundação
para a Ciência e a Tecnologia (FCT), Portugal, through the Project entitled Deep Compression: Emerging Paradigm for Image Coding under Grant PTDC/EEI-
COM/7775/2020.
ABSTRACT In recent years, visual sensors have been quickly improving towards mimicking the visual
information acquisition process of human brain by responding to illumination changes as they occur in time
rather than at fixed time intervals. In this context, the so-called neuromorphic vision sensors depart from the
conventional frame-based image sensors by adopting a paradigm shift in the way visual information is
acquired. This new way of visual information acquisition enables faster and asynchronous per-pixel
responses/recordings driven by the scene dynamics with a very high dynamic range and low power
consumption. However, the huge amount of data outputted by the emerging neuromorphic vision sensors
critically demands highly efficient coding solutions in order applications may take full advantage of these
new, attractive sensors’ capabilities. For this reason, considerable research efforts have been invested in
recent years towards developing increasingly efficient neuromorphic vision data coding (NVDC) solutions.
In this context, the main objective of this paper is to provide a comprehensive overview of NVDC solutions
in the literature, guided by a novel classification taxonomy, which allows better organizing this emerging
field. In this way, more solid conclusions can be drawn about the current NVDC status quo, thus allowing to
better drive future research and standardization developments in this emerging technical area.
INDEX TERMS Dynamic vision sensor, Event camera, Neuromorphic vision data coding, Spike camera,
Taxonomy
I. INTRODUCTION
Neuromorphic vision sensors are bio-inspired sensors that
try to mimic the sensing behavior of a biological retina.
These emerging sensors pose a paradigm shift in the visual
information acquisition (sensing) model, the so-called
frameless paradigm, where visual information is no longer
acquired as a 2D matrix, i.e., frame. As it is well-known, in
conventional frame-based paradigm, all sensor pixels
acquire visual information simultaneously, independently
of the scene dynamics, at regular time intervals (i.e.,
constant framerate). However, in the emerging frameless
paradigm, each sensor pixel is in charge of controlling its
own visual information acquisition process, in an
asynchronous and independent way, according to the scene
dynamics, thus producing a variable data rate output; scene
dynamics refers to the change of lightning/illumination
conditions and/or motion in the scene and/or sensor/camera
motion.
By incorporating intelligent pixels, frameless-based
(i.e., neuromorphic) vision sensors provide interesting
advantages over conventional frame-based image sensors,
such as high temporal resolution (smaller time interval at
which a sensor pixel can react/respond to the scene
dynamics), very high dynamic range, low latency, and low
power consumption. These are rather compelling properties
notably in scenarios that are particularly challenging to
conventional frame-based image sensors, such as the ones
involving visual scenes with high-speed motion and/or
uncontrolled lighting conditions, where usually this type of
(frame-based) image sensors fail to provide good
performance; autonomous driving, drones and robotics are
just a few examples of increasingly relevant applications in
humans daily lives where those challenging scenarios occur
and, thus, can benefit from the neuromorphic vision sensors
usage. Moreover, neuromorphic vision sensors may also
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
2 VOLUME XX, 2017
have application in industrial automation, visual surveillance,
augmented reality, and mobile environments, where fast
response, high-dynamic range or low-power consumption is
critically needed. Due to its potential and availability,
neuromorphic vision sensors, also known in the literature
as event-based sensors, dynamic vision sensors, silicon
retinas, spike cameras, asynchronous image sensors or
frameless imaging sensors, are nowadays attracting a great
deal of attention by the research community from both
academia and industry.
Currently, there are two main types of neuromorphic
vision cameras, the so-called event cameras and spike
cameras, which, roughly speaking, differ mainly on two
aspects: 1) the way visual information is acquired, i.e.,
sampled, and consequently on the output data produced
(event data versus spike data); and 2) the visual
information they represent (moving areas only versus both
static and moving areas of the visual scene acquired). The
event cameras follow a differential sampling model in
which time-domain changes in the incoming light intensity,
i.e., temporal contrast, are pixel-wise detected and
compared to a threshold, triggering a so-called event if it
exceeds the threshold. On the other hand, the spike cameras
follow an integral sampling model in which time-domain
accumulation of the incoming light intensity is carried out
pixel-wise and compared to a threshold, firing a so-called
spike if the threshold is exceeded. In this context, events are
triggered by event sensor pixels observing’ scene’s
moving objects only, while spikes are fired by spike sensor
pixels ‘observing’ both static and moving objects of the
scene to be acquired. While the dynamic vision sensor
(DVS) [1][2][3] was the first event camera to be made
commercially available, among several nowadays available
[4], the so-called Vidar camera [5][6] is the only spike
camera currently reported in the literature and, to the best
of the authors knowledge, is not commercially available.
Differently from the absolute intensity value
(simultaneously) outputted by every pixel in a conventional
frame-based image sensor (grayscale/color value resulting
from incoming light intensity accumulation over a specific
exposure time), an event is typically represented by a 4-
tuple x, y, , p. This ordinary representation of an event
contains the location (x, y) within the (event-based) sensor
(pixel coordinates or addresses) where the event occurred,
the timestamp , i.e., the time at which the event was
triggered, and the polarity p of the event, indicating
weather a light intensity increase (p = 1 or ON event) or
decrease (p = 1 or OFF event) occurred; thus, (x, y), , p
constitute the event location information, the event time
information, and the event polarity information,
respectively. This 4-tuple event representation is
sometimes referred to in the literature as address event
representation (AER), as this is the data representation used
by the AER communication protocol to asynchronously
transmit information (such as events) from the sensor to the
event camera output or between asynchronous chips. As far
as the spike is concerned, it is typically associated to an
‘ON’- or ‘OFF’- value, corresponding to spike fired (1)
or not (0). Although a (spike-based) sensor pixel can fire
spikes asynchronously and continuously (at an arbitrary
time), the spike camera currently reported in the literature
(the Vidar camera) only reads out the spikes fired, also
known as spike firing states, periodically with a fixed time
interval T (in the order of μs) and does that for every sensor
pixel. This means that, at sampling time = T, the spike
firing state (0 or 1 value) of every pixel is read, forming
a (binary) spike frame, with the height and width of the
sensor pixel array; naturally, as the time passes, spike
frames are formed one after the other, creating an 3D array
of binary spike frames.
In the related literature, the event camera pixel output is
most of the times called event sequence or event stream
while the spike camera pixel output is typically known as
spike train. The asynchronous sensor event
sequence/sensor spike train, resulting from the (pixel)
event sequences/spike trains of all sensor pixels, represent,
therefore, the visual information ‘observed’ by the
emerging event/spike sensors. It is worth noting that the
sensor event sequence/sensor spike train can be further
processed elsewhere in the camera or by a vision
application for several purposes/tasks, including a more
efficient representation (through coding). In this paper,
neuromorphic vision data (NVD) is the generic
terminology used to refer to both event and spike data.
Despite the significant differences between
neuromorphic vision data and conventional frame-based
images, particularly in terms of the type and spatio-
temporal characteristics of the visual information they
represent, numerous algorithms have been recently
developed to reconstruct images and videos from
neuromorphic vision data, e.g. [7][8]; this is largely
because it allows existing image and video processing
applications, typically designed to take image-based inputs,
to use neuromorphic vision data. Either way, neuromorphic
vision data are already successfully used in different tasks,
including object tracking [9][10], object recognition
[11][12], high-speed motion estimation [12][14], HDR
image reconstruction [15][16], simultaneous localization
and mapping (SLAM) [17][18], among others.
The high temporal resolution neuromorphic vision
sensors can achieve (typically in the order of µs), which is
equivalent to a high framerate (in the order of MHz), allied
to the scene dynamics complexity, usually leads to a huge
amount of raw data being produced by the neuromorphic
vision sensors. The output data of neuromorphic vision
sensors following a differential sampling model (i.e., event
cameras) are typically sparse in the spatial dimension, as
only pixels corresponding to moving/illumination changing
areas trigger events, while exhibiting high temporal
correlation. The output data of neuromorphic vision sensors
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 3
following an integral sampling model (i.e., spike cameras)
exhibit also high temporal correlation but are typically
denser than the event camera output data in the spatial
dimension, as pixels corresponding to both static and
moving areas may fire spikes. Independently of the
sampling model, highly efficient neuromorphic visual data
coding solutions are very much needed, especially
considering the limited transmission and storage resources
of the main applications of neuromorphic (particularly
even-based) vision, such as embedded systems.
In the last few years, several (almost two tens)
neuromorphic visual data coding (NVDC) solutions have
been proposed, mainly driven by the growing availability
of this type of cameras and the advantages they offer to
scenarios where the conventional frame-based image
cameras struggle to perform well; it is worth noting that,
although some neuromorphic vision cameras may carry an
inertial measurement unit (IMU), no coding solution for
this type of data is reviewed in this paper as it has also not
been found in the relevant NVDC literature. Nevertheless,
a comparison of their performances is a still rather difficult
task since the reported performance results have been
obtained most of the times under different test conditions
(even the evaluation methodologies are sometimes
different) and there is no public software available to obtain
comparable results. Acknowledging the practical
importance of developing efficient NVDC solutions, JPEG
has recently launched an exploration activity on event-
based vision, denominated JPEG XE [19]. The main goal
of JPEG XE is to create and develop a standard to
represent such events in an efficient way allowing
interoperability between sensing, storage, and processing,
targeting machine vision applications [19]. To achieve
this goal, JPEG XE is currently focused on defining the use
cases and requirements for potential standardization of the
coding of events [20].
In this context, this paper first proposes a meaningful
classification taxonomy for NVDC solutions that allows to
identify and abstract their differences, commonalities, and
relationships. Guided by this classification taxonomy, a
large set of relevant NVDC solutions available in the
literature are then reviewed; for a comprehensive overview
of the NVDC field, both event-based and spike-based
NDVC solutions will be considered. This paper does not
purposely include any performance evaluation based on
experimental results since its target is conceptual and
algorithmic; in fact, it may still be early to derive final
quantitative conclusions on the best NVDC coding
approaches, considering the lack of technical maturity of
most of these coding solutions, still requiring additional
research. Therefore, the main objective of this paper is not
to propose a novel NVDC solution, nor to provide a
comparative performance evaluation of NVDC solutions,
but rather to organize and classify a technical area that has
received many contributions in recent years. This type of
paper is essential to gather a systematic, high-level, and
more abstract view of the field to further launch solid and
consistent advancements in this emerging technical area.
With this purpose in mind, the rest of this paper is
organized as follows: Section II will propose a
classification taxonomy for the many NVDC solutions
available in the literature, while Section III will provide an
exhaustive review of the NVDC solutions available in the
literature driven by the proposed taxonomy. Section IV will
present an overview of the datasets used in the available
NVDC literature, while Section V will present an overview
of the performance evaluation metrics and relevant anchors
used for benchmarking also in the available NVDC
literature. Section VI will present some final remarks and,
finally, Section VII will present some relevant challenges
associated with NVDC, that emerged from the exhaustive
literature reviewing.
II. NVDC: PROPOSING A CLASSIFICATION TAXONOMY
The increasing popularity of neuromorphic vision sensors and,
consequently, the increasing availability of neuromorphic
vision content, has motivated the development of many coding
solutions. Since multiple technical approaches have been
adopted for the NVDC solutions available in the literature, it
is essential to identify their main commonalities, differences,
and relationships, thus providing a better understanding of the
full neuromorphic vision data coding landscape and promising
future research and standardization directions. In this context,
this section first proposes a meaningful classification
taxonomy and will after exercise it by referencing and
classifying the many NVDC solutions found in the literature
according to the proposed taxonomy; Section III will then go
one step further by reviewing those solutions with a level of
detail that allows to understand the involved key concepts and
designs, although naturally not as detailed as the
corresponding referenced papers. In the following sub-
sections, the proposed classification dimensions for the
taxonomy will be proposed first. After, the classes for each
taxonomy classification dimension will be proposed. The
classification dimensions and the classes within each
dimension have been defined based on the exhaustive
reviewing of (almost) two tens of NVDC solutions available
in the literature in order a robust taxonomy could be defined
[21]-[39]; this list of references is also an useful contribution
of this paper.
A. TAXONOMY CLASSIFICATION DIMENSIONS
This sub-section presents and defines the classification
dimensions for the taxonomy proposed for NVDC solutions.
After an exhaustive study of the NVDC solutions available in
the literature [21]-[39], it was concluded that the most
appropriate taxonomy classification dimensions are:
1) Raw Data Type: Refers to the type of (raw)
elementary information that is asynchronously
outputted by each sensor pixel, e.g., an event or spike,
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
4 VOLUME XX, 2017
and that is targeted by coding; the raw data type is
intimately linked to the neuromorphic vision sensor
type and its visual data acquisition model.
2) Fidelity: Refers to the fidelity with which the data are
coded. Depending on the application domain, the data
to be coded, hereafter referred to as raw input data,
may correspond to the sensor output data or to the
output data of an (optional) pre-processing module,
placed in between the sensor and the (NVD) encoder.
The pre-processing module may involve data filtering
or data sampling, and might be used, for instance, to
remove noise data from the sensor output data
sequence or to control the amount of sensor output
data to be effectively coded (i.e., fed to the NVD
encoder). Since this (optional) pre-processing step
takes place out of the encoder module, it is not
directly related to the coding process and, thus, it is
out of the scope of the proposed taxonomy (i.e., no
taxonomy dimension is associated to it).
3) Data Structure: Refers to the way the raw input data
sequence, i.e., sequence of elements of a specific raw
data type at the NVD encoder input, is
arranged/transformed to be then coded while
exploiting the available spatial and temporal
redundancies; depending on the adopted fidelity, this
may involve data partitioning, temporal aggregation,
data sampling, and data conversion.
4) Basic Coding Unit: Refers to the basic processing
entity in which the structured data are further divided
for coding purposes.
5) Components: Refers to the specific constituent, i.e.,
basic, elements of the raw data type in the structured
data to be directly coded; depending on the raw data
type, the components may be the basic elements of the
ordinary raw data type representation (introduced in
Section I), e.g., the polarity, or of an alternative
representation with basic elements obtained from the
ordinary raw data type representation during data
structuring, e.g., the time interval.
6) Components Coding Approach: Refers to the way in
which the components of the structured data are
coded to reach a more compact neuromorphic visual
data coded representation, e.g., independently or
jointly.
7) Prediction: Refers to the way the component-wise
spatial and temporal correlations in the basic
processing entity of the structured data are exploited
to create a lower energy signal, the so-called residue.
8) Transform: Refers to the way the remaining
correlation in the component-wise residue signal is
exploited to reach a more compact energy
representation, usually in some type of frequency
domain. A key issue related to the Transform
dimension is that the impact of transform on the
performance of NVDC solutions is a poorly
understood domain (only 2 out of 19 coding solutions
reviewed uses transform); this might be a possible
area of future research investment.
9) Quantization: Refers to the way the remaining
correlation in the component-wise residue signal or
the transform coefficients is exploited to create a
lower energy signal, the so-called quantized signal. A
key issue related to the Quantization dimension is that
the impact of quantization on the performance of
NVDC solutions is also a poorly understood domain
(only 4 out of 19 coding solutions reviewed use
quantization); this might be also a possible area of
future research investment, notably if lossy coding is
targeted, as quantization induces some loss or
distortion in the decoded data.
Using these dimensions, each NVDC solution may be
characterized by a taxonomy classification path connecting a
set of classes along these dimensions, thus allowing to identify
commonalities through the overlapping of the corresponding
classification paths.
B. CLASSES FOR EACH CLASSIFICATION
DIMENSION
Using the proposed taxonomy classification dimensions, it is
now necessary to define the classes within each classification
dimension, naturally based on the NVDC solutions already
available in the literature. After exhaustive analysis, the
following classes are proposed for each dimension:
1) Raw Data Type In terms of raw data type, the
following classes are proposed:
a) Event: Asynchronous and independent response of
a vision sensor pixel to a detected time-domain
(incoming) light intensity relative change (i.e.,
above a preset threshold), so-called temporal
contrast, at a precise time instant. It is typically
represented by the 4-tuple x, y, , p containing the
location (x, y) within the sensor (pixel coordinates)
where the event was triggered, the timestamp at
which the event was triggered, and the polarity p of
the event indicating weather a light intensity
increase (p = 1) or decrease (p = 1) occurred.
b) Spike: Asynchronous and independent response of
a vision sensor pixel to an accumulated time-
domain (incoming) light intensity exceeding a
preset threshold at a precise time instant. It is
typically represented by the 3-tuple x, y, f
containing the location (x, y) within the sensor
(pixel coordinates) where the spike was fired, and
the spike firing state f indicating whether a spike has
been fired (f=1) or not (f=0). For the spike camera
currently reported in the literature (the Vidar
camera), the spike firing state of every pixel is
read periodically, with a fixed time interval T,
forming a (binary) spike frame; as the time passes
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 5
by, spike frames are formed one after the other and
the temporal information associated to the spike
occurrences within each spike frame is inherently
embedded in the respective spike frame index
(kT). For this reason, no time information is
explicitly included in the spike (3-tuple)
representation.
2) Fidelity In terms of fidelity, the following classes are
proposed:
a) Lossless: Codecs keeping the original (i.e., raw
input) data fidelity, meaning that the decoded and
original data are mathematically equal (up to a
certain precision, if required).
b) Lossy: Codecs not keeping the original (i.e., raw
input) data fidelity, typically to increase the
compression factor; high fidelity may still be
achieved with the appropriate coding parameters
configuration.
3) Data Structure In terms of data structure (DS), the
following classes are proposed:
a) 3D Point Set: The raw input data sequence,
corresponding to a sequence of elements of a
specific raw data type at the NVD encoder input,
are arranged as a set of points in the 3-dimensional
(3D) space with the temporal dimension playing the
role of a geometric dimension, notably the Z
coordinate axis; thus, each 3D point is defined with
Cartesian coordinates (, , ) and possibly an
attribute, e.g., the polarity p. This data structure is
typically spatially (x, y) sparse and preserves the
original information (i.e., number of elements and
values) of all raw data type components, without
involving any data temporal aggregation or
sampling. The 3D Point Set data structure is
suitable to be coded with standard-based point
cloud geometry coding solutions, e.g., G-PCC [41],
or with coding schemes involving point-based
geometry processing.
b) Cuboid Grid: The raw input data sequence,
corresponding to a sequence of elements of a
specific raw data type at the NVD encoder input,
are arranged in a space-time grid of cuboids (or
cubes if all its 3 dimensions are equal); each cuboid
represents a local spatio-temporal neighborhood of
elements of a specific raw data type. While the grid
resolution in the spatial dimensions (X and Y) is
typically regular, meaning that the cuboids’ length
on those dimensions is the same over the entire
sensor resolution, in the temporal dimension the
grid resolution may be regular or irregular,
depending on the criterion used to split the sensor
data on that dimension; for example, the temporal
grid resolution can be determined from external
information, e.g., the time interval between RGB
images in a DAVIS-like camera, or by task-related
motion requirements. This data structure tends to be
spatially denser than the 3D Point Set data structure,
as the distribution of the elements (events/spikes) in
the spatial dimensions tends to concentrate into
small areas (such as a cuboid spatial area)
corresponding, for instance, to objects movement;
naturally, the elements distribution varies with the
scene dynamics. The Cuboid Grid data structure
also preserves the original information (i.e., number
of elements and values) of all raw data type
components, without involving any data temporal
aggregation or sampling. This data structure is
suitable to be coded with solutions based on spatio-
temporal volumes such as cuboids; this type of data
structure allows exploiting spatial correlation
between spatially neighboring (i.e., in the X and Y
dimensions) cuboids and the temporal correlation
within a cuboid and between co-located (in the
spatial dimensions) temporally adjacent cuboids.
c) 1D Array of Elements: The raw input data
sequence, corresponding to a sequence of elements
of a specific raw data type at the NVD encoder
input, are arranged as a 1-dimensional (1D) array,
i.e., as a single sequence, of elements that may or
may not be ordered in some component(s), e.g.
elements may be ordered by the triggering/firing
time. This data structure is dense, in the sense that
each array position refers to a triggering/firing
output, and preserves the original information (i.e.,
number of elements and values) of all raw data type
components without involving any data temporal
aggregation or sampling. The 1D Array of Elements
data structure is suitable to be coded with element-
wise coding solutions or coding schemes that
exploit components correlation through prediction
or entropy coding strategies within small portions
of the 1D array, i.e. chunks, with fixed or arbitrary
size.
d) 3D Array of Frames: The raw input data sequence,
corresponding to a sequence of elements of a
specific raw data type at the NVD encoder input,
are converted into a 3-dimensional array of frames,
e.g., by pixel-wise polarity accumulation or pixel-
wise counting the elements of a specific raw data
type (e.g., events) over a given time interval or by
simply sampling elements over time; a frame is
basically a 2D array, with the sensor spatial
resolution, whose entry (pixel) values are typically
obtained from event temporal aggregation at each
pixel location (e.g., a pixel may correspond to an
accumulated polarity value or a histogram count).
The 3D Array of Frames data structure is typically
denser (than the 3D Point Set data structure) and
usually does not preserve the original information
(i.e., number of elements and values) of some raw
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
6 VOLUME XX, 2017
data type components, as it typically involves data
temporal aggregation and polarity accumulation.
This structure is suitable to be coded with standard
video coding solutions, e.g., HEVC, or with coding
solutions inspired on the intra and inter coding
modes adopted in the standard image/video coding
solutions or even with lookup table (LUT)-based
coding schemes; this type of data structure allows
exploiting spatial correlation (within a frame)
and/or correlation between temporally adjacent
frames in the 3D array.
4) Basic Coding Unit In terms of basic coding unit
(BCU), the following classes are proposed:
a) Single Element: Basic processing entity of
structured data corresponding to an individual
element of a given raw data type. This basic coding
unit is characteristic of element-by-element
processing methods and allows spatio-temporal
correlation exploitation, although it might not be
that efficient due to the small support region for
correlation exploitation (1 element only).
b) Chunk: Basic processing entity of structured data
corresponding to a group of NC elements of a given
raw data type, i.e., a chunk. The chunk size NC may
be fixed or adjusted dynamically using, for
instance, a criterion associated to the triggering time
instant. This basic coding unit may potentiate high
spatio-temporal correlation exploitation depending
on the criterion used to define the elements
belonging to a chunk (and the chunk size), which
impacts on the similarities between neighboring
elements within and between chunks.
c) Polyhedron: Basic processing entity of structured
data corresponding to a 3D shape with flat
polygonal faces, straight edges, and sharp vertices,
containing elements of a given raw data. A cuboid,
i.e., a NX×NY×NT (3D) volume of elements, is a
particular case of this basic coding unit, which
corresponds to a polyhedron with six quadrilateral
(flat) faces; NX and NY stand for the volume length
in the spatial dimensions X and Y, respectively, and
NT stands for the volume length in the temporal
dimension. In a cuboid, NX and NY are usually set to
the same value, which in theory can vary between 1
(corresponding to a sensor pixel) and the sensor
height and width, respectively, while NT is typically
adjusted dynamically using a criterion associated to
possible (task-related) motion requirements.
Depending on the scene dynamics, other 3D shape
polyhedrons (than cuboids) may allow a finer
adaptation to the motion characteristics, by
aggregating in it neighboring elements (of a given
raw data) with similar motion characteristics; in this
case, a motion plane-based representation of the
polyhedron, characterized by a more or less
complex set of parameters (e.g., represented as a set
of the 3D shape’s vertices and edges), can be for
instance adopted. This basic coding unit is suitable
for a fine adaptation to the motion characteristics
and allows exploring the spatial or spatio-temporal
correlation between spatially neighboring
polyhedrons and the temporal correlation within or
between co-located (in the spatial dimensions)
temporally adjacent 3D polyhedrons.
d) Group of Frames: Basic processing entity of
structured data corresponding to a set, i.e., a group,
of NI frames, typically contiguous in time, where a
frame typically corresponds to a 2D array (NX×NY)
of values of a specific component (e.g., polarity),
accumulated over a certain time interval or
sampled over the temporal dimension; NX and NY
correspond to the frame height and width, which
typically are equal to the sensor height and width,
respectively; a single frame is a particular case of
this basic coding unit, which corresponds to NI
equal to one. In case NI =1, this basic coding unit
may allow exploiting spatial correlation (i.e., within
a frame) with standard-based image/video coding
solutions, e.g. HEVC Intra. When NI >1, this basic
coding unit may potentiate the spatio-temporal
correlation exploitation (within and between frames
of a group of frames), facilitating the adoption of
coding schemes according to the components to be
coded, e.g., different coding schemes for location
and polarity.
5) Components In terms of components, the following
classes are proposed:
a) Location: Information to be coded includes
location data, i.e., 2D coordinates (x, y) identifying
the position in the sensor where an asynchronous
sensor output of a specific raw data type was
triggered; this information is typically present in the
representation of both event and spike raw data
types.
b) Timestamp: Information to be coded includes time
data, i.e., temporal information identifying when an
asynchronous sensor output of a specific raw data
type was triggered at a given pixel location; this
information is present in the event raw data type
representation only.
c) Polarity: Information to be coded includes polarity
data, i.e., 1-bit code indicating the variation sign of
the temporal contrast (corresponding to a light
intensity increase or decrease) associated to an
asynchronous sensor output of a specific raw data
type triggered at a given pixel location; this
information is present in the event raw data type
representation only.
d) Time Interval: Information to be coded includes
time interval data, i.e., information indicating the
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 7
time interval between two consecutive occurrences
of asynchronous outputs of the sensor triggered at a
given pixel location; this information is present in
the spike and event raw data type representation.
Contrary to the previous component classes, the
Time Interval class corresponds to information that
is not directly recorded by the sensor; the time
interval information is converted from the sequence
of spike firing states (‘1’/’0’ values) or timestamps
outputted by each sensor pixel along time. It is
worth noting that, for the spike raw data type, the
time interval data, known as inter-spike interval
(ISI) data, is typically characterized by higher
spatio-temporal correlation than the (recorded)
spike firing state data, which makes them preferable
for coding purposes [23]; this is, in fact, the reason
why (all) the spike-based NVDC solutions available
in the literature code the time interval data (i.e., ISI
data) instead of the data recorded by the sensor
(spike firing states).
6) Components Coding Approach In terms of
components coding approach, the following classes are
proposed:
a) Independent: The components data, associated to a
specific raw data type representation, are coded
separately, meaning that each component is
independently encoded/decoded from the
remaining one(s); this may involve using different
coding strategies to code each component, possibly
with knowledge on some coding information of
other component, e.g., its elements’ coding order.
Depending on the adopted data structure,
component embedding in the data structure itself or
on the BCU may be involved; this means that the
embedded components are not directly coded.
b) Joint: The components data, associated to a specific
raw data type representation, are coded jointly,
meaning that all the components and
encoded/decoded jointly; this typically involves
using a single coding strategy to code all the
components at once. Depending on the adopted data
structure, component embedding in the data
structure itself or on the BCU may be involved; this
means that the embedded components are not
directly coded.
7) Prediction In terms of prediction, the following
classes are proposed:
a) None: No prediction is applied at all.
b) Intra: The basic processing unit of the structured
data is coded while exploiting the correlation
among its elements only; this is called Intra
prediction since the correlation is exploited within
a single basic processing unit. This prediction class
may involve exploitation of temporal correlation
only or both spatial and temporal correlation,
typically depending on whether the basic
processing unit spans over temporal dimension only
(corresponding to a sequence of elements of a given
pixel) or over both spatial and temporal dimensions
(corresponding to sequences of elements of pixels
in a spatial neighborhood). It may also involve the
definition of different Intra coding modes, which
are adaptively selected for different parts of the
content depending on the spatial distribution of the
elements.
c) Inter: The basic processing unit of the structured
data is coded while exploiting the correlation
between basic processing units in some spatio-
temporal neighborhood, considering motion; this is
called Inter prediction as, taking motion into
account, previous encoded basic processing units
may be used as reference to the current basic
processing unit coding. This prediction class may
involve exploitation of temporal correlation only
(in this case between spatially co-located
neighboring basic processing units) or both spatial
and temporal correlation depending on whether
motion in a spatio-temporal neighborhood is
considered or not. It may also involve the definition
of different Inter coding modes, which are
adaptively selected for different parts of the content
depending on the spatio-temporal distribution of the
elements.
d) Hybrid: The basic processing unit of the structured
data is coded while exploiting the correlation
among its elements and between elements
belonging to basic processing units in some spatio-
temporal neighborhood; this is called hybrid
prediction and may involve the definition of (Intra
and Inter) coding modes, which are adaptively
selected for different parts of the content.
8) Transform In terms of transform, the following
classes are proposed:
a) None: No transform is applied at all.
b) Block-based: A transform is applied to some
appropriate signal or residual signal, structured as a
regular block; this includes for example the discrete
cosine transform (DCT). The transform may be
fixed or hand-crafted (e.g., DCT), adaptive (e.g.,
KarhunenLoève Transform - KLT), or learned
(e.g., deep learning-based).
9) Quantization In terms of quantization, the following
classes are proposed:
a) None: No quantization is applied at all.
b) Uniform: A quantization is applied to some
appropriate signal or residual signal where the
quantization levels are uniformly spaced.
c) Non-Uniform: A quantization is applied to some
appropriate signal or residual signal where the
quantization levels are unequally spaced.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
8 VOLUME XX, 2017
An overview of the proposed classification taxonomy is
shown in FIGURE 1; note that the arrows simply intend to
highlight example connection paths between classes along the
nine dimensions. Because the Raw Data Type dimension is a
strong dividing factor, the next section will provide a
comprehensive review of the NVDC solutions available in the
literature guided by the proposed taxonomy’s Raw Data Type
dimension, to better understand the involved key concepts and
designs in NVDC, both within and between classes of the Raw
Data Type dimension.
FIGURE 1. Overview of the proposed NVDC classification taxonomy.
C. NVDC LITERATURE OVERVIEW
To experience and appreciate the power of the proposed
classification taxonomy, this section offers a summary table
where the NVDC references available in the literature are
classified according to the proposed taxonomy, see TABLE I;
this table allows identifying related NVDC solutions with
respect to one or more taxonomy dimensions. Since the
references are chronologically ordered, it is easy to see the
NVDC technical approaches evolution along time. For
example, it is interesting to note that event is the most adopted
Raw Data Type class while lossy is the most adopted Fidelity
class. Moreover, the first learning-based NVDC solution has
just emerged in 2022, adopting a lossy coding approach.
In TABLE I, ‘?’ means that not enough information is
provided in the reference to clarify the respective
classification. For each entry in TABLE I, whenever there is
only one class in the ‘Pred.’, ‘Transf.’ and/or Quant.’ columns
for more than one class in the ‘Compon.’ column, this means
that all components listed in ‘Compon.’ are predicted,
transformed and/or quantized in the same way; for each entry
in TABLE I, whenever there is more than one class in the
‘Compon.’/‘Pred’/‘Quant.’ dimensions, they are separated by
&. In TABLE I, ‘/’ is used to identify different variants
proposed for a specific classification dimension in each
reference.
From TABLE I, the following conclusions can be drawn:
The vast majority of the NVDC solutions currently
available in the literature, including the most recent ones,
are event-based (17 out of 19). This seems a natural
consequence of the significantly higher number of event-
based cameras commercially available [4] when
compared to the single spike-based camera currently
reported in the literature (the Vidar camera [5]).
Most of the NVDC solutions are lossy (12 out of 19),
although some of those solutions (e.g., [25], [28], [30],
[34], [37] and [38]) are classified as lossy not because
they employ lossy coding tools but rather because they
perform lossy operations, such as temporal quantization
and/or polarity accumulation, during the data structuring
of the raw input data sequence; while temporal
quantization induces some precision loss in the
timestamp component of the raw input data sequence,
polarity accumulation induces some precision loss in the
polarity component of the raw input data sequence.
While lossless NVDC seems to be more appropriate for
typical NVDC use-cases, lossy coding with adjustable
CR/quality tradeoff may also be adequate/advantageous
for certain application scenarios, e.g., on-demand slow
motion, for further reducing the events coding data rate
(compared to lossless coding); please refer to [20] for
more details on possible event-based vision use cases
and requirements under study on the recent JPEG
exploration activity on event-based vision (JPEG XE
[19]). Moreover, it is worth noting that lossy coding has
always been targeted in spike-based NVDC solutions.
In the event-based NVDC solutions, there is a trend
towards reducing the number of event’s components to
be directly coded by embedding one (or more)
component(s), e.g. timestamp or polarity, in the way the
events are arranged to be coded (i.e., in the data
structure); a similar trend is observed in the spike-based
NVDC solutions, as the location and temporal
information are somehow embedded in the BCU
processing order and in the spike firing states readout
time by the spike camera, respectively. The embedding
approach seems to allow some event/spike data rate
saving in comparison to coding all the components of the
event/spike representation.
Some of the most recent event-based NVDC solutions
jointly encode the event’s components, in opposition to
the independent coding approach followed by the former
NVDC solutions. While the joint approach may allow
some event data rate saving (in comparison to the
components independent coding), it may also increase
the coding solution complexity, as it becomes a multi-
variate coding approach.
In terms of the data structure/basic coding unit
dimensions, cuboid grid/polyhedron (notably cuboid-
like polyhedron) seem to have been the trend in the
earlier NVDC solutions, but for the more recent ones
(published since 2022) the trend is not that clear. A
possible reason for this may be the willingness of
exploiting, in the neuromorphic vision data context, new
Event Spike
3D Point Set Cuboid Grid 1D Array of
Elements 3D Array of
Frames
Independent Joint
Lossless Lossy
Group of FramesPolyhedronChunkSingle Element
Intra Inter Hybrid
None
None Block-based
None Uniform Non-Uniform
Time IntervalPolarityTimestampLocation
Raw Data Type
Fidelity
Data Structure
Basic Coding Unit
Components
Compon. Coding Appr.
Prediction
Transform
Quantization
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 9
visual data coding strategies that meanwhile emerged
(e.g., G-PCC and deep learning-based coding), and that,
to be effective, require the adoption of more appropriate
data structures/basic coding units (such as 3D point
set/polyhedron and 3D array of frames/group of frames).
Finally, regarding the coding tools addressed by the
proposed taxonomy, it is possible to conclude from
TABLE I that while (some sort of) prediction is used by
most of the NVDC solutions available (10 out of 19), the
same does not apply to transform, which is only used in
2 out 19 NVDC solutions (in this case a spike-based
solution). In terms of quantization, it is interesting to
notice that, from the 4 NVDC solutions where the
quantization tool is employed, only 2 of them are even-
based and, from these 2, both adopt a uniform
quantization; regarding the spike-based NVDC
solutions, both use the quantization tool, with a non-
uniform spacing of the quantization levels. Moreover,
from the small set of NVDC solutions adopting the
transform and quantization coding tools, it is possible to
infer that the impact of those coding tools on the NVDC
solutions’ performance is a poorly understood domain;
these might be two possible areas of future research.
III. NVDC: REVIEWING GUIDED BY THE TAXONOMY’S
RAW DATA TYPE DIMENSION
The NVDC area has received many contributions in the recent
years and is considered critical for the future of visual data
coding solutions. In the following, an exhaustive list of the
neuromorphic vision data coding (NVDC) solutions available
in the literature is presented together with a description of each
solution; the solutions’ comprehensive review is guided by the
proposed taxonomy’s Raw Data Type dimension and follows
the chronological order within each class of the Raw Data
Type dimension. Some performance results are also reported
for each NVDC solution to better understand its strengths and
weaknesses. However, it is important to stress that the
performance results reported in sections III.A and III.B are, in
most cases, not directly comparable between different coding
solutions, as they have been obtained under different test
conditions, e.g., different evaluating event/spike sequences
sets (even when the same dataset is used) and/or different
duration of the coded sequences. Sometimes even different
evaluation methodologies are used in the coding solutions
performance assessment; while some solutions evaluate their
performances with respect to the raw input data, other
solutions may evaluate its performance with respect to
aggregated data obtained from the raw input data structuring
process, typically involving operations that lead to a loss of
information in the raw input data. An earlier performance
comparison of lossless coding strategies, including the
NVDC solution in [21] and some generic lossless data
compression strategies adapted to NVDC, can be found in
[42]. For an overview on the datasets and performance
evaluation metrics/benchmarks adopted by the NVDC
solutions reviewed in this section, please refer to sections IV
and V, respectively.
A. EVENT-BASED NVDC SCHEMES
1) SPIKE CODING FOR DYNAMIC VISION SENSORS [21]
In 2018, Bi et al. proposed a cube-based coding framework to
losslessly code data generated from a DVS event camera
(consisting of location, polarity, and timestamp) [21].
Although [21] uses the term spike data to refer to the DVS
camera output data, in most NVDC literature (notably the
most recent one) event data is the term commonly used to refer
to the output data produced by an event camera, such as the
DVS event camera (see Section I); for this reason, the term
event has been adopted in this solution description.
In the proposed coding framework, which was the first
made available to code event data, the sensor event sequence
(outputted by the sensor pixel array) is first organized as a set
of points in the 3D (space-time) volume, with the pixel
location (x, y) and the timestamp defining the 3D coordinate
axes X, Y and Z, respectively, and the polarity being the value
attributed to each 3D point. An adaptive macro-cube
partitioning of the sensor event sequence in the temporal (i.e.,
Z) dimension is then performed based on a binary-tree
structure, targeting to obtain approximately the same number
of events within each macro-cube; a macro-cube is a cube of
event data in the 3D space whose size in X and Y dimensions
corresponds to the sensor pixel array full spatial resolution and
the size in Z (temporal dimension) results from binary-tree
partitioning. Each macro-cube is further split into 32×32
event-cubes along the spatial dimensions (X, Y), constituting
the basic unit for encoding and, thus, to exploit the (events)
spatio-temporal redundancies. It is worth recalling that, in the
related literature, events temporal redundancy is related to the
similarity between time intervals between consecutive events
at a given pixel, typically resulting from a constant changing
rate of luminance intensity (such as linear increase or
decrease). Events spatial redundancy is related to the
similarity between events triggered by adjacent (sensor)
pixels, typically resulting from the fact that adjacent pixels
tend to simultaneously receive almost the same luminance
intensities.
The proposed event-cube encoding procedure consists of
separate encoding of (event) location, timestamp, and polarity
data. The event location encoding process involves evaluating
two intra-cube prediction modes designed to tackle different
event spatial distributions, the so-called address-prior mode
and time-prior mode, and selecting the one leading to the
lowest rate cost; while the address-prior mode is designed for
spatially sparse cubes, resulting from events scattering over
the entire spatial resolution, the time-prior mode is designed
for spatially dense cubes, resulting from a high events
concentration over neighboring pixels. As intra-cube
prediction modes, the address-prior mode and the time-prior
mode only exploit the events correlation within an event-cube.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
10 VOLUME XX, 2017
Thus, in the address-prior mode, the events within the cube
to be encoded are first accumulated and stored in the so-called
location histogram, a 2D array where each entry represents the
number of events that were triggered in the corresponding
pixel location within the cube; the location histogram is
complemented with a location histogram binary map, a 2D
array where each entry indicates whether events were
triggered or not in the corresponding pixel location. Then, the
location histogram and the location histogram binary map are
(separately) fed into a context-based adaptive arithmetic
entropy encoder, generating the address-prior mode-based
location coding bitstream. For each pixel within the cube, the
events’ timestamps are differentially encoded with respect to
the previous (event) timestamp, followed by context-based
adaptive arithmetic entropy encoding, which generates the
address-prior mode-based timestamp coding bitstream.
In the time-prior mode, it is first determined the so-called
center point, i.e., the event pixel location within the cube that
minimizes the spatial distance (in x and y pixel coordinates) to
all the other event pixel locations within the cube to be
encoded. Next, it is computed the displacement (∆x, ∆y)
between the (x, y) coordinates of every event (within the cube)
and the center point coordinates (xc, yc) previously determined;
the set of (∆x, ∆y) displacements computed is then fed into a
context-based adaptive arithmetic entropy encoder, generating
the time-prior mode-based location coding bitstream. The
events timestamps (within the cube) are also differentially
encoded with respect to the previous (event) timestamp (the
coding order of the event timestamp follows the event location
coding order), followed by context-based adaptive arithmetic
entropy encoding, which generates the time-prior mode-based
timestamp coding bitstream. The elementary encoded
bitstreams associated to the (event) location and timestamp
coding result then from the corresponding bitstreams
generated by the intra-cube prediction mode with the lowest
rate (considering both the location and timestamp rates).
As far as the event polarity coding is concerned, it is
context-based adaptive arithmetic entropy encoded (with
coding order following the event location coding order of the
intra-cube prediction mode with the lowest rate), generating
the polarity coding bitstream. The elementary encoded
bitstreams resulting from (event) location, timestamp and
polarity coding are them multiplexed to generate the event
coding bitstream.
Experimental results on the PKU-DVS event dataset,
proposed in [21] for the event data coding algorithm
evaluation, show an average compression ratio (over the
whole dataset) of 19.52 with respect to the raw event data size
(where each event is represented by 64 bits); hereafter, the
compression ratio with respect to the raw event data size will
be simply referred to as compression ratio (CR). Compared to
the LZ77 and LZMA benchmarks, two Lempel-Ziv-based
(generic) lossless coding algorithms, the proposed coding
framework achieves an average CR 4.43× and 1.54× higher,
respectively.
2) SPIKE CODING FOR DYNAMIC VISION SENSOR IN
INTELLIGENT DRIVING [22]
In 2019, Dong et al. extended the lossless event data coding
solution in [21], by proposing an adaptive octree-based
partitioning of the sensor event sequence into the so-called
coding tree cubes in both spatial (XY axes) and temporal (Z
axis) dimensions, to code event data generated from a DAVIS
event camera [22]; DAVIS (Dynamic and Active Pixel Vision
Sensor) [43] is an hybrid sensor that concurrently outputs
event data (through a DVS sensor) and conventional intensity
images/frames (through an active pixel sensor APS).
In the proposed coding framework, each 64×64×32768
coding tree cube can be further adaptively divided along the
spatial and temporal dimensions into smaller cubes, called
coding cubes, using an octree structure, targeting to obtain
approximately the same number of events within each smaller
cube (i.e., coding cube); the coding cube constitutes thus the
basic coding unit of the proposed coding solution.
The proposed coding cube encoding procedure consists of
separate encoding of (event) location, timestamp, and polarity
data. Similarly to [21], two intra-cube prediction modes, i.e.,
the address-prior mode and the time-prior mode, are evaluated
and the one leading to the lowest rate cost is selected as the
best prediction mode. However, in [22], the evaluation of each
intra-cube prediction mode includes not only the (event)
location and timestamp data coding but also the polarity data
coding; the (event) location, timestamp, and polarity coding
strategies adopted in [22] are similar to the ones described in
the previous solution ([21]). Thus, the intra-cube prediction
mode with the lowest sum of (event) location, timestamp, and
polarity rates is selected as the best prediction mode; the
elementary encoded bitstreams resulting from the location,
timestamp and polarity coding for the best prediction mode are
then multiplexed to generate the event coding bitstream.
Experimental results on the DDD17 dataset (only event data
considered) show that the proposed solution slightly
outperforms the lossless event coding solution in [21],
achieving an average CR (with respect to the raw event data
size) of 2.65 whereas the event coding solution in [21]
achieves an average CR of 2.64. Compared to the LZ77
(Lempel-Ziv compression algorithm) and LZMA (Lempel-
Ziv-Markov chain algorithm) lossless benchmarks, the
proposed coding framework achieves an average CR 1.78×
and 1.24× higher, respectively. An inter-cube prediction
strategy, in which coding cubes previously encoded can be
used as reference to predict the current coding cube, was also
proposed in [22] but the coding performance achieved
(average CR of 2.64) was rather similar to the one attained
using only the intra-cube prediction strategy (average CR of
2.65).
3) SPIKE CODING: TOWARDS LOSSY COMPRESSION
FOR DYNAMIC VISION SENSOR [24]
Also in 2019, Fu et al. proposed the first lossy coding scheme
to code event data generated from a DVS event camera [24].
The proposed solution extends the lossless coding framework
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 11
in [22] to the lossy scenario by incorporating quantization of
the prediction residual data together with an optimized inter-
cube prediction and a new intra-cube prediction mode. It is
worth noting that the details on the proposed coding scheme
in [24] are scarce due to the paper length (1-page paper).
Experimental results on the MNIST-DVS dataset show the
evolution of the average distortion and classification accuracy
with the CR, for the proposed coding solution only; no details
are given on how the CR, average distortion and classification
accuracy were computed.
4) TIME-AGGREGATION-BASED LOSSLESS VIDEO
ENCODING FOR NEUROMORPHIC VISION SENSOR
DATA [25]
In 2021, Khan et al. proposed the so-called Time-Aggregation-
based Lossless Video Encoding (TALVEN) solution, a coding
solution based on event aggregation in the temporal dimension
and conventional lossless video coding, to code pseudo video
sequences created from event data generated from a DAVIS
event camera (consisting of location, polarity, and timestamp)
[25].
The TALVEN solution starts by aggregating (i.e.,
accumulating), over a fixed time interval called aggregation
time interval, the number of events triggered at each pixel
location according to their polarity, creating two polarity-
based event frames; a polarity-based event frame is, thus, a
location histogram, i.e., a 2D array with the full sensor pixel
array spatial resolution representing, at each pixel location, the
event count for a given polarity in a fixed aggregation time
interval. This process of event aggregation over a fixed time
interval, performed during the raw input data structuring,
involves (raw input) event timestamps quantization, i.e., a
lossy operation where all the (continuous) raw input (event)
timestamps within the aggregation time interval are mapped to
a single timestamp value, typically with reduced bit
representation; event timestamps quantization induces, thus,
some precision loss in the timestamp component of the raw
input event sequence. Therefore, from the event coding
pipeline point of view, which includes structuring of the raw
input data towards subsequent coding (see Section II.A), the
TALVEN solution is a lossy coding scheme; the (raw input)
event timestamps quantization is, however, the only operation
inducing loss of information in the TALVEN solution.
The two polarity-based event frames created from event
aggregation are then concatenated side by side, creating a
(new) bigger frame called superframe, targeting to take the
most benefit of the conventional video coding techniques and,
thus, achieving higher compression gains. This (event
accumulation and polarity-based event frames concatenation)
process is repeated over the whole duration of the sensor event
sequence (i.e., raw input event sequence) and the resulting set
of superframes, structured into a 3D array of superframes, is
then treated as a pseudo video sequence; the pseudo video
frames are not exactly conventional video frames, notably in
terms of its content (value stored in each pixel is not an
intensity value), hence the pseudo term. The pseudo video
sequence is then HEVC losslessly encoded, generating the
event coding bitstream; it is worth noting that, in the TALVEN
solution, the polarity information is embedded within the
video frame (each superframe concatenates both polarity-
based event frames side by side) while the quantized
timestamp information (for all the events in each video frame)
is embedded in the frame number field of the encoded video
sequence.
Experimental results on 10 (indoor and outdoor) event
sequences of the DAVIS 240C dataset (only event data
considered) show higher coding performance for medium to
high aggregation time intervals compared to the event coding
solution in [22], achieving CRs up to 4× higher. Besides the
usual CR, between the raw event data size (in bits) and the
event coding bitstream size (in bits), [25] also reports results
on the CR between the raw (pseudo) video sequence size (in
bits) and the event coding bitstream (in bits), called video
encoder CR; the coding performance of the proposed solution
in terms of video encoder CR follows a similar trend to the
usual CR.
5) LOSSY EVENT COMPRESSION BASED ON IMAGE-
DERIVED QUAD TREES AND POISSON DISK SAMPLING
[27]
Also in 2021, Banerjee et al. proposed the so-called Poisson
Disk Sampling-Lossy Event Compression (PDS-LEC), a lossy
coding solution based on Poisson disk sampling and quad-tree
segmentation of intensity images to code event data generated
from a DAVIS event camera (consisting of location, polarity,
and timestamp) [27]; please recall that the DAVIS sensor [43]
is an hybrid sensor that produces both event data (through a
DVS sensor) and conventional intensity images (through an
APS sensor). While the input of the proposed coding
framework includes both event data (location, polarity, and
timestamp) and RGB images, the work in [27] is focused on
the event data coding only; RGB images are assumed to be
coded in (some of) the experimental results but there is no
reference on the coding solution employed to compress them.
In the PDS-LEC solution, a quad-tree structure is first
derived for the intensity frame at time instant , I, through
dynamic programming (Viterbi algorithm), based on the
decoded frame at time instant -1, Î-1, targeting to guide the
coding process of the event data triggered between time
instants -1 and . Next, events triggered between time instants
-1 and are aggregated according to their polarity and
temporally quantized into 16 bins, creating (two) polarity-
based event cuboid grids; each (polarity-based) event cuboid
is a volume of event data in a 3D space-time neighborhood
and constitutes the basic coding unit of the PDS-LEC solution.
The (polarity-based) event cuboids are then adaptively
sampled via Poisson disk sampling according to the priority
established by the quad-tree based segmentation map; the
priority of a 2D spatial region within the 3D space-time
volume (of event data that have been triggered between time
instants -1 and ) is inversely proportional to the
corresponding block size in the quad-tree based segmentation
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
12 VOLUME XX, 2017
map. The higher the region priority, the higher the importance
of keeping the respective events after sampling. Afterwards,
the locations (x, y) of the sampled events are differentially
encoded followed by Huffman encoding, generating the
location coding bitstream, while the polarity information of
the sampled events is run-length encoded followed by
Huffman encoding, generating the polarity coding bitstream.
The elementary encoded bitstreams resulting from (event)
location and polarity coding are them multiplexed to generate
the event coding bitstream.
Experimental results on the DAVIS 240C dataset show
higher compression performance, measured in terms of CR
versus aggregation time interval, compared to the lossless and
lossy event coding benchmark solutions in [22] and [25],
respectively, achieving CRs up to higher; it is not clear
from [27], however, the precise conditions in which this lossy-
lossless comparison was performed. More recently, the PDS-
LEC solution has been adopted by the same authors for object
tracking in a communication system called host-chip
architecture [40]; in the proposed host-chip architecture, the
data acquisition on the chip (event and intensity images) is
driven by information transmitted by the host, an object
tracking application, through a feedback channel.
Experimental results show the robustness of the proposed
system when compared to benchmark object tracking methods
using DVS.
6) LOSSLESS COMPRESSION OF EVENT CAMERA
FRAMES [28]
In 2022, Schiopu and Bilcu proposed a performance-oriented,
context-based image coding solution to code groups of event
frames generated from a DVS-like camera (consisting of
location, polarity, and timestamp), where the event location
information and the event polarity information are encoded
separately using different strategies [28].
In the proposed coding framework, the polarity of the
events that occurred at each (sensor) pixel location are first
accumulated (summed up) over a fixed time interval and the
(polarity) sum’s sign -1, 0, 1 is then stored in the
corresponding pixel location in a 2D array known as event
frame (EF); an EF has, thus, the full sensor pixel array spatial
resolution and represents, at each pixel location, the polarity
of an event that represents all the events triggered in the
accumulation time interval. Depending on the value, the
process of event aggregation over a fixed time interval with
polarity accumulation, performed during the raw input data
structuring, may involve (raw input) event timestamps
quantization and polarity accumulation, two operations that
induce some precision loss in the timestamp and polarity
components of the raw input event sequence, respectively;
while in event timestamps quantization all the raw input
(event) timestamps within the aggregation time interval are
mapped to a single timestamp value, typically with reduced bit
representation, in polarity accumulation, all the raw input
(event) polarities within the aggregation time interval are
summed up and converted to a single (representative) polarity
value. In this context, from the event coding pipeline point of
view, which includes structuring of the raw input data towards
subsequent coding (see Section II.A), the proposed framework
[28] is a lossy/lossless coding scheme; the (raw input) event
timestamps quantization and polarity accumulation are the
two operations inducing loss of information in the proposed
coding framework.
The proposed process of event aggregation over a fixed
time interval with polarity accumulation is repeated over the
whole sensor event sequence duration and the resulting set of
(synchronous) EFs is then structured into a 3D array of EFs,
with each group of 8 (consecutive) EFs constituting the basic
coding unit of the proposed coding framework.
For coding purposes, each group of 8 (consecutive) EFs is
then represented by a pair of an event map image (EMI),
storing the spatial information, and a concatenated polarity
vector (CPV), storing the polarity information. The EMI is
further represented by: i) a (2D) binary map (BM), signaling
the pixel locations (x, y) where at least one event has occurred
in an EF; ii) the number of events per signaled position in BM;
and iii) the EF indices, indicating the positions in the EF group
of the EFs associated to the events previously identified.
The BM is encoded in raster scan order using template
context modelling (TCM), where the context is computed
using the causal neighborhood of the pixel to be encoded. The
number of events is encoded bitplane by bitplane, starting with
the least significant one in a 3-bitplane representation
(maximum number of events is 8), also using template context
modelling; the context for each bitplane is computed using the
causal neighborhood of the current bitplane and a template
context from the previously coded bitplane(s). The EF indices
are encoded using adaptive Markov modelling (AMM). As far
as the coding of polarity information is concerned, both AMM
and TCM are applied, and the final encoding strategy is the
one with the lowest estimated codelength. In the proposed
coding solution, the event coding bitstream results from
multiplexing the elementary bitstreams resulting from EMI
and CPV encoding.
Experimental results on the DSEC dataset show that the
proposed solution achieves CRs up to 5.8 (with respect to the
raw event data size) for the time interval =10-6s. Compared
to the conventional lossless video/image coding solutions used
as benchmarks, HEVC (High Efficiency Video Coding), VVC
(Versatile Video Coding), CALIC (Context-based Adaptive
Lossless Image Coding) and FLIF (Free Lossless Image
Format), [28] reports, for =10-6s, CR improvements of
198.01%, 238.94%, 125.04%, and 84.92%, respectively. Note
that, when the 3D array of EFs is generated from the raw input
(sensor) event sequence considering the time interval =10-6s
(i.e., framerate of 106 frames per second (fps)), all events of
the raw input event sequence having the same timestamp are
aggregated in one EF; this means that, for = 10-6s, there is
no (raw input) event timestamps quantization nor polarity
accumulation, i.e., it corresponds to a lossless scenario. For the
other time intervals (i.e., framerates) evaluated, notably
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 13
5.555ms (180fps), 1ms (103fps) and 0.1ms (104fps), [28]
reports average coding performance improvements of
70.68%, 58.06% and 20.66% compared to HEVC, VVC and
FLIF, respectively. This coding performance is measured in
terms of a so-called aggregation CR (see Section V), i.e., a CR
relative to the size (in bits) of an event (temporal) aggregation
based raw input data structure, in this case the 3D array of EFs
(where each EF pixel is represented by 2 bits, as 3 symbols are
possible {-1, 0, 1}); please recall that the 3D array of EFs
results from a process of event aggregation over a fixed time
interval with polarity accumulation (as described in the
beginning of this section). It is worth noting that the input to
the conventional image/video coding solutions are (event)
images obtained by combining, through some mathematical
function, the information enclosed in each set of 5 consecutive
EFs (represented by 8-bit values), for improved coding
performance of those codecs (please refer to [28] for more
details).
7) SPATIAL-TEMPORAL DATA COMPRESSION OF
DYNAMIC VISION SENSOR OUTPUT WITH HIGH PIXEL-
LEVEL SALIENCY USING LOW-PRECISION SPARSE
AUTOENCODER [29]
Also in 2022, Hasssan et al. proposed the first learning-based
lossy coding framework to code event frames created from
event data generated from DVS-like cameras (consisting of
location, polarity, and timestamp) [29].
In the proposed framework, the sensor event sequence (i.e.,
raw input event sequence) is first converted into
(synchronous) binary frames by sampling the (raw input)
events x, y, p,  over the temporal dimension (timestamp );
however, this conversion process is not described in detail in
[29]. The binary frames are then fed as input to the proposed
low-precision sparse convolutional autoencoder architecture,
where the encoder comprises 2 sparse convolutional layers
(each including a ReLU activation function) and 2 max
pooling layers, and the decoder includes 2 upsampling layers
and 2 convolutional layers (each of which including also a
ReLU activation function). To reduce the amount of
computation and storage resources needed, low-precision (2-
bit and 4-bit) convolution operations were implemented
during the training phase by passing the full precision weights
and activations of each convolution layer to a quantization
module before performing convolution operations, which
compresses the full precision weights and activations to 2-bit
and 4-bit precision levels, respectively. The sparse
compressed representation model (also known as latent space)
is obtained by adding a L1 norm sparsity penalty term to the
loss function during the training phase; the latent space
corresponds to the event coding bitstream of the proposed
coding solution.
Autoencoder reconstructed images are then used for
inference by a classification network and an object detection
network, both of which were independently trained on the
original, i.e., raw input, (DVS) event data. It is worth noting
that, contrary to the all the previously reviewed solutions, the
learning-based lossy coding framework proposed in [29] was
not evaluated standalone but in the context of specific
computer vision tasks, notably event-based classification and
object detection tasks. In the classification task, experimental
results on the datasets MNIST-DVS (generated by converting
the standard frame-based MNIST dataset to events),
DvsGesture and Gen1 N-CARS show average CRs up to 29.1
with an accuracy drop of 3.0%. In the object detection task,
experimental results on the Gen1 Automotive Detection
dataset show an average CR of 11.9 with a drop of 0.07 in
mean average precision.
8) LOW-COMPLEXITY LOSSLESS CODING FOR
MEMORY-EFFICIENT REPRESENTATION OF EVENT
CAMERA FRAMES [30]
Still in 2022, Schiopu and Bilcu proposed a low-complexity
coding framework based on run-length and Elias coding to
code, in a memory efficient way, event frames created from
event data generated from a DVS-like camera (consisting of
location, polarity, and timestamp) while targeting to be
suitable for hardware implementation in low-cost event signal
processing chips [30].
In the proposed coding framework, the polarity of the
events triggered at each pixel location are first summed up
over a fixed time interval and the polarity sum’s sign -1, 0,
1 is then stored in the corresponding pixel location in the so-
called event frame (EF), as in [28]. Hence, similarly to
solution 6) reviewed above ([28]), the coding framework
proposed in [30] is classified as lossy/lossless, since,
depending on the value, it may involve raw input event
timestamps quantization and polarity accumulation, two
operations that lead to some precision loss in the timestamp
and the polarity components, respectively; these are, however,
the only two operations inducing loss of information in the
proposed coding framework.
The proposed process of event aggregation over a fixed
time interval with polarity accumulation is repeated over the
whole sensor event sequence duration (or any pre-defined time
length) and the resulting (synchronous) EFs are then grouped
together forming the so-called EF volume (i.e., a 3D array of
EFs). The EF volume serves as input for the two proposed
coding solutions: i) SAFE (Simple And Fast lossless Event
frame) codec, for fast coding of large sets of EFs (thousands
of EFs); and ii) MER (Memory-Efficient Representation)
codec, for a memory-efficient representation of EFs while
providing random access (RA) to any group of pixels within
the EF volume. The EF volume constitutes, thus, the basic
coding unit of the proposed coding solutions.
In the SAFE solution, the EF volume is further represented
by a (2D) binary map (BM), signaling the pixel locations (x,
y) where at least one event was triggered in time, and a vector
of concatenated time intervals (VCT), associated to each pixel
location signaled in BM. Next, the run-length encoding
scheme is adapted for coding both vectorized BM and VCT,
by counting the number of consecutive event/no-event
symbols in each, and then the Elias coding algorithm is applied
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
14 VOLUME XX, 2017
to code those set of counts, generating the event coding
bitstream.
In the MER solution, the EF volume (3D array of EFs) is
further divided into a set of 8×8×8 cubes, constituting the RA
units, which are then arranged as a set of vectors V. As for the
SAFE codec, the MER codec adapts the run-length coding
scheme and Elias coding to code the set of vectors V,
generating the RA event coding bitstream.
Experimental results on the DSEC dataset show that the
proposed MER solution achieves an average aggregation CR
(over the DSEC dataset) of 8.77 for = 1ms (103 fps), with an
average EF encoding speed of 9.07 ms per EF (ms/EF); please
recall that the aggregation CR is a CR with respect to the size
(in bits) of the (uncompressed) 3D array of EFs (see Section
V), where each EF pixel is represented by 2 bits (as 3 symbols
are possible {-1, 0, 1}). The so-called average EF encoding
speed metric measures the (average) time needed to encode
one EF (see TABLE IV). Experimental results also show that
the MER solution provides fast and low-complexity RA to any
8×8×8 volume of pixels. For the same time interval = 1ms,
the proposed SAFE solution achieves an average aggregation
CR of 9.77, with an average EF encoding speed of 5.59ms/EF.
Compared to the conventional lossless video/image coding
solutions used as benchmarks, HEVC, VVC and CALIC, [30]
reports, for the best performing proposed solution (SAFE),
average aggregation CR improvements of 34.57%, 24.62%,
35.51%, and 84.92%, respectively, with average EF encoding
speed reductions of 92.79%, 99.97%, and 65.21%,
respectively. For the same time interval = 1ms, the
conventional lossless image coding solution FLIF outperforms
the best performing proposed solution (SAFE) with an average
aggregation CR 5% higher but requires a higher average EF
encoding speed (26.10× higher). When the EF volume is
generated from the sensor event sequence considering a time
interval = 10-6s (106fps), an average CR of 4073 is reported
for the proposed SAFE solution, with an average EF encoding
speed of 59.86ms/EF; similarly to the solution 6) reviewed
above ([28]), there is no (raw input) event timestamps
quantization nor polarity accumulation for = 10-6s (lossless
scenario). For = 10-6s, the proposed SAFE solution achieves
average CR gains (over the DSEC dataset) 2.37×, 2.70×,
1.47×, and 1.76× higher than the ones obtained with HEVC,
VVC, FLIF, and CALIC, respectively, while the average EF
encoding speeds are 0.77×, 0.07×, 0.64×, and 5× the ones
required by HEVC, VVC, FLIF, and CALIC, respectively.
Please recall that, as in [28], the input to the conventional
image/video coding solutions are (event) images obtained by
combining, through some mathematical function, the
information enclosed in each set of 5 consecutive EFs
(represented by 8-bit values), for improved coding
performance of those codecs (please refer to [28] for more
details).
9) LOSSLESS COMPRESSION OF NEUROMORPHIC
VISION SENSOR DATA BASED ON POINT CLOUD
REPRESENTATION [31]
In 2022, Martini et al. proposed a lossless coding solution
based on standard point cloud compression to code event data
generated from a DAVIS event camera (consisting of location,
polarity, and timestamp) [31].
In the proposed solution, the sensor event sequence is
organized as a set of points in 3D (space-time) volume, where
the pixel location (x, y) and the timestamp correspond to the
3D coordinate axes X, Y and Z, respectively, and the polarity
is the value attributed to each 3D point, thus resembling a 3D
point cloud representation. By splitting the events (of the
sensor event sequence) according to their polarity, two 3D
point clouds are obtained, one for each polarity.
Then, the standard Geometry-based Point Cloud
Compression (G-PCC) codec [41] is applied to separately
encode the geometry information, i.e., the (x, y, ) triplet, of
each 3D (polarity-based) point cloud. G-PCC applies a
transformation to the input (x, y, ) coordinates and structures
the resulting data into voxelized octrees; the geometry is then
losslessly coded with an octree of appropriate depth. The two
elementary encoded bitstreams resulting from applying G-
PCC to each 3D (polarity-based) point cloud are them
multiplexed to generate the event coding bitstream.
Experimental results on the DAVIS 240C dataset (only
event data considered) show higher compression performance,
measured in terms of CR, when compared to the lossless
benchmark solution in [21] and LZMA, a Lempel-Ziv-based
(generic) lossless coding algorithm; CRs up to 30% and 49.4%
higher are reported in [31] with respect to [21] and LZMA,
respectively.
10) LOW-COMPLEXITY LOSSLESS CODING OF
ASYNCHRONOUS EVENT SEQUENCES FOR LOW-
POWER CHIP INTEGRATION [32]
Also in 2022, Schiopu and Bilcu proposed the so-called Low-
complexity Lossless Compression of AsynchRonous Event
Sequence (LLC-ARES), a lossless coding framework based on
a novel event data representation, called same-timestamp
representation, and triple threshold-based range partition
(TTP) algorithm. The LLC-ARES framework codes event
data generated from a DVS-like camera (consisting of
location, polarity, and timestamp) while targeting to be
suitable for hardware implementation in low-power event
signal processing chips [32].
In the proposed LLC-ARES framework, the sensor event
sequence is first divided into multiple sub-sequences, called
same-timestamp (ST) sub-sequences, each of which encloses
all the events (in the sensor event sequence) with the same
timestamp; the ST sub-sequences constitute the basic coding
unit of the proposed solution.
Each ST sub-sequence is then ordered in increasing order
of the largest spatial coordinate and represented by four data
structures (DSs), i.e. two DSs containing the event spatial
(location) information (x and y coordinates, respectively), one
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 15
DS containing the polarity information and one DS containing
the number of events that were triggered at the timestamp of
the ST sub-sequence (which corresponds to the ST sub-
sequence length). Each DS is subsequently encoded, where
binarization is employed to encode the polarity information
DS and the TTP algorithm is employed to predictively encode
the other three DSs.
The TTP algorithm uses a short-depth decision tree based on
a triple threshold to partitioning the range of the input data
(notably x and y coordinates of the events and the ST sub-
sequence length) into several smaller coding ranges
distributed at concentric distances from the prediction,
obtained from the corresponding previously coded data.
Afterwards, each TTP input data value is represented by the
binary representation of the prediction error and the binary
representation of the decision tree structure, generating the
corresponding elementary data structure encoded bitstream;
the event coding bitstream is then formed by multiplexing the
4 elementary encoded bitstreams resulting from encoding the
4 DSs used to represent every ST sub-sequence.
In [32], it is also proposed a coding solution providing
random access (RA) functionality, called LLC-ARES-RA. In
LLC-ARES-RA, the sensor event sequence is first divided
into multiple so-called packages of a given time length, which
constitute the RA units. Each package is then coded with the
LLC-ARES solution, and, to the resulting package encoded
bitstreams, it is added a header information bitstream resulting
from employing the TTP algorithm to code the length of
package bitstreams.
Experimental results on the DSEC dataset show that LLC-
ARES has, on average, higher compression performance,
measured in terms of CR, compared to the lossless benchmark
solutions Bzip2 (lossless compression method based on
BurrowsWheeler algorithm), LZMA, and ZLib, with average
CR improvements of 5.49%, 11.45%, and 35.57%,
respectively; for event sequences with high density (i.e., with
high number of events per second), Bzip2 and LZMA perform
better than LLC-ARES, with an average CR up to 1.2× and
1.1× higher, respectively. As far as the LLC-ARES-RA
solution is concerned, experimental results show a
performance close to the LLC-ARES for the smallest
(package) time length considered (100µs).
11) FEATURE REPRESENTATION AND COMPRESSION
METHODS FOR EVENT-BASED DATA [33]
In 2023, Wang et al. proposed a conceptually different lossless
coding approach, which is based on a character-like
representation of the event representation components (i.e., x,
y, p, t), to code event data generated from a DAVIS event
camera (consisting of location, polarity, and timestamp) [33].
The proposed approach includes two coding methods,
namely the Characteristic Parameter Jointed Coding (CPJC)
method and the ASCII Coding based on Bit Operation (ACBO)
method. Both methods are applied at the event level of the
sensor event sequence and are based on the conversion of the
x, y, p, t event representation into a sequence of characters
(i.e., character-like representation); the event constitutes, thus,
the basic coding unit of both proposed methods.
In the CPJC method, the event polarity, ‘0’ or ‘1’, is first
mapped to the special characters -or ‘null’, respectively.
Then, the pixel coordinates x and y, i.e., the event location, are
converted to an alphabetic character (letter), represented by 2
digits, followed by a numeric character, represented by 1 digit;
the 3-digit representation of the x and y coordinates results
from the fact that, in a DAVIS346 sensor with 346×260 spatial
resolution, the maximum coordinate value, 346, requires 3
digits to be represented. In the proposed x coordinate
conversion process, the numeric character corresponds to the
last (rightmost) digit of the x coordinate value while the
alphabetic character is obtained from a predefined dictionary,
which associates a 2-digit value (the two leftmost digits of x
coordinate value) to a letter; a similar conversion process is
applied to the y coordinate value. Finally, the event timestamp
is first subtracted from the timestamp of the previously
triggered event and the difference is then represented by a
numeric character. After applying the CPJC method to every
event of the sensor event sequence, the corresponding
sequence of characters are multiplexed and Zip compression
is applied as entropy coding, generating the event coding
bitstream.
In the ACBO method, the pixel coordinate x (respectively,
y) of every event in the sensor event sequence is first converted
to a binary sequence, whose length is stored in a structure for
posterior entropy coding, and the binary sequences associated
to the x (respectively, y) coordinate of all events in the sensor
event sequence are then concatenated forming a long binary
sequence. Next, the long binary sequence is split into multiple
7-bit-long binary sub-sequences, each of which is then
converted to its corresponding decimal value for posterior
ASCII character representation according to a predefined fine-
tuned ASCII table; Zip compression is then applied to the
corresponding sequence of ASCII characters as entropy
coding, generating the event x (respectively, y) coordinate
encoded bitstream. The event timestamp is converted to a
numeric character, as in the CPJC method, while the event
polarity, ‘0’ and ‘1’, is represented by the - and ‘+’
characters, respectively; the character representation of the
polarity and timestamp information of each event are
concatenated, with the polarity character acting as the
separator between the timestamp characters of two
consecutive events; Zip compression is then applied to the
corresponding sequence of characters, generating the event
polarity-timestamp encoded bitstream. The (full) event coding
bitstream in the ACBO method results from the multiplexing
the elementary (event) x and y coordinate encoded bitstreams
with the polarity-timestamp encoded bitstream.
Experimental results on four event sequences, acquired by
the authors with the DAVIS346 event camera, show that the
proposed coding methods have higher compression
performance, measured in terms of CR, compared to the
lossless benchmark solution in [21] and direct Zip coding of
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
16 VOLUME XX, 2017
the x, y, p, t event representation; CR improvements of
17.93% and 14.92% compared to [21] are reported for CPJC
and ACBO methods, respectively. It is also observed that, on
average, the CPJC method performs better than the ACBO
method.
12) MEMORY-EFFICIENT FIXED-LENGTH
REPRESENTATION OF SYNCHRONOUS EVENT
FRAMES FOR VERY-LOW-POWER CHIP INTEGRATION
[34]
Also in 2023, Schiopu and Bilcu proposed a low-complexity
coding framework based on a memory-efficient fixed-length
representation using multi-level lookup tables (LUTs) to code
event frames created from event data generated from a DVS-
like camera (consisting of location, polarity, and timestamp)
while targeting to be suitable for hardware implementation in
very low-power event signal processing chips [34].
As in [28] and [30], the polarity of the events triggered at
each pixel location are first summed up over a fixed time
interval and the polarity sum’s sign -1, 0, 1 is then stored
in the corresponding pixel location in the so-called event frame
(EF), which constitute the basic coding unit of the proposed
framework. Hence, similarly to the solutions 6) and 8)
reviewed above ([28] and [30], respectively), the coding
framework proposed in [34] is classified as lossy/lossless,
since, depending on the value, it may involve event
timestamp quantization and polarity accumulation, two
processes that lead to some precision loss; these are, however,
the only two operations inducing loss of information in the
proposed coding framework.
Each EF is then partitioned in blocks of 32×32 pixels, each
of which is vectorized and further split into 205 subsets of 5
ternary symbols -1, 0, 1. Each subset, i.e., 5 ternary symbols,
is then remapped into an 8-bit symbol through a mathematical
model and stored in a 205-symbol-long vector. A fixed-length
LUT-based encoding solution is then applied to every 205-
symbol-long vector, where a LUT-based representation is
used to store all the unique combinations of 205 symbols
found in any 205-symbol-long vector (obtained from the EF)
and an index matrix is used to store in each entry the position,
i.e., index, in the LUT-based representation from where the
corresponding 205-symbol-long vector can be extracted. The
index matrix and the LUT-based representation are then
encoded through binarization, generating the event coding
bitstream.
Experimental results on the DSEC dataset show that, for
time intervals= 1000µs and ∆ = 5555µs, the compression
performance of the proposed solution, measured in terms of
the aggregation CR (see Section V), approaches (but is still
below) the performance obtained with the conventional
lossless video/image coding benchmarks. For time intervals ∆
= 1000µs (103fps) and = 5555µs (180fps), [34] reports
average aggregation CR reductions up to 5.79%, 12.76%,
5.13%, and 33.27% compared to HEVC, VVC, CALIC, and
FLIF, respectively; compared to the event-based coding
benchmark in [30], an average aggregation CR reduction up to
46.85% is reported for the same time intervals. For the smaller
time intervals, = 1µs (106fps) and ∆ = 100µs (104fps), it is
possible to observe significantly higher average aggregation
CR performance reductions when compared to the same
benchmarks; [34] reports average aggregation CR reductions
up to 81.92%, 79.42%, 86.57%, 88.83%, and 94.40%
compared to HEVC, VVC, CALIC, FLIF, and [30],
respectively. Please recall that, as in [28] and [30], the input to
the conventional image/video coding solutions are (event)
images obtained by combining, through some mathematical
function, the information enclosed in each group of 5
(consecutive) EFs (please refer to [28] for more details).
13) ENTROPY CODING-BASED LOSSLESS
COMPRESSION OF ASYNCHRONOUS EVENT
SEQUENCES [35]
Still in 2023, Schiopu and Bilcu extended the low-complexity
lossless coding framework in [32] (denominated LLC-ARES),
by modifying the TTP algorithm to employ entropy coding
based techniques, to code the same-timestamp (ST)
representation of event data generated from a DVS-like sensor
(consisting of location, polarity, and timestamp) [35].
In the proposed framework, named Entropy coding-based
Lossless Compression of ARES (ELC-ARES), after ordering a
ST sub-sequence in increasing order of the largest spatial
coordinate, a triple threshold-based range partition (TTP)
algorithm employing a set of adaptive Markov models
(AMMs) is applied to predictively encode the four data
structures (DSs) in which a ST sub-sequence is represented,
i.e. the two DSs containing the event spatial (location)
information (x and y coordinates, respectively), the DS
containing the polarity information and the DS containing the
number of events that were triggered at the timestamp of the
ST sub-sequence; the Laplace estimator is used to compute the
probability distribution for the TTP algorithm.
Different from the LLC-ARES solution [32], ELC-ARES
adopts also a new x-coordinate prediction strategy, where the
prediction of any element within the DS storing the x-
coordinate data corresponds to the x-coordinate associated to
the first event triggered at the ST sub-sequence timestamp; as
a consequence, ELC-ARES also adopts a new initialization for
the triple threshold needed for the TTP algorithm associated to
the x-coordinate. In addition to the 4 DSs coding processes, a
TTP algorithm similar to the one described above (i.e.,
employing a set of AMMs) is also applied to encode the
decision trees resulting from the coding processes of all DSs
except the polarity related one; the event coding bitstream is
then formed by multiplexing the elementary bitstreams
resulting from encoding the 4 DSs and the decision trees.
Experimental results on the DSEC dataset show that ELC-
ARES has higher compression performance, measured in
terms of CR, compared to the lossless benchmark solutions
LLC-ARES, Bzip2, LZMA and ZLib, with average CR
improvements (over the DSEC dataset) of 21.40%, 28.03%,
35.27%, and 64.54% respectively. Experimental results also
show that ELC-ARES provides average event encoding speed
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 17
reductions of 21.41%, 91.10%, and 54.74% compared to
Bzip2, LZMA and ZLib, respectively, and the LLC-ARES
benchmark solution achieves an average event encoding speed
46% lower than the proposed ELC-ARES solution; the
average event encoding speed metric measures the (average)
time needed to encode one event (see TABLE IV).
14) EVENT DATA STREAM COMPRESSION BASED ON
POINT CLOUD REPRESENTATION [36]
In 2023, Huang and Ebrahimi proposed a lossless coding
solution based on a point cloud representation to code event
data generated from a DAVIS event camera (consisting of
location, polarity, and timestamp) [36].
Similarly to [31], the sensor event sequence is organized as
a set of points in 3D (space-time) volume, where the pixel
location (x, y) and the timestamp define the 3D coordinate
axes X, Y and Z, respectively, and the polarity is the value
attributed to each 3D point, thus resembling a 3D point cloud
representation. In this context, the events of the sensor event
sequence are first aggregated, according to their polarity, into
multiple sets with a fixed number of events each, generating
multiple sets of two 3D point clouds (one for each polarity),
all with the same number of events. Next, the (x, y, ) values
of each point in each point cloud are scaled to a range of values
in the x, y, and coordinates appropriate for coding; in this
case, timestamp is multiplied by a temporal scaling factor of
1×106 and x and y coordinates are multiplied by a special
scaling factor of 1×103. Then, the standard Geometry-based
Point Cloud Compression (G-PCC) codec [41] is applied to
encode the geometry information (i.e., the (x, y, ) triplet) of
every 3D (polarity-based) point cloud separately. The
elementary encoded bitstreams resulting from applying G-
PCC to every 3D (polarity-based) point cloud are then
multiplexed to generate the event coding bitstream.
Experimental results on 8 (indoor and outdoor) event
sequences of the DAVIS 240C dataset (only event data
considered) show higher compression performance, measured
in terms of CR, compared to the lossless benchmark solutions
[21], LZMA, Sprintz-Delta (compression algorithm for
internet of things devices characterized by low memory
consumption and low latency), Huffman coding, and SIMD-
BP128 (vectorized binary packing coding scheme), with
average CR improvements (over the 8 event sequences) of
22.2%, 36.2%, 100.4%, 161.4%, and 264.9% respectively.
15) A NOVEL APPROACH FOR NEUROMORPHIC VISION
DATA COMPRESSION BASED ON DEEP BELIEF
NETWORK [37]
Also in 2023, Khaidem et al. proposed a deep learning-based
coding framework to code pseudo video sequences created
from event data generated from a DAVIS event camera
(consisting of location, polarity, and timestamp) [37].
Similarly to [25], the proposed solution first accumulates,
over a fixed time interval, the number of events triggered at
each pixel location according to their polarity, generating
(two) polarity-based event frames; these frames have the full
sensor pixel array spatial resolution and represent, at each
pixel location, the event count for a given polarity, i.e.,
correspond to location histograms. As for solution 4) reviewed
above ([25]), the coding framework proposed in [37] is
classified as lossy since it requires event timestamp
quantization, an operation that induces some precision loss in
the timestamp component.
The two polarity-based event frames created from event
aggregation are then concatenated side by side creating a so-
called superframe, i.e., a frame with twice the sensor pixel
array horizontal resolution. This process is repeated over the
whole sensor event sequence duration and the resulting set of
superframes, structured into a 3D array of superframes, is then
treated as a pseudo video sequence, with each superframe
constituting the basic coding unit of the proposed framework.
The proposed superframe coding process starts by dividing
each superframe into 30×30 blocks and feeding them to a deep
belief network (DBN), comprising a 4-layer autoencoder,
generating low-dimensional (20×1) latent features vectors.
The resulting latent features vectors are then encoded using
Huffman arithmetic coding, generating the corresponding
(superframe blocks) coding bitstreams. The elementary
bitstreams resulting from encoding all 30×30 blocks of all
superframes are finally multiplexed, forming the event coding
bitstream.
Experimental results on 3 (indoor) sequences of the DAVIS
240C dataset (only event data considered) show significant
compression performance gains in general, measured in terms
of CR versus aggregation time interval, with respect to several
benchmark solutions, such as the event coding solutions in
[21] (lossless solution) and [25] (lossy solution), and the
generic lossless data compression algorithms Huffman
arithmetic coding, LZMA, LZ4, ZLib (Zeta Library), Zstd,
Brotli, and Snappy (fast integer compression algorithm). For
the largest time aggregation interval considered (30ms), the
proposed solution achieves an average CR (over the 3 event
sequences) 44.35× higher than the one achieved with the
lossless event coding solution in [21] and an average CR up to
108.95× higher than the ones achieved with the generic data
compression algorithms; the average CR improvements
decrease as the time aggregation interval decreases. However,
it is worth mentioning that, according to [37], the DBN was
trained on blocks derived from the first 10 seconds of the event
sequences used for validation, which may bias the results and
somehow justify the high performance obtained.
16) LOSSLESS ENCODING OF TIME-AGGREGATED
NEUROMORPHIC VISION SENSOR DATA BASED ON
POINT-CLOUD COMPRESSION [38]
In 2024, Adhuran et al. extended the TALVEN solution in
[25], by proposing to use a standard point cloud compression
scheme to code the time-aggregated event data created from
event data generated from a DAVIS event camera (consisting
of location, polarity, and timestamp) [38].
Similarly to [25], the proposed solution, so-called Time-
Aggregated Lossless Encoding of Events based on Point-
Cloud Compression (TALEN-PCC), first accumulates, over a
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
18 VOLUME XX, 2017
fixed time interval ∆, the number of events triggered at each
pixel location according to their polarity. This process
generates (two) polarity-based event matrices, each with the
full spatial resolution of the sensor’s pixel array. These
matrices correspond to location histograms, where each pixel
location reflects the event count for a given polarity. As for
solution 4) reviewed above ([25]), the coding framework
proposed in [38] is classified as lossy since it requires event
timestamp quantization, an operation that induces some
precision loss in the timestamp component.
The two polarity-based event matrices created from event
aggregation are then (independently) raster scanned and the
matrices entries with non-zero event counts are arranged in
two so-called multivariate streams (one for each polarity),
where each non-zero matrix entry is represented by 4
variables: x, y, Event Count and k
; (x, y) corresponds to the
pixel location and k
corresponds to the aggregation time
interval index in which the Event Count events occurred. This
event accumulation and multivariate streams creation process
is repeated over the whole duration of the sensor event
sequence (i.e., raw input event sequence). The resulting
multivariate streams constitutes, thus, the basic coding unit of
the proposed solution.
Each resulting multivariate stream is organized as a 3D
point cloud, i.e., a set of points in 3D (space-time) volume,
where the pixel location (x, y) and the aggregation time
interval index k
correspond to the 3D coordinate axes X, Y
and Z, respectively, and the Event Count is the attribute
associated with each 3D point. Each 3D point cloud is then
encoded with the standard Geometry-based Point Cloud
Compression (G-PCC) codec [41]; while x, y, and k
are
encoded as geometry information, the Event Count is encoded
as attribute information (i.e., reflectance input of the G-PCC
codec). The elementary encoded bitstreams (geometry and
attribute) resulting from applying G-PCC to every 3D point
cloud (i.e., multivariate stream) are then multiplexed to
generate the event coding bitstream.
Experimental results on the DAVIS 240C dataset (only
event data considered), show higher coding performance,
measured in terms of CR, when compared to the TALVEN
solution in [25], achieving CRs up to 30%. Additionally, for
medium to high aggregation time intervals, the TALEN-PCC
method significantly outperforms the lossless event coding
solution in [21].
17) FLOW-BASED VISUAL STREAM COMPRESSION
FOR EVENT CAMERAS [39]
Also in 2024, Stumpp et al. proposed the so-called flow-based
compression (FBC), a conceptually different lossy coding
approach, called stream-to-stream approach. Unlike the
traditional coding approach adopted by all the remaining
reviewed NVDC solutions, where the encoder output is a
coding bitstream, the proposed FBC solution encodes the
sensor event sequence into another (reconstructed) event
sequence. The FBC encoding operation is based on not
transmitting future events for some time period; instead, those
future events are predicted at the receiver using real-time
optical flow estimates computed and sent by the transmitter.
The proposed FBC approach is designed specifically for real-
time compression of event data generated from a DAVIS
event camera (consisting of location, polarity, and timestamp)
[39].
The proposed FBC solution operates in two phases: a
sending phase followed by a predicting phase. During the time
period of the sending phase, event-wise optical flow is first
computed using the hARMS method [44]. Then, the events are
classified into flow events or no-flow events based on the
computed flow consistency. A flow event corresponds to an
event with consistent flow that can be reliably used to perform
prediction of future events at the receiver side. Flow events are
transmitted along with their computed optical flow estimates.
These flow estimates are used by the receiver to predict future
events, allowing the FBC to reduce the actual number of
events transmitted, thus leading to compression. A no-flow
event corresponds to an event with a non-consistent flow. No-
flow events are transmitted as they are, i.e., with 64 bits per
event, without any additional compression.
During the time period of the predicting phase, the receiver
uses each flow event received during the previous sending
phase to predict the spatio-temporal location of future events;
those future events are not transmitted, as they are assumed to
be accurately predicted based on the consistency of the optical
flow computed at the transmitter. This prediction process
reduces the need to transmit every event, leading to
compression. Once the predicting phase is concluded for every
flow event received, the receiver combines the predicted
events with the no-flow events (received during the predicting
phase) generating the (FBC) reconstructed event sequence.
While the proposed FBC approach targets to be suitable for
real-time compression of event data, it can be combined with
other compression strategies for improved performance [39],
i.e., the (FBC) reconstructed event sequence can be further
compressed.
Experimental results on 4 event sequences (obtained from
the DAVIS 240C dataset, [45] and [46]) show an average CR
of 2.8, corresponding to an average event reduction of 68%,
with a median temporal error of 0.48ms and an average spatio-
temporal event stream distance of 3.07, in a coding framework
where only the proposed FBC is used; please refer to TABLE
IV for more details on the event reduction, temporal error and
spatio-temporal event stream distance. By applying LZMA on
top of the event sequence reconstructed by the proposed FBC,
experimental results show a CR improvement of 3.72 with
respect to the scenario where LZMA is not used.
B. SPIKE-BASED NVDC SCHEMES
1) AN EFFICIENT CODING METHOD FOR SPIKE
CAMERA USING INTER-SPIKE INTERVALS [23]
In 2019, Dong et al. proposed the first lossy coding framework
to code time interval sequences created from spike data
generated from a spike camera (consisting of ‘ON’/‘OFF’, i.e.,
binary, values) [23].
In the proposed framework, each spike train, i.e., each
sequence of spikes (‘ON’/‘OFF’ or ‘1’/’0’ values) outputted
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3528375
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author Name: Preparation of Papers for IEEE Access (February 2017)
VOLUME XX, 2017 19
by a single sensor pixel along time, is first converted into a
sequence of ‘waiting’ times between consecutive spikes, i.e.,
relative latency of spikes, known as inter-spike intervals
(ISIs). Then, each ISI sequence is adaptively partitioned into
multiple sub-sequences (temporally), called segments, such
that adjacent segments are characterized by different ISIs
distributions and, thus, a more efficient exploitation of the
spatio-temporal redundancies can be achieved; each (ISI)
segment (from a sensor pixel) constitutes, thus, the basic
coding unit of the proposed framework, with ISI
corresponding to the time interval component of the spike
representation targeted by coding (see Section II.B).
The proposed (ISI) segment encoding process involves
evaluating two intra-pixel prediction modes, the so-called
mean value mode (MVM) and forward mode (FM), and one
inter-pixel prediction mode, and selecting the mode leading to
the lowest rate-distortion cost; considering that the average
pixel intensity is inversely proportional to the ISI, an intensity-
based distance between two spike trains is proposed as
distortion measure. While intra-pixel prediction considers ISI
data only from segments belonging to the pixel to be encoded,
inter-pixel prediction considers ISI data from segments
belonging to both the pixel to be encoded and to neighboring
pixels; the spike location component is somehow embedded in
the order in which the basic coding units (each segment from
a sensor pixel) are scanned and, thus, it is not directly coded.
The intra-pixel prediction mode MVM, designed to deal
with homogeneous segments, i.e., segments with similar ISI
values, involves the computation of the mean value of the ISIs
within a segment followed by the subtraction of that mean
value from all the segment ISI values, obtaining ISI
(prediction) residuals. The intra-pixel prediction mode FM,
designed to deal with non-homogeneous segments, i.e.,
segments with varying ISI values, involves motion estimation
and compensation in the temporal dimension only, i.e.,
considering previously coded segments from the same pixel
location only. After finding the reference (segment) candidate
that minimizes the intensity-based distance with respect to the
segment to be encoded, the (prediction) residuals of ISIs are
obtained and the associated motion vector is predicted from
the average of the motion vectors of previously coded
segments with the FM mode.
As far as the inter-pixel prediction mode is concerned, it
involves motion estimation and compensation in both
temporal and spatial dimensions, i.e., considering previously
coded segments from both the same pixel and neighboring
pixels in a causal spatio-temporal window. Thus, for each
reference (segment) candidate, the intensity-based distance
with respect to the segment to be encoded is computed and the
one with the lowest distance is selected as the prediction
segment; the (prediction) residuals of ISIs are then obtained
by subtracting the prediction segment from the segment to be
encoded. The spatio-temporal motion vector associated to the
prediction segment (with components in x, y, and ) is then
predicted from the average of the spatio-temporal motion
vectors of previously coded segments from the same pixel and
from segments of 4 neighboring pixels (in a causal spatial
window). After the intra- and inter-pixel coding, the
corresponding prediction residuals are quantized (with a
varying quantization step size at each ISI) and context-based
adaptive entropy encoded, generating the spike coding
bitstream.
Experimental results on the PKU-Spike dataset, proposed
in [23] for the spike data coding algorithm evaluation, show
compression performance, measured in terms of CR for
several quantization parameter (QP) values, with average CR
values (over the PKU-Spike dataset) ranging between 23.04
(for QP = 4) and 53.41 (for QP = 32). It is also shown in [23]
the PSNR and SSIM evolution with QP, where each
PSNR/SSIM value corresponds to the average of 1000
PSNR/SSIM values computed for 1000 still intensity images
reconstructed from raw spike data and decoded spike data.
2) HYBRID CODING OF SPATIOTEMPORAL SPIKE DATA
FOR A BIO-INSPIRED CAMERA [26]
In 2021, Zhu et al. extended the lossy coding framework in
[23], by incorporating an adaptive polyhedron partitioning,
intra and inter polyhedron-based prediction, transform and
multi-layer quantization, to code time interval sequences
created from spike data generated from a spike camera
(consisting of ‘ON’/‘OFF’, i.e., binary, values) [26].
In the proposed framework, each spike train (sequence of
‘ON’/‘OFF’ spikes) outputted by a single sensor pixel along
time, is first converted into a sequence of inter-spike intervals
(ISIs), i.e., ‘waiting’ times between subsequent spikes. Then,
the ISI sequences associated to all the sensor pixels are
structured in an ISI volume and divided into macro cuboids,
i.e., cuboids with the full sensor pixel array spatial resolution
and a predefined time length. Each macro cuboid is in turn
partitioned into multiple spike cuboids, each of which
corresponding to 2×2 pixels in the spatial dimension, called
pixel group. The set of ISI sequences belonging to a pixel
group are further adaptively partitioned into multiple
polyhedrons according to the motion characteristics; the
polyhedron constitutes, thus, the basic coding unit of the
proposed coding fr