ArticlePDF Available

Abstract and Figures

Time-of-Flight (ToF) sensors are generally used in combination with RGB sensors in image processing for adding the third dimension to 2D scenes. Because of their low lateral resolution and contrast, they are scarcely used in object detection or classification. In this work, we demonstrate that Ultra-Low Resolution (ULR) ToF sensors with 8x8 pixels can be successfully used as stand-alone sensors for multi-class object detection even if combined with machine learning (ML) models, which can be implemented in a very compact and low-power custom circuit. Specifically, addressing an STMicroelectronics VL53L8CX 8×8 pixel ToF sensor, the designed ToF+ML system is capable to classify up to 10 classes with an overall mean accuracy of 90.21%. The resulting hardware architecture, prototyped on an AMD Xilinx Artix-7 FPGA, achieves an Energy per Inference consumption of 65.6 nJ and a power consumption of 1.095 μW at the maximum Output Data Rate of the sensor. These values are lower than the typical energy and power consumption of the sensor, enabling real-time post-processing of depth images with significantly better performance than the state-of-the-art in the literature.
Content may be subject to copyright.
VOL. 1, NO. 3, JULY 2017 0000000
Sensor Applications
Multi-class Object Classification Using Ultra-low Resolution Time-of-Flight
Sensors
Andrea Fasolino1*, Paola Vitolo1*, Rosalba Liguori1* *, Luigi Di Benedetto1***, Alfredo Rubino1**,
Danilo Pau2****, and Gian Domenico Licciardo1***
1Department of Industrial Engineering, University of Salerno, Fisciano, 84084 Salerno, Italy
2System Research and Applications, STMicroelectronics, 20041 Agrate Brianza, Italy
*Graduate Student Member, IEEE
** Member, IEEE
***Senior Member, IEEE
****Fellow, IEEE
Manuscript received June 7, 2017; revised June 21, 2017; accepted July 6, 2017. Date of publication July 12, 2017; date of current version July 12, 2017.
Abstract—Time-of-Flight (ToF) sensors are generally used in combination with RGB sensors in image processing for
adding the third dimension to 2D scenes. Because of their low lateral resolution and contrast, they are scarcely used in
object detection or classification. In this work, we demonstrate that Ultra-Low Resolution (ULR) ToF sensors with 8x8 pixels
can be successfully used as stand-alone sensors for multi-class object detection even if combined with machine learning
(ML) models, which can be implemented in a very compact and low-power custom circuit. Specifically, addressing an
STMicroelectronics VL53L8CX 8×8 pixel ToF sensor, the designed ToF+ML system is capable to classify up to 10 classes
with an overall mean accuracy of 90.21%. The resulting hardware architecture, prototyped on an AMD Xilinx Artix-7 FPGA,
achieves an Energy per Inference consumption of 65.6nJ and a power consumption of 1.095 𝜇𝑊 at the maximum Output
Data Rate of the sensor. These values are lower than the typical energy and power consumption of the sensor, enabling
real-time post-processing of depth images with significantly better performance than the state-of-the-art in the literature.
Index Terms—Time-of-flight, Ultra-low-resolution, FPGA, Machine Learning, Edge Computing, image classification, Hardware Design,
on-tiny-device learning.
I. INTRODUCTION
Time-of-Flight (ToF) sensors are usually employed in image
processing applications to enhance the performance of red-blue-green
(RGB) sensors, by the addition of depth information to bidimensional
pictures [1]–[4]. That is because, generally, the only depth map of
the scene is unsuitable for conventional image processing such as
edge-detection, or morphological operations due to ToFs very low
lateral resolutions (up to 640 ×480 pixels for commercial sensors)
and contrast compared to typical RGB sensors. However, employing
ToF and RGB sensors simultaneously, along with the complex setup
for processing the acquired data, can be challenging [5] in edge
devices that must be compact, lightweight, and aesthetically pleasing
[6]. On the other hand, ToF is very fast compared to laser scanning
methods for capturing 3D scenarios [7], which makes this sensor very
interesting for real-time applications. The recent literature shows that
the use of ToF as the only sensor for classification is very challenging.
In [8] a 8×8ToF is used for binary only classification in combination
with a custom CNN with 2 conv layers; in [9] images from a 64 ×64
ToF-SPAD array, upscaled to 320 ×320, are used with a 3 conv
layer CNN for 3 classes classification. The 2 conv layers CNN in
[10] achieves 86.3 mean accuracy on 6 classes with a 32 ×32 ToF.
Therefore, ToF classification is performed with relatively complex
network models [8], [11]–[13], requiring computing capabilities not
suitable for the implementation in a dedicated HW core embedded
Corresponding author: Gian Domenico Licciardo (e-mail: gdlicciardo@unisa.it).
Associate Editor: Alan Smithee.
Digital Object Identifier 10.1109/LSENS.2017.0000000
in the sensing element [14]–[16]. In this work, we demonstrate for
the first time in the literature that Ultra-Low resolution (ULR) ToF
(8x8 pixels) can be successfully used as a stand-alone sensor for
multi-class object classification, even if it is combined with a simple
Convolutional Neural Network (CNN), suitable to be processed in the
circuitry already equipping the sensor. In doing so, taking as reference
the STMicroelectronics VL53L8CX 8×8pixel ULR ToF [17], on
one side the complexity of the CNN model has been constrained to
an implementation with power dissipation and memory requirements
negligible compared to those of the sensor. On the other side, we
trained and validated the ToF+CNN system with different datasets,
each one composed by an increased number of objects with respect to
the previous one, ensuring a mean classification accuracy higher than
90%, set as the ideal threshold of an acceptable accuracy. Therefore,
the primary contributions of this study can be summarized as follows:
demonstration of a multi-class classification system by using an
8x8 stand-alone ToF sensor;
design of an ultra-tiny CNN to optimize the complexity-accuracy
trade-off result capable of in-sensor performance.
The derived system obtains a mean accuracy ranging from 93.6%
on the smallest dataset (4 classes) to 90.21% on 10 classes and the
HW implementation of CNN model has an Energy per Inference
𝐸𝑖𝑛 𝑓 =65.6nJ and a 𝑃𝑑 𝑦𝑛 =1.095 𝜇𝑊 when it is prototyped
with an AMD Xilinx Artix-7 FPGA (xc7a35tfgg484-1). The above
results overcome the state-of-the-art in the literature, both in terms
of classification capabilities and compactness of the system. The
paper is organized as follows: Section II illustrates the dataset used
1949-307X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/r ights/index.html for more information.
This article has been accepted for publication in IEEE Sensors Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2024.3467165
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
0000000 VOL. 1, NO. 3, JULY 2017
for the training, the designed model, and its performances when
it is optimized through a post-training 8-bit quantization, Section
III depicts the HW post-processing architecture, the results of the
FPGA deployment, and the comparison with the sensor performances;
Section IV concludes the paper.
II. MACHINE LEARNING MODEL
The hardware-aware design of the classification model leverages
a dataset that contains images from a publicly available dataset [18]
and images acquired with the STMicroelectronics VL53L8CX 8×8
pixel ToF sensor [17].
A. Dataset
The dataset used for model definition and training is the open-
source dataset presented in [18]. The dataset comprises 300 objects
in 51 categories, with depth images provided in 640x480 pixel
format. All scenes are described by RGB images, depth images,
binary masks, coordinates of the object’s centers, and cropped images
comprehending only the central object of the scene.
Considering that the VL53L8CX ToF ranges up to 400 cm with
65° of diagonal FoV and provides 8x8 pixels depth maps, the
cropped images are rescaled and subsampled accordingly. These
transformations are consistent with the used ToF sensor that frames
a limited portion of the scene. That is not a significant limitation to
sensor employment, as the object of interest is assumed to lie within
the central portion of the image. This is coherent in several use cases,
including those where the point of focus is in line with the vision.
Several subsets have been used, composed by an incremental number
of classes. The smallest subset is composed of 4 classes, larger than
the datasets in the existent literature, chosen for their shapes close
to basic geometric shapes: apple, calculator, water bottle and food
box. The other subsets have been composed by gradually adding
new classes up to an overall mean accuracy higher than 90%. This
value is reached with 10 classes obtained by adding: dry battery,
glue stick, peach, plate, and coffee mug to the initial subset.
B. Training, optimization, and performance
The proposed NN model, schematized in Fig. 1 is composed as
follows:
1) input tensor with shape 8x8x1 composed of 8-bit data;
2) first convolutional layer (Conv1) with three channels, padding
’same’, and kernels of size 3x3;
3) second convolutional layer (Conv2) with four channels, padding
’valid’, and kernels of size 3x3;
4) 1 four neurons fully connected (fc) layer that executes the
classification over the selected classes with a softmax activation
function.
All the convolutional layers have a Rectified Linear Unit (ReLU)
activation function and unary stride. No pooling layers are used due
to the small dimensions of the input tensor.
The model training was conducted by partitioning the datasets
into 70% for training, 15% for validation, and 15% for testing. The
stochastic gradient descent optimizer was chosen to achieve a more
stable solution compared to the faster Adam optimizer. Other training
parameters included are listed in Table 1. A dropout of 25% was
applied after the activation of each convolutional layer. A piece-wise
Fig. 1: Output shapes of the proposed custom model, where the fc
dimensions, N, depends on the number of classes. The conv layers
have a unitary stride and 3x3 kernels. Conv1 layer has padding same,
whereas Conv2 layer and fc have padding valid.
learning rate decrement factor of 0.5 each 15 epochs is used. The
resulting confusion matrices on the two datasets are reported in Fig.
2 and 3, while Table 2 reports the main performance parameters of
the CNN with 3 datasets composed by 4, 8 and 10 classes.
To test the robustness of the model for HW implementation, the
model has been quantized in activations and weights from 32-bit
floating point to 8-bit fixed-point with a post-training quantization.
The quantized model on the 4 classes dataset achieves an overall
accuracy of 93.23% and a loss = 0.0919, showing a negligible decrease
of 0.37% in accuracy and an increment of 0.0381 in loss with respect
to the full precision model. No further quantization is considered
because of a strong reduction of the model performances when going
beyond the 8-bit precision.
III. FPGA IMPLEMENTATION
A. HW architecture
The proposed quantized CNN model has been used to design a
custom post-processing core to verify the feasibility of the CNN
integration inside the sensor. The HW architecture (Fig. 4) has
been implemented on an AMD Xilinx Artix-7 (xc7a35tfgg484-
1) FPGA with Verilog Hardware Description Language using
Vivado 2024.1 toolchain. The proposed scheme directly reflects the
Table 1: Training parameters obtained through a progressive fine-
tuning. The patience value is the number of epochs that the early
stop callback waits before stopping the training.
optimizer stochastic gradient descent
loss function categorical cross-entropy
initial learning rate 0.001
batch size 32 (4 classes)-128 (10 classes)
epochs 300
patience 6
Table 2: Overall performances of the proposed model over four, eight
and ten classes.
# classes 4 8 10
accuracy [%] 93.60 92.07 90.21
loss 0.054 0.028 0.026
precision [%] 93.58 90.12 88.19
recall [%] 93.48 89.60 90.26
F1 score [%] 93.45 89.77 88.31
This article has been accepted for publication in IEEE Sensors Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2024.3467165
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOL. 1, NO. 3, JULY 2017 0000000
Fig. 2: Confusion matrix of the proposed model on 4 classes.
Fig. 3: Confusion matrix of the proposed model on 10 classes.
classification model, to ensure minimal HW under-utilization and
energy consumption. It is composed of
the Serial Peripheral Interface (SPI) interface circuitry: serial-
input parallel-output (SIPO) to receive data from the master-
output slave-input (mosi) bus and parallel-input serial-output
(PISO) to send the classification results on the master-input
slave-output (miso) bus;
three computational cores, namely con1, conv2 and fc, which
apply the arithmetic operations;
two memories, image memory (image mem) to store the input
data from the ToF sensor and feature memory (feat mem) to
store the intermediate feature map produced by the filtering
operation of conv1.
Input data are received from the ToF sensor through an SPI mosi
bus. The 8-bit data stored in the image mem feeds the conv1 processing
core, which produces the intermediate feature map. This is stored
in the feature memory and feeds the conv2 processing core. Finally,
the fc processing core produces the final classification.
Fig. 4: Data-driven post-processing core directly derived from the
model in Fig 1. Mosi and miso are the Serial Peripheral Interface
(SPI) interface with the sensor. Image memory stores the input image,
Feature (feat) memory stores the intermediate features.
B. Results
The proposed HW architecture utilizes 2867 Look Up Tables
(LUTs), 1088 Flip-Flops (FFs), 3.5 Block Random Access Memories
(BRAMs), and no Digital Signal Processors (DSPs). The choice
to set to 0 the maximum number of DSPs available for the
implementation is to have target independent prototyping. The overall
power consumption is 𝑃𝑜𝑛𝑐 ℎ𝑖 𝑝 =74 mW, where the dynamic power
consumption amounts to 𝑃𝑑 𝑦𝑛 =2.187 mW with a clock frequency
𝑓𝑐𝑙𝑘 =12 MHz. The implemented architecture is divided into five
blocks to provide a breakdown of utilized resources in Fig. 5a and
Fig. 5b and power consumption in Fig. 5c: Control Unit (C.U.),
memories (MEM), conv1, conv2, fc, and glue logic (others). The
computing cores (conv1, conv2, and fc) account for above half of the
utilized resources, whereas the main power consumption contribution
is due to the BRAMs. Since the maximum Output Data Rate of the
sensor is 15 Hz, to ensure the real-time processing of the depth
images the operating frequency have to be 6kHz. When 𝑓𝑐𝑙𝑘 =6
kHz, the proposed post-processing architecture has an Energy per
Inference 𝐸𝑖𝑛 𝑓 =65 .6nJ and a 𝑃𝑑𝑦 𝑛 =1.095 𝜇𝑊. As reported in
Table 3, the above results are orders of magnitude below the 200
mW power consumption and 14 mJ per acquisition of the ToF
sensor in active ranging mode.
Table 3: Comparison between the proposed model with the state-of-
the-art ULR ToF classification models. A comparison with the most
compact RGB classification model exploiting the [18] dataset has
been added.
work [19]* this work [8] [9] [10]
model custom custom custom VGG-16 custom
conv
layers 3 2 3 17 2
#
kernels 32,64,128 3,4 16,32,64 - NA
accuracy
[%]76.3 93.6 90.2 91.8 95 86.3
#
classes 51 4 10 2 3 6
input
size 640x480 8x8 8x8
320x320
(64x64
resized)
32x32
This article has been accepted for publication in IEEE Sensors Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2024.3467165
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
0000000 VOL. 1, NO. 3, JULY 2017
(a)
(b) (c)
Fig. 5: Power breakdown of CU, MEM, conv1, conv2, fc, and others
in terms of utilized resources and power consumption with respect
to a total of 2867 LUTs, 1088 FFs, and 2.187 mW: (a) LUTs, (b)
FFs, (c) power.
The above results prove that the proposed ToF+CNN system meets
the initial constraints of a negligible impact on the circuitry of the
sensor, although it is capable of real-time performance, namely, it
processes an input frame before the acquisition of the next one.
Together with the results in Fig. 2, 3 and Table 2, they prove that
the performance of the proposed system represents the state-of-the-
art of the classification systems exploiting only ULR ToF sensors.
Comparisons with the recent literature, exploiting URL ToF-only
sensors, are reported in Table 3 in terms of mean overall accuracy,
the complexity of the NN, estimated by the number of conv layers,
number of kernels for each conv layer, and input size of the models,
and the number of classes. As a reference, the most compact RGB-
based classification model presented in the recent literature, exploiting
the [18] dataset has been added [19]. Unfortunately, it is not possible
to fairly compare the proposed architecture since the alternatives
do not target HW designs. They exploit commercial platforms (e.g.
Raspberry) which require resources order of magnitude higher than
the proposed. Our custom HW architecture enables the real-time
operation of the multi-class classification task while reducing the
energy per inference by more than five orders of magnitude with
respect to the sensor and [8]. Moreover, only 33% additional power
is needed compared to the sensor power consumption.
IV. CONCLUSIONS AND FUTURE PROSPECTS
This paper presents a new multi-class image classification system
exploiting only a URL ToF sensor. To enable the classification
feature integration in the ToF’s packaging, closely coupling the smart
capability with the sensing element, a very compact post-processing
circuit is designed. As the first implementation of a feasibility study
in that field, the proposed system achieves sota performance over
10 classes and meets the real-time constraints of the target sensor,
having a negligible energy consumption per inference five orders of
magnitude less than the state-of-the-art works. Future development
will explore the capabilities of the system to detects objects in scene
with partial occlusions and object overlaps and the implementation
of the CNN in a CMOS technology compatible with that of the
sensor.
REFERENCES
[1] W. Zhou, E. Yang, J. Lei, J. Wan, and L. Yu, “Pgdenet: Progressive guided fusion
and depth enhancement network for rgb-d indoor scene parsing,”IEEE Transactions
on Multimedia, vol. 25, pp. 3483–3494, 2023.
[2] M. Zhang, S. Yao, B. Hu, Y. Piao, and W. Ji, “C2dfnet: Criss-cross dynamic filter
network for rgb-d salient object detection,” IEEE Transactions on Multimedia,
vol. 25, pp. 5142–5154, 2023.
[3] M. Oppliger, J. Gutknecht, R. Gubler, M. Ludwig, and T. Loeliger, “Sensor fusion
of 3d time-of-flight and thermal infrared camera for presence detection of living
beings,” in 2022 IEEE Sensors, 2022, pp. 1–4.
[4] Z. Zeng, H. Liu, F. Chen, and X. Tan, “Airsod: A lightweight network for rgb-d
salient object detection,” IEEE Transactions on Circuits and Systems for Video
Technology, pp. 1–1, 2023.
[5] Q. Zhu, W. Sun, Y. Dai, C. Li, S. Zhou, R. Feng, Q. Sun, C. C. Loy, J. Gu,
Y. Yu et al., “Mipi 2023 challenge on rgb+ tof depth completion: Methods and
results,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 2864–2870.
[6] D. Kim and Y. Choi, Applications of smart glasses in applied sciences: A
systematic review,” Applied Sciences, vol. 11, no. 11, 2021. [Online]. Available:
https://www.mdpi.com/2076-3417/11/11/4956
[7] C. Bamji, J. Godbaz, M. Oh, S. Mehta, A. Payne, S. Ortiz, S. Nagaraja, T. Perry, and
B. Thompson, “Areview of indirect time-of-flight technologies,” IEEE Transactions
on Electron Devices, vol. 69, no. 6, pp. 2779–2793, 2022.
[8] J. Pleterski, G. Škulj, C. Esnault, J. Puc, R. Vrabiˇ
c, and P. Podržaj, “Miniature
mobile robot detection using an ultralow-resolution time-of-flight sensor, IEEE
Transactions on Instrumentation and Measurement, vol. 72, pp. 1–9, 2023.
[9] A. D. Ruvalcaba-Cardenas, T. Scoleri, and G. Day, “Object classification using
deep learning on extremely low-resolution time-of-flight data, in 2018 Digital
Image Computing: Techniques and Applications (DICTA), 2018, pp. 1–7.
[10] G. Nash and V. Devrelis, “Flash lidar imaging and classification of vehicles,” in
2020 IEEE SENSORS, 2020, pp. 1–4.
[11] G. Mora-Martín, A. Turpin, A. Ruget, A. Halimi, R. Henderson, J. Leach,
and I. Gyongy, “High-speed object detection with a single-photon time-of-flight
image sensor, Opt. Express, vol. 29, no. 21, pp. 33184–33196, Oct 2021.
[Online]. Available: https://opg.optica.org/oe/abstract.cfm?URI=oe- 29-21- 33184
[12] S. Salerno, “8x8 tof gesture classification - arduino rp2040 connect,” last
acceded: 2024.01.04. [Online]. Available: https://docs.edgeimpulse.com/experts/
novel-sensor-projects/tof- gesture-classification-arduino-rp2040-connect
[13] S.-Y. Lee, T.-Y. Huang, C.-Y. Yen, I.-P. Lee, J.-Y. Chen, and C.-R. Huang, A
deep learning-based cloud-edge healthcare system with time-of-flight cameras,”
IEEE Sensors Journal, vol. 24, no. 5, pp. 7064–7074, 2024.
[14] P. Vitolo, A. De Vita, L. D. Benedetto, D. Pau, and G. D. Licciardo, “Low-power
detection and classification for in-sensor predictive maintenance based on vibration
monitoring,” IEEE Sensors Journal, vol. 22, no. 7, pp. 6942–6951, 2022.
[15] A. De Vita, G. D. Licciardo, A. Femia, L. Di Benedetto, A. Rubino, and D. Pau,
“Embeddable circuit for orientation independent processing in ultra low-power
tri-axial inertial sensors,” IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 67, no. 6, pp. 1124–1128, 2020.
[16] A. Fasolino, P. Vitolo, R. Liguori, L. Di Benedetto, A. Rubino, and G. D. Licciardo,
“Dynamically adaptive accumulator for in-sensor ann hardware accelerators, in
2024 IEEE International Symposium on Circuits and Systems (ISCAS), 2024, pp.
1–5.
[17] S. Microelectronics, “Vl53l8cx,” last acceded: 2024.01.29. [Online]. Available:
https://www.st.com/en/imaging-and-photonics- solutions/vl53l8cx.html
[18] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d
object dataset,”in 2011 IEEE International Conference on Robotics and Automation,
2011, pp. 1817–1824.
[19] S. Soltan, A. Oleinikov, M. F. Demirci, and A. Shintemirov, “Deep learning-based
object classification and position estimation pipeline for potential use in robotized
pick-and-place operations,” Robotics, vol. 9, no. 3, 2020. [Online]. Available:
https://www.mdpi.com/2218-6581/9/3/63
This article has been accepted for publication in IEEE Sensors Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2024.3467165
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Miniature mobile robots in multi-robotic systems require reliable environmental perception for successful navigation, especially when operating in the real-world environment. One of the sensors that have recently become accessible in microrobotics due to their size and cost-effectiveness is a multi-zone time-of-flight (ToF) sensor. In this research, object classification using a convolutional neural network (CNN) based on an ultra-low resolution ToF sensor is implemented on a miniature mobile robot to distinguish the robot from other objects. The main contribution of this work is an accurate classification system implemented on low resolution, low processing power and low power consumption hardware. The developed system consists of a VL53L5CX ToF sensor with an 8x8 depth image and a low-power RP2040 microcontroller. The classification system is based on a customised CNN architecture to determine the presence of a miniature mobile robot within the observed terrain, primarily characterized by sand and rocks. The developed system trained on a custom dataset can detect a mobile robot with an accuracy of 91.8% when deployed on a microcontroller. The model implementation requires 7 kB of RAM, has an inference time of 34 ms, and an energy consumption during inference of 3.685 mJ.
Article
Full-text available
In this work, a new custom design of an anomaly detection and classification system is proposed. It is composed of a convolutional Auto-Encoder (AE) hardware design to perform anomaly detection which cooperates with a mixed HW/SW Convolutional Neural Network (CNN) to perform the classification of detected anomalies. The AE features a partial binarization, so that the weights are binarized while the activations, associated to some selected layers, are non-binarized. This has been necessary to meet the severe area and energy constraints that allow it to be integrated on the same die as the MEMS sensors for which it serves as a neural accelerator. The CNN shares the feature extraction module with the AE, whereas a SW classifier is triggered by the AE when a fault is detected, working asynchronously to it. The AE has been mapped on a Xilinx Artix-7 FPGA, featuring an Output Data Rate (ODR) of 365 kHz and achieving a power dissipation of 333μ333 \boldsymbol {\mu } W/MHz. Logic synthesis has targeted TSMC CMOS 65 nm, 90 nm, and 130 nm standard cells. Best results achieved highlight a power consumption of 138μ138 \boldsymbol {\mu } W/MHz with an area occupation of 0.49 mm 2 when real-time operations are set. These results enable the integration of the complete neural accelerator in the CMOS circuitry that typically sits with the inertial MEMS on the same silicon die. Comparisons with the related works suggest that the proposed system is capable of state-of-the-art performances and accuracy.
Article
This study proposes a comprehensive and vision-based long-term healthcare system that includes time-of-flight (ToF) cameras at the front end, the Raspberry Pi at the edge point, and image database and classification at a cloud server. First, the ToF cameras capture human actions through depth maps. Next, the Raspberry Pi accomplishes image preprocessing and sends the resulting images to the cloud server by wireless transmission. Finally, the cloud server performs human action recognition by using the proposed temporal frame correlation recognition model. Our model expands object detection to the three-dimensional space based on continuous ToF images. Depth maps of ToF images do not record users’ identities or environments, which prevents users from committing privacy violations. The study also builds a human action dataset, where each frame is recorded and labeled as five actions including sitting, standing, lying, getting up, and falling. After further optimization in the future, the system can improve the long-term healthcare environment and relieve the burden of nursing on elderly care.
Article
Salient object detection (SOD) aims to identify the most prominent regions in images. However, the large model sizes , high computational costs , and slow inference speeds of existing RGB-D SOD models have hindered their deployment on real-world embedded devices. To address this issue, we propose a novel method named AirSOD, which is committed to lightweight RGB-D SOD. Specifically, we first design a hybrid feature extraction network, which includes the first three stages of MobileNetV2 and our Parallel Attention-Shift convolution (PAS) module. Using the novel PAS module enables capturing both long-range dependencies and local information to enhance the representation learning while significantly reducing the number of parameters and computational complexity. Secondly, we propose a Multi-level and Multi-modal feature Fusion (MMF) module to facilitate feature fusion, and a Multi-path enhancement for Feature Refinement (MFR) decoder for feature integration. The proposed method significantly reduces the model size by 63%, decreases the computational complexity by 43%, and improves the inference speed by 43% compared with the cutting-edge model (MobileSal). We test our AirSOD on six widely-used RGB-D SOD datasets. Extensive experimental results demonstrate that our method obtains satisfactory performance. The source codes will be made available.
Article
The ability to deal with intra and inter-modality features has been critical to the development of RGB-D salient object detection. While many works have advanced in leaps and bounds in this field, most existing methods have not taken their way down into the inherent differences between the RGB and depth data due to widely adopted conventional convolution in which fixed parameter kernels are applied during inference. To promote intra and inter-modality interaction conditioned on various scenarios, as RGB and depth data are processed independently and later fused interactively, we develop a new insight and a better model. In this paper, we introduce a criss-cross dynamic filter network by decoupling dynamic convolution. First, we propose a Model-specific Dynamic Enhanced Module (MDEM) that dynamically enhances the intra-modality features with global context guidance. Second, we propose a Scene-aware Dynamic Fusion Module (SDFM) to realize dynamic feature selection between two modalities. As a result, our model achieves accurate predictions of salient objects. Extensive experiments demonstrate that our method achieves competitive performance over 28 state-of-the-art RGB-D methods on 7 public datasets. The code and results of our method are available at https://github.com/OIPLab-DUT/C2DFNet.
Article
Scene parsing is a fundamental task in computer vision. Various RGB-D (color and depth) scene parsing methods based on fully convolutional networks have achieved excellent performance. However, color and depth information are different in nature and existing methods cannot optimize the cooperation of high-level and low-level information when aggregating modal information, which introduces noise or loss of key information in the aggregated features and generates inaccurate segmentation maps. The features extracted from the depth branch are weak because of the low quality of the depth map, which results in unsatisfactory feature representation. To address these drawbacks, we propose a progressive guided fusion and depth enhancement network (PGDENet) for RGB-D indoor scene parsing. First, high-quality RGB images are used to improve depth data through a depth enhancement module, in which the depth maps are strengthened in terms of channel and spatial correlations. Then, we integrate information from the RGB and enhance depth modalities using a progressive complementary fusion module, in which we start with high-level semantic information and move down layerwise to guide the fusion of adjacent layers while reducing hierarchy-based differences. Extensive experiments are conducted on two public indoor scene datasets, and the results show that the proposed PGDENet outperforms state-of-the-art methods in RGB-D scene parsing.
Article
Indirect time-of-flight (iToF) cameras operate by illuminating a scene with modulated light and inferring depth at each pixel by combining the back-reflected light with different gating signals. This article focuses on amplitude-modulated continuous-wave (AMCW) time-of-flight (ToF), which, because of its robustness and stability properties, is the most common form of iToF. The figures of merit that drive iToF performance are explained and plotted, and system parameters that drive a camera's final performance are summarized. Different iToF pixel and chip architectures are compared and the basic phasor methods for extracting depth from the pixel output values are explained. The evolution of pixel size is discussed, showing performance improvement over time. Depth pipelines, which play a key role in filtering and enhancing data, have also greatly improved over time with sophisticated denoising methods now available. Key remaining challenges, such as ambient light resilience and multipath invariance, are explained, and state-of-the-art mitigation techniques are referenced. Finally, applications, use cases, and benefits of iToF are listed.