Article

A Deep Learning-Based Cloud-Edge Healthcare System With Time-of-Flight Cameras

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This study proposes a comprehensive and vision-based long-term healthcare system that includes time-of-flight (ToF) cameras at the front end, the Raspberry Pi at the edge point, and image database and classification at a cloud server. First, the ToF cameras capture human actions through depth maps. Next, the Raspberry Pi accomplishes image preprocessing and sends the resulting images to the cloud server by wireless transmission. Finally, the cloud server performs human action recognition by using the proposed temporal frame correlation recognition model. Our model expands object detection to the three-dimensional space based on continuous ToF images. Depth maps of ToF images do not record users’ identities or environments, which prevents users from committing privacy violations. The study also builds a human action dataset, where each frame is recorded and labeled as five actions including sitting, standing, lying, getting up, and falling. After further optimization in the future, the system can improve the long-term healthcare environment and relieve the burden of nursing on elderly care.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The 2 conv layers CNN in [10] achieves 86.3 mean accuracy on 6 classes with a 32 × 32 ToF. Therefore, ToF classification is performed with relatively complex network models [8], [11]- [13], requiring computing capabilities not suitable for the implementation in a dedicated HW core embedded in the sensing element [14]- [16]. In this work, we demonstrate for the first time in the literature that Ultra-Low resolution (ULR) ToF (8x8 pixels) can be successfully used as a stand-alone sensor for multi-class object classification, even if it is combined with a simple Convolutional Neural Network (CNN), suitable to be processed in the circuitry already equipping the sensor. ...
Article
Full-text available
Time-of-Flight (ToF) sensors are generally used in combination with RGB sensors in image processing for adding the third dimension to 2D scenes. Because of their low lateral resolution and contrast, they are scarcely used in object detection or classification. In this work, we demonstrate that Ultra-Low Resolution (ULR) ToF sensors with 8x8 pixels can be successfully used as stand-alone sensors for multi-class object detection even if combined with machine learning (ML) models, which can be implemented in a very compact and low-power custom circuit. Specifically, addressing an STMicroelectronics VL53L8CX 8×8 pixel ToF sensor, the designed ToF+ML system is capable to classify up to 10 classes with an overall mean accuracy of 90.21%. The resulting hardware architecture, prototyped on an AMD Xilinx Artix-7 FPGA, achieves an Energy per Inference consumption of 65.6 nJ and a power consumption of 1.095 μW at the maximum Output Data Rate of the sensor. These values are lower than the typical energy and power consumption of the sensor, enabling real-time post-processing of depth images with significantly better performance than the state-of-the-art in the literature.
Article
Full-text available
Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.
Article
Full-text available
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Conference Paper
Full-text available
http://www.utdallas.edu/~kehtar/UTD-MHAD.html
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Conference Paper
Full-text available
This article presents an interactive hand shape recognition user interface for American Sign Language (ASL) finger-spelling. The system makes use of a Microsoft Kinect device to collect appearance and depth images, and of the OpenNI+NITE framework for hand detection and tracking. Hand-shapes corresponding to letters of the alphabet are characterized using appearance and depth images and classified using random forests. We compare classification using appearance and depth images, and show a combination of both lead to best results, and validate on a dataset of four different users. This hand shape detection works in real-time and is integrated in an interactive user interface allowing the signer to select between ambiguous detections and integrated with an English dictionary for efficient writing.
Conference Paper
Full-text available
The paper presents an active vision system for the automatic detection of falls and the recognition of several postures for elderly homecare applications. A wall-mounted Time-Of-Flight camera provides accurate measurements of the acquired scene in all illumination conditions, allowing the reliable detection of critical events. Preliminarily, an off-line calibration procedure estimates the external camera parameters automatically without landmarks, calibration patterns or user intervention. The calibration procedure searches for different planes in the scene selecting the one that accomplishes the floor plane constraints. Subsequently, the moving regions are detected in real-time by applying a Bayesian segmentation to the whole 3D points cloud. The distance of the 3D human centroid from the floor plane is evaluated by using the previously defined calibration parameters and the corresponding trend is used as feature in a thresholding-based clustering for fall detection. The fall detection shows high performances in terms of efficiency and reliability on a large real dataset in which almost one half of events are falls acquired in different conditions. The posture recognition is carried out by using both the 3D human centroid distance from the floor plane and the orientation of the body spine estimated by applying a topological approach to the range images. Experimental results on synthetic data validate the correctness of the proposed posture recognition approach.
Article
2019 IEEE. The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: It ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: Https://github. com/mit-han-lab/temporal-shift-module.
Article
The number of the elderly person is increasing every year. Fall detection of the elderly has become an important research topic. Fall detection based on image processing is considered as a good solution. However, algorithms based on motion recognition are difficult to distinguish between a person who has fallen and a person who is sleeping. This research tracks and analyzes the motion speed of human joints, which improves the accuracy of fall detection. The experimental results prove that this method effectively distinguishes falling and sleeping.
Article
As the world elderly population is increasing rapidly, the use of technology for the development of accurate and fast automatic fall detection systems has become a necessity. Most of the fall detection systems are developed for specific devices which reduces the versatility of the fall detection system. This paper proposes a centralized unobtrusive IoT based device-type invariant fall detection and rescue system for monitoring of a large population in real-time. Any type of devices such as Smartphones, Raspberry Pi, Arduino, NodeMcu, and Custom Embedded Systems can be used to monitor a large population in the proposed system. The devices are placed into the users’ left or right pant pocket. The accelerometer data from the devices are continuously sent to a multithreaded server which hosts a pre-trained machine learning model that analyzes the data to determine whether a fall has occurred or not. The server sends the classification results back to the corresponding devices. If a fall is detected, the server notifies the mediator of the user's location via an SMS. As a failsafe, the corresponding device alerts nearby individuals by sounding the buzzer and contacts emergency medical services and mediators via SMS for immediate medical assistance, thus saving the user's life. The proposed system achieved 99.7% accuracy, 96.3% sensitivity, and 99.6% specificity. Finally, the proposed system can be implemented on a variety of devices and used to reliably monitor a large population with low false alarm rate, without obstructing the users’ daily living, as no external connections are required.
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Article
Interest in enhancing medical services and healthcare is emerging exploiting recent technological capabilities. An integrable fall detection sensor is an essential component toward achieving smart healthcare solutions. Traditional vision-based methods rely on tracking a skeleton and estimating the change in height of key body parts such as head, hips, and shoulders. These methods are often challenged by occluded body parts and abrupt posture changes. This paper presents a fall detection system consisting of a novel skeleton-free posture recognition method and an activity recognition stage. The posture recognition method analyzes local variations in depth pixels to identify the adopted posture. An input depth frame acquired using a Kinect-like sensor is densely represented using a depth comparison feature and fed to a random decision forest to discriminate among standing, sitting, and fallen postures. The proposed approach simplifies the posture recognition into a simple pixel labeling problem, after which determining the posture is as simple as counting votes from all labeled pixels. The falling event is recognized using a support vector machine. The proposed approach records a sensitivity rate of 99% on synthetic and live datasets as well as a specificity rate of 99% on synthetic datasets and 96% on popular live datasets without invasive accelerometer support. IEEE
Article
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
Article
Falls are a major health problem for the frail community dwelling old people. For more than two decades, falls have been extensively investigated by medical institutions to mitigate their impact (e.g. lack of independence, fear of falling, etc.) and minimize their consequences (e.g. cost of hospitalization, etc.). However, the problem of elderly falling does not only concern health-professionals but has also drawn the interest of the scientific community. In fact, falls have been the object of many research studies and the purpose of many commercial products from academia and industry. These studies have tackled the problem using fall detection approaches exhausting a variety of sensing methods. Lately, researcher has shifted their efforts to fall prevention where falls might be spotted before they even happen. Despite their restriction to clinical studies, early-fall prediction systems have started to emerge. At the same time, current reviews in this field lack a common ground classification. In this context, the main contribution of this article is to give a comprehensive overview on elderly falls and to propose a generic classification of fall-related systems based on their sensor deployment. An extensive research scheme from fall detection to fall prevention systems have been also conducted based on this common ground classification. Data processing techniques in both fall detection and fall prevention tracks are also highlighted. The objective of this work is to deliver medical technologists in the field of public health a good position regarding fall-related systems.
Book
This book provides a comprehensive overview of the key technologies and applications related to new cameras that have brought 3D data acquisition to the mass market. It covers both the theoretical principles behind the acquisition devices and the practical implementation aspects of the computer vision algorithms needed for the various applications. Real data examples are used in order to show the performances of the various algorithms. The performance and limitations of the depth camera technology are explored, along with an extensive review of the most effective methods for addressing challenges in common applications. Applications covered in specific detail include scene segmentation, 3D scene reconstruction, human pose estimation and tracking and gesture recognition. This book offers students, practitioners and researchers the tools necessary to explore the potential uses of depth data in light of the expanding number of devices available for sale. It explores the impact of these devices on the rapidly growing field of depth-based computer vision.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Conference Paper
New forms of natural interactions between human operators and UAVs (Unmanned Aerial Vehicle) are demanded by the military industry to achieve a better balance of the UAV control and the burden of the human operator. In this work, a human machine interface (HMI) based on a novel gesture recognition system using depth imagery is proposed for the control of UAVs. Hand gesture recognition based on depth imagery is a promising approach for HMIs because it is more intuitive, natural, and non-intrusive than other alternatives using complex controllers. The proposed system is based on a Support Vector Machine (SVM) classifier that uses spatio-temporal depth descriptors as input features. The designed descriptor is based on a variation of the Local Binary Pattern (LBP) technique to efficiently work with depth video sequences. Other major consideration is the especial hand sign language used for the UAV control. A tradeoff between the use of natural hand signs and the minimization of the inter-sign interference has been established. Promising results have been achieved in a depth based database of hand gestures especially developed for the validation of the proposed system.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
The paper deals with the issue of action recognition as an application of the new 3D time-of-flight (ToF) camera, exploiting the special ability of the device to measure distances. Segmentation of moving people is straightforward from the distance information and subsequent steps of the processing chain follow in a classical way. We describe the first results on action recognition using ToF camera distance images for a simple task of deciding actions of a single person. The total variation of a function reveals as a very useful feature in these applications.
Conference Paper
This paper presents a multi-sensor system for the detection of people falls in the home environment. Two kinds of devices are used: a MEMS wearable wireless accelerometer with onboard fall detection algorithms and a 3D Time-of-Flight camera. An embedded computing system receives the possible fall alarm data from the two sub-sensory systems and their associated level of confidence. The computing module hosts a data fusion software to operate the validation and correlation among the two subsystems delivered data in order to rise overall system efficiency performance with respect to each single sensor sub-system.
More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation
  • Fan
Single-Board Computer