ArticlePublisher preview available

S3D-CNN: skeleton-based 3D consecutive-low-pooling neural network for fall detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Most existing deep-learning-based fall detection methods use either 2D neural network without considering movement representation sequences, or whole sequences instead of only those in the fall period. These characteristics result in inaccurate extraction of human action features and failure to detect falls due to background interferences or activity representation beyond the fall period. To alleviate these problems, a skeleton-based 3D consecutive-low-pooling neural network (S3D-CNN) for fall detection is proposed in this paper. In the S3D-CNN, an activity feature clustering selector is designed to extract the skeleton representation in depth videos using pose estimation algorithm and form optimized skeleton sequence of fall period. A 3D consecutive-low-pooling (3D-CLP) neural network is proposed to process these representation sequences by improving network in terms of layer number, pooling kernel size, and single input frame number. The proposed method is evaluated on public and self-collected datasets respectively, outperforming the existing methods.
This content is subject to copyright. Terms and conditions apply.
S3D-CNN: skeleton-based 3D consecutive-low-pooling neural
network for fall detection
Xin Xiong
1
&Weidong Min
2,3
&Wei-Shi Zheng
4
&Pin Liao
1
&Hao Yang
1
&Shuai Wang
1
#Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Most existing deep-learning-based fall detection methods use either 2D neural network without considering movement repre-
sentation sequences, or whole sequences instead of only those in the fall period. These characteristics result in inaccurate
extraction of human action features and failure to detect falls due to background interferences or activity representation beyond
the fall period. To alleviate these problems, a skeleton-based 3D consecutive-low-pooling neural network (S3D-CNN) for fall
detection is proposed in this paper. In the S3D-CNN, an activity feature clustering selector is designed to extract the skeleton
representation in depth videos using pose estimation algorithm and form optimized skeleton sequence of fall period. A 3D
consecutive-low-pooling (3D-CLP) neural network is proposed to process these representation sequences by improving network
in terms of layer number, pooling kernel size, and single input frame number. The proposed method is evaluated on public and
self-collected datasets respectively, outperforming the existing methods.
Keywords Fall detection .Optimized skeleton representation .Depth video .Pose estimation .3D-CLP network
1 Introduction
Falls are a leading cause of injury and the most common
reason for non-fatal hospitalization among the elderly. The
World Health Organization reports that more than 28% of all
persons aged over 65 years fall every year; it also projects the
global incidence of falls among those aged 70 years and over
to rise to 32%42% [1]. Falls are a main cause of death from
injury-related or unintentional injuries, second only to road
traffic injuries. Among 37.3 million patients referred to their
doctors for falls each year, 646,000 have died and over 80%
reside in low- to middle-income countries. As the population
ages, these figures are expected to worsen; falls are the leading
cause of accidental death among persons aged 79 years and
older [2]. According to the National Institutes of Health, about
1.6 million elderly Americans are injured by falling each year
[3]. Over half of the seniors that lay on the floor for more than
1 h after a fall has been reported to have died within 6 months
[4]. As China has a large elderly population, the fall problem
is of great importance. If falls can be detected in a timely and
automatic manner, then rapid delivery of medical services to
the injured may be achieved. The existing common action
recognition methods do not detect fall well due to the lack of
fall datasets and the poor fall feature extraction by the complex
network which is easy to over-fitting in training. As such,
development of an intelligent system for automatic fall detec-
tion is a crucial undertaking.
In general, fall behaviors can be identified by using several
approaches, such as wearable sensors, detection via traditional
geometric and movement features, and deep-learning
methods. Most existing deep-learning-based fall detection
methods use either 2D neural network without considering
movement representation sequences or whole sequences in-
stead of only those in the fall period. These characteristics
result in inaccurate extraction of human action features and
failure to detect falls due to background interferences or ac-
tivity representation beyond the fall period. Modeling based
on depth videos and apply the 3D feature extraction of a 3D-
CLP neural network to eliminate major interferences is a rea-
sonable approach to focus on fall features. Unfortunately, re-
search on this topic is limited.
*Weidong Min
minweidong@ncu.edu.cn
1
School of Information Engineering, Nanchang University,
Nanchang 330031, China
2
School of Software, Nanchang University, Nanchang 330047, China
3
Jiangxi Key Laboratory of Smart City, Nanchang 330047, China
4
School of Data and Computer Science, Sun Yat-sen University,
Guangzhou 510006, China
https://doi.org/10.1007/s10489-020-01751-y
Published online: 13 June 2020
Applied Intelligence (2020) 50:3521–3534
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Yet, despite their inherent flexibility in deployment, 2D vision systems remain acutely sensitive to the vagaries of environmental conditions and often give rise to legitimate privacy concerns owing to the perpetual nature of video surveillance. Conversely, 3D vision systems [16], armed with the formidable prowess of depth sensors and stereo camera setups, ascend to new heights, deftly capturing and encapsulating spatial information, thereby significantly mitigating sensitivity to environmental factors while simultaneously augmenting posture analysis. However, the adoption of 3D vision systems comes at a cost, both figuratively and literally, as they entail higher financial investments and often entail an augmented level of algorithmic complexity. ...
Article
Full-text available
In this study, we introduce a robust system for real-time fall detection and tracking to enhance the safety and autonomy of individuals, particularly older adults and those with mobility challenges. Leveraging advanced machine learning techniques, we integrate BlazePose, a real-time pose detection system, with XGBoost-a powerful distributed gradient boosting library-and DeepSORT for 3D fall detection and tracking in the proposed system. The system analyzes skeleton data from BlazePose, extracting critical features such as angles, linear velocity, and angular velocity from upper body segments to identify falls swiftly and accurately. In the event of a fall, the system can activate alarms or notify caregivers and emergency services, facilitating timely medical intervention and reducing the risk of severe injuries or complications. Comparative evaluations against three other methods-Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF)-demonstrate the superior performance of the proposed XGBoost-based approach, achieving a precision of 89.29%, recall of 81.20%, F1-Score of 84.82%, and accuracy of 94.26%. Additionally, dataset comparison across different fall detection scenarios further validates the effectiveness of the proposed system, with XGBoost consistently outperforming other methods across various datasets. Overall, this research presents a significant advancement in intelligent surveillance technology, showcasing the potential for proactive fall detection and intervention to enhance safety and well-being in diverse populations.
... The comparison of the execution speed between our algorithm and several fall detection methods is shown in Table 8. The processing speed of our algorithm 72 fps, which was faster than the processing speed reported by Yadav et al. (2022), Xiong et al. (2020), andWang et al. (2020). Moreover, the accurate recognition rate of falls of our algorithm was significantly improved, which further improved the applicability of the algorithm. ...
Article
Full-text available
Accidental falls are the second leading cause of accidental death of the elderly. Early intervention measures can reduce the problem. However, so far, there are few related studies using Transformer coding module for fall detection feature extraction, and the real-time performance of existing algorithms is not so good. Therefore, we propose a fall detection method based on Transformer to extract spatiotemporal features. Specifically, we use an image reduction module based on a convolutional neural network to reduce the image size for computation. Then, we design a pyramid network based on an improved Transformer to extract spatial features. Finally, we design a feature fusion module that fuses spatial features of different scales. The fused features are input into the gate recurrent unit to extract time features and complete the recognition of falls and normal postures. Experimental results show that the proposed approach achieves an accuracy of 99.61% and 99.33% when tested with UR Fall Detection Dataset and Le2i Fall Detection Dataset. Compared with the state-of-the-art fall detection algorithms, our method has high accuracy while maintaining high detection speed.
... But how to obtain human gait information from the skeleton sequence is the key to detect early Parkinson's disease. Skeletons estimated from RGB images are efficient in identifying people's actions [38][39][40][41][42][43]. Recently, Ref. [26] proposed an end-to-end method for early diagnosis of Parkinson's disease based on graph convolutional network, which takes patient skeleton sequence as input and returns diagnostic results. ...
Article
Full-text available
Parkinson’s disease is a chronic neurodegenerative condition accompanied by a variety of motor and non-motor clinical symptoms. Diagnosing Parkinson’s disease presents many challenges, such as excessive reliance on subjective scale scores and a lack of objective indicators in the diagnostic process. Developing efficient and convenient methods to assist doctors in diagnosing Parkinson’s disease is necessary. In this paper, we study the skeleton sequences obtained from gait videos of Parkinsonian patients for early detection of the disease. We designed a Transformer network based on feature tensor fusion to capture the subtle manifestations of Parkinson’s disease. Initially, we fully utilized the distance information between joints, converting it into a multivariate time series classification task. We then built twin towers to discover dependencies within and across sequence channels. Finally, a tensor fusion layer was employed to integrate the features from both towers. In our experiments, our model demonstrated superior performance over the current state-of-the-art algorithm, achieving an 86.8% accuracy in distinguishing Parkinsonian patients from healthy individuals using the PD-Walk dataset.
... The method based on vision devices usually detects fall through extensive analysis on videos or images [13][14][15][16][17][18]. Specifically, fall detection algorithms based on deep learning become increasingly popular, such as a deep learning approach with skeleton-based 3D-CNN, and consecutivelow-pooling is utilized to obtain the discriminant skeleton representation for fall detection [19]. In addition, a great deal of studies concentrate on skeleton data gathered in deep learning model, such as skeleton data accumulated through OpenPose [20]. ...
Article
Full-text available
Three-dimensional convolutional neural networks (3D-CNNs) and full connection long short-term memory networks (FC-LSTMs) have been demonstrated as a kind of powerful non-intrusive approaches in fall detection. However, the feature extration of 3D-CNN-based requires a large-scale dataset. Meanwhile, the deployment of FC-LSTM to expand the input into one-dimension leads to the loss of spatial information. To this end, a novel model combined lightweight 3D-CNN and convolutional long short-term memory (ConvLSTM) networks is proposed in this paper. In this model, a lightweight 3D convolutional neural network with five layers is presented to avoid the phenomenon of over-fitting. To further explore the discrimnative features, the channel- and spatial-wise attention modules are adopted in each layer to improve the detection performance. In addition, the ConvLSTM is presented to extract the long-term spatial–temporal features of 3D tensors. Finally, we verify our model through extensive experiments by comprehensive comparisons with HMDB5, UCF11, URFD, and MCFD. Experimental results on the public benchmarks demonstrate that our method outperforms current state-of-the-art single-stream networks with 62.55 ± 7.99% on HMDB5, 97.28 ± 0.36% on UCF11, 98.06 ± 0.32% on URFD, and 94.84 ± 4.64% on MCFD.
Article
Falls in the elderly have become one of the major risks for the growing elderly population. Therefore, the application of automatic fall detection system for the elderly is particularly important. In recent years, a large number of deep learning methods (such as CNN) have been applied to such research. This paper proposed a sparse convolution method 3D Sparse Convolutions and the corresponding 3D Sparse Convolutional Neural Network (3D-SCNN), which can achieve faster convolution at the approximate accuracy, thereby reducing computational complexity while maintaining high accuracy in video analysis and fall detection task. Additionally, the preprocessing stage involves a dynamic key frame selection method, using the jitter buffers to adjust frame selection based on current network conditions and buffer state. To ensure feature continuity, overlapping cubes of selected frames are intentionally employed, with dynamic resizing to adapt to network dynamics and buffer states. Experiments are conducted on Multi-camera fall dataset and UR fall dataset, and the results show that its accuracy exceeds the three compared methods, and outperforms the traditional 3D-CNN methods in both accuracy and losses.
Preprint
Full-text available
New technologies for the quantification of behavior have revolutionized animal studies in social, cognitive, and pharmacological neurosciences. However, comparable studies in understanding human behavior, especially in psychiatry, are lacking. In this study, we utilized data-driven machine learning to analyze natural, spontaneous open-field human behaviors from people with euthymic bipolar disorder (BD) and non-BD participants. Our computational paradigm identified representations of distinct sets of actions (motifs) that capture the physical activities of both groups of participants. We propose novel measures for quantifying dynamics, variability, and stereotypy in BD behaviors. These fine-grained behavioral features reflect patterns of cognitive functions of BD and better predict BD compared with traditional ethological and psychiatric measures and action recognition approaches. This research represents a significant computational advancement in human ethology, enabling the quantification of complex behaviors in real-world conditions and opening new avenues for characterizing neuropsychiatric conditions from behavior.
Article
Full-text available
In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.
Article
This study proposes an innovative fall detection system that leverages the capabilities of depth sensors and a Convolutional Neural Network (CNN) model. The objective is to enhance fall detection accuracy by using raw data from depth images for ground reference establishment and to distinguish foreground elements by using a background subtraction algorithm for comprehensive analysis. The performance of the proposed CNN-based system was compared with that of systems based on three learning methods: support vector machine, multilayer perceptron, and radial basis function neural network models. Experimental results indicated that the proposed system achieved an accuracy rate of approximately 95% and a Kappa coefficient of 0.96 in fall detection, thereby outperforming the other systems. These findings indicate that the proposed system has high efficacy and reliability; thus, it has potential applications in real-world scenarios requiring accurate fall detection.
Article
Full-text available
In this study, pre-impact fall detection algorithms were developed based on data gathered by a custom-made inertial measurement unit (IMU). Four types of simulated falls were performed by 40 healthy subjects (age: 23.4 ± 4.4 years). The IMU recorded acceleration and angular velocity during all activities. Acceleration, angular velocity, and trunk inclination thresholds were set to 0.9 g, 47.3°/s, and 24.7°, respectively, for a pre-impact fall detection algorithm using vertical angles (VA algorithm); and 0.9 g, 47.3°/s, and 0.19, respectively, for an algorithm using the triangle feature (TF algorithm). The algorithms were validated by the results of a blind test using four types of simulated falls and six types of activities of daily living (ADL). VA and TF algorithms resulted in lead times of 401 ± 46.9 ms and 427 ± 45.9 ms, respectively. Both algorithms were able to detect falls with 100% accuracy. The performance of the algorithms was evaluated using a public dataset. Both algorithms detected every fall in the SisFall dataset with 100% sensitivity). The VA algorithm had a specificity of 78.3%, and TF algorithm had a specificity of 83.9%. The algorithms had higher specificity when interpreting data from elderly subjects. This study showed that algorithms using angles could more accurately detect falls. Public datasets are needed to improve the accuracy of the algorithms.
Article
Full-text available
Automatic fall detection in videos could enable timely delivery of medical service to the injured elders who have fallen and live alone. Deep ConvNets have been used to detect fall actions. However, there still remain problems in deep video representations for fall detection. First, video frames are directly inputted to deep ConvNets. The visual features of human actions may be interfered with surrounding environments. Second, redundant frames increase the difficulty of time encoding for human actions. To address these problems, this paper presents Trajectory-weighted Deep-convolutional Rank-pooling Descriptor (TDRD) for fall detection, which is robust to surrounding environments and can describe the dynamics of human actions in long time videos effectively. First, CNN feature map of each frame is extracted through a deep ConvNet. Then, we present a new kind of trajectory attention map which is built with improved dense trajectories to optimally localize the subject area. Next, CNN feature map of each frame is weighted with its corresponding trajectory attention map to get trajectory-weighted convolutional visual feature of human region. Further, we propose a cluster pooling method to reduce the redundancy of the trajectory-weighted convolutional features of a video in the time sequence. Finally, rank pooling method is used to encode the dynamic of the clusterpooled sequence to get our TDRD. With TDRD, we get superior result on SDUFall dataset and get comparable performances on UR dataset and Multiple cameras dataset with SVM classifiers.
Article
Full-text available
Falls are one of the greatest risks for older adults living alone at home. This paper presents a novel visual-based fall detection approach to support independent living for older adults through analysing the motion and shape of the human body. The proposed approach employs a new set of features to detect a fall. Motion information of a segmented silhouette when extracted can provide a useful cue for classifying different behaviours, while variation in shape and the projection histogram can be used to describe human body postures and subsequent fall events. The proposed approach presented here extracts motion information using best-fit approximated ellipse and bounding box around the human body, produces projection histograms and determines the head position over time, to generate 10 features to identify falls. These features are fed into a multilayer perceptron neural network for fall classification. Experimental results show the reliability of the proposed approach with a high fall detection rate of 99.60% and a low false alarm rate of 2.62% when tested with the UR Fall Detection dataset. Comparisons with state of the art fall detection techniques show the robustness of the proposed approach.
Article
Full-text available
In video surveillance, automatic human fall detection is important to protect vulnerable groups such as the elderly. When the camera layout varies, the shape aspect ratio (SAR) of a human body may change substantially. In order to rectify these changes, in this paper, we propose an automatic human fall detection method using the normalized shape aspect ratio (NSAR). A calibration process and bicubic interpolation are implemented to generate the NSAR table for each camera. Compared with some representative fall detection methods using the SAR, the proposed method integrates the NSAR with the moving speed and direction information to robustly detect human fall, as well as being able to detect falls toward eight different directions for multiple humans. Moreover, while most of the existing fall detection methods were designed only for indoor environment, experimental results demonstrate that this newly proposed method can effectively detect human fall in both indoor and outdoor environments.
Article
Full-text available
Behavior analysis through posture recognition is an essential research in robotic systems. Sitting with unhealthy sitting posture for a long time seriously harms human health and may even lead to lumbar disease, cervical disease and myopia. Automatic vision-based detection of unhealthy sitting posture, as an example of posture detection in robotic systems, has become a hot research topic. However, the existing methods only focus on extracting features of human themselves and lack understanding relevancies among objects in the scene, and henceforth fail to recognize some types of unhealthy sitting postures in complicated environments. To alleviate these problems, a scene recognition and semantic analysis approach to unhealthy sitting posture detection in screen-reading is proposed in this paper. The key skeletal points of human body are detected and tracked with a Microsoft Kinect sensor. Meanwhile, a deep learning method, Faster R-CNN, is used in the scene recognition of our method to accurately detect objects and extract relevant features. Then our method performs semantic analysis through Gaussian-Mixture behavioral clustering for scene understanding. The relevant features in the scene and the skeletal features extracted from human are fused into the semantic features to discriminate various types of sitting postures. Experimental results demonstrated that our method accurately and effectively detected various types of unhealthy sitting postures in screen-reading and avoided error detection in complicated environments. Compared with the existing methods, our proposed method detected more types of unhealthy sitting postures including those that the existing methods could not detect. Our method can be potentially applied and integrated as a medical assistance in robotic systems of health care and treatment.
Article
Nowadays, Convolutional Neural Network (CNN) has achieved great success in various computer vision tasks. However, in classic CNN models, convolution and fully connected (FC) layers just perform linear transformations to their inputs. Non-linearity is often added by activation and pooling layers. It is natural to explore and extend convolution and FC layers non-linearly with affordable costs. In this paper, we first investigate the power mean function, which is proved effective and efficient in SVM kernel learning. Then, we investigate the power mean kernel, which is a non-linear kernel having linear computational complexity with the asymmetric kernel approximation function. Motivated by this scalable kernel, we propose Power Mean Transformation, which nonlinearizes both convolution and FC layers. It only needs a small modification on current CNNs, and improves the performance with a negligible increase of model size and running time. Experiments on various tasks show that Power Mean Transformation can improve classification accuracy, bring generalization ability and add different non-linearity to CNN models. Large performance gain on tiny models shows that Power Mean Transformation is especially effective in resource restricted deep learning scenarios like mobile applications. Finally, we add visualization experiments to illustrate why Power Mean Transformation works.
Conference Paper
In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.
Article
Accurate and reliable travel time predictions in public transport networks are essential for delivering an attractive service that is able to compete with other modes of transport in urban areas. The traditional application of this information, where arrival and departure predictions are displayed on digital boards, is highly visible in the city landscape of most modern metropolises. More recently, the same information has become critical as input for smart-phone trip planners in order to alert passengers about unreachable connections, alternative route choices and prolonged travel times. More sophisticated Intelligent Transport Systems (ITS) include the predictions of connection assurance, i.e. to hold back services in case a connecting service is delayed. In order to operate such systems, and to ensure the confidence of passengers in the systems, the information provided must be accurate and reliable. Traditional methods have trouble with this as congestion, and thus travel time variability, increases in cities, consequently making travel time predictions in urban areas a non-trivial task. This paper presents a system for bus travel time prediction that leverages the non-static spatio-temporal correlations present in urban bus networks, allowing the discovery of complex patterns not captured by traditional methods. The underlying model is a multi-output, multi-time-step, deep neural network that uses a combination of convolutional and long short-term memory (LSTM) layers. The method is empirically evaluated and compared to other popular approaches for link travel time prediction and currently available services, including the currently deployed model in Copenhagen, Denmark. We find that the proposed model significantly outperforms all the other methods we compare with, and is able to detect small irregular peaks in bus travel times very quickly.
Article
Fall is one of the leading causes of injury for the elderly individuals. Systems that automatically detect falls can significantly reduce the delay of assistance. Most of commercialized fall detection systems are based on wearable devices, which elderly individuals tend to forget wearing. Using surveillance cameras to detect falls based on computer vision is ideal, because anyone in the monitoring scopes can be under protection. However, the privacy protection issue using surveillance cameras has been bothering people. To effectively protect the privacy, we proposed an optical level anonymous image sensing system, which can protect the privacy by hiding the facial regions optically at the video capturing phase. We apply the system to fall detection. In detecting falls, we propose a neural network by combining a 3D convolutional neural network for feature extraction and an autoencoder for modelling the normal behaviors. The learned autoencoder reconstructs the features extracted from videos with normal behaviors with smaller average errors than those extracted from videos with falls. We evaluated our neural network by a hold-out validation experiment, and showed its effectiveness. In field tests, we showed and discussed the applicability of the optical level anonymous image sensing system for privacy protection and fall detection.