Conference Paper

Human Activity Recognition in Video Surveillance Using Long-Term Recurrent Convolutional Network

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The workflow begins with a set of raw video data, depicted as a series of fragmented images that undergo a process of initial fusion. This initial fusion is likely a preprocessing step that involves techniques such as frame normalization, noise reduction, and data augmentation to prepare the video segments for further analysis [43][44][45][46][47]. This step may also include the synchronization of video frames to align temporal sequences, which is critical for accurate motion analysis. ...
Article
Pedestrian dead reckoning (PDR) uses built-in sensors in smartphones to track user positions, offering both versatility and portability. However, diversities among individuals and their behavior patterns decrease positioning accuracy. In this study, a two-stage personalized pedestrian dead reckoning based on neural networks (P2Net) is proposed. The first stage employs a human activity and smartphone location recognition (HSR) module, integrating a convolutional neural network with focal loss (FLCNN) to recognize nine modes, which constrains the subsequent optimized PDR procedure. In the second stage, a hybrid feature temporal attention network (HFTAN) is constructed to achieve generalized step length estimation across individual diversities. Temporal features via temporal convolutional network (TCN) and physical features are combined to generate hybrid features with time-series modeling capability and interpretability, which are then fed into a bi-directional long short-term memory (BiLSTM) model with a feature attention mechanism to estimate pedestrian step length. Experimental results demonstrate that P2Net achieves a class average accuracy of 92.75% for nine modes, and the total traveled distance error under various states is within 10%, outperforming other mentioned positioning methods.
Article
Full-text available
This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: (1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data; and (2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image datasets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the DAVIS dataset (MAE of .06) and the FBMS dataset (MAE of .07), and do so with much improved speed (2fps with all steps).
Article
Full-text available
In this paper, we study the problem of activity recognition and abnormal behaviour detection for elderly people with dementia. Very few studies have attempted to address this problem presumably because of the lack of experimental data in the context of dementia care. In particular, the paper investigates three variants of Recurrent Neural Networks (RNNs): Vanilla RNNs (VRNN), Long Short Term RNNs (LSTM) and Gated Recurrent Unit RNNs (GRU). Here activity recognition is considered as a sequence labelling problem, while abnormal behaviour is flagged based on the deviation from normal patterns. To provide an adequate discussion of the performance of RNNs in this context, we compare them against the state-of-art methods such as Support Vector Machines (SVMs), Na¨ıve Bayes (NB), Hidden Markov Models (HMMs), Hidden Semi-Markov Models (HSMM) and Conditional Random Fields (CRFs). The results obtained indicate that RNNs are competitive with those state-of-art methods. Moreover, the paper presents a methodology for generating synthetic data reflecting on some behaviours of people with dementia given the difficulty of obtaining real-world data.
Article
Full-text available
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).
Article
Full-text available
This paper presents a review of different classification techniques used to recognize human activities from wearable inertial sensor data. Three inertial sensor units were used in this study and were worn by healthy subjects at key points of upper/lower body limbs (chest, right thigh and left ankle). Three main steps describe the activity recognition process: sensors’ placement, data pre-processing and data classification. Four supervised classification techniques namely, k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Gaussian Mixture Models (GMM), and Random Forest (RF) as well as three unsupervised classification techniques namely, k-Means, Gaussian mixture models (GMM) and Hidden Markov Model (HMM), are compared in terms of correct classification rate, F-measure, recall, precision, and specificity. Raw data and extracted features are used separately as inputs of each classifier. The feature selection is performed using a wrapper approach based on the RF algorithm. Based on our experiments, the results obtained show that the k-NN classifier provides the best performance compared to other supervised classification algorithms, whereas the HMM classifier is the one that gives the best results among unsupervised classification algorithms. This comparison highlights which approach gives better performance in both supervised and unsupervised contexts. It should be noted that the obtained results are limited to the context of this study, which concerns the classification of the main daily living human activities using three wearable accelerometers placed at the chest, right shank and left ankle of the subject.
Article
Full-text available
Recent results suggest that state-of-the-art saliency models perform far from optimal in predicting fixations. This lack in performance has been attributed to an inability to model the influence of high-level image features such as objects. Recent seminal advances in applying deep neural networks to tasks like object recognition suggests that they are able to capture this kind of structure. However, the enormous amount of training data necessary to train these networks makes them difficult to apply directly to saliency prediction. We present a novel way of reusing existing neural networks that have been pretrained on the task of object recognition in models of fixation prediction. Using the well-known network of Krizhevsky et al., 2012, we come up with a new saliency model that significantly outperforms all state-of-the-art models on the MIT Saliency Benchmark. We show that the structure of this network allows new insights in the psychophysics of fixation selection and potentially their neural implementation. To train our network, we build on recent work on the modeling of saliency as point processes.
Conference Paper
For the past few years, smartphone based human activity recognition (HAR) has gained much popularity due to its embedded sensors which have found various applications in healthcare, surveillance, human-device interaction, pattern recognition etc. In this paper, we propose a neural network model to classify human activities, which uses activity-driven hand-crafted features. First, the neighborhood component analysis derived feature selection is used to choose a subset of important features from the available time and frequency domain parameters. Next, a dense neural network consisting of four hidden layers is modeled to classify the input features into different categories. The model is evaluated on publicly available UCI HAR data set consisting of six daily activities; our approach achieved 95.79% classification accuracy. When compared with existing state-of-the-art methods, our proposed model outperformed most other methods while using fewer features, thus showing the importance of proper feature selection.
Article
Convolutional Neural Network based action recognition methods have achieved significant improvements in recent years. The 3D convolution extends the 2D convolution to the spatial-temporal domain for better analysis of human activities in videos. The 3D convolution, however, involves many more parameters than the 2D convolution. Thus, it is much more expensive on computation, costly on storage, and difficult to learn. This work proposes efficient asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, a set of local 3D convolutional networks, called MicroNets, are proposed by incorporating multi-scale 3D convolution branches. Then, an asymmetric 3D-CNN deep model is constructed by MicroNets for the action recognition task. Moreover, to avoid training two networks on the RGB and Flow frames separately as most works do, a simple but effective multi-source enhanced input is proposed, which fuses useful information of the RGB and Flow frame at the pre-processing stage. The asymmetric 3D-CNN model is evaluated on two of the most challenging action recognition benchmarks, UCF-101 and HMDB-51. The asymmetric 3D-CNN model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency, and its performance is comparable with that of recent state-of-the-art action recognition methods on both benchmarks.
Article
Interest in enhancing medical services and healthcare is emerging exploiting recent technological capabilities. An integrable fall detection sensor is an essential component toward achieving smart healthcare solutions. Traditional vision-based methods rely on tracking a skeleton and estimating the change in height of key body parts such as head, hips, and shoulders. These methods are often challenged by occluded body parts and abrupt posture changes. This paper presents a fall detection system consisting of a novel skeleton-free posture recognition method and an activity recognition stage. The posture recognition method analyzes local variations in depth pixels to identify the adopted posture. An input depth frame acquired using a Kinect-like sensor is densely represented using a depth comparison feature and fed to a random decision forest to discriminate among standing, sitting, and fallen postures. The proposed approach simplifies the posture recognition into a simple pixel labeling problem, after which determining the posture is as simple as counting votes from all labeled pixels. The falling event is recognized using a support vector machine. The proposed approach records a sensitivity rate of 99% on synthetic and live datasets as well as a specificity rate of 99% on synthetic datasets and 96% on popular live datasets without invasive accelerometer support. IEEE
Article
Though recent advanced convolutional neural networks (CNNs) have been improving the image recognition accuracy, the models are getting more complex and time-consuming. For real-world applications in industrial and commercial scenarios, engineers and developers are often faced with the requirement of constrained time budget. In this paper, we investigate the accuracy of CNNs under constrained time cost. Under this constraint, the designs of the network architectures should exhibit as trade-offs among the factors like depth, numbers of filters, filter sizes, etc. With a series of controlled comparisons, we progressively modify a baseline model while preserving its time complexity. This is also helpful for understanding the importance of the factors in network designs. We present an architecture that achieves very competitive accuracy in the ImageNet dataset (11.8% top-5 error, 10-view test), yet is 20% faster than "AlexNet" (16.0% top-5 error, 10-view test).
Article
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Conference Paper
Despite significant recent progress, the best available visual saliency models still lag behind human performance in predicting eye fixations in free-viewing of natural scenes. Majority of models are based on low-level visual features and the importance of top-down factors has not yet been fully explored or modeled. Here, we combine low-level features such as orientation, color, intensity, saliency maps of previous best bottom-up models with top-down cognitive visual features (e.g., faces, humans, cars, etc.) and learn a direct mapping from those features to eye fixations using Regression, SVM, and AdaBoost classifiers. By extensive experimenting over three benchmark eye-tracking datasets using three popular evaluation scores, we show that our boosting model outperforms 27 state-of-the-art models and is so far the closest model to the accuracy of human model for fixation prediction. Furthermore, our model successfully detects the most salient object in a scene without sophisticated image processings such as region segmentation.
Human Activity Recognition using Smartphones
  • Pinki Pradhan
Puneet Tiwari Human Activity Recognition
  • Akash Kumar
  • Varshini Shenoy