Article

Enhancing temple surveillance through human activity recognition: A novel dataset and YOLOv4-ConvLSTM approach

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automated identification of human activities remains a complex endeavor, particularly in unique settings like temple environments. This study focuses on employing machine learning and deep learning techniques to analyze human activities for intelligent temple surveillance. However, due to the scarcity of standardized datasets tailored for temple surveillance, there is a need for specialized data. In response, this research introduces a pioneering dataset featuring Eight distinct classes of human activities, predominantly centered on hand gestures and body postures. To identify the most effective solution for Human Activity Recognition (HAR), a comprehensive ablation study is conducted, involving a variety of conventional machine learning and deep learning models. By integrating YOLOv4’s robust object detection capabilities with ConvLSTM’s ability to model both spatial and temporal dependencies in spatio-temporal data, the approach becomes capable of recognizing and understanding human activities in sequences of images or video frames. Notably, the proposed YOLOv4-ConvLSTM approach emerges as the optimal choice, showcasing a remarkable accuracy of 93.68%. This outcome underscores the suitability of the outlined methodology for diverse HAR applications in temple environments.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
The rapid increase in population density has posed significant challenges to medical sciences in the auto-detection of various diseases. Intelligent systems play a crucial role in assisting medical professionals with early disease detection and providing consistent treatment, ultimately reducing mortality rates. Skin-related diseases, particularly those that can become severe if not detected early, require timely identification to expedite diagnosis and improve patient outcomes. This paper proposes a transfer learning-based ensemble deep learning model for diagnosing dermatological conditions at an early stage. Data augmentation techniques were employed to increase the number of samples and create a diverse data pattern within the dataset. The study applied ResNet50, InceptionV3, and DenseNet121 transfer learning models, leading to the development of a weighted and average ensemble model. The system was trained and tested using the International Skin Imaging Collaboration (ISIC) dataset. The proposed ensemble model demonstrated superior performance, achieving 98.5% accuracy, 97.50% Kappa, 97.67% MCC (Matthews Correlation Coefficient), and 98.50% F1 score. The model outperformed existing state-of-the-art models in dermatological disease classification and provides valuable support to dermatologists and medical specialists in early disease detection. Compared to previous research, the proposed model offers high accuracy with lower computational complexity, addressing a significant challenge in the classification of skin-related diseases.
Article
The timely and precise identification of traffic signs is essential for maintaining the effectiveness and safety of contemporary roads, particularly in light of the increasing number of self-driving cars. Conventional image processing methods have faced challenges because to the intricate and fluctuating variables present in real-world settings, including various signage, erratic weather, and inconsistent illumination. This study utilizes recent breakthroughs in deep learning, particularly the YOLOv8 (You Only Look Once version 8) model, to tackle these difficulties. YOLOv8 incorporates cutting-edge neural network architectural advancements, such as an anchor-free detection methodology, adaptive spatial feature pooling, and dynamic neural configurations. In order to further increase detection efficiency and accuracy, this study presents two innovative models, YOLOv8-DH and YOLOv8-TDHSA. These models make use of improvements such decoupled heads and transformer-based self-attention mechanisms. Experimental results indicate that the suggested models substantially surpass current deep learning models, attaining enhanced performance across multiple measures, including accuracy, recall, F-score, and mean average precision (mAP). This research enhances traffic sign detecting technology, facilitating the development of safer and more intelligent transportation systems.
Conference Paper
Full-text available
Articulating an individual behavior analysis from a video stream is considered by most researchers. It has its own applications in the field of computer vision-based Human activity recognition (HAR). HAR is the most widely utilized service in many of the systems across surveillance systems, healthcare, online education, and many more areas. Nowadays equally there is also increased interest in the researchers across the HAR community for works to be considered when multiple faces have to be recognized and predict the activity. This paper details the contribution made by the researchers in this aspect recently and also paper gives a comprehensive analysis of the methods adopted and concludes by examining the accuracy of the various work contributed. At last, this paper additionally is given the future directions to deal with this application.
Article
Full-text available
Renewable energy (RE) power plants are deployed globally because renewable energy sources (RESs) are sustainable, clean, and environmentally friendly. However, the demand for power increases on a daily basis due to population growth, technology, marketing, and the number of installed industries. This challenge has raised a critical issue of how to intelligently match the power generation with the consumption for efficient energy management. To handle this issue, we propose a novel architecture called ‘AB-Net’: a one-step forecast of RE generation for short-term horizons by incorporating an autoencoder (AE) with bidirectional long short-term memory (BiLSTM). Firstly, the data acquisition step is applied, where the data are acquired from various RESs such as wind and solar. The second step performs deep preprocessing of the acquired data via several de-noising and cleansing filters to clean the data and normalize them prior to actual processing. Thirdly, an AE is employed to extract the discriminative features from the cleaned data sequence through its encoder part. BiLSTM is used to learn these features to provide a final forecast of power generation. The proposed AB-Net was evaluated using two publicly available benchmark datasets where the proposed method obtains state-of-the-art results in terms of the error metrics.
Article
Full-text available
Sensor-based human activity recognition (S-HAR) has become an important and high-impact topic of research within human-centered computing. In the last decade, successful applications of S-HAR have been presented through fruitful academic research and industrial applications, including for healthcare monitoring, smart home controlling, and daily sport tracking. However, the growing requirements of many current applications for recognizing complex human activities (CHA) have begun to attract the attention of the HAR research field when compared with simple human activities (SHA). S-HAR has shown that deep learning (DL), a type of machine learning based on complicated artificial neural networks, has a significant degree of recognition efficiency. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are two different types of DL methods that have been successfully applied to the S-HAR challenge in recent years. In this paper, we focused on four RNN-based DL models (LSTMs, BiLSTMs, GRUs, and BiGRUs) that performed complex activity recognition tasks. The efficiency of four hybrid DL models that combine convolutional layers with the efficient RNN-based models was also studied. Experimental studies on the UTwente dataset demonstrated that the suggested hybrid RNN-based models achieved a high level of recognition performance along with a variety of performance indicators, including accuracy, F1-score, and confusion matrix. The experimental results show that the hybrid DL model called CNN-BiGRU outperformed the other DL models with a high accuracy of 98.89% when using only complex activity data. Moreover, the CNN-BiGRU model also achieved the highest recognition performance in other scenarios (99.44% by using only simple activity data and 98.78% with a combination of simple and complex activities).
Article
Full-text available
In this article, we present an in-depth comparative analysis of the conventional and sequential learning algorithms for electricity load forecasting and optimally select the most appropriate algorithm for energy consumption prediction (ECP). ECP reduces the misusage and wastage of energy using mathematical modeling and supervised learning algorithms. However, the existing ECP research lacks comparative analysis of various algorithms to reach the optimal model with real-world implementation potentials and convincingly reduced error rates. Furthermore, these methods are less friendly towards the energy management chain between the smart grids and residential buildings, with limited contributions in saving energy resources and maintaining an appropriate equilibrium between energy producers and consumers. Considering these limitations, we dive deep into load forecasting methods, analyze their performance, and finally, present a novel three-tier framework for ECP. The first tier applies data preprocessing for its refinement and organization, prior to the actual training, facilitating its effective output generation. The second tier is the learning process, employing ensemble learning algorithms (ELAs) and sequential learning techniques to train over energy consumption data. In the third tier, we obtain the final ECP model and evaluate our method; we visualize the data for energy data analysts. We experimentally prove that deep sequential learning models are dominant over mathematical modeling techniques and its several invariants by utilizing available residential electricity consumption data to reach an optimal proposed model with smallest mean square error (MSE) of value 0.1661 and root mean square error (RMSE) of value 0.4075 against the recent rivals.
Article
Full-text available
The use of electrical energy is directly proportional to the increase in global population, both concerning growing industrialization and rising residential demand. The need to achieve a balance between electrical energy production and consumption inspires researchers to develop forecasting models for optimal and economical energy use. Mostly, the residential and industrial sectors use metering sensors that only measure the consumed energy but are unable to manage electricity. In this paper, we present a comparative analysis of a variety of deep features with several sequential learning models to select the optimized hybrid architecture for energy consumption prediction. The best results are achieved using convolutional long short-term memory (ConvLSTM) integrated with bidirectional long short-term memory (BiLSTM). The ConvLSTM initially extracts features from the input data to produce encoded sequences that are decoded by BiLSTM and then proceeds with a final dense layer for energy consumption prediction. The overall framework consists of preprocessing raw data, extracting features, training the sequential model, and then evaluating it. The proposed energy consumption prediction model outperforms existing models over publicly available datasets, including Household and Korean commercial building datasets.
Conference Paper
Full-text available
Fall is one of the primary causes of fatal as well as non-fatal injuries in the elderly community. The falls in the elderly may cause different consequences and in serious cases, it may cause death. Timely treatment is critical where immediate treatment may reduce the risk of serious injuries. The detection should be taken out in an automated way and detect fall events accurately. This paper presented the image-based fall detection system which integrated the YOLO object detection algorithm with the Image-based Fall Detection system algorithm in detecting fall events. The system will first get track of the person in the video frame with the object detection algorithm and the fall detection algorithm will be used to get track of the person's height and to detect fall events immediately and accurately to notify the caregivers. The system was evaluated with different use cases and conditions. The result shows the system can detect fall events with the accuracy of 92% under the daylight condition and 60% under the low light condition. An email notification will be sent as an alarm to notify the caregivers when any fall events were detected by the system. The quick fall detection and notification of system able to ensure the safety of the elderly were well monitored and timely treatment can take place when fall events were detected by the system.
Conference Paper
Full-text available
Object detection algorithm such as You Only Look Once (YOLOv3 and YOLOv4) is implemented for traffic and surveillance applications. A neural network consists of input with minimum one hidden and output layer. Multiple object dataset (KITTI image and video), which consists of classes of images such as Car, truck, person, and two-wheeler captured during RGB and grayscale images. The dataset is composed (image and video) of varying illumination. YOLO model variants such as YOLOv3 is implemented for image and YOLOv4 for video dataset. Obtained results show that the algorithm effectively detects the objects approximately with an accuracy of 98% for image dataset and 99% for video dataset.
Article
Full-text available
Mitochondrial proteins of Plasmodium falciparum (MPPF) are an important target for anti-malarial drugs, but their identification through manual experimentation is costly, and in turn, their related drugs production by pharmaceutical institutions involves a prolonged time duration. Therefore, it is highly desirable for pharmaceutical companies to develop computationally automated and reliable approach to identify proteins precisely, resulting in appropriate drug production in a timely manner. In this direction, several computationally intelligent techniques are developed to extract local features from biological sequences using machine learning methods followed by various classifiers to discriminate the nature of proteins. Unfortunately, these techniques demonstrate poor performance while capturing contextual features from sequence patterns, yielding non-representative classifiers. In this paper, we proposed a sequence-based framework to extract deep and representative features that are trustworthy for Plasmodium mitochondrial proteins identification. The backbone of the proposed framework is MPPF identification-net (MPPFI-Net), that is based on a convolutional neural network (CNN) with multilayer bi-directional long short-term memory (MBD-LSTM). MPPIF-Net inputs protein sequences, passes through various convolution and pooling layers to optimally extract learned features. We pass these features into our sequence learning mechanism, MBD-LSTM, that is particularly trained to classify them into their relevant classes. Our proposed model is experimentally evaluated on newly prepared dataset PF2095 and two existing benchmark datasets i.e., PF175 and MPD using the holdout method. The proposed method achieved 97.6%, 97.1%, and 99.5% testing accuracy on PF2095, PF175, and MPD datasets, respectively, which outperformed the state-of-the-art approaches.
Article
Full-text available
In this paper we address the problem of multi-cue affect recognition in challenging scenarios such as child-robot interaction. Towards this goal we propose a method for automatic recognition of affect that leverages body expressions alongside facial ones, as opposed to traditional methods that typically focus only on the latter. Our deep-learning based method uses hierarchical multi-label annotations and multi-stage losses, can be trained both jointly and separately, and offers us computational models for both individual modalities, as well as for the whole body emotion. We evaluate our method on a challenging child-robot interaction database of emotional expressions collected by us, as well as on the GEMEP public database of acted emotions by adults, and show that the proposed method achieves significantly better results than facial-only expression baselines.
Conference Paper
Full-text available
People express emotions through different modalities. Integration of verbal and non-verbal communication channels creates a system in which the message is easier to understand. Expanding the focus to several expression forms can facilitate research on emotion recognition as well as human-machine interaction. In this article, the authors present a Polish emotional database composed of three modalities: facial expressions, body movement and gestures, and speech. The corpora contains recordings registered in studio conditions, acted out by 16 professional actors (8 male and 8 female). The data is labeled with six basic emotions categories, according to Ekman's emotion categories. To check the quality of performance, all recordings are evaluated by experts and volunteers. The database is available to academic community and might be useful in the study on audiovisual emotion recognition.
Article
Full-text available
Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g. human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce, there is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.
Data
Full-text available
The JAFFE images may be used for non-commercial research provided the user agrees to a few conditions. More information about obtaining the images and a download link is available at the page: http://www.kasrl.org/jaffe.html
Article
Full-text available
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.
Conference Paper
Full-text available
In this paper, we propose an Internet of Things (IoT) system application for remote medical monitoring. The body pressure distribution is acquired through a pressure sensing mattress under the person's body, data is sent to a computer workstation for processing, and results are communicated for monitoring and diagnosis. The area of application of such system is large in the medical domain making the system convenient for clinical use such as in sleep studies, non or partial anesthetic surgical procedures, medical-imaging techniques, and other areas involving the determination of the body-posture on a mattress. In this vein, a novel method for human body posture recognition that consists in providing an optimal combination of signal acquisition, processing, and data storage to perform the recognition task in a quasi-real-time basis. A supervised learning approach was used to build a model using a robust synthetic data. The data has been generated beforehand, in a way to enhance and generalize the recognition capability while maintaining both geometrical and spatial performance. Low-cost and fast computation per sample processing along with autonomy, make the system suitable for long-term operation and IoT applications. The recognition results with a Cohen's Kappa coefficient κ = 0.866 was satisfactorily encouraging for further investigation in this field.
Conference Paper
Full-text available
The recent advancement of social media has given users a platform to socially engage and interact with a global population. With millions of images being uploaded onto social media platforms, there is an increasing interest in inferring the emotion and mood display of a group of people in images. Automatic affect analysis research has come a long way but has traditionally focussed on a single subject in a scene. In this paper, we study the problem of inferring the emotion of a group of people in an image. This group affect has wide applications in retrieval, advertisement, content recommendation and security. The contributions of the paper are: 1) a novel emotion labelled database of groups of people in images; 2) a Multiple Kernel Learning based hybrid affect inference model; 3) a scene context based affect inference model; 4) a user survey to better understand the attributes that affect the perception of affect of a group of people in an image. The detailed experimentation validation provides a rich baseline for the proposed database.
Conference Paper
Full-text available
We present a novel method for real-time continuous pose recovery of markerless complex articulable objects from a single depth image. Our method consists of the following stages: a randomized decision forest classifier for image segmentation, a robust method for labeled dataset generation, a convolutional network for dense feature extraction, and finally an inverse kinematics stage for stable real-time pose recovery. As one possible application of this pipeline, we show state-of-the-art results for real-time puppeteering of a skinned hand-model.
Conference Paper
Full-text available
Recognizing human faces in the wild is emerging as a critically important, and technically challenging computer vision problem. With a few notable exceptions, most previous works in the last several decades have focused on recognizing faces captured in a laboratory setting. However, with the introduction of databases such as LFW and Pubfigs, face recognition community is gradually shifting its focus on much more challenging unconstrained settings. Since its introduction, LFW verification benchmark is getting a lot of attention with various researchers contributing towards state-of-the-results. To further boost the unconstrained face recognition research, we introduce a more challenging Indian Movie Face Database (IMFDB) that has much more variability compared to LFW and Pubfigs. The database consists of 34512 faces of 100 known actors collected from approximately 103 Indian movies. Unlike LFW and Pubfigs which used face detectors to automatically detect the faces from the web collection, faces in IMFDB are detected manually from all the movies. Manual selection of faces from movies resulted in high degree of variability (in scale, pose, expression, illumination, age, occlusion, makeup) which one could ever see in natural world. IMFDB is the first face database that provides a detailed annotation in terms of age, pose, gender, expression, amount of occlusion, for each face which may help other face related applications.
Conference Paper
Full-text available
Human pose estimation has made significant progress during the last years. However current datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we intro-duce a novel benchmark "MPII Human Pose" 1 that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future develop-ments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities [1]. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational and householding activ-ities, and capture people from a wider range of viewpoints. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. For each im-age we provide adjacent video frames to facilitate the use of motion information. Given these rich annotations we per-form a detailed analysis of leading human pose estimation approaches and gaining insights for the success and fail-ures of these methods.
Article
Full-text available
Facial expression is central to human experience. Its efficiency and valid measurement are challenges that automated facial image analysis seeks to address. Most publically available databases are limited to 2D static images or video of posed facial behavior. Because posed and un-posed (aka “spontaneous”) facial expressions differ along several dimensions including complexity and timing, well-annotated video of un-posed facial behavior is needed. Moreover, because the face is a three-dimensional deformable object, 2D video may be insufficient, and therefore 3D video archives are required. We present a newly developed 3D video database of spontaneous facial expressions in a diverse group of young adults. Well-validated emotion inductions were used to elicit expressions of emotion and paralinguistic communication. Frame-level ground-truth for facial actions was obtained using the Facial Action Coding System. Facial features were tracked in both 2D and 3D domains. To the best of our knowledge, this new database is the first of its kind for the public. The work promotes the exploration of 3D spatiotemporal features in subtle facial expression, better understanding of the relation between pose and motion dynamics in facial action units, and deeper understanding of naturally occurring facial action.
Conference Paper
Full-text available
Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many recent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to provide insights based on a systematic performance evaluation using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important - for example, should we work on improving flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. We also find that the accuracy of a top-performing action recognition framework can be greatly increased by refining the underlying low/mid level features, this suggests it is important to improve optical flow and human detection algorithms. Our analysis and J-HMDB dataset should facilitate a deeper understanding of action recognition algorithms.
Conference Paper
Full-text available
Typical approaches to articulated pose estimation combine spatial modelling of the human body with appearance modelling of body parts. This paper aims to push the state-of-the-art in articulated pose estimation in two ways. First we explore various types of appearance representations aiming to substantially improve the body part hypotheses. And second, we draw on and combine several recently proposed powerful ideas such as more flexible spatial models as well as image-conditioned spatial models. In a series of experiments we draw several important conclusions: (1) we show that the proposed appearance representations are complementary, (2) we demonstrate that even a basic tree-structure spatial human body model achieves state-of-the-art performance when augmented with the proper appearance representation, and (3) we show that the combination of the best performing appearance model with a flexible image-conditioned spatial model achieves the best result, significantly improving over the state of the art, on the ``Leeds Sports Poses'' and ``Parse'' benchmarks.
Conference Paper
Full-text available
In this paper, we explored the use of features that represent body posture and movement for automatically detecting people's emotions in non-acted standing scenarios. We focused on four emotions that are often observed when people are playing video games: triumph, frustration, defeat, and concentration. The dataset consists of recordings of the rotation angles of the player's joints while playing Wii sports games. We applied various machine learning techniques and bagged them for prediction. When body pose and movement features are used we can reach an overall accuracy of 66.5% for differentiating between these four emotions. In contrast, when using the raw joint rotations, limb rotation movement, or posture features alone, we were only able to achieve accuracy rates of 59%, 61%, and 62% respectively. Our results suggest that features representing changes in body posture can yield improved classification rates over using static postures or joint information alone.
Conference Paper
Full-text available
Recent years demonstrate an increased research in automatic recognition of emotions in whole-body gestures. However, most of them rely on emotional models that are still being contested or require an obtrusive way of collecting the data. We study primitive postures based on 7 primary-process and clinically measured emotions. We portray postures from theatre in front of the motion capture sensor and we conduct online surveys to discriminate primary-process emotions. We analyze low-level features from postural joints data and reveal RAGE patterns which we will use in future real-time affective interactions.
Article
Full-text available
Thanks to the decreasing cost of whole-body sensing technology and its increasing reliability, there is an increasing interest in, and understanding of, the role played by body expressions as a powerful affective communication channel. The aim of this survey is to review the literature on affective body expression perception and recognition. One issue is whether there are universal aspects to affect expression perception and recognition models or if they are affected by human factors such as culture. Next, we discuss the difference between form and movement information as studies have shown that they are governed by separate pathways in the brain. We also review psychological studies that have investigated bodily configurations to evaluate if specific features can be identified that contribute to the recognition of specific affective states. The survey then turns to automatic affect recognition systems using body expressions as at least one input modality. The survey ends by raising open questions on data collecting, labeling, modeling, and setting benchmarks for comparing automatic recognition systems.
Article
In the era of cutting edge technology, excessive demand for electricity is rising day by day, due to the exponential growth of population, electricity reliant vehicles, and home appliances. Precise energy consumption prediction (ECP) and integrated local energy systems (ILES) are critical to boost clean energy management systems between consumers and suppliers. Various obstacles such as environmental factors and occupant behavior affects the performance of existing approaches for long- and short-term ECP. Thus, to address such concerns, we present a novel hybrid network model ‘DB-Net’ by incorporating a dilated convolutional neural network (DCNN) with bidirectional long short-term memory (BiLSTM). The proposed approach allows efficient control of power energy in ILES between consumer and supplier when employed for long- and short-term ECP. The first phase combines data acquisition and refinement procedures into a preprocessing module in which the main goal is to optimize the collected data and to handle outliers. In the next phase, the refined data is passed into DCNN layers for feature encoding followed by BiLSTM layers to learn hidden sequential patterns and decode the feature maps. In the final phase, the DB-Net model forecasts multi-step power consumption (PC), including hourly, daily, weekly, and monthly output. The proposed approach attains better predictive performance than existing methods, thereby confirming its effectiveness.
Article
3D skeleton-based action recognition and motion prediction are two essential problems of human activity understanding. In many previous works: 1) they studied two tasks separately, neglecting internal correlations; 2) they did not capture sufficient relations inside the body. To address these issues, we propose a symbiotic model to handle two tasks jointly; and we propose two scales of graphs to explicitly capture relations among body-joints and body-parts. Together, we propose symbiotic graph neural networks, which contain a backbone, an action-recognition head, and a motion-prediction head. Two heads are trained jointly and enhance each other. For the backbone, we propose multi-branch multiscale graph convolution networks to extract spatial and temporal features. The multiscale graph convolution networks are based on joint-scale and part-scale graphs. The joint-scale graphs contain actional graphs, capturing action-based relations, and structural graphs, capturing physical constraints. The part-scale graphs integrate body-joints to form specific parts, representing high-level relations. Moreover, dual bone-based graphs and networks are proposed to learn complementary features. We conduct extensive experiments for skeleton-based action recognition and motion prediction with four datasets, NTU-RGB+D, Kinetics, Human3.6M, and CMU Mocap. Experiments show that our symbiotic graph neural networks achieve better performances on both tasks compared to the state-of-the-art methods.
Article
Fall events are one of the greatest risks for public safety, especially in some complex scenes with large number of people. Nevertheless, there are few researches on fall detection in complex scenes, and even no public datasets. A fall event dataset in crowded and complex scenes is constructed. Aiming at detecting fall events in complex scenes, we further propose an attention guided LSTM model. Our method provides the spatial and temporal locations of fall events, which are indispensable information for danger alarm in complex public scenes. Specifically, the effective YOLO v3 is employed to detect pedestrian in videos, and followed by a tracking module. CNN features are extracted for each tracked bounding boxes. Fall events are detected by the attention guided LSTM. Experimental results show that our method achieves good performance, outperforming the state-of-the-art methods.
Conference Paper
Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.
Article
Fall detection is an important public healthcare problem. Timely detection could enable instant delivery of medical service to the injured. A popular non-intrusive solution for fall detection is based on videos obtained through ambient camera, and the corresponding methods usually require a large dataset to train a classifier and are inclined to be influenced by the image quality. However, it is hard to collect fall data and instead simulated falls are recorded to construct the training dataset, which is restricted to limited quantity. To address these problems, a three-dimensional convolutional neural network (3D CNN) based method for fall detection is developed which only uses video kinematic data to train an automatic feature extractor and could circumvent the requirement for large fall dataset of deep learning solution. 2D CNN could only encode spatial information, and the employed 3D convolution could extract motion feature from temporal sequence, which is important for fall detection. To further locate the region of interest in each frame, a LSTM (Long Short-Term Memory) based spatial visual attention scheme is incorporated. Sports dataset Sports-1M with no fall examples is employed to train the 3D CNN, which is then combined with LSTM to train a classifier with fall dataset. Experiments have verified the proposed scheme on fall detection benchmark with high accuracy as 100%. Superior performance has also been obtained on other activity databases. IEEE
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Conference Paper
A system for recognizing emotions from videos by studying facial expressions, hand gestures and body postures is presented. A stochastic context-free grammar (SCFG) containing 8 combinations of hand gestures and body postures for each emotion is used and we show that increasing the number of combinations in SCFG improves the system's generalization for new hand gesture and body posture combinations. We show that hand gestures and body postures contribute to improving the emotion recognition rate with up to 5% for Anger, Sadness and Fear compared to the standard facial emotion recognition system, while for Happiness, Surprise and Disgust no significant improvement was noticed.
Conference Paper
Supporting learning with rich natural language dialogue has been the focus of increasing attention in recent years. Many adaptive learning environments model students' natural language input, and there is growing recognition that these systems can be improved by leveraging multimodal cues to understand learners better. This paper investigates multimodal features related to posture and gesture for the task of classifying students' dialogue acts within tutorial dialogue. In order to accelerate the modeling process by eliminating the manual annotation bottleneck, a fully unsupervised machine learning approach is utilized for this task. The results indicate that these unsupervised models are significantly improved with the addition of automatically extracted posture and gesture information. Further, even in the absence of any linguistic features, a model that utilizes posture and gesture features alone performed significantly better than a majority class baseline. This work represents a step toward achieving better understanding of student utterances by incorporating multimodal features within adaptive learning environments. Additionally, the technique presented here is scalable to very large student datasets.
Conference Paper
Recent years have seen a growing recognition of the central role of affect and motivation in learning. In particular, nonverbal behaviors such as posture and gesture provide key channels signaling affective and motivational states. Developing a clear understanding of these mechanisms will inform the development of personalized learning environments that promote successful affective and motivational outcomes. This paper investigates posture and gesture in computer-mediated tutorial dialogue using automated techniques to track posture and hand-to-face gestures. Annotated dialogue transcripts were analyzed to identify the relationships between student posture, student gesture, and tutor and student dialogue. The results indicate that posture and hand-to-face gestures are significantly associated with particular tutorial dialogue moves. Additionally, two-hands-to-face gestures occurred significantly more frequently among students with low self-efficacy. The results shed light on the cognitive-affective mechanisms that underlie these nonverbal behaviors. Collectively, the findings provide insight into the interdependencies among tutorial dialogue, posture, and gesture, revealing a new avenue for automated tracking of embodied affect during learning.
Article
One of the challenges in affect recognition is accurate estimation of the emotion intensity level. This research proposes development of an affect intensity estimation model based on a weighted sum of classification confidence levels, displacement of feature points and speed of feature point motion. The parameters of the model were calculated from data captured using multiple modalities such as face, body posture, hand movement and speech. A preliminary study was conducted to compare the accuracy of the model with the annotated intensity levels. An emotion intensity scale ranging from 0 to 1 along the arousal dimension in the emotion space was used. Results indicated speech and hand modality significantly contributed in improving accuracy in emotion intensity estimation using the proposed model. Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
Video affective content analysis has been an active research area in recent decades, since emotion is an important component in the classification and retrieval of videos. Video affective content analysis can be divided into two approaches: Direct and implicit. Direct approaches infer the affective content of videos directly from related audiovisual features. Implicit approaches, on the other hand, detect affective content from videos based on an automatic analysis of a user's spontaneous response while consuming the videos. This paper first proposes a general framework for video affective content analysis, which includes video content, emotional descriptors, and users' spontaneous nonverbal responses, as well as the relationships between the three. Then, we survey current research in both direct and implicit video affective content analysis, with a focus on direct video affective content analysis. Lastly, we identify several challenges in this field and put forward recommendations for future research.
Conference Paper
We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-the-art approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets.
Conference Paper
We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standard HOG features, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
Conference Paper
In this work, we address the problem of estimating 2d human pose from still images. Recent methods that rely on discriminatively trained deformable parts organized in a tree model have shown to be very successful in solving this task. Within such a pictorial structure framework, we address the problem of obtaining good part templates by proposing novel, non-linear joint regressors. In particular, we employ two-layered random forests as joint regressors. The first layer acts as a discriminative, independent body part classifier. The second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts. This results in a pose estimation framework that takes dependencies between body parts already for joint localization into account and is thus able to circumvent typical ambiguities of tree structures, such as for legs and arms. In the experiments, we demonstrate that our body parts dependent joint regressors achieve a higher joint localization accuracy than tree-based state-of-the-art methods.
Article
We describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened) templates, we use a mixture of small, nonoriented parts. We describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Our models have several notable properties: 1) They efficiently model articulation by sharing computation across similar warps, 2) they efficiently model an exponentially large set of global mixtures through composition of local mixtures, and 3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When relations are tree structured, our models can be efficiently optimized with dynamic programming. We learn all parameters, including local appearances, spatial relations, and co-occurrence relations (which encode local rigidity) with a structured SVM solver. Because our model is efficient enough to be used as a detector that searches over scales and image locations, we introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. We show that currently used evaluation criteria may conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of each other, while we present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for strong performance. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets while being orders of magnitude faster.
Article
Nonparametric regression is a set of techniques for estimating a regression curve without making strong assumptions about the shape of the true regression function. These techniques are therefore useful for building and checking parametric models, as well as for data description. Kernel and nearest-neighbor regression estimators are local versions of univariate location estimators, and so they can readily be introduced to beginning students and consulting clients who are familiar with such summaries as the sample mean and median.
Article
In this article Dr Belson describes a technique for matching population samples. This depends upon the combination of empirically developed predictors to give the best available predictive, or matching, composite. The underlying principle is quite distinct from that inherent in the multiple correlation method.
Book