Figure 1 - uploaded by Altaf Hussain
Content may be subject to copyright.
The proposed camera prioritisation framework for a large-scale surveillance network. In the preprocessing step, the surveillance video streams are preprocessed using the background subtraction method to extract only motion-specific frames. In the second step, a 3DCNN is utilised to extract spatio-temporal features from the sequence of frames to classify the underlying activity. Finally, based on the violent activity detected in the video stream, a specific camera is assigned with high priority

The proposed camera prioritisation framework for a large-scale surveillance network. In the preprocessing step, the surveillance video streams are preprocessed using the background subtraction method to extract only motion-specific frames. In the second step, a 3DCNN is utilised to extract spatio-temporal features from the sequence of frames to classify the underlying activity. Finally, based on the violent activity detected in the video stream, a specific camera is assigned with high priority

Similar publications

Article
Full-text available
Three-dimensional (3D) real-scale models delivered from digital photogrammetric techniques have rapidly increased to meet the requirements of many applications in different fields of daily life. This paper deals with establishing a 3D real-scale model from a block of images (18 images) captured using a Canon EOS 500D digital camera to cover a test...

Citations

... Thereby, 3DCNN is proposed which is able to learn the spatial as well as temporal features. In this direction, different methods are developed, for example, pseudo-3D CNNs [11], MiCT-Net [12], Inflated 3DCNN [13], and 3DCNN [14] for automatic HAR. However, the computational complexity of 3DCNNs is increasing exponentially when recognizing lengthy videos, and further achieving higher performance requires a large-scale pre-trained video datasets [14,15]. ...
... In this direction, different methods are developed, for example, pseudo-3D CNNs [11], MiCT-Net [12], Inflated 3DCNN [13], and 3DCNN [14] for automatic HAR. However, the computational complexity of 3DCNNs is increasing exponentially when recognizing lengthy videos, and further achieving higher performance requires a large-scale pre-trained video datasets [14,15]. Therefore, to tackle the computation problem researchers used hybrid approaches where the spatial features are extracted by 2DCNN and then for spatiotemporal learning several variants of Recurrent Neural Network (RNN) for HAR [16][17][18]. ...
... However, HAR by only spatial features requires less computation but researchers investigate that for accurate HAR in real-world environments only spatial information not enough, temporal information is also needs to be analyzed [37]. In this direction, the conventional 2DCNN methods are upgraded to 3-Dimensional CNN (3DCNN) to capture both spatial and temporal information [10,14]. ...
Article
Full-text available
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidi-rectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
... However, sign expression is not simply understood by the listener to the community, resulting in an interaction difference between the speech and hearing-reduced and hearing populations [10]. Recently, deep learning (DL) has achieved encouraging outcomes in different domains, such as activity recognition [11,12], disease recognition [13], and energy forecasting [14]. Therefore, HGRoc, using computer knowledge, can serve as a translator for sign motion translation, creating a bridge between these two communities. ...
Article
Full-text available
Hand gestures have been used as a significant mode of communication since the advent of human civilization. By facilitating human-computer interaction (HCI), hand gesture recognition (HGRoc) technology is crucial for seamless and error-free HCI. HGRoc technology is pivotal in healthcare and communication for the deaf community. Despite significant advancements in computer vision-based gesture recognition for language understanding, two considerable challenges persist in this field: (a) limited and common gestures are considered, (b) processing multiple channels of information across a network takes huge computational time during discriminative feature extraction. Therefore, a novel hand vision-based convolutional neural network (CNN) model named (HVCNNM) offers several benefits, notably enhanced accuracy, robustness to variations, real-time performance, reduced channels, and scalability. Additionally, these models can be optimized for real-time performance, learn from large amounts of data, and are scalable to handle complex recognition tasks for efficient human-computer interaction. The proposed model was evaluated on two challenging datasets, namely the Massey University Dataset (MUD) and the American Sign Language (ASL) Alphabet Dataset (ASLAD). On the MUD and ASLAD datasets, HVCNNM achieved a score of 99.23% and 99.00%, respectively. These results demonstrate the effectiveness of CNN as a promising HGRoc approach. The findings suggest that the proposed model have potential roles in applications such as sign language recognition, human-computer interaction, and robotics.
... Their approach attained the highest accuracy for activity recognition as compared with the other models because their filters are randomly initialized. Another study is conducted by [17] where a lightweight 3DCNN is explored for abnormal activity recognition where the abnormal scene camera is prioritized among the normal in a visual sensor network. In this direction, different variants of 3DCNN and end-to-end networks have been developed to learn the discriminative spatial and temporal features of HAR [9]. ...
... Although 3DCNNs effectively learn the spatial and temporal information from the videos to predict the ongoing activities however, they require higher computation to identify activities in lengthy videos. Thereby they produce an exponential increase in the time complexity when the temporal dimension is increased [9,17]. ...
Article
Full-text available
Human Activity Recognition (HAR) plays a crucial role in communication and the Internet of Things (IoT), by enabling vision sensors to understand and respond to human behavior more intelligently and efficiently. Existing deep learning models are complex to deal with the low illumination, diverse viewpoints, and cluttered backgrounds, which require substantial computing resources and are not appropriate for edge devices. Furthermore, without an effective video analysis technique it processes entire frames, resulting inadequate performance. To address these key challenges, a cloud-assisted IoT computing framework is proposed for HAR in uncertain low-lighting environments, which is mainly composed of two tiers: edge and cloud computing. Initially, a lightweight Convolutional Neural Network (CNN) model is developed which is responsible to enhance the low-light frames, followed by the human detection algorithm to process the selective frames, thus enabling efficient resource utilization. Next, these refined frames are then transmitted to the cloud for accurate HAR, where dual stream CNN and transformer fusion network extract both short-and long-range spatiotemporal discriminative features followed by proposed Optimized Parallel Sequential Temporal Network (OPSTN) with squeeze and excitation attention to efficiently learn HAR in complex scenarios. Finally, extensive experiments are conducted over three challenging HAR datasets to deeply examine the proposed framework from various perspectives such as complex activity recognition, lowlighting, etc., where the results are outperformed compared with the state-of-art methods.
... Another famous study was performed by Abdullahi, who designed a bidirectional long-short term memory-fast fisher vector algorithm to train 3D hand skeletal information of motion and orientation angle features and further used it to classify dynamic sign words [10], [11]. One of the representative works [12], [13] from Sejong University proposed a cloud-assisted IoT computing framework for human activity recognition in uncertain low-lighting environments and applied a lightweight three-dimensional convolutional neural network architecture to extract spatiotemporal features from significant frames to easily identify violent behaviors in video. This group also pretrained a vision transformer to extract frame features and did research on identifying abnormal behaviors in video [14], [15]. ...
Article
Full-text available
Human motion prediction is a popular method to predict future motion sequences based on past sequences, which is widely used in human-computer interaction. Space-time-separable graph Convolutional Network (STS-GCN) is a conventional mathematical model for human motion prediction. However, the uncertainty of human movements often leads to the problem of significant prediction error in the prediction results. This paper first proposed a Multi-scale STS-GCN (MSTS-GCN) model based on the conventional STS-GCN method to find the relevant factors that affect the prediction results. In our study, the constructed Multi-scale Temporal Convolutional Network (MTCN) decoder effectively reduced the human motion prediction error at specific time nodes. To expand the transmission and utilization performance in a larger receptive field, a Gated Recurrent Unit-TCN decoder was also designed. Finally, a new STS-GCN (NSTS-GCN) human motion prediction model was proposed, which realized the transmission and utilization of motion sequence features under a larger temporal perceptual field. To verify the effectiveness of NSTS-GCN, the Human3.6M dataset, AMASS and 3DPW dataset were tested. The experimental results show that the MPJPE error of the proposed model for human joint prediction at each time node is reduced compared with the conventional STS-GCN model, and the mean reduction was achieved by 3.0mm. All the experimental results validated the effectiveness of the proposed NSTS-GCN model, which further improved the performance of human motion prediction.
... e task of nonlinear mapping and feature extraction is extremely challenging; therefore, the best way to tackle these challenges is to employ deep learning models with the ability to extract the discriminative features end-toend [29,30]. In recent years, the application of deep learning models has significantly improved for image classification [31,32], video classification [33][34][35][36][37], and power forecasting in TS data [38][39][40][41][42]. For instance, Khan et al. [43] proposed a hybrid model for electricity forecasting in residential and commercial buildings. ...
Article
Full-text available
For efficient energy distribution, microgrids (MG) provide significant assistance to main grids and act as a bridge between the power generation and consumption. Renewable energy generation resources, particularly photovoltaics (PVs), are considered as a clean source of energy but are highly complex, volatile, and intermittent in nature making their forecasting challenging. Thus, a reliable, optimized, and a robust forecasting method deployed at MG objectifies these challenges by providing accurate renewable energy production forecasting and establishing a precise power generation and consumption matching at MG. Furthermore, it ensures effective planning, operation, and acquisition from the main grid in the case of superior or inferior amounts of energy, respectively. Therefore, in this work, we develop an end-to-end hybrid network for automatic PV power forecasting, comprising three basic steps. Firstly, data preprocessing is performed to normalize, remove the outliers, and deal with the missing values prominently. Next, the temporal features are extracted using deep sequential modelling schemes, followed by the extraction of spatial features via convolutional neural networks. These features are then fed to fully connected layers for optimal PV power forecasting. In the third step, the proposed model is evaluated on publicly available PV power generation datasets, where its performance reveals lower error rates when compared to state-of-the-art methods.
... (7) Establishing a blockchain for video surveillance equipment: In a technological environment where digital surveillance systems are ubiquitous and continuously producing large amounts of data, manual surveillance is required to identify human activities in the public realm. Meanwhile, smart surveillance systems that can identify normal and abnormal activities are urgently needed because they allow for the effective monitoring of images sent from cameras that are designed to capture abnormal activities; the implementation of these systems can alleviate the lack of surveillance personnel [42,43]. Furthermore, the inclusion of these systems can enhance the community's control over the number of people entering the mountain and the entry of unauthorized personnel. ...
Article
Full-text available
Forest protection is crucial to ensuring the balance between human beings and ecology. This study explores the key role played by communities that originally lived in forest-protected areas in implementing the traditional management of forests. The unified management mode previously used by the state power can no longer meet the demands of modern times; hence, multiple types of management systems should be implemented to enable adaption to the original ecology of forest areas. A multimodal management mode should be adopted to restore the original ecology of forest areas. The adoption of this management system can restore a forest to its original state (i.e., the state that existed prior to the entry of state power). The forest has been in a state of ecological balance involving numerous species since ancient times. However, in the modern field of science, the passive restoration of a community’s self-governance ability could be unsustainable and unstable. To improve this situation, blockchain technology can first be used to improve the community management of a forest, such that the capabilities of the original local community can be improved; second, tourism promotion benefits both the community and the forest. When a community in a forest develops the tourism industry with the support of blockchain technology, the information and resources of all parties can be widely connected with the larger world, and this considerably increases success rates; finally, the traditional spiritual culture of a community, such as the culture of sharing, should be promoted. In addition to the skillful utilization of technology, culture can improve the traditional forest management ability of tribal communities who live in native forest areas in terms of their personality traits. Overall, we conclude that: against the evolution history of the over one hundred years, the adoption of new technology for forest management is inherently a creative innovation for the tribal community’s entrepreneurial development.
... e experiments concluded that their method achieved higher accuracy as compared to randomly initialized filters. Similarly, Hussain et al. [16] have proposed a lightweight 3D CNN model for anomaly activity recognition and camera prioritization in surveillance environments. e variants of 3D CNNs include two-stream 3D CNN [17], pseudo-3D CNN [18], and MiCT-Net [19]. ...
... However, the existing 3D-CNN models can only process 10 to 16 frames effectively. ey cannot recognize lengthy activities due to exponential increase in time complexity caused mainly by the temporal dimension [16]. To overcome this issue, researchers experimented on hybrid models where the spatial features are extracted from pretrained CNN models and learned the temporal information using variants of Recurrent Neural Networks (RNNs). ...
Article
Full-text available
Human Activity Recognition is an active research area with several Convolutional Neural Network (CNN) based features extraction and classification methods employed for surveillance and other applications. However, accurate identification of HAR from a sequence of frames is a challenging task due to cluttered background, different viewpoints, low resolution, and partial occlusion. Current CNN-based techniques use large-scale computational classifiers along with convolutional operators having local receptive fields, limiting their performance to capture long-range temporal information. Therefore, in this work, we introduce a convolution-free approach for accurate HAR, which overcomes the above-mentioned problems and accurately encodes relative spatial information. In the proposed framework, the frame-level features are extracted via pretrained Vision Transformer; next, these features are passed to multilayer long short-term memory to capture the long-range dependencies of the actions in the surveillance videos. To validate the performance of the proposed framework, we carried out extensive experiments on UCF50 and HMDB51 benchmark HAR datasets and improved accuracy by 0.944% and 1.414%, respectively, when compared to state-of-the-art deep models.
... Therefore, in future work, the main strategy is to use such an advance pose estimation technique which is more accurate while detecting human body key points. Moreover, a lightweight sequential model can be adopted to the proposed network so it can be deployed on embedded systems, and boards such as FPGAs etc. to increase the frame rate, and ability of producing improved decision in order to integrate with other IoT applications concerning; camera prioritization [88], localization [89,90], security [91], healthcare [92], greenery [93], smart home automation [94], and energy efficiency [34,95]. ...
Thesis
Full-text available
The recent surge in the number of CCTV cameras, and their impact on the surveillance environment provides great opportunity to the researchers to facilitate, and develop more advanced, and accurate recognition, and detection systems. In order to advance and improve the functionalities of a traditional CCTV monitoring systems to a smart surveillance environment, humans’ monitoring, and their behavior analysis is a major task to be accomplished. Human behavior analysis, and their interaction with each other contain three subbranches e.g., human action recognition, human activity recognition, and human interaction recognition. Human interaction recognition is challenging due to the presence of multiple humans’ involvement, and their mutual interaction in a single frame, generated from their movements. Mainstream existing literature is based on 3D CNNs, processing only visual frames, where human joints data play a vital role in accurate interaction recognition. In this thesis, a two-stream network for human interaction recognition (HIR) is proposed that intelligently learns from skeletons’ key points, and spatiotemporal visual representations. The first stream localizes the joints of the human body using a pose estimation model, and transmits them to a 1D CNN, and bi-directional long short-term memory to efficiently extract the features of the dynamic movements of each human skeleton. The second stream feeds the series of visual frames to a 3D convolutional neural network to extract the discriminative spatiotemporal features. Finally, the outputs of both streams are integrated via fully connected layers that precisely classify the ongoing interactions between humans. To validate the performance of the proposed network, a comprehensive set of experiments are conducted over two benchmark datasets, UT-Interaction, and TV Human Interaction, and found an increase of 1.15%, and 10.7% in accuracy, respectively.
... AlexNet has outperformed traditional handcrafted techniques in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [23]. AlexNet performs exceptionally well in the image classification task; researchers from the computer vision community are exploring and using CNN in several problems: segmentation [24], object tracking [25], plant disease recognition [26], chest disease detection [27], activity identification [28], and other similar areas. The main advantage of CNN architecture is the local connection and weight sharing that helps in processing high-dimensional data and extracting meaningful discriminative features. ...
Article
Full-text available
Background and motivation: Every year, millions of Muslims worldwide come to Mecca to perform the Hajj. In order to maintain the security of the pilgrims, the Saudi government has installed about 5000 closed circuit television (CCTV) cameras to monitor crowd activity efficiently. Problem: As a result, these cameras generate an enormous amount of visual data through manual or offline monitoring, requiring numerous human resources for efficient tracking. Therefore, there is an urgent need to develop an intelligent and automatic system in order to efficiently monitor crowds and identify abnormal activity. Method: The existing method is incapable of extracting discriminative features from surveillance videos as pre-trained weights of different architectures were used. This paper develops a lightweight approach for accurately identifying violent activity in surveillance environments. As the first step of the proposed framework, a lightweight CNN model is trained on our own pilgrim's dataset to detect pilgrims from the surveillance cameras. These preprocessed salient frames are passed to a lightweight CNN model for spatial features extraction in the second step. In the third step, a Long Short Term Memory network (LSTM) is developed to extract temporal features. Finally, in the last step, in the case of violent activity or accidents, the proposed system will generate an alarm in real time to inform law enforcement agencies to take appropriate action, thus helping to avoid accidents and stampedes. Results: We have conducted multiple experiments on two publicly available violent activity datasets, such as Surveillance Fight and Hockey Fight datasets; our proposed model achieved accuracies of 81.05 and 98.00, respectively.