ArticlePDF Available

Anomaly Based Camera Prioritization in Large Scale Surveillance Networks

Tech Science Press
Computers, Materials & Continua
Authors:
ech
T
PressScience
Computers, Materials & Continua
DOI:10.32604/cmc.2022.018181
Article
Anomaly Based Camera Prioritization in Large Scale Surveillance Networks
Altaf Hussain1,2, Khan Muhammad1, Hayat Ullah1, Amin Ullah1,4, Ali Shariq Imran3, Mi Young Lee1,
Seungmin Rho1and Muhammad Sajjad2,3,*
1Department of Software, Sejong University, 143-747, Seoul, Korea
2Digital Image Processing Lab, Islamia College Peshawar, 25000, Pakistan
3Color Lab, Department of Computer Science, Norwegian University of Science and Technology (NTNU),
2815, Gjøvik, Norway
4CORIS Institute, Oregon State University, 97331, Oregon, USA
*Corresponding Author: Muhammad Sajjad. Email: muhammad.sajjad@icp.edu.pk
Received: 28 February 2021; Accepted: 06 May 2021
Abstract: Digital surveillance systems are ubiquitous and continuously gen-
erate massive amounts of data, and manual monitoring is required in order
to recognise human activities in public areas. Intelligent surveillance systems
that can automatically identify normal and abnormal activities are highly
desirable, as these would allow for efcient monitoring by selecting only those
camera feeds in which abnormal activities are occurring. This paper proposes
an energy-efcient camera prioritisation framework that intelligently adjusts
the priority of cameras in a vast surveillance network using feedback from
the activity recognition system. The proposed system addresses the limitations
of existing manual monitoring surveillance systems using a three-step frame-
work. In the rst step, the salient frames are selected from the online video
stream using a frame differencing method. A lightweight 3D convolutional
neural network (3DCNN) architecture is applied to extract spatio-temporal
features from the salient frames in the second step. Finally, the probabilities
predicted by the 3DCNN network and the metadata of the cameras are pro-
cessed using a linear threshold gate sigmoid mechanism to control the priority
of the camera. The proposed system performs well compared to state-of-the-
art violent activity recognition methods in terms of efcient camera prioriti-
sation in large-scale surveillance networks. Comprehensive experiments and
an evaluation of activity recognition and camera prioritisation showed that
our approach achieved an accuracy of 98% with an F1-score of 0.97 on the
Hockey Fight dataset, and an accuracy of 99% with an F1-score of 0.98 on
the Violent Crowd dataset.
Keywords: Camera prioritisation; surveillance networks; convolutional
neural network; computer vision; deep learning; resource-constrained
device; violent activity recognition
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
2172 CMC, 2022, vol.70, no.2
1 Introduction
Following the development of computer vision and pattern recognition technology, video
surveillance systems have been widely deployed, and their functionality is improving rapidly with
the aim of preventing emergencies and crimes [1]. Current systems mainly rely on traditional
approaches that are time-consuming, laborious, and prone to misdetection of violent activities due
to need for round-the-clock monitoring of large numbers of camera nodes. Video surveillance, and
particularly human activity recognition systems, have strong prevention capabilities due to their
improved visualisation, accuracy and real-time feedback, and these are now considered essential
aspects of a security system. Wireless visual sensor networks (WVSNs) have recently emerged
as a new type of sensor-based intelligent system, and their performance exceeds that of existing
surveillance sensor networks. Furthermore, due to their small size and the dense spatial coverage
that can be achieved, WVSNs can be exibly deployed for various computer vision tasks, including
patient monitoring [2], distributed multimedia-based surveillance systems [3], military, police, air-
port [4], border [5], urban and transport [6] applications, and other areas of public interest. WVSN
surveillance systems have an extensive range of applications worldwide, but their implementation
has remained a challenge, due to the need for connectivity between multiple sensors and their
complicated setup. For instance, monitoring a large area requires a large number of WVSNs. To
accurately monitor an entire set of cameras in real time, WVSNs require massive amounts of
human resources, computational power and bandwidth. In addition, vision-based sensors require
a high bandwidth to transmit data between cameras and servers. Efcient monitoring of the target
area also requires extensive computation to detect events and anomalies. A similar study reported
the need for extensive computational resources due to the use of multiple sets of parameters in
the proposed model [7]. The literature in the area of violent activity recognition is mainly divided
into handcrafted and deep learning-based methods, and these are discussed below.
1.1 Handcrafted Feature-Based Approaches
The success of handcrafted feature-based approaches relies heavily on manually engineered
feature extraction techniques. Many researchers have utilised handcrafted features for the detection
of violent activities. For instance, Hassner et al. [8] introduced extractor ow vectors using a
violent ow descriptor (ViF). They then used a support vector machine (SVM) to classify these
features into violent and non-violent crowds. Similarly, Huang et al. [9] used SVM to analyse the
statistical properties of the optical ow for violent crowd recognition. Zhang et al. [10]useda
Gaussian model of the optical ow for violent region extraction, and an orientation histogram of
the optical ow with linear SVM was used to classify video frames into violent and non-violent
classes. Gao et al. [11] proposed an oriented violent ow descriptor (OViF) that utilised both the
magnitude of motion and orientation information for the recognition of violent activity. Chen
et al. [12] used spatiotemporal interest points, including space-time interest points (STIPs) [13]
and a motion scale-invariant feature transform (Mo SIFT) [14], for violent activity recognition.
Similarly, Lloyd et al. [15] proposed a new descriptor called a grey level co-occurrence texture
measure to detect violence and abnormal activity in crowds. Fu et al. [16] analysed three attributes
of motion (region, magnitude, and acceleration) for violence detection from surveillance videos.
For feature extraction, the optical ow magnitude and orientation (HOMO) [17] was also used.
A sliding window method was used in [18] with an improved Fisher vector (IFV) for violent
activity recognition.
CMC, 2022, vol.70, no.2 2173
1.2 Deep Learning-Based Approaches
Due to the wide range of variations in pose, viewpoint and scale in visual data, accurate
recognition of violent activities is complex and challenging. Researchers from the articial intel-
ligence and computer vision communities have recently contributed to this area by identifying
human activities using deep learning-based approaches. CNNs extract features in a hierarchical
way, where the initial layers learn local features and the higher layers learn global features from
the visual data [19]. Although recurrent neural networks (RCNNs) and 3DCNNs are mostly used
for violent activity recognition [20], an RNN is not able to extract features directly; a CNN is
typically used for feature extraction purposes, and these features are then passed to an RNN
for classication. A 3DCNN is an end-to-end method that is very widely used to extract spatio-
temporal features for violent activity recognition [2128]. For instance, Tran et al. [21]useda
3DCNN architecture for activity recognition in which 3 ×3×3 lters were convolved with a
sequence of 16 frames.
Similarly, Carreira et al. [22] modied 2D lters pretrained on ImageNet to create 3D versions
for activity recognition. These modied lters achieved better accuracy than a lter that was
randomly initialised. Stroud et al. [26] introduced a distilled 3D network (D3D) for activity
recognition. Diba et al. [23] employed a temporal transition layer with DenseNet3D, in a scheme
that used transfer learning from a 2DNN model. Serrano et al. [29] used Hough forests with
2DCNN to detect violent activity, and their proposed approach obtained an accuracy of 94.6%
on the Hockey Fight dataset. However, 3DCNNs have high computational requirements, making
them unsuitable for use in standard surveillance systems due to resource constraints. In this paper,
we solve this issue by introducing a new lightweight 3DCNN model that is less computationally
intensive and can easily be deployed in common CCTV camera surveillance systems.
Within such large amounts of video data, very few scenes are important in terms of allowing
a machine to understand human activity. For example, theft from a shopping mall happens
very rarely. When a human performs any activity, there is some sort of bodily movement, such
as movements of arms and legs. In these situations, the detection of moving objects from a
sequences of frames is a challenging and crucial task, and is the rst step in any video analytics
system such as video surveillance [30], target detection [31], human tracking [32], and robot
control [33]. The selection of salient motion frames from WVSN nodes is a crucial aspect of video
processing that helps us analyse only important clips, thereby effectively minimising the execution
time and improving the accuracy of the violent activity recognition system. Several techniques
for automatically detecting salient frames have been developed that can separate the moving
objects (foreground) from the scene (background) in videos. It is difcult to accurately segment
foreground objects due to variations in illumination, ghosting of the foreground aperture, and the
presence of unconstrained environments. Over the years, approaches based on optical ow [34],
background subtraction [35], and frame differencing [36] have been actively used to detect motion
between two consecutive frames. Optical ow is used to detect the simple apparent motion of
an object in two consecutive frames; however, this is computationally expensive and produces
inaccurate results due to its susceptibility to noise, variations in illumination, and fake motion.
The second method, frame differencing, is a straightforward and commonly used technique for
identifying moving objects, but is susceptible to variations in illumination and camera jitter. Unlike
optical ow, the frame difference technique is computationally efcient and is particularly used by
resource-constrained devices to detect moving objects in video frames.
Existing large-scale WVSNs consist of numerous wireless camera sensors and are used to
monitor suspicious human activities. However, these systems have several drawbacks, such as
2174 CMC, 2022, vol.70, no.2
inadequate recognition of salient information, streaming of all imaging data, high bandwidth
and storage requirements, ineffective monitoring, and late responses to crime or abnormal activ-
ities. Other signicant issues related to WVSN-based surveillance include scattered background
viewpoint variation and changes in lighting. The task of camera prioritisation in large-scale
WVSNs also becomes more challenging when large numbers of nodes and continuous streaming
are used [37]. Researchers around the world are making efforts to tackle these challenges. For
instance, Mehmood et al. [38] proposed a saliency-aware data prioritisation framework that selects
semantically relevant information for each node and then transmits these data to a sink node for
further processing. The main limitation of their method is that they used handcrafted features to
extract salient information, an approach that gives limited performance with real-time data. To
rank salient information and remove redundancy, Thomas et al. [39] used a perceptual system
to detect road events. Another technique inspired by the idea of perceptual computation was
a low-computation video summarisation framework for resource-constrained devices such as the
Raspberry Pi [40].
The abovementioned approaches were developed in an attempt to solve several challenges
such as variations in pose, viewpoint and scale, complex crowd patterns in visual data, and
data prioritisation. However, there are still numerous challenges that need to be addressed. For
instance, the studies in [817] used handcrafted approaches that were not capable of learning
discriminative features from violence datasets. The authors of [2028] proposed 3DCNN models
for violent activity that performed rather well, but due to the large number of computations
involved, these were not suitable for deployment in real-world surveillance systems. Similarly,
the authors of [36,37] developed a surveillance system that prioritised video camera content.
A surveillance system usually consists of resource-constrained devices such as CCTV cameras, and
there is therefore an urgent need for a system that can accurately recognise violent activity in a
complex environment with lower computational requirements. In addition, there is a need for an
intelligent prioritisation technique that can select only those cameras that are transmitting violent
activity from among a set of normal feeds, to allow for smart monitoring of large WVSNs, reduce
the memory storage required, and utilise the resources of WVSNs more efciently.
In this paper, we propose an energy-efcient deep learning-based framework for camera
prioritisation in large-scale WVSNs. The key contributions of this work can be summarised as
follows.
i. A novel camera prioritisation framework is proposed for economical hardware devices
(e.g. the Raspberry Pi) based on violent activity recognition in large-scale WVSNs. Salient
motion is the key feature for activity recognition. A lightweight frame differencing tech-
nique is incorporated to extract frames containing salient motion, thus ensuring the
efcient utilisation of resources.
ii. Human activity consists of a sequence of motion patterns in consecutive frames, meaning
that both spatial and temporal features need to be learned for this task. A novel lightweight
3DCNN architecture over resource-constrained devices is proposed, which outperforms
other state-of-the-art techniques in terms of accuracy on benchmark datasets.
iii. A novel linear threshold gate sigmoid (LTGS) method is used to prioritise cameras based
on violent activity in large-scale WVSNs by exploiting both the metadata and the prob-
abilities predicted by the proposed 3DCNN model, thereby reducing the dependency on
human monitoring, the energy required, and the bandwidth consumption.
CMC, 2022, vol.70, no.2 2175
The remainder of this paper is organised as follows. Section 2 presents the proposed method-
ology. The experimental setup and evaluation are described in Section 3, and Section 4 concludes
the paper and suggests future work.
2 Proposed Method
This section introduces our novel camera prioritisation framework for efcient surveillance, in
which an individual camera node from a large-scale WVSN is prioritised based on the detection of
violent activity. The proposed framework is divided into three steps. In the rst step, we perform
motion detection, and select only motion-specic frames from the video streams captured by the
WVSNs. In the second step, a sequence of 16 salient frames is forwarded to a trained 3DCNN
model to extract the spatio-temporal features. The extracted features are then fed to a Softmax
classier, which generates categorical probabilities in which the class with higher probability is
considered to be the predicted class. Finally, based on the predicted probabilities, our proposed
framework prioritises cameras with high probability of showing violent activity, using LTGS.
The overall workow of the proposed intelligent camera prioritisation framework is illustrated in
Fig. 1.
Figure 1: The proposed camera prioritisation framework for a large-scale surveillance network.
In the preprocessing step, the surveillance video streams are preprocessed using the background
subtraction method to extract only motion-specic frames. In the second step, a 3DCNN is
utilised to extract spatio-temporal features from the sequence of frames to classify the underlying
activity. Finally, based on the violent activity detected in the video stream, a specic camera is
assigned with high priority
2176 CMC, 2022, vol.70, no.2
2.1 Preprocessing Phase
The frame differencing technique plays a very important role in the efcient use of resources
in large-scale WVSNs, as it can help in selecting only motion-specic frames, i.e., moving cars or
pedestrians. In our case, a resource-constrained Raspberry Pi device is used in the preprocessing
stage to select only frames containing moving objects from the input video stream. The detection
of salient objects is a difcult task due to the range of different viewpoints, occlusions, and
cluttered backgrounds in the frames [17]. In WVSNs, multiple visual sensors are interconnected
to enable the target area to be efciently monitored. Processing of the video stream from each
camera is not important, as it is an inefcient use of the available computational resources. There
is therefore a need for a system that can select only motion-relevant frames. A lightweight frame
differencing technique is applied here to identify salient motion frames, as frame differencing
is the most precise and straightforward technique of detecting minor temporal changes. A pair
of consecutive frames fifi+1are smoothed and processed to remove noise; this removes high-
frequency content such as noise and edges from the images. There are several different techniques
for smoothing, such as averaging, median, bilateral ltering, and Gaussian blur, in which the pixels
close to the centre of the lter are given more weight than those further away from the centre.
We conducted experiments and concluded that due to the use of a Gaussian kernel, Gaussian
blur is highly effective in removing noise from images. For the selection of salient frames, a
pixel-wise absolute difference is calculated for each pair of consecutive frames Dimage =|fifi+1|.
When the value of the absolute difference is higher than a pre-dened threshold T(in our case,
T=0.7), these video frames are selected as salient frames and considered for further processing.
The complete ow of the motion detection process is described in Algorithm 1.
Algorithm 1: Selection of salient frames
Input: Video stream
1. Take two consecutive input video frames (fi,fi+1).
2. Apply Gaussian blur on (fifi+1)to remove noise.
3. Find the pixel-wise absolute difference of each pair of consecutive frames Dimage =
1
NN
i=1
fi(i)fi+1(i)
4. For each frame (fi,fi+1)
IF Dimage <=0.7
fiand fi+1are non-salient frames.
Else
fiand fi+1are salient frames.
End
End For
Output: Salient frames
2.2 Violence Detection
2DCNN architectures are widely used to extract spatial information from still images for
a variety of computer vision tasks, such as image classication and image retrieval. However,
the analysis of human activity in videos is challenging compared to the classication of still
images, as human activity/action is encoded across multiple frames that involve both spatial and
temporal information. A variety of existing methods have used a 2DCNN to extract the spatial
correlations from a video, which also includes temporal information. For example, in [41,42], the
CMC, 2022, vol.70, no.2 2177
authors used a 2DCNN to process multiple frames and all the temporal features are collapsed.
While in 3DCNN, a 3D lter is convolving on spatial and across the multiple frames to extract
spatial and temporal information. After the 3D convolutional operation, a feature map is obtained
that captures the spatial and temporal information. The feature maps are extracted from multiple
frames to capture the temporal information. In our case, the feature maps are a combination of
16 consecutive frames with spatial dimensions of 112 ×112. The values at location x, y, z of the
qth extracted feature map in the pth layer with bias tpq are given by Eq. (1):
Nxyz
pq =tanh
tpq +
k
Ap1
a=0
Bp1
b=0
Cp1
c=0
wabc
pqkN(x+a)(y+b)(z+c)
(p1)k
(1)
Here, Cp represents the size of the 3D lter used to learn the temporal dimension, and
in wabc
pqk, the values at position (a, b, c) represent the lter convolved on the kth feature map in
the previous layer. The rst version of 3DCNN was introduced in 2014; it was implemented in a
deep learning framework called Caffe [43], and achieved state-of-the-art performance in an activity
recognition task. Inspired by the exceptional performance of this network, we designed a similar
3DCNN architecture with pruning of the higher layers for the task of violent activity recognition.
In our proposed 3DCNN architecture, we used eight 3D convolutional layers, two max-pooling
layers, and one fully connected layer. In the nal layer, a Softmax activation function is used as a
binary classier. To extract the deep spatial features from the input video stream, a 3 ×3×3lter
is used in the convolution layer with stride one. The input shape of the model is based on the
batch size, depth, rows, columns and channels, and can be represented as 16×16×112 ×112 ×3. A
leaky ReLU is used as the activation function to overcome the dying ReLU problem. In a CNN,
the fully connected layer is extremely expensive in terms of computation, since each neuron is
directly connected to each pixel of the input image. We therefore use a fully connected layer with
512 neurons.
2.3 Camera Selection
In large-scale WVSNs, the streaming and monitoring of gigantic amounts of data is impracti-
cal due to limited human resources. Each surveillance system consists of large numbers of sensors
that require a high bandwidth to transmit their raw video streams, and these video streams involve
costly computation when detecting events and anomalies. Although due to its continuous nature,
a video stream poses critical problems for a human analyst in terms of identifying the important
portions, it is vital to apply visual analytics at the point of data collection. To overcome the
drawbacks of a traditional multi-camera WSN, we propose a salient motion detection scheme
based on a prioritisation framework for visual data, in which all nodes autonomously prioritise
certain visual content and the cameras showing it. The main goal of the proposed surveillance
system is to focus solely on anomaly-specic cameras, not only to avoid unimportant streams but
also to provide an efcient surveillance system. An overview of the proposed scheme is shown in
Fig. 2. The rst step involves understanding the scene by classifying the input video stream into
salient and non-salient classes; the output is a predicted probability score for each frame of the
streams captured by different camera nodes. These scores are then forwarded to the second step
together with the weight values of the corresponding cameras.
2178 CMC, 2022, vol.70, no.2
Figure 2: The proposed camera prioritisation scheme with a symbolic representation of LTGS
In the second step, the sensitivity of each camera is computed based on the input frame
saliency by the LTGS module (using a sigmoid activation function), which processes two inputs:
the probability of the frame under observation and the metadata (weight) value of the correspond-
ing camera. It then generates a single value that determines the priority of the camera. Metadata
are data about the camera, such as its importance and location. This is crucial, since in some
locations such as banks and border control stations, we want to create a high level of surveillance
by selecting camera nodes with high priority. In our experiments, we used metadata values from
one to six, where a score of six means that the location of the camera is very important and one
means the location is normal.
ϕ=∂(wixi)(2)
where i is the number of the camera, w represents the metadata for the specic camera, x
represents the predicted probability of the camera, is the sigmoid function, and φindicates the
priority of the camera.
CF=1if ϕT
0elsewhere(3)
where c indicates the specic camera node, F represents the priority ag, and T is a threshold
with a default value of 0.7. If the value of φexceeds the threshold T, then the camera priority
ag F is set, and this particular camera is prioritised over the others.
3 Experimental Results
In this section, we describe the details of our experiments and present a comparison of the
proposed framework with other state-of-the-art techniques. To evaluate the performance of the
CMC, 2022, vol.70, no.2 2179
proposed method, we conducted various experiments on the publicly available Hockey Fight [44]
and Violent Crowd [8] violence detection datasets. We examined our network with different learn-
ing rates, numbers of convolutional layers, activation functions, and other temporal parameters
such as variations in the sequence size. We also compared our proposed model with different
handcrafted and deep learning methods. Finally, the proposed system was quantitatively evaluated
for camera prioritisation tasks, with and without a weighted strategy.
3.1 Experimental Setting
All the experiments were conducted on a Core i5 CPU equipped with NVIDIA GeForce GTX
1070 (8 GB) and 24 GB onboard memory with Windows 10 operating system, using the Tensor-
ow deep learning framework. For video recording and preprocessing, a Raspberry Pi model 3 was
used with a 64 GB Micro SD-card, 1 GB RAM (LPDDR2 (900 MHz)), 1.2 GHz CPU (4×ARM
Cortex-A53), networking (10/100 Ethernet, 2.4 GHz 802.11n wireless). The camera aperture (F)
was 2.0, the focal length was 3.04 mm, the angle of view (diagonal) was 62.2, and the image
resolution was 3280 ×2464 (with support for 1080p30, 720p60 and 640 ×480p90 video recording)
with a Raspbian operating system.
3.2 Datasets
This section presents a detailed overview of the datasets used in our experiments. We used
two existing benchmark datasets, Hockey Fight and Violent Crowd.
3.2.1 Hockey Fight Dataset
This dataset was introduced by Nievas et al. [44] and contains a total of 1000 videos, 500 of
which show violence (ghts), and the remaining 500 show non-violent (normal) activities. In the
violence class, all clips were collected from ghts during a hockey game. The entire set of video
clips was taken from a National Hockey League (NHL) game, and each video clip consists of
50 video frames with resolution 360 ×288 ×3. Examples are shown in Fig. 3, and the details are
given in Tab. 1. After training, we evaluated the performance of our proposed architecture on the
test data. The confusion matrix for this dataset is shown in Fig. 4a.
3.2.2 Violent Crowd Dataset
The Violent Crowd dataset contains 246 videos taken from YouTube, and was introduced
by Hassner et al. [8]. Originally, this dataset consisted of ve different categories of violent and
non-violent activities. In our experiments, we merged these categories into two classes containing
violent and non-violent activity. In each category, there are 123 videos, and each clip has 50 to
150 frames with dimensions of 320 ×240×3. An example frame from the Violent Crowd dataset is
shown in Fig. 3, and details of the dataset are given in Tab. 1. The experimental results obtained
from the Violent Crowd dataset are shown in the form of a confusion matrix in Fig. 4b. Detailed
quantitative results can be found in Tabs. 2 and 4.
3.3 Evaluation Matrices
There are several methods for evaluating the performance of the classication model, but the
most common metrics are precision, recall, and accuracy. These are represented mathematically in
Eqs. (4)(6).
Precision =TP
TP +FP (4)
2180 CMC, 2022, vol.70, no.2
Recall =TP
TP +FN (5)
Accuracy =TP +TN
TP +TN +FP +FN (6)
wheretruepositive(TP) is the number of violent activities correctly identied; false positive (FP)
is the number of non-violent activities incorrectly predicted as violent; true negative (TN) is the
number of non-violent activities correctly identied; and false negative (FN) is the number of
violent activities that are incorrectly predicted as non-violent.
Figure 3: Example frames from the hockey ght and violent crowd datasets. Images shown in the
rst section are scenes from the hockey ght dataset, while those in the second section are scenes
from the violent crowd dataset
CMC, 2022, vol.70, no.2 2181
Table 1: Description and statistics for the datasets
Dataset Samples Resolution Violent scenes Non-violent scenes
No. of clips Frame rate No. of clips Frame rate
Hockey ght [44] 1000 360 ×288 500 25 500 25
Violent crowd [8] 246 320 ×240 123 25 123 25
Figure 4: Confusion matrices for (a) The hockey ght dataset, and (b) The violent crowd dataset
Table 2: Comparison of the proposed method with other state-of-the-art methods on the Violent
Crowd and Hockey Fight datasets
Methods Accuracy (%) of method on each dataset
Violent crowd Hockey ght
ViF, OViF, AdaBoost and SVM [11] 88 87.50
Hough forests and 2D CNN [29] 94.6
ViF [8] 81.3 82.90
Improved sher vectors [18] 96.4 93.7
3DCNN [7]9896
Proposed method 99.89 98.80
3.4 Results and Discussion
In a CNN, the most challenging tasks involve nding the optimal kernel size, the optimal
number of lters per layer, and the optimal formation of the layers. These hyperparameters are
highly correlated with the input data. The hidden layers in CNN architecture act as a black
box, meaning that it is very difcult to identify the correct number of lters and formation of
layers directly, and we therefore used a heuristic approach to develop an efcient model. We
performed multiple experiments with different numbers of convolutional layers, hyperparameter
settings, activation functions, learning rates and temporal information. To efciently learn the
patterns of violent and non-violent activity, we need to input massive amounts of labelled data
to train the deep architecture. Each violent activity dataset contains a number N of short clips
with different durations, and each video in the dataset belongs to one of two categories: violent
2182 CMC, 2022, vol.70, no.2
or non-violent. During training, each individual input video clip is fed to the 3DCNN as a set
of batches of length 16 frames, to allow the model to learn the temporal features of the input
video.
3.4.1 Analyses of Different Learning Rates
When training a neural network, the most important hyperparameter is the learning rate,
and the choice of the optimal learning rate greatly affects the generalisation of the deep learning
model. There are two key assumptions that should be taken into consideration when selecting
the learning rate during training. Firstly, the learning rate should not be too high, as this will
cause overshoot while nding the minimal points and the model will allocate large updates to the
weights, causing divergent behaviour. At the beginning of our experiments, we therefore used a
low learning rate of 0.00001, as shown in Fig. 6a. However, from our results, we found that a low
learning rate did not always perform very well. We therefore performed further experiments based
on the second key assumption, in which the learning rate should not be too low since numerous
updates will be required before the optimal point is reached. We nally applied a learning rate
of 0.001 with the Adam optimiser, and achieved state-of-the-art results. The highest values of
accuracy were 98% for the Hockey Fight dataset and 99% for the Violent Crowd dataset, as shown
in Fig. 6b. It can be seen that the highest accuracy of 98% was obtained with a learning rate of
0.001 over 500 epochs. After conducting numerous experiments, we observed that changing the
learning rate affected the loss, accuracy, and number of iterations. Fig. 6a shows that the changes
in loss and accuracy are strongly related to the variation in the learning rate. For instance, for the
rst 100 epochs, the training loss, validation loss, training accuracy and validation accuracy were
0.299, 0.218, 88.3% and 94.5%, respectively. Over time, as the number of iterations increased, the
model loss decreased while the accuracy increased. Finally, at 500 epochs, the loss was reduced
to 0.02 and the accuracy reached 99.44% when the learning rate was set to 0.00001 and the
experimental setup, training loss, validation loss, training accuracy and validation were kept at the
same values as before. The accuracy for the rst hundred epochs was 94%, while at 500 epochs,
the obtained training loss, validation loss, training accuracy, and validation accuracy were 0.01,
0.15, 99%, and 98%, as shown in Fig. 6b.
We also trained our model with the same setup on the Violent Crowd dataset, and the results
of the training and validation process are shown in Fig. 6a. The pink line indicates the experiment
in which we achieved the highest accuracy of 99% and a loss of 0.22 over 500 epochs.
3.4.2 Computational Complexity Analysis
A CCN architecture consists of three different types of layers: convolutional, max-pooling
and fully connected layers. The convolutional layers share a lter of xed size, which is used to
extract spatial features. The max-pooling layer does not learn anything during training. However,
the fully connected layer is extremely expensive in terms of computation, as it involves one-to-one
connectivity for each pixel in an image. When a depth channel or a temporal dimension is added,
a 3DCNN becomes computationally more complex than a classical 2DCNN architecture. For
instance, in [7], the authors presented a 3DCNN architecture for recognition of violent activity in
a surveillance system. At the feature extraction stage, eight convolutional layers were used, feature
reduction was achieved through ve max-pooling layers, and the classication stage used two fully
connected layers with 4096 neurons in each. To limit the computational complexity of our model,
we conducted multiple experiments and nally developed a lightweight 3DCNN architecture con-
sisting of eight convolutional layers, three max-pooling layers, and one fully connected layer with
512 neurons rather than 4,096. The proposed network involved 14,352,811 trainable parameters,
CMC, 2022, vol.70, no.2 2183
whereas the network in [7] involved 22,228,802, a reduction of 7,875,991. The nal result was
a computationally inexpensive model that was capable of achieving high accuracy compared to
state-of-the-art models, as shown in Tab. 2 . We reduced the overall complexity of our proposed
network by using only one fully connected layer.
3.4.3 Ablation Study of Different Sequences
Activity recognition with a 2DCNN involves extracting only the spatial features from video
frames, which is insufcient for activity recognition. Khan et al. [45] proposed a CNN architecture
called MobileNet for violence detection from movies that only extracted spatial features from
videos, whereas a 3DCNN extracts both spatial and temporal features from videos. Temporal or
sequence information plays a critical role in activity recognition. In view of this, we conducted
several experiments to analyse temporal information, as shown in Fig. 5. When the number of
sequences was increased, the model efciently learned the data, but there was a trade-off between
the number of sequences and the computation time. To nd the optimal length of a sequence,
we tested our method with different sequence lengths ranging from six to 16 consecutive frames.
The experimental results are shown in Fig. 5, and it can be observed that the lowest accuracy
was achieved for a sequence of six frames. As the length of the sequence was increased, the
accuracy improved. The highest values of accuracy for both the Hockey Fight (98%) and Violent
Crowd datasets (99%) were achieved for a sequence of 16 frames. We evaluated our proposed
system using the metrics of precision, recall, and accuracy, and the results are listed in Tab. 2.
For the Hockey Fight dataset, the highest accuracy was 98% with an F1-score of 0.97, while for
the Violent Crowd dataset, the highest accuracy was 99% with an F1-score of 0.98.
Figure 5: Temporal analysis of the proposed 3DCNN architecture in terms of accuracy on the
hockey ght and violent crowd datasets
2184 CMC, 2022, vol.70, no.2
Figure 6: Impact of different learning rates on the hockey ght and violent crowd datasets: (a)
the optimal learning rate of 0.001 with the Adam optimiser, and (b) a smaller learning rate of
0.00001 with the Adam optimiser
3.5 Comparative Analysis
In this section, we compare the results obtained from our proposed 3DCNN architecture with
those of state-of-art methods. A comparative analysis is shown in Tab. 2. The rst row shows the
results obtained by the method proposed in [11], which used violent ows descriptor to estimate
the magnitude of the violence and the AdaBoost algorithm for feature extraction. Finally, an
SVM was deployed as a violent activity classier.
Their method achieved an accuracy of 87.50% on the Hockey Fight dataset and 88% for
the Violent Crowd dataset. Recently, Serrano et al. [29] used Hough forests with a 2DCNN to
detect violent activity, and this approach gave an accuracy of 94.6% on the Hockey Fight dataset.
Similarly, Hassner et al. [8] experimented on the same datasets with a scheme that used ViF
for feature extraction and SVM for classication, and obtained accuracies of 82.90% and 81.3%,
respectively. A sliding window method was used with an improved Fisher vector in [18], and this
CMC, 2022, vol.70, no.2 2185
approach achieved values of 93.7% and 96.4% for accuracy, respectively. In [7], the author used a
3DCNN and obtained accuracy scores of 96% and 98% on the Hockey Fight and Violent Crowd
datasets, respectively. The last row of Tab. 2 shows the performance of the proposed model.
In general, the computational complexity of CNN architectures is directly related to the fully
connected layer. For instance, the method presented in [7] used eight convolutional layers and two
fully connected layers with 4096 neurons per layer. In contrast, our proposed 3DCNN architecture
consists of eight convolution layers and only one fully connected layer with 512 neurons. Our
model therefore performed favourably against the scheme [7] in terms of the accuracy and network
complexity.
3.6 Quantitative Evaluation of Our Proposed Camera Prioritisation Method with and without a
Weighting Strategy
For camera prioritisation, there is no standard evaluation matrix to check their performance.
However, to achieve the best performance, the camera prioritisation evaluation metrics should
maximise the impact of the overall system accuracy and consider the vulnerable position of the
camera node. This is very important, since some locations such as banks and border control
stations require a high level of surveillance, based on selecting camera nodes with a high priority.
Video streams from camera nodes are fed to a trained 3DCNN model to classify and calculate
the probability of violent or non-violent activity. If violent activity is detected, the probability
calculated by the trained 3DCNN model and the metadata about the camera are passed to the
LTGS module, as shown in Eqs. (2) and (3). The weights or metadata reect the importance of
the location of the node: a low weight means that the position is not important, while a higher
weight indicates that the position of the node is more important. A mathematical representation
of our camera prioritisation scheme is given in Eqs. (2) and (3).Tab. 3 presents the results of
our experiments without a weighting strategy, in which the weight is set to a constant value of
one. In Tab. 3, the column marked ‘Ground truth’ represents the number of cameras showing
violent activity, which need to be prioritised. In contrast, the column marked ‘Cameras prioritised
by the model’ shows the number of cameras selected for prioritisation by our proposed system.
In the rst experiment, we used 10 cameras over which violent activity was randomly distributed
to check the robustness of our proposed model on a small network of WVSNs. In the rst test,
cameras 1 to 8 captured violent scenes, while cameras 9 and 10 recorded normal events. When we
passed these streams to the trained 3DCNN model, it prioritised seven of the eight cameras that
recorded these activities. To calculate the overall accuracy, we divided the total correct number
of prioritised cameras by the total number of cameras that should have been prioritised. The
proposed model achieved an average accuracy of 86.1%.
Table 3: Experiments on camera prioritisation with a constant weight value (w =1)
Experiments Number of
cameras
Ground truth Cameras prioritised
by the model (ϕ)
Accuracy (%)
Experiment 1
Test 1 10 8 7 87.5
Test 2 10 7 6 85.7
Test 3 10 9 7 77.7
Test 4 1 0 6 6 1
Test 5 1 0 5 4 80
Average 10 7 6 86.1
(Continued)
2186 CMC, 2022, vol.70, no.2
Table 3: Continued
Experiments Number of
cameras
Ground truth Cameras prioritised
by the model (ϕ)
Accuracy (%)
Experiment 2
Test 1 15 13 11 84.6
Test 2 15 11 10 90.9
Test 3 1 5 9 9 1
Test 4 1 5 1 4 14 1
Test 5 1 5 8 6 75
Average 15 11 10 90.1
Experiment 3
Test 1 2 0 1 4 14 1
Test 2 20 15 14 93.3
Test 3 20 16 15 93.7
Test 4 2 0 1 3 13 1
Test 5 20 17 14 82.3
Average 20 15 14 93.8
In the second experiment, the number of cameras was increased from 10 to 15, and the
total number of cameras that needed to be prioritised was 11. For each individual test in this
set, the number of cameras reporting violent activity is listed under the ‘Ground truth’ column.
In this experiment, our proposed system prioritised 10 cameras, giving an average accuracy of
90.1%. Similarly, in experiment 3, we evaluated 20 cameras and achieved an accuracy of 93.8%.
In experiment 1, our model prioritised all 14 cameras and we achieved an accuracy of 100%. The
details of experiments 2–5 are shown in Tab. 3. The results of the experiments with a weighting
strategy are presented in Tab. 4 .
3.7 Result Derivatives
Existing systems use highly intensive computational methods for the recognition of abnormal
activity, as discussed in Sections 1.1 and 1.2, and cannot be deployed in real-time surveillance
systems. Techniques in the literature on camera prioritisation are mostly based on handcrafted
features, and no statistical analyses of their results are provided. Furthermore, no discussion
of the validity of camera prioritisation is available for large-scale WVSNs. In this paper, we
have proposed a lightweight 3DCNN model for abnormal activity recognition using resource-
constrained devices, which outperformed existing state-of-art methods in terms of accuracy and
F1-score. Furthermore, we have proposed an intelligent algorithm called LTGS that prioritises
cameras showing abnormal activity above those showing a normal stream. The results of multiple
comprehensive experiments and evaluations show that the proposed system performs well in terms
of both violent activity recognition and efcient camera prioritisation in large-scale surveillance
networks.
CMC, 2022, vol.70, no.2 2187
Table 4: Experiments on camera prioritisation with a weighting strategy
Experiments Number of
cameras
Ground
truth
Cameras prioritised
by the model
Wei gh t
values (w)
Accuracy
(%)
Total net
accuracy (ϕ)
Experiment 1
Test 1 10 8 7 4 3 . 5 97
Test 2 10 7 6 5 4.625 99
Test 3 10 9 7 3 2.571 92.8
Test 4 10 6 6 1 1 73.1
Test 5 10 5 4 2 1.6 83.2
Average 10 7 6 3 2.659 89.02
Experiment 2
Test 1 15 13 11 2 1.692 84.4
Test 2 15 11 10 3 2.727 93.8
Test 3 15 9 9 1 1 73.1
Test 4 15 14 14 2 2 8 8
Test 5 15 8 6 2 1.5 81.7
Average 15 11 10 2 1.7838 84.2
Experiment 3
Test 1 20 14 14 1 1 73.1
Test 2 20 15 14 4 3.732 97.6
Test 3 20 16 15 6 5.622 99.6
Test 4 20 13 13 2 2 8 8
Test 5 20 17 14 2 1.646 83.8
Average 20 15 14 3 2.8 88.42
4 Concluding Remarks and Directions for Future Work
In this study, we have investigated the strength and capabilities of a 3DCNN for intelli-
gent camera prioritisation in large-scale WVSNs, based on violent activity recognition. WVSNs
typically consist of a large number of visual nodes, each of which continuously generates a
massive amount of video data. The efcient monitoring and streaming of such huge amounts of
data is very challenging, due to the limited availability of computational resources. To overcome
the drawbacks of traditional surveillance systems, we have proposed a novel intelligent camera
prioritisation framework for large-scale WVSNs. A 3DCNN is used for violent activity recognition
that is capable of learning not only spatial but also temporal information. The proposed model
can prioritise cameras with an accuracy of 98%–99%. In future work, we aim to further reduce
the cost of the proposed framework by customising the Raspberry Pi, for example by removing
unwanted hardware like the mouse, keyboard and GPIO pens. Removing unnecessary functional-
ities from this device can signicantly reduce the overall computational complexity. For activity
recognition, the features extracted by our proposed architecture could be fed to a recurrent neural
network such as LSTM to classify the input video stream more efciently.
Acknowledgement: This work was supported by the faculty research fund of Sejong University in
2020 and also supported by Institute of Information & communications Technology Planning &
2188 CMC, 2022, vol.70, no.2
Evaluation (IITP) grant funded by the Korea government (MSIT) (2019-0-00136, Development of
AI-Convergence Technologies for Smart City Industry Productivity Innovation).
Funding Statement: Institute of Information & communications Technology Planning & Eval-
uation (IITP) grant funded by the Korea government (MSIT) (2019-0-00136, Development of
AI-Convergence Technologies for Smart City Industry Productivity Innovation).
Conicts of Interest: The authors declare that they have no conicts of interest to report regarding
the present study.
References
[1] W. Ullah, A. Ullah, I. U. Haq, K. Muhammad and S. W. Baik, “CNN features with bi-directional
LSTM for real-time anomaly detection in surveillance networks, Multimedia Tools and Applications,
vol. 1, pp. 1–17, 2020.
[2] K. Karboub, M. Tabaa, S. Dellagi, A. Dandache and F. Moutaouakkil, Intelligent patient monitor-
ing for arrhythmia and congestive failure patients using internet of things and convolutional neural
network, in 31st Int. Conf. on Microelectronics, Cairo, Egypt, pp. 292–295, 2019.
[3] Y.-H. Tsai, J.-K. Hsu, Y.-E. Wu and W.-F. Huang, “Distributed multimedia content processing in
ONVIF surveillance system,” in 2011 Int. Conf. on Future Computer Sciences and Application, Hong Kong,
China, pp. 70–73, 2011.
[4] C. A. T. Stelios, D. Stelios and D. Antonios, Automated real-time risk assessment for airport pas-
sengers using a deep learning architecture, in Signal Processing, Sensor/Information Fusion and Target
Recognition XXVIII, Maryland, United States, pp. 110180, 2019.
[5] D. Arjun, P. K. Indukala and K. A. U. Menon, “PANCHENDRIYA: A Multi-sensing framework
through wireless sensor networks for advanced border surveillance and human intruder detection,” in
2019 Int. Conf. on Communication and Electronics Systems, Coimbatore, India, pp. 295–298, 2019.
[6] S. Telang, A. Chel, A. Nemade and G. Kaushik, “Intelligent Transport System for a Smart City, in
Security and Privacy Applications for Smart City Development. Cham: Springer, pp. 171–187, 2021.
[7] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq and S. W. Baik, “Violence detection using
spatiotemporal features with 3D convolutional neural network, Sensors, vol. 19, no. 11, pp. 2472,
2019.
[8] T. Hassner, Y. Itcher and O. Kliper-Gross, “Violent ows: Real-time detection of violent crowd
behavior,” in 2012 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops,
Providence, RI, USA, pp. 1–6, 2012.
[9] J.-F. Huang and S.-L. Chen, “Detection of violent crowd behavior based on statistical characteristics of
the optical ow,” in 11th Int. Conf. on Fuzzy Systems and Knowledge Discovery, Xiamen, China, pp. 565–
569, 2014.
[10] T. Zhang, Z. Yang, W. Jia, B. Yang, J. Yang et al., A new method for violence detection in surveillance
scenes, Multimedia Tools and Applications, vol. 75, no. 12, pp. 7327–7349, 2016.
[11] Y. Gao, H. Liu, X. Sun, C. Wang and Y. Liu, “Violence detection using oriented VIolent ows, Image
and Vision Computing, vol. 48–49, no. 6, pp. 37–41, 2016.
[12] D. Chen, H. Wactlar, M.-Y. Chen, C. Gao, A. Bharucha et al., “Recognition of aggressive human
behavior using binary local motion descriptors,” in 30th Annual Int. Conf. of the IEEE Engineering in
Medicine and Biology Society, Vancouver, BC, Canada, pp. 5238–5241, 2008.
[13] F. D. De Souza, G. C. Chavez, E. A. do Valle Jr and A. D. A. Araújo, “Violence detection in video
using spatio-temporal features, in 23rd SIBGRAPI Conf. on Graphics, Patterns and Images,Gramado,
Brazil, pp. 224–230, 2010.
CMC, 2022, vol.70, no.2 2189
[14] L. Xu, C. Gong, J. Yang, Q. Wu and L. Yao, “Violent video detection based on MoSIFT feature and
sparse coding,” in 2014 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Florence, Italy, pp.
3538–3542, 2014.
[15] K. Lloyd, P. L. Rosin, D. Marshall and S. C. Moore, “Detecting violent and abnormal crowd activity
using temporal analysis of grey level co-occurrence matrix (GLCM)-based texture measures,” Machine
Vision and Applications, vol. 28, no. 3–4, pp. 361–371, 2017.
[16] E. Y. Fu, H. V. Leong, G. Ngai and S. C. Chan, Automatic ght detection in surveillance videos,
International Journal of Pervasive Computing and Communications, vol. 3, pp. 1–11, 2017.
[17] J. Mahmoodi and A. Salajeghe, A classication method based on optical ow for violence detection,
Expert Systems with Applications, vol. 127, no. 1, pp. 121–127, 2019.
[18] P. Bilinski and F. Bremond, “Human violence recognition and detection in surveillance videos,” in 2016
13th IEEE Int. Conf. on Advanced Video and Signal Based Surveillance, Colorado Springs, CO, USA, pp.
30–36, 2016.
[19] A. Deshmukh, K. Warang, Y. Pente and N. Marathe, Violence detection through surveillance system,”
in ICT Systems and Sustainability. Berlin, Germany: Springer, pp. 503–511, 2021.
[20] R. Maqsood, U. I. Bajwa, G. Saleem, R. H. Raza et al., Anomaly Recognition from surveil-
lance videos using 3D convolutional neural networks, Multimedia Tools and Applications, 2017.
https://doi.org/10.1007/s11042-021-10570-3.
[21] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, “Learning spatiotemporal features with
3d convolutional networks, in Proc. of the IEEE Int. Conf. on Computer Vision, Montreal, Canada,
pp. 4489–4497, 2015.
[22] J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset,” in
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Nashville, United States, pp. 6299–
6308, 2017.
[23] A. Diba, M. Fayyaz, V. Sharma, A. Hossein Karami, M. Mahdi Arzani et al., “Temporal 3d convnets
using temporal transition layer,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
Workshops, Nashville, United States, pp. 1117–1121, 2018.
[24] Z. Qiu, T. Yao and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual net-
works,” in Proc. of the IEEE Int. Conf. on Computer Vision, Montreal, Canada, pp. 5533–5541,
2017.
[25] D. Tran, J. Ray, Z. Shou, S.-F. Chang and M. Paluri, “Convnet architecture search for spatiotemporal
feature learning, ArXiv, vol. abs/1708.05038, 2017. https://arxiv.org/abs/1708.05038.
[26] J. Stroud, D. Ross, C. Sun, J. Deng and R. Sukthankar, “D3d: Distilled 3d networks for video action
recognition,” in Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, Waikoloa,
United States, pp. 625–634, 2020.
[27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun et al., A closer look at spatiotemporal convolutions
for action recognition,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Nashville,
United States, pp. 6450–6459, 2018.
[28] S. Xie, C. Sun, J. Huang, Z. Tu and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-
accuracy trade-offs in video classication,” in Proc. of the European Conf. on Computer Vision, Montreal,
Canada, pp. 305–321, 2018.
[29] I. Serrano, O. Deniz, J. L. Espinosa-Aranda and G. Bueno, “Fight recognition in video using hough
forests and 2D convolutional neural network,” IEEE Transactions on Image Processing, vol. 27, no. 10,
pp. 4787–4797, 2018.
[30] A. Ullah, K. Muhammad, K. Haydarov, I. U. Haq, M. Lee et al., “One-shot learning for surveillance
anomaly recognition using siamese 3D CNN,” in 2020 Int. Joint Conf. on Neural Networks,Glasgow,
UK, pp. 1–8, 2020.
[31] S.Maresca,G.Serano,F.Scotti,F.Amato,L.Lemboet al., “Photonics for coherent MIMO radar:
An experimental multi-target surveillance scenario,” in 2019 20th Int. Radar Symp.,Ulm,Germany,pp.
1–6, 2019.
2190 CMC, 2022, vol.70, no.2
[32] N. Kumar and N. Sukavanam, A cascaded CNN model for multiple human tracking and re-
localization in complex video sequences with large displacement, Multimedia Tools and Applications,
vol. 79, no. 9, pp. 6109–6134, 2020.
[33] C. Sampedro, A. Rodriguez-Ramos, H. Bavle, A. Carrio, P. de la Puente et al., A fully-autonomous
aerial robot for search and rescue applications in indoor environments using learning-based techniques,
Journal of Intelligent & Robotic Systems, vol. 95, no. 2, pp. 601–627, 2019.
[34] S. S. Sengar and S. Mukhopadhyay, “Moving object area detection using normalized self adaptive
optical ow,” Optik, vol. 127, no. 16, pp. 6258–6267, 2016.
[35] J. Dou, Q. Qin and Z. Tu, “Background subtraction based on circulant matrix,” Signal, Image and Video
Processing, vol. 11, no. 3, pp. 407–414, 2017.
[36] M. Fei, J. Li and H. Liu, Visual tracking based on improved foreground detection and perceptual
hashing, Neurocomputing, vol. 152, no. 8, pp. 413–428, 2015.
[37] A. Zam, M. R. Khayyambashi and A. Bohlooli, “Energy-aware strategy for collaborative target-
detection in wireless multimedia sensor network,” Multimedia Tools and Applications, vol. 78, no. 13,
pp. 18921–18941, 2019.
[38] I. Mehmood, M. Sajjad, W. Ejaz and S. W. Baik, “Saliency-directed prioritization of visual data in
wireless surveillance networks,” Information Fusion, vol. 24, no. 3, pp. 16–30, 2015.
[39] S. S. Thomas, S. Gupta and V. K. Subramanian, “Event detection on roads using perceptual video
summarization, IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 9, pp. 2944–2954,
2018.
[40] K. Muhammad, T. Hussain and S. W. Baik, “Efcient CNN based summarization of surveillance
videos for resource-constrained devices, Pattern Recognition Letters, vol. 130, pp. 370–375, 2020.
[41] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy et al., “Recent advances in convolutional neural
networks, Pattern Recognition, vol. 77, no. 11, pp. 354–377, 2018.
[42] A. Piergiovanni, A. Angelova and M. S. Ryoo, “Evolving losses for unsupervised video representation
learning, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Shanghai, China,
pp. 133–142, 2020.
[43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, S. Long et al., “Caffe: Convolutional architecture for
fast feature embedding, in Proc. of the 22nd ACM Int. Conf. on Multimedia, Florida, USA, pp. 675–678,
2014.
[44] E. B. Nievas, O. D. Suarez, G. B. García and R. Sukthankar, “Violence detection in video using
computer vision techniques, in Int. Conf. on Computer Analysis of Images and Patterns,NY,USA,
pp. 332–339, 2011.
[45] S. U. Khan, I. U. Haq, S. Rho, S. W. Baik and M. Y. Lee, “Cover the violence: A novel deep-learning-
based approach towards violence-detection in movies,” Applied Sciences, vol. 9, no. 22, pp. 4963, 2019.
... Thereby, 3DCNN is proposed which is able to learn the spatial as well as temporal features. In this direction, different methods are developed, for example, pseudo-3D CNNs [11], MiCT-Net [12], Inflated 3DCNN [13], and 3DCNN [14] for automatic HAR. However, the computational complexity of 3DCNNs is increasing exponentially when recognizing lengthy videos, and further achieving higher performance requires a large-scale pre-trained video datasets [14,15]. ...
... In this direction, different methods are developed, for example, pseudo-3D CNNs [11], MiCT-Net [12], Inflated 3DCNN [13], and 3DCNN [14] for automatic HAR. However, the computational complexity of 3DCNNs is increasing exponentially when recognizing lengthy videos, and further achieving higher performance requires a large-scale pre-trained video datasets [14,15]. Therefore, to tackle the computation problem researchers used hybrid approaches where the spatial features are extracted by 2DCNN and then for spatiotemporal learning several variants of Recurrent Neural Network (RNN) for HAR [16][17][18]. ...
... However, HAR by only spatial features requires less computation but researchers investigate that for accurate HAR in real-world environments only spatial information not enough, temporal information is also needs to be analyzed [37]. In this direction, the conventional 2DCNN methods are upgraded to 3-Dimensional CNN (3DCNN) to capture both spatial and temporal information [10,14]. ...
Article
Full-text available
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidi-rectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
... Their approach attained the highest accuracy for activity recognition as compared with the other models because their filters are randomly initialized. Another study is conducted by [17] where a lightweight 3DCNN is explored for abnormal activity recognition where the abnormal scene camera is prioritized among the normal in a visual sensor network. In this direction, different variants of 3DCNN and end-to-end networks have been developed to learn the discriminative spatial and temporal features of HAR [9]. ...
... Although 3DCNNs effectively learn the spatial and temporal information from the videos to predict the ongoing activities however, they require higher computation to identify activities in lengthy videos. Thereby they produce an exponential increase in the time complexity when the temporal dimension is increased [9,17]. ...
Article
Full-text available
Human Activity Recognition (HAR) plays a crucial role in communication and the Internet of Things (IoT), by enabling vision sensors to understand and respond to human behavior more intelligently and efficiently. Existing deep learning models are complex to deal with the low illumination, diverse viewpoints, and cluttered backgrounds, which require substantial computing resources and are not appropriate for edge devices. Furthermore, without an effective video analysis technique it processes entire frames, resulting inadequate performance. To address these key challenges, a cloud-assisted IoT computing framework is proposed for HAR in uncertain low-lighting environments, which is mainly composed of two tiers: edge and cloud computing. Initially, a lightweight Convolutional Neural Network (CNN) model is developed which is responsible to enhance the low-light frames, followed by the human detection algorithm to process the selective frames, thus enabling efficient resource utilization. Next, these refined frames are then transmitted to the cloud for accurate HAR, where dual stream CNN and transformer fusion network extract both short-and long-range spatiotemporal discriminative features followed by proposed Optimized Parallel Sequential Temporal Network (OPSTN) with squeeze and excitation attention to efficiently learn HAR in complex scenarios. Finally, extensive experiments are conducted over three challenging HAR datasets to deeply examine the proposed framework from various perspectives such as complex activity recognition, lowlighting, etc., where the results are outperformed compared with the state-of-art methods.
... However, sign expression is not simply understood by the listener to the community, resulting in an interaction difference between the speech and hearing-reduced and hearing populations [10]. Recently, deep learning (DL) has achieved encouraging outcomes in different domains, such as activity recognition [11,12], disease recognition [13], and energy forecasting [14]. Therefore, HGRoc, using computer knowledge, can serve as a translator for sign motion translation, creating a bridge between these two communities. ...
Article
Full-text available
Hand gestures have been used as a significant mode of communication since the advent of human civilization. By facilitating human-computer interaction (HCI), hand gesture recognition (HGRoc) technology is crucial for seamless and error-free HCI. HGRoc technology is pivotal in healthcare and communication for the deaf community. Despite significant advancements in computer vision-based gesture recognition for language understanding, two considerable challenges persist in this field: (a) limited and common gestures are considered, (b) processing multiple channels of information across a network takes huge computational time during discriminative feature extraction. Therefore, a novel hand vision-based convolutional neural network (CNN) model named (HVCNNM) offers several benefits, notably enhanced accuracy, robustness to variations, real-time performance, reduced channels, and scalability. Additionally, these models can be optimized for real-time performance, learn from large amounts of data, and are scalable to handle complex recognition tasks for efficient human-computer interaction. The proposed model was evaluated on two challenging datasets, namely the Massey University Dataset (MUD) and the American Sign Language (ASL) Alphabet Dataset (ASLAD). On the MUD and ASLAD datasets, HVCNNM achieved a score of 99.23% and 99.00%, respectively. These results demonstrate the effectiveness of CNN as a promising HGRoc approach. The findings suggest that the proposed model have potential roles in applications such as sign language recognition, human-computer interaction, and robotics.
... Another famous study was performed by Abdullahi, who designed a bidirectional long-short term memory-fast fisher vector algorithm to train 3D hand skeletal information of motion and orientation angle features and further used it to classify dynamic sign words [10], [11]. One of the representative works [12], [13] from Sejong University proposed a cloud-assisted IoT computing framework for human activity recognition in uncertain low-lighting environments and applied a lightweight three-dimensional convolutional neural network architecture to extract spatiotemporal features from significant frames to easily identify violent behaviors in video. This group also pretrained a vision transformer to extract frame features and did research on identifying abnormal behaviors in video [14], [15]. ...
Article
Full-text available
Human motion prediction is a popular method to predict future motion sequences based on past sequences, which is widely used in human-computer interaction. Space-time-separable graph Convolutional Network (STS-GCN) is a conventional mathematical model for human motion prediction. However, the uncertainty of human movements often leads to the problem of significant prediction error in the prediction results. This paper first proposed a Multi-scale STS-GCN (MSTS-GCN) model based on the conventional STS-GCN method to find the relevant factors that affect the prediction results. In our study, the constructed Multi-scale Temporal Convolutional Network (MTCN) decoder effectively reduced the human motion prediction error at specific time nodes. To expand the transmission and utilization performance in a larger receptive field, a Gated Recurrent Unit-TCN decoder was also designed. Finally, a new STS-GCN (NSTS-GCN) human motion prediction model was proposed, which realized the transmission and utilization of motion sequence features under a larger temporal perceptual field. To verify the effectiveness of NSTS-GCN, the Human3.6M dataset, AMASS and 3DPW dataset were tested. The experimental results show that the MPJPE error of the proposed model for human joint prediction at each time node is reduced compared with the conventional STS-GCN model, and the mean reduction was achieved by 3.0mm. All the experimental results validated the effectiveness of the proposed NSTS-GCN model, which further improved the performance of human motion prediction.
... e task of nonlinear mapping and feature extraction is extremely challenging; therefore, the best way to tackle these challenges is to employ deep learning models with the ability to extract the discriminative features end-toend [29,30]. In recent years, the application of deep learning models has significantly improved for image classification [31,32], video classification [33][34][35][36][37], and power forecasting in TS data [38][39][40][41][42]. For instance, Khan et al. [43] proposed a hybrid model for electricity forecasting in residential and commercial buildings. ...
Article
Full-text available
For efficient energy distribution, microgrids (MG) provide significant assistance to main grids and act as a bridge between the power generation and consumption. Renewable energy generation resources, particularly photovoltaics (PVs), are considered as a clean source of energy but are highly complex, volatile, and intermittent in nature making their forecasting challenging. Thus, a reliable, optimized, and a robust forecasting method deployed at MG objectifies these challenges by providing accurate renewable energy production forecasting and establishing a precise power generation and consumption matching at MG. Furthermore, it ensures effective planning, operation, and acquisition from the main grid in the case of superior or inferior amounts of energy, respectively. Therefore, in this work, we develop an end-to-end hybrid network for automatic PV power forecasting, comprising three basic steps. Firstly, data preprocessing is performed to normalize, remove the outliers, and deal with the missing values prominently. Next, the temporal features are extracted using deep sequential modelling schemes, followed by the extraction of spatial features via convolutional neural networks. These features are then fed to fully connected layers for optimal PV power forecasting. In the third step, the proposed model is evaluated on publicly available PV power generation datasets, where its performance reveals lower error rates when compared to state-of-the-art methods.
... (7) Establishing a blockchain for video surveillance equipment: In a technological environment where digital surveillance systems are ubiquitous and continuously producing large amounts of data, manual surveillance is required to identify human activities in the public realm. Meanwhile, smart surveillance systems that can identify normal and abnormal activities are urgently needed because they allow for the effective monitoring of images sent from cameras that are designed to capture abnormal activities; the implementation of these systems can alleviate the lack of surveillance personnel [42,43]. Furthermore, the inclusion of these systems can enhance the community's control over the number of people entering the mountain and the entry of unauthorized personnel. ...
Article
Full-text available
Forest protection is crucial to ensuring the balance between human beings and ecology. This study explores the key role played by communities that originally lived in forest-protected areas in implementing the traditional management of forests. The unified management mode previously used by the state power can no longer meet the demands of modern times; hence, multiple types of management systems should be implemented to enable adaption to the original ecology of forest areas. A multimodal management mode should be adopted to restore the original ecology of forest areas. The adoption of this management system can restore a forest to its original state (i.e., the state that existed prior to the entry of state power). The forest has been in a state of ecological balance involving numerous species since ancient times. However, in the modern field of science, the passive restoration of a community’s self-governance ability could be unsustainable and unstable. To improve this situation, blockchain technology can first be used to improve the community management of a forest, such that the capabilities of the original local community can be improved; second, tourism promotion benefits both the community and the forest. When a community in a forest develops the tourism industry with the support of blockchain technology, the information and resources of all parties can be widely connected with the larger world, and this considerably increases success rates; finally, the traditional spiritual culture of a community, such as the culture of sharing, should be promoted. In addition to the skillful utilization of technology, culture can improve the traditional forest management ability of tribal communities who live in native forest areas in terms of their personality traits. Overall, we conclude that: against the evolution history of the over one hundred years, the adoption of new technology for forest management is inherently a creative innovation for the tribal community’s entrepreneurial development.
Chapter
Car theft is a constant problem in parking lots and places where cars are left unattended. Car theft detection is a time-consuming task due to the human resources that are required. Therefore, the task of checking closed circuit television (CCTV) cameras can be automated using machine learning techniques. The implementation of such a system would mean an optimization of the current technology. Even if a CCTV camera is installed, it requires human labor to supervise the area, which is a repetitive and time-consuming task. A machine learning algorithm could simplify the task and attend to many cameras without decreasing the attention on each one. In this context, a systematic review of the literature on machine learning was conducted based on four research questions using the PRISMA methodology. The research method may help to find the current methods used in similar applications and possible ways to implement the proposed automatic solution. This scientific study retrieved 384 articles from Web of Science, Scopus, and IEEE databases. The number of studies used to answer the research questions was 58. Finally, analyzing the most frequent models and metrics, Convolutional Neural Networks and Accuracy were the most referenced, with 30 and 42 mentions, respectively.KeywordsMachine learningcar theftmodelrecognitionpredictionvideo analysis
Article
Full-text available
The research on complex human body motion including sports and workout activity recognition is a major challenge and long-lasting problem for the computer vision community. Recent development in deep learning algorithms to track people’s workout activity characteristics based on video sensors can be used to infer the human body pose for further analysis. Specifically, tracking complex body movements while performing multi-pose physical exercise helps individuals provide fine granularity feedback including activity repetition counting and activity recognition. Therefore, this research proposes a system that provides a repetition counter and activity recognition of physical exercise from video frames (extracted 3D human skeleton using VIBE) based on the deep semantic features and repetitive segmentation algorithm. The proposed system locates both ends of the activity’s action and segments the activity into multiple unit actions which improves activity recognition, time intervals, # of sets, and other quantitative values of activity. The proposed system is evaluated on the physical activities dataset named “NOL-18 Exercise” through extensive experiments. The proposed system results show that the accuracy of the repetitive action segmentation is 96.27% with 0.23% time error, and action recognition reaches 99.06%. The system can be employed to fitness or rehabilitation centers and used for treating patients.
Article
Full-text available
To prevent economic, social, and ecological damage, fire detection and management at an early stage are significant yet challenging. Although computationally complex networks have been developed, attention has been largely focused on improving accuracy, rather than focusing on real-time fire detection. Hence, in this study, the authors present an efficient fire detection framework termed E-FireNet for real-time detection in a complex surveillance environment. The proposed model architecture is inspired by the VGG16 network, with significant modifications including the entire removal of Block-5 and tweaking of the convolutional layers of Block-4. This results in higher performance with a reduced number of parameters and inference time. Moreover, smaller convolutional kernels are utilized, which are particularly designed to obtain the optimal details from input images, with numerous channels to assist in feature discrimination. In E-FireNet, three steps are involved: preprocessing of collected data, detection of fires using the proposed technique, and, if there is a fire, alarms are generated and transmitted to law enforcement, healthcare, and management departments. Moreover, E-FireNet achieves 0.98 accuracy, 1 precision, 0.99 recall, and 0.99 F1-score. A comprehensive investigation of various Convolutional Neural Network (CNN) models is conducted using the newly created Fire Surveillance SV-Fire dataset. The empirical results and comparison of numerous parameters establish that the proposed model shows convincing performance in terms of accuracy, model size, and execution time.
Article
Full-text available
Anomalous activity recognition deals with identifying the patterns and events that vary from the normal stream. In a surveillance paradigm, these events range from abuse to fighting and road accidents to snatching, etc. Due to the sparse occurrence of anomalous events, anomalous activity recognition from surveillance videos is a challenging research task. The approaches reported can be generally categorized as handcrafted and deep learning-based. Most of the reported studies address binary classification i.e. anomaly detection from surveillance videos. But these reported approaches did not address other anomalous events e.g. abuse, fight, road accidents, shooting, stealing, vandalism, and robbery, etc. from surveillance videos. Therefore, this paper aims to provide an effective framework for the recognition of different real-world anomalies from videos. This study provides a simple, yet effective approach for learning spatiotemporal features using deep 3-dimensional convolutional networks (3D ConvNets) trained on the University of Central Florida (UCF) Crime video dataset. Firstly, the frame-level labels of the UCF Crime dataset are provided, and then to extract anomalous spatiotemporal features more efficiently a fine-tuned 3D ConvNets is proposed. Findings of the proposed study are twofold 1) There exist specific, detectable, and quantifiable features in UCF Crime video feed that associate with each other 2) Multiclass learning can improve generalizing competencies of the 3D ConvNets by effectively learning frame-level information of dataset and can be leveraged in terms of better results by applying spatial augmentation. The proposed study extracted 3D features by providing frame level information and spatial augmentation to a fine-tuned pre-trained model, namely 3DConvNets. Besides, the learned features are compact enough and the proposed approach outperforms significantly from state of art approaches in terms of accuracy on anomalous activity recognition having 82% AUC.
Article
Full-text available
In current technological era, surveillance systems generate an enormous volume of video data on a daily basis, making its analysis a difficult task for computer vision experts. Manually searching for unusual events in these massive video streams is a challenging task, since they occur inconsistently and with low probability in real-world surveillance. In contrast, deep learning-based anomaly detection reduces human labour and its decision making ability is comparatively reliable, thereby ensuring public safety. In this paper, we present an efficient deep features-based intelligent anomaly detection framework that can operate in surveillance networks with reduced time complexity. In the proposed framework, we first extract spatiotemporal features from a series of frames by passing each one to a pre-trained Convolutional Neural Network (CNN) model. The features extracted from the sequence of frames are valuable in capturing anomalous events. We then pass the extracted deep features to multilayer Bi-directional long short-term memory (BD-LSTM) model, which can accurately classify ongoing anomalous/normal events in complex surveillance scenes of smart cities. We performed extensive experiments on various anomaly detection benchmark datasets to validate the functionality of the proposed framework within complex surveillance scenarios. We reported a 3.41% and 8.09% increase in accuracy on UCF-Crime and UCFCrime2Local datasets compared to state-of-the-art methods.
Conference Paper
Full-text available
One-shot image recognition has been explored for many applications in computer vision community. However, its applications in video analytics is not deeply investigated yet. For instance, surveillance anomaly recognition is an open challenging problem and one of its hurdles is the lack of accurate temporally annotated data. This paper addresses the lack of data issue using one-shot learning strategy and proposes an anomaly recognition framework which exploits a 3D CNN siamese network that yields the similarity between two anomaly sequences. This paper also investigates the existing 3D CNNs for this task and then proposes a lightweight 3D CNN model that efficiently handles one-shot anomaly recognition. Once our network is trained, then we can use the powerful discriminative 3D CNN features to predict anomalies not only for the new data but also for entirely new classes. The proposed model is trained using temporally annotated test set of UCF Crime dataset. Finally, the trained model is used to recognize the anomalies and produce temporal automatic labels for the video level weakly annotated training set of the dataset.
Article
Full-text available
Human tracking and localization play a crucial role in many applications like accident avoidance, action recognition, safety and security, surveillance and crowd analysis. Inspired by its use and scope, we introduced a novel method for human tracking (one or many) and re-localization in a complex environment with large displacement. The model can handle complex background, variations in illumination, changes in target pose, the presence of similar target and appearance (pose and clothes), the motion of target and camera, occlusion of the target, background variation, and massive displacement of the target. Our model uses three convolutional neural network based deep architecture and cascades their learning such that it improves the overall efficiency of the model. The first network learns the pixel level representation of small regions. The second architecture uses these features and learns the displacement of a region with its category between moved, not-moved, and occluded classes. Whereas, the third network improves the displacement result of the second network by utilizing the previous two learning. We also create a semi-synthetic dataset for training purpose. The model is trained on this dataset first and tested on a subset of CamNeT, VOT2015, LITIV-tracking and Visual Tracker Benchmark database without training with real data. The proposed model yield comparative results with respect to current state-of-the-art methods based on evaluation criteria described in Object Tracking Benchmark, TPAMI 2015, CVPR 2013 and ICCV 2017.
Chapter
This project aims to deliver a system which detects violence in the crowd. The model does not need human intervention to detect violence as it is automated. A dataset has been collected for detecting whether violence is taking place or not. The dataset includes both the scenarios, the ones which contain violence and ones which does not. Then, the model is trained to analyze whether the scenario contains violence or not. To detect scenarios which contain violent instances, various deep learning algorithms are applied on the dataset. CNN and LSTM-based architectures are experimented separately and in combination on this dataset. The model can be easily implemented in CCTV camera systems because it is lightweight.
Chapter
The term smart city was coined in the early 1990s. This term includes urban development with new improved developments in the technology, innovation, and globalization. The major contribution is towards adapting the smart growth movement of the late 1990s. This has advocated improved urban planning and utilization of improved Wi-Fi enabled gadgets. It is needed for growth in new global knowledge economy. It also integrates the operation of urban infrastructure and services used in the buildings, transportation, electrical and water distribution and public safety. The smart city is part of urban development which has information and communication technology (ICT) to facilitate improved insight into as well as control over the various systems that affect the lives of residents.
Conference Paper
In the current paper we present a low cost intelligent system capable to collect data from one lead Electrocardiogram (ECG) sensor, process the collected data and classify the signal into one of three categories: arrhythmia, congestive failure or normal heart beat with 100 % positive predictive value and 100% negative predictive value. The achieved performance uses only 1 min data recording for every patient which increases the probability to save the patient’s life and outperform state of the art of similar systems. The proposed system can be used for a specified patient and can handle longer ECG records. The system can also be trained by other databases and can by then classify and monitor new types of recorded ECG signals, thanks to its simplicity and computational efficiency.