Content uploaded by Altaf Hussain
Author content
All content in this area was uploaded by Altaf Hussain on Oct 26, 2021
Content may be subject to copyright.
ech
T
PressScience
Computers, Materials & Continua
DOI:10.32604/cmc.2022.018181
Article
Anomaly Based Camera Prioritization in Large Scale Surveillance Networks
Altaf Hussain1,2, Khan Muhammad1, Hayat Ullah1, Amin Ullah1,4, Ali Shariq Imran3, Mi Young Lee1,
Seungmin Rho1and Muhammad Sajjad2,3,*
1Department of Software, Sejong University, 143-747, Seoul, Korea
2Digital Image Processing Lab, Islamia College Peshawar, 25000, Pakistan
3Color Lab, Department of Computer Science, Norwegian University of Science and Technology (NTNU),
2815, Gjøvik, Norway
4CORIS Institute, Oregon State University, 97331, Oregon, USA
*Corresponding Author: Muhammad Sajjad. Email: muhammad.sajjad@icp.edu.pk
Received: 28 February 2021; Accepted: 06 May 2021
Abstract: Digital surveillance systems are ubiquitous and continuously gen-
erate massive amounts of data, and manual monitoring is required in order
to recognise human activities in public areas. Intelligent surveillance systems
that can automatically identify normal and abnormal activities are highly
desirable, as these would allow for efcient monitoring by selecting only those
camera feeds in which abnormal activities are occurring. This paper proposes
an energy-efcient camera prioritisation framework that intelligently adjusts
the priority of cameras in a vast surveillance network using feedback from
the activity recognition system. The proposed system addresses the limitations
of existing manual monitoring surveillance systems using a three-step frame-
work. In the rst step, the salient frames are selected from the online video
stream using a frame differencing method. A lightweight 3D convolutional
neural network (3DCNN) architecture is applied to extract spatio-temporal
features from the salient frames in the second step. Finally, the probabilities
predicted by the 3DCNN network and the metadata of the cameras are pro-
cessed using a linear threshold gate sigmoid mechanism to control the priority
of the camera. The proposed system performs well compared to state-of-the-
art violent activity recognition methods in terms of efcient camera prioriti-
sation in large-scale surveillance networks. Comprehensive experiments and
an evaluation of activity recognition and camera prioritisation showed that
our approach achieved an accuracy of 98% with an F1-score of 0.97 on the
Hockey Fight dataset, and an accuracy of 99% with an F1-score of 0.98 on
the Violent Crowd dataset.
Keywords: Camera prioritisation; surveillance networks; convolutional
neural network; computer vision; deep learning; resource-constrained
device; violent activity recognition
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
2172 CMC, 2022, vol.70, no.2
1 Introduction
Following the development of computer vision and pattern recognition technology, video
surveillance systems have been widely deployed, and their functionality is improving rapidly with
the aim of preventing emergencies and crimes [1]. Current systems mainly rely on traditional
approaches that are time-consuming, laborious, and prone to misdetection of violent activities due
to need for round-the-clock monitoring of large numbers of camera nodes. Video surveillance, and
particularly human activity recognition systems, have strong prevention capabilities due to their
improved visualisation, accuracy and real-time feedback, and these are now considered essential
aspects of a security system. Wireless visual sensor networks (WVSNs) have recently emerged
as a new type of sensor-based intelligent system, and their performance exceeds that of existing
surveillance sensor networks. Furthermore, due to their small size and the dense spatial coverage
that can be achieved, WVSNs can be exibly deployed for various computer vision tasks, including
patient monitoring [2], distributed multimedia-based surveillance systems [3], military, police, air-
port [4], border [5], urban and transport [6] applications, and other areas of public interest. WVSN
surveillance systems have an extensive range of applications worldwide, but their implementation
has remained a challenge, due to the need for connectivity between multiple sensors and their
complicated setup. For instance, monitoring a large area requires a large number of WVSNs. To
accurately monitor an entire set of cameras in real time, WVSNs require massive amounts of
human resources, computational power and bandwidth. In addition, vision-based sensors require
a high bandwidth to transmit data between cameras and servers. Efcient monitoring of the target
area also requires extensive computation to detect events and anomalies. A similar study reported
the need for extensive computational resources due to the use of multiple sets of parameters in
the proposed model [7]. The literature in the area of violent activity recognition is mainly divided
into handcrafted and deep learning-based methods, and these are discussed below.
1.1 Handcrafted Feature-Based Approaches
The success of handcrafted feature-based approaches relies heavily on manually engineered
feature extraction techniques. Many researchers have utilised handcrafted features for the detection
of violent activities. For instance, Hassner et al. [8] introduced extractor ow vectors using a
violent ow descriptor (ViF). They then used a support vector machine (SVM) to classify these
features into violent and non-violent crowds. Similarly, Huang et al. [9] used SVM to analyse the
statistical properties of the optical ow for violent crowd recognition. Zhang et al. [10]useda
Gaussian model of the optical ow for violent region extraction, and an orientation histogram of
the optical ow with linear SVM was used to classify video frames into violent and non-violent
classes. Gao et al. [11] proposed an oriented violent ow descriptor (OViF) that utilised both the
magnitude of motion and orientation information for the recognition of violent activity. Chen
et al. [12] used spatiotemporal interest points, including space-time interest points (STIPs) [13]
and a motion scale-invariant feature transform (Mo SIFT) [14], for violent activity recognition.
Similarly, Lloyd et al. [15] proposed a new descriptor called a grey level co-occurrence texture
measure to detect violence and abnormal activity in crowds. Fu et al. [16] analysed three attributes
of motion (region, magnitude, and acceleration) for violence detection from surveillance videos.
For feature extraction, the optical ow magnitude and orientation (HOMO) [17] was also used.
A sliding window method was used in [18] with an improved Fisher vector (IFV) for violent
activity recognition.
CMC, 2022, vol.70, no.2 2173
1.2 Deep Learning-Based Approaches
Due to the wide range of variations in pose, viewpoint and scale in visual data, accurate
recognition of violent activities is complex and challenging. Researchers from the articial intel-
ligence and computer vision communities have recently contributed to this area by identifying
human activities using deep learning-based approaches. CNNs extract features in a hierarchical
way, where the initial layers learn local features and the higher layers learn global features from
the visual data [19]. Although recurrent neural networks (RCNNs) and 3DCNNs are mostly used
for violent activity recognition [20], an RNN is not able to extract features directly; a CNN is
typically used for feature extraction purposes, and these features are then passed to an RNN
for classication. A 3DCNN is an end-to-end method that is very widely used to extract spatio-
temporal features for violent activity recognition [21–28]. For instance, Tran et al. [21]useda
3DCNN architecture for activity recognition in which 3 ×3×3 lters were convolved with a
sequence of 16 frames.
Similarly, Carreira et al. [22] modied 2D lters pretrained on ImageNet to create 3D versions
for activity recognition. These modied lters achieved better accuracy than a lter that was
randomly initialised. Stroud et al. [26] introduced a distilled 3D network (D3D) for activity
recognition. Diba et al. [23] employed a temporal transition layer with DenseNet3D, in a scheme
that used transfer learning from a 2DNN model. Serrano et al. [29] used Hough forests with
2DCNN to detect violent activity, and their proposed approach obtained an accuracy of 94.6%
on the Hockey Fight dataset. However, 3DCNNs have high computational requirements, making
them unsuitable for use in standard surveillance systems due to resource constraints. In this paper,
we solve this issue by introducing a new lightweight 3DCNN model that is less computationally
intensive and can easily be deployed in common CCTV camera surveillance systems.
Within such large amounts of video data, very few scenes are important in terms of allowing
a machine to understand human activity. For example, theft from a shopping mall happens
very rarely. When a human performs any activity, there is some sort of bodily movement, such
as movements of arms and legs. In these situations, the detection of moving objects from a
sequences of frames is a challenging and crucial task, and is the rst step in any video analytics
system such as video surveillance [30], target detection [31], human tracking [32], and robot
control [33]. The selection of salient motion frames from WVSN nodes is a crucial aspect of video
processing that helps us analyse only important clips, thereby effectively minimising the execution
time and improving the accuracy of the violent activity recognition system. Several techniques
for automatically detecting salient frames have been developed that can separate the moving
objects (foreground) from the scene (background) in videos. It is difcult to accurately segment
foreground objects due to variations in illumination, ghosting of the foreground aperture, and the
presence of unconstrained environments. Over the years, approaches based on optical ow [34],
background subtraction [35], and frame differencing [36] have been actively used to detect motion
between two consecutive frames. Optical ow is used to detect the simple apparent motion of
an object in two consecutive frames; however, this is computationally expensive and produces
inaccurate results due to its susceptibility to noise, variations in illumination, and fake motion.
The second method, frame differencing, is a straightforward and commonly used technique for
identifying moving objects, but is susceptible to variations in illumination and camera jitter. Unlike
optical ow, the frame difference technique is computationally efcient and is particularly used by
resource-constrained devices to detect moving objects in video frames.
Existing large-scale WVSNs consist of numerous wireless camera sensors and are used to
monitor suspicious human activities. However, these systems have several drawbacks, such as
2174 CMC, 2022, vol.70, no.2
inadequate recognition of salient information, streaming of all imaging data, high bandwidth
and storage requirements, ineffective monitoring, and late responses to crime or abnormal activ-
ities. Other signicant issues related to WVSN-based surveillance include scattered background
viewpoint variation and changes in lighting. The task of camera prioritisation in large-scale
WVSNs also becomes more challenging when large numbers of nodes and continuous streaming
are used [37]. Researchers around the world are making efforts to tackle these challenges. For
instance, Mehmood et al. [38] proposed a saliency-aware data prioritisation framework that selects
semantically relevant information for each node and then transmits these data to a sink node for
further processing. The main limitation of their method is that they used handcrafted features to
extract salient information, an approach that gives limited performance with real-time data. To
rank salient information and remove redundancy, Thomas et al. [39] used a perceptual system
to detect road events. Another technique inspired by the idea of perceptual computation was
a low-computation video summarisation framework for resource-constrained devices such as the
Raspberry Pi [40].
The abovementioned approaches were developed in an attempt to solve several challenges
such as variations in pose, viewpoint and scale, complex crowd patterns in visual data, and
data prioritisation. However, there are still numerous challenges that need to be addressed. For
instance, the studies in [8–17] used handcrafted approaches that were not capable of learning
discriminative features from violence datasets. The authors of [20–28] proposed 3DCNN models
for violent activity that performed rather well, but due to the large number of computations
involved, these were not suitable for deployment in real-world surveillance systems. Similarly,
the authors of [36,37] developed a surveillance system that prioritised video camera content.
A surveillance system usually consists of resource-constrained devices such as CCTV cameras, and
there is therefore an urgent need for a system that can accurately recognise violent activity in a
complex environment with lower computational requirements. In addition, there is a need for an
intelligent prioritisation technique that can select only those cameras that are transmitting violent
activity from among a set of normal feeds, to allow for smart monitoring of large WVSNs, reduce
the memory storage required, and utilise the resources of WVSNs more efciently.
In this paper, we propose an energy-efcient deep learning-based framework for camera
prioritisation in large-scale WVSNs. The key contributions of this work can be summarised as
follows.
i. A novel camera prioritisation framework is proposed for economical hardware devices
(e.g. the Raspberry Pi) based on violent activity recognition in large-scale WVSNs. Salient
motion is the key feature for activity recognition. A lightweight frame differencing tech-
nique is incorporated to extract frames containing salient motion, thus ensuring the
efcient utilisation of resources.
ii. Human activity consists of a sequence of motion patterns in consecutive frames, meaning
that both spatial and temporal features need to be learned for this task. A novel lightweight
3DCNN architecture over resource-constrained devices is proposed, which outperforms
other state-of-the-art techniques in terms of accuracy on benchmark datasets.
iii. A novel linear threshold gate sigmoid (LTGS) method is used to prioritise cameras based
on violent activity in large-scale WVSNs by exploiting both the metadata and the prob-
abilities predicted by the proposed 3DCNN model, thereby reducing the dependency on
human monitoring, the energy required, and the bandwidth consumption.
CMC, 2022, vol.70, no.2 2175
The remainder of this paper is organised as follows. Section 2 presents the proposed method-
ology. The experimental setup and evaluation are described in Section 3, and Section 4 concludes
the paper and suggests future work.
2 Proposed Method
This section introduces our novel camera prioritisation framework for efcient surveillance, in
which an individual camera node from a large-scale WVSN is prioritised based on the detection of
violent activity. The proposed framework is divided into three steps. In the rst step, we perform
motion detection, and select only motion-specic frames from the video streams captured by the
WVSNs. In the second step, a sequence of 16 salient frames is forwarded to a trained 3DCNN
model to extract the spatio-temporal features. The extracted features are then fed to a Softmax
classier, which generates categorical probabilities in which the class with higher probability is
considered to be the predicted class. Finally, based on the predicted probabilities, our proposed
framework prioritises cameras with high probability of showing violent activity, using LTGS.
The overall workow of the proposed intelligent camera prioritisation framework is illustrated in
Fig. 1.
Figure 1: The proposed camera prioritisation framework for a large-scale surveillance network.
In the preprocessing step, the surveillance video streams are preprocessed using the background
subtraction method to extract only motion-specic frames. In the second step, a 3DCNN is
utilised to extract spatio-temporal features from the sequence of frames to classify the underlying
activity. Finally, based on the violent activity detected in the video stream, a specic camera is
assigned with high priority
2176 CMC, 2022, vol.70, no.2
2.1 Preprocessing Phase
The frame differencing technique plays a very important role in the efcient use of resources
in large-scale WVSNs, as it can help in selecting only motion-specic frames, i.e., moving cars or
pedestrians. In our case, a resource-constrained Raspberry Pi device is used in the preprocessing
stage to select only frames containing moving objects from the input video stream. The detection
of salient objects is a difcult task due to the range of different viewpoints, occlusions, and
cluttered backgrounds in the frames [17]. In WVSNs, multiple visual sensors are interconnected
to enable the target area to be efciently monitored. Processing of the video stream from each
camera is not important, as it is an inefcient use of the available computational resources. There
is therefore a need for a system that can select only motion-relevant frames. A lightweight frame
differencing technique is applied here to identify salient motion frames, as frame differencing
is the most precise and straightforward technique of detecting minor temporal changes. A pair
of consecutive frames fi−fi+1are smoothed and processed to remove noise; this removes high-
frequency content such as noise and edges from the images. There are several different techniques
for smoothing, such as averaging, median, bilateral ltering, and Gaussian blur, in which the pixels
close to the centre of the lter are given more weight than those further away from the centre.
We conducted experiments and concluded that due to the use of a Gaussian kernel, Gaussian
blur is highly effective in removing noise from images. For the selection of salient frames, a
pixel-wise absolute difference is calculated for each pair of consecutive frames Dimage =|fi−fi+1|.
When the value of the absolute difference is higher than a pre-dened threshold T(in our case,
T=0.7), these video frames are selected as salient frames and considered for further processing.
The complete ow of the motion detection process is described in Algorithm 1.
Algorithm 1: Selection of salient frames
Input: Video stream
1. Take two consecutive input video frames (fi,fi+1).
2. Apply Gaussian blur on (fi−fi+1)to remove noise.
3. Find the pixel-wise absolute difference of each pair of consecutive frames Dimage =
1
NN
i=1
fi(i)−fi+1(i)
4. For each frame (fi,fi+1)
IF Dimage <=0.7
fiand fi+1are non-salient frames.
Else
fiand fi+1are salient frames.
End
End For
Output: Salient frames
2.2 Violence Detection
2DCNN architectures are widely used to extract spatial information from still images for
a variety of computer vision tasks, such as image classication and image retrieval. However,
the analysis of human activity in videos is challenging compared to the classication of still
images, as human activity/action is encoded across multiple frames that involve both spatial and
temporal information. A variety of existing methods have used a 2DCNN to extract the spatial
correlations from a video, which also includes temporal information. For example, in [41,42], the
CMC, 2022, vol.70, no.2 2177
authors used a 2DCNN to process multiple frames and all the temporal features are collapsed.
While in 3DCNN, a 3D lter is convolving on spatial and across the multiple frames to extract
spatial and temporal information. After the 3D convolutional operation, a feature map is obtained
that captures the spatial and temporal information. The feature maps are extracted from multiple
frames to capture the temporal information. In our case, the feature maps are a combination of
16 consecutive frames with spatial dimensions of 112 ×112. The values at location x, y, z of the
qth extracted feature map in the pth layer with bias tpq are given by Eq. (1):
Nxyz
pq =tanh ⎛
⎝tpq +
k
Ap−1
a=0
Bp−1
b=0
Cp−1
c=0
wabc
pqkN(x+a)(y+b)(z+c)
(p−1)k⎞
⎠(1)
Here, Cp represents the size of the 3D lter used to learn the temporal dimension, and
in wabc
pqk, the values at position (a, b, c) represent the lter convolved on the kth feature map in
the previous layer. The rst version of 3DCNN was introduced in 2014; it was implemented in a
deep learning framework called Caffe [43], and achieved state-of-the-art performance in an activity
recognition task. Inspired by the exceptional performance of this network, we designed a similar
3DCNN architecture with pruning of the higher layers for the task of violent activity recognition.
In our proposed 3DCNN architecture, we used eight 3D convolutional layers, two max-pooling
layers, and one fully connected layer. In the nal layer, a Softmax activation function is used as a
binary classier. To extract the deep spatial features from the input video stream, a 3 ×3×3lter
is used in the convolution layer with stride one. The input shape of the model is based on the
batch size, depth, rows, columns and channels, and can be represented as 16×16×112 ×112 ×3. A
leaky ReLU is used as the activation function to overcome the dying ReLU problem. In a CNN,
the fully connected layer is extremely expensive in terms of computation, since each neuron is
directly connected to each pixel of the input image. We therefore use a fully connected layer with
512 neurons.
2.3 Camera Selection
In large-scale WVSNs, the streaming and monitoring of gigantic amounts of data is impracti-
cal due to limited human resources. Each surveillance system consists of large numbers of sensors
that require a high bandwidth to transmit their raw video streams, and these video streams involve
costly computation when detecting events and anomalies. Although due to its continuous nature,
a video stream poses critical problems for a human analyst in terms of identifying the important
portions, it is vital to apply visual analytics at the point of data collection. To overcome the
drawbacks of a traditional multi-camera WSN, we propose a salient motion detection scheme
based on a prioritisation framework for visual data, in which all nodes autonomously prioritise
certain visual content and the cameras showing it. The main goal of the proposed surveillance
system is to focus solely on anomaly-specic cameras, not only to avoid unimportant streams but
also to provide an efcient surveillance system. An overview of the proposed scheme is shown in
Fig. 2. The rst step involves understanding the scene by classifying the input video stream into
salient and non-salient classes; the output is a predicted probability score for each frame of the
streams captured by different camera nodes. These scores are then forwarded to the second step
together with the weight values of the corresponding cameras.
2178 CMC, 2022, vol.70, no.2
Figure 2: The proposed camera prioritisation scheme with a symbolic representation of LTGS
In the second step, the sensitivity of each camera is computed based on the input frame
saliency by the LTGS module (using a sigmoid activation function), which processes two inputs:
the probability of the frame under observation and the metadata (weight) value of the correspond-
ing camera. It then generates a single value that determines the priority of the camera. Metadata
are data about the camera, such as its importance and location. This is crucial, since in some
locations such as banks and border control stations, we want to create a high level of surveillance
by selecting camera nodes with high priority. In our experiments, we used metadata values from
one to six, where a score of six means that the location of the camera is very important and one
means the location is normal.
ϕ=∂(wixi)(2)
where i is the number of the camera, w represents the metadata for the specic camera, x
represents the predicted probability of the camera, ∂is the sigmoid function, and φindicates the
priority of the camera.
CF=1if ϕ≥T
0elsewhere(3)
where c indicates the specic camera node, F represents the priority ag, and T is a threshold
with a default value of 0.7. If the value of φexceeds the threshold T, then the camera priority
ag F is set, and this particular camera is prioritised over the others.
3 Experimental Results
In this section, we describe the details of our experiments and present a comparison of the
proposed framework with other state-of-the-art techniques. To evaluate the performance of the
CMC, 2022, vol.70, no.2 2179
proposed method, we conducted various experiments on the publicly available Hockey Fight [44]
and Violent Crowd [8] violence detection datasets. We examined our network with different learn-
ing rates, numbers of convolutional layers, activation functions, and other temporal parameters
such as variations in the sequence size. We also compared our proposed model with different
handcrafted and deep learning methods. Finally, the proposed system was quantitatively evaluated
for camera prioritisation tasks, with and without a weighted strategy.
3.1 Experimental Setting
All the experiments were conducted on a Core i5 CPU equipped with NVIDIA GeForce GTX
1070 (8 GB) and 24 GB onboard memory with Windows 10 operating system, using the Tensor-
ow deep learning framework. For video recording and preprocessing, a Raspberry Pi model 3 was
used with a 64 GB Micro SD-card, 1 GB RAM (LPDDR2 (900 MHz)), 1.2 GHz CPU (4×ARM
Cortex-A53), networking (10/100 Ethernet, 2.4 GHz 802.11n wireless). The camera aperture (F)
was 2.0, the focal length was 3.04 mm, the angle of view (diagonal) was 62.2◦, and the image
resolution was 3280 ×2464 (with support for 1080p30, 720p60 and 640 ×480p90 video recording)
with a Raspbian operating system.
3.2 Datasets
This section presents a detailed overview of the datasets used in our experiments. We used
two existing benchmark datasets, Hockey Fight and Violent Crowd.
3.2.1 Hockey Fight Dataset
This dataset was introduced by Nievas et al. [44] and contains a total of 1000 videos, 500 of
which show violence (ghts), and the remaining 500 show non-violent (normal) activities. In the
violence class, all clips were collected from ghts during a hockey game. The entire set of video
clips was taken from a National Hockey League (NHL) game, and each video clip consists of
50 video frames with resolution 360 ×288 ×3. Examples are shown in Fig. 3, and the details are
given in Tab. 1. After training, we evaluated the performance of our proposed architecture on the
test data. The confusion matrix for this dataset is shown in Fig. 4a.
3.2.2 Violent Crowd Dataset
The Violent Crowd dataset contains 246 videos taken from YouTube, and was introduced
by Hassner et al. [8]. Originally, this dataset consisted of ve different categories of violent and
non-violent activities. In our experiments, we merged these categories into two classes containing
violent and non-violent activity. In each category, there are 123 videos, and each clip has 50 to
150 frames with dimensions of 320 ×240×3. An example frame from the Violent Crowd dataset is
shown in Fig. 3, and details of the dataset are given in Tab. 1. The experimental results obtained
from the Violent Crowd dataset are shown in the form of a confusion matrix in Fig. 4b. Detailed
quantitative results can be found in Tabs. 2 and 4.
3.3 Evaluation Matrices
There are several methods for evaluating the performance of the classication model, but the
most common metrics are precision, recall, and accuracy. These are represented mathematically in
Eqs. (4)–(6).
Precision =TP
TP +FP (4)
2180 CMC, 2022, vol.70, no.2
Recall =TP
TP +FN (5)
Accuracy =TP +TN
TP +TN +FP +FN (6)
wheretruepositive(TP) is the number of violent activities correctly identied; false positive (FP)
is the number of non-violent activities incorrectly predicted as violent; true negative (TN) is the
number of non-violent activities correctly identied; and false negative (FN) is the number of
violent activities that are incorrectly predicted as non-violent.
Figure 3: Example frames from the hockey ght and violent crowd datasets. Images shown in the
rst section are scenes from the hockey ght dataset, while those in the second section are scenes
from the violent crowd dataset
CMC, 2022, vol.70, no.2 2181
Table 1: Description and statistics for the datasets
Dataset Samples Resolution Violent scenes Non-violent scenes
No. of clips Frame rate No. of clips Frame rate
Hockey ght [44] 1000 360 ×288 500 25 500 25
Violent crowd [8] 246 320 ×240 123 25 123 25
Figure 4: Confusion matrices for (a) The hockey ght dataset, and (b) The violent crowd dataset
Table 2: Comparison of the proposed method with other state-of-the-art methods on the Violent
Crowd and Hockey Fight datasets
Methods Accuracy (%) of method on each dataset
Violent crowd Hockey ght
ViF, OViF, AdaBoost and SVM [11] 88 87.50
Hough forests and 2D CNN [29] – 94.6
ViF [8] 81.3 82.90
Improved sher vectors [18] 96.4 93.7
3DCNN [7]9896
Proposed method 99.89 98.80
3.4 Results and Discussion
In a CNN, the most challenging tasks involve nding the optimal kernel size, the optimal
number of lters per layer, and the optimal formation of the layers. These hyperparameters are
highly correlated with the input data. The hidden layers in CNN architecture act as a black
box, meaning that it is very difcult to identify the correct number of lters and formation of
layers directly, and we therefore used a heuristic approach to develop an efcient model. We
performed multiple experiments with different numbers of convolutional layers, hyperparameter
settings, activation functions, learning rates and temporal information. To efciently learn the
patterns of violent and non-violent activity, we need to input massive amounts of labelled data
to train the deep architecture. Each violent activity dataset contains a number N of short clips
with different durations, and each video in the dataset belongs to one of two categories: violent
2182 CMC, 2022, vol.70, no.2
or non-violent. During training, each individual input video clip is fed to the 3DCNN as a set
of batches of length 16 frames, to allow the model to learn the temporal features of the input
video.
3.4.1 Analyses of Different Learning Rates
When training a neural network, the most important hyperparameter is the learning rate,
and the choice of the optimal learning rate greatly affects the generalisation of the deep learning
model. There are two key assumptions that should be taken into consideration when selecting
the learning rate during training. Firstly, the learning rate should not be too high, as this will
cause overshoot while nding the minimal points and the model will allocate large updates to the
weights, causing divergent behaviour. At the beginning of our experiments, we therefore used a
low learning rate of 0.00001, as shown in Fig. 6a. However, from our results, we found that a low
learning rate did not always perform very well. We therefore performed further experiments based
on the second key assumption, in which the learning rate should not be too low since numerous
updates will be required before the optimal point is reached. We nally applied a learning rate
of 0.001 with the Adam optimiser, and achieved state-of-the-art results. The highest values of
accuracy were 98% for the Hockey Fight dataset and 99% for the Violent Crowd dataset, as shown
in Fig. 6b. It can be seen that the highest accuracy of 98% was obtained with a learning rate of
0.001 over 500 epochs. After conducting numerous experiments, we observed that changing the
learning rate affected the loss, accuracy, and number of iterations. Fig. 6a shows that the changes
in loss and accuracy are strongly related to the variation in the learning rate. For instance, for the
rst 100 epochs, the training loss, validation loss, training accuracy and validation accuracy were
0.299, 0.218, 88.3% and 94.5%, respectively. Over time, as the number of iterations increased, the
model loss decreased while the accuracy increased. Finally, at 500 epochs, the loss was reduced
to 0.02 and the accuracy reached 99.44% when the learning rate was set to 0.00001 and the
experimental setup, training loss, validation loss, training accuracy and validation were kept at the
same values as before. The accuracy for the rst hundred epochs was 94%, while at 500 epochs,
the obtained training loss, validation loss, training accuracy, and validation accuracy were 0.01,
0.15, 99%, and 98%, as shown in Fig. 6b.
We also trained our model with the same setup on the Violent Crowd dataset, and the results
of the training and validation process are shown in Fig. 6a. The pink line indicates the experiment
in which we achieved the highest accuracy of 99% and a loss of 0.22 over 500 epochs.
3.4.2 Computational Complexity Analysis
A CCN architecture consists of three different types of layers: convolutional, max-pooling
and fully connected layers. The convolutional layers share a lter of xed size, which is used to
extract spatial features. The max-pooling layer does not learn anything during training. However,
the fully connected layer is extremely expensive in terms of computation, as it involves one-to-one
connectivity for each pixel in an image. When a depth channel or a temporal dimension is added,
a 3DCNN becomes computationally more complex than a classical 2DCNN architecture. For
instance, in [7], the authors presented a 3DCNN architecture for recognition of violent activity in
a surveillance system. At the feature extraction stage, eight convolutional layers were used, feature
reduction was achieved through ve max-pooling layers, and the classication stage used two fully
connected layers with 4096 neurons in each. To limit the computational complexity of our model,
we conducted multiple experiments and nally developed a lightweight 3DCNN architecture con-
sisting of eight convolutional layers, three max-pooling layers, and one fully connected layer with
512 neurons rather than 4,096. The proposed network involved 14,352,811 trainable parameters,
CMC, 2022, vol.70, no.2 2183
whereas the network in [7] involved 22,228,802, a reduction of 7,875,991. The nal result was
a computationally inexpensive model that was capable of achieving high accuracy compared to
state-of-the-art models, as shown in Tab. 2 . We reduced the overall complexity of our proposed
network by using only one fully connected layer.
3.4.3 Ablation Study of Different Sequences
Activity recognition with a 2DCNN involves extracting only the spatial features from video
frames, which is insufcient for activity recognition. Khan et al. [45] proposed a CNN architecture
called MobileNet for violence detection from movies that only extracted spatial features from
videos, whereas a 3DCNN extracts both spatial and temporal features from videos. Temporal or
sequence information plays a critical role in activity recognition. In view of this, we conducted
several experiments to analyse temporal information, as shown in Fig. 5. When the number of
sequences was increased, the model efciently learned the data, but there was a trade-off between
the number of sequences and the computation time. To nd the optimal length of a sequence,
we tested our method with different sequence lengths ranging from six to 16 consecutive frames.
The experimental results are shown in Fig. 5, and it can be observed that the lowest accuracy
was achieved for a sequence of six frames. As the length of the sequence was increased, the
accuracy improved. The highest values of accuracy for both the Hockey Fight (98%) and Violent
Crowd datasets (99%) were achieved for a sequence of 16 frames. We evaluated our proposed
system using the metrics of precision, recall, and accuracy, and the results are listed in Tab. 2.
For the Hockey Fight dataset, the highest accuracy was 98% with an F1-score of 0.97, while for
the Violent Crowd dataset, the highest accuracy was 99% with an F1-score of 0.98.
Figure 5: Temporal analysis of the proposed 3DCNN architecture in terms of accuracy on the
hockey ght and violent crowd datasets
2184 CMC, 2022, vol.70, no.2
Figure 6: Impact of different learning rates on the hockey ght and violent crowd datasets: (a)
the optimal learning rate of 0.001 with the Adam optimiser, and (b) a smaller learning rate of
0.00001 with the Adam optimiser
3.5 Comparative Analysis
In this section, we compare the results obtained from our proposed 3DCNN architecture with
those of state-of-art methods. A comparative analysis is shown in Tab. 2. The rst row shows the
results obtained by the method proposed in [11], which used violent ows descriptor to estimate
the magnitude of the violence and the AdaBoost algorithm for feature extraction. Finally, an
SVM was deployed as a violent activity classier.
Their method achieved an accuracy of 87.50% on the Hockey Fight dataset and 88% for
the Violent Crowd dataset. Recently, Serrano et al. [29] used Hough forests with a 2DCNN to
detect violent activity, and this approach gave an accuracy of 94.6% on the Hockey Fight dataset.
Similarly, Hassner et al. [8] experimented on the same datasets with a scheme that used ViF
for feature extraction and SVM for classication, and obtained accuracies of 82.90% and 81.3%,
respectively. A sliding window method was used with an improved Fisher vector in [18], and this
CMC, 2022, vol.70, no.2 2185
approach achieved values of 93.7% and 96.4% for accuracy, respectively. In [7], the author used a
3DCNN and obtained accuracy scores of 96% and 98% on the Hockey Fight and Violent Crowd
datasets, respectively. The last row of Tab. 2 shows the performance of the proposed model.
In general, the computational complexity of CNN architectures is directly related to the fully
connected layer. For instance, the method presented in [7] used eight convolutional layers and two
fully connected layers with 4096 neurons per layer. In contrast, our proposed 3DCNN architecture
consists of eight convolution layers and only one fully connected layer with 512 neurons. Our
model therefore performed favourably against the scheme [7] in terms of the accuracy and network
complexity.
3.6 Quantitative Evaluation of Our Proposed Camera Prioritisation Method with and without a
Weighting Strategy
For camera prioritisation, there is no standard evaluation matrix to check their performance.
However, to achieve the best performance, the camera prioritisation evaluation metrics should
maximise the impact of the overall system accuracy and consider the vulnerable position of the
camera node. This is very important, since some locations such as banks and border control
stations require a high level of surveillance, based on selecting camera nodes with a high priority.
Video streams from camera nodes are fed to a trained 3DCNN model to classify and calculate
the probability of violent or non-violent activity. If violent activity is detected, the probability
calculated by the trained 3DCNN model and the metadata about the camera are passed to the
LTGS module, as shown in Eqs. (2) and (3). The weights or metadata reect the importance of
the location of the node: a low weight means that the position is not important, while a higher
weight indicates that the position of the node is more important. A mathematical representation
of our camera prioritisation scheme is given in Eqs. (2) and (3).Tab. 3 presents the results of
our experiments without a weighting strategy, in which the weight is set to a constant value of
one. In Tab. 3, the column marked ‘Ground truth’ represents the number of cameras showing
violent activity, which need to be prioritised. In contrast, the column marked ‘Cameras prioritised
by the model’ shows the number of cameras selected for prioritisation by our proposed system.
In the rst experiment, we used 10 cameras over which violent activity was randomly distributed
to check the robustness of our proposed model on a small network of WVSNs. In the rst test,
cameras 1 to 8 captured violent scenes, while cameras 9 and 10 recorded normal events. When we
passed these streams to the trained 3DCNN model, it prioritised seven of the eight cameras that
recorded these activities. To calculate the overall accuracy, we divided the total correct number
of prioritised cameras by the total number of cameras that should have been prioritised. The
proposed model achieved an average accuracy of 86.1%.
Table 3: Experiments on camera prioritisation with a constant weight value (w =1)
Experiments Number of
cameras
Ground truth Cameras prioritised
by the model (ϕ)
Accuracy (%)
Experiment 1
Test 1 10 8 7 87.5
Test 2 10 7 6 85.7
Test 3 10 9 7 77.7
Test 4 1 0 6 6 1
Test 5 1 0 5 4 80
Average 10 7 6 86.1
(Continued)
2186 CMC, 2022, vol.70, no.2
Table 3: Continued
Experiments Number of
cameras
Ground truth Cameras prioritised
by the model (ϕ)
Accuracy (%)
Experiment 2
Test 1 15 13 11 84.6
Test 2 15 11 10 90.9
Test 3 1 5 9 9 1
Test 4 1 5 1 4 14 1
Test 5 1 5 8 6 75
Average 15 11 10 90.1
Experiment 3
Test 1 2 0 1 4 14 1
Test 2 20 15 14 93.3
Test 3 20 16 15 93.7
Test 4 2 0 1 3 13 1
Test 5 20 17 14 82.3
Average 20 15 14 93.8
In the second experiment, the number of cameras was increased from 10 to 15, and the
total number of cameras that needed to be prioritised was 11. For each individual test in this
set, the number of cameras reporting violent activity is listed under the ‘Ground truth’ column.
In this experiment, our proposed system prioritised 10 cameras, giving an average accuracy of
90.1%. Similarly, in experiment 3, we evaluated 20 cameras and achieved an accuracy of 93.8%.
In experiment 1, our model prioritised all 14 cameras and we achieved an accuracy of 100%. The
details of experiments 2–5 are shown in Tab. 3. The results of the experiments with a weighting
strategy are presented in Tab. 4 .
3.7 Result Derivatives
Existing systems use highly intensive computational methods for the recognition of abnormal
activity, as discussed in Sections 1.1 and 1.2, and cannot be deployed in real-time surveillance
systems. Techniques in the literature on camera prioritisation are mostly based on handcrafted
features, and no statistical analyses of their results are provided. Furthermore, no discussion
of the validity of camera prioritisation is available for large-scale WVSNs. In this paper, we
have proposed a lightweight 3DCNN model for abnormal activity recognition using resource-
constrained devices, which outperformed existing state-of-art methods in terms of accuracy and
F1-score. Furthermore, we have proposed an intelligent algorithm called LTGS that prioritises
cameras showing abnormal activity above those showing a normal stream. The results of multiple
comprehensive experiments and evaluations show that the proposed system performs well in terms
of both violent activity recognition and efcient camera prioritisation in large-scale surveillance
networks.
CMC, 2022, vol.70, no.2 2187
Table 4: Experiments on camera prioritisation with a weighting strategy
Experiments Number of
cameras
Ground
truth
Cameras prioritised
by the model
Wei gh t
values (w)
Accuracy
(%)
Total net
accuracy (ϕ)
Experiment 1
Test 1 10 8 7 4 3 . 5 97
Test 2 10 7 6 5 4.625 99
Test 3 10 9 7 3 2.571 92.8
Test 4 10 6 6 1 1 73.1
Test 5 10 5 4 2 1.6 83.2
Average 10 7 6 3 2.659 89.02
Experiment 2
Test 1 15 13 11 2 1.692 84.4
Test 2 15 11 10 3 2.727 93.8
Test 3 15 9 9 1 1 73.1
Test 4 15 14 14 2 2 8 8
Test 5 15 8 6 2 1.5 81.7
Average 15 11 10 2 1.7838 84.2
Experiment 3
Test 1 20 14 14 1 1 73.1
Test 2 20 15 14 4 3.732 97.6
Test 3 20 16 15 6 5.622 99.6
Test 4 20 13 13 2 2 8 8
Test 5 20 17 14 2 1.646 83.8
Average 20 15 14 3 2.8 88.42
4 Concluding Remarks and Directions for Future Work
In this study, we have investigated the strength and capabilities of a 3DCNN for intelli-
gent camera prioritisation in large-scale WVSNs, based on violent activity recognition. WVSNs
typically consist of a large number of visual nodes, each of which continuously generates a
massive amount of video data. The efcient monitoring and streaming of such huge amounts of
data is very challenging, due to the limited availability of computational resources. To overcome
the drawbacks of traditional surveillance systems, we have proposed a novel intelligent camera
prioritisation framework for large-scale WVSNs. A 3DCNN is used for violent activity recognition
that is capable of learning not only spatial but also temporal information. The proposed model
can prioritise cameras with an accuracy of 98%–99%. In future work, we aim to further reduce
the cost of the proposed framework by customising the Raspberry Pi, for example by removing
unwanted hardware like the mouse, keyboard and GPIO pens. Removing unnecessary functional-
ities from this device can signicantly reduce the overall computational complexity. For activity
recognition, the features extracted by our proposed architecture could be fed to a recurrent neural
network such as LSTM to classify the input video stream more efciently.
Acknowledgement: This work was supported by the faculty research fund of Sejong University in
2020 and also supported by Institute of Information & communications Technology Planning &
2188 CMC, 2022, vol.70, no.2
Evaluation (IITP) grant funded by the Korea government (MSIT) (2019-0-00136, Development of
AI-Convergence Technologies for Smart City Industry Productivity Innovation).
Funding Statement: Institute of Information & communications Technology Planning & Eval-
uation (IITP) grant funded by the Korea government (MSIT) (2019-0-00136, Development of
AI-Convergence Technologies for Smart City Industry Productivity Innovation).
Conicts of Interest: The authors declare that they have no conicts of interest to report regarding
the present study.
References
[1] W. Ullah, A. Ullah, I. U. Haq, K. Muhammad and S. W. Baik, “CNN features with bi-directional
LSTM for real-time anomaly detection in surveillance networks,” Multimedia Tools and Applications,
vol. 1, pp. 1–17, 2020.
[2] K. Karboub, M. Tabaa, S. Dellagi, A. Dandache and F. Moutaouakkil, “Intelligent patient monitor-
ing for arrhythmia and congestive failure patients using internet of things and convolutional neural
network,” in 31st Int. Conf. on Microelectronics, Cairo, Egypt, pp. 292–295, 2019.
[3] Y.-H. Tsai, J.-K. Hsu, Y.-E. Wu and W.-F. Huang, “Distributed multimedia content processing in
ONVIF surveillance system,” in 2011 Int. Conf. on Future Computer Sciences and Application, Hong Kong,
China, pp. 70–73, 2011.
[4] C. A. T. Stelios, D. Stelios and D. Antonios, “Automated real-time risk assessment for airport pas-
sengers using a deep learning architecture,” in Signal Processing, Sensor/Information Fusion and Target
Recognition XXVIII, Maryland, United States, pp. 110180, 2019.
[5] D. Arjun, P. K. Indukala and K. A. U. Menon, “PANCHENDRIYA: A Multi-sensing framework
through wireless sensor networks for advanced border surveillance and human intruder detection,” in
2019 Int. Conf. on Communication and Electronics Systems, Coimbatore, India, pp. 295–298, 2019.
[6] S. Telang, A. Chel, A. Nemade and G. Kaushik, “Intelligent Transport System for a Smart City,” in
Security and Privacy Applications for Smart City Development. Cham: Springer, pp. 171–187, 2021.
[7] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq and S. W. Baik, “Violence detection using
spatiotemporal features with 3D convolutional neural network,” Sensors, vol. 19, no. 11, pp. 2472,
2019.
[8] T. Hassner, Y. Itcher and O. Kliper-Gross, “Violent ows: Real-time detection of violent crowd
behavior,” in 2012 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops,
Providence, RI, USA, pp. 1–6, 2012.
[9] J.-F. Huang and S.-L. Chen, “Detection of violent crowd behavior based on statistical characteristics of
the optical ow,” in 11th Int. Conf. on Fuzzy Systems and Knowledge Discovery, Xiamen, China, pp. 565–
569, 2014.
[10] T. Zhang, Z. Yang, W. Jia, B. Yang, J. Yang et al., “A new method for violence detection in surveillance
scenes,” Multimedia Tools and Applications, vol. 75, no. 12, pp. 7327–7349, 2016.
[11] Y. Gao, H. Liu, X. Sun, C. Wang and Y. Liu, “Violence detection using oriented VIolent ows,” Image
and Vision Computing, vol. 48–49, no. 6, pp. 37–41, 2016.
[12] D. Chen, H. Wactlar, M.-Y. Chen, C. Gao, A. Bharucha et al., “Recognition of aggressive human
behavior using binary local motion descriptors,” in 30th Annual Int. Conf. of the IEEE Engineering in
Medicine and Biology Society, Vancouver, BC, Canada, pp. 5238–5241, 2008.
[13] F. D. De Souza, G. C. Chavez, E. A. do Valle Jr and A. D. A. Araújo, “Violence detection in video
using spatio-temporal features,” in 23rd SIBGRAPI Conf. on Graphics, Patterns and Images,Gramado,
Brazil, pp. 224–230, 2010.
CMC, 2022, vol.70, no.2 2189
[14] L. Xu, C. Gong, J. Yang, Q. Wu and L. Yao, “Violent video detection based on MoSIFT feature and
sparse coding,” in 2014 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Florence, Italy, pp.
3538–3542, 2014.
[15] K. Lloyd, P. L. Rosin, D. Marshall and S. C. Moore, “Detecting violent and abnormal crowd activity
using temporal analysis of grey level co-occurrence matrix (GLCM)-based texture measures,” Machine
Vision and Applications, vol. 28, no. 3–4, pp. 361–371, 2017.
[16] E. Y. Fu, H. V. Leong, G. Ngai and S. C. Chan, “Automatic ght detection in surveillance videos,”
International Journal of Pervasive Computing and Communications, vol. 3, pp. 1–11, 2017.
[17] J. Mahmoodi and A. Salajeghe, “A classication method based on optical ow for violence detection,”
Expert Systems with Applications, vol. 127, no. 1, pp. 121–127, 2019.
[18] P. Bilinski and F. Bremond, “Human violence recognition and detection in surveillance videos,” in 2016
13th IEEE Int. Conf. on Advanced Video and Signal Based Surveillance, Colorado Springs, CO, USA, pp.
30–36, 2016.
[19] A. Deshmukh, K. Warang, Y. Pente and N. Marathe, “Violence detection through surveillance system,”
in ICT Systems and Sustainability. Berlin, Germany: Springer, pp. 503–511, 2021.
[20] R. Maqsood, U. I. Bajwa, G. Saleem, R. H. Raza et al., “Anomaly Recognition from surveil-
lance videos using 3D convolutional neural networks,” Multimedia Tools and Applications, 2017.
https://doi.org/10.1007/s11042-021-10570-3.
[21] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, “Learning spatiotemporal features with
3d convolutional networks,” in Proc. of the IEEE Int. Conf. on Computer Vision, Montreal, Canada,
pp. 4489–4497, 2015.
[22] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Nashville, United States, pp. 6299–
6308, 2017.
[23] A. Diba, M. Fayyaz, V. Sharma, A. Hossein Karami, M. Mahdi Arzani et al., “Temporal 3d convnets
using temporal transition layer,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
Workshops, Nashville, United States, pp. 1117–1121, 2018.
[24] Z. Qiu, T. Yao and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual net-
works,” in Proc. of the IEEE Int. Conf. on Computer Vision, Montreal, Canada, pp. 5533–5541,
2017.
[25] D. Tran, J. Ray, Z. Shou, S.-F. Chang and M. Paluri, “Convnet architecture search for spatiotemporal
feature learning,” ArXiv, vol. abs/1708.05038, 2017. https://arxiv.org/abs/1708.05038.
[26] J. Stroud, D. Ross, C. Sun, J. Deng and R. Sukthankar, “D3d: Distilled 3d networks for video action
recognition,” in Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, Waikoloa,
United States, pp. 625–634, 2020.
[27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun et al., “A closer look at spatiotemporal convolutions
for action recognition,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Nashville,
United States, pp. 6450–6459, 2018.
[28] S. Xie, C. Sun, J. Huang, Z. Tu and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-
accuracy trade-offs in video classication,” in Proc. of the European Conf. on Computer Vision, Montreal,
Canada, pp. 305–321, 2018.
[29] I. Serrano, O. Deniz, J. L. Espinosa-Aranda and G. Bueno, “Fight recognition in video using hough
forests and 2D convolutional neural network,” IEEE Transactions on Image Processing, vol. 27, no. 10,
pp. 4787–4797, 2018.
[30] A. Ullah, K. Muhammad, K. Haydarov, I. U. Haq, M. Lee et al., “One-shot learning for surveillance
anomaly recognition using siamese 3D CNN,” in 2020 Int. Joint Conf. on Neural Networks,Glasgow,
UK, pp. 1–8, 2020.
[31] S.Maresca,G.Serano,F.Scotti,F.Amato,L.Lemboet al., “Photonics for coherent MIMO radar:
An experimental multi-target surveillance scenario,” in 2019 20th Int. Radar Symp.,Ulm,Germany,pp.
1–6, 2019.
2190 CMC, 2022, vol.70, no.2
[32] N. Kumar and N. Sukavanam, “A cascaded CNN model for multiple human tracking and re-
localization in complex video sequences with large displacement,” Multimedia Tools and Applications,
vol. 79, no. 9, pp. 6109–6134, 2020.
[33] C. Sampedro, A. Rodriguez-Ramos, H. Bavle, A. Carrio, P. de la Puente et al., “A fully-autonomous
aerial robot for search and rescue applications in indoor environments using learning-based techniques,”
Journal of Intelligent & Robotic Systems, vol. 95, no. 2, pp. 601–627, 2019.
[34] S. S. Sengar and S. Mukhopadhyay, “Moving object area detection using normalized self adaptive
optical ow,” Optik, vol. 127, no. 16, pp. 6258–6267, 2016.
[35] J. Dou, Q. Qin and Z. Tu, “Background subtraction based on circulant matrix,” Signal, Image and Video
Processing, vol. 11, no. 3, pp. 407–414, 2017.
[36] M. Fei, J. Li and H. Liu, “Visual tracking based on improved foreground detection and perceptual
hashing,” Neurocomputing, vol. 152, no. 8, pp. 413–428, 2015.
[37] A. Zam, M. R. Khayyambashi and A. Bohlooli, “Energy-aware strategy for collaborative target-
detection in wireless multimedia sensor network,” Multimedia Tools and Applications, vol. 78, no. 13,
pp. 18921–18941, 2019.
[38] I. Mehmood, M. Sajjad, W. Ejaz and S. W. Baik, “Saliency-directed prioritization of visual data in
wireless surveillance networks,” Information Fusion, vol. 24, no. 3, pp. 16–30, 2015.
[39] S. S. Thomas, S. Gupta and V. K. Subramanian, “Event detection on roads using perceptual video
summarization,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 9, pp. 2944–2954,
2018.
[40] K. Muhammad, T. Hussain and S. W. Baik, “Efcient CNN based summarization of surveillance
videos for resource-constrained devices,” Pattern Recognition Letters, vol. 130, pp. 370–375, 2020.
[41] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy et al., “Recent advances in convolutional neural
networks,” Pattern Recognition, vol. 77, no. 11, pp. 354–377, 2018.
[42] A. Piergiovanni, A. Angelova and M. S. Ryoo, “Evolving losses for unsupervised video representation
learning,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Shanghai, China,
pp. 133–142, 2020.
[43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, S. Long et al., “Caffe: Convolutional architecture for
fast feature embedding,” in Proc. of the 22nd ACM Int. Conf. on Multimedia, Florida, USA, pp. 675–678,
2014.
[44] E. B. Nievas, O. D. Suarez, G. B. García and R. Sukthankar, “Violence detection in video using
computer vision techniques,” in Int. Conf. on Computer Analysis of Images and Patterns,NY,USA,
pp. 332–339, 2011.
[45] S. U. Khan, I. U. Haq, S. Rho, S. W. Baik and M. Y. Lee, “Cover the violence: A novel deep-learning-
based approach towards violence-detection in movies,” Applied Sciences, vol. 9, no. 22, pp. 4963, 2019.