Article

Fast Sparse Coding Networks for Anomaly Detection in Videos

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The semi-supervised video anomaly detection assumes that only normal video clips are available for training. Therefore, the intuitive idea is either to learn a dictionary by sparse coding or to train encoding-decoding neural networks by minimizing the reconstruction errors. For the former, the optimization of sparse coefficients is extremely time-consuming. For the latter, this manner cannot guarantee that an abnormal data corresponds to a larger reconstruction error due to the strong generalization of neural networks. To remedy their weaknesses and leverage their strengths, we propose a Fast Sparse Coding Network (FSCN) based on High-level Features. First, we propose a two-stream neural network to extract Spatial-Temporal Fusion Features (STFF) in hidden layers. With the STFF at hand, we use a Fast Sparse Coding Network to build a normal dictionary. By leveraging the predictor to produce approximate sparse coefficients, our FSCN generates sparse coefficients within a forward pass, which is simple and computationally efficient. Compared with traditional sparse coding based methods, FSCN is hundreds of or even thousands of times faster at the test stage. Extensive experiments on benchmark datasets demonstrate that our method reaches the state-of-the-art level.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Deep learning (DL) has become increasingly prevalent for anomaly detection (AD) applications for reliability, safety, and health monitoring in several domains with the proliferation of sensor data in recent years [1][2][3]. AD has been applied for a diverse set of tasks, including but not limited to machinery fault diagnosis and prognosis [4,5], electronic device fault diagnosis [6][7][8][9], medical diagnosis [10][11][12][13], cybersecurity [14][15][16], crowd monitoring [17][18][19][20][21][22][23], traffic monitoring [24,25], environment monitoring [26], the Internet of things [3,27], and energy and power management [28,29]. AD aims to determine anomalies depending on the setting and application domain [2]. ...
... Deep learning has become effective for AD modeling because of its capability to capture complex structures, extract end-to-end automatic features, and scale for large data sets [1,2]. Several DL models have been proposed in the literature for diverse data types, such as structural [1], time series [7][8][9]12,13,16,27,[29][30][31][32][33][34][35][36][37][38], image [10,26], graph network data [14,15,24,25,39], and spatio-temporal [10,14,15,[17][18][19][20][21][22]24,25,39]. Spatio-temporal (ST) data are commonly collected in diverse domains, such as visual streaming data [17][18][19][20][21][22][23], transportation traffic flows [24,25], sensor networks [14,15,39], geoscience [26], medical diagnosis [10], and high-energy physics [40,41]. ...
... Several DL models have been proposed in the literature for diverse data types, such as structural [1], time series [7][8][9]12,13,16,27,[29][30][31][32][33][34][35][36][37][38], image [10,26], graph network data [14,15,24,25,39], and spatio-temporal [10,14,15,[17][18][19][20][21][22]24,25,39]. Spatio-temporal (ST) data are commonly collected in diverse domains, such as visual streaming data [17][18][19][20][21][22][23], transportation traffic flows [24,25], sensor networks [14,15,39], geoscience [26], medical diagnosis [10], and high-energy physics [40,41]. A unique quality of ST data that differentiates it from other classic data is the presence of dependencies among measurements induced by the spatial and temporal attributes, where data correlations are more complex to capture by conventional techniques [42]. ...
Article
Full-text available
The Compact Muon Solenoid (CMS) experiment is a general-purpose detector for high-energy collision at the Large Hadron Collider (LHC) at CERN. It employs an online data quality monitoring (DQM) system to promptly spot and diagnose particle data acquisition problems to avoid data quality loss. In this study, we present a semi-supervised spatio-temporal anomaly detection (AD) monitoring system for the physics particle reading channels of the Hadron Calorimeter (HCAL) of the CMS using three-dimensional digi-occupancy map data of the DQM. We propose the GraphSTAD system, which employs convolutional and graph neural networks to learn local spatial characteristics induced by particles traversing the detector and the global behavior owing to shared backend circuit connections and housing boxes of the channels, respectively. Recurrent neural networks capture the temporal evolution of the extracted spatial features. We validate the accuracy of the proposed AD system in capturing diverse channel fault types using the LHC collision data sets. The GraphSTAD system achieves production-level accuracy and is being integrated into the CMS core production system for real-time monitoring of the HCAL. We provide a quantitative performance comparison with alternative benchmark models to demonstrate the promising leverage of the presented system.
... Compared with abnormal events which are difficult to collect and enumerate, normal events are easier to obtain. Hence, most of the existing works train the models based on datasets with only normal events [3,9,18,25,28,29,32,34]. Generally, there are three strategies to estimate the anomalies in the video: (1) features of normal and abnormal events are projected into a common space, and the anomaly is detected according to the distance of the spatial distribution [1]. ...
... Generally, there are three strategies to estimate the anomalies in the video: (1) features of normal and abnormal events are projected into a common space, and the anomaly is detected according to the distance of the spatial distribution [1]. (2) A dictionary is trained where patterns of normal events are recorded based on their high-level semantic features, and the score of abnormality is calculated with the help of the dictionary [23,29,33]. (3) Various feature extractors are trained to reconstruct or predict the previous or the next frames and anomalies are detected through reconstruction and prediction errors [15,16,19,25,32]. ...
... Stan et al. [13] train the model by generating the removed frames in consecutive frames with adversarial learning. Wu et al. [29] build a fast sparse coding network (FSCN) based on high-level features, which extracts spatiotemporal fusion features to learn a normal dictionary and use abnormality scores to speculate whether the test video is normal or not. Huang et al. [8] utilize deep contrastive self-supervised learning to capture the high-level semantic features and tackle anomaly detection with multiple self-supervised tasks. ...
Article
Full-text available
Abnormal detection of surveillance video is of great significance to social security and the protection of specific scenes. However, the existing methods fail to achieve a balance between accuracy and real-time performance. In this paper, we propose a two-stream spatio-temporal generative model (TSSTGM) for surveillance videos to detect abnormal behaviors in real-time. We construct an end-to-end video reconstruction and prediction framework based on deep learning to detect the anomalies by reconstruction error and prediction error. Specifically, we elaborately design a fully convolutional structure, enabling the model to accept input videos of any size. To ensure great performance in complex scenes, appearance, temporal and motion features are fully explored and fed into the discriminator to train the model with adversarial learning. Moreover, the input design and the calculation way of optical flow ensure the model runs in real-time. Experiments on two real-world datasets show that, when satisfying the real-time requirement, TSSTGM is still competitive compared with no matter real-time or non-real-time existing methods in AUC and EER metrics. Our model has been deployed in several campus security surveillance systems to detect dangerous behaviors, ensuring the personal safety of students.
... To exploit both spatial and temporal information from input frames, a ConvLSTM layer was added to the encoder and decoder, and a residual block was placed between the encoder and decoder to avoid the vanishing gradient problem. To address the false reconstruction problem of the autoencoder, [11] combined a human expert's feedback with the convolutional autoencoder's output [12] combined an autoencoder and an explanation method to enhance its interpretability. On the other hand, [12][13][14] proposed a two-stream network designed to extract both spatial and temporal features to improve the performance of their reconstruction network. ...
... To address the false reconstruction problem of the autoencoder, [11] combined a human expert's feedback with the convolutional autoencoder's output [12] combined an autoencoder and an explanation method to enhance its interpretability. On the other hand, [12][13][14] proposed a two-stream network designed to extract both spatial and temporal features to improve the performance of their reconstruction network. Similarly, [15] employed a two-stream network that has both a forward network for reconstructing the current frame and a backward network for generating the reversed frames. ...
Article
Full-text available
Anomaly detection is to identify abnormal events against normal ones within surveillance videos mainly collected in ground-based settings. Recently, the demand for processing drone-collected data is rapidly growing with the expanding range of drone applications. However, as most aerial videos collected by flying drones contain dynamic backgrounds and others, it is necessary to deal with their spatio-temporal features in detecting anomalies. This study presents a transformer-based video anomaly detection method whereby we investigate a challenging aerial dataset and three benchmark ground-based datasets. A multi-stage transformer is leveraged as an encoder to generate multi-scale feature maps, which are then conveyed to a hierarchical spatio-temporal transformer, that is linked to a decoder and used to capture spatial and temporal information by utilizing a joint attention mechanism. Extensive evaluations including several ablation studies suggest that this network outperforms the state-of-the-art methods. We expect the proposed transformer for U-net can find diverse applications in the video processing area. Code and pre-trained models are publicly available at https://github.com/vt-le/HSTforU.
... Unsupervised video anomaly detection algorithms generally fall into two categories. The first category is sparse coding methods [29,43,55]. The core idea is to learn a dictionary to encode all normal events in the training set so that anomalous events cannot be well-reconstructed. ...
... Quantitative results. We compare our proposed MOFSTE-NM with the current state-of-the-art models including (1) based other methods: AnoPCN [47], FSCN [43], DeepOC [44], SIGnet [12], AnomalyNet [60]; (2) reconstructionbased methods: AMC [36], ConvLSTM-AE [30], MemAE [13], MNAD-R [37], Stacked RNN [31], STEAL-Net [1], DDGAN [11], CDDA [5]; (3) frame prediction based methods: FFP [27], STCEN [14], AMMC-Net [2], MNAD-P [37], VABD [26], DSM-Net [42],*AMSTE [58]. The results are summarized in Table 2. ...
Article
Full-text available
Video anomaly detection aims to detect anomaly scores in video frames, and it is a challenging research area since the types of anomalies are limitless. In response to the fact that abnormal behavior is likely to be misidentified as normal and anomalies are typically generated by the fast motion of foreground objects, this paper proposes a novel model called the Multi-scale Optical Flow Spatio-Temporal Enhancement and Normality Mining Network (MOFSTE-NM). It contains the Spatio-temporal Information Attention Enhancement Module (SIAEM) that incorporates reconstructed optical flows at multiple scales and considers spatial and temporal aspects. This strategy reduces the influence of the background and normal objects, enhancing the model’s ability to focus on anomalous fast moving objects in the foreground. Additionally, we propose a Normality Mining Convolution (NMC) module embedded in the decoder to refine the boundary between normality and abnormality. The NMC uses a multihead attention mechanism for dynamic weight adjustment, enabling the precise extraction of normal information. We compute the final anomaly score by fusing two components: (1) the reconstruction error of the optical flows and (2) the peak signal-to-noise ratio between the predicted frame and its ground truth. We evaluate our model on three well-established video anomaly detection datasets. A comparison of different models indicates that the proposed model achieves superior performance compared to state-of-the-art approaches, with area under the receiver operating characteristic curve (AUROC) values of 99.23%%\% on UCSD Ped2, 88.84%%\% on CUHK Avenue, and 74.80%%\% on Shanghaitech.
... The common methods for locating the foreground include object detection algorithms [46] or optical flow [8], but these require high computation costs. Here, we propose a considerably simple and efficient method named SA 2 , inspired by motion priors based works [59,65]. Specifically, given the frame-level feature and its corresponding spatial feature ∈ R × × × , where and are the height and width of the spatial feature, we argue that when most abnormal events occur, the corresponding location within spatial feature would change significantly [65]. ...
... Here, we propose a considerably simple and efficient method named SA 2 , inspired by motion priors based works [59,65]. Specifically, given the frame-level feature and its corresponding spatial feature ∈ R × × × , where and are the height and width of the spatial feature, we argue that when most abnormal events occur, the corresponding location within spatial feature would change significantly [65]. Therefore, we compute the frame difference to obtain the motion magnitude: ...
Preprint
Full-text available
Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.
... Consequently, researchers usually split the frame into multiple sub-portions and recognize the oddity in every sub-portion. To identify the anomalies in the video, various techniques have been proposed in the literature [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. Some preferred to apply handcraft-based techniques such as trajectory features and optical flow for anomaly detection. ...
... As a result, researchers preferred to detect anomalies using deep learning-based approaches. Wu et al. [5] implemented the two-stream neural network for extracting fused spatialtemporal features, and afterward, these features were transformed into normal dictionary terms using fast sparse coding. To learn the temporal regularity, Hasan et al. [6] passed the extracted features to the neural network-based auto-encoder. ...
Article
Full-text available
Video abnormality detection has become an essential component of surveillance video, identifying frames in the video sequences that contain events that do not conform to the expected behavior. However, their application is limited due to the presence of major challenges during training such as mode collapse, non-convergence, and instability. This paper proposes a novel two-stream spatial and temporal architecture called Deep Stacked LSTM (DSLSTM) for abnormality detection that comprises a spatial and temporal stream to extract the spatial and temporal features. MSE is computed separately for each extracted feature of the stream and fused to form the joint representation of appearance and motion. Afterwards, PSNR followed by anomaly score is measured from the joint representation. Only those frames whose anomaly score value is greater than the threshold will be considered abnormal frames. The experimental results evaluated and compared in four benchmark datasets (UCSD Ped1, Ped2, CUHK Avenue, and ShanghaiTech) depict the high performance of DSLSTM in contrast to the recent popular state-of-the-art methods. Besides, a report on three ablation experiments is also provided, and the impacts on the performance of DSLSTM are compared. We also further compared the performance of our deep DSLSTM with our own shallow SSLSTM model.
... Video anomaly detection (VAD) is challenging due to the absence of training samples. To alleviate the issue, VAD has traditionally relied on unsupervised [27,28,29] or weakly supervised learning [5,7,6] for training. Regularization is another critical step to overcome the overfitting issue prevalent in VAD models. ...
... These methods make the assumption that unseen abnormal videos are difficult to reconstruct accurately and regard samples with high reconstruction errors as an anomaly. Reconstruction can be done through sparse coding [27,28,30,31] or auto-encoder [29,12,2,32]. Sparse coding encodes normal patterns with a dictionary and the sample is This article has been accepted for publication in IEEE Access. ...
Article
Full-text available
Video anomaly detection aims to identify anomalous segments in a video. It is typically trained with weakly supervised video-level labels. This paper focuses on two crucial factors affecting the performance of video anomaly detection models. First, we explore how to capture the local and global temporal dependencies more effectively. Previous architectures are effective at capturing either local and global information, but not both. We propose to employ a U-Net like structure to model both types of dependencies in a unified structure where the encoder learns global dependencies hierarchically on top of local ones; then the decoder propagates this global information back to the segment level for classification. Second, overfitting is a non-trivial issue for video anomaly detection due to limited training data. We propose weakly supervised contrastive regularization which adopts a feature-based approach to regularize the network. Contrastive regularization learns more generalizable features by enforcing inter-class separability and intra-class compactness. Extensive experiments on the UCF-Crime dataset shows that our approach outperforms several state-of-the-art methods.
... Within this context, [25] proposed a sparse coding based deep neural network using the stacked recurrent neural networks to optimize the sparse coefficients, while [38] introduced an optimization network based on a novel LSTM network. A fast sparse coding network [32] adopted a two-stream neural network to extract the spatiotemporal features as it was a lightweight network to learn a normal event dictionary. ...
... Since it contains a large amount of data having diverse types of normal and abnormal events, its performance is relatively lower than those of other datasets. The bold entries show the best results [38] 94.9 86.1 -Abati et al. [1] 95.4 -72.5 Hu et al. [13] 95.9 84.2 -Wu et al. [32] 92.8 85.5 -Li et al. [15] 92.9 83.5 -Fan et al. [7] 92.2 83.4 -Fang et al. [8] 95.6 86.3 73.2 Chang et al. [2] 96.5 86.0 73.3 Deepak et al. [4] 83.0 82.0 -Prediction ...
Article
Full-text available
Automatic anomaly detection is a crucial task in video surveillance system intensively used for public safety and others. The present system adopts a spatial branch and a temporal branch in a unified network that exploits both spatial and temporal information effectively. The network has a residual autoencoder architecture, consisting of a deep convolutional neural network-based encoder and a multi-stage channel attention-based decoder, trained in an unsupervised manner. The temporal shift method is used for exploiting the temporal feature, whereas the contextual dependency is extracted by channel attention modules. System performance is evaluated using three standard benchmark datasets. Result suggests that our network outperforms the state-of-the-art methods, achieving 97.4% for UCSD Ped2, 86.7% for CUHK Avenue, and 73.6% for ShanghaiTech dataset in term of Area Under Curve, respectively.
... Recent DL models built on hybrids of CNNs [3,38,39], RNNs [40,41], and GNNs [4,41] have gained momentum for TS and ST data in AD and other data mining applications [10,12,17,32]. Thus far, most TL studies focus on feature extraction encoding networks and predominantly on forecasting tasks [37]. ...
Preprint
Full-text available
The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains for various purposes, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms in new environments. Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task. Despite the triumph of TL in fields like computer vision and natural language processing, efforts on complex ST models for anomaly detection (AD) applications are limited. In this study, we present the potential of TL within the context of AD for the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN. We have transferred the ST AD models trained on data collected from one part of a calorimeter to another. We have investigated different configurations of TL on semi-supervised autoencoders of the ST AD models -- transferring convolutional, graph, and recurrent neural networks of both the encoder and decoder networks. The experiment results demonstrate that TL effectively enhances the model learning accuracy on a target subdetector. The TL achieves promising data reconstruction and AD performance while substantially reducing the trainable parameters of the AD models. It also improves robustness against anomaly contamination in the training data sets of the semi-supervised AD models.
... Lu et al. [22] used a dictionarybased method to learn behavior and used reconstruction errors to detect anomalies. For a comprehensive comparison, Ref. [7], [22] and [23] are selected as representatives of the prediction method, the reconstruction method, and the [8] and its improvement [11,13,[24][25][26][27][28][29][30] are also compared as the same kind of methods using MIL. ...
Article
Full-text available
Video anomaly detection attempts to identify abnormal activity within an overwhelming volume of surveillance videos. In order to learn efficient features in both spatial and temporal dimensions for anomaly detection with a lower calculation cost, a temporal-spatial interactive shift module (TISM) is proposed, avoiding the huge computation cost of 3D convolution. The module shifts part of the feature channels over the time dimension via weighted spatial interaction with temporally neighbored features, allowing spatial information to be interactively exchanged between adjacent frames. Entropy is introduced into the interactive operation for channel selection and adaptive weighting. Experiments on the UCF-Crime dataset prove the superiority of the proposed method. The AUC is increased by 2.53% compared with the SOTA baseline and 6.24% compared with the 3D CNNs, while reducing the amount of calculation by 26%.
... We can conclude from the analysis that our suggested approach outperforms the existing approach and has a greater accuracy of approximately 99%. [31] 2021 98.5% ----Dong et al. [32] 2020 UCSD 95.6% -0.73 0.32 -Wu et al. [33] 2020 CUHK dataset --85.5% 0.27 -Sarker et al. [34] 2021 Web dataset -96% 80.3% --Avola et al. [35] 2022 UMCD dataset --97.2% Ekanayake et al. [36] 2022 UCSD Dataset 98.6% 99.51% ---Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH ("Springer Nature"). ...
Article
Full-text available
In recent times, video surveillance has become indispensable for public security, leveraging computer vision advancements to analyze and comprehend lengthy video feeds. Anomaly detection and classification stand out as crucial components of this technology. Anomaly detection's primary objective is to swiftly identify irregularities within a given timeframe. It is promising to use Deep Neural Network (DNN) approaches for anomaly detection because they combine the ideas of deep learning and reinforcement learning, enabling artificial agents to learn from and gain insights from real-world data. A modified DNN (Deep Neural Network) technique known as HSOE-FAST (Histo sigmoid of Orientation and Enthalpy with Fast Accelerated Segment Test) was proposed in this study. From the dataset, the input is obtained. Initially, the input is pre-processed using a Gaussian filter followed by the feature extraction using the HSOE-FAST method and finally classification is done using the modified DNN approach. Compared to other approaches, our suggested solution achieves an accuracy of almost 99% while overcoming the shortcomings of the existing methodologies. .
... This is because these techniques focus solely on visual patterns. Dictionary learning and sparse encoding are two other video advertising methods that have gained notoriety in recent years [16] [17]. In the proposed schemes, normal scenes are usually encoded into a vocabulary called the input features and treated as exception events. ...
Article
Full-text available
Violence detection holds immense significance in ensuring public safety, security, and law enforcement in various domains. With the increasing availability of video data from surveillance cameras and social media platforms, the need for accurate and efficient violence detection algorithms has become paramount. Automated violence detection systems can aid law enforcement agencies in identifying and responding to violent incidents promptly, thereby preventing potential threats and ensuring public protection. This research focuses on violence detection in large video databases, proposing two keyframe-based models named DeepkeyFrm and AreaDiffKey. The keyframes selection process is critical in violence detection systems, as it reduces computational complexity and enhances accuracy. EvoKeyNet and KFCRNet are the proposed classification models that leverage feature extraction from optimal keyframes. EvoKeyNet utilizes an evolutionary algorithm to select optimal feature attributes, while KFCRNet employs an ensemble of LSTM, Bi-LSTM, and GRU models with a voting scheme. Our key contributions include the development of efficient keyframes selection methods and classification models, addressing the challenge of violence detection in dynamic surveillance scenarios. The proposed models outperform existing methods in terms of accuracy and computational efficiency, with accuracy results as follows: 98.98% (Hockey Fight), 99.29% (Violent Flow), 99% (RLVS), 91% (UCF-Crime), and 91% (ShanghaiTech). The ANOVA and Tukey tests were performed to validate the statistical significance of the differences among all models. The proposed approaches, supported by the statistical tests, pave the way for more effective violence detection systems, holding immense promise for a safer and secure future. As violence detection technology continues to evolve, our research stands as a crucial stepping stone towards achieving improved public safety and security in the face of dynamic challenges.
... Recent methods [43,126,127] have focused on exploiting both the appearance and motion information in video, by extracting structural features and the optical flow. The extraction of optical flow requires a high computational cost. ...
Article
Full-text available
Anomaly detection in video surveillance is a highly developed subject that is attracting increased attention from the research community. There is great demand for intelligent systems with the capacity to automatically detect anomalous events in streaming videos. Due to this, a wide variety of approaches have been proposed to build an effective model that would ensure public security. There has been a variety of surveys of anomaly detection, such as of network anomaly detection, financial fraud detection, human behavioral analysis, and many more. Deep learning has been successfully applied to many aspects of computer vision. In particular, the strong growth of generative models means that these are the main techniques used in the proposed methods. This paper aims to provide a comprehensive review of the deep learning-based techniques used in the field of video anomaly detection. Specifically, deep learning-based approaches have been categorized into different methods by their objectives and learning metrics. Additionally, preprocessing and feature engineering techniques are discussed thoroughly for the vision-based domain. This paper also describes the benchmark databases used in training and detecting abnormal human behavior. Finally, the common challenges in video surveillance are discussed, to offer some possible solutions and directions for future research.
... Wu et al. [67] proposed a straight-through pipeline to detect video captions. To recognize video subtitles, the CTPN is utilized, while the ResNet, GRU, and CTC are used to detect Chinese and English subtitles in video pictures. ...
Article
Full-text available
The objective of this study is to supply an overview of research work based on video-based networks and tiny object identification. The identification of tiny items and video objects, as well as research on current technologies, are discussed first. The detection, loss function, and optimization techniques are classified and described in the form of a comparison table. These comparison tables are designed to help you identify differences in research utility, accuracy, and calculations. Finally, it highlights some future trends in video and small object detection (people, cars, animals, etc.), loss functions, and optimization techniques for solving new problems.
... Although it does not consider the temporal coherence between neighboring frames within normal or abnormal events, an abnormal event is associated with a large reconstruction error. Wu et al. [345] proposed a two-stream neural network to extract spatiotemporal fusion features in hidden layers. With these features, they employed a fast sparse coding network to build a normal dictionary for anomaly detection. ...
Preprint
Full-text available
Crowd anomaly detection is one of the most popular topics in computer vision in the context of smart cities. A plethora of deep learning methods have been proposed that generally outperform other machine learning solutions. Our review primarily discusses algorithms that were published in mainstream conferences and journals between 2020 and 2022. We present datasets that are typically used for benchmarking, produce a taxonomy of the developed algorithms, and discuss and compare their performances. Our main findings are that the heterogeneities of pre-trained convolutional models have a negligible impact on crowd video anomaly detection performance. We conclude our discussion with fruitful directions for future research.
... Peng Wu et al. [61] recommended a Fast Sparse Coding Networks (FSCN) for higher-level FE in anomaly videos. For extracting Spatial-Temporal Fusions Features (STFF) on hidden layers, a '2'-stream NN was developed. ...
Article
Full-text available
Video Surveillance (VS) systems are popular. For enhancing the safety of public lives as well as assets, it is utilized in public places like marketplaces, hospitals, streets, education institutions, banks, shopping malls, city administrative offices, together with smart cities. The main purpose of security applications is the well-timed and also accurate detection of video anomalies. Anomalous activities along with anomalous entities are the video anomalies, which are stated as the irregular or abnormal patterns on the video that doesn’t match the normal trained patterns. Automatic detection of Anomalous activities, say traffic rule infringements, riots, fighting, and stampede in addition to anomalous entities, say, weapons at the sensitive place together with deserted luggage ought to be done. The Anomaly Detection (AD) in VS is reviewed in the paper. This survey concentrates on the Deep Learning (DL) application in finding the exact count, involved individuals and the occurred activity on a larger crowd at every climate condition. The fundamental DL implementation technology concerned in disparate crowd Video Analysis (VA) is discussed. Moreover, it presented the available datasets as well as metrics for performance evaluation and also described the examples of prevailing VS systems utilized in the real life. Lastly, the challenges together with propitious directions for additional research are outlined. Pattern recognition has been the subject of a great deal of study during the previous half-century. There isn’t a single technique that can be utilised for all kinds of applications, whether in bioinformatics or data mining or speech recognition or remote sensing or multimedia or text detection or localization or any other area. Methodologies for object recognition are the primary focus of this paper. All aspects of object recognition, including local and global feature-based algorithms, as well as various pattern-recognition approaches, are examined here. Please note that we have attempted to describe the findings of many technologies and the future extent of this paper’s particular technique. We used the datasets’ properties and other evaluation parameters found in an easily accessible web database. Research in pattern recognition and object recognition can greatly benefit from this study, which identifies the research gaps and limits in this subject.
Conference Paper
The growing dependence on surveillance videos highlights the need for automating anomalous events detection in videos, to improve real-time response and reduce manual workload. However, the diverse nature of abnormal activities across various public and private settings poses a challenge. This paper proposes integrating a U-Net-based architecture with vision transformers, leveraging their capabilities in capturing long-range dependencies, for frame-level anomaly detection in surveillance videos. The model learns inherent patterns by training on normal event videos and identifies deviations during testing for unseen video frames. Experiments on UCSD - Ped1 Dataset shows better results compared to others.
Conference Paper
In recent years, the importance of text detection and recognition in images has grown due to the huge number of applications developed for mobile devices and computer vision. Text detection becomes difficult when backgrounds are more complex, or when the capture conditions are not controlled or when it contains complex textual content. In this work, a robust multilanguage text detection and recognition method is based on a pre-trained CRAFT (Character-Region Awareness For Text Detection) algorithm with OCR (Optical Character Recognition system) for text recognition. The preprocessing techniques include advanced image pre-processing methods, which include channel modifications, grayscale conversion, and normalization to enhance text visibility and detectability. Then, our algorithm effectively analyzes text from a variety of backgrounds and languages by combining a VGG16-BN backbone network with extra Up-Conv blocks. Finally, using the modified OCR system for text recognition. Experimental results show that the proposed text detection and recognition method is robust to distortions in geometry, noise reduction, lighting, and resolution (high brightness, shadows, inconsistent illumination, low contrast). Furthermore, the proposed method often performs better than state-of-the-art methods when tested on natural scene datasets, as seen by its excellent recall and precision rates. Moreover, it achieved a precision rate of 97.3% and a harmonic mean score of 93.5% on the ICDAR2015 dataset, indicating its dependability and efficiency in text detection for practical uses.
Article
Video anomaly detection has always been a challenging task in computer vision due to data imbalance and susceptibility to scene variations such as lighting and occlusions. In response to this challenge, this paper proposes an unsupervised video anomaly detection method based on an attention-enhanced memory network. The method utilizes a dual-stream network structure of autoencoders, enhancing the model’s learning ability for important features in appearance and motion by introducing coordinate attention mechanisms and variance attention mechanisms, emphasizing significant characteristics of static objects and rapidly moving regions. By adding memory modules to both the appearance and motion branches, the network structure’s memory information is reinforced, enabling it to capture long-term spatiotemporal dependencies in videos and thereby improving the accuracy of anomaly detection. Furthermore, by optimizing the network structure’s activation functions to handle negative inputs, it enhances its nonlinear modeling capabilities, enabling better adaptation to complex environments, including variations in lighting and occlusions, further improving the effectiveness of anomaly detection. The paper conducts comparative experiments and ablation studies using three public available datasets and various models. The results demonstrate that compared to baseline models, the AUC performance is improved by 3.9%, 4.7%, and 1.7% on UCSD Ped2, CHUK Avenue, and ShanghaiTech datasets, respectively. When compared with the other models, the average AUC performance is improved by 4.3%, 5.4%, and 6.2%, with an average improvement of 8.75% in the ERR metric, validating the effectiveness and adaptability of the proposed method. The code can be obtained at the following URL: https://github.com/AcademicWhite/AEMNet .
Conference Paper
Anomaly detection in surveillance videos with weakly supervised Learning (WSL) is observed as a complicated research problem in existing works. Recent studies show the applications of MIL-based video violence detection under WSL where a system gets trained to create frame-level anomaly scores in positive and negative bags. However, these techniques only consider the present-time information as input data and mostly overlook past observations. Also, there is a lack of sufficient discriminative features to distinguish between normal and anomalous segments. To address this challenge, we propose a novel long-term temporal dependency framework by incorporating a SlowFast Network (SFNet) that performs feature discrimination and a flow transformer that focuses on temporal dependency to look at historical data and current information. Also, we leverage a snippet-level classifier to measure the score of anomalous and normal videos. Comprehensive experiments on two benchmark datasets, namely ShanghaiTech and UCF-Crime, show that our method performs better than the existing methods. keywords: Anomaly detection, Weakly-supervised Learning
Article
Full-text available
In poor lighting and rainy and foggy bad weather environments, road traffic signs are blurred and have low recognition, etc. A super-resolution reconstruction algorithm for complex lighting and bad weather traffic sign images was proposed. First, a novel attention residual module was designed to incorporate an aggregated feature attention mechanism on the jump connection side of the base residual module so that the deep network can obtain richer detail information; second, a cross-layer jump connection feature fusion mechanism was adopted to enhance the flow of information across layers as well as to prevent the problem of gradient disappearance of the deep network to enhance the reconstruction of the edge detail information; and lastly, a positive-inverse dual-channel sub-pixel convolutional up-sampling method was designed to reconstruct super-resolution images to obtain better pixel and spatial information expression. The evaluation model was trained on the Chinese traffic sign dataset in a natural scene, and when the scaling factor is 4, the average values of PSNR and SSIM are improved by 0.031 when compared with the latest release of the deep learning-based super-resolution reconstruction algorithm for single-frame images, MICU (Multi-level Information Compensation and U-net), the average values of PSNR and SSIM are improved by 0.031 dB and 0.083, and the actual test average reaches 20.946 dB and 0.656. The experimental results show that the reconstructed image quality of this paper's algorithm is better than the mainstream algorithms of comparison in terms of objective indexes and subjective feelings. The super-resolution reconstructed image has a higher peak signal-to-noise ratio and perceptual similarity. It can provide certain technical support for the research of safe driving assistive devices in natural scenes under multi-temporal varying illumination conditions and bad weather.
Article
Detecting anomalies in videos presents a significant challenge in the field of video surveillance. The primary goal is identifying and detecting uncommon actions or events within a video sequence. The difficulty arises from the limited availability of video frames depicting anomalies and the ambiguous definition of anomaly. Based on extensive applications of Generative Adversarial Networks (GANs), which consist of a generator and a discriminator network, we propose an Attention-guided Generator with Dual Discriminator GAN (A2D-GAN) for real-time video anomaly detection (VAD). The generator network uses an encoder-decoder architecture with a multi-stage self-attention added to the encoder and multi-stage channel attention added to the decoder. The framework uses adversarial learning from noise and video frame reconstruction to enhance the generalization of the generator network. Also, of the dual discriminator in A2D-GAN, one discriminates between the reconstructed video frame and the real video frame, while the other discriminates between the reconstructed noise and the real noise. Exhaustive experiments and ablation studies on four benchmark video anomaly datasets, namely UCSD Peds, CUHK Avenue, ShanghaiTech, and Subway, demonstrate the effectiveness of the proposed A2D-GAN compared to other state-of-the-art methods. The proposed A2D-GAN model is robust and can detect anomalies in videos in real-time. The source code to replicate the results of the proposed A2D-GAN model is available at https://github.com/Rituraj-ksi/A2D-GAN.
Conference Paper
Video anomaly detection (VAD) aims to detect abnormal behaviors or events during video monitoring. Recent VAD methods use a proxy task that reconstructs the input video frames, quantifying the degree of anomaly by computing the reconstruction error. However, these methods do not consider the diversity of normal patterns and neglect the scale differences of the abnormal foreground image between different video frames. To address these issues, we propose an unsupervised video anomaly detection method termed enhanced memory adversarial network, which integrates a dilated convolution feature extraction encoder and a feature matching memory module. The dilated convolution feature extraction encoder extracts features at different scales by increasing the receptive field. The feature matching memory module stores multiple prototype features of normal video frames, ensuring that the query features are closer to the prototypes while maintaining a distinct separation between different prototypes. Our approach not only improves the prediction performance but also considers the diversity of normal patterns. At the same time, it reduces the representational capacity of the predictive networks while enhancing the model’s sensitivity to anomalies. Experiments on the UCSD Ped2 and CUHK Avenue dataset, comparing our method with existing unsupervised video anomaly detection methods, show that our proposed method is superior in the AUC metric, achieving an AUC of 96.3% on the UCSD Ped2 dataset, and an AUC of 86.5% on the CUHK Avenue dataset.
Article
Although great progress has been sparked in video anomaly detection (VAD) by deep neural networks (DNNs), existing solutions still fall short in two aspects: (1) The extraction of video events cannot be both precise and comprehensive. (2) The semantics and temporal context are under-explored. To tackle above issues, we are inspired by cloze tests in language education and propose a novel approach named Visual Cloze Completion (VCC), which conducts VAD by completing visual cloze tests (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To realize both precise and comprehensive event extraction, appearance and motion are used as complementary cues to mark the object region associated with each event. For each marked region, a normalized patch sequence is extracted from several neighboring frames and stacked into a STC. With each patch and the patch sequence of a STC regarded as a visual “word” and “sentence” respectively, we deliberately erase a certain “word” (patch) to yield a VCT. Then, the VCT is completed by training DNNs to infer the erased patch and its optical flow via video semantics. Meanwhile, VCC fully exploits temporal context by alternatively erasing each patch in temporal context and creating multiple VCTs. Furthermore, we propose localization-level, event-level, model-level and decision-level solutions to enhance VCC, which can further exploit VCC’s potential and produce significant VAD performance improvement. Extensive experiments demonstrate that VCC achieves highly competitive VAD performance.
Article
Full-text available
Abnormal event detection is one of the most challenging tasks in computer vision. Many existing deep anomaly detection models are based on reconstruction errors, where the training phase is performed using only videos of normal events and the model is then capable to estimate frame-level scores for an unknown input. It is assumed that the reconstruction error gap between frames of normal and abnormal scores is high for abnormal events during the testing phase. Yet, this assumption may not always hold due to superior capacity and generalization of deep neural networks. In this paper, we design a generalized framework (rpNet) for proposing a series of deep models by fusing several options of a reconstruction network (rNet) and a prediction network (pNet) to detect anomaly in videos efficiently. In the rNet, either a convolutional autoencoder (ConvAE) or a skip connected ConvAE (AEc) can be used, whereas in the pNet, either a traditional U-Net, a non-local block U-Net, or an attention block U-Net (aUnet) can be applied. The fusion of both rNet and pNet increases the error gap. Our deep models have distinct degree of feature extraction capabilities. One of our models (AEcaUnet) consists of an AEc with our proposed aUnet has capability to confirm better error gap and to extract high quality of features needed for video anomaly detection. Experimental results on UCSD-Ped1, UCSD-Ped2, CUHK-Avenue, ShanghaiTech-Campus, and UMN datasets with rigorous statistical analysis show the effectiveness of our models.
Article
Anomaly detection in public places using the video surveillance gains significance due to the real-time monitoring and security that ensures the personal assets and public security. Accordingly, in this research, a deep CNN model with Timber–Prairie wolf optimization algorithm (TPWO) optimization is proposed for surveillance-based anomaly detection. To support the TPWO-based deep CNN anomaly detection model, tracking model named OptSpatio tracking model tracks the location and movement of the anomalous objects in any locality. The OptSpatio tracking model uses both visual and spatial tracking models to monitor any anomalous activity. On the other hand, TPWO is designed to tune the deep classifier for acquiring better detection performance. The TPWO-based model surpasses the competent methods in terms of accuracy by 97.214%, sensitivity by 97.831% and specificity by 96.668% with minimal EER of 2.786%. The MOTP values are also obtained at a rate of 0.7325; moreover, the effectiveness of the TPWO method is justified at the object-, frame-, and pixel-level analysis.
Article
Weakly supervised video anomaly detection is generally formulated as a multiple instance learning (MIL) problem, where an anomaly detector learns to generate frame-level anomaly scores under the supervision of MIL-based video-level classification. However, most previous works suffer from two drawbacks: 1) they lack ability to model temporal relationships between video segments and 2) they cannot extract sufficient discriminative features to separate normal and anomalous snippets. In this article, we develop a weakly supervised temporal discriminative (WSTD) paradigm, that aims to leverage both temporal relation and feature discrimination to mitigate the above drawbacks. To this end, we propose a transformer-styled temporal feature aggregator (TTFA) and a self-guided discriminative feature encoder (SDFE). Specifically, TTFA captures multiple types of temporal relationships between video snippets from different feature subspaces, while SDFE enhances the discriminative powers of features by clustering normal snippets and maximizing the separability between anomalous snippets and normal centers in embedding space. Experimental results on three public benchmarks indicate that WSTD outperforms state-of-the-art unsupervised and weakly supervised methods, which verifies the superiority of the proposed method.
Article
Anomaly behavior detection plays a significant role in emergencies such as robbery. Although a lot of works have been proposed to deal with this problem, the performance in real applications is still relatively low. Here, to detect abnormal human behavior in videos, we propose a multiscale spatial temporal attention graph convolution network (MSTA-GCN) to capture and cluster the features of the human skeleton. First, based on the human skeleton graph, a multiscale spatial temporal attention graph convolution block (MSTA-GCB) is built which contains multiscale graphs in temporal and spatial dimensions. MSTA-GCB can simulate the motion relations of human body components at different scales where each scale corresponds to different granularity of annotation levels on the human skeleton. Then, static, globally-learned and attention-based adjacency matrices in the graph convolution module are proposed to capture hierarchical representation. Finally, extensive experiments are carried out on the ShanghaiTech Campus and CUHK Avenue datasets, the final results of the frame-level AUC/EER are 0.759/0.311 and 0.876/0.192, respectively. Moreover, the frame-level AUC is 0.768 for the human-related ShanghaiTech subset. These results show that our MSTA-GCN outperforms most of methods in video anomaly detection and we have obtained a new state-of-the-art performance in skeleton-based anomaly behavior detection.
Article
Block Diagonal Representation (BDR) has attracted massive attention in subspace clustering, yet the high computational cost limits its widespread application. To address this issue, we propose a novel approach called Projective Block Diagonal Representation (PBDR), which rapidly pursues a representation matrix with the block diagonal structure. Firstly, an effective sampling strategy is utilized to select a small subset of the original large-scale data. Then, we learn a projection mapping to match the block diagonal representation matrix on the selected subset. After training, we employ the learned projection mapping to quickly generate the representation matrix with an ideal block diagonal structure for the original large-scale data. Additionally, we further extend the proposed PBDR model (i.e., PBDRℓ1 and PBDR*) by capturing the global or local structure of the data to enhance block diagonal coding capability. This paper also proves the effectiveness of the proposed model theoretically. Especially, this is the first work to directly learn a representation matrix with a block diagonal structure to handle the large-scale subspace clustering problem. Finally, experimental results on publicly available datasets show that our approaches achieve faster and more accurate clustering results compared to the state-of-the-art block diagonal-based subspace clustering approaches, which demonstrates its practical usefulness.
Article
Full-text available
Smart surveillance is a difficult task that is gaining popularity due to its direct link to human safety. Today, many indoor and outdoor surveillance systems are in use at public places and smart cities. Because these systems are expensive to deploy, these are out of reach for the vast majority of the public and private sectors. Due to the lack of a precise definition of an anomaly, automated surveillance is a challenging task, especially when large amounts of data, such as 24/7 CCTV footage, must be processed. When implementing such systems in real-time environments, the high computational resource requirements for automated surveillance becomes a major bottleneck. Another challenge is to recognize anomalies accurately as achieving high accuracy while reducing computational cost is more challenging. To address these challenge, this research is based on the developing a system that is both efficient and cost effective. Although 3D convolutional neural networks have proven to be accurate, they are prohibitively expensive for practical use, particularly in real-time surveillance. In this article, we present two contributions: a resource-efficient framework for anomaly recognition problems and two-class and multi-class anomaly recognition on spatially augmented surveillance videos. This research aims to address the problem of computation overhead while maintaining recognition accuracy. The proposed Temporal based Anomaly Recognizer (TAR) framework combines a partial shift strategy with a 2D convolutional architecture-based model, namely MobileNetV2. Extensive experiments were carried out to evaluate the model’s performance on the UCF Crime dataset, with MobileNetV2 as the baseline architecture; it achieved an accuracy of 88% which is 2.47% increased performance than available state-of-the-art. The proposed framework achieves 52.7% accuracy for multiclass anomaly recognition on the UCF Crime2Local dataset. The proposed model has been tested in real-time camera stream settings and can handle six streams simultaneously without the need for additional resources.
Article
Video anomaly detection (VAD) mainly refers to identifying anomalous events that have not occurred in the training set where only normal samples are available. Existing works usually formulate VAD as a reconstruction or prediction problem. However, the adaptability and scalability of these methods are limited. In this paper, we propose a novel distance-based VAD method to take advantage of all the available normal data efficiently and flexibly. In our method, the smaller the distance between a testing sample and normal samples, the higher the probability that the testing sample is normal. Specifically, we propose to use locality-sensitive hashing (LSH) to map the samples whose similarity exceeds a certain threshold into the same bucket in advance. To utilize multiple hashes and further alleviate the computation and memory usage, we propose to use the hash codes rather than the features as the representations of the samples. In this manner, the complexity of near neighbor search is cut down significantly. To make the samples that are semantically similar get closer and those not similar get further apart, we propose a novel learnable version of LSH that embeds LSH into a neural network and optimizes the hash functions with contrastive learning strategy. The proposed method is robust to data imbalance and can handle the large intra-class variations in normal data flexibly. Besides, it has a good ability of scalability. Extensive experiments demonstrate the superiority of our method, which achieves new state-of-the-art results on VAD benchmarks.
Chapter
Full-text available
Anomaly detection is a classical problem in computer vision, namely the determination of the normal from the abnormal when datasets are highly biased towards one class (normal) due to the insufficient sample size of the other class (abnormal). While this can be addressed as a supervised learning problem, a significantly more challenging problem is that of detecting the unknown/unseen anomaly case that takes us instead into the space of a one-class, semi-supervised learning paradigm. We introduce such a novel anomaly detection model, by using a conditional generative adversarial network that jointly learns the generation of high-dimensional image space and the inference of latent space. Employing encoder-decoder-encoder sub-networks in the generator network enables the model to map the input image to a lower dimension vector, which is then used to reconstruct the generated output image. The use of the additional encoder network maps this generated image to its latent representation. Minimizing the distance between these images and the latent vectors during training aids in learning the data distribution for the normal samples. As a result, a larger distance metric from this learned data distribution at inference time is indicative of an outlier from that distribution—an anomaly. Experimentation over several benchmark datasets, from varying domains, shows the model efficacy and superiority over previous state-of-the-art approaches.
Chapter
Full-text available
Real-time detection of irregularities in visual data is very invaluable and useful in many prospective applications including surveillance, patient monitoring systems, etc. With the surge of deep learning methods in the recent years, researchers have tried a wide spectrum of methods for different applications. However, for the case of irregularity or anomaly detection in videos, training an end-to-end model is still an open challenge, since often irregularity is not well-defined and there are not enough irregular samples to use during training. In this paper, inspired by the success of generative adversarial networks (GANs) for training deep models in unsupervised or self-supervised settings, we propose an end-to-end deep network for detection and fine localization of irregularities in videos (and images). Our proposed architecture is composed of two networks, which are trained in competing with each other while collaborating to find the irregularity. One network works as a pixel-level irregularity Inpainter, and the other works as a patch-level Detector. After an adversarial self-supervised training, in which I tries to fool D into accepting its inpainted output as regular (normal), the two networks collaborate to detect and fine-segment the irregularity in any given testing video. Our results on three different datasets show that our method can outperform the state-of-the-art and fine-segment the irregularity.
Article
Full-text available
The detection of abnormal behaviour in crowded scenes has to deal with many challenges. This paper presents an efficient method for detection and localization of anomalies in videos. Using \textit{fully convolutional neural networks} (FCNs) and temporal data, a pre-trained supervised FCN is transferred into an unsupervised FCN ensuring the detection of (global) anomalies in scenes. High performance in terms of speed and accuracy is achieved by investigating the cascaded detection as a result of reducing computation complexities. This FCN-based architecture addresses two main tasks, feature representation and cascaded outlier detection. Experimental results on two benchmarks suggest that the proposed method outperforms existing methods in terms of accuracy regarding detection and localization.
Article
Full-text available
Surveillance videos are able to capture a variety of realistic anomalies. In this paper, we propose to learn anomalies by exploiting both normal and anomalous videos. To avoid annotating the anomalous segments or clips in training videos, which is very time consuming, we propose to learn anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos, i.e. the training labels (anomalous or normal) are at video-level instead of clip-level. In our approach, we consider normal and anomalous videos as bags and video segments as instances in multiple instance learning (MIL), and automatically learn a deep anomaly ranking model that predicts high anomaly scores for anomalous video segments. Furthermore, we introduce sparsity and temporal smoothness constraints in the ranking loss function to better localize anomaly during training. We also introduce a new large-scale first of its kind dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities. Our experimental results show that our MIL method for anomaly detection achieves significant improvement on anomaly detection performance as compared to the state-of-the-art approaches. We provide the results of several recent deep learning baselines on anomalous activity recognition. The low recognition performance of these baselines reveals that our dataset is very challenging and opens more opportunities for future work.
Article
Full-text available
Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from 19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.
Conference Paper
Full-text available
We propose a novel framework for abnormal event detection in video that is based on deep features extracted with pre-trained convolutional neural networks (CNN). The CNN features are fed into a one-class Support Vector Machines (SVM) classifier in order to learn a model of normality from training data. We compare our approach with several state-of-the-art methods on two benchmark data sets, namely the Avenue data set and the UMN data set. The empirical results indicate that our abnormal event detection framework can reach state-of-the-art results, while running in real-time at 20 frames per second.
Conference Paper
Full-text available
Obtaining models that capture imaging markers relevant for disease progression and treatment monitoring is challenging. Models are typically based on large amounts of data with annotated examples of known markers aiming at automating detection. High annotation effort and the limitation to a vocabulary of known markers limit the power of such approaches. Here, we perform unsupervised learning to identify anomalies in imaging data as candidates for markers. We propose AnoGAN, a deep convolutional generative adversarial network to learn a manifold of normal anatomical variability, accompanying a novel anomaly scoring scheme based on the mapping from image space to a latent space. Applied to new data, the model labels anomalies, and scores image patches indicating their fit into the learned distribution. Results on optical coherence tomography images of the retina demonstrate that the approach correctly identifies anomalous images, such as images containing retinal fluid or hyperreflective foci.
Article
Full-text available
We propose a novel framework for abnormal event detection in video that requires no training sequences. Our framework is based on unmasking, a technique previously used for authorship verification in text documents, which we adapt to our task. We iteratively train a binary classifier to distinguish between two consecutive video sequences while removing at each step the most discriminant features. Higher training accuracy rates of the intermediately obtained classifiers represent abnormal events. To the best of our knowledge, this is the first work to apply unmasking for a computer vision task. We compare our method with several state-of-the-art supervised and unsupervised methods on four benchmark data sets. The empirical results indicate that our abnormal event detection framework can achieve state-of-the-art results, while running in real-time at 20 frames per second.
Conference Paper
Full-text available
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( 69.4% 69.4\% ) and UCF101 (94.2% 94.2\% ). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
Article
Full-text available
We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes. While most existing works merely use hand-crafted appearance and motion features, we propose Appearance and Motion DeepNet (AMDN) which utilizes deep neural networks to automatically learn feature representations. To exploit the complementary information of both appearance and motion patterns, we introduce a novel double fusion framework, combining both the benefits of traditional early fusion and late fusion strategies. Specifically, stacked denoising autoencoders are proposed to separately learn both appearance and motion features as well as a joint representation (early fusion). Based on the learned representations, multiple one-class SVM models are used to predict the anomaly scores of each input, which are then integrated with a late fusion strategy for final anomaly detection. We evaluate the proposed method on two publicly available video surveillance datasets, showing competitive performance with respect to state of the art approaches.
Article
Full-text available
Image-based classification of histology sections, in terms of distinct components (e.g., tumor, stroma, normal), provides a series of indices for histology composition (e.g., the percentage of each distinct components in histology sections), and enables the study of nuclear properties within each component. Furthermore, the study of these indices, constructed from each whole slide image in a large cohort, has the potential to provide predictive models of clinical outcome. For example, correlations can be established between the constructed indices and the patients’ survival information at cohort level, which is a fundamental step towards personalized medicine. However, performance of the existing techniques is hindered as a result of large technical variations (e.g., variations of color/textures in tissue images due to non-standard experimental protocols) and biological heterogeneities (e.g., cell type, cell state) that are always present in a large cohort. We propose a system that automatically learns a series of dictionary elements for representing the underlying spatial distribution using stacked predictive sparse decomposition. The learned representation is then fed into the spatial pyramid matching framework with a linear support vector machine classifier. The system has been evaluated for classification of distinct histological components for two cohorts of tumor types. Throughput has been increased by using of graphical processing unit (GPU), and evaluation indicates a superior performance results, compared with previous research.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
Conference Paper
Full-text available
We propose to detect abnormal events via a sparse reconstruction over the normal bases. Given an over-complete normal basis set (e.g., an image sequence or a collection of local spatio-temporal patches), we introduce the sparse reconstruction cost (SRC) over the normal dictionary to measure the normalness of the testing sample. To condense the size of the dictionary, a novel dictionary selection method is designed with sparsity consistency constraint. By introducing the prior weight of each basis during sparse reconstruction, the proposed SRC is more robust compared to other outlier detection criteria. Our method provides a unified solution to detect both local abnormal events (LAE) and global abnormal events (GAE). We further extend it to support online abnormal event detection by updating the dictionary incrementally. Experiments on three benchmark datasets and the comparison to the state-of-the-art methods validate the advantages of our algorithm.
Conference Paper
Full-text available
Suppose you are given some dataset drawn from an underlying probability distribution and you want to estimate a “simple” subset of input space such that the probability that a test point drawn from lies outside of equals some a priori specified between and. We propose a method to approach this problem by trying to estimate a function which is positive on and negative on the complement. The functional form of is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. We provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.
Article
How to build a generic deep one-class (DeepOC) model to solve one-class classification problems for anomaly detection, such as anomalous event detection in complex scenes? The characteristics of existing one-class labels lead to a dilemma: it is hard to directly use a multiple classifier based on deep neural networks to solve one-class classification problems. Therefore, in this article, we propose a novel DeepOC neural network, termed as DeepOC, which can simultaneously learn compact feature representations and train a DeepOC classifier. Only with the given normal samples, we use the stacked convolutional encoder to generate their low-dimensional high-level features and train a one-class classifier to make these features as compact as possible. Meanwhile, for the sake of the correct mapping relation and the feature representations' diversity, we utilize a decoder in order to reconstruct raw samples from these low-dimensional feature representations. This structure is gradually established using an adversarial mechanism during the training stage. This mechanism is the key to our model. It organically combines two seemingly contradictory components and allows them to take advantage of each other, thus making the model robust and effective. Unlike methods that use handcrafted features or those that are separated into two stages (extracting features and training classifiers), DeepOC is a one-stage model using reliable features that are automatically extracted by neural networks. Experiments on various benchmark data sets show that DeepOC is feasible and achieves the state-of-the-art anomaly detection results compared with a dozen existing methods.
Article
Sparse coding based anomaly detection has shown promising performance, of which the keys are feature learning, sparse representation, and dictionary learning. In this work, we propose a new neural network for anomaly detection (termed AnomalyNet) by deeply achieving feature learning, sparse representation and dictionary learning in three joint neural processing blocks. Specifically, to learn better features, we design a motion fusion block accompanied by a feature transfer block to enjoy the advantages of eliminating noisy background, capturing motion and alleviating data deficiency. Furthermore, to address some disadvantages (e.g., nonadaptive updating) of existing sparse coding optimizers and embrace the merits of neural network (e.g., parallel computing), we design a novel recurrent neural network to learn sparse representation and dictionary by proposing an adaptive iterative hard-thresholding algorithm (adaptive ISTA) and reformulating the adaptive ISTA as a new long short term memory (LSTM). To the best of our knowledge, this could be one of first works to bridge the l1-solver and LSTM and may provide novel insight in understanding LSTM and model-based optimization (or named differentiable programming), as well as sparse coding based anomaly detection. Extensive experiments show the state-of-the-art performance of our method in the abnormal events detection task.
Article
Network representation learning (NRL) aims to map vertices of a network into a low-dimensional space which preserves the network structure and its inherent properties. Most existing methods for network representation adopt shallow models which have relatively limited capacity to capture highly non-linear network structures, resulting in sub-optimal network representations. Therefore, it is nontrivial to explore how to effectively capture highly non-linear network structure and preserve the global and local structure in NRL. To solve this problem, in this paper we propose a new graph convolutional autoencoder architecture based on a depth-based representation of graph structure, referred to as the depth-based subgraph convolutional autoencoder (DS-CAE), which integrates both the global topological and local connectivity structures within a graph. Our idea is to first decompose a graph into a family of K-layer expansion subgraphs rooted at each vertex aimed at better capturing long-range vertex inter-dependencies. Then a set of convolution filters slide over the entire sets of subgraphs of a vertex to extract the local structural connectivity information. This is analogous to the standard convolution operation on grid data. In contrast to most existing models for unsupervised learning on graph-structured data, our model can capture highly non-linear structure by simultaneously integrating node features and network structure into network representation learning. This significantly improves the predictive performance on a number of benchmark datasets.
Article
We propose a sparse Convolutional Autoencoder (CAE) for simultaneous nucleus detection and feature extraction in histopathology tissue images. Our CAE detects and encodes nuclei in image patches in tissue images into sparse feature maps that encode both the location and appearance of nuclei. A primary contribution of our work is the development of an unsupervised detection network by using the characteristics of histopathology image patches. The pretrained nucleus detection and feature extraction modules in our CAE can be fine-tuned for supervised learning in an end-to-end fashion. We evaluate our method on four datasets and achieve state-of-the-art results. In addition, we are able to achieve comparable performance with only 5% of the fully-supervised annotation cost.
Article
Anomaly detection in videos refers to the identification of events that do not conform to expected behavior. However, almost all existing methods tackle the problem by minimizing the reconstruction errors of training data, which cannot guarantee a larger reconstruction error for an abnormal event. In this paper, we propose to tackle the anomaly detection problem within a video prediction framework. To the best of our knowledge, this is the first work that leverages the difference between a predicted future frame and its ground truth to detect an abnormal event. To predict a future frame with higher quality for normal events, other than the commonly used appearance (spatial) constraints on intensity and gradient, we also introduce a motion (temporal) constraint in video prediction by enforcing the optical flow between predicted frames and ground truth frames to be consistent, and this is the first work that introduces a temporal constraint into the video prediction task. Such spatial and motion constraints facilitate the future frame prediction for normal events, and consequently facilitate to identify those abnormal events that do not conform the expectation. Extensive experiments on both a toy dataset and some publicly available datasets validate the effectiveness of our method in terms of robustness to the uncertainty in normal events and the sensitivity to abnormal events
Article
This paper addresses the problem of joint detection and recounting of abnormal events in videos. Recounting of abnormal events, i.e., explaining why they are judged to be abnormal, is an unexplored but critical task in video surveillance, because it helps human observers quickly judge if they are false alarms or not. To describe the events in the human-understandable form for event recounting, learning generic knowledge about visual concepts (e.g., object and action) is crucial. Although convolutional neural networks (CNNs) have achieved promising results in learning such concepts, it remains an open question as to how to effectively use CNNs for abnormal event detection, mainly due to the environment-dependent nature of the anomaly detection. In this paper, we tackle this problem by integrating a generic CNN model and environment-dependent anomaly detectors. Our approach first learns CNN with multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events. By appropriately plugging the model into anomaly detectors, we can detect and recount abnormal events while taking advantage of the discriminative power of CNNs. Our approach outperforms the state-of-the-art on Avenue and UCSD Ped2 benchmarks for abnormal event detection and also produces promising results of abnormal event recounting.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Conference Paper
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction is viewed as a promising avenue for unsupervised feature learning. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset.
Article
Abnormal event detection is now a widely concerned research topic, especially for crowded scenes. In recent years, many dictionary learning algorithms have been developed to learn normal event regularities, and have presented promising performance for abnormal event detection. However, they seldom consider the structural information, which plays important roles in many computer vision tasks, such as image denoising and segmentation. In this paper, structural information is explored within a sparse representation framework. On the one hand, we introduce a new concept named reference event, which indicates the potential event patterns in normal video events. Compared with abnormal events, normal ones are more likely to approximate these reference events. On the other hand, a smoothness regularization is constructed to describe the relationships among video events. The relationships consist of both similarities in the feature space and relative positions in the video sequences. In this case, video events related to each other are more likely to possess similar representations. The structured dictionary and sparse representation coefficients are optimized through an iterative updating strategy. In the testing phase, abnormal events are identified as samples which cannot be well represented using the learned dictionary. Extensive experiments and comparisons with state-of-the-art algorithms have been conducted to prove the effectiveness of the proposed algorithm.
Article
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
Article
This paper proposes a fast and reliable method for anomaly detection and localization in video data showing crowded scenes. Time-efficient anomaly localization is an ongoing challenge and subject of this paper. We propose a cubicpatch- based method, characterised by a cascade of classifiers, which makes use of an advanced feature-learning approach. Our cascade of classifiers has two main stages. First, a light but deep 3D auto-encoder is used for early identification of “many” normal cubic patches. This deep network operates on small cubic patches as being the first stage, before carefully resizing remaining candidates of interest, and evaluating those at the second stage using a more complex and deeper 3D convolutional neural network (CNN). We divide the deep autoencoder and the CNN into multiple sub-stages which operate as cascaded classifiers. Shallow layers of the cascaded deep networks (designed as Gaussian classifiers, acting as weak single-class classifiers) detect “simple” normal patches such as background patches, and more complex normal patches are detected at deeper layers. It is shown that the proposed novel technique (a cascade of two cascaded classifiers) performs comparable to current top-performing detection and localization methods on standard benchmarks, but outperforms those in general with respect to required computation time.
Article
Anomaly detection is still a challenging task for video surveillance due to complex environments and unpredictable human behaviors. Most existing approaches train offline detectors using manually labeled data and predefined parameters, and are hard to model changing scenes. This paper introduces a neural network based model called online Growing Neural Gas (online GNG) to perform an unsupervised learning. Unlike a parameter-fixed GNG, our model updates learning parameters continuously, for which we propose several online neighbor-related strategies. Specific operations, namely neuron insertion, deletion, learning rate adaptation and stopping criteria selection, get upgraded to online modes. In the anomaly detection stage, the behavior patterns far away from our model are labeled as anomalous, for which far away is measured by a time-varying threshold. Experiments are implemented on three surveillance datasets, namely UMN, UCSD Ped1/Ped2 and Avenue dataset. All datasets have changing scenes due to mutable crowd density and behavior types. Anomaly detection results show that our model can adapt to the current scene rapidly and reduce false alarms while still detecting most anomalies. Quantitative comparisons with 12 recent approaches further confirm our superiority.
Conference Paper
We address an anomaly detection setting in which training sequences are unavailable and anomalies are scored independently of temporal ordering. Current algorithms in anomaly detection are based on the classical density estimation approach of learning high-dimensional models and finding low-probability events. These algorithms are sensitive to the order in which anomalies appear and require either training data or early context assumptions that do not hold for longer, more complex videos. By defining anomalies as examples that can be distinguished from other examples in the same video, our definition inspires a shift in approaches from classical density estimation to simple discriminative learning. Our contributions include a novel framework for anomaly detection that is (1) independent of temporal ordering of anomalies, and (2) unsupervised, requiring no separate training sequences. We show that our algorithm can achieve state-of-the-art results even when we adjust the setting by removing training sequences from standard datasets.
Article
In this work, we propose an unsupervised approach for crowd scene anomaly detection and localization using a social network model. Using a window-based approach, a video scene is first partitioned at spatial and temporal levels, and a set of spatio-temporal cuboids is constructed. Objects exhibiting scene dynamics are detected and the crowd behavior in each cuboid is modeled using local social networks (LSN). From these local social networks, a global social network (GSN) is built for the current window to represent the global behavior of the scene. As the scene evolves with time, the global social network is updated accordingly using LSNs, to detect and localize abnormal behaviors. We demonstrate the effectiveness of the proposed Social Network Model (SNM) approach on a set of benchmark crowd analysis video sequences. The experimental results reveal that the proposed method outperforms the majority, if not all, of the state-of-the-art methods in terms of accuracy of anomaly detection.
Article
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a \emph{per-pixel} loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing \emph{perceptual} loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results .
Article
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction is viewed as a promising avenue for unsupervised feature learning. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
Speedy abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on inherent redundancy of video structures, we propose an efficient sparse combination learning framework. It achieves decent performance in the detection phase without compromising result quality. The short running time is guaranteed because the new method effectively turns the original complicated problem to one in which only a few costless small-scale least square optimization steps are involved. Our method reaches high detection rates on benchmark datasets at a speed of 140-150 frames per second on average when computing on an ordinary desktop PC using MATLAB.