Article

A Deep One-Class Neural Network for Anomalous Event Detection in Complex Scenes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

How to build a generic deep one-class (DeepOC) model to solve one-class classification problems for anomaly detection, such as anomalous event detection in complex scenes? The characteristics of existing one-class labels lead to a dilemma: it is hard to directly use a multiple classifier based on deep neural networks to solve one-class classification problems. Therefore, in this article, we propose a novel DeepOC neural network, termed as DeepOC, which can simultaneously learn compact feature representations and train a DeepOC classifier. Only with the given normal samples, we use the stacked convolutional encoder to generate their low-dimensional high-level features and train a one-class classifier to make these features as compact as possible. Meanwhile, for the sake of the correct mapping relation and the feature representations' diversity, we utilize a decoder in order to reconstruct raw samples from these low-dimensional feature representations. This structure is gradually established using an adversarial mechanism during the training stage. This mechanism is the key to our model. It organically combines two seemingly contradictory components and allows them to take advantage of each other, thus making the model robust and effective. Unlike methods that use handcrafted features or those that are separated into two stages (extracting features and training classifiers), DeepOC is a one-stage model using reliable features that are automatically extracted by neural networks. Experiments on various benchmark data sets show that DeepOC is feasible and achieves the state-of-the-art anomaly detection results compared with a dozen existing methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, the end-toend pipeline provides high accuracy at the cost of processing time. The authors propose a unique neural network called DeepOC [101], which has the capability to learn compact feature representations and train a DeepOC classifier simulta-neously. The stacked convolutional encoder is employed only with the provided normal samples in order to produce lowdimensional high-level features. ...
... The pixel-level EER (PL-EER) is the percentage of misclassified pixels when the pixel-level True Positive Rate (TPR) equals the pixel-level False Negative Rate (FNR) [2]. It is proficient at quantifying the performance of the VAL [101]. ...
... Patch or Block-based VAL [52], [84], [85], [88]- [92], [94], [100], [101], [103], [105], [107] PL-AUROC, PL-AUPR, PL-EER, PL-F1 score, Outside-inside error ratio Provides better accuracy with competitive processing speed. Easy to integrate in the DL-based VAD methods. ...
Article
Full-text available
Video anomaly detection and localization is the process of spatiotemporally localizing the anomalous video segment corresponding to the abnormal event or activities. It is challenging due to the inherent ambiguity of anomalies, diverse environmental factors, the intricate nature of human activities, and the absence of adequate datasets. Further, the spatial localization of the video anomalies (video anomaly localization) after the temporal localization of the video anomalies (video anomaly detection) is also a complex task. Video anomaly localization is essential for pinpointing the anomalous event or object in the spatial domain. Hence, the intelligent video surveillance system must have video anomaly detection and localization as key functionalities. However, the state-of-the-art lacks a dedicated survey of video anomaly localization. Hence, this article comprehensively surveys the cutting-edge approaches for video anomaly localization, associated threshold selection strategies, publicly available datasets, performance evaluation criteria, and open trending research challenges with potential solution strategies.
... Semi-supervised VAD. The advent of deep learning revolutionized the field of semi-supervised VAD, with the mainstream of research focusing on convolutional neural networks (CNNs) [10,22,29,35,43,61,66,69,75], recurrent neural networks (RNNs) [51,74], and transformers [63,73], with many of these approaches adopting self-supervised learning principles. For example, several studies [17,38,77] utilize 2D-CNNs, 3D-CNNs, and RNN-based autoencoders to reconstruct normal events and identify abnormal events based on the magnitude of the reconstruction error. ...
... For instance, Li et al. [31] divided the visual field into overlapping regions and learned a global mixture model using only patches around the current frame, with regions least similar to their surroundings deemed most likely to be abnormal. Wu et al. [66] similarly divided the visual field into overlapping regions and trained a deep one-class model to discriminate abnormal regions. ...
Preprint
Full-text available
Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.
... Due to the rarity and reliance on context of abnormal events, it is nearly impossible to collect a sufficiently representative set of such events, eliminating the possibility of applying traditional approaches based on supervised learning to perform video anomaly detection. Without the supervised option at hand, researchers turned their attention to alternative solutions, the most prominent alternative being based on outlier detection (Cong et al., 2011;Dutta and Banerjee, 2015;Hasan et al., 2016;Ionescu et al., 2019b;Kim and Grauman, 2009;Lee et al., 2019;Liu et al., 2018a;Lu et al., 2013;Luo et al., 2017;Park et al., 2020;Ramachandra et al., , 2022Sabokrou et al., 2017;Wu et al., 2019;Zhong et al., 2019). Methods based on outlier detection learn ...
... Baselines. We compare our method with state-of-the-art framelevel (Astrid et al., 2021a,b;Gong et al., 2019;Ionescu et al., 2019b;Liu et al., 2018bLiu et al., ,a, 2021Madan et al., 2023;Nguyen and Meunier, 2019;Park et al., 2020;Ravanbakhsh et al., 2018Ravanbakhsh et al., , 2017Ristea et al., 2022;Smeureanu et al., 2017;Sultani et al., 2018;Sun et al., 2020;Tang et al., 2020;Wang et al., 2020;Wu et al., 2019Wu et al., , 2022Yu et al., 2022a;Zhang et al., 2016) and object-level (Bȃrbȃlȃu et al., 2023;Doshi and Yilmaz, 2020;Georgescu et al., 2021Georgescu et al., , 2022bIonescu et al., 2019a;Madan et al., 2023;Ristea et al., 2022;Wang et al., 2022;Yu et al., 2020) frameworks. Hyperparameters. ...
... Hence, in such complex detection of events, there are various challenges posed and that imposes the necessity for efficient algorithms. These algorithms must enable in building of an MED (Multi-Media Event Detection) model, that is utilized practically in complex-event detection [2]. ...
... By focussing on these challenges that are related to training data, at the forefront two kinds of approaches exist inclusive of synthetic data-generation and transferlearning. In this transfer-learning, an existing pre-trained model was subjected to fine-tuning of features, upon event-related images [2]. Concurrently, in synthetic data augmentation, data training sets were populated through the generation of synthetic training set images by various techniques including rotation and cropping [4]. ...
Article
Full-text available
Purpose: Interoceptions are a combination of sensation, integration, and interpretation of internal bodily signals. Interoceptions are bidirectionally related to the human being mental and physiological health, and well-being. Sleep and different interoceptive modalities are proven to share common relations. Heartbeat Evoked Potential (HEP) is known as a robust readout to interoceptive processes. In this study, we focused on the relation between HEP modulations and sleep-related disorders. Materials and Methods: We investigated four different sleep-related disorders, including insomnia, rapid eye movement behavior disorder, periodic limb movements and nocturnal frontal lobe epilepsy, and provided HEP signals of multiple Electroencephalogram (EEG) channels over the right hemisphere to compare these disorders with the control group. Here, we investigated and compared the results of 35 subjects, including seven subjects for the control group and seven subjects for each of above-mentioned sleep disorders. Results: By comparing HEP responses of the control group with sleep-related patients’ groups, statistically significant HEP differences were detected over right hemisphere EEG channels, including FP2, F4, C4, P4, and O2 channels. These significant differences were also observed over the grand average HEP amplitude activity of channels over the right hemisphere in the sleep-related disorders as well. Conclusion: Our results between the control group and groups of patients suffering from sleep-related disorders demonstrated that during different stages of sleep, HEPs show significant differences over multiple right hemisphere EEG channels. Interestingly, by comparing different sleep disorders with each other, we observed that each of these HEP differences’ patterns over specific channels and during certain sleep stages bear considerable resemblances to each other.
... The key to unsupervised learning-based methods lies in extracting representative, discriminative, and accurate features that reflect the actual characteristics of video data. To address this issue, some researchers [2][3][4][5][6] have employed approaches that focus on extracting local features for anomaly detection. Additionally, there is a portion of researchers who have concentrated their attention on global information. ...
... Let R u denote the score of the u-th frame, where min(R u ) represents the minimum score among all scores, and max(R u ) represents the maximum score among all scores. The normalized reconstruction score for the u-th frame is calculated as shown in Eq. (6). ...
Article
Full-text available
Video anomaly event detection is crucial for analyzing surveillance videos. Existing methods have limitations: frame-level detection fails to remove background interference, and object-level methods overlook object-environment interaction. To address these issues, this paper proposes a novel video anomaly event detection algorithm based on a dual-channel autoencoder with key region feature enhancement. The goal is to preserve valuable information in the global context while focusing on regions with a high anomaly occurrence. Firstly, a key region extraction network is proposed to perform foreground segmentation on video frames, eliminating background redundancy. Secondly, a dual-channel autoencoder is designed to enhance the features of key regions, enabling the model to extract more representative features. Finally, channel attention modules are inserted between each deconvolution layer of the decoder to enhance the model’s perception and discrimination of valuable information. Compared to existing methods, our approach accurately locates and focuses on regions with a high anomaly occurrence, improving the accuracy of anomaly event detection. Extensive experiments are conducted on the UCSD ped2, CUHK Avenue, and SHTech Campus datasets, and the results validate the effectiveness of the proposed method.
... Propagation models have brought various sophistication into people's life. However, correctly, and effectively identifying hot events and tracking user's opinions amid the proliferation of user-generated content and real-time events remain challenging [20,21] . ...
... In classical topic detection and tracking research, the event evolution model (Story Link Detection (SLD) is an essential method, which can discover the associations among documents. Qian et al. [21] constructed a multi-vector event model to describe events, where the relationship between events is described by calculating vector similarity. This method, however, treats different vectors equally and cannot distinguish between different weights of words, geographic locations, names, and background information. ...
Article
Social networks are inevitable parts of our daily life, where an unprecedented amount of complex data corresponding to a diverse range of applications are generated. As such, it is imperative to conduct research on social events and patterns from the perspectives of conventional sociology to optimize services that originate from social networks. Event tracking in social networks finds various applications, such as network security and societal governance, which involves analyzing data generated by user groups on social networks in real time. Moreover, as deep learning techniques continue to advance and make important breakthroughs in various fields, researchers are using this technology to progressively optimize the effectiveness of Event Detection (ED) and tracking algorithms. In this regard, this paper presents an in-depth comprehensive review of the concept and methods involved in ED and tracking in social networks. We introduce mainstream event tracking methods, which involve three primary technical steps: ED, event propagation, and event evolution. Finally, we introduce benchmark datasets and evaluation metrics for ED and tracking, which allow comparative analysis on the performance of mainstream methods. Finally, we present a comprehensive analysis of the main research findings and existing limitations in this field, as well as future research prospects and challenges.
... • Receiver Operator Characteristic (ROC) There are two criteria for evaluating anomaly detection models: i) The frame level criterion and ii) the pixel level criterion, suggested in prior works [33,74]. We use two metrics -i) the Equal Error Rate (EER) [65] and ii) the Area under the ROC curve (AUC) [29]. These two measures are based on the receiver operating characteristics (ROC) curves. ...
... In Figure 5.3, we have shown the ground truth and predicted image along with its Mean Squared Error. Although our model has potentially blurry edges due to inherently misformed predictions, the MSE score is much lower than other CNN frameworks [65]. Conclusion and Future Scope ...
Thesis
Accounting for the increased concern for public safety, automatic abnormal event detection and recognition in a surveillance scene is crucial. It is a current open study subject because of its intricacy and utility. The identification of aberrant events automatically, it's a difficult undertaking because everyone's idea of abnormality is different. A typical occurrence in one circumstance could be seen as aberrant in another. Automatic anomaly identification becomes particularly challenging in the surveillance footage with a large crowd due to congestion and high occlusion. With the use of machine learning techniques, this thesis study aims to offer the solution for this use case so that human resources won't be required to keep an eye out for any unusual activity in the surveillance system records. We have developed a novel generative adversarial network (GAN) based anomaly detection model. This model is trained such that it learns together about constructing a high dimensional picture space and determining the latent space from the video's context. The generator uses a residual Autoencoder architecture made up of a multi-stage channel attention-based decoder and a two-stream, deep convolutional encoder that can realise both spatial and temporal data. We have also offered a technique for refining the GAN model that reduces training time while also generalising the model by utilising transfer learning between datasets. Using a variety of assessment measures, we compare our model to the current state-of-the-art techniques on four benchmark datasets. The empirical findings indicate that, in comparison to existing techniques, our network performs favourably on all datasets.
... One active application is the field of abnormal event detection [27], which raises awareness or alarms for the monitoring personnel. While many works have sought to improve the performance of video anomaly detection methods [6,12,13,15,22,24,26,31,33,34,36], only a handful of studies are concerned with the intrusiveness of general population privacy [5,9,21,30]. A promising direction, which remains underexplored, is to use the less intrusive thermal domain, as opposed to the commonly used visual domain. ...
Article
Full-text available
Creating high-quality datasets for the task of video anomaly detection is challenging due to a subjective anomaly definition and the rarity of anomalies, which oust the possibility of obtaining statistically significant data. This results in datasets where anomalies are placed in a single category, and are often considered less relevant from a security standpoint. Instead, we propose to create video anomaly datasets based on a framework utilizing object annotations to ease the annotation process and allow users to decide on the anomaly definition. Furthermore, this allows for a fine-grained evaluation w.r.t. anomaly types, which represents a novelty in the area of video anomaly detection. The framework is demonstrated using the existing thermal long-term drift (LTD) dataset, identifying and evaluating five different types of anomalies (appearance, motion, localization, density, and tampering) on six test sets. State-of-the-art anomaly detection methods are evaluated and found to underperform on the thermal anomaly detection dataset, which emphasizes a need for an adjustable anomaly definition in order to produce better anomaly datasets and models that generalize towards practical use. We share the code of the proposed framework to extract anomaly types along with object annotations for the LTD dataset at https://github.com/jagob/harborfront-vad.
... The density-based models have four second-level categories: distribution-based, graph-based, tree-based, and encoding-based. We enumerate all the mentioned methods in Table 2. Distribution-based MCD M Se ✗ OCSVM [150] Distribution-based SVM M Se ✗ AOSVM [81] Distribution-based SVM M U ✓ Eros-SVMs [124] Distribution-based SVM M Se ✗ S-SVM [20] Distribution-based SVM I Se ✗ MS-SVDD [253] Distribution-based SVM M Se ✗ NetworkSVM [266] Distribution-based SVM M Se ✗ HMAD [87] Distribution-based SVM I Se ✗ DeepSVM [250] Distribution-based SVM M U ✗ HBOS [79] Distribution-based -M U ✗ COPOD [133] Distribution-based -M U ✗ ConInd [7] Distribution-based -M Se ✗ MGDD [233] Distribution-based -M U ✓ OC-KFD [208] Distribution-based -M U ✗ SmartSifter [256] Distribution-based -M U ✓ MedianMethod [18] Distribution-based -I U ✓ S-ESD [97] Distribution-based ESD I U ✗ S-H-ESD [97] Distribution-based ESD I U ✗ SH-ESD+ [244] Distribution-based ESD I U ✗ TwoFinger [156] Graph-based -I Se ✗ GeckoFSM [214] Graph-based -M S ✗ Series2Graph [26] Graph-based Series2Graph I U ✗ DADS [217] Graph-based Series2Graph I U ✗ IForest [139] Tree-based IForest M U ✗ IF-LOF [53] Tree-based IForest/LOF M U ✗ Extended IForest [90] Tree-based IForest M U ✗ Hybrid IForest [157] Tree-based IForest M Se ✗ SurpriseEncode [42] Encoding-based -M U ✗ GranmmarViz [220] Encoding-based -I U ✗ Ensemble GI [71] Encoding-based -I U ✗ PST [234] Encoding-based Markov Ch. M U ✗ EM-HMM [193] Encoding-based Markov Ch. ...
Preprint
Full-text available
Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research.
... One-Class Support Vector Machines (OC-SVM) [17] and Support Vector Data Description (SVDD) [18] are two classical one-class classification models. These models employ a kernel function to identify an optimal boundary-separating hyperplane or boundary hypersphere in the kernel space, which is used to determine whether a testing sample is normal based on the distance to the boundary [19][20][21]. Leveraging the powerful learning capabilities of deep learning, [22] introduced the Deep SVDD (DSVDD) model for anomaly detection by integrating a deep neural network into SVDD, which effectively learns complex data patterns and achieves remarkable performance in anomaly detection. ...
Article
Full-text available
In industrial contexts, anomaly detection is crucial for ensuring quality control and maintaining operational efficiency in manufacturing processes. Leveraging high-level features extracted from ImageNet-trained networks and the robust capabilities of the Deep Support Vector Data Description (SVDD) model for anomaly detection, this paper proposes an improved Deep SVDD model, termed Feature-Patching SVDD (FPSVDD), designed for unsupervised anomaly detection in industrial applications. This model integrates a feature-patching technique with the Deep SVDD framework. Features are extracted from a pre-trained backbone network on ImageNet, and each extracted feature is split into multiple small patches of appropriate size. This approach effectively captures both macro-structural information and fine-grained local information from the extracted features, enhancing the model’s sensitivity to anomalies. The feature patches are then aggregated and concatenated for further training with the Deep SVDD model. Experimental results on both the MvTec AD and CIFAR-10 datasets demonstrate that our model outperforms current mainstream approaches and provides significant improvements in anomaly detection performance, which is vital for industrial quality assurance and defect detection in real-time manufacturing scenarios.
... Quantitative results. We compare our proposed MOFSTE-NM with the current state-of-the-art models including (1) based other methods: AnoPCN [47], FSCN [43], DeepOC [44], SIGnet [12], AnomalyNet [60]; (2) reconstructionbased methods: AMC [36], ConvLSTM-AE [30], MemAE [13], MNAD-R [37], Stacked RNN [31], STEAL-Net [1], DDGAN [11], CDDA [5]; (3) frame prediction based methods: FFP [27], STCEN [14], AMMC-Net [2], MNAD-P [37], VABD [26], DSM-Net [42],*AMSTE [58]. The results are summarized in Table 2. ...
Article
Full-text available
Video anomaly detection aims to detect anomaly scores in video frames, and it is a challenging research area since the types of anomalies are limitless. In response to the fact that abnormal behavior is likely to be misidentified as normal and anomalies are typically generated by the fast motion of foreground objects, this paper proposes a novel model called the Multi-scale Optical Flow Spatio-Temporal Enhancement and Normality Mining Network (MOFSTE-NM). It contains the Spatio-temporal Information Attention Enhancement Module (SIAEM) that incorporates reconstructed optical flows at multiple scales and considers spatial and temporal aspects. This strategy reduces the influence of the background and normal objects, enhancing the model’s ability to focus on anomalous fast moving objects in the foreground. Additionally, we propose a Normality Mining Convolution (NMC) module embedded in the decoder to refine the boundary between normality and abnormality. The NMC uses a multihead attention mechanism for dynamic weight adjustment, enabling the precise extraction of normal information. We compute the final anomaly score by fusing two components: (1) the reconstruction error of the optical flows and (2) the peak signal-to-noise ratio between the predicted frame and its ground truth. We evaluate our model on three well-established video anomaly detection datasets. A comparison of different models indicates that the proposed model achieves superior performance compared to state-of-the-art approaches, with area under the receiver operating characteristic curve (AUROC) values of 99.23%%\% on UCSD Ped2, 88.84%%\% on CUHK Avenue, and 74.80%%\% on Shanghaitech.
... Note that training the model improperly with L dist may yield trivial solutions where the normal and abnormal patterns are all located close to C in the embedding space. To prevent that happens, we fix the center vector C and calculate it in advance, which means the center vector will not be updated in the training process, the same as in [32], [33]. Moreover, we train L dist and L next simultaneously, L next will play a role as regularization preventing L dist from learning the trivial solutions because these two objective functions are optimized based on h i [CLS] . ...
Article
Full-text available
Log anomaly detection is a crucial task in monitoring IT systems along with metrics and traces. An anomaly could be detected by either one of two types of logs: individual logs or log sequences. While an individual log indicates an independent system status, combining multiple logs describes the execution paths of systems. Once the patterns of log sequences deviate much from normal execution behaviors, that might indicate system anomalies. For log anomaly detection using log sequences, supervised learning methods are preferable due to their high performance. However, these methods require labeled data to train models. As systems evolve, the number of logs increases significantly, which makes labeling data labor-intensive and impractical. Therefore, other learning techniques, such as semi-supervised or unsupervised, are better alternatives for detection. In practice, detecting log anomalies is quite challenging because of several problems, such as unstable logs, new types of logs, and unexplored log semantics. To address these problems and enhance detection performance, we propose a lightweight semi-supervised multi-task learning method named MultiLog in this paper. The key components of the proposed method are pre-trained language model BERT, dimension reduction, attention mechanism from Transformer, and multi-task learning. Similar to previous studies, we conduct comprehensive experiments on three widely used datasets: HDFS, BGL, and Thunderbird. In terms of efficiency, our proposed model is 50 times smaller while the F1-Scores are maintained compared to the original model. In terms of effectiveness, the proposed model outperforms baseline methods and achieves performance comparable to supervised learning models.
... Deep learning models for surveillance applications has been increasing in recent days. Specifically, convolutional neural networks (Gandapur, 2022;Khan et al., 2018;Wu et al., 2019) play a significant role in object detection for surveillance by learning and extracting the features with high accuracy. In (Mansour et al., 2021), an intelligent anomaly detection mechanism is implemented using faster RCNN. ...
Article
Full-text available
Detection of abandoned and stationary objects like luggage, boxes, machinery, and so forth, in public places is one of the challenging and critical tasks in the video surveillance system. These objects may contain weapons, bombs, or other explosive materials that threaten the public. Though various applications have been developed to detect stationary objects, different challenges, like occlusions, changes in geometrical features of things, and so forth, are still to be addressed. Considering the complexity of scenarios in public places and the variety of objects, a context‐aware model is developed based on mask region‐based convolution network (M‐RCNN) for detecting abandoned objects. A modified convolution operation is implemented in the Backbone network to understand features from geometric variations near objects. These modified operation layers can be adapted based on geometric interpretations to extract required features. Finally, a bounding box operation is performed to locate the abandoned object and mask the particular thing. Experiments have been performed on the benchmark dataset like ABODA and our dataset, which shows that an mAP of 0. 0.699 is achieved for model 1, 0.675 is achieved for model 2, and 0.734 mAP is completed for model 3. An ablation analysis has also been performed and compared with other state‐of‐the‐art methods. Based on the results, the proposed model better detects abandoned objects than existing state‐of‐the‐art methods.
... Ye, Peng, Gan, Wu, & Qiao, 2019)), and self-supervised classification methods (e.g., Tenenboim-Chekina, Rokach, and Shapira (2013)). Additionally, some methods are specifically tuned to certain anomaly measures, such as distance-based measures (e.g., Wang, Pang, Shen, and Ma (2020)), one-class classification measures (e.g., Wu, Liu, and Shen (2019)), and clustering-based measures (e.g., X. Yang, Deng, Zheng, Yan, and Liu (2019)). Most of these methods are easy to implement and only require semi-supervised training with non-anomalous data. ...
Preprint
Full-text available
Change point detection (CPD) and anomaly detection (AD) are essential techniques in various fields to identify abrupt changes or abnormal data instances. However, existing methods are often constrained to univariate data, face scalability challenges with large datasets due to computational demands, and experience reduced performance with high-dimensional or intricate data, as well as hidden anomalies. Furthermore, they often lack interpretability and adaptability to domain-specific knowledge, which limits their versatility across different fields. In this work, we propose a deep learning-based CPD/AD method called Probabilistic Predictive Coding (PPC) that jointly learns to encode sequential data to low dimensional latent space representations and to predict the subsequent data representations as well as the corresponding prediction uncertainties. The model parameters are optimized with maximum likelihood estimation by comparing these predictions with the true encodings. At the time of application, the true and predicted encodings are used to determine the probability of conformity, an interpretable and meaningful anomaly score. Furthermore, our approach has linear time complexity, scalability issues are prevented, and the method can easily be adjusted to a wide range of data types and intricate applications. We demonstrate the effectiveness and adaptability of our proposed method across synthetic time series experiments, image data, and real-world magnetic resonance spectroscopic imaging data.
... In Table 1, we provide a comprehensive comparison of our SYRFA framework with state-of-the-art (SOTA) methods [5], [6], [12], [25]- [35], [37]- [39] in terms of frame-level AUC (%). We further categorize the SOTA methods into four categories, including reconstruction and prediction-based methods, which have dominated the field of VAD. ...
Article
Full-text available
Video Anomaly Detection (VAD) has garnered significant attention in computer vision, especially with the exponential growth of surveillance videos. Recently, the synthetic dataset has been released to address the imbalance problem between normal and abnormal scenarios in real-world datasets by providing various combinations of events. Motivated by the release of synthetic datasets, many studies have attempted to handle domain shifts by generating synthetic-real or real-synthetic abnormal scenarios. However, these approaches still suffer from a substantial computation burden due to the generation model. In this paper, we aim to alleviate the domain gap without relying on any generation model. We propose a novel framework named the SYnthetic-to-Real via Feature Alignment (SYRFA) for VAD. The SYRFA consists of two learning phases: learning synthetic knowledge and adaptation to the real-world domain. These two learning phases facilitate the incorporation of rich synthetic knowledge into the real-world domain. To address the domain shift between synthetic and real domains, we introduce consistency learning, aligning feature representations to map closely between the synthetic and real-world domains. Additionally, in the adaptation phase, we propose the Residual Additional Parameters (RAP), a simple yet effective approach for handling domain gaps. RAP is designed with a residual path for learning local patterns, crucial in VAD due to circumstantial feature representation. It contributes to obtaining transferable feature representations with fewer additional computations. The proposed framework demonstrates superior performance on VAD benchmark datasets. Especially, Our framework outperforms other methods by a margin of 0.8% on ShanghaiTech. Moreover, the ablation study highlights the effectiveness of the proposed framework and RAP.
... Rainer et al. [19] used PCA to constrain the latent space dimensions. Constraining the latent space ensures that the generated video frame is of good quality and consistent with the normal data distribution [20,21]. ...
Article
Automatic detection of abnormal behavior in video sequences is a fundamental and challenging problem for intelligent video surveillance systems. However, the existing state-of-the-art Video Anomaly Detection (VAD) methods are computationally expensive and lack the desired robustness in real-world scenarios. The contemporary VAD methods cannot detect the fundamental features absent during training, which usually results in a high false positive rate while testing. To this end, we propose a Constrained Generative Adversarial Network (CVAD-GAN) for real-time VAD. Adding white Gaussian noise to the input video frame with constrained latent space of CVAD-GAN improves its fine-grained features learning from the normal video frames. Also, the dilated convolution layers and skip-connection preserve the information across layers to understand the broader context of complex video scenes in real-time. Our proposed approach achieves a higher Area Under Curve (AUC) score and a lower Equal Error Rate (EER) with enhanced computational efficiency than the existing state-of-the-art VAD methods. CVAD-GAN achieves an AUC and EER score of 98.0% and 6.0% on UCSD Peds1, 97.8% and 7.0% on UCSD Peds2, 94.0% and 8.1% on CUHK Avenue, and 76.2% and 21.7% on ShanghaiTech dataset, respectively. Also, it detects 63 and 19 abnormal events, with false alarms of 3 and 1, respectively, on the Subway-Entry and Subway-Exit datasets. The source code to replicate the results of the proposed CVAD-GAN is available at https://github.com/Rituraj-ksi/CVAD-GAN.
... Misclassifying a sample from the minority class as belonging to the majority class can have serious consequences or lead to significant damage in terms of water pollution control. Recently, class-imbalanced classification has garnered considerable attention in various applications, such as feature selection [3][4][5], fault diagnosis [6,7], continuous supervising tasks [1], face recognition [8], cancer detection [9], and anomalous event detection [10]. ...
Article
Full-text available
Imbalanced class data are commonly observed in pattern analysis, machine learning, and various real-world applications. Conventional approaches often resort to resampling techniques in order to address the imbalance, which inevitably alter the original data distribution. This paper proposes a novel classification method that leverages optimal transport for handling imbalanced data. Specifically, we establish a transport plan between training and testing data without modifying the original data distribution, drawing upon the principles of optimal transport theory. Additionally, we introduce a non-convex interclass regularization term to establish connections between testing samples and training samples with the same class labels. This regularization term forms the basis of a regularized discrete optimal transport model, which is employed to address imbalanced classification scenarios. Subsequently, in line with the concept of maximum minimization, a maximum minimization algorithm is introduced for regularized discrete optimal transport. Subsequent experiments on 17 Keel datasets with varying levels of imbalance demonstrate the superior performance of the proposed approach compared to 11 other widely used techniques for class-imbalanced classification. Additionally, the application of the proposed approach to water quality evaluation confirms its effectiveness.
... While they are powerful with non-linear kernels, their performance is limited to the quality of underlying data representations. Early attempts on AD [52,46] rely on kernel tricks and hand-crafted feature engineering, but recent ones [16,62,40,44,43] advocate the capability of deep neural networks to automatically learn high-level representations, outperforming their kernel-based counterparts. However, naive training results in a trivial solution with a constant mapping, a.k.a. ...
Preprint
Full-text available
Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, which can address spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a inter-patch distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.
Article
Video anomaly detection (VAD) plays a crucial role in intelligent surveillance. However, an essential type of anomaly named scene-dependent anomaly is overlooked. Moreover, the task of video anomaly anticipation (VAA) also deserves attention. To fill these gaps, we build a comprehensive dataset named NWPU Campus, which is the largest semi-supervised VAD dataset and the first dataset for scene-dependent VAD and VAA. Meanwhile, we introduce a novel forward-backward framework for scene-dependent VAD and VAA, in which the forward network individually solves the VAD and jointly solves the VAA with the backward network. Particularly, we propose a scene-dependent generative model in latent space for the forward and backward networks. First, we propose a hierarchical variational auto-encoder to extract scene-generic features. Next, we design a score-based diffusion model in latent space to refine these features more compact for the task and generate scene-dependent features with a scene information auto-encoder, modeling the relationships between video events and scenes. Finally, we develop a temporal loss from key frames to constrain the motion consistency of video clips. Extensive experiments demonstrate that our method can handle both scene-dependent anomaly detection and anticipation well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, and the proposed NWPU Campus datasets.
Article
Weakly supervised video anomaly detection aims to locate abnormal activities in untrimmed videos without the need for frame-level supervision. Prior work has utilized graph convolution networks or self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features. However, these approaches are limited in two aspects: 1) Multi-branch parallel architectures, while capturing multi-scale temporal dependencies, inevitably lead to increased parameter and computational costs. 2) The binarized MIL constraint only ensures the interclass separability while neglecting the fine-grained discriminability within anomalous classes. To this end, we introduce a novel WS-VAD framework that focuses on efficient temporal modeling and anomaly innerclass discriminability. We first construct a Temporal Context Aggregation (TCA) module that simultaneously captures local-global dependencies by reusing an attention matrix along with adaptive context fusion. In addition, we propose a Prompt-Enhanced Learning (PEL) module that incorporates semantic priors using knowledge-based prompts to boost the discrimination of visual features while ensuring separability across anomaly subclasses. The proposed components have been validated through extensive experiments, which demonstrate superior performance on three challenging datasets, UCF-Crime, XD-Violence and ShanghaiTech, with fewer parameters and reduced computational effort. Notably, our method can significantly improve the detection accuracy for certain anomaly subclasses and reduced the false alarm rate. Our code is available at: https://github.com/yujiangpu20/PEL4VAD .
Article
In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results. We released the dataset at https://github.com/lingruzhou/UCCD to facilitate future research
Article
Sharing digital resources is a common practice in both work and personal life. Yet, sharing identical credentials, such as passwords or physical cards, not only poses significant security risks but also falls short in addressing the specific requirements of small local groups, such as parental controls, tracking user modifications, and easily updating access. To address this, we suggest a touch behavior-based method tailored for sharing in small local groups, designed to balance between ensuring relaxed security and maintaining practical functionality. Our approach aims to concurrently identify in-group users and detect out-of-group imposters. Specifically, our approach extracts effective identity representations that are robust to in-group variability and out-of-group uncertainty by learning a pair of touch-behavioral and within-group similarity embeddings. While the former captures the unique features of user touch characteristics, the latter reflects the typical group-wide similarity structure that an in-group user is expected to possess from a holistic perspective. Experimental results showcase the effectiveness of our method even with few samples for training. It maintains accuracy despite the group growing larger and shows resilience against advanced attacks. This offers a promising way to keep group access both user-friendly and relatively secure, striking a crucial balance for small groups’ needs.
Article
Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with E nhanced O bject Information and G lobal T emporal Dependencies (EOGT) and the main novelties are: (1) A L ocal O bject A nomaly S tream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a D iffusion-based O bject R econstruction N etwork (DORN) with multimodal conditions detects anomalies with object RGB information, and an O bject P ose A nomaly Refiner (OPA) discovers anomalies with human pose information. (2) A G lobal T emporal S trengthening S tream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.
Article
Nowadays, fast and accurate novelty detection is crucial for public safety and security in surveillance videos. Given the high accuracy of deep learning technique, deep learning based novel detection is a trend. With the huge amount of surveillance videos being generated by surveillance cameras at any time, it is challenging to make novelty detection in surveillance videos efficiently while guaranteeing the accuracy. To address it, we propose a dynamic frame sampling method called ORLNet with both the frame similarity and the intensity of the object movement considered. It is based on the two observations as follows: firstly, there is a high similarity between adjacent frames in a video data. Secondly, in practice, since novel behaviors are always generated by moving targets, we only need to focus on a small number of frames that contain key information which we call key frames. Specifically, ORLNet speeds up surveillance video by setting a reinforcement learning agent to dynamically determine the indexes of key frames at run-time and replace end-to-end inference at non-key frame positions by reusing the last key frame’s calculation. Typically, it defines frame similarity as novelty energy, which is the combination of novel semantic and motion features. On the premise of calculating the distance of novel energy between frames, the calculation of key frames can be reused for other frames corresponding to similar novelty energies, which can thus accelerate novelty detection while maintain accuracy. Finally, we evaluate ORLNet experimentally with two surveillance video datasets by comparing with existing methods. Experimental results show that ORLNet reduces processing time by 42% while guaranteeing the accuracy.
Article
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., “vandalism”, is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks and design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR .
Article
Video anomaly detection aims to find the events in a video that do not conform to the expected behavior. The prevalent methods mainly detect anomalies by snippet reconstruction or future frame prediction error. However, the error is highly dependent on the local context of the current snippet and lacks the understanding of normality. To address this issue, we propose to detect anomalous events not only by the local context, but also according to the consistency between the testing event and the knowledge about normality from the training data. Concretely, we propose a novel two-stream framework based on context recovery and knowledge retrieval, where the two streams can complement each other. For the context recovery stream, we propose a spatiotemporal U-Net which can fully utilize the motion information to predict the future frame. Furthermore, we propose a maximum local error mechanism to alleviate the problem of large recovery errors caused by complex foreground objects. For the knowledge retrieval stream, we propose an improved learnable locality-sensitive hashing, which optimizes hash functions via a Siamese network and a mutual difference loss. The knowledge about normality is encoded and stored in hash tables, and the distance between the testing event and the knowledge representation is used to reveal the probability of anomaly. Finally, we fuse the anomaly scores from the two streams to detect anomalies. Extensive experiments demonstrate the effectiveness and complementarity of the two streams, whereby the proposed two-stream framework achieves state-of-the-art performance on ShanghaiTech, Avenue and Corridor datasets among the methods without object detection. Even if compared with the methods using object detection, our method reaches competitive or better performance on the ShanghaiTech, Avenue, and Ped2 datasets.
Article
Full-text available
Physical violence in the educational environment by students often occurs and leads to criminal acts. Apart from that, repeated acts of physical violence can be considered non-verbal bullying. This bullying can hurt the victim, causing physical disorders, mental health, impaired social relationships and decreased academic performance. However, monitoring activities against acts of violence currently being carried out have weaknesses, namely weak supervision by the school. A deep Learning-based physical violence detection system, namely LSTM Network, is the solution to this problem. In this research, we develop a Convolutional Neural Network to detect acts of violence. Convolutional Neural Network extracts features at the frame level from videos. At the frame level, the feature uses long short-term memory in the convolutional gate. Convolutional Neural Networks and convolutional short-term memory can capture local spatio-temporal features, enabling local video motion analysis. The performance of the proposed feature extraction pipeline is evaluated on standard benchmark datasets in terms of recognition accuracy. A comparison of the results obtained with state-of-the-art techniques reveals the promising capabilities of the proposed method for recognising violent videos. The model that has been trained and tested will be integrated into a violence detection system, which can provide ease and speed in detecting acts of violence that occur in the school environment.
Chapter
In the present era, the crime rates increase day by day, in such situations, humans keep security as a topmost priority in their daily lives. As a result, the demands of surveillance system surge for public, private, and remote areas. With that scenario, anomaly detection systems are gaining more attention in the domain of computer vision. Various machine (ML) and deep learning (DL) based approaches have been presented for anomaly detection over decades but still, this framework is a challenging task because of many reasons, one of them being the vague quality of content in the video. Transfer learning (TL) plays a key role by providing already trained information to gain good accuracy. This paper is divided into three parts: the first part comprises the study of deep and machine for violent and abnormal activities detection. In the second part, a basic architecture of transfer learning-based framework for anomaly detection along with TL approaches is presented. The final section compares machine learning and deep learning algorithms for the publicly available benchmark datasets based on accuracy achieved. The main obstacles encountered while utilizing this technique are also mentioned in accordance with study and analysis.
Article
Video Anomaly Detection (VAD) serves as a pivotal technology in the intelligent surveillance systems, enabling the temporal or spatial identification of anomalous events within videos. While existing reviews predominantly concentrate on conventional unsupervised methods, they often overlook the emergence of weakly-supervised and fully-unsupervised approaches. To address this gap, this survey extends the conventional scope of VAD beyond unsupervised methods, encompassing a broader spectrum termed Generalized Video Anomaly Event Detection (GVAED). By skillfully incorporating recent advancements rooted in diverse assumptions and learning frameworks, this survey introduces an intuitive taxonomy that seamlessly navigates through unsupervised, weakly-supervised, supervised and fully-unsupervised VAD methodologies, elucidating the distinctions and interconnections within these research trajectories. In addition, this survey facilitates prospective researchers by assembling a compilation of research resources, including public datasets, available codebases, programming tools, and pertinent literature. Furthermore, this survey quantitatively assesses model performance, delves into research challenges and directions, and outlines potential avenues for future exploration.
Article
This paper presents a novel fine-grained task for traffic accident analysis. Accident detection in surveillance or dashcam videos is a common task in the field of traffic accident analysis by using videos. However, common accident detection does not analyze the specific particulars of the accident, only identifies the accident’s existence or occurrence time in a video. In this paper, we define the novel fine-grained accident detection task which contains fine-grained accident classification, temporal-spatial occurrence region localization, and accident severity estimation. A transformer-based framework combining the RGB and optical flow information of videos is proposed for fine-grained accident detection. Additionally, we introduce a challenging Fine-grained Accident Detection (FAD) database that covers multiple tasks in surveillance videos which places more emphasis on the overall perspective. Experimental results demonstrate that our model could effectively extract the video features for multiple tasks, indicating that current traffic accident analysis has limitations in dealing with the FAD task and that further research is indeed needed.
Article
Detecting anomalies in videos presents a significant challenge in the field of video surveillance. The primary goal is identifying and detecting uncommon actions or events within a video sequence. The difficulty arises from the limited availability of video frames depicting anomalies and the ambiguous definition of anomaly. Based on extensive applications of Generative Adversarial Networks (GANs), which consist of a generator and a discriminator network, we propose an Attention-guided Generator with Dual Discriminator GAN (A2D-GAN) for real-time video anomaly detection (VAD). The generator network uses an encoder-decoder architecture with a multi-stage self-attention added to the encoder and multi-stage channel attention added to the decoder. The framework uses adversarial learning from noise and video frame reconstruction to enhance the generalization of the generator network. Also, of the dual discriminator in A2D-GAN, one discriminates between the reconstructed video frame and the real video frame, while the other discriminates between the reconstructed noise and the real noise. Exhaustive experiments and ablation studies on four benchmark video anomaly datasets, namely UCSD Peds, CUHK Avenue, ShanghaiTech, and Subway, demonstrate the effectiveness of the proposed A2D-GAN compared to other state-of-the-art methods. The proposed A2D-GAN model is robust and can detect anomalies in videos in real-time. The source code to replicate the results of the proposed A2D-GAN model is available at https://github.com/Rituraj-ksi/A2D-GAN.
Article
Full-text available
Video anomaly detection and localization is one of the key components of the intelligent video surveillance system. Video anomaly detection refers to the process of spatiotemporal localization of the abnormal or anomalous pattern present in the video. The performance of the deep learning-based video anomaly detector depends on the quality and quantity of the video anomaly datasets used for training. However, there is a scarcity of effective video anomaly datasets due to inherent natures such as rareness, context-dependency, and equivocal nature. Further, state-of-the-art lacks a review that presents a comprehensive study of video anomaly datasets, including issues associated with the existing datasets, comparative analysis of the available datasets, potential solutions using both model-centric and data-centric approaches. Hence, a comprehensive review of the publicly available video anomaly datasets for video anomaly detection and localization is presented in this article. Further, a comparative study of the existing video anomaly datasets at qualitative and quantitative levels is presented to decide the right strategies for the desired application. Subsequently, model-centric and data-centric approaches required to solve various problems associated with the video anomaly datasets are presented. Finally, current research trends, research challenges, potential applications, and future research directions are outlined.
Article
Detecting abnormal events in surveillance involves identifying unexpected behavior through video analysis. This involves recognizing patterns or deviations from normal behavior and taking actions to mitigate potential risks. However, the distribution of data can change over time, leading to concept drift, which can make it challenging to accurately detect abnormal events. To address this issue, a new approach using a global density network (GDN) has been proposed. The GDN allows for more efficient identification of object distributions in surveillance videos, leading to improved accuracy in abnormal event detection. The proposed method combines features extracted by a backbone network with a global density joined network (GDJN), which refines density features using dilated convolutional networks. A multistage long short-term memory (LSTM) network is then used to classify abnormal events. The experimental results are conducted on two datasets, UMN and UCSD Ped2. The achieved F1 scores were 93.42 and 94.46 respectively, with corresponding AUC values of 93.5 and 94.8.
Article
Full-text available
The automatic detection and recognition of anomalous events in crowded and complex scenes on video are the research objectives of this paper. The main challenge in this system is to create models for detecting such events due to their changeability and the territory of the context of the scenes. Due to these challenges, this paper proposed a novel HOME FAST (Histogram of Orientation, Magnitude, and Entropy with Fast Accelerated Segment Test) spatiotemporal feature extraction approach based on optical flow information to capture anomalies. This descriptor performs the video analysis within the smart surveillance domain and detects anomalies. In deep learning, the training step learns all the normal patterns from the high-level and low-level information. The events are described in testing and, if they differ from the normal pattern, are considered as anomalous. The overall proposed system robustly identifies both local and global abnormal events from complex scenes and solves the problem of detection under various transformations with respect to the state-of-the-art approaches. The performance assessment of the simulation outcome validated that the projected model could handle different anomalous events in a crowded scene and automatically recognize anomalous events with success.
Conference Paper
Full-text available
We formulate the abnormal event detection problem as an outlier detection task and we propose a two-stage algorithm based on k-means clustering and one-class Support Vector Machines (SVM) to eliminate outliers. In the feature extraction stage, we propose to augment spatio-temporal cubes with deep appearance features extracted from the last convolutional layer of a pre-trained neural network. After extracting motion and appearance features from the training video containing only normal events, we apply k-means clustering to find clusters representing different types of normal motion and appearance features. In the first stage, we consider that clusters with fewer samples (with respect to a given threshold) contain mostly outliers, and we eliminate these clusters altogether. In the second stage, we shrink the borders of the remaining clusters by training a one-class SVM model on each cluster. To detected abnormal events in the test video, we analyze each test sample and consider its maximum normality score provided by the trained one-class SVM models, based on the intuition that a test sample can belong to only one cluster of normality. If the test sample does not fit well in any narrowed normality cluster, then it is labeled as abnormal. We compare our method with several state-of-the-art methods on three benchmark data sets. The empirical results indicate that our abnormal event detection framework can achieve better results in most cases, while processing the test video in real-time at 24 frames per second on a single CPU.
Article
Full-text available
We propose a deep learning-based solution for the problem of feature learning in one-class classification. The proposed method operates on top of a Convolutional Neural Network (CNN) of choice and produces descriptive features while maintaining a low intra-class variance in the feature space for the given class. For this purpose two loss functions, compactness loss and descriptiveness loss are proposed along with a parallel CNN architecture. A template matching-based framework is introduced to facilitate the testing process. Extensive experiments on publicly available anomaly detection, novelty detection and mobile active authentication datasets show that the proposed Deep One-Class (DOC) classification method achieves significant improvements over the state-of-the-art.
Article
Full-text available
Surveillance videos are able to capture a variety of realistic anomalies. In this paper, we propose to learn anomalies by exploiting both normal and anomalous videos. To avoid annotating the anomalous segments or clips in training videos, which is very time consuming, we propose to learn anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos, i.e. the training labels (anomalous or normal) are at video-level instead of clip-level. In our approach, we consider normal and anomalous videos as bags and video segments as instances in multiple instance learning (MIL), and automatically learn a deep anomaly ranking model that predicts high anomaly scores for anomalous video segments. Furthermore, we introduce sparsity and temporal smoothness constraints in the ranking loss function to better localize anomaly during training. We also introduce a new large-scale first of its kind dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities. Our experimental results show that our MIL method for anomaly detection achieves significant improvement on anomaly detection performance as compared to the state-of-the-art approaches. We provide the results of several recent deep learning baselines on anomalous activity recognition. The low recognition performance of these baselines reveals that our dataset is very challenging and opens more opportunities for future work.
Conference Paper
Full-text available
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch
Conference Paper
Full-text available
We propose a novel framework for abnormal event detection in video that is based on deep features extracted with pre-trained convolutional neural networks (CNN). The CNN features are fed into a one-class Support Vector Machines (SVM) classifier in order to learn a model of normality from training data. We compare our approach with several state-of-the-art methods on two benchmark data sets, namely the Avenue data set and the UMN data set. The empirical results indicate that our abnormal event detection framework can reach state-of-the-art results, while running in real-time at 20 frames per second.
Article
Full-text available
We propose a novel framework for abnormal event detection in video that requires no training sequences. Our framework is based on unmasking, a technique previously used for authorship verification in text documents, which we adapt to our task. We iteratively train a binary classifier to distinguish between two consecutive video sequences while removing at each step the most discriminant features. Higher training accuracy rates of the intermediately obtained classifiers represent abnormal events. To the best of our knowledge, this is the first work to apply unmasking for a computer vision task. We compare our method with several state-of-the-art supervised and unsupervised methods on four benchmark data sets. The empirical results indicate that our abnormal event detection framework can achieve state-of-the-art results, while running in real-time at 20 frames per second.
Article
Full-text available
We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes. While most existing works merely use hand-crafted appearance and motion features, we propose Appearance and Motion DeepNet (AMDN) which utilizes deep neural networks to automatically learn feature representations. To exploit the complementary information of both appearance and motion patterns, we introduce a novel double fusion framework, combining both the benefits of traditional early fusion and late fusion strategies. Specifically, stacked denoising autoencoders are proposed to separately learn both appearance and motion features as well as a joint representation (early fusion). Based on the learned representations, multiple one-class SVM models are used to predict the anomaly scores of each input, which are then integrated with a late fusion strategy for final anomaly detection. We evaluate the proposed method on two publicly available video surveillance datasets, showing competitive performance with respect to state of the art approaches.
Conference Paper
Full-text available
Visual object tracking is challenging as target objects often undergo significant appearance changes caused by deformation , abrupt motion, background clutter and occlu-sion. In this paper, we exploit features extracted from deep convolutional neural networks trained on object recognition datasets to improve tracking accuracy and robustness. The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to significant appearance variations. However, their spatial resolution is too coarse to precisely localize targets. In contrast , earlier convolutional layers provide more precise lo-calization but are less invariant to appearance changes. We interpret the hierarchies of convolutional layers as a non-linear counterpart of an image pyramid representation and exploit these multiple levels of abstraction for visual tracking. Specifically, we adaptively learn correlation filters on each convolutional layer to encode the target appearance. We hierarchically infer the maximum response of each layer to locate targets. Extensive experimental results on a large-scale benchmark dataset show that the proposed algorithm performs favorably against state-of-the-art methods.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the fully convolutional network (FCN) architecture and its variants. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. The design of SegNet was primarily motivated by road scene understanding applications. Hence, it is efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than competing architectures and can be trained end-to-end using stochastic gradient descent. We also benchmark the performance of SegNet on Pascal VOC12 salient object segmentation and the recent SUN RGB-D indoor scene understanding challenge. We show that SegNet provides competitive performance although it is significantly smaller than other architectures. We also provide a Caffe implementation of SegNet and a webdemo at http://mi.eng.cam.ac.uk/projects/segnet/
Article
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.
Article
Multi-modal tasks like visual question answering (VQA) are an important step towards human-level artificial intelligence. In general, the input of the VQA task consists of an image and a related question. In order to correctly answer the question, a model needs to extract and integrate useful information from both the image and the question. In this paper, we propose a model named AnswerNet to tackle this task. In the proposed model, discriminative features are extracted from both the image and the question. Specifically, high-level image features are extracted by the state-of-the-art convolutional neural network, i.e., Deep Residual Net. For question features, the semantic representations of the question and the term frequencies of the distinct words are captured by long short-term memory network and bag-of-words model, respectively. Then, a hierarchical fusion network is proposed to effectively fuse the image features with the question features. Experimental results on three large-scale datasets, VQA, COCO-QA, and VQA2, demonstrate the effectiveness of the proposed AnswerNet.
Article
Most existing object detection algorithms are trained based upon a set of fully annotated object regions or bounding boxes, which are typically labor-intensive. On the contrary, nowadays there is a significant amount of image-level annotations cheaply available on the Internet. It is hence a natural thought to explore such "weak" supervision to benefit the training of object detectors. In this paper, we propose a novel scheme to perform weakly supervised object localization, termed object-specific pixel gradient (OPG). The OPG is trained by using image-level annotations alone, which performs in an iterative manner to localize potential objects in a given image robustly and efficiently. In particular, we first extract an OPG map to reveal the contributions of individual pixels to a given object category, upon which an iterative mining scheme is further introduced to extract instances or components of this object. Moreover, a novel average and max pooling layer is introduced to improve the localization accuracy. In the task of weakly supervised object localization, the OPG achieves a state-of-the-art 44.5% top-5 error on ILSVRC 2013, which outperforms competing methods, including Oquab et al. and region-based convolutional neural networks on the Pascal VOC 2012, with gains of 2.6% and 2.3%, respectively. In the task of object detection, OPG achieves a comparable performance of 27.0% mean average precision on Pascal VOC 2007. In all experiments, the OPG only adopts the off-the-shelf pretrained CNN model, without using any object proposals. Therefore, it also significantly improves the detection speed, i.e., achieving three times faster compared with the state-of-the-art method.
Article
Anomaly detection in videos refers to the identification of events that do not conform to expected behavior. However, almost all existing methods tackle the problem by minimizing the reconstruction errors of training data, which cannot guarantee a larger reconstruction error for an abnormal event. In this paper, we propose to tackle the anomaly detection problem within a video prediction framework. To the best of our knowledge, this is the first work that leverages the difference between a predicted future frame and its ground truth to detect an abnormal event. To predict a future frame with higher quality for normal events, other than the commonly used appearance (spatial) constraints on intensity and gradient, we also introduce a motion (temporal) constraint in video prediction by enforcing the optical flow between predicted frames and ground truth frames to be consistent, and this is the first work that introduces a temporal constraint into the video prediction task. Such spatial and motion constraints facilitate the future frame prediction for normal events, and consequently facilitate to identify those abnormal events that do not conform the expectation. Extensive experiments on both a toy dataset and some publicly available datasets validate the effectiveness of our method in terms of robustness to the uncertainty in normal events and the sensitivity to abnormal events
Article
We present ClusterSVDD , a methodology that unifies support vector data descriptions (SVDDs) and k -means clustering into a single formulation. This allows both methods to benefit from one another, i.e., by adding flexibility using multiple spheres for SVDDs and increasing anomaly resistance and flexibility through kernels to k -means. In particular, our approach leads to a new interpretation of k -means as a regularized mode seeking algorithm. The unifying formulation further allows for deriving new algorithms by transferring knowledge from one-class learning settings to clustering settings and vice versa. As a showcase, we derive a clustering method for structured data based on a one-class learning scenario. Additionally, our formulation can be solved via a particularly simple optimization scheme. We evaluate our approach empirically to highlight some of the proposed benefits on artificially generated data, as well as on real-world problems, and provide a Python software package comprising various implementations of primal and dual SVDD as well as our proposed ClusterSVDD .
Article
Learning deep representations have been applied in action recognition widely. However, there have been a few investigations on how to utilize the structural manifold information among different action videos to enhance the recognition accuracy and efficiency. In this paper, we propose to incorporate the manifold of training samples into deep learning, which is defined as deep manifold learning (DML). The proposed DML framework can be adapted to most existing deep networks to learn more discriminative features for action recognition. When applied to a convolutional neural network, DML embeds the previous convolutional layer's manifold into the next convolutional layer; thus, the discriminative capacity of the next layer can be promoted. We also apply the DML on a restricted Boltzmann machine, which can alleviate the overfitting problem. Experimental results on four standard action databases (i.e., UCF101, HMDB51, KTH, and UCF sports) show that the proposed method outperforms the state-of-the-art methods.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Conference Paper
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction is viewed as a promising avenue for unsupervised feature learning. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Over the past decade, video anomaly detection has been explored with remarkable results. However, research on methodologies suitable for online performance is still very limited. In this paper, we present an online framework for video anomaly detection. The key aspect of our framework is a compact set of highly descriptive features, which is extracted from a novel cell structure that helps to define support regions in a coarse-to-fine fashion. Based on the scene’s activity, only a limited number of support regions are processed, thus limiting the size of the feature set. Specifically, we use foreground occupancy and optical flow features. The framework uses an inference mechanism that evaluates the compact feature set via Gaussian Mixture Models, Markov Chains and Bag-of-Words in order to detect abnormal events. Our framework also considers the joint response of the models in the local spatio-temporal neighborhood to increase detection accuracy. We test our framework on popular existing datasets and on a new dataset comprising a wide variety of realistic videos captured by surveillance cameras. This particular dataset includes surveillance videos depicting criminal activities, car accidents and other dangerous situations. Evaluation results show that our framework outperforms other online methods and attains a very competitive detection performance compared to state-of-the-art non-online methods.
Article
Anomaly detection is still a challenging task for video surveillance due to complex environments and unpredictable human behaviors. Most existing approaches train offline detectors using manually labeled data and predefined parameters, and are hard to model changing scenes. This paper introduces a neural network based model called online Growing Neural Gas (online GNG) to perform an unsupervised learning. Unlike a parameter-fixed GNG, our model updates learning parameters continuously, for which we propose several online neighbor-related strategies. Specific operations, namely neuron insertion, deletion, learning rate adaptation and stopping criteria selection, get upgraded to online modes. In the anomaly detection stage, the behavior patterns far away from our model are labeled as anomalous, for which far away is measured by a time-varying threshold. Experiments are implemented on three surveillance datasets, namely UMN, UCSD Ped1/Ped2 and Avenue dataset. All datasets have changing scenes due to mutable crowd density and behavior types. Anomaly detection results show that our model can adapt to the current scene rapidly and reduce false alarms while still detecting most anomalies. Quantitative comparisons with 12 recent approaches further confirm our superiority.
Article
In this work, we propose an unsupervised approach for crowd scene anomaly detection and localization using a social network model. Using a window-based approach, a video scene is first partitioned at spatial and temporal levels, and a set of spatio-temporal cuboids is constructed. Objects exhibiting scene dynamics are detected and the crowd behavior in each cuboid is modeled using local social networks (LSN). From these local social networks, a global social network (GSN) is built for the current window to represent the global behavior of the scene. As the scene evolves with time, the global social network is updated accordingly using LSNs, to detect and localize abnormal behaviors. We demonstrate the effectiveness of the proposed Social Network Model (SNM) approach on a set of benchmark crowd analysis video sequences. The experimental results reveal that the proposed method outperforms the majority, if not all, of the state-of-the-art methods in terms of accuracy of anomaly detection.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction is viewed as a promising avenue for unsupervised feature learning. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset.
Article
Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Conference Paper
Real time anomaly detection is the need of the hour for any security applications. In this paper, we have proposed a real-time anomaly detection algorithm by utilizing cues from the motion vectors in H.264/AVC compressed domain. The discussed work is principally motivated by the observation that motion vectors (MVs) exhibit different characteristics during anomaly. We have observed that H.264 motion vector magnitude contains relevant information which can be used to model the usual behavior (UB) effectively. This is subsequently extended to detect abnormality/anomaly based on the probability of occurrence of a behavior. Additionally, we have suggested a hierarchical approach through Motion Pyramid for High Resolution videos to further increase the detection rate. The proposed algorithm has performed extremely well on UMN and Peds Anomaly Detection Video datasets, with a detection speed of >150 and 65-75 frames per sec in respective datasets resulting in more than 200× speedup along with comparable accuracy to pixel domain state-of-the-art algorithms.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Recently, Dufrenois and Noyer proposed a one class Fisher's linear discriminant to isolate normal data from outliers. In this paper, a kernelized version of their criterion is presented. Originally on the basis of an iterative optimization process, alternating between subspace selection and clustering, I show here that their criterion has an upper bound making these two problems independent. In particular, the estimation of the label vector is formulated as an unconstrained binary linear problem (UBLP) which can be solved using an iterative perturbation method. Once the label vector is estimated, an optimal projection subspace is obtained by solving a generalized eigenvalue problem. Like many other kernel methods, the performance of the proposed approach depends on the choice of the kernel. Constructed with a Gaussian kernel, I show that the proposed contrast measure is an efficient indicator for selecting an optimal kernel width. This property simplifies the model selection problem which is typically solved by costly (generalized) cross-validation procedures. Initialization, convergence analysis, and computational complexity are also discussed. Lastly, the proposed algorithm is compared with recent novelty detectors on synthetic and real data sets.
Article
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Conference Paper
Speedy abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on inherent redundancy of video structures, we propose an efficient sparse combination learning framework. It achieves decent performance in the detection phase without compromising result quality. The short running time is guaranteed because the new method effectively turns the original complicated problem to one in which only a few costless small-scale least square optimization steps are involved. Our method reaches high detection rates on benchmark datasets at a speed of 140-150 frames per second on average when computing on an ordinary desktop PC using MATLAB.