Preprint

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained classification problem. Our method exhibits several advantages over existing works: 1) the spatio-temporal jigsaw puzzles are decoupled in terms of spatial and temporal dimensions, responsible for capturing highly discriminative appearance and motion features, respectively ; 2) full permutations are used to provide abundant jigsaw puzzles covering various difficulty levels, allowing the network to distinguish subtle spatio-temporal differences between normal and abnormal events; and 3) the pretext task is tackled in an end-to-end manner without relying on any pre-trained models. Our method outperforms state-of-the-art counterparts on three public benchmarks. Especially on ShanghaiTech Campus, the result is superior to reconstruction and prediction-based methods by a large margin.
Article
Full-text available
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is twofold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method.
Article
Full-text available
The problem of inferring nonlinear and complex dynamical systems from available data is prominent in many fields including engineering, biological, social, physical, and computer sciences. Many evolutionary algorithm (EA) based network reconstruction methods have been proposed to address this problem, but they ignore several useful information of network structure, such as community structure, which widely exists in various complex networks. Inspired by the community structure, this paper develops a community-based evolutionary multi-objective network reconstruction framework to promote the reconstruction performance of EA-based network reconstruction methods due to their good performance; we refer this framework as CEMO-NR. CEMO-NR is a generic framework and any population-based multi-objective metaheuristic algorithm can be employed as the base optimizer. CEMO-NR employs the community structure of networks to divide the original decision space into multiple small decision spaces, and then any multi-objective evolutionary algorithm (MOEA) can be used to search for improved solutions in the reduced decision space. To verify the performance of CEMO-NR, this paper also designs a test suite for complex network reconstruction problems. Three representative MOEAs are embedded into CEMO-NR and compared with their original versions, respectively. The experimental results have demonstrated the significant improvement benefiting from the proposed CEMO-NR in 30 multi-objective network reconstruction problems.
Conference Paper
Full-text available
There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.
Conference Paper
Full-text available
Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (in-fAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.
Article
Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC .
Article
For weakly supervised anomaly detection, most existing work is limited to the problem of inadequate video representation due to the inability of modeling long-term contextual information. To solve this, we propose a novel weakly supervised adaptive graph convolutional network (WAGCN) to model the complex contextual relationship among video segments. By which, we fully consider the influence of other video segments on the current one when generating the anomaly probability score for each segment. Firstly, we combine the temporal consistency as well as feature similarity of video segments to construct a global graph, which makes full use of the association information among spatial-temporal features of anomalous events in videos. Secondly, we propose a graph learning layer in order to break the limitation of setting topology manually, which can extract graph adjacency matrix based on data adaptively and effectively. Extensive experiments on two public datasets (i.e., UCF-Crime dataset and ShanghaiTech dataset) demonstrate the effectiveness of our approach which achieves state-of-the-art performance.
Article
The understanding of language plays a key role in video grounding, where a target moment is localized according to a text query. From a biological point of view, language is naturally hierarchical, with the main clause (predicate phrase) providing coarse semantics and modifiers providing detailed descriptions. In video grounding, moments described by the main clause may exist in multiple clips of a long video, including both the ground-truth and background clips. Therefore, in order to correctly discriminate the ground-truth clip from the background ones, this co-existence leads to the negligence of the main clause, and concentrate the model on the modifiers that provide discriminative information on distinguishing the target proposal from the others. We first demonstrate this phenomenon empirically, and propose a Hierarchical Language Network (HLN) that exploits the language hierarchy, as well as a new learning approach called Multi-Instance Positive-Unlabelled Learning (MI-PUL) to alleviate the above problem. Specifically, in HLN, the localization is performed on various layers of the language hierarchy, so that the attention can be paid to different parts of the sentences, rather than only discriminative ones. Furthermore, MI-PUL allows the model to localize background clips that can be possibly described by the main clause, even without manual annotations. Therefore, the union of the two proposed components enhances the learning of the main clause, which is of critical importance in video grounding. Finally, we evaluate that our proposed HLN can plug into the current methods and improve their performance. Extensive experiments on challenging datasets show HLN significantly improve the state-of-the-art methods, especially achieving 6.15% gain in terms of Recall1@IoU0.5Recall 1\text{@}IoU0.5 on the TACoS dataset.
Article
Video moment retrieval with text query aims to retrieve the most relevant segment from the whole video based on the given text query. It is a challenging cross-modal alignment task due to the huge gap between visual and linguistic modalities and the noise generated by manual labeling of time segments. Most of the existing works only use language information in the cross-modal fusion stage, neglecting that language information plays an important role in the retrieval stage. Besides, these works roughly compress the visual information in the video clips to reduce the computation cost which loses subtle video information in the long video. In this paper, we propose a novel model termed Cross-modal Dynamic Networks (CDN) which dynamically generates convolution kernel by visual and language features. In the feature extraction stage, we also propose a frame selection module to capture the subtle video information in the video segment. By this approach, the CDN can reduce the impact of the visual noise without significantly increasing the computation cost and leads to a better video moment retrieval result. The experiments on two challenge datasets, i.e. , Charades-STA and TACoS, show that our proposed CDN method outperforms a bundle of state-of-the-art methods with more accurately retrieved moment video clips. The implementation code and extensive instruction of our proposed CDN method are provided at https://github.com/CFM-MSG/Code_CDN .
Article
Dynamic neural network is an emerging research topic in deep learning. Compared to static models which have fixed computational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc. In this survey, we comprehensively review this rapidly developing area by dividing dynamic networks into three main categories: 1) sample-wise dynamic models that process each sample with data-dependent architectures or parameters; 2) spatial-wise dynamic networks that conduct adaptive computation with respect to different spatial locations of image data; and 3) temporal-wise dynamic models that perform adaptive inference along the temporal dimension for sequential data such as videos and texts. The important research problems of dynamic networks, e.g., architecture design, decision making scheme, optimization technique and applications, are reviewed systematically. Finally, we discuss the open problems in this field together with interesting future research directions.
Conference Paper
In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i.e., a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse video-level annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i.e., ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.
Conference Paper
Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.
Article
Video anomaly detection under video-level labels is currently a challenging task. Previous works have made progresses on discriminating whether a video sequence contains anomalies. However, most of them fail to accurately localize the anomalous events within videos in the temporal domain. In this paper, we propose a Weakly Supervised Anomaly Localization (WSAL) method focusing on temporally localizing anomalous segments within anomalous videos. Inspired by the appearance difference in anomalous videos, the evolution of adjacent temporal segments is evaluated for the localization of anomalous segments. To this end, a high-order context encoding model is proposed to not only extract semantic representations but also measure the dynamic variations so that the temporal context could be effectively utilized. In addition, in order to fully utilize the spatial context information, the immediate semantics are directly derived from the segment representations. The dynamic variations as well as the immediate semantics, are efficiently aggregated to obtain the final anomaly scores. An enhancement strategy is further proposed to deal with noise interference and the absence of localization guidance in anomaly detection. Moreover, to facilitate the diversity requirement for anomaly detection benchmarks, we also collect a new traffic anomaly (TAD) dataset which specifies in the traffic conditions, differing greatly from the current popular anomaly detection evaluation benchmarks. Thedataset and the benchmark test codes, as well as experimental results, are made public on http://vgg-ai.cn/pages/Resource/ and https://github.com/ktr-hubrt/WSAL . Extensive experiments are conducted to verify the effectiveness of different components, and our proposed method achieves new state-of-the-art performance on the UCF-Crime and TAD datasets.
Article
Weakly supervised anomaly detection is a challenging task since frame-level labels are not given in the training phase. Previous studies generally employ neural networks to learn features and produce frame-level predictions and then use multiple instance learning (MIL)-based classification loss to ensure the interclass separability of the learned features; all operations simply take into account the current time information as input and ignore the historical observations. According to investigations, these solutions are universal but ignore two essential factors, i.e., the temporal cue and feature discrimination. The former introduces temporal context to enhance the current time feature, and the latter enforces the samples of different categories to be more separable in the feature space. In this paper, we propose a method that consists of four modules to leverage the effect of these two ignored factors. The causal temporal relation (CTR) module captures local-range temporal dependencies among features to enhance features. The classifier (CL) projects enhanced features to the category space using the causal convolution and further expands the temporal modeling range. Two additional modules, namely, compactness (CP) and dispersion (DP) modules, are designed to learn the discriminative power of features, where the compactness module ensures the intraclass compactness of normal features, and the dispersion module enhances the interclass dispersion. Extensive experiments on three public benchmarks demonstrate the significance of causal temporal relations and feature discrimination for anomaly detection and the superiority of our proposed method.
Article
Abnormal event detection is an important task in video surveillance systems. In this paper, we propose a novel bidirectional multi-scale aggregation networks (BMAN) for abnormal event detection. The proposed BMAN learns spatiotemporal patterns of normal events to detect deviations from the learned normal patterns as abnormalities. The BMAN consists of two main parts: an inter-frame predictor and an appearancemotion joint detector. The inter-frame predictor is devised to encode normal patterns, which generates an inter-frame using bidirectional multi-scale aggregation based on attention. With the feature aggregation, robustness for object scale variations and complex motions is achieved in normal pattern encoding. Based on the encoded normal patterns, abnormal events are detected by the appearance-motion joint detector in which both appearance and motion characteristics of scenes are considered. Comprehensive experiments are performed, and the results show that the proposed method outperforms the existing state-of-the-art methods. The resulting abnormal event detection is interpretable on the visual basis of where the detected events occur. Further, we validate the effectiveness of the proposed network designs by conducting ablation study and feature visualization.
Article
How to build a generic deep one-class (DeepOC) model to solve one-class classification problems for anomaly detection, such as anomalous event detection in complex scenes? The characteristics of existing one-class labels lead to a dilemma: it is hard to directly use a multiple classifier based on deep neural networks to solve one-class classification problems. Therefore, in this article, we propose a novel DeepOC neural network, termed as DeepOC, which can simultaneously learn compact feature representations and train a DeepOC classifier. Only with the given normal samples, we use the stacked convolutional encoder to generate their low-dimensional high-level features and train a one-class classifier to make these features as compact as possible. Meanwhile, for the sake of the correct mapping relation and the feature representations' diversity, we utilize a decoder in order to reconstruct raw samples from these low-dimensional feature representations. This structure is gradually established using an adversarial mechanism during the training stage. This mechanism is the key to our model. It organically combines two seemingly contradictory components and allows them to take advantage of each other, thus making the model robust and effective. Unlike methods that use handcrafted features or those that are separated into two stages (extracting features and training classifiers), DeepOC is a one-stage model using reliable features that are automatically extracted by neural networks. Experiments on various benchmark data sets show that DeepOC is feasible and achieves the state-of-the-art anomaly detection results compared with a dozen existing methods.