Conference Paper

Violent flows: Real-time detection of violent crowd behavior

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Although surveillance video cameras are now widely used, their effectiveness is questionable. Here, we focus on the challenging task of monitoring crowded events for outbreaks of violence. Such scenes require a human surveyor to monitor multiple video screens, presenting crowds of people in a constantly changing sea of activity, and to identify signs of breaking violence early enough to alert help. With this in mind, we propose the following contributions: (1) We describe a novel approach to real-time detection of breaking violence in crowded scenes. Our method considers statistics of how flow-vector magnitudes change over time. These statistics, collected for short frame sequences, are represented using the VIolent Flows (ViF) descriptor. ViF descriptors are then classified as either violent or non-violent using linear SVM. (2) We present a unique data set of real-world surveillance videos, along with standard benchmarks designed to test both violent/non-violent classification, as well as real-time detection accuracy. Finally, (3) we provide empirical tests, comparing our method to state-of-the-art techniques, and demonstrating its effectiveness.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Violence 4D achieved better performance than previous studies on four violence benchmarks. Namely RWF2000 [45], Movie fight [3], Hockey Fight [4], and Crowd violence [5]. ...
... Low contrast characteristics, such as the optical flow local histogram [13], the directed gradient local histogram [14], and the absolute difference between successive frames and features from motion blobs [15]. Other approaches [5] concentrate solely on the optical flow. The Gaussian model of optical flow is utilised to generate areas that are used to create a descriptor which is categorised using SVM. ...
... Even though it is an effective way for recognising actions, it necessitates the use of specialised hardware, making it computationally costly. Variance Inflation Factor (VIF) [5] is suggested for real-time detection of violence. However, the direction characteristic of optical flow was not used by VIF. ...
Article
Full-text available
As violence has increased around the world, surveillance cameras are everywhere, and they are only going to get more ubiquitous. Due to the massive volume of video footage, automatic activity detection systems must be used to create an online warning in the event of aberrant activity. A deep learning architecture is presented in this study using four‐dimensional video‐level convolution neural networks. The proposed architecture includes residual blocks that are used with three‐Dimensional Convolution Neural Networks 3D (CNNs) to learn long‐term and short‐term spatiotemporal representation from the video as well as record inter‐clip interaction. ResNet50 is used as the backbone for three‐dimensional convolution networks and dense optical flow for the region of interest. The proposed architecture is applied on four benchmarks for violence and non‐violence videos, which are commonly used for violent detection. It obtained test accuracies of 94.67% on RWF2000, 97.29% on Crowd violence, 100% on Movie fight and 100% on the Hockey Fight dataset. These results outperform the previous methods used on RWF2000 datasets.
... Violent Flows (ViF) [4], Oriented ViF (OViF) [5], Motion Weber Local Descriptor (MoWLD) [6], and Histogram of Optical flow Magnitude and Orientation (HOMO) [7] are only a few of the methods for video violence detection that are based on optical flow orientation and magnitude. Hassner et al. [4] proposed ViF by exploiting the magnitude of pixel flow and classified the video clip using a linear Support Vector Machine (SVM). ...
... Violent Flows (ViF) [4], Oriented ViF (OViF) [5], Motion Weber Local Descriptor (MoWLD) [6], and Histogram of Optical flow Magnitude and Orientation (HOMO) [7] are only a few of the methods for video violence detection that are based on optical flow orientation and magnitude. Hassner et al. [4] proposed ViF by exploiting the magnitude of pixel flow and classified the video clip using a linear Support Vector Machine (SVM). To overcome the limitation of ViF, Zhou et al. [8] put forward OViF, and implemented it on non-crowded violence video clips. ...
... Each video clip consists of 50 frames with a 360 x 288 pixel resolution. • 123 violent and 123 non-violent video clips are included in the violent crowd [4] dataset. Each clip comprises between 50 and 150 frames, each with a resolution of 320 x 240 pixels. ...
... In this work, extensive experimental analysis of the modified network [8] is performed. Modified for violence detection in videos, experimental results on three violence detection datasets [2,7,9] demonstrate that our modified model outperforms most other violence detection methods with simple hyperparameter adjustments. Our results also provide insights into the design of effective video classification models for future research. ...
... It is then fine-tuned with various hyperparameter and data augmentation configurations using a Bayesian based tuning method [14]. Experiments are performed using three widely used benchmark violence detection datasets: RWF-2000 [7], Surveillance Camera Fight Dataset (Surv/SCFD) [2], and ViolentFlows [9]. [7] is the largest public dataset of surveillance footage for violence detection. ...
... ViolentFlows [9] contains 246 videos of crowds with clip-level violent or non-violent labels. All videos in the dataset have 240 × 320 pixel resolution and range from 1.04 to 6.52 seconds. ...
... Previously, researchers employed a variety of feature extraction approaches, such as ViF [11], STIPs [6], iDT [27], and fed the results to classic classification models, such as support vector machine (SVM). Real-world settings, on the other hand, are complicated, as seen in Fig. 1, and it is difficult to extract relevant information from hand-crafted feature descriptors. ...
... Hassner et al. [11] used optical flow magnitude series to detect violence in videos. The features are called violent flow (ViF) descriptors. ...
... I3D (RGB) [3] 12.30 111.30 85.75 I3D (flow) [3] 12.30 102.52 75.50 I3D (two-stream) [3] 24.40 213.85 81.50 FlowGate (RGB) [5] 0. 25 8.76 84.50 FlowGate (flow) [5] 0. 25 8.29 75.50 FlowGate (fusion) [5] 0.27 16 Methods Accuracy(%) Hockey Movies ViF [11] 82.9 -LHOG+LOF [32] 95.1 -HOF+HIK [19] 88.6 59.0 HOF+HIK [19] 91.7 49.0 MoWLD+BoW [29] 91.9 -MoSFIT+HIK [19] 90.9 89.5 FightNet [31] 97.0 100 3D ConvNet [22] 99.62 99.9 ConvLSTM [23] 97.1 100 C3D [25] 96.5 100 I3D (RGB) [3] 98.5 100 I3D (Flow) [3] 84.0 100 FlowGate [5] 98.0 100 2s-MDCN 99.0 100 and compare them with the single 3D CNN layer. From the visualization, it is evident that the combination of 1D, 2D, and 3D CNN layers extract salient features from the input and makes our model efficient and accurate. ...
Preprint
Full-text available
The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.
... Although the patterns relate fight, but still the camera motion and other similar parameters affect a model's performance whenever employed for real-world monitoring. The most challenging datasets so far in VD domain is Violent-Flows and Real-World Fighting (RWF), that VD Datasets Hockey Fight [55] Violent Flows/Violent Crowd [32] Violence in Movies [55] RWF-2000 [12] UCF Crime [73] UT-Interaction [64] Two-person interaction [95] Fig. 6: The available VD datasets. Datasets written in bold letters indicate that these are not purely VD datsets, but contain some classes with human violence and can be utilized in VD domain. ...
... This dataset is a challenging one as it comprises several categories of violence such as violence in sports and in crowds [32]. The videos are downloaded from YouTube and the overall dataset consists of five sets of video clips, while each one has two classes i.e., violent and non-violent. ...
... That is why traditional low-level features-based techniques are comparatively less effective on simple datasets such as Hockey fight, which have conventional low-level featuresbased techniques. The motion patterns in Hockey fight dataset are not very complex, yet the best traditional features-based methods such MoSIFT+HIK [55], ViF [32], and MoSIFT+KDE+Sparse Coding [93] techniques scored 90.9%, 82.9%, and 94.3% accuracy, respectively. Similarly, on Violent-Flows dataset, the best hand-crafted features methods i.e. ...
Preprint
Full-text available
The Big Video Data generated in today's smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for real-time VD analysis. This paper focuses on overview of deep sequence learning approaches along with localization strategies of the detected violence. This overview also dives into the initial image processing and machine learning-based VD literature and their possible advantages such as efficiency against the current complex models. Furthermore,the datasets are discussed, to provide an analysis of the current models, explaining their pros and cons with future directions in VD domain derived from an in-depth analysis of the previous methods.
... This dataset is a challenging one as it comprises several categories of violence such as violence in sports and in crowds (Hassner et al. 2012). The videos are downloaded from YouTube and the overall dataset consists of five sets of video clips, while each one has two classes i.e., violent and non-violent. ...
... That is why traditional low-level features-based techniques are comparatively less effective on simple datasets such as Hockey fight, which have conventional low-level features-based techniques. The motion patterns in Hockey fight dataset are not very complex, yet the best traditional features-based methods such MoSIFT+HIK (Nievas et al. 2011), ViF (Hassner et al. 2012), and MoSIFT+KDE+Sparse Coding (Xu et al. 2014) techniques scored 90.9%, 82.9%, and 94.3% accuracy, respectively. Similarly, on Violent-Flows dataset, the best hand-crafted features methods i.e. ...
... Similarly, on Violent-Flows dataset, the best hand-crafted features methods i.e. ViF (Hassner et al. 2012), MoSIFT+KDE+Sparse Coding (Xu (Gao et al. 2016) achieved 81.3%, 89.05%, and 88% accuracy, respectively. Although the mentioned accuracy and performance is higher, but still the deployment of these methods in real-world environments is questionable. ...
Article
Full-text available
The Big Video Data generated in today's smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for real-time VD analysis. This paper focuses on overview of deep sequence learning approaches along with localization strategies of the detected violence. This overview also dives into the initial image processing and machine learning-based VD literature and their possible advantages such as efficiency against the current complex models. Furthermore,the datasets are discussed, to provide an analysis of the current models, explaining their pros and cons with future directions in VD domain derived from an in-depth analysis of the previous methods.
... The data resources [46]- [55] of action recognition datasets mostly came from institutions or websites containing many video segments. The scale of the datasets (YouTube, surveillance systems, universities, etc.) is also growing over time, reflecting the number of videos or tags. ...
... The datasets published earlier in 2012 include Violent Flows-crowd violence [46] and UCF50, UCF101, from the expansion and development of datasets. Data sources for these three datasets are coming from YouTube websites. ...
Preprint
Full-text available
p>Currently, spatial-temporal behavior recognition is one of the most foundational tasks of computer vision. The 2D neural networks of deep learning are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats, with the current increasingly wide usage of surveillance video and more tasks related to human action recognition. There are increasing tasks requiring temporal information for frames dependency analysis. The researchers have widely studied video-based recognition rather than image-based(pixel-based) only to extract more informative elements from geometry tasks. Our current related research addresses multiple novels proposed research works and compares their advantages and disadvantages between the derived deep learning frameworks rather than machine learning frameworks. The comparison happened between existing frameworks and datasets, which are video format data only. Due to the specific properties of human actions and the increasingly wide usage of deep neural networks, we collected all research works within the last three years between 2020 to 2022. In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks, especially video action recognition.</p
... The data resources [46]- [55] of action recognition datasets mostly came from institutions or websites containing many video segments. The scale of the datasets (YouTube, surveillance systems, universities, etc.) is also growing over time, reflecting the number of videos or tags. ...
... The datasets published earlier in 2012 include Violent Flows-crowd violence [46] and UCF50, UCF101, from the expansion and development of datasets. Data sources for these three datasets are coming from YouTube websites. ...
Preprint
p>Currently, spatial-temporal behavior recognition is one of the most foundational tasks of computer vision. The 2D neural networks of deep learning are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats, with the current increasingly wide usage of surveillance video and more tasks related to human action recognition. There are increasing tasks requiring temporal information for frames dependency analysis. The researchers have widely studied video-based recognition rather than image-based(pixel-based) only to extract more informative elements from geometry tasks. Our current related research addresses multiple novels proposed research works and compares their advantages and disadvantages between the derived deep learning frameworks rather than machine learning frameworks. The comparison happened between existing frameworks and datasets, which are video format data only. Due to the specific properties of human actions and the increasingly wide usage of deep neural networks, we collected all research works within the last three years between 2020 to 2022. In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks, especially video action recognition.</p
... The proposed method outperforms competing algorithms evaluated on the same database. Violent Flows: Real -Time Detection of Violent Crowd Behavior [3] Tal Hassner, Yossi Itcher, Orit Kliper-Gross [3] Although surveillance video cameras are now widely used, their effectiveness is questionable. Here, author focus on the challenging task of monitoring crowded events for out breaks of violence. ...
... The proposed method outperforms competing algorithms evaluated on the same database. Violent Flows: Real -Time Detection of Violent Crowd Behavior [3] Tal Hassner, Yossi Itcher, Orit Kliper-Gross [3] Although surveillance video cameras are now widely used, their effectiveness is questionable. Here, author focus on the challenging task of monitoring crowded events for out breaks of violence. ...
Article
Human action recognition is the process of labelling image sequences with action labels. Robust solutions to this problem have applications in domains such as visual surveillance, video retrieval and human–computer interaction. The task is challenging due to variations in motion performance, recording settings and inter-personal differences. In this survey, we explicitly address these challenges. We provide a detailed overview of current advances in the field. Image representations and the subsequent classification process are discussed separately to focus on the novelties of recent research. Moreover, we discuss limitations of the state of the art and outline promising directions of research.
... To classify violent and non-violent frames, the Histogram of Optical Flow (HOF) features descriptor is used for features extraction, and Support Vector Machines (SVM) is used for classification. An interesting use of optical flow is presented [11] for estimating the optical flow between consecutive frames in a sequence by using a descriptor called Violence Flows (ViF). This descriptor gathers the most significant information, and the SVM machine learning algorithm is used to classify a video as being violent or non-violent. ...
... The other datasets that are used for experimentation include violent crowd, also known as violent flow [11], violence in movies [30], surveillance fight, and industrial surveillance [21]. The violent flows dataset is derived from YouTube videos, and the industrial surveillace datasets contains videos downloaded from YouTube, where the events include real-world environments-based violence. ...
Article
Full-text available
The study of automated video surveillance systems study using computer vision techniques is a hot research topic and has been deployed in many real-world CCTV environments. The main focus of the current systems is higher accuracy, while the assistance of surveillance experts in effective data analysis and instant decision making using efficient computer vision algorithms need researchers’ attentions. In this research, to the best of our knowledge, we are the first to introduce a process control technique: control charts for surveillance video data analysis. The control charts concept is merged with a novel deep learning-based violence detection framework. Different from the existing methods, the proposed technique considers the importance of spatial information, as well as temporal representations of the input video data, to detect human violence. The spatial information are fused with the temporal dimension of the deep learning model using a multi-scale strategy to ensure that the temporal information are properly assisted by the spatial representations at multi-levels. The proposed frameworks’ results are kept in the history-maintaining module of the control charts to validate the level of risks involved in the live input surveillance video. The detailed experimental results over the existing datasets and the real-world video data demonstrate that the proposed approach is a prominent solution towards automated surveillance with the pre- and post-analyses of violent events.
... In this case, the lack of diversity represents the main drawback because all the videos are captured in a single scene. Another dataset, named Violent-Flows, has been presented in [17]. It consists of about 250 video clips of violent/non-violent behaviors in general contexts. ...
... NTU CCTV-Fights [18] Violence Detection 2 1000 AIRTLab [19,20] Violence Detection 2 350 Hockey and Movies Fight [16] Violence Detection 2 1000 Violent-Flows [17] Violence Detection 2 250 Surveillance Camera Fight [21] Violence Detection 2 300 RWF-2000 [22] Violence Detection 2 2000 Real-Life Violence Situations [23] Violence Detection 2 2000 To complement these datasets, in this work, a new large-scale benchmark suitable for human violence detection is constructed by gathering video clips from several cameras located inside a moving bus. To the best of our knowledge, our Bus Violence dataset is the first collection of videos depicting violent scenes concerning public transport. ...
Article
Full-text available
The automatic detection of violent actions in public places through video analysis is difficult because the employed Artificial Intelligence-based techniques often suffer from generalization problems. Indeed, these algorithms hinge on large quantities of annotated data and usually experience a drastic drop in performance when used in scenarios never seen during the supervised learning phase. In this paper, we introduce and publicly release the Bus Violence benchmark, the first large-scale collection of video clips for violence detection on public transport, where some actors simulated violent actions inside a moving bus in changing conditions, such as the background or light. Moreover, we conduct a performance analysis of several state-of-the-art video violence detectors pre-trained with general violence detection databases on this newly established use case. The achieved moderate performances reveal the difficulties in generalizing from these popular methods, indicating the need to have this new collection of labeled data, beneficial for specializing them in this new scenario. Ciampi, L.; Foszner, P.; Messina, N.; Staniszewski, M.; Gennaro, C.; Falchi, F.; Serao, G.; Cogiel, M.; Golba, D.; Szczęsna, A.; Amato, G. Bus Violence: An Open Benchmark for Video Violence Detection on Public Transport. Sensors 2022, 22, 8345. https://doi.org/10.3390/s22218345
... The following research classified violence by judging the presence of blood or abnormal sounds such as gunshots in the video [1][2][3]. Special descriptors, such as VIolent Flows (ViF) [4,5] and Oriented VIolent Flows (OViF) [6], were designed to extract the characteristics of violent behavior, such as large range of action, short occurrence time and large change in movement direction. ViF considered statistics of how flow-vector magnitudes change over time. ...
... The training set includes 800 video clips, the verification set includes 100 video clips and the test set includes 100 video clips. The Crowd Violence [4] dataset mainly contains the scenes of crowd, but due to long shooting distance and low resolution, most of the scenes are chaotic and vague. The latest published RWF-2000 [45] dataset contains 2000 surveillance video clips collected from Youtube. ...
Article
Full-text available
Most existing violence recognition methods have complex network structures and high cost of computation and cannot meet the requirements of large-scale deployment. The purpose of this paper is to reduce the complexity of the model to realize the application of violence recognition on mobile intelligent terminals. To solve this problem, we propose MobileNet-TSM, a lightweight network, which uses MobileNet-V2 as main structure. By incorporating temporal shift modules (TSM), which can exchange information between frames, the capability of extracting dynamic characteristics between consecutive frames is strengthened. Extensive experiments are conducted to prove the validity of this method. Our proposed model has only 8.49MB parameters and 175.86MB estimated total size. Compared with the existing methods, this method greatly reduced the model size, at the cost of an accuracy gap of about 3%. The proposed model has achieved accuracy of 97.959%, 97.5% and 87.75% on three public datasets (Crowd Violence, Hockey Fights, and RWF-2000), respectively. Based on this, we also build a real-time violence recognition application on the Android terminal. The source code and trained models are available on https://github.com/1840210289/MobileNet-TSM.git.
... Several research have been published on the topic of automatically detecting violent incidents in films in an effort to relieve authorities of the load of having to view hours of footage in order to identify occurrences that only last seconds. Recent publications [3][4][5] emphasised the accuracy of systems in violence discovery, while earlier efforts [6][7][8] relied on custom-built features and flow forms are hallmarks of outdated approaches of identifying actions. Extraction of spatial-temporal features from films, or features that capture both the spatial information in a single frame and the gesture info in a series of edges, has been demonstrated to be possible with the use of deep learning approaches [9]. ...
... RWF-2000, Hockey-Fight, Crowd Violence, and Movies-Fight. The RWF-2000 [11], Hockey-Fight [4], Crowd Violence[24], and Movies-Fight[35] datasets are violence recognition datasets. These datasets contain two types of actions, violence and non-violence, with various people and backgrounds. ...
Preprint
Full-text available
This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.
... This technique has been widely tested for recognition tasks in videos [21,42], It's powerful to encode the spatial-temporal information in a video stream. Instead, [13,19] have focused on local features and the bag-of-features method; the major difference between these methods lies in the form of the functionalities used. ...
Article
Full-text available
Identifying anomalous activity is a heavy task, and this has led to the progression in the domain of deep learning for video surveillance. With the development of deep learning, anomaly detection techniques have been widely used to improve the performance of various applications, including vision detection systems. However, it is still difficult to apply them directly to practical applications which usually involve the lack of abnormal samples and diversity. This paper proposes a novel Stacked Auto Encoder (SAE) and Extreme Learning Machine (ELM) abnormal detection framework based on multiples features. These features are connected to speed of movement and appearance and fed to a new neural network architecture as temporal and spatiotemporal streams. The use of ELM algorithms with an exceptionally fast learning speed when dealing with abnormal activity localization problems in addition to excellent generalization abilities, a deep learning network achieves a good performance with quick learning speed to further improve the regression performance. The strength of our proposed approaches is demonstrated by experiments with measured abnormal activities’ data. This approach can accurately identify and precisely locate abnormal events.
... It was intended to extract possible areas of violence and distinguish violent events by feeding the proposed novel descriptor, orientation histogram of optical flow (OHOF) into a linear support vector machine (SVM) for classification. To exhibit the overall performance of the proposed algorithm, experiments had been carried out on three public datasets: the BEHAVE dataset [5], the CAVIAR dataset [6], and the Violent Flows dataset [7]. The results were 85.29 ± 0.16% on BEHAVE dataset, 86.75 ± 0.15% on the CAVIAR dataset, and 82.79 ± 0.19 % on the Violent Flows dataset. ...
... As part of the evaluation, the proposed framework is compared with non-surveillance benchmark data for violence detection, such as the Hockey Fight dataset [58] and the Violent Flow dataset [59]. Hockey Fights is a dataset that covers violence events occurring during hockey matches within the National Hockey League. ...
Article
Full-text available
Since the advent of visual sensors, smart cities have generated massive surveillance video data, which can be intelligently inspected to detect anomalies. Computer vision-based automated anomaly detection techniques replace human intervention to secure video surveillance applications in place from traditional video surveillance systems that rely on human involvement for anomaly detection, which is tedious and inaccurate. Due to the diverse nature of anomalous events and their complexity, it is however, very challenging to detect them automatically in a real-world scenario. By using Artificial Intelligence of Things (AIoT), this research work presents an efficient and robust framework for detecting anomalies in surveillance large video data. A hybrid model integrating 2D-CNN and ESN are proposed in this research study for smart surveillance, which is an important application of AIoT. The CNN is used as feature extractor from input videos which are then inputted to autoencoder for feature refinement followed by ESN for sequence learning and anomalous events detection. The proposed model is lightweight and implemented over edge devices to ensure their capability and applicability over AIoT environments in a smart city. The proposed model significantly enhanced performance using challenging surveillance datasets compared to other methods.
... In 2012, Hassner et al [62] proposed a method for realtime detection of violence in crowded scenes using the Violent Flows (ViF) descriptor to capture optical flow information between consecutive video frames and a linear SVM to classify the videos based on the computed ViF descriptors. They demonstrated that their method was effective at classifying videos containing crowd violence, and it was compared to other existing methods at the time. ...
Article
Full-text available
Surveillance cameras are increasingly being used worldwide due to the proliferation of digital video capturing, storage, and processing technologies. However, the large volume of video data generated makes it difficult for humans to perform real-time analysis, and even manual approaches can result in delayed detection of events. Automatic violence detection in surveillance footage has therefore gained significant attention in the scientific community as a way to address this challenge. With the advancement of machine learning algorithms, automatic video recognition tasks such as violence detection have become increasingly feasible. In this study, we investigate the use of smart networks that model the dynamic relationships between actors and/or objects using 3D convolutions to capture both the spatial and temporal structure of the data. We also leverage the knowledge learned by a pre-trained action recognition model for efficient and accurate violence detection in surveillance footage. We extend and evaluate several public datasets featuring diverse and challenging video content to assess the effectiveness of our proposed methods. Our results show that our approach outperforms state-of-the-art methods, achieving approximately a 2% improvement in accuracy with fewer model parameters. Additionally, our experiments demonstrate the robustness of our approach under common compression artifacts encountered in remote server processing applications.
... The latter is fed into a pretrained CNN, which extracts deep features from several layers. Empirical validation is lastly conducted on violent-flows (Hassner et al., 2012), Hockey (Bermejo Nievas et al., 2011), and movies (Bermejo Nievas et al., 2011) datasets. ...
Article
Full-text available
Recently, developing automated video surveillance systems (VSSs) has become crucial to ensure the security and safety of the population, especially during events involving large crowds, such as sporting events. While artificial intelligence (AI) smooths the path of computers to think like humans, machine learning (ML) and deep learning (DL) pave the way more, even by adding training and learning components. DL algorithms require data labeling and high-performance computers to effectively analyze and understand surveillance data recorded from fixed or mobile cameras installed in indoor or outdoor environments. However, they might not perform as expected, take much time in training, or not have enough input data to generalize well. To that end, deep transfer learning (DTL) and deep domain adaptation (DDA) have recently been proposed as promising solutions to alleviate these issues. Typically, they can (i) ease the training process, (ii) improve the generalizability of ML and DL models, and (iii) overcome data scarcity problems by transferring knowledge from one domain to another or from one task to another. Although the increasing number of articles proposed to develop DTL- and DDA-based VSSs, a thorough review that summarizes and criticizes the state-of-the-art is still missing. To that end, this paper introduces, to the best of the authors’ knowledge, the first overview of existing DTL- and DDA-based video surveillance to (i) shed light on their benefits, (ii) discuss their challenges, and (iii) highlight their future perspectives.
... Even if we don't use the training part of the test dataset to blindly evaluate the model, to be completely fair we plan to build a new dataset adapted for semi-supervised scenarios. Moreover, one next step for evaluation will be to apply our methods on more datasets [93,28,48] that are more suitable for the supervised scenarios. ...
Thesis
Today, automatic solving transportation problems becomes active subject. In our PhD project, we aim to address a specific challenge in this domain: anomaly detection and tracking. Our ultimate goal is constructing a flexible and effective framework producing high performance on various public datasets. The context of our research is applying and improving previous successful approaches to achieve better results. Our first research is evaluating the performance of classical hand-crafted generative approach in future prediction and its capability for improving segmentation and tracking. In contrast, the lack of visual information of IOU tracker combined with the failure detections of Mask R-CNNs detectors create fragmented trajectories. We propose an enhanced tracker based on tracking by-detection and optical flow estimation in vehicle tracking scenarios. Our solution generates new detections or segmentations based on translating backward and forward results of CNNs detectors by optical flow vectors. This task can fill in the gaps of trajectories. Then we match generated results with fragmented trajectories by SURF features. DAVIS dataset is used for evaluating the best way to generate new detections. Finally, the entire process is tested on DETRAC dataset. The qualitative results show that our solution achieved stable performance with different types of flow estimation methods and significantly improved the fragmented trajectories. For future work, we plan to apply CGANs streams of second work for the first task to propose a new competitive process of future prediction for segmentation and tracking. Despite the moderate success of the first work, there is significant limitations of classical approaches to deal with our main task: anomaly detection. Facing to those challenge, in this thesis, our contributions are two-fold. On the one hand, we propose a flexible multichannel framework to generate multi-type frame-level features. On the other hand, we study how it is possible to improve the detection performance by supervised learning. The multi-channel framework is based on four Conditional GANs (CGANs) taking various types of appearance and motion information as input and producing prediction information as output. These CGANs provide a better feature space to represent the distinction between normal and abnormal events. Then, the difference between those generative and ground-truth pieces of information is encoded by Peak Signal-to Noise Ratio (PSNR). We propose to classify those features in a classical supervised scenario by building a small training set with some abnormal samples of the original test set of the dataset. The binary Support Vector Machine (SVM) is applied for frame-level anomaly detection. Finally, we use Mask R-CNN as a detector to perform object-centric anomaly localization. Our solution is largely evaluated on Avenue, Ped1, Ped2 and ShanghaiTech datasets. Our experiment results demonstrate that PSNR features combined with supervised SVM are better than error maps computed by previous methods. We achieve SOTA performance for frame-level AUC on Avenue, Ped1 and ShanghaiTech. Especially, for the most challenging Shanghaitech dataset, a supervised training model outperforms up to 9% the SOTA on unsupervised strategy. Furthermore, we keep in progress several promising ways: building a new dataset for semi-supervised anomaly detection containing both normal and abnormal samples in its training set and applying one-class SVM to propose an end-to-end framework
... • CUHK Avenue Dataset for abnormal event detection [78], [79]: The films were shot on the campus avenue of "CUHK (The Chinese University of Hong • Violent-Flows Crowd Violence and Non-violence Dataset [80], [81], [82]: A library of "realworld video footage of crowd violence", as well as established benchmark methods for determining violent/non-violent categorization and detecting violence outbreaks. ...
... UMN [31], H.Fight [133], ViF [138]. ...
Preprint
Full-text available
Crowd anomaly detection is one of the most popular topics in computer vision in the context of smart cities. A plethora of deep learning methods have been proposed that generally outperform other machine learning solutions. Our review primarily discusses algorithms that were published in mainstream conferences and journals between 2020 and 2022. We present datasets that are typically used for benchmarking, produce a taxonomy of the developed algorithms, and discuss and compare their performances. Our main findings are that the heterogeneities of pre-trained convolutional models have a negligible impact on crowd video anomaly detection performance. We conclude our discussion with fruitful directions for future research.
... This paper deals with the automatic recognition of rare violent actions in a transport environment using multi-modal data. The automatic recognition of violent actions has been addressed for several years by modelling video streams using machine learning [1]- [3] and more recently using deep learning [4]- [8]. Moreover, automatic sound scenes and events recognition is being actively investigated [9] and several research deals with recognition and detection of violent scenes and screams [10]- [12]. ...
Conference Paper
Full-text available
This paper deals with the security improvement of passengers in public transport by automatically processing the audio and video streams of an embedded surveillance system. In this paper we analyse several levels of fusion of two deep audio and video recurrent network models for violent actions recognition. Each audio and video model is based on recent generic feature extractors proposed in the state-of-the-art to benefit of powerful feature representation capabilities. Each level of fusion is trained and evaluated on a new real-world audio-video surveillance streams recorded in a real train with scenes of violence played by actors. The obtained results confirm the interest in seeking to detect violence by jointly using audio and video signal and highlight the difficulty to define the optimal level of fusion.
... The input of HMM is a sequence of numerical values instead of a single value, this is needed because HMM is based on the analysis of continuous values. Here the feature values and count values are obtained, then constructs the density map the histogram of it given as an input to the HMM classifier for training [34]. The experimental results are shown in Section III. ...
Article
Each year the computerized visual investigation of behavior gives a few key building pieces towards an mental vision framework. The capacity to see individuals and their activities by vision may be a key for a machine to interface vigorously and effectively with a human-computer association world. Due to various conceivable basic applications, "watching people" is at show a standout among the foremost energetic application spaces in vision. One of the foremost applications of visual investigation is peculiarity location in human exercises. An irregularity can show up in different shapes they speak to the different levels of human security issues. The location and following of bizarre exercises in observation have propelled an expanding level of concentration in computer vision.A novel approach to screen an variation from the norm within the open environment, here a generalized system is created for following the deviation by extricating the neighborhood highlights utilizing traits as neighborhood thickness and movement vector. In this moderate highlight movement is affected by typical behavior of people and quick include movement affected by the unusual behavior of people, for way better discovery it includes in computing the movement outline for flow of movement vectors within the scene by coordination the movement and appearances. The test investigation illustrates the viability of this approach in comparison with classifiers which is proficient to run and accomplishes 96% execution, in any case, for compelling approval of the framework is tried with standard UMN datasets and claim datasets.
... Many HC techniques used optical flow to describe human behavior. For instance, in [13], a descriptor is proposed by computing optical flow vectors in consecutive frames and building a histogram of optical flow information to describe video segments. The information obtained from this histogram is given to a Support Vector Machine (SVM) to learn violence features. ...
Article
Full-text available
Numerous violent actions occur in the world every day, affecting victims mentally and physically. To reduce violence rates in society, an automatic system may be required to analyze human activities quickly and detect violent actions accurately. Violence detection is a complex machine vision problem involving insufficient violent datasets and wide variation in activities and environments. In this paper, an unsupervised framework is presented to discriminate between normal and violent actions overcoming these challenges. This is accomplished by analyzing the latent space of a double-stream convolutional AutoEncoder (AE). In the proposed framework, the input samples are processed to extract discriminative spatial and temporal information. A human detection approach is applied in the spatial stream to remove background environment and other noisy information from video segments. Since motion patterns in violent actions are entirely different from normal actions, movement information is processed with a novel Jerk feature in the temporal stream. This feature describes the long-term motion acceleration and is composed of 7 consecutive frames. Moreover, the classification stage is carried out with a one-class classifier using the latent space of AEs to identify the outliers as violent samples. Extensive experiments on Hockey and Movies datasets showed that the proposed framework surpassed the previous works in terms of accuracy and generality.
... The violent scenes dataset (VSD) [6] is a recently proposed benchmark dataset for violence detection limited to only 3 audiorelated categories 'explosions, screams, gunshots' and 6 videorelated categories 'blood, fire, firearms, cold arms, car chases and gory scenes' based on 32 movies. The violent flows-crowd violence dataset consists of 246 short YouTube video clips [7] containing 'crowd violence'. Authors in [3] focus on detection of aggressive behaviors and analyzed on CareMedia aggression dataset containing 42 aggressive clips. ...
... Covering the datasets literature, we determined that most of the videos are obtained from movies or recorded with mobiles. From our deep analysis, we noted that the most explored datasets in the field of VD are hockey fights [110], violence in movies [110], and violent crowd datasets [151] due to their challenging nature, which can be verified from Table 5, Table 6, and Table 7. Further, we demonstrate all the VD datasets in Table 9, where details, videos source, and resolution are given. We considered the datasets as surveillance by viewing each video in each dataset. ...
Article
Recent advancements in intelligent surveillance systems for video analysis have been a topic of great interest in the research community due to the vast number of applications to monitor humans’ activities. The growing demand for these systems aims towards automatic violence detection (VD) systems enhancing and comforting human lives through artificial neural networks (ANN) and machine intelligence. Extremely overcrowded regions such as subways, public streets, banks, and the industries need such automatic VD system to ensure safety and security in the smart city. For this purpose, researchers have published extensive VD literature in the form of surveys, proposals, and extensive reviews. Existing VD surveys are limited to a single domain of study, i.e., coverage of VD for non-surveillance or for person-to-person data only. To deeply examine and contribute to the VD arena, we survey and analyze the VD literature into a single platform that highlights the working flow of VD in terms of machine learning strategies, neural networks (NNs)-based patterns analysis, limitations in existing VD articles, and their source details. Further, we investigate VD in terms of surveillance datasets and VD applications and debate on the challenges faced by researchers using these datasets. We comprehensively discuss the evaluation strategies and metrics for VD methods. Finally, we emphasize the recommendations in future research guidelines of VD that aid this arena with respect to trending research endeavors.
... a) Violent Flows [29]: The Violent Flows dataset is a classic dataset that has 246 total video sequences of various resolution and length, 123 of which are violent while the rest are non-violent. b) Movie Fights [33]: This is yet another classic dataset that is composed of 200 video sequences equally divided into violent and non-violent categories collected from action movies. ...
Preprint
Full-text available
Unmanned Aerial Vehicle (UAV) has gained significant traction in the recent years, particularly the context of surveillance. However, video datasets that capture violent and non-violent human activity from aerial point-of-view is scarce. To address this issue, we propose a novel, baseline simulator which is capable of generating sequences of photo-realistic synthetic images of crowds engaging in various activities that can be categorized as violent or non-violent. The crowd groups are annotated with bounding boxes that are automatically computed using semantic segmentation. Our simulator is capable of generating large, randomized urban environments and is able to maintain an average of 25 frames per second on a mid-range computer with 150 concurrent crowd agents interacting with each other. We also show that when synthetic data from the proposed simulator is augmented with real world data, binary video classification accuracy is improved by 5% on average across two different models.
Chapter
Dowry abuse, rape, domestic violence, forced marriage, witchcraft-related abuse, honor killings are just a few of the myriad atrocities women encounter and fight against worldwide. The psychological impacts of abuse on the victim can lead to depression, PTSD, eating disorders, withdrawal from the outside world and society, and low self-esteem to name a few. The physical implications could result in an inability to get to work, wage loss, dearth of involvement in routine activities, not being able to take care of themselves and their families. Our initiative is dedicated to curbing violence against women by providing a forum for women to speak about violence as well as passing a signal about it through a dedicated hand gesture. Our designed solution has three modules namely: Violence/Crime Scene Detection against women using audio and video, help hand signal detection, and multi-label story classification. Our approach uses Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) for video classification along with Support Vector Machine (SVM), and Random forest for audio classification.KeywordsViolence detectionAbuseHarassmentHand gesturesCrimeResidual networks (ResNets)Convolutional neural network (CNN)Deep learningWord embeddings
Article
The state of the art in violence detection in videos has improved in recent years thanks to deep learning models, but it is still below 90% of average precision in the most complex datasets, which may pose a problem of frequent false alarms in video surveillance environments and may cause security guards to disable the artificial intelligence system. In this study, we propose a new neural network based on Vision Transformer (ViT) and Neural Structured Learning (NSL) with adversarial training. This network, called CrimeNet, outperforms previous works by a large margin and reduces practically to zero the false positives. Our tests on the four most challenging violence-related datasets (binary and multi-class) show the effectiveness of CrimeNet, improving the state of the art from 9.4 to 22.17 percentage points in ROC AUC depending on the dataset. In addition, we present a generalisation study on our model by training and testing it on different datasets. The obtained results show that CrimeNet improves over competing methods with a gain of between 12.39 and 25.22 percentage points, showing remarkable robustness.
Article
Full-text available
In the last few years, due to the continuous advancement of technology, human behavior detection and recognition have become important scientific research in the field of computer vision (CV). However, one of the most challenging problems in CV is anomaly detection (AD) because of the complex environment and the difficulty in extracting a particular feature that correlates with a particular event. As the number of cameras monitoring a given area increases, it will become vital to have systems capable of learning from the vast amounts of available data to identify any potential suspicious behavior. Then, the introduction of deep learning (DL) has brought new development directions for AD. In particular, DL models such as convolution neural networks (CNNs) and recurrent neural networks (RNNs) have achieved excellent performance dealing with AD tasks, as well as other challenging domains like image classification, object detection, and speech processing. In this review, we aim to present a comprehensive overview of those research methods using DL to address the AD problem. Firstly, different classifications of anomalies are introduced, and then the DL methods and architectures used for video AD are discussed and analyzed, respectively. The revised contributions have been categorized by the network type, architecture model, datasets, and performance metrics that are used to evaluate these methodologies. Moreover, several applications of video AD have been discussed. Finally, we outlined the challenges and future directions for further research in the field.
Article
Anomalous event recognition has a complicated definition in the complex background due to the sparse occurrence of anomalies. In this paper, we form a framework for classifying multiple anomalies present in video frames that happen in a context such as the sudden moment of people in various directions and anomalous vehicles in the pedestrian park. An attention U-net model on video frames is utilized to create a binary segmented anomalous image that classifies each anomalous object in the video. White pixels indicate the anomaly, and black pixels serve as the background image. For better segmentation, we have assigned a border to every anomalous object in a binary image. Further to distinguish each anomaly a watershed algorithm is utilized that develops multi-level gray image masks for every anomalous class. This forms a multi-class problem, where each anomalous instance is represented by a different gray color level. We use pixel values, Optical Intensity, entropy values, and Gaussian filter with sigma 5, and 7 to form a feature extraction module for training video images along with their multi-instance gray-level masks. Pixel-level localization and identification of unusual items are done using the feature vectors acquired from the feature extraction module and multi-class stack classifier model. The proposed methodology is evaluated on UCSD Ped1, Ped2 and UMN datasets that obtain pixel-level average accuracy results of 81.15%,87.26% and 82.67% respectively.
Article
Overcrowding and stampedes may occur in public places with the gathering of crowds. To mitigate and prevent risk, the accident mechanism and methods for monitoring and evaluating crowd-gathering risk were investigated. Related studies are reviewed and summarized in this paper. The evolution process of crowd-gathering risk and precipitating factors were explained systematically. Risk monitoring methods are classified into three types according to the key technologies adopted. Articles exploring risk evaluation methods for crowd gathering are outlined, and the three main paradigms were formed. Finally, the shortcomings and future research points are summarized to promote more in-depth and comprehensive studies on crowd-gathering risk, develop monitoring technologies, and build an integrated system of risk management.
Chapter
Violent video recognition is a challenging task in the field of computer vision and multimodal methods have always been an important part of it. Due to containing sensitive content, it is not easy to collect violent videos and resulting in a lack of big public datasets. Existing methods of learning violent video representations are limited by small datasets and lack efficient multimodal fusion models. According to the situation, firstly, we propose to effectively transfer information from large datasets to small violent datasets based on mutual distillation with the self-supervised pretrained model for the vital RGB feature. Secondly, the multimodal attention fusion network (MAF-Net) is proposed to fuse the obtained RGB feature with flow and audio feature to recognize violent videos with multi-modal information. Thirdly, we build a new violent dataset, named Violent Clip Dataset (VCD), which is on a large scale and contains complete audio information. We performed experiments on the public VSD dataset and the self-built VCD dataset. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods on both datasets. KeywordsViolence recognitionMutual distillationMultimodal information fusion
Preprint
Full-text available
The report describes the digital architecture developed for the West Cambridge Digital Twin, particularly focussed on real-time sensor data collection and analysis with a privacy framework allowing occupants of the buildings to be first-class participants in the system. The implementation has some notable characteristics. In particular 'push' technology is used throughout such that information streams from the incoming asynchronous individual sensor events through to the end-user web pages with the absolute minimum latency, including real-time generated simple and complex events derived from the the underlying sensor data and the updating of visualisations such as an in-building heatmap. We believe the ability of the entire system to respond in the timescale of individual sensor messages to be unique. JSON structures are used to represent all data types including sensor readings, sensor types, building objects, organisations and people, with the idea that JSON-LD may represent a more suitable way than XML/RDF for managing relations between those objects (such as the 'occupies' relationship of people to offices, or the 'type' relationship of sensors to sensor types).
Article
Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts, such as blood, fighting, screams, explosions, etc., which may show certain co-occurrence trends. Therefore, we argue that parsing abstract violence into specific semantics helps to obtain the essential representation of violence. In this paper, we propose a semantic multimodal violence detection framework based on local-to-global embedding. The local semantic detection is designed to capture fine-grained violent elements in the video via a set of local semantic detectors, which is generated from a variety of external word embeddings. Also, we introduce a global semantic alignment branch to mitigate the intra-class variance of violence, in which violent video embeddings are guided to form a compact cluster while keeping a semantic gap with non-violent embeddings. Furthermore, we construct a multimodal cross-fusion network (MCN) for multimodal feature fusion, which consists of a cross-adaptive module and a cross-perceptual module. The former aims to eliminate inter-modal heterogeneity, while the latter suppresses task-irrelevant redundancies to obtain robust video representations. Extensive experiments demonstrate the effectiveness of the proposed method, which has a superior generalization capacity and achieves competitive performance on five violence datasets.
Article
In potential disasters, real world scenarios and public events, understanding of crowd psychology is a challenging process and detection of crowd behaviors in those events is quite complicated. Therefore, in this article, an effective crowd analysis mechanism is introduced using human behaviour analysis model based on motion heat flow enabled optical flow method. Here, OCP descriptor evaluates required object points from motion heat map to detect objects and generate feature weights. Here, three type of histogram gradients are evaluated using proposed HBA model such as inconsistencies between neighboring points and OCP descriptor, gradients obtained considering spatial angle and gradients obtained considering temporal angle. Features obtained considering spatial as well as temporal domain are encoded using feature encoding scheme. The performance of proposed Human Behaviour Analysis (HBA) model is tested on Web and Violent-Flow Video Dataset. The performance of the proposed HBA model is compared against several traditional behavior analysis methods considering performance matrices like Accuracy, Precision and Recall to detect behaviors in public events. Performance of proposed HBA model is superior to multiple state-of-art-crowd behavior analysis techniques.
Article
Full-text available
This paper explores deep learning (DL) methods that are used or have the potential to be used for traffic video analysis, emphasising driving safety for both autonomous vehicles and human‐operated vehicles. A typical processing pipeline is presented, which can be used to understand and interpret traffic videos by extracting operational safety metrics and providing general hints and guidelines to improve traffic safety. This processing framework includes several steps, including video enhancement, video stabilisation, semantic and incident segmentation, object detection and classification, trajectory extraction, speed estimation, event analysis, modelling, and anomaly detection. The main goal is to guide traffic analysts to develop their own custom‐built processing frameworks by selecting the best choices for each step and offering new designs for the lacking modules by providing a comparative analysis of the most successful conventional and DL‐based algorithms proposed for each step. Existing open‐source tools and public datasets that can help train DL models are also reviewed. To be more specific, exemplary traffic problems are reviewed and required steps are mentioned for each problem. Besides, connections to the closely related research areas of drivers' cognition evaluation, crowd‐sourcing‐based monitoring systems, edge computing in roadside infrastructures, automated driving systems‐equipped vehicles are investigated, and the missing gaps are highlighted. Finally, commercial implementations of traffic monitoring systems, their future outlook, and open problems and remaining challenges for widespread use of such systems are reviewed.
Article
The activity recognition gained immense popularity due to increasing number of surveillance cameras. The purpose of activity recognition is to detect the actions from the series of examination by varying the environmental condition. In this paper, Chaotic Whale Atom Search Optimisation (CWASO)-based Deep stacked autoencoder (CWASO-Deep SAE) is proposed for crowd behaviour recognition. The key frames are subjected to the descriptor of feature to extort the features, which bring out the classifier input vector. In this model, the statistical features, optical flow features and visual features are conducted to extract important features. Furthermore, the significant features are shown in the deep stacked auto-encoder (Deep SAE) for activity recognition, as the guidance of deep SAE is performed byCWASO, that is planned is designed by adjoining Atom search optimisation (ASO) algorithm and Chaotic Whale optimisation algorithm (CWOA). The proposed systems’ performance is analysed using two datasets. By considering the training data, the projected method attains performance that is high for dataset-1 with maximum precision, sensitivity, and with specific value of 96.826%, 96.790%, and 99.395%, respectively. Similarly, by considering the K-Fold, this method attains the maximum precision of 96.897%, sensitivity of 96.885%, and with specific values of 97.245% for the dataset-1.
Article
Full-text available
With the widespread use of closed-circuit television (CCTV) surveillance systems in public areas, crowd anomaly detection has become an increasingly critical aspect of the intelligent video surveillance system. It requires workforce and continuous attention to decide on the captured event, which is hard to perform by individuals. The available literature on human action detection includes various approaches to detect abnormal crowd behavior, which is articulated as an outlier detection problem. This paper presents a detailed review of the recent development of anomaly detection methods from the perspectives of computer vision on different available datasets. A new taxonomic organization of existing works in crowd analysis and anomaly detection has been introduced. A summarization of existing reviews and datasets related to anomaly detection has been listed. It covers an overview of different crowd concepts, including mass gathering events analysis and challenges, types of anomalies, and surveillance systems. Additionally, research trends and future work prospects have been analyzed.
Article
Full-text available
We introduce a new gesture recognition framework based on learning local motion signatures (LMSs) of HOG descriptors introduced by [1]. Our main contribution is to propose a new probabilistic learning-classification scheme based on a reliable tracking of local features. After the generation of these LMSs computed on one individual by tracking Histograms of Oriented Gradient (HOG) [2] descriptor, we learn a codebook of video-words (i.e., clusters of LMSs) using k-means algorithm on a learning gesture video database. Then, the video-words are compacted to a code-book of codewords by the Maximization of Mutual Information (MMI) algorithm. At the final step, we compare the LMSs generated for a new gesture w.r.t. the learned code-book via the k-nearest neighbors (k-NN) algorithm and a novel voting strategy. Our main contribution is the handling of the N to N mapping between codewords and gesture labels within the proposed voting strategy. Experiments have been carried out on two public gesture databases: KTH [3] and IXMAS [4]. Results show that the proposed method outperforms recent state-of-the-art methods.
Conference Paper
Full-text available
Automatic processing of video data is essential in order to allow efficient access to large amounts of video content, a crucial point in such applications as video mining and surveillance. In this paper we focus on the problem of identifying interesting parts of the video. Specifically, we seek to identify atypical video events, which are the events a human user is usually looking for. To this end we employ the notion of Bayesian surprise, as defined in [1,2], in which an event is considered surprising if its occurrence leads to a large change in the probability of the world model. We propose to compute this abstract measure of surprise by first modeling a corpus of video events using the Latent Dirichlet Allocation model. Subsequently, we measure the change in the Dirichlet prior of the LDA model as a result of each video event’s occurrence. This change of the Dirichlet prior leads to a closed form expression for an event’s level of surprise, which can then be inferred directly from the observed data. We tested our algorithm on a real dataset of video data, taken by a camera observing an urban street intersection. The results demonstrate our ability to detect atypical events, such as a car making a U-turn or a person crossing an intersection diagonally.
Conference Paper
Full-text available
Millions of surveillance cameras record video around the clock, producing huge video archives. Even when a video archive is known to include critical activities, finding them is like finding a needle in a haystack, making the archive almost worthless. Two main approaches were proposed to address this problem: action recognition and video summarization. Methods for automatic detection of activities still face problems in many scenarios. The video synopsis approach to video summarization is very effective, but may produce confusing summaries by the simultaneous display of multiple activities.A new methodology for the generation of short and coherent video summaries is presented, based on clustering of similar activities. Objects with similar activities are easy to watch simultaneously, and outliers can be spotted instantly. Clustered synopsis is also suitable for efficient creation of ground truth data.
Conference Paper
Full-text available
In this paper, we present a systematic framework for recognizing realistic actions from videos ldquoin the wildrdquo. Such unconstrained videos are abundant in personal collections as well as on the Web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.
Conference Paper
Full-text available
We present a novel model for human action categorization. A video sequence is represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points. We propose a hierarchical model that can be characterized as a constellation of bags-of-features and that is able to combine both spatial and spatial-temporal features. Given a novel video sequence, the model is able to categorize human actions in a frame-by-frame basis. We test the model on a publicly available human action dataset [2] and show that our new method performs well on the classification task. We also conducted control experiments to show that the use of the proposed mixture of hierarchical models improves the classification performance over bag of feature models. An additional experiment shows that using both dynamic and static features provides a richer representation of human actions when compared to the use of a single feature type, as demonstrated by our evaluation in the classification task.
Conference Paper
Full-text available
In this paper we introduce a template-based method for recognizing human actions called Action MACH. Our ap- proach is based on a Maximum Average Correlation Height (MACH) filter. A common limitation of template-based methods is their inability to generate a single template us- ing a collection of examples. MACH is capable of captur- ing intra-class variability by synthesizing a single Action MACH filter for a given action class. We generalize the traditional MACH filter to video (3D spatiotemporal vol- ume), and vector valued data. By analyzing the response of the filter in the frequency domain, we avoid the high com- putational cost commonly incurred in template-based ap- proaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier trans- form intended for both scalar and vector-valued data. Fi- nally, we perform an extensive set of experiments and com- pare our method with some of the most recent approaches in the field by using publicly available datasets, and two newannotatedhumanaction datasetswhichinclude actions performed in classic featurefilms and sports broadcast tele- vision.
Conference Paper
Full-text available
We present an approach for measuring similarity be- tween visual entities (images or videos) based on match- ing internal self-similarities. What is correlated across images (or across video sequences) is the internal lay- out of local self-similarities (up to some distortions), even though the patterns generating those local self-similarities are quite different in each of the images/videos. These in- ternal self-similarities are efficiently captured by a com- pact local "self-similarity descriptor", measured densely throughout the image/video, at multiple scales, while ac- counting for local and global geometric distortions. This gives rise to matching capabilities of complex visual data, including detection of objects in real cluttered images using only rough hand-sketches, handling textured objects with no clear boundaries, and detecting complex actions in clut- tered video data with no prior learning. We compare our measure to commonly used image-based and video-based similarity measures, and demonstrate its applicability to ob- ject detection, retrieval, and action detection.
Conference Paper
Full-text available
Visual recognition of human actions in video clips has been an active field of research in recent years. However, most published methods either analyse an entire video and assign it a single action label, or use relatively large look- ahead to classify each frame. Contrary to these strategies, human vision proves that simple actions can be recognised almost instantaneously. In this paper, we present a system for action recognition from very short sequences ("snip- pets") of 1-10 frames, and systematically evaluate it on standard data sets. It turns out that even local shape and optic flow for a single frame are enough to achieve ≈90% correct recognitions, and snippets of 5-7 frames (0.3-0.5 seconds of video) are enough to achieve a performance sim- ilar to the one obtainable with the entire video sequence.
Conference Paper
Full-text available
In this work, we present a novel method to detect violent shots in movies. The detection process is split into two views—the audio and video views. From the audio-view, a weakly-supervised method is exploited to improve the classification performance. And from the video-view, we use a classifier to detect violent shots. Finally, the auditory and visual classifiers are combined in a co-training way. The experimental results on several movies with violent contents preliminarily show the effectiveness of our method.
Article
Full-text available
Subspaces offer convenient means of representing information in many pattern recognition, machine vision, and statistical learning applications. Contrary to the growing popularity of subspace representations, the problem of efficiently searching through large subspace databases has received little attention in the past. In this paper, we present a general solution to the problem of Approximate Nearest Subspace search. Our solution uniformly handles cases where the queries are points or subspaces, where query and database elements differ in dimensionality, and where the database contains subspaces of different dimensions. To this end, we present a simple mapping from subspaces to points, thus reducing the problem to the well-studied Approximate Nearest Neighbor problem on points. We provide theoretical proofs of correctness and error bounds of our construction and demonstrate its capabilities on synthetic and real data. Our experiments indicate that an approximate nearest subspace can be located significantly faster than the nearest subspace, with little loss of accuracy.
Article
Full-text available
In the context of the automated surveillance field, automatic scene analysis and understanding systems typically consider only visual information, whereas other modalities, such as audio, are typically disregarded. This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage and coupled with an audio BG/FG modelling scheme. These processes permit one to detect separate audio and visual patterns representing unusual unimodal events in a scene. The integration of audio and visual data is subsequently performed by exploiting the concept of synchrony between such events. The audio-visual (AV) association is carried out online and without need for training sequences, and is actually based on the computation of a characteristic feature called audio-video concurrence matrix, allowing one to detect and segment AV events, as well as to discriminate between them. Experimental tests involving classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by employing the single modalities and without considering the synchrony issue
Article
Full-text available
Local imagef eatures or interest points provide compact and abstract representationsof patterns in an image. In this paper, we extend the notionof spatial interest points into the spatio-temporal domain and show how the resultingfu tures capture interesting events in video and can be usedf or a compact representation andf or interpretation of video data.
Article
Full-text available
> y)). Scale-independent scalar features of each flow, based on moments of the moving point weighted by |u|, |v|,or|(u, v)|, characterize the spatial distribution of the flow. We then analyze the periodic structure of these sequences of scalars. The scalar sequences for an image sequence have the same fundamental period but differ in phase, which is a phase feature for each signal. Some phase features are consistent for one person and show significant statistical variation among persons. We use the phase feature vectors to recognize individuals by the shape of their motion. As few as three features out of the full set of twelve lead to excellent discrimination. Keywords: action recognition, gait recognition, motion features, optic flow, motion energy, spatial frequency, analysis Recognizing People by Their Gait: The Shape of Moti
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
The focus of motion analysis has been on estimating a flow vector for every pixel by matching intensities. In my thesis, I will explore motion representations beyond the pixel level and new applications to which these representations lead. I first focus on analyzing motion from video sequences. Traditional motion analysis suffers from the inappropriate modeling of the grouping relationship of pixels and from a lack of ground-truth data. Using layers as the interface for humans to interact with videos, we build a human-assisted motion annotation system to obtain ground-truth motion, missing in the literature, for natural video sequences. Furthermore, we show that with the layer representation, we can detect and magnify small motions to make them visible to human eyes. Then we move to a contour presentation to analyze the motion for textureless objects under occlusion. We demonstrate that simultaneous boundary grouping and motion analysis can solve challenging data, where the traditional pixel-wise motion analysis fails. In the second part of my thesis, I will show the benefits of matching local image structures instead of intensity values. We propose SIFT flow that establishes dense, semantically meaningful correspondence between two images across scenes by matching pixel-wise SIFT features. Using SIFT flow, we develop a new framework for image parsing by transferring the metadata information, such as annotation, motion and depth, from the images in a large database to an unknown query image. We demonstrate this framework using new applications such as predicting motion from a single image and motion synthesis via object transfer. Based on SIFT flow, we introduce a nonparametric scene parsing system using label transfer, with very promising experimental results suggesting that our system outperforms state-of-the-art techniques based on training classifiers. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
We derive a statistical graphical model of video scenes with multiple, possibly occluded objects that can be efficiently used for tasks related to video search, browsing and retrieval. The model is trained on query (target) clip selected by the user. Shot retrieval process is based on the likelihood of a video frame under generative model. Instead of using a combination of weighted Euclidean distances as a shot similarity measure, the likelihood model automatically separates and balances various causes of variability in video, including occlusion, appearance change and motion. Thus, we overcome tedious and complex user interventions required in previous studies. We use the model in the adaptive video forward application that adapts video playback speed to the likelihood of the data. The similarity measure of each candidate clip to the target clip defines the playback speed. Given a query, the video is played at a higher speed as long as video content has low likelihood, and when frames similar to the query clip start to come in, the video playback rate drops. Set of experiments o12n typical home videos demonstrate performance, easiness and utility of our application.
Article
Local image features or interest points provide compact and abstract representations of patterns in an image. In this paper, we extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for interpretation of spatio-temporal events.To detect spatio-temporal events, we build on the idea of the Harris and Frstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. We estimate the spatio-temporal extents of the detected events by maximizing a normalized spatio-temporal Laplacian operator over spatial and temporal scales. To represent the detected events, we then compute local, spatio-temporal, scale-invariant N-jets and classify each event with respect to its jet descriptor. For the problem of human motion analysis, we illustrate how a video representation in terms of local space-time features allows for detection of walking people in scenes with occlusions and dynamic cluttered backgrounds.
Conference Paper
This paper presents a target tracking framework for unstructured crowded scenes. Unstructured crowded scenes are defined as those scenes where the motion of a crowd appears to be random with different participants moving in different directions over time. This means each spatial location in such scenes supports more than one, or multi-modal, crowd behavior. The case of tracking in structured crowded scenes, where the crowd moves coherently in a common direction, and the direction of motion does not vary over time, was previously handled in. In this work, we propose to model various crowd behavior (or motion) modalities at different locations of the scene by employing Correlated Topic Model (CTM) of. In our construction, words correspond to low level quantized motion features and topics correspond to crowd behaviors. It is then assumed that motion at each location in an unstructured crowd scene is generated by a set of behavior proportions, where behaviors represent distributions over low-level motion features. This way any one location in the scene may support multiple crowd behavior modalities and can be used as prior information for tracking. Our approach enables us to model a diverse set of unstructured crowd domains, which range from cluttered time-lapse microscopy videos of cell populations in vitro, to footage of crowded sporting events.
Conference Paper
We present a novel action recognition method which is based on combining the effective description properties of Local Binary Patterns with the appearance invariance and adaptability of patch matching based methods. The resulting method is extremely efficient, and thus is suitable for real-time uses of simultaneous recovery of human action of several lengths and starting points. Tested on all publicity available datasets in the literature known to us, our system repeatedly achieves state of the art performance. Lastly, we present a new benchmark that focuses on uncut motion recognition in broadcast sports video.
Article
Recognizing actions in videos is rapidly becoming a topic of much research. To facilitate the development of methods for action recognition, several video collections, along with benchmark protocols, have previously been proposed. In this paper, we present a novel video database, the "Action Similarity LAbeliNg" (ASLAN) database, along with benchmark protocols. The ASLAN set includes thousands of videos collected from the web, in over 400 complex action classes. Our benchmark protocols focus on action similarity (same/not-same), rather than action classification, and testing is performed on never-before-seen actions. We propose this data set and benchmark as a means for gaining a more principled understanding of what makes actions different or similar, rather than learning the properties of particular action classes. We present baseline results on our benchmark, and compare them to human performance. To promote further study of action similarity techniques, we make the ASLAN database, benchmarks, and descriptor encodings publicly available to the research community.
Recent work shows how to use local spatio-temporal features to learn models of realistic human actions from video. However, existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure “bag-of-words” model, and thus may fail to capture the most informative space-time relationships. We propose to learn the shapes of space-time feature neighborhoods that are most discriminative for a given action category. Given a set of training videos, our method first extracts local motion and appearance features, quantizes them to a visual vocabulary, and then forms candidate neighborhoods consisting of the words associated with nearby points and their orientation with respect to the central interest point. Rather than dictate a particular scaling of the spatial and temporal dimensions to determine which points are near, we show how to learn the class-specific distance functions that form the most informative configurations. Descriptors for these variable-sized neighborhoods are then recursively mapped to higher-level vocabularies, producing a hierarchy of space-time configurations at successively broader scales. Our approach yields state-of-the-art performance on the UCF Sports and KTH datasets.
The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contri- bution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human ac- tions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based clas- sifier. Using the retrieved action samples for visual learn- ing, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi- channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly in- vestigate and show high tolerance of our method to annota- tion errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results.
Conference Paper
We present a novel approach for human activity reco gnition. The method uses dynamic texture descriptors to describe human movements in a spatiotemporal way. The same features are also use d for human detection, which makes our whole approach computationally simple. Following recent trends in computer vision research , our method works on image data rather than silhouettes. We test our met hod on a publicly available dataset and compare our result to the sta te of the art methods.
Conference Paper
Real-world actions occur often in crowded, dynamic en- vironments. This poses a difficult challenge for current approaches to video event detection because it is difficult to segment the actor from the background due to distract- ing motion from other objects in the scene. We propose a technique for event recognition in crowded videos that re- liably identifies actions in the presence of partial occlusion and background clutter. Our approach is based on three key ideas: (1) we efficiently match the volumetric represen- tation of an event against oversegmented spatio-temporal video volumes; (2) we augment our shape-based features using flow; (3) rather than treating an event template as an atomic entity, we separately match by parts (both in space and time), enabling robustness against occlusions and ac- tor variability. Our experiments on human actions, such as picking up a dropped object or waving in a crowd show reliable detection with few false positives.
Conference Paper
Whereas the action recognition community has focused mostly on detecting simple actions like clapping, walking or jogging, the detection of fights or in general aggressive behaviors has been comparatively less studied. Such capability may be extremely useful in some video surveillance scenarios like in prisons, psychiatric or elderly centers or even in camera phones. After an analysis of previous approaches we test the well-known Bag-of-Words framework used for action recognition in the specific problem of fight detection, along with two of the best action descriptors currently available: STIP and MoSIFT. For the purpose of evaluation and to foster research on violence detection in video we introduce a new video database containing 1000 sequences divided in two groups: fights and non-fights. Experiments on this database and another one with fights from action movies show that fights can be detected with near 90% accuracy.
Conference Paper
The One-Shot-Similarity (OSS) is a framework for classifier-based similarity functions. It is based on the use of background samples and was shown to excel in tasks ranging from face recognition to document analysis. However, we found that its performance depends on the ability to effectively learn the underlying classifiers, which in turn depends on the underlying metric. In this work we present a metric learning technique that is geared toward improved OSS performance. We test the proposed technique using the recently presented ASLAN action similarity labeling benchmark. Enhanced, state of the art performance is obtained, and the method compares favorably to leading similarity learning techniques.
Article
We address the problem of detecting irregularities in vi- sual data, e.g., detecting suspicious behaviors in video se- quences, or identifying salient patterns in images. The term "irregular" depends on the context in which the "regular" or "valid" are defined. Yet, it is not realistic to expect explicit definition of all possible valid configurations for a given context. We pose the problem of determining the validity of visual data as a process of constructing a puz- zle: We try to compose a new observed image region or a new video segment ("the query") using chunks of data ("pieces of puzzle") extracted from previous visual exam- ples ("the database"). Regions in the observed data which can be composed using large contiguous chunks of data from the database are considered very likely, whereas re- gions in the observed data which cannot be composed from the database (or can be composed, but only using small fragmented pieces) are regarded as unlikely/suspicious. The problem is posed as an inference process in a probabilistic graphical model. We show applications of this approach to identifying saliency in images and video, and for suspicious behavior recognition.
Article
This paper addresses the problem of recognizing human gestures from videos using models that are built from the Riemannian geometry of shape spaces. We represent a human gesture as a temporal sequence of human poses, each characterized by a contour of the associated human silhouette. The shape of a contour is viewed as a point on the shape space of closed curves and, hence, each gesture is characterized and modeled as a trajectory on this shape space. We propose two approaches for modeling these trajectories. In the first template-based approach, we use dynamic time warping (DTW) to align the different trajectories using elastic geodesic distances on the shape space. The gesture templates are then calculated by averaging the aligned trajectories. In the second approach, we use a graphical model approach similar to an exemplar-based hidden Markov model, where we cluster the gesture shapes on the shape space, and build non-parametric statistical models to capture the variations within each cluster. We model each gesture as a Markov model of transitions between these clusters. To evaluate the proposed approaches, an extensive set of experiments was performed using two different data sets representing gesture and action recognition applications. The proposed approaches not only are successfully able to represent the shape and dynamics of the different classes for recognition, but are also robust against some errors resulting from segmentation and background subtraction.
Article
We propose a set of kinematic features that are derived from the optical flow for human action recognition in videos. The set of kinematic features includes divergence, vorticity, symmetric and antisymmetric flow fields, second and third principal invariants of flow gradient and rate of strain tensor, and third principal invariant of rate of rotation tensor. Each kinematic feature, when computed from the optical flow of a sequence of images, gives rise to a spatiotemporal pattern. It is then assumed that the representative dynamics of the optical flow are captured by these spatiotemporal patterns in the form of dominant kinematic trends or kinematic modes. These kinematic modes are computed by performing Principal Component Analysis (PCA) on the spatiotemporal volumes of the kinematic features. For classification, we propose the use of multiple instance learning (MIL) in which each action video is represented by a bag of kinematic modes. Each video is then embedded into a kinematic-mode-based feature space and the coordinates of the video in that space are used for classification using the nearest neighbor algorithm. The qualitative and quantitative results are reported on the benchmark data sets.
Article
Dynamic texture (DT) is an extension of texture to the temporal domain. Description and recognition of DTs have attracted growing attention. In this paper, a novel approach for recognizing DTs is proposed and its simplifications and extensions to facial image analysis are also considered. First, the textures are modeled with volume local binary patterns (VLBP), which are an extension of the LBP operator widely used in ordinary texture analysis, combining motion and appearance. To make the approach computationally simple and easy to extend, only the co-occurrences of the local binary patterns on three orthogonal planes (LBP-TOP) are then considered. A block-based method is also proposed to deal with specific dynamic events such as facial expressions in which local information and its spatial locations should also be taken into account. In experiments with two DT databases, DynTex and Massachusetts Institute of Technology (MIT), both the VLBP and LBP-TOP clearly outperformed the earlier approaches. The proposed block-based method was evaluated with the Cohn-Kanade facial expression database with excellent results. The advantages of our approach include local processing, robustness to monotonic gray-scale changes, and simple computation.
Conference Paper
A common trend in object recognition is to detect and leverage the use of sparse, informative feature points. The use of such features makes the problem more manageable while providing increased robustness to noise and pose variation. In this work we develop an extension of these ideas to the spatio-temporal case. For this purpose, we show that the direct 3D counterparts to commonly used 2D interest point detectors are inadequate, and we propose an alternative. Anchoring off of these interest points, we devise a recognition algorithm based on spatio-temporally windowed data. We present recognition results on a variety of datasets including both human and rodent behavior.
Article
Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns
Article
The image of a moving figure contains image flow that varies both spatially and temporally. By carefully describing this flow, we can derive model-free features that vary with the type of moving figure and the type of motion. We develop a model-free description of instantaneous motion, the shape of motion, and then use that description to recognize individuals by their gait, discriminating them by periodic variation in the shape of their motion. We begin with a short sequence of images of a moving figure, taken by a static camera, and derive dense optical flow data, (u(x; y); v(x; y)), for the sequence. We determine a range of scale-independent scalar features of each flow image that characterize the spatial distribution of the flow. The scalars are based on various moments of the set of moving points. To characterize the shape of the motion, not the shape of the moving points, the points are weighted by juj, jvj, or j(u; v)j. We then analyze the periodic structure of these sequences ...
Criminological theories: introduction, evaluation, and application
  • R Akers
  • C Sellers
R. Akers and C. Sellers. Criminological theories: introduction, evaluation, and application. Oxford University Press, 2008.
Identifying surprising events in videos using bayesian topic models
  • A Hendel
  • D Weinshall
  • S Peleg
A. Hendel, D. Weinshall, and S. Peleg. Identifying surprising events in videos using bayesian topic models. ACCV, pages 448-459, 2011.
One shot similarity metric learning for action recognition. Similarity-Based Pattern Recognition
  • O Kliper-Gross
  • T Hassner
  • L Wolf
O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similarity metric learning for action recognition. Similarity-Based Pattern Recognition, pages 31-45, 2011.