Conference Paper

Speed Estimation and Abnormality Detection from Surveillance Cameras

Conference Paper

Speed Estimation and Abnormality Detection from Surveillance Cameras

If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Hard calibration involves to jointly estimate both intrinsic and extrinsic parameters with the camera already installed. It can also be performed either manually [20,38,58] or automatically [12,14,18,19,23,33,78,85,88,101,108]. ...
... Multiple lanes, both directions. Very high meter-to-pixel ratio [88]. (b) Medium focal length, medium relative distances. ...
... One of the most common ways to compute the camera's extrinsic parameters in the operation environment is using vanishing points [12,14,15,19,22,23,27,33,42,85,88,89,91,108]. The set of parallel lines in the 3D real-world coordinate The translation matrix T is then obtained using knowledge of real-world dimensions of some object or region in the image. ...
Article
Full-text available
Abstract The need to accurately estimate the speed of road vehicles is becoming increasingly important for at least two main reasons. First, the number of speed cameras installed worldwide has been growing in recent years, as the introduction and enforcement of appropriate speed limits are considered one of the most effective means to increase the road safety. Second, traffic monitoring and forecasting in road networks plays a fundamental role to enhance traffic, emissions and energy consumption in smart cities, being the speed of the vehicles one of the most relevant parameters of the traffic state. Among the technologies available for the accurate detection of vehicle speed, the use of vision‐based systems brings great challenges to be solved, but also great potential advantages, such as the drastic reduction of costs due to the absence of expensive range sensors, and the possibility of identifying vehicles accurately. This paper provides a review of vision‐based vehicle speed estimation. The terminology and the application domains are described and a complete taxonomy of a large selection of works that categorizes all stages involved is proposed. An overview of performance evaluation metrics and available datasets is provided. Finally, current limitations and future directions are discussed.
... To detect vehicles all of the teams have opted for deep learning based object detectors which we describe in section 2.2. Most teams [168,170,171,173,174] used Faster R-CNN [18], followed by multiple teams [166,169,167] using Mask R-CNN [98] and finally only three teams [164,172] used YOLO v2 [103]. One team [170] additionally constructed a 3D bounding box around the vehicle using a contour extraction network [175]. ...
... To perform tracking many teams [165,168,171,172,173,174] base their approach on IoU between successive detections with some additional processing via graphs [165], histogram matching [171], optical flow [172], Kalman filter [173] or correlation-based filtering [174]. Kalman filter [42] has been also used by itself [170]. ...
... To perform tracking many teams [165,168,171,172,173,174] base their approach on IoU between successive detections with some additional processing via graphs [165], histogram matching [171], optical flow [172], Kalman filter [173] or correlation-based filtering [174]. Kalman filter [42] has been also used by itself [170]. ...
Thesis
Full-text available
Intelligent Transportation Systems are advanced systems integrating various information technologies with the goal to provide services for efficient, informed and safe use and development of transportation networks. Such systems require efficient large-scale data collection. Recent developments in camera technology have made traffic cameras a viable source for such data. This thesis deals with the topic of automated analysis of traffic camera footage. In this thesis we present our own pipeline for vehicle counting which consists of existing state of the art methods. We also present our novel contribution for the task of detection of 3D bounding boxes of vehicles. We show that when this method is used for vehicle speed estimation the resulting mean error is only 0.75 km/h which is 32 % less than the error of the best competing method. We also present our contribution to semi-automatic traffic camera calibration based on detection of vanishing points of individual vehicles detected in traffic camera footage. We show that the results of this method are on par with the best existing approach while suffering from fewer limitations.
... GNNs (Graph neural networks) are also deemed a promising solution for anomaly detections [47,75], though we argue that their learned features are not quite compatible with road traffic surveillance. The recent anomaly detection frameworks can be divided into supervised learning methods [4,11,22,27,29,32,33,40,50,56,72,76,77,80,86] that require certain level of manually processed data, or unsupervised learning-based methods [7,13,14,17,37,42,46,51,55,57,64,65,70,78,79] that requires virtually no labeled data. Supervised learning has been deemed obsolete in recent years due to its low robustness and laborsome annotation. ...
... Note that, RMSE is normalized with 300 to set the maximum acceptable time error in detection. Many works [7,10,14,15,21,22,50,54,65,80,87] are evaluated, respectively, by the above three metrics. ...
Article
Full-text available
Occasions such as stalled vehicles or crashes led by abnormal trajectories should be instantly identified and then dealt with quickly by the city traffic management system for the sake of road safety. However, a fast and accurate automatic detection system based on machine learning in general meets with great challenges from the shortage of recorded accident data, resulting in low detection accuracy. Many existing studies implement a two-level detection approach: stalled vehicles are detected at the stationary level, while abnormal trajectories are detected at the mobile level. This paper proposes a novel triple-layer framework to distribute these two levels to three parallel layers for maximum efficiency. A straightforward background extraction algorithm is applied at the beginning of this framework for motion-stationary distribution. Layer 1 implements a lightweight optical-flow-based feature extraction algorithm to convert the mobile visual features to learnable data. With a clustering algorithm that learns the common trajectories in an unsupervised manner, abnormal trajectories are detected in Layer 2. Simultaneously, in Layer 3, a custom-trained object detection algorithm is applied to detect the stall/crashed vehicles. The computational efficiency is improved and the detection accuracy is boosted. Experiments conducted on Nvidia AI City Challenge Dataset demonstrate the effectiveness of our LRATD (Lightweight Real-Time Abnormal Trajectory Detection framework) in terms of \(104\%\) gain in detection speed compared to the fastest entry, while achieving 0.935 S4-Score, only \(2.1\%\) less than the current state-of-the-art method. Overall, the performance of LRATD opens the possibility of its real-life application.
... Authors of [104] have used 3D-tube representation of trajectories as features using the contextual proximity of neighboring trajectory for learning normal trajectory. In [52], Fisher vector corresponding to each trajectory obtained using optical flow of the object and its position, has been used. Histogram of optical flow and motion [207] Sub-trajectories Multi instance learning Nearest neighborhood based approach with Hausdorff distance-based threshold for anomaly detection. ...
... A new feature descriptor HOFME that could handle diverse anomaly scenarios as compared with conventional features. Giannakeris (2018) [52] Trajectory Fisher vector SVM Anomaly score derived from the Fisher vector using OCSVM. ...
Preprint
Computer vision has evolved in the last decade as a key technology for numerous applications replacing human supervision. In this paper, we present a survey on relevant visual surveillance related researches for anomaly detection in public places, focusing primarily on roads. Firstly, we revisit the surveys done in the last 10 years in this field. Since the underlying building block of a typical anomaly detection is learning, we emphasize more on learning methods applied on video scenes. We then summarize the important contributions made during last six years on anomaly detection primarily focusing on features, underlying techniques, applied scenarios and types of anomalies using single static camera. Finally, we discuss the challenges in the computer vision related anomaly detection techniques and some of the important future possibilities.
... Event recognition is of particular importance for traffic safety analysis since it can be used for detecting abnormal events and traffic violations and their associations with crash rates [357], car behavior analysis [358,359], and pedestrians' crossing identification [360,354,361]. Recently, some works [362,363,364] try to predict anomaly actions using Generative Adversarial Networks (GAN). ...
... We summarize important DL-based event recognition methods in Table 13, and present some popular datasets for event recognition tasks in Table 18. One observation is that unsupervised, semi-supervised, and self-supervised models are becoming more prevalent in recent works [357,368,364] to mitigate the costly and tedious job of video annotation and simplifies volume video processing. These pure data-driven methods often focus on detection problems. ...
Preprint
Full-text available
This paper explores Deep Learning (DL) methods that are used or have the potential to be used for traffic video analysis, emphasizing driving safety for both Autonomous Vehicles (AVs) and human-operated vehicles. We present a typical processing pipeline, which can be used to understand and interpret traffic videos by extracting operational safety metrics and providing general hints and guidelines to improve traffic safety. This processing framework includes several steps, including video enhancement, video stabilization, semantic and incident segmentation, object detection and classification, trajectory extraction, speed estimation, event analysis, modeling and anomaly detection. Our main goal is to guide traffic analysts to develop their own custom-built processing frameworks by selecting the best choices for each step and offering new designs for the lacking modules by providing a comparative analysis of the most successful conventional and DL-based algorithms proposed for each step. We also review existing open-source tools and public datasets that can help train DL models. To be more specific, we review exemplary traffic problems and mentioned requires steps for each problem. Besides, we investigate connections to the closely related research areas of drivers' cognition evaluation, Crowd-sourcing-based monitoring systems, Edge Computing in roadside infrastructures, ADS-equipped AVs, and highlight the missing gaps. Finally, we review commercial implementations of traffic monitoring systems, their future outlook, and open problems and remaining challenges for widespread use of such systems.
... Event recognition is of particular importance for traffic safety analysis since it can be used for detecting abnormal events and traffic violations and their associations with crash rates [348], car behaviour analysis [354,355], and pedestrians' crossing identification [388,391,392]. Recently, some works [337,393,394] try to predict anomaly actions using generative adversarial networks (GAN). ...
... We summarise important DL-based event recognition methods in Table 13, and present some popular datasets for event recognition tasks in Table 18. One observation is that unsupervised, semi-supervised, and self-supervised models are becoming more prevalent in recent works [337,338,348] to mitigate the costly and tedious job of video annotation and simplifies volume video processing. ...
Article
Full-text available
This paper explores deep learning (DL) methods that are used or have the potential to be used for traffic video analysis, emphasising driving safety for both autonomous vehicles and human‐operated vehicles. A typical processing pipeline is presented, which can be used to understand and interpret traffic videos by extracting operational safety metrics and providing general hints and guidelines to improve traffic safety. This processing framework includes several steps, including video enhancement, video stabilisation, semantic and incident segmentation, object detection and classification, trajectory extraction, speed estimation, event analysis, modelling, and anomaly detection. The main goal is to guide traffic analysts to develop their own custom‐built processing frameworks by selecting the best choices for each step and offering new designs for the lacking modules by providing a comparative analysis of the most successful conventional and DL‐based algorithms proposed for each step. Existing open‐source tools and public datasets that can help train DL models are also reviewed. To be more specific, exemplary traffic problems are reviewed and required steps are mentioned for each problem. Besides, connections to the closely related research areas of drivers' cognition evaluation, crowd‐sourcing‐based monitoring systems, edge computing in roadside infrastructures, automated driving systems‐equipped vehicles are investigated, and the missing gaps are highlighted. Finally, commercial implementations of traffic monitoring systems, their future outlook, and open problems and remaining challenges for widespread use of such systems are reviewed.
... Giannakeris et al. [6] introduced a fully automatic camera calibration algorithm to estimate the speed of vehicles. ...
Article
Full-text available
In this article, an effective solution has been presented to assist a driver in taking decisions for overtaking under adverse night time dark condition on a two-lane single carriageway road. Here, an awkward situation of the road where a vehicle is just in front of the test vehicle in the same direction and another vehicle coming from the opposite direction is considered. As the environmental condition is very dark, so only headlights and taillights of any vehicle are visible. Estimation of distance and speed with greater accuracy, especially at night where vehicles are not visible is really a challenging task. The proposed assistance system can estimate the actual and relative speed and the distance of the slow vehicle in front of the test vehicle and the vehicle coming from the opposite direction by observing taillights and headlights respectively. Subsequently, required gap, road condition level, speed and acceleration for safe overtaking are estimated. Finally, the overtaking decision is made in a such way that there should not be any collision between vehicles. Several real time experiments reveal that the estimation achieves a great accuracy with safe condition over the state-of-the-art techniques using a low-cost 2D camera.
... Ahmadi et al. [151] clustered optical flow features to learn motion patterns in the traffic phase for detecting abnormal vehicle behavior, e.g., abnormal driving and not following traffic laws. In [152], the global amplitudes of optical flow in each frame were utilized to obtain the optical flow descriptors of traffic scenes. The descriptors and Fisher vector representing the spatiotemporal visual volumes were employed to detect traffic violations. ...
... Reference distances measured or assumed from features such as road markings or standard lane widths (Huang, 2018;Tran et al., 2018), combined with 'vanishing points' at which parallel lines meet on the image domain, provide parameters enabling for camera calibration through algorithmic optimisation (Tang et al., 2018). Vanishing point methods have also been adapted for fully automatic application with the use of vehicle dimension estimation, vehicle motion analysis and diamond space accumulation algorithms (Dubská et al., 2015;Giannakeris and Briassouli, 2018;Sochor et al., 2017). However, with the use of background modelling and cluster analysis of vehicle trajectories derived from video footage, perspective transformation is not strictly necessary for deriving vehicle speed estimations (Xiong, 2018). ...
Article
Full-text available
A workflow is devised in this paper by which vehicle speeds are estimated semi-automatically via fixed DSLR camera. Deep learning algorithm YOLOv2 was used for vehicle detection, while Simple Online Realtime Tracking (SORT) algorithm enabled for tracking of vehicles. Perspective projection and scale factor were dealt with by remotely mapping corresponding image and real-world coordinates through a homography. The ensuing transformation of camera footage to British National Grid Coordinate System, allowed for the derivation of real-world distances on the planar road surface, and subsequent simultaneous vehicle speed estimations. As monitoring took place in a heavily urbanised environment, where vehicles frequently change speed, estimations were determined consecutively between frames. Speed estimations were validated against a reference dataset containing precise trajectories from a GNSS and IMU equipped vehicle platform. Estimations achieved an average root mean square error and mean absolute percentage error of 0.625 m/s and 20.922 % respectively. The robustness of the method was tested in a real-world context and environmental conditions.
... Authors of [114] have used 3D-tube representation of trajectories as features using the contextual proximity of neighboring trajectory for learning normal trajectory. In [125], Fisher vector corresponding to each trajectory obtained using optical flow of the object and its position, has been used. Histogram of optical flow and motion entropy (HOFME) have been used in [91]. ...
Article
Computer vision has evolved in the last decade as a key technology for numerous applications replacing human supervision. Timely detection of traffic violations and abnormal behavior of pedestrians at public places through computer vision and visual surveillance can be highly effective for maintaining traffic order in cities. However, despite a handful of computer vision–based techniques proposed in recent times to understand the traffic violations or other types of on-road anomalies, no methodological survey is available that provides a detailed insight into the classification techniques, learning methods, datasets, and application contexts. Thus, this study aims to investigate the recent visual surveillance–related research on anomaly detection in public places, particularly on road. The study analyzes various vision-guided anomaly detection techniques using a generic framework such that the key technical components can be easily understood. Our survey includes definitions of related terminologies and concepts, judicious classifications of the vision-guided anomaly detection approaches, detailed analysis of anomaly detection methods including deep learning–based methods, descriptions of the relevant datasets with environmental conditions, and types of anomalies. The study also reveals vital gaps in the available datasets and anomaly detection capability in various contexts, and thus gives future directions to the computer vision–guided anomaly detection research. As anomaly detection is an important step in automatic road traffic surveillance, this survey can be a useful resource for interested researchers working on solving various issues of Intelligent Transportation Systems (ITS).
Article
Full-text available
Detection of abnormal events in the traffic scene is very challenging and is a significant problem in video surveillance. The authors proposed a novel scheme called super orientation optical flow (SOOF)‐based clustering for identifying the abnormal activities. The key idea behind the proposed SOOF features is to efficiently reproduce the motion information of a moving vehicle with respect to superorientation motion descriptor within the sequence of the frame. Here, the authors adopt the mean absolute temporal difference to identify the anomalies by motion block (MB) selection and localisation. SOOF features obtained from MB are used as motion descriptor for both normal and abnormal events. Simple and efficient K‐means clustering is used to study the normal motion flow during the training. The abnormal events are identified using the nearest‐neighbour searching technique in the testing phase. The experimental outcome shows that the proposed work is effectively detecting anomalies and found to give results better than the state‐of‐the‐art techniques.
Article
Video processing solutions for motion analysis are key tasks in many computer vision applications, ranging from human activity recognition to object detection. In particular, speed estimation algorithms may be relevant in contexts such as street monitoring and environment surveillance. In most realistic scenarios, the projection of a framed object of interest onto the image plane is likely to be affected by dynamic changes mainly related to perspective transformations or periodic behaviours. Therefore, advanced speed estimation techniques need to rely on robust algorithms for object detection that are able to deal with potential geometrical modifications. The proposed method is composed of a sequence of pre-processing operations, that aim to reduce or neglect perspectival effects affecting the objects of interest, followed by the estimation phase based on the Maximum Likelihood (ML) principle, where the speed of the foreground objects is estimated. The ML estimation method represents, indeed, a consolidated statistical tool that may be exploited to obtain reliable results. The performance of the proposed algorithm is evaluated on a set of real video recordings and compared with a block-matching motion estimation algorithm. The obtained results indicate that the proposed method shows good and robust performance.
Article
With the rapid development of connected autonomous vehicles (CAVs), both road infrastructure and transport are experiencing a profound transformation. In recent years, the cooperative perception and control supported infrastructure-vehicle system (IVS) attracted increasing attention in the field of intelligent transportation systems (ITS). The perception information of surrounding objects can be obtained by various types of sensors or communication networks. Control commands generated by CAVs or infrastructure can be executed promptly and accurately to improve the overall performance of the transportation system in terms of safety, efficiency, comfort and energy saving. This study presents a comprehensive review of the research progress achieved upon cooperative perception and control supported IVS over the past decade. By focusing on the essential interactions between infrastructure and CAVs and between CAVs, the infrastructure-vehicle cooperative perception and control methods are summarized and analyzed. Furthermore, the mining site as a closed scenario was used to show the current application of IVS. Finally, the existing issues of the cooperative perception and control technology implementation are discussed, and the recommendation for future research directions are proposed.
Conference Paper
Full-text available
This paper introduces an auto-calibration mechanism for an Automatic Number Plate Recognition camera dedicated to a vehicle speed measurement. A calibration task is formulated as a multi-objective optimization problem and solved with Non-dominated Sorting Genetic Algorithm. For simplicity a uniform motion profile of a majority of vehicles is assumed. The proposed speed estimation method is based on tracing licence plates quadrangles recognized on video frames. The results are compared with concurrent measurements performed with piezoelectric sensors.
Article
Full-text available
In this paper, we study the trade-off between accuracy and speed when building an object detection system based on convolutional neural networks. We consider three main families of detectors --- Faster R-CNN, R-FCN and SSD --- which we view as "meta-architectures". Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. In addition, we can vary other parameters, such as the image resolution, and the number of box proposals. We develop a unified framework (in Tensorflow) that enables us to perform a fair comparison between all of these variants. We analyze the performance of many different previously published model combinations, as well as some novel ones, and thus identify a set of models which achieve different points on the speed-accuracy tradeoff curve, ranging from fast models, suitable for use on a mobile phone, to a much slower model that achieves a new state of the art on the COCO detection challenge.
Article
Full-text available
This paper presents a hierarchical framework for detecting local and global anomalies via hierarchical feature representation and Gaussian process regression (GPR) which is fully non-parametric and robust to the noisy training data, and supports sparse features. While most research on anomaly detection has focused more on detecting local anomalies, we are more interested in global anomalies that involve multiple normal events interacting in an unusual manner such as car accidents. To simultaneously detect local and global anomalies, we cast the extraction of normal interactions from the training videos as a problem of finding the frequent geometric relations of the nearby sparse spatio-temporal interest points (STIPs). A codebook of interaction templates is then constructed and modeled using GPR, based on which a novel inference method for computing the likelihood of an observed interaction is also developed. Thereafter, these local likelihood scores are integrated into globally consistent anomaly masks, from which anomalies can be succinctly identified. To the authors' best knowledge, it is the first time GPR is employed to model the relationship of the nearby STIPs for anomaly detection. Simulations based on four widespread datasets show that the new method outperforms the main state-of-the-art methods with lower computational burden.
Conference Paper
Full-text available
Vehicle speed estimation using Closed Circuit Television (CCTV) is one of the interesting issues in the field of computer vision. Various approaches are used to perform automation in vehicle speed estimation using CCTV. In this study, the use of Gaussian Mixture Model (GMM) for vehicle detection has been improved with the hole-filling method (HF). The speed estimation of the vehicles with various scenarios has been done, and gives the best estimation with the deviation of 7.63 Km/hr. GMM fusion with hole-filling algorithm combined with Pinhole models have shown the best results compared with results using other scenarios.
Article
Full-text available
This paper introduces a novel probabilistic activity modeling approach that mines recurrent sequential patterns called motifs from documents given as word $\times $ time count matrices (e.g., videos). In this model, documents are represented as a mixture of sequential activity patterns (our motifs) where the mixing weights are defined by the motif starting time occurrences. The novelties are multi fold. First, unlike previous approaches where topics modeled only the co-occurrence of words at a given time instant, our motifs model the co-occurrence and temporal order in which the words occur within a temporal window. Second, unlike traditional Dynamic Bayesian networks (DBN), our model accounts for the important case where activities occur concurrently in the video (but not necessarily in synchrony), i.e., the advent of activity motifs can overlap. The learning of the motifs in these difficult situations is made possible thanks to the introduction of latent variables representing the activity starting times, enabling us to implicitly align the occurrences of the same pattern during the joint inference of the motifs and their starting times. As a third novelty, we propose a general method that favors the recovery of sparse distributions, a highly desirable property in many topic model applications, by adding simple regularization constraints on the searched distributions to the data likelihood optimization criteria. We substantiate our claims with experiments on synthetic data to demonstrate the algorithm behavior, and on four video datasets with significant variations in their activity content obtained from static cameras. We observe that using low-level motion features from videos, our algorithm is able to capture sequential patterns that implicitly represent typical trajectories of scene objects.
Article
Full-text available
This paper addresses the problem of fully automated mining of public space video data, a highly desirable capability under contemporary commercial and security considerations. This task is especially challenging due to the complexity of the object behaviors to be profiled, the difficulty of analysis under the visual occlusions and ambiguities common in public space video, and the computational challenge of doing so in real-time. We address these issues by introducing a new dynamic topic model, termed a Markov Clustering Topic Model (MCTM). The MCTM builds on existing dynamic Bayesian network models and Bayesian topic models, and overcomes their drawbacks on sensitivity, robustness and efficiency. Specifically, our model profiles complex dynamic scenes by robustly clustering visual events into activities and these activities into global behaviours with temporal dynamics. A Gibbs sampler is derived for offline learning with unlabeled training data and a new approximation to online Bayesian inference is formulated to enable dynamic scene understanding and behaviour mining in new video data online in real-time. The strength of this model is demonstrated by unsupervised learning of dynamic scene models for four complex and crowded public scenes, and successful mining of behaviors and detection of salient events in each.
Article
Full-text available
During the last years, the task of automatic event analysis in video sequences has gained an increasing attention among the research community. The application domains are disparate, ranging from video surveillance to automatic video annotation for sport videos or TV shots. Whatever the application field, most of the works in event analysis are based on two main approaches: the former based on explicit event recognition, focused on finding high-level, semantic interpretations of video sequences, and the latter based on anomaly detection. This paper deals with the second approach, where the final goal is not the explicit labeling of recognized events, but the detection of anomalous events differing from typical patterns. In particular, the proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring. The proposed approach is based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories. Particular attention is given to trajectory classification in absence of a priori information on the distribution of outliers. Experimental results prove the validity of the proposed approach.
Article
Full-text available
We propose a novel method to model and learn the scene activity, observed by a static camera. The proposed model is very general and can be applied for solution of a variety of problems. The motion patterns of objects in the scene are modeled in the form of a multivariate nonparametric probability density function of spatiotemporal variables (object locations and transition times between them). Kernel Density Estimation is used to learn this model in a completely unsupervised fashion. Learning is accomplished by observing the trajectories of objects by a static camera over extended periods of time. It encodes the probabilistic nature of the behavior of moving objects in the scene and is useful for activity analysis applications, such as persistent tracking and anomalous motion detection. In addition, the model also captures salient scene features, such as the areas of occlusion and most likely paths. Once the model is learned, we use a unified Markov Chain Monte Carlo (MCMC)-based framework for generating the most likely paths in the scene, improving foreground detection, persistent labeling of objects during tracking, and deciding whether a given trajectory represents an anomaly to the observed motion patterns. Experiments with real-world videos are reported which validate the proposed approach.
Conference Paper
Full-text available
We present two novel methods to automatically learn spatio-temporal dependencies of moving agents in complex dynamic scenes. They allow to discover temporal rules, such as the right of way between different lanes or typical traffic light sequences. To extract them, sequences of activities need to be learned. While the first method extracts rules based on a learned topic model, the second model called DDP-HMM jointly learns co-occurring activities and their time dependencies. To this end we employ Dependent Dirichlet Processes to learn an arbitrary number of infinite Hidden Markov Models. In contrast to previous work, we build on state-of-the-art topic models that allow to automatically infer all parameters such as the optimal number of HMMs necessary to explain the rules governing a scene. The models are trained offline by Gibbs Sampling using unlabeled training data.
Article
Full-text available
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data.
Conference Paper
Full-text available
We propose a novel unsupervised learning framework for activity perception. To understand activities in complicated scenes from visual data, we propose a hierarchical Bayesian model to connect three elements: low-level visual features, simple "atomic" activities, and multi-agent interactions. Atomic activities are modeled as distributions over low-level visual features, and interactions are modeled as distributions over atomic activities. Our models improve existing language models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) by modeling interactions without supervision. Our data sets are challenging video sequences from crowded traffic scenes with many kinds of activities co-occurring. Our approach provides a summary of typical atomic activities and interactions in the scene. Unusual activities and interactions are found, with natural probabilistic explanations. Our method supports flexible high-level queries on activities and interactions using atomic activities as components.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
In this paper, we focus on fully automatic traffic surveillance camera calibration which we use for speed measurement of passing vehicles. We improve over a recent state-of-the-art camera calibration method for traffic surveillance based on two detected vanishing points. More importantly, we propose a novel automatic scene scale inference based on matching bounding boxes of rendered 3D models of vehicles with detected bounding boxes in the image. The proposed method can be used from an arbitrary viewpoint and it has no constraints on camera placement. We evaluate our method on recent comprehensive dataset for speed measurement BrnoCompSpeed. Experiments show that our automatic camera calibration by detected two vanishing points method reduces the error by 50% compared to the previous state-of-the-art method. We also show that our scene scale inference method is much more precise (mean speed measurement error 1.10km/h) outperforming both state of the art automatic calibration method (error reduction by 86% -- mean error 7.98km/h) and manual calibration (error reduction by 19% -- mean error 1.35km/h). We also present qualitative results of automatic camera calibration method on video sequences obtained from real surveillance cameras on various places and under different lighting conditions (night, dawn, day).
Article
We propose a method for fully automatic calibration of traffic surveillance cameras. This method allows for calibration of the camera-including scale-without any user input, only from several minutes of input surveillance video. The targeted applications include speed measurement, measurement of vehicle dimensions, vehicle classification, etc. The first step of our approach is camera calibration by determining three vanishing points defining the stream of vehicles. The second step is construction of 3D bounding boxes of individual vehicles and their measurement up to scale. We propose to first construct the projection of the bounding boxes and then, by using the camera calibration obtained earlier, create their 3D representation. In the third step, we use the dimensions of the 3D bounding boxes for calibration of the scene scale. We collected a dataset with ground truth speed and distance measurements and evaluate our approach on it. The achieved mean accuracy of speed and distance measurement is below 2%. Our efficient C++ implementation runs in real time on a low-end processor (Core i3) with a safe margin even for full-HD videos. © 2014. The
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper deals with automatic calibration of roadside surveillance cameras. We focus on parameters necessary for measurements in traffic-surveillance applications. Contrary to the existing solutions, our approach requires no a priori knowledge, and it works with a very wide variety of road settings (number of lanes, occlusion, quality of ground marking), as well as with practically unlimited viewing angles. The main contribution is that our solution works fully automatically—without any per-camera or per-video manual settings or input whatsoever—and it is computationally inexpensive. Our approach uses tracking of local feature points and analyzes the trajectories in a manner based on cascaded Hough transform and parallel coordinates. An important assumption for the vehicle movement is that at least a part of the vehicle motion is approximately straight—we discuss the impact of this assumption on the applicability of our approach and show experimentally that this assumption does not limit the usability of our approach severely. We efficiently and robustly detect vanishing points, which define the ground plane and vehicle movement, except for the scene scale. Our algorithm also computes parameters for radial distortion compensation. Experiments show that the obtained camera parameters allow for measurements of relative lengths (and potentially speed) with $sim$2% mean accuracy. The processing is performed easily in real time, and typically, a 2-min-long video is sufficient for stable calibration.
Article
In this paper, we propose a method for modeling trajectory patterns with both regional and velocity observations through the probabilistic topic model. By embedding Gaussian models into the discrete topic model framework, our method uses continuous velocity as well as regional observations unlike existing approaches. In addition, the proposed framework combined with Hidden Markov Model can cover the temporal transition of the scene state, which is useful in checking a violation of the rule that some conflict topics (e.g. two cross-traffic patterns) should not occur at the same time. To achieve online learning even with the complexity of the proposed model, we suggest a novel learning scheme instead of collapsed Gibbs sampling. The proposed two-stage greedy learning scheme is not only efficient at reducing the search space but also accurate in a way that the accuracy of online learning becomes not worse than that of the batch learning. To validate the performance of our method, experiments were conducted on various datasets. Experimental results show that our model explains satisfactorily the trajectory patterns with respect to scene understanding, anomaly detection, and prediction.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies -- any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.
Article
Vehicle speed measurement (VSM) based on video images represents the development direction of speed measurement in the intelligent transportation systems (ITS). This paper presents a novel vehicle speed measurement method, which contains the improved three-frame difference algorithm and the proposed gray constraint optical flow algorithm. By the improved three-frame difference algorithm, the contour of moving vehicles can be detected exactly. Through the proposed gray constraint optical flow algorithm, the vehicle contour's optical flow value, which is the speed (pixels/s) of the vehicle in the image, can be computed accurately. Then, the velocity (km/h) of the vehicles is calculated by the optical flow value of the vehicle's contour and the corresponding ratio of the image pixels to the width of the road. The method can yield a better optical flow field by reducing the influence of changing lighting and shadow. Besides, it can reduce computation obviously, since it only calculates the moving target contour's optical flow value. Experimental comparisons between the method and other VSM methods show that the proposed approach has a satisfactory estimate of vehicle speed.
Conference Paper
In this paper, we present a new algorithm for estimating individual vehicle speed based on two consecutive images captured from a traffic safety camera system. Its principles are first, both images are transformed from the image plane to the 3D world coordinates based on the calibrated camera parameters. Second, the difference of the two transformed images is calculated, resulting in the background being eliminated and vehicles in the two images are mapped onto one image. Finally, a block feature of the vehicle closest to the ground is matched to estimate vehicle travel distance and speed. Experimental results show that the proposed method exhibits good and consistent performance. When compared with speed measurements obtained from speed radar, averaged estimation errors are 3.27% and 8.51% for day-time and night-time test examples respectively, which are better than other previously published results. The proposed algorithm can be easily extended to work on image sequences
Article
Compared to other anomalous video event detection approaches that analyze object trajectories only, we propose a context-aware method to detect anomalies. By tracking all moving objects in the video, three different levels of spatiotemporal contexts are considered, i.e., point anomaly of a video object, sequential anomaly of an object trajectory, and co-occurrence anomaly of multiple video objects. A hierarchical data mining approach is proposed. At each level, frequency-based analysis is performed to automatically discover regular rules of normal events. Events deviating from these rules are identified as anomalies. The proposed method is computationally efficient and can infer complex rules. Experiments on real traffic video validate that the detected video anomalies are hazardous or illegal according to traffic regulations.
Estimating the support of a high-dimensional distribution
  • B Schlkopf
  • J C Platt
  • J Shawe-Taylor
  • A J Smola
  • R C Williamson