Figure 4 - uploaded by Eran Swears
Content may be subject to copyright.
Source publication
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showin...
Context in source publication
Context 1
... it is important to understand how existing approaches will behave differently based on video characteristics. Accordingly, we provide several different versions of datasets downsampled both spatially and temporally. For temporal downsampling, we provide datasets sampled at three different framerates of 10, 5, 2 Hz. In fact, large number of existing surveillance cameras operate at 2 Hz or lower, and stud- ies on downsampled data with lower framerates will provide important insights into this largely open area for research on surveillance. Spatial down-sampling needs more atten- tion because vanilla spatial downsampling by fixed ratios will not result in the type of data that will be most useful. In fact, datasets with movers exhibiting similar amount of pixel appearance information, i.e., similar heights, will be more useful for performance evaluation. Accordingly, we measure the average pixel heights of people in each scene, and created downsampled versions in such a way that average downsampled people are at 3 consistent pixel heights: 50, 20, 10 pixels. A set of downsampled examples of a person in the dataset is shown in Fig. 4. After both temporal (3 cases) and spatial down-sampling (3 cases), 9 different downsampled versions of videos are additionally provided. The second part of our benchmark dataset includes aerial video datasets. Again, the goal of this aerial dataset is ...
Similar publications
This report surveys work done by academia in developing Face Recognition solutions for video-surveillance applications. We present an architecture of a generic system for face recognition in video and review academic systems reported in the academic literature suitable for video-surveillance applications. Recommendations on the selection of systems...
This paper presents a novel method for people counting in crowded scenes that combines the information gathered by multiple cameras to mitigate the problem of occlusion that commonly affects the performance of counting methods using single cameras. The proposed method detects the corner points associated to the people present in the scene and compu...
This report surveys work done by academia in developing Face Recognition solutions for video-surveillance
applications. We present an architecture of a generic system for face recognition in video and review academic
systems reported in the academic literature suitable for video-surveillance applications. Recommendations
on the selection of systems...
Recognizing human activity is one of the important areas of computer vision research today. It plays a vital role in constructing intelligent surveillance systems. Despite the efforts in the past decades, recognizing human activities from videos is still a challenging task. Human activity may have different forms ranging from simple actions to comp...
Citations
... This task is critical for various real-world applications, such as sports analytics, where action forecasting [20; 58], strategic and tactical analysis [10; 41], and player performance evaluation [11; 48] depend on a detailed understanding of event sequences. Other examples include industrial inspection [40], crucial for detecting subtle irregularities in high-speed production lines to ensure quality and safety; computer vision in autonomous driving [24], essential for accurate and instantaneous vehicle control and obstacle detection; and surveillance [46], important for the precise identification of abnormal or sudden events to enhance security. However, existing methods and datasets foundational to their development only partially address the F 3 scenario. ...
Analyzing Fast, Frequent, and Fine-grained (F3) events presents a significant
challenge in video analytics and multi-modal LLMs. Current methods struggle to
identify events that satisfy all the F3 criteria with high accuracy due to challenges
such as motion blur and subtle visual discrepancies. To advance research in video
understanding, we introduce F3Set, a benchmark that consists of video datasets for
precise F3 event detection. Datasets in F3Set are characterized by their extensive
scale and comprehensive detail, usually encompassing over 1,000 event types
with precise timestamps and supporting multi-level granularity. Currently F3Set
contains several sports datasets, and this framework may be extended to other
applications as well. We evaluated popular temporal action understanding methods
on F3Set, revealing substantial challenges for existing techniques. Additionally,
we propose a new method, F3ED, for F3 event detections, achieving superior
performance. The dataset, model, and benchmark code are available at https:
//github.com/F3Set/F3Set.
... Video background subtraction. In this section, we implement LRMC for video background subtraction, employing the VIRAT video dataset [65] as our benchmark dataset. VIRAT comprises a diverse collection of videos characterized by colorful scenes with static backgrounds. ...
Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.
... When multiple camera views are available, 3D object detection has been studied heavily for autonomous driving [46,146], as well as for infrastructure-based 3D object detection [117,152,181,11]. Many approaches to 3D object detection use object queries [160], bird's-eye view transformations [173], or a combination of the two [90]. ...
We present a survey paper on methods and applications of digital twins (DT) for urban traffic management. While the majority of studies on the DT focus on its "eyes," which is the emerging sensing and perception like object detection and tracking, what really distinguishes the DT from a traditional simulator lies in its ``brain," the prediction and decision making capabilities of extracting patterns and making informed decisions from what has been seen and perceived. In order to add values to urban transportation management, DTs need to be powered by artificial intelligence and complement with low-latency high-bandwidth sensing and networking technologies. We will first review the DT pipeline leveraging cyberphysical systems and propose our DT architecture deployed on a real-world testbed in New York City. This survey paper can be a pointer to help researchers and practitioners identify challenges and opportunities for the development of DTs; a bridge to initiate conversations across disciplines; and a road map to exploiting potentials of DTs for diverse urban transportation applications.
... The dataset's focus on human actions makes it particularly valuable for human-centered action localization research. Furthermore, MEVA [17] and VIRAT [51] focus on unmanned aerial vehicles and surveillance activity detection. ...
Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
... SOMETHING-SOMETHING V2 [5] is a large video dataset covering basic human activities, comprising 220,847 videos classified into 174 labels. Camera datasets like the VIRAT Video Dataset [6] are collected from cameras, primarily classifying two objects: people and vehicles. CCTV-Fights [7] is a dataset related to data collected from cameras containing 1000 videos collected by the ROSE Lab at Nanyang Technological University, Singapore, classifying videos of fights through CCTV. ...
Optimizing camera information storage is a critical issue due to the increasing data volume and a large number of daily surveillance videos. In this study, we propose a deep learning-based system for efficient data storage. Videos captured by cameras are classified into four categories: no action, normal action, human action, and dangerous action. Videos without action or with normal action are stored temporarily and then deleted to save storage space. Videos with human action are stored for easy retrieval, while videos with dangerous action are promptly alerted to users. In the paper, we propose two approaches using deep learning models to address the video classification problem. The first approach is a separate approach, where pretrained CNN models extract features from video frame images. These features are then passed through RNN, Transformer models to extract relationships between them. The goal of this approach is to delve into extracting features of objects in the video. The proposed models include VGG16, InceptionV3 combined with LSTM, BiLSTM, Attention, and Vision Transformer. The next approach combines CNN and LSTM layers simultaneously through models like ConvLSTM and LRCN. This approach aims to help the model simultaneously extract object features and their relationships, with the goal of reducing model size, accelerating the training process, and increasing object recognition speed when deployed in the system. In Approach 1, we construct and refine network architectures such as VGG16+LSTM, VGG16+Attention+LSTM, VGG16+BiLSTM, VGG16+ViT, InceptionV3+LSTM, InceptionV3+Attention+LSTM, InceptionV3+BiLSTM. In Approach 2, we build a new network architecture based on the ConvLSTM and LRCN model. The training dataset, collected from real surveillance cameras, comprises 3315 videos labeled into four classes: no action (1018 videos), actions involving people (832 videos), dangerous actions (751 videos), and normal actions (714 videos). Experimental results show that models from Approach 1 exhibit excellent feature extraction capabilities, effectively classifying activities with high accuracy. Notably, InceptionV3+LSTM and VGG16+ViT models achieve accuracy rates exceeding 93%. Conversely, models from Approach 2 show fast training speeds and lightweight model sizes but struggle with activity classification, resulting in lower accuracy. To meet the system’s requirements for high accuracy and real-time performance, we select the InceptionV3+LSTM model for deployment. This model is further fine-tuned to achieve better model size and training speed compared to other models in this approach.
... We use the object-name annotations of OpenImages and follow Park et al. [47] to create change captions using predefined templates (see App. C). Spot-the-Diff (STD) [27] has ∼13K image pairs captured from CCTV videos in VIRAT Video dataset [45]. Each image pair contains more than one change and is captioned by humans. ...
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to . That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to intervene by editing attention, which often produces expected outputs by VLMs.
... Since the NGSIM dataset contains only vehicle trajectories and the ETH and UCY datasets contain only human trajectories, we excluded extra calculations on unseen participants in each comparison. For VIRAT (Oh et al. 2011) and ActEV (Awad et al. 2020), although they include three major event types (including vehicles), we focused only on the annotated pedestrian trajectories to control for variables when comparing under Pishgu. Figure 3: The comparison of all metrics between the datasets collected via AutoSceneGen (Blue), ApolloScapes (Orange), and the combination of the two datasets (Green) across different epochs is shown. ...
Motion planning is a crucial component in autonomous driving. State-of-the-art motion planners are trained on meticulously curated datasets, which are not only expensive to annotate but also insufficient in capturing rarely seen critical scenarios. Failing to account for such scenarios poses a significant risk to motion planners and may lead to incidents during testing. An intuitive solution is to manually compose such scenarios by programming and executing a simulator (e.g., CARLA). However, this approach incurs substantial human costs. Motivated by this, we propose an inexpensive method for generating diverse critical traffic scenarios to train more robust motion planners. First, we represent traffic scenarios as scripts, which are then used by the simulator to generate traffic scenarios. Next, we develop a method that accepts user-specified text descriptions, which a Large Language Model (LLM) translates into scripts using in-context learning. The output scripts are sent to the simulator that produces the corresponding traffic scenarios. As our method can generate abundant safety-critical traffic scenarios, we use them as synthetic training data for motion planners. To demonstrate the value of generated scenarios, we train existing motion planners on our synthetic data, real-world datasets, and a combination of both. Our experiments show that motion planners trained with our data significantly outperform those trained solely on real-world data, showing the usefulness of our synthetic data and the effectiveness of our data generation method. Our source code is available at https://ezharjan.github.io/AutoSceneGen.
... Drawing upon the UCF101-Action Recognition Dataset [21]-we collected video clips portraying individuals performing everyday actions such as walking, sitting, shaking hands, eating, and children playing. To enhance the diversity of non-violent activities, we further integrated clips from the VIRAT Video Dataset [22] and NTU RGB+D Dataset [23,24]. This meticulous curation resulted in a compilation of 1206 video clips, ensuring a comprehensive coverage of routine actions typically encountered in surveillance settings. ...
Public transportation systems play a vital role in modern cities, but they face growing security challenges, particularly related to incidents of violence. Detecting and responding to violence in real time is crucial for ensuring passenger safety and the smooth operation of these transport networks. To address this issue, we propose an advanced artificial intelligence (AI) solution for identifying unsafe behaviours in public transport. The proposed approach employs deep learning action recognition models and utilises technologies like NVIDIA DeepStream SDK, Amazon Web Services (AWS) DirectConnect, local edge computing server, ONNXRuntime and MQTT to accelerate the end-to-end pipeline. The solution captures video streams from remote train stations closed circuit television (CCTV) networks, processes the data in the cloud, applies the action recognition model, and transmits the results to a live web application. A temporal pyramid network (TPN) action recognition model was trained on a newly curated video dataset mixing open-source resources and live simulated trials to identify the unsafe behaviours. The base model was able to achieve a validation accuracy of 93% when trained using open-source dataset samples and was improved to 97% when live simulated dataset was included during the training. The developed AI system was deployed at Wollongong Train Station (NSW, Australia) and showcased impressive accuracy in detecting violence incidents during an 8-week test period, achieving a reliable false-positive (FP) rate of 23%. While the AI correctly identified 30 true-positive incidents, there were 6 cases of false negatives (FNs) where violence incidents were missed during the rainy weather suggesting more data in the training dataset related to bad weather. The AI model’s continuous retraining capability ensures its adaptability to various real-world scenarios, making it a valuable tool for enhancing safety and the overall passenger experience in public transport settings.
... Subsequently, features are extracted for the purpose of validation. The VIRAT Video Dataset Release 2.0_VIRAT Ground [19] is an extensive collection of videos specifically gathered for the purpose of doing research on object detection, object tracking, and event recognition. The VIRAT Ground Dataset is a smaller portion of the larger VIRAT Video Dataset. ...
... The dataset contains annotations for events, which are valuable for the development and evaluation of computer vision algorithms [19,20]. The VIRAT Video Dataset Release 2.0_VIRAT Ground Dataset contains annotated video sequences with precise annotations for object detection, tracking, and event recognition. ...
Surveillance video processing requires high efficiency, given its large datasets, demands significant resources for timely and effective analysis. This study aims to enhance surveillance systems by developing an automated method for extracting key events from outdoor surveillance videos. The proposed model comprises four phases: preprocessing and feature extraction, training and testing, and validation. Before utilizing a convolution neural networks approach to extract features from videos, the videos are pre-processed. Events classification uses gated recurrent units. In validation, motions and objects are extraction then feature extraction. Results show satisfactory performance, achieving 79% accuracy in events classification, highlighting the effectiveness of the methodology in identifying significant outdoor events.
... In the domain of video-based SOD, however, identifying a widely accepted baseline dataset is challenging. Unfortunately, many popular datasets such as UAV123 [42], VIRATground [43], Visdrone [44], and small90 and small112 [45] fail to provide sufficient examples of objects with an area of under 100 pixels. Even though these very small objects may be present in the video material, they are often not annotated. ...
... For this experiment, we used two large-scale public datasets: the VIRAT-Ground Dataset [43], a dataset specifically designed for event recognition in surveillance videos, and the VisDrone-VID dataset [44], a dataset focused on drone-captured imagery. From each dataset, 2500 samples were extracted to be used during training. ...
Deep learning has become the preferred method for automated object detection, but the accurate detection of small objects remains a challenge due to the lack of distinctive appearance features. Most deep learning-based detectors do not exploit the temporal information that is available in video, even though this context is often essential when the signal-to-noise ratio is low. In addition, model development choices, such as the loss function, are typically designed around medium-sized objects. Moreover, most datasets that are acquired for the development of small object detectors are task-specific and lack diversity, and the smallest objects are often not well annotated. In this study, we address the aforementioned challenges and create a deep learning-based pipeline for versatile small object detection. With an in-house dataset consisting of civilian and military objects, we achieve a substantial improvement in YOLOv8 (baseline mAP = 0.465) by leveraging the temporal context in video and data augmentations specifically tailored to small objects (mAP = 0.839). We also show the benefit of having a carefully curated dataset in comparison with public datasets and find that a model trained on a diverse dataset outperforms environment-specific models. Our findings indicate that small objects can be detected accurately in a wide range of environments while leveraging the speed of the YOLO architecture.