Figure - available from: Sustainability
This content is subject to copyright.
Overview of Vision Transformer-assisted safety assessment scenario extraction.

Overview of Vision Transformer-assisted safety assessment scenario extraction.

Source publication
Article
Full-text available
Automated Vehicles (AVs) are attracting attention as a safer mobility option thanks to the recent advancement of various sensing technologies that realize a much quicker Perception–Reaction Time than Human-Driven Vehicles (HVs). However, AVs are not entirely free from the risk of accidents, and we currently lack a systematic and reliable method to...

Citations

... In Vision-TAA systems, CNNs are used to extract spatial features from video frames and analyze sequential frames to detect accident indicators. For instance, 3D-CNN captures spatiotemporal information from video streams to predict potential traffic accidents 29,30 . ...
Preprint
Full-text available
Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep learning.This paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic datasets.Current methodologies are categorized into four key approaches: image and video feature-based prediction, spatiotemporal feature-based prediction, scene understanding,and multimodal data fusion.While these methods demonstrate significant potential,challenges such as data scarcity,limited generalization to complex scenarios,and real-time performance constraints remain prevalent. This review highlights opportunities for future research,including the integration of multimodal data fusion, self-supervised learning,and Transformer-based architectures to enhance prediction accuracy and scalability.By synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems,contributing to road safety and traffic management.
... A sequence-to-sequence model called Transformer has significantly improved video captioning, video prediction, and natural language processing (NLP). Transformers process the complete input data sequence in a single operation, as opposed to CNNs, which process the input sequences sequentially [14,15]. In order to improve feature learning, Vison transformer-based anomaly detection is suggested in this study in terms of ZigZag path learning. ...
... The LSTMDT model has shown better performance compared to traditional machine learning classifiers for traffic accidents. In their study, Kang et al. [48] introduced the Vision Transformer Traffic Accident (ViT-TA) classifier, which analyzes traffic accidents using first-person video footage to enhance the safety of autonomous vehicles. However, several issues have been found. ...
Article
Full-text available
Detection of anomalies in video surveillance plays a key role in ensuring the safety and security of public spaces. The number of surveillance cameras is growing, making it harder to monitor them manually. So, automated systems are needed. This change increases the demand for automated systems that detect abnormal events or anomalies, such as road accidents, fighting, snatching, car fires, and explosions in real-time. These systems improve detection accuracy, minimize human error, and make security operations more efficient. In this study, we proposed the Composite Recurrent Bi-Attention (CRBA) model for detecting anomalies in surveillance videos. The CRBA model combines DenseNet201 for robust spatial feature extraction with BiLSTM networks that capture temporal dependencies across video frames. A multi-attention mechanism was also incorporated to direct the model’s focus to critical spatiotemporal regions. This improves the system’s ability to distinguish between normal and abnormal behaviors. By integrating these methodologies, the CRBA model improves the detection and classification of anomalies in surveillance videos, effectively addressing both spatial and temporal challenges. Experimental assessments demonstrate that the CRBA model achieves high accuracy on both the University of Central Florida (UCF) and the newly developed Road Anomaly Dataset (RAD). This model enhances detection accuracy while also improving resource efficiency and minimizing response times in critical situations. These advantages make it an invaluable tool for public safety and security operations, where rapid and accurate responses are needed for maintaining safety.
... Some applications of ViTs include detecting rain and road surface conditions [90], predicting pedestrian crossing intentions [91], identifying critical traffic moments [92], and detecting unusual traffic scenarios [93]. ...
... ViT-TA [92] is a custom ViT that achieves 94% accuracy in detecting critical moments at Time-To-Collision (TTC) ≤ 1s on the Dashcam Accident Dataset (DAD). It classifies critical traffic situations and uses attention maps to highlight probable causes, systematically enhancing automated vehicle safety by generating reliable safety scenarios. ...
... used a learning rate of 0.001 decayed by 0.1 per epoch, momentum of 0.9, weight decay of 0.0005, 8000 max_batches, and batch size of 32. [123] used a learning rate of 0.01, final rate of 0.2, momentum of 0.937, weight decay of 0.0005, 110 epochs, and batch size of 12. [124] utilized a learning rate of 0.0002, Adam optimizer with β1=0.5, β2=0.999, and batch size of 32.Among ViT models,[90] used 16x16 patches, 768 embedding dimension, 12 layers, 12 attention heads, Adam optimizer with a 0.001 learning rate, and 100 epochs.[92] used 32x32 patches, 12 attention heads, 768 embedding dimension, and Rectified Adam, training for up to 1000 epochs with early stopping.[93] ...
Article
Full-text available
This review paper presents an in-depth analysis of deep learning (DL) models applied to traffic scene understanding, a key aspect of modern intelligent transportation systems. It examines fundamental techniques such as classification, object detection, and segmentation, and extends to more advanced applications like action recognition, object tracking, path prediction, scene generation and retrieval, anomaly detection, Image-to-Image Translation (I2IT), and person re-identification (Person Re-ID). The paper synthesizes insights from a broad range of studies, tracing the evolution from traditional image processing methods to sophisticated DL techniques, such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs). The review also explores three primary categories of domain adaptation (DA) methods: clustering-based, discrepancy-based, and adversarial-based, highlighting their significance in traffic scene understanding. The significance of Hyperparameter Optimization (HPO) is also discussed, emphasizing its critical role in enhancing model performance and efficiency, particularly in adapting DL models for practical, real-world use. Special focus is given to the integration of these models in real-world applications, including autonomous driving, traffic management, and pedestrian safety. The review also addresses key challenges in traffic scene understanding, such as occlusions, the dynamic nature of urban traffic, and environmental complexities like varying weather and lighting conditions. By critically analyzing current technologies, the paper identifies limitations in existing research and proposes areas for future exploration. It underscores the need for improved interpretability, real-time processing, and the integration of multi-modal data. This reviewserves as a valuable resource for researchers and practitioners aiming to apply or advance DL techniques in traffic scene understanding.
... Kang et al. [183] proposed a visual transformer to predict collisions supplemented by attention maps. Subsequently, a time series of attention maps is further analysed to identify spatiotemporal characteristics and based on the situation interpretation, accident scenarios for safety assessment are extracted. ...
Article
Full-text available
Artificial Intelligence (AI) shows promising applications for the perception and planning tasks in autonomous driving (AD) due to its superior performance compared to conventional methods. However, highly complex AI systems exacerbate the existing challenge of safety assurance of AD. One way to mitigate this challenge is to utilize explainable AI (XAI) techniques. To this end, we present the first comprehensive systematic literature review of explainable methods for safe and trustworthy AD. We begin by analyzing the requirements for AI in the context of AD, focusing on three key aspects: data, model, and agency. We find that XAI is fundamental to meeting these requirements. Based on this, we explain the sources of explanations in AI and describe a taxonomy of XAI. We then identify five key contributions of XAI for safe and trustworthy AI in AD, which are interpretable design, interpretable surrogate models, interpretable monitoring, auxiliary explanations, and interpretable validation. Finally, we propose a conceptual modular framework called SafeX to integrate the reviewed methods, enabling explanation delivery to users while simultaneously ensuring the safety of AI models.
... In the transportation sector, [28] presents a novel approach using vision transformers and semantic reasoning to detect critical safety situations in traffic accidents. The proposed technique classifies and highlights objects posing threats to automated vehicles, generating functional scenarios for security testing. ...
Article
Full-text available
Industrial safety aims to prevent and mitigate accidents, injuries, and damage in the workplace. One common approach to identifying and analyzing potentially risky situations involves using static cameras to record images or videos of facilities and production processes. However, current state-of-the-art deep learning-based solutions require extensive labeled datasets and substantial computational power to detect these dangerous situations. To address these limitations, this paper presents DINOFSAFE, a methodology that combines dense optical flow and the DINOv2 model, a vision transformer that learns universal visual features without supervision. Optical flow estimates the apparent motion of objects in the input video streams, while the DINOv2 model generates high-dimensional universal representations capturing their visual properties. Our methodology is doubly effective, as it minimizes the manual labeling work needed for training the models while also being computationally efficient. Using these representations, we train simple linear classifiers to identify moving objects and categorize them. This information aids in identifying and preventing hazardous scenarios in industrial settings, such as pedestrians crossing paths with forklifts, forklifts approaching dangerous areas, loads falling from forklifts, and similar situations. We tested our solution on real videos from industrial scenarios, achieving promising results. Additionally, we compiled and made publicly available a dataset of approximately 6500 images, referred to as IndustrialDetectionStaticCameras.
... This study examines the functioning of traffic prediction models using explainable artificial intelligence (XAI) technologies, where they have investigated the impact of features proposed in the literature on various types of traffic, sampling rates and ML algorithms. Studies of applied XAI techniques are now starting to surface in traffic management 26 and accident analysis 27,28 . In particular, nearby objects causing the ego vehicles to fall into near-crash situations were identified through attention-score maps returned by Vision Transformer 29 . ...
Article
Full-text available
Traffic speed prediction is an important topic in Intelligent Transport Systems (ITS). Although traffic speed prediction for real-life applications is burgeoning, the study of explaining and interpreting AI-based speed prediction is still in its initial stage. In this paper, we applied multiple advanced regression techniques, such as XGBoost and CatBoost optimized gradient boosting, Random Forest, and LASSO to predict traffic speed more accurately in the subsequent time windows. The experiment with prediction methods was conducted using the traffic speed data of the Seoul metropolitan road network. Each road segment represented as a node in the network is associated with neighboring roads within a configurable range. We picked heavily congested nodes as prediction targets. Then, we evaluated nearby road influences to determine critical contributions to the situation of the target nodes. We interpreted the model’s output and extracted the topmost influential neighboring nodes by using an ensemble of explainable artificial intelligence (XAI) techniques such as feature importance assessment using the GINI entropy function, Recursive Feature Elimination, Shapely Additive Explanation, and a method of measuring the impact of masked nodes. We validated the XAI interpretations through traffic flow simulation by tuning the topmost influential nearby roads’ speed and observing the effect on the roads’ traffic congestion relief correspondingly. We also proved our solution through local explanation techniques such as Local Interpretable Model-Agnostic Explanations. Our methods are applicable to any transport network and open the door to new strategies for controlling the specific nearby roads for effective congestion relief.
... As a result, the importance of explainable artificial intelligence (XAI) is growing, and various methodologies are being studied. Studies of applied XAI techniques are now starting to surface in traffic management 20 and accident analysis 21,22 . In particular, nearby objects causing the ego vehicles to fall into near-crash situations were identified through attention-score maps returned by Vision Transformer ? . ...
Preprint
Full-text available
Traffic speed prediction is an important topic in Intelligent Transport Systems (ITS). Although traffic speed prediction for real-life applications is burgeoning, the study of explaining and interpreting AI-based speed prediction is still in its initial stage. In this paper, we applied multiple advanced regression techniques such as XGBoost and Catboost optimized gradient boosting, Random Forest, and LASSO to predict traffic speed more accurately in the subsequent time windows. The experiment with prediction methods was conducted using the traffic speed data of the Seoul metropolitan road network. Each road segment represented as a node in the network is associated with neighboring roads within a configurable range. We picked heavily congested nodes as prediction targets. Then, we evaluated nearby road influences to determine critical contributions to the situation of the target nodes. We interpreted the model's output and extracted the topmost influential neighboring nodes by using an ensemble of explainable artificial intelligence (XAI) techniques such as feature importance assessment using the GINI entropy function, Recursive Feature Elimination, Shapely Additive Explanation, and a method of measuring the impact of masked nodes.We validated the XAI interpretations through traffic flow simulation by tuning the topmost influential nearby roads' speed and observing the effect on the roads' traffic congestion relief correspondingly. We also proved our solution through local explanation techniques such as Local Interpretable Model-Agnostic Explanations. Our methods are applicable to any transport network and open the door to new strategies for controlling the specific nearby roads for effective congestion relief.
... K-mean clustering and auto-coders have been used to extract hidden information from traffic accident data and to performing accident hotspot identification [40][41][42]. Transfer learning and transformer techniques have shown potential in traffic accident risk prediction and detection [43][44][45][46]. These research methodologies not only demonstrate the diversity and intricacy of data analytics within the realm of traffic safety, but also highlight potential limitations and chart out future research trajectories for the application of these techniques in real-world traffic scenarios. ...
Article
Full-text available
This study aims to improve the accuracy of predicting the severity of traffic accidents by developing an innovative traffic accident risk prediction model—StackTrafficRiskPrediction. The model combines multidimensional data analysis including environmental factors, human factors, roadway characteristics, and accident-related meta-features. In the model comparison, the StackTrafficRiskPrediction model achieves an accuracy of 0.9613, 0.9069, and 0.7508 in predicting fatal, serious, and minor accidents, respectively, which significantly outperforms the traditional logistic regression model. In the experimental part, we analyzed the severity of traffic accidents under different age groups of drivers, driving experience, road conditions, light and weather conditions. The results showed that drivers between 31 and 50 years of age with 2 to 5 years of driving experience were more likely to be involved in serious crashes. In addition, it was found that drivers tend to adopt a more cautious driving style in poor road and weather conditions, which increases the margin of safety. In terms of model evaluation, the StackTrafficRiskPrediction model performs best in terms of accuracy, recall, and ROC–AUC values, but performs poorly in predicting small-sample categories. Our study also revealed limitations of the current methodology, such as the sample imbalance problem and the limitations of environmental and human factors in the study. Future research can overcome these limitations by collecting more diverse data, exploring a wider range of influencing factors, and applying more advanced data analysis techniques.
... As pSTL provides a way to construct formulas that describe the relationships between spatial and temporal properties of a signal, the formally-specifiable outcome can be obtained by configuring the parameters, allowing proactively generating various desired behaviour of an agent for testing AVs. Kang et al. [102] proposed a visual transformer to predict collisions supplemented by attention maps. Subsequently, a time series of attention maps is further analysed to identify spatiotemporal characteristics and based on the situation interpretation, accident scenarios for safety assessment are extracted. ...
... Fang et al. [53] Hacker et al. [54] Keser et al. [55] Kronenberger et al. [56] Planning & Prediction Bao et al. [57] Chen et al. [58] Franco and Bezzo [59] Gall and Bezzo [60] Gilpin et al. [61] Gorospe et al. [62] Karim et al. [63] Nahata et al. [64] Schmidt et al. [65] Auxiliary Explanations Perception Abukmeil et al. [66] Gou et al. [67] Haedecke et al. [68] Kolekar et al. [69] Mankodiya et al. [70] Nowak et al [71] Saravanarajan et al. [72] Schinagl et al. [73] Schorr et al. [74] Wang et al. [75] Planning & Prediction Jiang et al. [76] Kochakarn et al. [77] Liu et al. [78] Mishra et al. [79] Shao et al. [80] Teng et al. [81] Wang et al. [82] Yu et al. [83] End-to-End Aksoy and Yazici [84] Chitta et al. [85] Cultrera et al. [86] Chen et al. [87] Dong et al. [88] Feng et al. [89] Kim et al. [90] Kühn et al. [91] Mori et al. [92] Tashiro et al. [93] Xu et al. [94] Yang et al. [95] Wang et al. [96] Wei et al. [97] Zhang et al. [98] Zhang et al. [99] Interpretable Safety Validation Corso and Kochenderfer [100] DeCastro et al. [101] Kang et al. [102] Li et al. [103] ...
Preprint
Full-text available
Artificial Intelligence (AI) shows promising applications for the perception and planning tasks in autonomous driving (AD) due to its superior performance compared to conventional methods. However, inscrutable AI systems exacerbate the existing challenge of safety assurance of AD. One way to mitigate this challenge is to utilize explainable AI (XAI) techniques. To this end, we present the first comprehensive systematic literature review of explainable methods for safe and trustworthy AD. We begin by analyzing the requirements for AI in the context of AD, focusing on three key aspects: data, model, and agency. We find that XAI is fundamental to meeting these requirements. Based on this, we explain the sources of explanations in AI and describe a taxonomy of XAI. We then identify five key contributions of XAI for safe and trustworthy AI in AD, which are interpretable design, interpretable surrogate models, interpretable monitoring, auxiliary explanations, and interpretable validation. Finally, we propose a modular framework called SafeX to integrate these contributions, enabling explanation delivery to users while simultaneously ensuring the safety of AI models.