Chapter

Enabling Pedestrian Safety Using Computer Vision Techniques: A Case Study of the 2018 Uber Inc. Self-driving Car Crash

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Human lives are important. The decision to allow self-driving vehicles operate on our roads carries great weight. This has been a hot topic of debate between policy-makers, technologists and public safety institutions. The recent Uber Inc. self-driving car crash, resulting in the death of a pedestrian, has strengthened the argument that autonomous vehicle technology is still not ready for deployment on public roads. In this work, we analyze the Uber car crash and shed light on the question, “Could the Uber Car Crash have been avoided?”. We apply state-of-the-art Computer Vision models to this highly practical scenario. More generally, our experimental results are an evaluation of various image enhancement and object recognition techniques for enabling pedestrian safety in low-lighting conditions using the Uber crash as a case study.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In March 18, 2018, Uber's autonomous car hit and killed a 49-year-old as she was walking her bike across the street at night in Tempe, Ariz [48,49]. With its 360-degree cameras and sensors, the car should have been able to detect someone crossing in front of it, even at night. ...
... With all the sensors including cameras around today's autonomous cars, one can be fooled in thinking these cars can really operate autonomously. What is really missing is unfamiliar scenes or combination of objects that these autonomous cars have not been trained with or can't interpret correctly [48] as well as real-time processing needs. There is no doubt that the way in which computer vision techniques have been used in these systems are novel, but we need to note that, the systems are not yet perfect [38]. ...
... Compared to human recognition system it would be odd to find a sound and healthy individual not being able distinguish between two different items like an apples and bananas. The reported accidents [48][49][50] related to self-driving vehicles go a long way to confirm that the current state of deep convolutional networks aren't able to handle complex situation or the intriguing fact that they can be easily fooled to misinterpret unrecognisable object with high confidence [38]. ...
Article
Full-text available
Advances in machine learning coupled with the abundances of training data has facilitated the deep learning era, which has demonstrated its ability and effectiveness in solving complex detection and recognition problems. In general application areas with elements of machine learning have seen exponential growth with promising new and sophisticated solutions to complex learning problems. In computer vision, the challenge related to the detection of known objects in a scene is a thing of the past. With the tremendous increase in detection accuracies, some close to that of human detection, there are several areas still lagging in computer vision and machine learning where improvements may call for more architectural designs. In this paper, we propose a physiologically inspired model for scene understanding that encodes three key components: object location, size and category. Our aim is to develop an energy efficient artificial intelligent model for naturalistic scene understanding capable of deploying on a low power neuromorphic hardware. We have reviewed recent advances in deep learning architecture that have taken inspiration from human or primate learning systems and provided direct to future advancement on deep learning with inspiration from physiological experiments. Upon a review of areas that have benefitted from deep learning, we provide recommendations for enhancing those areas that might have stalled or grinded to a halt with little or no significant improvement.
... The utilization of a safety belt is highly crucial in the mitigation of severe injuries and the substantial reduction in the probability of fatalities [1][2][3][4]. According to the research conducted by Khohli and Chadha, it has been estimated that a total of 3287 individuals perish on a daily basis as a result of accidents [5]. Leland et al. [6] also submit that the estimated annual number of fatalities resulting from automobile accidents is approximately 1.35 million. ...
... Scholars have addressed the issue of safety belt compliance through two distinct approaches. The first approach involves the installation of cameras along highways to monitor the usage of safety belts [3][4][5][6][7]. However, the installed cameras may have to consider the fact that there are two modes of computer vision models (single-stage and two-stage mode), and, therefore, the Stage 1 computer vision models are the ones recommended for real-time detection [9,10]. ...
... The primary objective of this study was to assess the efficacy of the YOLOv7 algorithm in detecting the presence of a safety belt. According to Reference [5], the failure to utilize safety belts leads to a significant number of fatalities, with a daily death toll of 3285 and an annual toll of 1.35 million. The investigation of safety belt buckling is a significant area of research due to the potential for grave injuries and fatalities resulting from the act of unbuckling [1][2][3][4]. ...
Article
Full-text available
Using a safety belt is crucial for preventing severe injuries and fatalities during vehicle accidents. In this paper, we propose a real-time vehicle occupant safety belt detection system based on the YOLOv7 (You Only Look Once version seven) object detection algorithm. The proposed approach aims to automatically detect whether the occupants of a vehicle have buckled their safety belts or not as soon as they are detected within the vehicle. A dataset for this purpose was collected and annotated for validation and testing. By leveraging the efficiency and accuracy of YOLOv7, we achieve near-instantaneous analysis of video streams, making our system suitable for deployment in various surveillance and automotive safety applications. This paper outlines a comprehensive methodology for training the YOLOv7 model using the labelImg tool to annotate the dataset with images showing vehicle occupants. It also discusses the challenges of detecting seat belts and evaluates the system’s performance on a real-world dataset. The evaluation focuses on distinguishing the status of a safety belt between two classes: “buckled” and “unbuckled”. The results demonstrate a high level of accuracy, with a mean average precision (mAP) of 99.6% and an F1 score of 98%, indicating the system’s effectiveness in identifying the safety belt status.
... For instance, in autonomous driving, it is impractical and almost impossible to ensure that the input data strictly satisfies the i.i.d. assumption, as unknown situations may arise [1]. In such scenarios, it is essential that models generalize well to unseen data by being independent of biased features. ...
... In this work, we introduce Invariance Pair-Guided learning (IPG) that incorporates the invariances during training. 1 In order to guide the neural network, we extend the standard gradient descent-based approach with an additional corrective step, the corrective gradient inspired by van Baelen [15]. The corrective gradient is specified by pairs of input data, the invariance pairs, which define the desired invariance properties of the model. ...
Preprint
Full-text available
Out-of-distribution generalization of machine learning models remains challenging since the models are inherently bound to the training data distribution. This especially manifests, when the learned models rely on spurious correlations. Most of the existing approaches apply data manipulation, representation learning, or learning strategies to achieve generalizable models. Unfortunately, these approaches usually require multiple training domains, group labels, specialized augmentation, or pre-processing to reach generalizable models. We propose a novel approach that addresses these limitations by providing a technique to guide the neural network through the training phase. We first establish input pairs, representing the spurious attribute and describing the invariance, a characteristic that should not affect the outcome of the model. Based on these pairs, we form a corrective gradient complementing the traditional gradient descent approach. We further make this correction mechanism adaptive based on a predefined invariance condition. Experiments on ColoredMNIST, Waterbird-100, and CelebA datasets demonstrate the effectiveness of our approach and the robustness to group shifts.
... One example is the Equifax data breach, in which private data records of millions of users were compromised (Zou and Schaub, 2018). Additionally, when thinking of self-driving cars, unreliable AI-based systems could even cause physical harm, as demonstrated by the unfortunate Uber car crash in 2018, in which a malfunctioning algorithm did not detect and, as a consequence, killed a pedestrian on the road (Kohli and Chadha, 2020). As a consequence, there has been an increase in public and academic discussions about the essential requirements AI systems must fulfill to be considered trustworthy. ...
... Finally, principle 4 (prevention of harm) should ensure that the AI system does not cause any harm to humans, society, or the environment. For example, it should be prevented that AI-based systems harm or, even worse, kill humans, which unfortunately has happened, e.g., in the aforementioned Uber car crash in 2018, in which a pedestrian was killed by a malfunctioned selfdriving car (Kohli and Chadha, 2020). This principle includes a range of requirements, including technical and legal concerns. ...
Article
Full-text available
Artificial intelligence (AI) technologies (re-)shape modern life, driving innovation in a wide range of sectors. However, some AI systems have yielded unexpected or undesirable outcomes or have been used in questionable manners. As a result, there has been a surge in public and academic discussions about aspects that AI systems must fulfill to be considered trustworthy. In this paper, we synthesize existing conceptualizations of trustworthy AI along six requirements: (1) human agency and oversight, (2) fairness and non-discrimination, (3) transparency and explainability, (4) robustness and accuracy, (5) privacy and security, and (6) accountability. For each one, we provide a definition, describe how it can be established and evaluated, and discuss requirement-specific research challenges. Finally, we conclude this analysis by identifying overarching research challenges across the requirements with respect to (1) interdisciplinary research, (2) conceptual clarity, (3) context-dependency, (4) dynamics in evolving systems, and (5) investigations in real-world contexts. Thus, this paper synthesizes and consolidates a wide-ranging and active discussion currently taking place in various academic sub-communities and public forums. It aims to serve as a reference for a broad audience and as a basis for future research directions.
... The use of autonomous cars, advantages and disadvantages, and the research on these areas are available (Mandal et al., 2020). For example, in 2018, there were grave worries regarding the safety of autonomous vehicles after an Uber selfdriving car was involved in a tragic collision in Arizona (Kohli & Chadha, 2018). The successful 2020 debut of Waymo's fully autonomous ride-hailing service in Phoenix, Arizona, illustrated how autonomous vehicles could revolutionize urban transportation (Waymo, 2020). ...
Thesis
Full-text available
The advanced improvement of autonomous vehicle technology has marked a transformative shift in the automotive industry, promising efficiency, mobility, and safety. However, safety concern remained in comparison to traditional human-driven vehicles. This research would present a comparative analysis of the risks which was associated with the self-driving vehicles and human-driven vehicles, addressing accident frequency, limitation of the technology, and acceptance in the public. By using quantitative methodology for the study and using the publicly available data, the study would evaluate the extent to which self-driving cars enhance our safety in the road compromise. The research used statistical techniques to compare the rate of accidents, analyzing the impact of the level of automation, and assessing the influence of factors like regulations, infrastructure, and ethical considerations. The result could be used to contribute the ongoing discussions on the feasibility of widespread self-driving cars adoption and informing the manufactures, policymakers, and the general public about the potential usefulness and risks linked with the self-driving cars. This study would provide verifiable evidence on autonomous vehicles safety, directing the development of self-driving technologies in the future. Keywords: future impact, TAM, road safety, technology acceptance model, self-driving cars, driverless cars, autonomous vehicles, human-driven cars, SDV
... Autonomous vehicles come under high-risk AI systems that can have a tragic impact on failure, e.g., a self-driving Uber crash due to the malfunctioning of sensors that were supposed to detect pedestrians [41]. Upon reviewing stateof-the-art AI systems in autonomous vehicles, Fernández-Llorca and Gómez [42] understand that they have a "High" level of maturity to address "Technical Robustness & Safety" while having "Low" and "Medium" levels of maturity towards other trust requirements. ...
Preprint
Full-text available
The rapid integration of AI in products, services, and innovation processes has enabled transformative applications, raising global concerns about the trustworthiness of AI features and the corresponding development processes. In this paper, we provide a perspective on how design and innovation processes can be adapted to ensure the trustworthiness of AI-centric artefacts. We review generic recommendations for trustworthy AI provided by various organisations and scholars. By leveraging the “double-hump” model of data-driven innovation, we explain and illustrate how trustworthy AI could be integrated into the design and innovation processes. We then propose research directions, data, and methods that could help gather an empirical understanding of trustworthiness and thus lead to an assessment of existing AI artefacts for trustworthiness. Since there is a disparity among domains and organisations in terms of AI-related risk and maturity, we expect that the proposed process model and the assessment methods could contribute towards a reliable road map for the development and assessment of trustworthy AI.
... Egy hét leforgása alatt két önvezető autó (bár eltérő technológiai alapokon) szenvedett halálos balesetet, amelyre a társadalom nagy felháborodással reagált. Ez a fajta reakció nagyon fontos volt az önvezető autók számára, mivel megmutatta, hogy a társadalom tisztában van a technológia fontosságával és hatásával (Kohli, 2020). Ezt követően még azok közül is, akik a legnagyobb lelkesedést mutatták az önvezető autók iránt, néhányan hezitáltak azzal kapcsolatban, hogy valóban használni fogják-e azokat valamilyen formában. ...
Article
Full-text available
Ez a tanulmány összefoglalja az önvezető járművek ígéretes lehetőségeit és kihívásait a mezőgazdaságban. A további kutatás és fejlesztések révén ezek a technológiák lehetővé tehetik a fenntarthatóbb, hatékonyabb és versenyképesebb mezőgazdasági gyakorlatokat a jövőben. Az átalakító technológia elfogadását és bevezetését befolyásoló tényezők megértésére összpontosítva elmélyedünk a gazdákra, az iparági érdekeltekre és a fenntartható mezőgazdaság jövőjére gyakorolt hatásokban. Ipar 4.0, önvezető traktorok, autonóm járművek, mezőgazdasági traktorok, automatizált mezőgazdaság kulcsszavakat használva, készült a szakirodalmi áttekintés azzal a céllal, hogy átfogó képet kapjunk az önvezető traktorok használatáról hazai és nemzetközi szinten, továbbá betekintést nyerjünk az Ipar 4.0 és a mesterséges intelligencia hatására a mezőgazdaságban. Megismerjük az önvezető traktorokat és az Ipar 4.0 esetleges veszélyeit a mezőgazdaságban. Gazdálkodói szinten az önvezető traktorok átvétele paradigmaváltást jelent a mezőgazdasági feladatok végrehajtásában. Alapvető fontosságú az egyes gazdálkodókra gyakorolt hatások megértése, ideértve a munkakörben bekövetkezett változásokat, a szükséges készségeket és a lehetséges társadalmi-gazdasági hatásokat. Az autonóm technológia bevezetése megváltoztathatja a hagyományos gazdálkodási gyakorlatokat, így képzési programokra és támogatási mechanizmusokra lesz szükség a gazdálkodók számára, hogy zökkenőmentesen alkalmazkodjanak az új technológiai környezethez. Létfontosságú annak feltárása, hogy ez az átmenet hogyan befolyásolhatja az egyéni gazdálkodók megélhetését, jólétét és autonómiáját, biztosítva, hogy az önvezető traktorok előnyei méltányosak és inkluzívak legyenek.
... The challenge of assuring these systems has increased as novel technologies and capabilities (e.g., artificial intelligence, autonomy, self-adaptation) are relied upon to realize critical functions. For instance, autonomous vehicles depend on artificial intelligence (mainly machine learning) methods to perceive their environment and detect nearby agents, and perception system failures have lead to severe consequences [45,46]. Significant effort is expended to develop these systems and avoid such loss events. ...
Preprint
Full-text available
CONTEXT: Assurance Cases (ACs) are prepared to argue that the system's desired quality attributes (e.g., safety or security) are satisfied. While there is strong adoption of ACs, practitioners are often left asking an important question: are we confident that the claims made by the case are true? While many confidence assessment methods (CAMs) exist, little is known about the use of these methods in practice OBJECTIVE: Develop an understanding of the current state of practice for AC confidence assessment: what methods are used in practice and what barriers exist for their use? METHOD: Structured interviews were performed with practitioners with experience contributing to real-world ACs. Open-coding was performed on transcripts. A description of the current state of AC practice and future considerations for researchers was synthesized from the results. RESULTS: A total of n = 19 practitioners were interviewed. The most common CAMs were (peer-)review of ACs, dialectic reasoning ("defeaters"), and comparing against checklists. Participants preferred qualitative methods and expressed concerns about quantitative CAMs. Barriers to using CAMs included additional work, inadequate guidance, subjectivity and interpretation of results, and trustworthiness of methods. CONCLUSION: While many CAMs are described in the literature there is a gap between the proposed methods and needs of practitioners. Researchers working in this area should consider the need to: connect CAMs to established practices, use CAMs to communicate with interest holders, crystallize the details of CAM application, curate accessible guidance, and confirm that methods are trustworthy.
... Performance of Yolo-v7 on Cityscapes dataset [9] at different intensities of lightness and rain predictions such as object detectors. Safety-critical systems, such as AVs, can face fatal accidents if the control actions are derived from DNNs and the test inputs fall outside the distribution of the training datasets [19,32]. Out-of-distribution (OOD) detection attempts to identify the test inputs that do not belong to the distribution of the training datasets. ...
Article
Deep neural networks (DNNs) are effective tools for learning-enabled cyber-physical systems (CPSs) that handle high-dimensional image data. However, DNNs may make incorrect decisions when presented with inputs outside the distribution of their training data. These inputs can compromise the safety of CPSs. So, it becomes crucial to detect inputs as out-of-distribution (OOD) and interpret the reasons for their classification as OOD. In this study, we propose an interpretable learning method to detect OOD caused by meteorological features like darkness, lightness, and rain. To achieve this, we employ a variational autoencoder (VAE) to map high-dimensional image data to a lower-dimensional latent space. We then focus on a specific latent dimension and encourage it to classify different intensities of a particular meteorological feature in a monotonically increasing manner. This is accomplished by incorporating two additional terms into the VAE’s loss function: a classification loss and a positional loss. During training, we optimize the utilization of label information for classification. Remarkably, our results demonstrate that using only 25% of the training data labels is sufficient to train a single pre-selected latent dimension to classify different intensities of a specific meteorological feature. We evaluate the proposed method on two distinct datasets, CARLA and Duckietown, employing two different rain-generation methods. We show that our approach outperforms existing approaches by at least 15 in the F1 score and precision when trained and tested on CARLA dataset.
... Image-based human re-identification (re-ID) is a critical and rapidly evolving area of research, with significant applications in surveillance, criminal investigations, and public safety scenarios [1,2]. The goal of re-ID is to match a probe image of an individual with images of the same identity from a gallery collected by multiple, non-overlapping cameras [3,4]. ...
Article
Full-text available
Human re-identification (re-ID) is nowadays among the most popular topics in computer vision, due to the increasing importance given to safety/security in modern societies. Being expected to sun in totally uncontrolled data acquisition settings (e.g., visual surveillance) automated re-ID not only depends on various factors that may occur in non-controlled data acquisition settings, but - most importantly - performance varies with respect to different subject features (e.g., gender, height, ethnicity, clothing, and action being performed), which may result in highly biased and undesirable automata. While many efforts have been putted in increase the robustness of identification to uncontrolled settings, a systematic assessment of the actual variations in performance with respect to each subject feature remains to be done. Accordingly, the contributions of this paper are threefold: 1) we report the correlation between the performance of three state-of-the-art re-ID models and different subject features; 2) we discuss the most concerning features and report valuable insights about the roles of the various features in re-ID performance, which can be used to develop more effective and unbiased re-ID systems; and 3) we leverage the concept of biometric menagerie, in order to identify the groups of individuals that typically fall into the most common menagerie families (e.g., goats, lambs, and wolves). Our findings not only contribute to a better understanding of the factors affecting re-ID performance, but also may offer practical guidance for researchers and practitioners concerned on human re-identification development.
... The fundamental objective of the system that the system has presented is to use cutting-edge image processing and deep learning algorithms to reliably identify pedestrians inside the crossing area and inform drivers as quickly as possible, therefore reducing the likelihood of potential accidents and saving the lives of pedestrians. It is anticipated that this intelligent system would integrate faultlessly with the urban infrastructure that already exists, therefore promoting a transportation ecology that is both safer and more efficient [3]. ...
... One of the most common integrations of AI and CV technology is that of self-driving vehicles. However, while such vehicles have the potential to revolutionize the driving experience, they raise a number of important safety concerns [1][2][3][4]. For example, as described in a recent study [5], self-driving cars operate well in predictable environments with well-planned travel routes; however, their performance is significantly impaired in complex or poor-sight environments. ...
Article
Full-text available
In heavy rain situations, the clarity of both human vision and computer vision is significantly reduced. Rain removal GAN-based networks have been proposed as a means of resolving this problem. However, such methods have only a limited effectiveness in improving the object detection accuracy. Accordingly, this study commences by analyzing the object detection performance before and after rain removal, respectively. We propose an integrated framework for improving the object detection performance in heavy rain images based on the analysis results. The experimental results show that the proposed framework yields an improved IoU and reduces the error rate compared with existing methods.
... Daha sonra elde edilen histogramlar istenilen kırpma limit değerini geçmeyecek şekilde yeniden pay edilir ve gri tonlamalı haritalama için kontrastla sınırlı histogramlar için kümülatif dağılım fonksiyonları (CDF) uygulanır [25]. Görüntü kontrastını arttırmak amacıyla meme mamografilerinde, hücre segmentasyonunda, kemik kırığı tespiti gibi medikal uygulamalarının yanı sıra tarım sistemlerinde, meyve segmentasyonunda ve sürüş sistemlerinde yaya veya araç tespitinde kullanılmaktadır [26][27][28][29]. Görüntü işleme yöntemlerinden biri olan eşikleme resimdeki istenen nesnenin maskesinin oluşumunu; belirlenen sabit piksel eşik değerinin altındaki piksel değerleri siyah, üstündeki piksel değerleri ise beyaz piksel olarak değiştirilerek sağlamaktadır. ...
Article
Geleneksel dermoskopik cihazların pahalı olması nedeniyle ulaşılabilirliğinin düşük olması ve uzman hekimlerin mesleki deneyimlerinin yeterli olmayışı cilt kanserinin erken teşhisinde en engelleyici faktörlerdir. Ancak bu alanda daha önce yapılan çalışmalar, çekim sırasında daha az uzmanlık gerektirdiğinden daha ekonomik ve pratik olan dijital benzerlerinden ziyade ağırlıklı olarak yüksek kaliteli dermoskopik görüntülere odaklanmıştır. Bununla birlikte, gürültünün yüksek varlığı nedeniyle tanıda bunların kullanımı zordur ve bu da zahmetli bir filtreleme işlemine neden olur. Filtrelemeyle ilgili en büyük zorluklardan biri, renk, şekil ve kalınlıktaki büyük farklılıklar nedeniyle kılın temizlenmesidir. Kıl temizliği, teşhisin kendisini sabote edebileceğinden teşhis için çok önemli olan verilerin sıkı bir şekilde korunmasını dikkate alan kapsamlı bir yaklaşım gerektirir. Makalede bu amaç için tasarlanan iki yeni filtrenin ana hatları verilmekte ve bu alanda yaygın olarak kullanılan iki filtreye göre performansları incelenmektedir.
... Wearing a safety belt is absolutely essential for preventing fatalities and minimizing the severity of injuries [6]. It has been calculated that 3287 people die every day as a result of accidents, according to the research done by Khohli and Chadha [7]. A further claim made by Leland et al. [8] is that 1.35 million fatalities caused by car accidents are thought to occur annually. ...
... However, the AI systems are still limited to what they are trained for and end up with failure even if a small invisible change is encountered in the real-time. On March 18, 2018, Uber self-driving car has produced pedestrian fatality and the reason reported was "the inability of the AI to classify an object as a pedestrian unless that object was near a crosswalk" [11]. Further, Amazon's AI powered recruiting tool was found to be gender biased (selecting men) as it was trained on the resumes submitted to the company over a period of 10 years and the majority of those came from men [12]. ...
Article
Full-text available
span lang="EN-US">Nowadays, artificial intelligence (AI) in general and machine learning techniques in particular has been widely employed in automated systems. Increasing complexity of these machine learning based systems have consequently given rise to blackbox models that are typically not understandable or explainable by humans. There is a need to understand the logic and reason behind these automated decision-making black box models as they are involved in our day-to-day activities such as driving, facial recognition identity systems, online recruitment. Explainable artificial intelligence (XAI) is an evolving field that makes it possible for humans to evaluate machine learning models for their correctness, fairness, and reliability. We extend our previous research work and perform a detailed analysis of the model created for text classification and sentiment analysis using a popular Explainable AI tool named local interpretable model agnostic explanations (LIME). The results verify that it is essential to evaluate machine learning models using explainable AI tools as accuracy and other related metrics does not ensure the correctness, fairness, and reliability of the model. We also present the comparison of explainability and interpretability of various machine learning algorithms using LIME. </span
... An AV needs to understand and anticipate the current and future states of its environment, including the presence of other vehicles, pedestrians, road signs, road conditions, weather situations, and potential obstacles [14]. Being able to understand and predict other road users' behaviour using a camera would be beneficial to any AV [15,16]. However, current computer vision systems cannot yet achieve error rates acceptable for autonomous navigation [17]. ...
Article
Full-text available
With the advent of autonomous vehicles, sensors and algorithm testing have become crucial parts of the autonomous vehicle development cycle. Having access to real-world sensors and vehicles is a dream for researchers and small-scale original equipment manufacturers (OEMs) due to the software and hardware development life-cycle duration and high costs. Therefore, simulator-based virtual testing has gained traction over the years as the preferred testing method due to its low cost, efficiency, and effectiveness in executing a wide range of testing scenarios. Companies like ANSYS and NVIDIA have come up with robust simulators, and open-source simulators such as CARLA have also populated the market. However, there is a lack of lightweight and simple simulators catering to specific test cases. In this paper, we introduce the SLAV-Sim, a lightweight simulator that specifically trains the behaviour of a self-learning autonomous vehicle. This simulator has been created using the Unity engine and provides an end-to-end virtual testing framework for different reinforcement learning (RL) algorithms in a variety of scenarios using camera sensors and raycasts.
... Issues prominently arise with respect to both algorithmic fairness [11,29], where machine learning systems reinforce existing social biases, and safety concerns [3,44], where machine learning systems are not yet reliable enough to be trusted with safety-critical tasks, a well-known example being car crashes involving self-driving cars [25]. For machine learning systems to be widely employed in everyday applications, this status quo needs to change. ...
Article
Full-text available
In this paper, we present the envisioned style and scope of the new topic “Explanation Paradigms Leveraging Analytic Intuition” (ExPLAIn) with the International Journal on Software Tools for Technology Transfer (STTT). Intention behind this new topic is to (1) explicitly address all aspects and issues that arise when trying to, if possible, reveal and then confirm hidden properties of black-box systems, or (2) to enforce vital properties by embedding them into appropriate system contexts. Machine-learned systems, such as Deep Neural Networks, are particularly challenging black-box systems, and there is a wealth of formal methods for analysis and verification waiting to be adapted and applied. The selection of papers of this first Special Section of ExPLAIn, most of which were co-authored by editorial board members, is an illustrative example of the style and scope envisioned: In addition to methodological papers on verification, explanation, and their scalability, case studies, tool papers, literature reviews, and position papers are also welcome.
... For example, a model designed to run on a drone to monitor crop health runs on resource-constrained hardware where a suboptimal model will consume higher compute draining the battery faster reducing flight-time. In comparison, a model operating in an autonomous vehicle has no hardware limitations although should have low latency to detect hazards in a timely manner, and a suboptimal model could be the difference between life and death [3]. ...
Preprint
Deep learning models have become essential in software engineering, enabling intelligent features like image captioning and document generation. However, their popularity raises concerns about environmental impact and inefficient model selection. This paper introduces GreenRunnerGPT, a novel tool for efficiently selecting deep learning models based on specific use cases. It employs a large language model to suggest weights for quality indicators, optimizing resource utilization. The tool utilizes a multi-armed bandit framework to evaluate models against target datasets, considering tradeoffs. We demonstrate that GreenRunnerGPT is able to identify a model suited to a target use case without wasteful computations that would occur under a brute-force approach to model selection.
... When poorly designed or adapted to target users, AI usage could mislead users into unfair and even incorrect decision-making (Lakkaraju & Bastani, 2020). Consequently, the real-world consequences of a failed AI-enabled system can be catastrophic, leading to, for example, discrimination (Buolamwini, 2017;Buolamwini & Gebru, 2018;Dastin, 2022;Hoffman & Podgurski, 2022;Kayser-Bril, 2020;Olteanu et al., 2019;Ruiz, 2019), and even death (Kohli & Chadha, 2020;Pietsch, 2021). Here, AI-enabled systems are defined as AI systems with capabilities to improve existing systems' performance, i.e., AIenhanced systems (Boland & Lyytinen, 2017), for example, recommender systems, and/or AI systems with capabilities to develop new applications, i.e., AI-based systems (Wuenderlich & Paluch, 2017), for example, virtual agents and robotic surgery (Rzepka & Berger, 2018). ...
Preprint
Full-text available
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption. It has been suggested that AI-enabled systems must go beyond technical-centric approaches and towards embracing a more human centric approach, a core principle of the human-computer interaction (HCI) field. This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies to gather insight for future technical and design strategies, research, and initiatives to calibrate the user AI relationship. The findings confirm that there is more than one way to define trust. Selecting the most appropriate trust definition to depict user trust in a specific context should be the focus instead of comparing definitions. User trust in AI-enabled systems is found to be influenced by three main themes, namely socio-ethical considerations, technical and design features, and user characteristics. User characteristics dominate the findings, reinforcing the importance of user involvement from development through to monitoring of AI enabled systems. In conclusion, user trust needs to be addressed directly in every context where AI-enabled systems are being used or discussed. In addition, calibrating the user-AI relationship requires finding the optimal balance that works for not only the user but also the system.
... This severely limits their robustness, as illustrated by, for example, the failures to recognize traffic signs, which can be easily hacked, or by models that are trained to detect traffic objects but confuse pedestrians with road infrastructure because they cannot deal with unforeseen situations. In the case of Uber in 2016, this problem caused a fatal accident with a self-driving car [10]. What do you think could be done to overcome this blind spot? ...
Article
Krzysztof Janowicz is a professor for Geoinformatics at the University of Vienna and the Univer sity of California, Santa Barbara. His research focuses on how humans conceptualize the space around them based on their behavior, focusing particularly on regional and cultural differences, with the goal of assisting machines to better understand the information needs of an increasingly diverse user base. http://geographicknowledge.de/pdf/geoAI_Interview_Janowicz.pdf
... Users can easily build and train deep learning models based on deep learning libraries and apply deep learning technology to the desired scenarios without deeply understanding all the implementations of the functions in the libraries. Currently, an increasing number of deep learning applications are being built based on deep learning libraries, and even a small bug in a deep learning library can affect thousands of deep learning systems [3], [4]. Therefore, ensuring the reliability of deep learning libraries is important for the practical application of deep learning systems. ...
... When poorly designed or adapted to target users, AI usage could mislead users into unfair and even incorrect decision-making (Lakkaraju & Bastani, 2020). Consequently, the real-world consequences of a failed AI-enabled system can be catastrophic, leading to, for example, discrimination (Buolamwini, 2017;Buolamwini & Gebru, 2018;Dastin, 2022;Hoffman & Podgurski, 2022;Kayser-Bril, 2020;Olteanu et al., 2019;Ruiz, 2019), and even death (Kohli & Chadha, 2020;Pietsch, 2021). Here, AI-enabled systems are defined as AI systems with capabilities to improve existing systems' performance, i.e., AIenhanced systems (Boland & Lyytinen, 2017), for example, recommender systems, and/or AI systems with capabilities to develop new applications, i.e., AI-based systems (Wuenderlich & Paluch, 2017), for example, virtual agents and robotic surgery (Rzepka & Berger, 2018). ...
Article
Full-text available
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption. It has been suggested that AI-enabled systems must go beyond technical-centric approaches and towards embracing a more human-centric approach, a core principle of the human-computer interaction (HCI) field. This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies to gather insight for future technical and design strategies, research, and initiatives to calibrate the user-AI relationship. The findings confirm that there is more than one way to define trust. Selecting the most appropriate trust definition to depict user trust in a specific context should be the focus instead of comparing definitions. User trust in AI-enabled systems is found to be influenced by three main themes, namely socio-ethical considerations, technical and design features, and user characteristics. User characteristics dominate the findings, reinforcing the importance of user involvement from development through to monitoring of AI-enabled systems. Different contexts and various characteristics of both the users and the systems are also found to influence user trust, highlighting the importance of selecting and tailoring features of the system according to the targeted user group’s characteristics. Importantly, socio-ethical considerations can pave the way in making sure that the environment where user-AI interactions happen is sufficiently conducive to establish and maintain a trusted relationship. In measuring user trust, surveys are found to be the most common method followed by interviews and focus groups. In conclusion, user trust needs to be addressed directly in every context where AI-enabled systems are being used or discussed. In addition, calibrating the user-AI relationship requires finding the optimal balance that works for not only the user but also the system.
... Advancements in Machine Learning (ML) have led to considerable progress in the levels of autonomy achieved by Autonomous Vehicles (AVs) [1]. However, recent accidents such as Tesla's autopilot crashes [2] and the fatal Uber selfdriving car accident [3] demonstrate that current AV systems still can fail. To address this, testing and validating such systems before deploying them into real-world operations has received increasing attention. ...
Article
Full-text available
As the performance and accuracy of machine learning and AI algorithms improve, the demand for adopting computer vision techniques to solve various problems, such as autonomous driving and AI robots, increases. To meet such demand, IoT and edge devices, which are small enough to be adopted in various environments while having sufficient computing capabilities, are being widely adopted. However, as devices are utilized in IoT and edge environments, which have harsh restrictions compared to traditional server environments, they are often limited by low computational and memory resources, in addition to the limited electrical power supply. This necessitates a unique approach for small IoT devices that are required to run complex tasks. In this paper, we propose a concurrent multi-frame processing scheme for real-time object detection algorithms. To do this, we first divide the video into individual frames and group the frames according to the number of cores in the device. Then, we allocate a group of frames per core to perform the object detection, resulting in parallel detection of multiple frames. We implement our scheme in YOLO (You Only Look Once), one of the most popular real-time object detection algorithms, on a state-of-the-art, resource-constrained IoT edge device, Nvidia Jetson Orin Nano, using real-world video and image datasets, including MS-COCO, ImageNet, PascalVOC, DOTA, animal videos, and car-traffic videos. Our evaluation results show that our proposed scheme can improve the diverse aspect of edge performance and improve the runtime, memory consumption, and power usage by up to 445%, 69%, and 73%, respectively. Additionally, it demonstrates improvements of 2.10× over state-of-the-art model optimization.
Article
Full-text available
Objectives: In conditional automation for automated vehicles (AVs), drivers are tasked with remaining vigilant and ready to assume control should the system encounter a malfunction. However, little to no information is provided to the driver either about the AV's intended maneuvers or the AV's awareness of potential threats in the surrounding environment. To address this research gap, the present study proposes 2 human-machine interaction (HMI) concepts: Firstly, the shared intended pathway (SIP), which presents a forecast of the AV's intended maneuvers and, secondly, object recognition bounding boxes (ORBBs), which place transparent blue squares around other road users likely to contribute to a crash. Methods: An online experiment with both within- and between-group factors tested 2 HMI designs: One with both SIP and ORBB and one with SIP only. Six videos were presented sequentially, each depicting 10 to 15 s of automated driving, to 394 participants (Mage = 28.24 years, 64% male) in Australia. Participants monitored the AV in each of the videos, which were presented in 1 of 3 randomized conditions: no HMI display, SIP only, or SIP + ORBB. Questions after each video measured trust, situational awareness, stress, and user acceptance, as measured by the technology acceptance model (TAM) about user acceptance. Results: Results suggest that SIP is beneficial in conditional automation monitoring, in terms of reduced stress, increased perceived usefulness, and future intentions to use AVs. Further, ORBB presence increased participant ratings of perceived usefulness and intentions to use AVs compared to no ORBB. The findings also provided support for the TAM, because perceived ease of use and perceived usefulness were found to be significant positive predictors of future intentions to use AVs. Conclusions: The current study provides some evidence to suggest that SIP may be beneficial in automated driving. However, further on-road testing of HMI displays for AVs that include SIP and ORBBs are needed to enhance the user experience and acceptability of AVs and to provide more ecological validity.
Article
Full-text available
The future of aviation is unmanned and ultimately autonomous. As part of this effort the Office of Naval Research, in partnership with the Naval Air Systems Command, has initiated the advanced autonomous air-to-air refueling system (A4RS) future naval capability (FNC). The A4RS FNC intends to set the interface requirements for any uncrewed aerial system to receive fuel from a United States Navy (USN) aircraft. Additionally, the A4RS FNC will be the first time that a system will be authorized to complete autonomous behavior without a human in or on the loop. However, a method to certify this behavior safe for flight does not currently exist. This paper details the method that has been proposed as part of the FNC to the naval flight certification authorities for approving a deep neural network to complete the aerial refueling task.
Article
Full-text available
Recognizing and categorizing items in weather-adverse environments poses significant challenges for autonomous vehicles. To improve the robustness of object-detection systems, this paper introduces an innovative approach for detecting objects at different levels by leveraging sensors and deep learning-based solutions within a traffic circle. The suggested approach improves the effectiveness of single-stage object detectors, aiming to advance the performance in perceiving autonomous racing environments and minimizing instances of false detection and low recognition rates. The improved framework is based on the one-stage object-detection model, incorporating multiple lightweight backbones. Additionally, attention mechanisms are integrated to refine the object-detection process further. Our proposed model demonstrates superior performance compared to the state-of-the-art method on the DAWN dataset, achieving a mean average precision (mAP) of 99.1%, surpassing the previous result of 84.7%.
Article
Full-text available
Deep learning (DL) has revolutionized the field of artificial intelligence by providing sophisticated models across a diverse range of applications, from image and speech recognition to natural language processing and autonomous driving. However, deep learning models are typically black-box models where the reason for predictions is unknown. Consequently, the reliability of the model becomes questionable in many circumstances. Explainable AI (XAI) plays an important role in improving the transparency and interpretability of the model thereby making it more reliable for real-time deployment. To investigate the reliability and truthfulness of DL models, this research develops image classification models using transfer learning mechanism and validates the results using XAI technique. Thus, the contribution of this research is twofold, we employ three pre-trained models VGG16, MobileNetV2 and ResNet50 using multiple transfer learning techniques for a fruit classification task consisting of 131 classes. Next, we inspect the reliability of models, based on these pre-trained networks, by utilizing Local Interpretable Model-Agnostic Explanations, the LIME, a popular XAI technique that generates explanations for the predictions. Experimental results reveal that transfer learning provides optimized results of around 98% accuracy. The classification of the models is validated on different instances using LIME and it was observed that each model predictions are interpretable and understandable as they are based on pertinent image features that are relevant to particular classes. We believe that this research gives an insight for determining how an interpretation can be drawn from a complex AI model such that its accountability and trustworthiness can be increased.
Article
Real-time video dehazing plays a key role in helping autonomous driving detect pedestrians or obstacles in severe foggy weather to prevent potential hazards. Existing video dehazing methods achieve good restoration performance but still suffer from oversaturation and low dehazing speed, especially for high-definition (HD, high-resolution) videos. In this article, we propose a mixed atmosphere prior information video dehazing accelerator (MAPD) and implement it on field programmable gate array (FPGA) to achieve real-time haze removal for HD video. MAPD provides a mixed atmospheric light model by applying heterogeneous atmospheric light in the foreground area to balance brightness deviation, and maintaining the global atmospheric light in the background region. Considering the parallel characteristics of FPGA, MAPD leverages the redundant information between adjacent frames to accelerate the dehazing process and designs an indirect transmission estimation to decrease resource consumption. For comparison, we also implement six dehazing solutions (DCP, color ellipsoid prior (CEP), RDCP, FFVD, MHVD, and REFD) on FPGA, and deploy a graphics processing unit (GPU)-based method (D4)(D^{4}) on the platform with Nvidia 3080 GPU. Experiments using two widely used benchmarks show that MAPD increases the performance by up to 36.5%, 53.5%, 36.3%, 33.3%, 11.9%, and 23.3%, decreases resource consumption by up to 79.7%, 75.0%, 74.8%, 25.6%, 22.6%, and 73.9% and enhances FPS for HD videos by up to 241.6%, 145.9%, 151.7%, 68.6%, 50.6%, and 62.4%, compared with DCP, CEP, RDCP, FFVD, MHVD, and REFD. Compared to D4D^{4} , MAPD also promotes the dehazing performance by up to 21.8%, and increases FPS by up to 487.0%.
Article
Despite the potential of autonomous vehicles (AV) to improve traffic efficiency and safety, many studies have shown that traffic accidents in a hybrid traffic environment where both AV and human-driven vehicles (HVs) are present are inevitable because of the unpredictability of HVs. Given that eliminating accidents is impossible, an achievable goal is to design AVs in a way so that they will not be blamed for any accident in which they are involved in. In this paper, we propose BlaFT Rules – or Bla me- F ree hybrid T raffic motion planning Rules . An AV following BlaFT Rules is designed to be cooperative with HVs as well as other AVs, and will not be blamed for accidents in a structured road environment. We provide proofs that no accidents will happen if all AVs are using a BlaFT Rules conforming motion planner, and that an AV using BlaFT Rules will be blame-free even if it is involved in a collision in hybrid traffic. We implemented a motion planning algorithm that conforms to BlaFT Rules called BlaFT . We instantiated scores of BlaFT controlled AVs and HVs in an urban roadscape loop in the SUMO simulator and show that over time that as the percentage of BlaFT vehicles increases, the traffic becomes safer even with HVs involved. Adding BlaFT vehicles increases the efficiency of traffic as a whole by up to 34% over HVs alone.
Article
Video data offer important insights into social processes because they enable direct observation of real-life social interaction. Though such data have become abundant and increasingly accessible, they pose challenges to scalability and measurement. Computer vision (CV), i.e., software-based automated analysis of visual material, can help address these challenges, but existing CV tools are not sufficiently tailored to analyze social interactions. We describe our novel approach, “3D social research” (3DSR), which uses CV and 3D camera footage to study kinesics and proxemics, two core elements of social interaction. Using eight videos of a scripted interaction and five real-life street scene videos, we demonstrate how 3DSR expands sociologists’ analytical toolkit by facilitating a range of scalable and precise measurements. We specifically emphasize 3DSR's potential for analyzing physical distance, movement in space, and movement rate – important aspects of kinesics and proxemics in interactions. We also assess data reliability when using 3DSR.
Chapter
This article provides a comprehensive understanding of the image-based ship detector using computer vision technologies with deep learning. Several pre-trained object detection models, such as MobileNet, VGGNet, Inception, and ResNet, have been investigated by illustrating the network architectures. A group of pre-trained models has been proposed and examined by recognizing ships on the sea and in the bay area. The model testing and comparison procedure have also been performed by evaluating the performance matrix and comparing predictive results per model. The optimal model is then chosen with the additional tests in terms of capabilities of the ship detection using the satellite image streaming in the real world. Such a proposed ship detector can contribute to the development of smart ship operations and may further carve out the possibility for the automated shipping system with smart port management.
Article
Artificial intelligence (AI) and algorithmic decision making are having a profound impact on our daily lives. These systems are vastly used in different high-stakes applications like healthcare, business, government, education, and justice, moving us toward a more algorithmic society. However, despite so many advantages of these systems, they sometimes directly or indirectly cause harm to the users and society. Therefore, it has become essential to make these systems safe, reliable, and trustworthy. Several requirements, such as fairness, explainability, accountability, reliability, and acceptance, have been proposed in this direction to make these systems trustworthy. This survey analyzes all of these different requirements through the lens of the literature. It provides an overview of different approaches that can help mitigate AI risks and increase trust and acceptance of the systems by utilizing the users and society. It also discusses existing strategies for validating and verifying these systems and the current standardization efforts for trustworthy AI. Finally, we present a holistic view of the recent advancements in trustworthy AI to help the interested researchers grasp the crucial facets of the topic efficiently and offer possible future research directions.
Article
Full-text available
The performance of deep learning-based detection methods has made them an attractive option for robotic perception. However, their training typically requires large volumes of data containing all the various situations the robots may potentially encounter during their routine operation. Thus, the workforce required for data collection and annotation is a significant bottleneck when deploying robots in the real world. This applies especially to outdoor deployments, where robots have to face various adverse weather conditions. We present a method that allows an independent car tansporter to train its neural networks for vehicle detection without human supervision or annotation. We provide the robot with a hand-coded algorithm for detecting cars in LiDAR scans in favourable weather conditions and complement this algorithm with a tracking method and a weather simulator. As the robot traverses its environment, it can collect data samples, which can be subsequently processed into training samples for the neural networks. As the tracking method is applied offline, it can exploit the detections made both before the currently processed scan and any subsequent future detections of the current scene, meaning the quality of annotations is in excess of those of the raw detections. Along with the acquisition of the labels, the weather simulator is able to alter the raw sensory data, which are then fed into the neural network together with the labels. We show how this pipeline, being run in an offline fashion, can exploit off-the-shelf weather simulation for the auto-labelling training scheme in a simulator-in-the-loop manner. We show how such a framework produces an effective detector and how the weather simulator-in-the-loop is beneficial for the robustness of the detector. Thus, our automatic data annotation pipeline significantly reduces not only the data annotation but also the data collection effort. This allows the integration of deep learning algorithms into existing robotic systems without the need for tedious data annotation and collection in all possible situations. Moreover, the method provides annotated datasets that can be used to develop other methods. To promote the reproducibility of our research, we provide our datasets, codes and models online.
Chapter
The field of machine learning focuses on computationally efficient, yet approximate algorithms. On the contrary, the field of formal methods focuses on mathematical rigor and provable correctness. Despite their superficial differences, both fields offer mutual benefit. Formal methods offer methods to verify and explain machine learning systems, aiding their adoption in safety critical domains. Machine learning offers approximate, computationally efficient approaches that let formal methods scale to larger problems. This paper gives an introduction to the track “Formal Methods Meets Machine Learning” (F3ML) and shortly presents its scientific contributions, structured into two thematic subthemes: One, concerning formal methods based approaches for the explanation and verification of machine learning systems, and one concerning the employment of machine learning approaches to scale formal methods.
Article
Deep Neural Networks are actively being used in the design of autonomous Cyber-Physical Systems (CPSs). The advantage of these models is their ability to handle high-dimensional state-space and learn compact surrogate representations of the operational state spaces. However, the problem is that the sampled observations used for training the model may never cover the entire state space of the physical environment, and as a result, the system will likely operate in conditions that do not belong to the training distribution. These conditions that do not belong to training distribution are referred to as Out-of-Distribution (OOD). Detecting OOD conditions at runtime is critical for the safety of CPS. In addition, it is also desirable to identify the context or the feature(s) that are the source of OOD to select an appropriate control action to mitigate the consequences that may arise because of the OOD condition. In this article, we study this problem as a multi-labeled time series OOD detection problem over images, where the OOD is defined both sequentially across short time windows (change points) as well as across the training data distribution. A common approach to solving this problem is the use of multi-chained one-class classifiers. However, this approach is expensive for CPSs that have limited computational resources and require short inference times. Our contribution is an approach to design and train a single β -Variational Autoencoder detector with a partially disentangled latent space sensitive to variations in image features. We use the feature sensitive latent variables in the latent space to detect OOD images and identify the most likely feature(s) responsible for the OOD. We demonstrate our approach using an Autonomous Vehicle in the CARLA simulator and a real-world automotive dataset called nuImages.
Article
Full-text available
Images captured in low-light conditions usually suffer from very low contrast, which increases the difficulty of subsequent computer vision tasks in a great extent. In this paper, a low-light image enhancement model based on convolutional neural network and Retinex theory is proposed. Firstly, we show that multi-scale Retinex is equivalent to a feedforward convolutional neural network with different Gaussian convolution kernels. Motivated by this fact, we consider a Convolutional Neural Network(MSR-net) that directly learns an end-to-end mapping between dark and bright images. Different fundamentally from existing approaches, low-light image enhancement in this paper is regarded as a machine learning problem. In this model, most of the parameters are optimized by back-propagation, while the parameters of traditional models depend on the artificial setting. Experiments on a number of challenging images reveal the advantages of our method in comparison with other state-of-the-art methods from the qualitative and quantitative perspective.
Conference Paper
Full-text available
Low-light images are not conducive to human observation and computer vision algorithms due to their low visibility. To solve this problem, many image enhancement techniques have been proposed. However, existing techniques inevitably introduce color and lightness distortion when increasing visibility. To lower the distortion, we propose a novel enhancement method using the response characteristics of cameras. First, we investigate the relationship between two images with different exposures to obtain an accurate camera response model. Then we borrow the illumination estimation techniques to estimate the exposure ratio map. Finally, we use our camera response model to adjust each pixel to its desired exposure according to the estimated exposure ratio map. Experiments show that our method can obtain enhancement results with less color and lightness distortion compared to several state-of-the-art methods.
Conference Paper
Full-text available
Low-light images are not conducive to human observation and computer vision algorithms due to their low visibility. Although many image enhancement techniques have been proposed to solve this problem, existing methods inevitably introduce contrast under- and over-enhancement. In this paper, we propose an image contrast enhancement algorithm to provide an accurate contrast enhancement. Specifically, we first design the weight matrix for image fusion using illumination estimation techniques. Then we introduce our camera response model to synthesize multi-exposure images. Next, we find the best exposure ratio so that the synthetic image is well-exposed in the regions where the original image under-exposed. Finally, the input image and the synthetic image are fused according to the weight matrix to obtain the enhancement result. Experiments show that our method can obtain results with less contrast and lightness distortion compared to that of several state-of-the-art methods.
Article
Full-text available
While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving. Towards this goal, we present an approach to joint classification, detection and semantic segmentation via a unified architecture where the encoder is shared amongst the three tasks. Our approach is very simple, can be trained end-to-end and performs extremely well in the challenging KITTI dataset, outperforming the state-of-the-art in the road segmentation task. Our approach is also very efficient, taking less than 100 ms to perform all tasks.
Article
Full-text available
Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires real-time inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully-convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI benchmark.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Conference Paper
Full-text available
Recent advances in pedestrian detection are attained by transferring the learned features of Convolutional Neural Network (ConvNet) to pedestrians. This ConvNet is typically pre-trained with massive general object categories (e.g. ImageNet). Although these features are able to handle variations such as poses, viewpoints, and lightings, they may fail when pedestrian images with complex occlusions are present. Occlusion handling is one of the most important problem in pedestrian detection. Unlike previous deep models that directly learned a single detector for pedestrian detection, we propose DeepParts, which consists of extensive part detectors. DeepParts has several appealing properties. First, DeepParts can be trained on weakly labeled data, i.e. only pedestrian bounding boxes without part annotations are provided. Second, DeepParts is able to handle low IoU positive proposals that shift away from ground truth. Third, each part detector in DeepParts is a strong detector that can detect pedestrian by observing only a part of a proposal. Extensive experiments in Caltech dataset demonstrate the effectiveness of DeepParts, which yields a new state-of-the-art miss rate of 11.89%, outperforming the second best method by 10%.
Article
Full-text available
In this paper, we propose a fast labeling algorithm based on block-based concepts. Because the number of memory access points directly affects the time consumption of the labeling algorithms, the aim of the proposed algorithm is to minimize neighborhood operations. Our algorithm utilizes a block-based view and correlates a raster scan to select the necessary pixels generated by a block-based scan mask. We analyze the advantages of a sequential raster scan for the block-based scan mask, and integrate the block-connected relationships using two different procedures with binary decision trees to reduce unnecessary memory access. This greatly simplifies the pixel locations of the block-based scan mask. Furthermore, our algorithm significantly reduces the number of leaf nodes and depth levels required in the binary decision tree. We analyze the labeling performance of the proposed algorithm alongside that of other labeling algorithms using high-resolution images and foreground images. The experimental results from synthetic and real image datasets demonstrate that the proposed algorithm is faster than other methods.
Article
Full-text available
Autonomous vehicles (AVs) represent a potentially disruptive yet beneficial change to our transportation system. This new technology has the potential to impact vehicle safety, congestion, and travel behavior. All told, major social AV impacts in the form of crash savings, travel time reduction, fuel efficiency and parking benefits are estimated to approach 2000toperyearperAV,andmayeventuallyapproachnearly2000 to per year per AV, and may eventually approach nearly 4000 when comprehensive crash costs are accounted for. Yet barriers to implementation and mass-market penetration remain. Initial costs will likely be unaffordable. Licensing and testing standards in the U.S. are being developed at the state level, rather than nationally, which may lead to inconsistencies across states. Liability details remain undefined, security concerns linger, and without new privacy standards, a default lack of privacy for personal travel may become the norm. The impacts and interactions with other components of the transportation system, as well as implementation details, remain uncertain. To address these concerns, the federal government should expand research in these areas and create a nationally recognized licensing framework for AVs, determining appropriate standards for liability, security, and data privacy.
Conference Paper
Full-text available
Detecting partially occluded pedestrians is challenging. A common practice to maximize detection quality is to train a set of occlusion-specific classifiers, each for a certain amount and type of occlusion. Since training classifiers is expensive, only a handful are typically trained. We show that by using many occlusion-specific classifiers, we outperform previous approaches on three pedestrian datasets, INRIA, ETH, and Caltech USA. We present a new approach to train such classifiers. By reusing computations among different training stages, 16 occlusion-specific classifiers can be trained at only one tenth the cost of one full training. We show that also test time cost grows sub-linearly.
Article
Full-text available
Deep learning methods have achieved great success in pedestrian detection, owing to its ability to learn features from raw pixels. However, they mainly capture middle-level representations, such as pose of pedestrian, but confuse positive with hard negative samples, which have large ambiguity, e.g. the shape and appearance of `tree trunk' or `wire pole' are similar to pedestrian in certain viewpoint. This ambiguity can be distinguished by high-level representation. To this end, this work jointly optimizes pedestrian detection with semantic tasks, including pedestrian attributes (e.g. `carrying backpack') and scene attributes (e.g. `road', `tree', and `horizontal'). Rather than expensively annotating scene attributes, we transfer attributes information from existing scene segmentation datasets to the pedestrian dataset, by proposing a novel deep model to learn high-level features from multiple tasks and multiple data sources. Since distinct tasks have distinct convergence rates and data from different datasets have different distributions, a multi-task objective function is carefully designed to coordinate tasks and reduce discrepancies among datasets. The importance coefficients of tasks and network parameters in this objective function can be iteratively estimated. Extensive evaluations show that the proposed approach outperforms the state-of-the-art on the challenging Caltech and ETH datasets, where it reduces the miss rates of previous deep models by 17 and 5.5 percent, respectively.
Conference Paper
Full-text available
Paper-by-paper results make it easy to miss the forest for the trees.We analyse the remarkable progress of the last decade by discussing the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark. We observe that there exist three families of approaches, all currently reaching similar detection quality. Based on our analysis, we study the complementarity of the most promising ideas by combining multiple published strategies. This new decision forest detector achieves the current best known performance on the challenging Caltech-USA dataset.
Technical Report
Full-text available
In 2010, there were 32,999 people killed, 3.9 million were injured, and 24 million vehicles were damaged in motor vehicle crashes in the United States. The economic costs of these crashes totaled 277billion.Includedintheselossesarelostproductivity,medicalcosts,legalandcourtcosts,emergencyservicecosts(EMS),insuranceadministrationcosts,congestioncosts,propertydamage,andworkplacelosses.The277 billion. Included in these losses are lost productivity, medical costs, legal and court costs, emergency service costs (EMS), insurance administration costs, congestion costs, property damage, and workplace losses. The 277 billion cost of motor vehicle crashes represents the equivalent of nearly 897foreachofthe308.7millionpeoplelivingintheUnitedStates,and1.9percentofthe897 for each of the 308.7 million people living in the United States, and 1.9 percent of the 14.96 trillion real U.S. Gross Domestic Product for 2010. These figures include both police-reported and unreported crashes. When quality of life valuations are considered, the total value of societal harm from motor vehicle crashes in 2010 was 871billion.Lostmarketandhouseholdproductivityaccountedfor871 billion. Lost market and household productivity accounted for 93 billion of the total 277billioneconomiccosts,whilepropertydamageaccountedfor277 billion economic costs, while property damage accounted for 76 billion. Medical expenses totaled 35billion.Congestioncausedbycrashes,includingtraveldelay,excessfuelconsumption,greenhousegasesandcriteriapollutantsaccountedfor35 billion. Congestion caused by crashes, including travel delay, excess fuel consumption, greenhouse gases and criteria pollutants accounted for 28 billion. Each fatality resulted in an average discounted lifetime cost of 1.4million.Publicrevenuespaidforroughly9percentofallmotorvehiclecrashcosts,costingtaxpayers1.4 million. Public revenues paid for roughly 9 percent of all motor vehicle crash costs, costing tax payers 24 billion in 2010, the equivalent of over 200inaddedtaxesforeveryhouseholdintheUnitedStates.Alcoholinvolvedcrashesaccountedfor200 in added taxes for every household in the United States. Alcohol involved crashes accounted for 59 billion or 21 percent of all economic costs, and 84 percent of these costs occurred in crashes where a driver or non-occupant had a blood alcohol concentration (BAC) of .08 grams per deciliter or greater. Alcohol was the cause of the crash in roughly 82 percent of these cases, causing 49billionincosts.CrashesinwhichalcohollevelsareBACof.08orhigherareresponsibleforover90percentoftheeconomiccostsandsocietalharmthatoccursincrashesattributabletoalcoholuse.Crashesinwhichpoliceindicatethatatleastonedriverwasexceedingthelegalspeedlimitordrivingtoofastforconditionscost49 billion in costs. Crashes in which alcohol levels are BAC of .08 or higher are responsible for over 90 percent of the economic costs and societal harm that occurs in crashes attributable to alcohol use. Crashes in which police indicate that at least one driver was exceeding the legal speed limit or driving too fast for conditions cost 59 billion in 2010. Seat belt use prevented 12,500 fatalities, 308,000 serious injuries, and 69billionininjuryrelatedcostsin2010,butthefailureofasubstantialportionofthedrivingpopulationtobuckleupcaused3,350unnecessaryfatalities,54,300seriousinjuries,andcostsociety69 billion in injury related costs in 2010, but the failure of a substantial portion of the driving population to buckle up caused 3,350 unnecessary fatalities, 54,300 serious injuries, and cost society 14 billion in easily preventable injury related costs. Crashes in which at least one driver was identified as being distracted cost $46 billion in 2010. The report also includes data on the costs associated with motorcycle crashes, failure to wear motorcycle helmets, pedestrian crash, bicyclist crashes, and numerous different roadway designation crashes.
Conference Paper
Full-text available
We propose a simple yet effective approach to the problem of pedestrian detection which outperforms the current state-of-the-art. Our new features are built on the basis of low-level visual features and spatial pooling. Incorporating spatial pooling improves the translational invariance and thus the robustness of the detection process. We then directly optimise the partial area under the ROC curve (\pAUC) measure, which concentrates detection performance in the range of most practical importance. The combination of these factors leads to a pedestrian detector which outperforms all competitors on all of the standard benchmark datasets. We advance state-of-the-art results by lowering the average miss rate from 13%13\% to 11%11\% on the INRIA benchmark, 41%41\% to 37%37\% on the ETH benchmark, 51%51\% to 42%42\% on the TUD-Brussels benchmark and 36%36\% to 29%29\% on the Caltech-USA benchmark.
Article
Full-text available
The current paper proposes recognition of partially invisible objects in images using image enhancement techniques. The problem mainly arises in night vision images which comprise poor contrast standards. Also during daytime, the object which is captured under sunlight is the lone survivor and the rest of information is not captured by camera properly. Image enhancement techniques to improve visual quality have been popularized with the proliferation of digital imagery and computers. Histogram Equalization (HE) is a versatile image improvement technique that can be incorporated for converting the partial visible objects/invisible objects into a proper vision. Further for enriching the information in image obtained by the HE image, a Contrast Limited Adaptive Histogram Equalization (CLAHE) is incorporated and finally for smoothing purpose, the image thus obtained is passed through a Gaussian filter. Results on various set of images show that above two techniques HE and CLAHE along with a Gaussian filter significantly improve the quality of image and hence assist to discover the partially visible/invisible objects.
Conference Paper
Full-text available
Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.
Conference Paper
Full-text available
We present a new pedestrian detector that improves both in speed and quality over state-of-the-art. By efficiently handling different scales and transferring computation from test time to training time, detection speed is improved. When processing monocular images, our system provides high quality detections at 50 fps. We also propose a new method for exploiting geometric context extracted from stereo images. On a single CPU+GPU desktop machine, we reach 135 fps, when processing street scenes, from rectified input to detections output.
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Performance is a critical challenge in mobile image processing. Given a reference imaging pipeline, or even human-adjusted pairs of images, we seek to reproduce the enhancements and enable real-time evaluation. For this, we introduce a new neural network architecture inspired by bilateral grid processing and local affine color transforms. Using pairs of input/output images, we train a convolutional neural network to predict the coefficients of a locally-affine model in bilateral space. Our architecture learns to make local, global, and content-dependent decisions to approximate the desired image transformation. At runtime, the neural network consumes a low-resolution version of the input image, produces a set of affine transformations in bilateral space, upsamples those transformations in an edge-preserving fashion using a new slicing node, and then applies those upsampled transformations to the full-resolution image. Our algorithm processes high-resolution images on a smartphone in milliseconds, provides a real-time viewfinder at 1080p resolution, and matches the quality of state-of-the-art approximation techniques on a large class of image operators. Unlike previous work, our model is trained off-line from data and therefore does not require access to the original operator at runtime. This allows our model to learn complex, scene-dependent transformations for which no reference implementation is available, such as the photographic edits of a human retoucher.
Conference Paper
Pedestrian detection is a key problem in computer vision, with several applications including robotics, surveillance and automotive safety. Much of the progress of the past few years has been driven by the availability of challenging public datasets. To continue the rapid rate of innovation, we introduce the Caltech Pedestrian Dataset, which is two orders of magnitude larger than existing datasets. The dataset contains richly annotated video, recorded from a moving vehicle, with challenging images of low resolution and frequently occluded people. We propose improved evaluation metrics, demonstrating that commonly used per-window measures are flawed and can fail to predict performance on full images. We also benchmark several promising detection systems, providing an overview of state-of-the-art performance and a direct, unbiased comparison of existing methods. Finally, by analyzing common failure cases, we help identify future research directions for the field.
Article
How safe are autonomous vehicles? The answer is critical for determining how autonomous vehicles may shape motor vehicle safety and public health, and for developing sound policies to govern their deployment. One proposed way to assess safety is to test drive autonomous vehicles in real traffic, observe their performance, and make statistical comparisons to human driver performance. This approach is logical, but it is practical? In this paper, we calculate the number of miles of driving that would be needed to provide clear statistical evidence of autonomous vehicle safety. Given that current traffic fatalities and injuries are rare events compared to vehicle miles traveled, we show that fully autonomous vehicles would have to be driven hundreds of millions of miles and sometimes hundreds of billions of miles to demonstrate their reliability in terms of fatalities and injuries. Under even aggressive testing assumptions, existing fleets would take tens and sometimes hundreds of years to drive these miles?an impossible proposition if the aim is to demonstrate their performance prior to releasing them on the roads for consumer use. These findings demonstrate that developers of this technology and third-party testers cannot simply drive their way to safety. Instead, they will need to develop innovative methods of demonstrating safety and reliability. And yet, the possibility remains that it will not be possible to establish with certainty the safety of autonomous vehicles. Uncertainty will remain. Therefore, it is imperative that autonomous vehicle regulations are adaptive?designed from the outset to evolve with the technology so that society can better harness the benefits and manage the risks of these rapidly evolving and potentially transformative technologies.
Conference Paper
Detecting pedestrian has been arguably addressed as a special topic beyond general object detection. Although recent deep learning object detectors such as Fast/Faster R-CNN have shown excellent performance for general object detection, they have limited success for detecting pedestrian, and previous leading pedestrian detectors were in general hybrid methods combining hand-crafted and deep convolutional features. In this paper, we investigate issues involving Faster R-CNN for pedestrian detection. We discover that the Region Proposal Network (RPN) in Faster R-CNN indeed performs well as a stand-alone pedestrian detector, but surprisingly, the downstream classifier degrades the results. We argue that two reasons account for the unsatisfactory accuracy: (i) insufficient resolution of feature maps for handling small instances, and (ii) lack of any bootstrapping strategy for mining hard negative examples. Driven by these observations, we propose a very simple but effective baseline for pedestrian detection, using an RPN followed by boosted forests on shared, high-resolution convolutional feature maps. We comprehensively evaluate this method on several benchmarks (Caltech, INRIA, ETH, and KITTI), presenting competitive accuracy and good speed. Code will be made publicly available.
Article
Single image haze removal is a challenging ill-posed problem. Existing methods use various constraints/priors to get plausible dehazing solutions. The key to achieve haze removal is to estimate a medium transmission map for an input hazy image. In this paper, we propose a trainable end-to-end system called DehazeNet, for medium transmission estimation. DehazeNet takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. DehazeNet adopts Convolutional Neural Networks (CNN) based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing. Specifically, layers of Maxout units are used for feature extraction, which can generate almost all haze-relevant features. We also propose a novel nonlinear activation function in DehazeNet, called Bilateral Rectified Linear Unit (BReLU), which is able to improve the quality of recovered haze-free image. We establish connections between components of the proposed DehazeNet and those used in existing methods. Experiments on benchmark images show that DehazeNet achieves superior performance over existing methods, yet keeps efficient and easy to use.
Book
For the past hundred years, innovation within the automotive sector has created safer, cleaner, and more affordable vehicles, but progress has been incremental. The industry now appears close to substantial change, engendered by autonomous, or "self-driving," vehicle technologies. This technology offers the possibility of significant benefits to social welfare — saving lives; reducing crashes, congestion, fuel consumption, and pollution; increasing mobility for the disabled; and ultimately improving land use. This report is intended as a guide for state and federal policymakers on the many issues that this technology raises. After surveying the advantages and disadvantages of the technology, RAND researchers determined that the benefits of the technology likely outweigh the disadvantages. However, many of the benefits will accrue to parties other than the technology's purchasers. These positive externalities may justify some form of subsidy. The report also explores policy issues, communications, regulation and standards, and liability issues raised by the technology; and concludes with some tentative guidance for policymakers, guided largely by the principle that the technology should be allowed and perhaps encouraged when it is superior to an average human driver.
Article
Boosted decision trees are among the most popular learning techniques in use today. While exhibiting fast speeds at test time, relatively slow training renders them impractical for applications with real-time learning requirements. We propose a principled approach to overcome this drawback. We prove a bound on the error of a decision stump given its preliminary error on a subset of the training data; the bound may be used to prune unpromising features early in the training process. We propose a fast training algorithm that exploits this bound, yielding speedups of an order of magnitude at no cost in the final performance of the classifier. Our method is not a new variant of Boosting; rather, it is used in conjunction with existing Boosting algorithms and other sampling methods to achieve even greater speedups.
Article
In surveillance, monitoring and tactical reconnaissance, gathering the right visual information from a dynamic environment and accurately processing such data are essential ingredients to making informed decisions which determines the success of an operation. Camera sensors are often cost-limited in ability to clearly capture objects without defects from images or videos taken in a poorly-lit environment. The goal in many applications is to enhance the brightness, contrast and reduce noise content of such images in an on-board real-time manner. We propose a deep autoencoder-based approach to identify signal features from low-light images handcrafting and adaptively brighten images without over-amplifying the lighter parts in images (i.e., without saturation of image pixels) in high dynamic range. We show that a variant of the recently proposed stacked-sparse denoising autoencoder can learn to adaptively enhance and denoise from synthetically darkened and noisy training examples. The network can then be successfully applied to naturally low-light environment and/or hardware degraded images. Results show significant credibility of deep learning based approaches both visually and by quantitative comparison with various popular enhancing, state-of-the-art denoising and hybrid enhancing-denoising techniques.
Article
While convolutional neural network (CNN) architectures have achieved great success in various vision tasks, the critical scale problem is still much under-explored, especially for pedestrian detection. Current approaches mainly focus on using large numbers of training images with different scales to improve the network capability or result fusions by multi-scale crops of images during testing. Designing a CNN architecture that can intrinsically capture the characteristics of large-scale and small-scale objects and also retain the scale invariance property is still a very challenging problem. In this paper, we propose a novel scale-aware Fast R-CNN to handle the detection of small object instances which are very common in pedestrian detection. Our architecture incorporates a large-scale sub-network and a small-scale sub-network into a unified architecture by leveraging the scale-aware weighting during training. The heights of object proposals are utilized to specify different scale-aware weights for the two sub-networks. Extensive evaluations on the challenging Caltech~\cite{dollar2012pedestrian} demonstrate the superiority of the proposed architecture over the state-of-the-art methods~\cite{compact,ta_cnn}. In particular, the miss rate on the Caltech dataset is reduced to 9.68%9.68\% by our method, significantly smaller than 11.75%11.75\% by CompACT-Deep~\cite{compact} and 20.86%20.86\% by TA-CNN~\cite{ta_cnn}.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed explicitly. This fundamental insight allows us to design object detection algorithms that are as accurate, and considerably faster, than the state-of-the-art. The computational bottleneck of many modern detectors is the computation of features at every scale of a finely-sampled image pyramid. Our key insight is that one may compute finely sampled feature pyramids at a fraction of the cost, without sacrificing performance: for a broad family of features we find that features computed at octave-spaced scale intervals are sufficient to approximate features on a finely-sampled pyramid. Extrapolation is inexpensive as compared to direct feature computation. As a result, our approximation yields considerable speedups with negligible loss in detection accuracy. We modify three diverse visual recognition systems to use fast feature pyramids and show results on both pedestrian detection (measured on the Caltech, INRIA, TUD-Brussels and ETH data sets) and general object detection (measured on the PASCAL VOC). The approach is general and is widely applicable to vision algorithms requiring fine-grained multi-scale analysis. Our approximation is valid for images with broad spectra (most natural images) and fails for images with narrow band-pass spectra (e.g., periodic textures).
Conference Paper
We propose a simple yet effective detector for pedestrian detection. The basic idea is to incorporate common sense and everyday knowledge into the design of simple and computationally efficient features. As edestrians usually appear up-right in image or video data, the problem of pedestrian detection is considerably simpler than general purpose people detection. We therefore employ a statistical model of the up-right human body where the head, the upper body, and the lower body are treated as three distinct components. Our main contribution is to systematically design a pool of rectangular templates that are tailored to this shape model. As we incorporate different kinds of low-level measurements, the resulting multi-modal & multi-channel Haar-like features represent characteristic differences between parts of the human body yet are robust against variations in clothing or environmental settings. Our approach avoids exhaustive searches over all possible configurations of rectangle features and neither relies on random sampling. It thus marks a middle ground among recently published techniques and yields efficient low-dimensional yet highly discriminative features. Experimental results on the INRIA and Caltech pedestrian datasets show that our detector reaches state-of-the-art performance at low computational costs and that our features are robust against occlusions.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.