Conference Paper

Real-Time Pedestrian Detection with Deep Network Cascades

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Great strides in pedestrian detection research [3] have been made for challenging situations, such as cluttered background, substantial occlusions and tiny target appearance. As for many other computer vision tasks, in the last few years significant performance gains have been achieved thanks to approaches based on deep networks [21,1,17,32]. Additionally, the adoption of novel sensors, e.g. ...
... Due to its relevance in many fields, such as robotics and video surveillance, the problem of pedestrian detection has received considerable interests in the research community. Over the years, a large variety of features and algorithms have been proposed for improving detection systems, both with respect to speed [34,2,1,17] and accuracy [41,22,46,47,10,32]. ...
... Recently, notable performance gains have been achieved with the adoption of powerful deep networks [21,1], thanks to their ability to learn discriminative features directly from raw pixels. In [26], a CNN pre-trained with an unsupervised method based on convolutional sparse coding was presented. ...
Preprint
This paper presents a novel method for detecting pedestrians under adverse illumination conditions. Our approach relies on a novel cross-modality learning framework and it is based on two main phases. First, given a multimodal dataset, a deep convolutional network is employed to learn a non-linear mapping, modeling the relations between RGB and thermal data. Then, the learned feature representations are transferred to a second deep network, which receives as input an RGB image and outputs the detection results. In this way, features which are both discriminative and robust to bad illumination conditions are learned. Importantly, at test time, only the second pipeline is considered and no thermal data are required. Our extensive evaluation demonstrates that the proposed approach outperforms the state-of- the-art on the challenging KAIST multispectral pedestrian dataset and it is competitive with previous methods on the popular Caltech dataset.
... It uses deep convolutional networks to efficiently classify objects with increased speed and accuracy. Angelova, et al., [15] designed a real-time pedestrian detection system using deep convolutional networks. A tiny model was used to reject multitude of easy negatives, and the rest hard proposals were then classified by the huge deep network. ...
... A highly accurate, fast, simple and effective pedestrian detection system 12 Girshik [14] Spatial pyramid pooling networks SPPnet, Region-based Convolutional R-CNN and DNN Efficient, high speed and accurate object classification 13 Angeloa et al., [15] A cascade of deep convolutional networks A real-time pedestrian detection maximized rejection of easy negatives 14 Zhu, et al., [16] Deterministic hidden neurons, random hidden neurons and regularization with multi-view perceptron neural model Easy identification and rebuilding images under multiple views 15 Sermanet, et al., [17] Convolutional filter bank, high and low resolution features in model and INRIA dataset. ...
... A highly accurate, fast, simple and effective pedestrian detection system 12 Girshik [14] Spatial pyramid pooling networks SPPnet, Region-based Convolutional R-CNN and DNN Efficient, high speed and accurate object classification 13 Angeloa et al., [15] A cascade of deep convolutional networks A real-time pedestrian detection maximized rejection of easy negatives 14 Zhu, et al., [16] Deterministic hidden neurons, random hidden neurons and regularization with multi-view perceptron neural model Easy identification and rebuilding images under multiple views 15 Sermanet, et al., [17] Convolutional filter bank, high and low resolution features in model and INRIA dataset. ...
Article
Full-text available
Deep Learning (DL) of Artificial intelligence (AI) is a hot topic in the data science world. It's become crucial as many public and commercial businesses amass huge amounts of domain-specific data, which can provide useful information on issues like fraud detection, national intelligence, cyber security, medical informatics and marketing. Microsoft, Google, Twitter and Amazon Web Services for example, are evaluating very huge amounts of data for business analysis and decisions making, which has an impact on current and future technologies. Through the hierarchical learning process, DL algorithms extract high-level, complex abstractions as data representations. Scene interpretation, video surveillance, robots, and self-driving systems are just a few of the many applications that have prompted an extensive study in the field of computer vision in the last decade. Visual recognition systems, which include picture categorization, localization, and detection, are at the heart of all of these applications and have gathered a lot of research attention. These visual identification algorithms have achieved extraordinary performance, thanks to considerable advancements in neural networks, in particular deep learning. Object tracking and detection are one of the areas where computer vision has had a lot of success. The aim of this paper is to analyze the applications and challenges of DL algorithms for object detection in the last ten (10) years. The state-of-the-art object detection methods, including video tracking, are Applications … Yetunde Josephine OGUNS et al. discussed. The results of the achievements of various DL algorithms in object and video tracking and detection were analyzed, and the prospects of DL techniques for object detection were also discussed. The chapter concluded with the highlight of future directions in object tracking and detection with the use of DL algorithms.
... In summary, this paper constructs a multimodal pedestrian detection network based on Faster-RCNN and improves it to explore multimodal pedestrian detectors. The main contributions are as follows: 1 The feature extraction network is optimized, and structures such as 1 × 1 convolution and dilated convolution are introduced to enhance the expression of the network feature layer. In addition, the ROIAlign method replaces the ROIPooling method to map the candidate box to the feature layer to Information Technology and Control 2023/4/52 eliminate the quantitative loss and improve the detection ability of small target pedestrians. ...
... Many different strategies have been put out in recent years to improve the performance of visible light pedestrian detectors, primarily to address the issues of pedestrian occlusion, congestion, and scale difference. Angelov et al. [1] use a cascade classifier to increase the accuracy of deep neural networks. The disadvantage is that time consumption will increase as the image size increases ALFNet was proposed by Liu et al. [15], expanding the SSD target identification method with cascade and multistage ideas. ...
Article
Full-text available
Aiming at the matter that pedestrian detection in the autonomous driving system is vulnerable to the influence of the external environment and the detector supported single sensor modal detector has poor performance beneath the condition of enormous amendment of unrestricted light-weight, this paper proposes a fusion of light and thermal infrared dual mode pedestrian detection methodology. Firstly, 1 × 1 convolution and expanded convolution square measure are introduced within the residual network, and also the ROI Align methodology is employed to exchange the ROI Pooling method-ology to map the candidate box to the feature layer to optimize the Faster R-CNN. Secondly, the loss performance of the generalized intersection over union (GIoU) is employed because of the loss performance of the prediction box positioning regression; finally, supported by the improved Faster R-CNN, four forms of multimodal neural network structures square measure designed to fuse visible and thermal infrared pictures. According to experimental findings, the proposed technique outperforms current mainstream detection algorithms on the KAIST dataset. As compared to the conventional ACF + T + THOG pedestrian detector, the AP is 8.38 percentage points greater. Compared to the visible light pedestrian detector, the miss rate is 5.34 percentage points lower.
... In this class of method, the work by [Lukas et al., 2006] also achieved interesting results at the cost of a high demanding computation though [Cattaneo et al., 2017] proposed a feasible distributed and scalable implementation of it. Deep learning, in particular CNNs, have shown a quite good performance in several computer vision tasks such as facial recognition (see [Parkhi et al., 2015]), pedestrian detection (see [Angelova et al., 2015]) or handwriting recognition (see [Elleuch and Kherallah, 2016]). Unlike the previous commented works based on feature extraction and classical machine learning algorithms, deep learning relies on the input data ability to drive their own feature extraction process. ...
Preprint
In the present paper, we propose a source camera identification method for mobile devices based on deep learning. Recently, convolutional neural networks (CNNs) have shown a remarkable performance on several tasks such as image recognition, video analysis or natural language processing. A CNN consists on a set of layers where each layer is composed by a set of high pass filters which are applied all over the input image. This convolution process provides the unique ability to extract features automatically from data and to learn from those features. Our proposal describes a CNN architecture which is able to infer the noise pattern of mobile camera sensors (also known as camera fingerprint) with the aim at detecting and identifying not only the mobile device used to capture an image (with a 98\% of accuracy), but also from which embedded camera the image was captured. More specifically, we provide an extensive analysis on the proposed architecture considering different configurations. The experiment has been carried out using the images captured from different mobile devices cameras (MICHE-I Dataset was used) and the obtained results have proved the robustness of the proposed method.
... Shrivastava et al. [26] make the traditional boosting algorithm available on deep networks which achieves higher accuracy and maintain the same detection speed. Similar to ours, Angelova et al. [1] is based on sliding window and processes different image regions independently. However, recent deep detectors use fully-convolutional networks and take the whole image as input. ...
Preprint
Object detection aims at high speed and accuracy simultaneously. However, fast models are usually less accurate, while accurate models cannot satisfy our need for speed. A fast model can be 10 times faster but 50\% less accurate than an accurate model. In this paper, we propose Adaptive Feeding (AF) to combine a fast (but less accurate) detector and an accurate (but slow) detector, by adaptively determining whether an image is easy or hard and choosing an appropriate detector for it. In practice, we build a cascade of detectors, including the AF classifier which make the easy vs. hard decision and the two detectors. The AF classifier can be tuned to obtain different tradeoff between speed and accuracy, which has negligible training time and requires no additional training data. Experimental results on the PASCAL VOC, MS COCO and Caltech Pedestrian datasets confirm that AF has the ability to achieve comparable speed as the fast detector and comparable accuracy as the accurate one at the same time. As an example, by combining the fast SSD300 with the accurate SSD500 detector, AF leads to 50\% speedup over SSD500 with the same precision on the VOC2007 test set.
... Chai and Hodgins obtain sufficient quality to drive virtual avatars in real-time, but require visual markers [Chai and Hodgins 2005]. The use of CNNs in real time has been explored for variants of the object detection problem, for instance bounding box detection and pedestrian detection methods have leveraged application specific architectures [Angelova et al. 2015;Liu et al. 2016;Redmon et al. 2015] and preprocessing steps [Ren et al. 2015]. ...
Preprint
We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control---thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e. it works for outdoor scenes, community videos, and low quality commodity RGB cameras.
... ConvNet [28] used convolutional sparse auto-encoders to initialize the layer parameters, then fine-tuned in a small pedestrian dataset. VeryFast [29] built a convnet cascade, in which a tiny convnet is adopted to filter candidates before passing through to a deep convent. F-DNN [11] used a soft cascade mechanism and fused multiple convnets in the second cascade stage. ...
Preprint
Multispectral images of color-thermal pairs have shown more effective than a single color channel for pedestrian detection, especially under challenging illumination conditions. However, there is still a lack of studies on how to fuse the two modalities effectively. In this paper, we deeply compare six different convolutional network fusion architectures and analyse their adaptations, enabling a vanilla architecture to obtain detection performances comparable to the state-of-the-art results. Further, we discover that pedestrian detection confidences from color or thermal images are correlated with illumination conditions. With this in mind, we propose an Illumination-aware Faster R-CNN (IAF RCNN). Specifically, an Illumination-aware Network is introduced to give an illumination measure of the input image. Then we adaptively merge color and thermal sub-networks via a gate function defined over the illumination value. The experimental results on KAIST Multispectral Pedestrian Benchmark validate the effectiveness of the proposed IAF R-CNN.
... For example, a smaller, almost shallow network was trained to greatly reduce the initially large number of candidate regions produced by the sliding window. Then in a second step, only high confidence regions were passed through a deep network obtaining in this way a trade-off between speed and accuracy [56]. The idea of cascading any kind of features of different complexity, including deeply learnt features was addressed by seeking an algorithm for optimal cascade learning under a criterion that penalizes both detection errors and complexity. ...
Preprint
Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g. human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce, there is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.
... Han et al (15) proposed a fusion system combining LIDAR and color camera for detection., and improved the detection accuracy by improving the YOLO algorithm. Kuang et al (16) effectively improved the performance of pedestrian detection by extending the original YOLOv3 structure and newly defining the loss function.Sermanet al (17) proposed a convolutional sparse coding unsupervised model for pedestrian detection.Angelova et al. (18) merged the concept of fast cascades and depth networks to achieve pedestrian detection. Li et al (19) detect pedestrians using multiple built-in subnet adaptive scales.Cai et al (20) combined highly diverse features of complexity in order to design a complex perceptual cascade detector that could be used for detecting pedestrians.Wang et al (21) suggested a definition for the Repulsion loss function, which could be utilized for detecting pedestrians who are occluded. ...
Article
Full-text available
The ability to accurately detect pedestrians in the area of interest in real time is crucial in the field of autonomous driving. An improved YOLOv3 model is proposed for pedestrian detection. Firstly, a lightweight model that incorporates a residual network module approach and a CBAM attention mechanism is added to the structure to enhance the feature representation capability of the network. Experimental results show that the improved YOLOv3 target detection model raises the detection accuracy by 4% compared to the original algorithm, and the accuracy precision is improved to a large extent, which verifies the feasibility and effectiveness of the improved YOLOv3 model for pedestrian detection.
... In recent years, numerous researchers have applied dynamic structures in computer vision. Object detection is its biggest application scenario, including face detection [224], [225], pedestrian detection [227], action detection [229], and general models [118], [165]. In addition, it applied into image segmentation [232], [233], super-resolution [240], [241], image denoising [236], style transfer [235], and so on. ...
Article
Full-text available
The dynamic neural network (DNN), in contrast to the static counterpart, offers numerous advantages, such as improved accuracy, efficiency, and interpretability. These benefits stem from the network’s flexible structures and parameters, making it highly attractive and applicable across various domains. As the broad learning system (BLS) continues to evolve, DNNs have expanded beyond deep learning (DL), orienting a more comprehensive range of domains. Therefore, this comprehensive review article focuses on two prominent areas where DNN structures have rapidly developed: 1) DL and 2) broad learning. This article provides an in-depth exploration of the techniques related to dynamic construction and inference. Furthermore, it discusses the applications of DNNs in diverse domains while also addressing open issues and highlighting promising research directions. By offering a comprehensive understanding of DNNs, this article serves as a valuable resource for researchers, guiding them toward future investigations.
... Benenson et al. 28,29 used existing detectors with mainly decision forests over hand-crafted feature outputs and re-scored them with plus-bounding box regression. [30][31][32][33] However, these methods heavily rely on manually extracted features and often suffer from limited generalization to different domains due to the lack of adaptability. ...
Article
Full-text available
Datasets collected under different sensors, viewpoints, or weather conditions cause different domains. Models trained on domain A applied to tasks of domain B result in low performance. To overcome the domain shift, we propose an unsupervised pedestrian detection method that utilizes CycleGAN to establish an intermediate domain and transform a large gap domain-shift problem into two feature alignment subtasks with small gaps. The intermediate domain trained with labels from domain A, after two rounds of feature alignment using adversarial learning, can facilitate effective detection in domain B. To further enhance the training quality of intermediate domain models, Image Quality Assessment (IQA) is incorporated. The experimental results evaluated on Citypersons, KITTI, and BDD100K show that MR of 24.58%, 33.66%, 28.27%, and 28.25% were achieved in four cross-domain scenarios. Compared with typical pedestrian detection models, our proposed method can better overcome the domain-shift problem and achieve competitive results.
... However, the other research thread of employing decision cascades is directly relevant to our work. Decision Cascades, originally introduced in (Cai, Saberian, and Vasconcelos 2015;Angelova et al. 2015) were recently re-popularized by the work of IDK Cascades (Wang et al. 2017). The IDK cascade framework imposes a sequential model architecture, where each model is queried in order of increasing complexity until prediction confidence exceeds a threshold. ...
Article
Deep learning architectures have achieved state-of-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such large architectures lead to increased accuracy, this is usually accompanied by a larger increase in computation and memory requirements during inference. While this is a non-issue in traditional machine learning (ML) pipelines, the recent confluence of machine learning and fields like the Internet of Things (IoT) has rendered such large architectures infeasible for execution in low-resource settings. For some datasets, large monolithic pipelines may be overkill for simpler inputs. To address this problem, previous efforts have proposed decision cascades where inputs are passed through models of increasing complexity until desired performance is achieved. However, we argue that cascaded prediction leads to sub-optimal throughput and increased computational cost due to wasteful intermediate computations. To address this, we propose PaSeR (Parsimonious Segmentation with Reinforcement Learning) a non-cascading, cost-aware learning pipeline as an efficient alternative to cascaded decision architectures. Through experimental evaluation on both real-world and standard datasets, we demonstrate that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. Further, we introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance. On the real-world task of battery material phase segmentation, PaSeR yields a minimum performance improvement of 174% on the IoU/GigaFlop metric with respect to baselines. We also demonstrate PaSeR's adaptability to complementary models trained on a noisy MNIST dataset, where it achieved a minimum performance improvement on IoU/GigaFlop of 13.4% over SOTA models. Code and data are available at github.com/scailab/paser.
... Many of the generic object detection techniques were used as a base for modern pedestrian detection methods. One of the first CNN-based methods was proposed by Angelova et al. [43]. Cascade classifiers and deep neural network features were used, resulting in a fast and accurate method, that runs in real-time on the Caltech Pedestrian detection benchmark. ...
Article
Full-text available
Pedestrian detection based on deep learning methods have reached great success in the past few years with several possible real-world applications including autonomous driving, robotic navigation, and video surveillance. In this work, a new neural network two-stage pedestrian detector with a new custom classification head, adding the triplet loss function to the standard bounding box regression and classification losses, is presented. This aims to improve the domain generalization capabilities of existing pedestrian detectors, by explicitly maximizing inter-class distance and minimizing intra-class distance. Triplet loss is applied to the features generated by the region proposal network, aimed at clustering together pedestrian samples in the features space. We used Faster R-CNN and Cascade R-CNN with the HRNet backbone pre-trained on ImageNet, changing the standard classification head for Faster R-CNN, and changing one of the three heads for Cascade R-CNN. The best results were obtained using a progressive training pipeline, starting from a dataset that is further away from the target domain, and progressively fine-tuning on datasets closer to the target domain. We obtained state-of-the-art results, MR−2 of 9.9, 11.0, and 36.2 for the reasonable, small, and heavy subsets on the CityPersons benchmark with outstanding performance on the heavy subset, the most difficult one.
... Pedestrian detection reached notably high performance levels over the past decade, mainly due to the advances in object detection domain, e.g. [2,19,29,37,38,40,49]. Nowadays, the detection performance is sufficiently high such that most tracking approaches rely on the tracking-by-detection paradigm, which poses tracking as an association problem, e.g. ...
Chapter
A pedestrian detection system in a traffic light controller is studied. The system is based on Deep Neural Networks (DNNs). We explore several network architectures and hardware platforms to identify the most suitable solution under the given constraints of latency, cost, and precision. Specifically, we study altogether 13 networks from the MobileNet, Yolo, ResNet, and EfficientDet families and 6 platforms based on Nvidia and Intel platforms, conducting 383 experiments. We find that several network-platform combinations meet the given requirements of maximum 100 ms inference latency and 0.9 mean average precision. The most promising are Yolo v5 networks on Nvidia Jetson TX2 and IntelNUC GPU hardware.
... The technique performs better than most of the existing techniques while also outperforming other in terms of speed. Angelova et al. (2015) [32] designed an intelligence-based system for real-time pedestrian detection architecture using a cascade of DCNN. The tiny model of machine learning was used to reject a huge DNN was used to classify the hard proposals of a large number of easy negatives were identified. ...
Chapter
Object detection is the process of using a camera to track an object or a group of objects over time. It is sometimes referred to as object tracking. It can be used for a variety of things, including human-computer interactions (HCI), security and surveillance, video communication, subfield of works, lane detection (LD), pedestrian detection (PD), traffic light detection (TLD), traffic sign detection (TSD), vehicle detection (VD), and object detection from compressed video public places such as airports. In recent times, object tracking has become a popular topic in computer science particularly in the data science community, thanks to the usage of deep learning (DL) in artificial intelligence (AI). It has become essential as numerous government and private organizations gather enormous amounts of domain-specific data, which can offer insightful data on topics like marketing, national intelligence, cybersecurity, and fraud detection. In the last decades, these applications including core functions of image categorization, localization, and detection have attracted a lot of study interest. Because of significant developments in neural networks, particularly DL, these visual identification algorithms have attained amazing performance. DL which convolutional neural network as one of its techniques usually used two-stage detection methods in TLD. Despite all successes recorded in TLD through the use of two-stage detection methods, there is no study that has analyzed these methods in experimental research, studying the strength and witnesses for informed research by the researchers. Based on the needs, this chapter analyzes the applications and challenges of DL techniques in TLD. In addition, object detection for TLD using five distinct, two-stage detection methods with LARA traffic light dataset using a Jupyter Notebook and the sklearn libraries is implemented. The achievements of two-stage detection methods in TLD are enlightened using standard performance metrics, and it was observed that FASTER-CNN was the best in detection accuracy, F1-score, precision, recall, and running time with 0.89, 0.93, 0.83, 0.90, and 32 s, respectively.
... Angelova et al. (2015) [24] designed an intelligence-based system for real-time pedestrian detection architecture using a cascade of DCNN. The tiny model was used to reject a huge DNN was used to classify the hard proposals of a large number of easy negatives were identified. ...
Article
Full-text available
Using a camera to monitor an object or a group of objects over time is the process of object detection. It can be used for a variety of things, including security and surveillance, video communication, traffic light detection (TLD), object detection from compressed video in public places. In recent times, object tracking has become a popular topic in computer science particularly, the data science community, thanks to the usage of deep learning (DL) in artificial intelligence (AI). DL which convolutional neural network (CNN) as one of its techniques usually used two-stage detection methods in TLD. Despite all successes recorded in TLD through the use of two-stage detection methods, there is no study that has analyzed these methods in experimental research, studying the strength and witnesses by the researchers. Based on the needs this study analyses the applications of DL techniques in TLD. We implemented object detection for TLD using 5 two-stage detection methods with the traffic light dataset using a Jupyter notebook and the sklearn libraries. We present the achievements of two-stage detection methods in TLD, going by standard performance metrics used, FASTER-CNN was the best in detection accuracy, F1-score, precision and recall with 0.89, 0.93, 0.83 and 0.90 respectively. Keywords: Artificial intelligence Convolutional neural network (CNN) Deep learning Faster-CNN Object detection Two-stage detection This is an open access article under the CC BY-SA license.
... A state-of-the-art classification and detection model based on deep neural networks was proposed by Szegedy [24] which utilizes the computing resources to the core of an inside network, producing better results. In [25,26], the authors introduced the concept of pedestrian detection using deep convolution neural networks, while Teju [6] proposed object detection using OFSA and even object tracking using an optimal Kalman filter [7]. ...
Article
Full-text available
Thermal imaging is a cutting-edge technology which has the capability to detect objects in any environmental conditions, such as smoke, fog, smog, etc. This technology finds its importance mainly during nighttime since it does not require light to detect the objects. Applications of this technology span into various sectors, most importantly in border security to detect any incoming hazards. Object detection and classification are generally difficult with thermal imaging. In this paper, a one-stage deep convolution network-based object detection and classification called retina net is introduced. Existing surveys are based on object detection using infrared information obtained from the objects. This research is focused on detecting and identifying objects from thermal images and surveillance data.
... An advanced driver-assistance system (ADAS) leverages various automation technologies to enable different degrees of autonomous driving. These techniques include the detection of various objects such as pedestrians [1,2,3], traffic signs [4,5,6], and lanes [7,8,9] through the use of sensors rarely occluded, under well-lit conditions, such as the ones in the TuSimple dataset [17] that consists of highway road images taken during daytime. However, such testing examples do not generalize well to real-world scenarios in streets which involve other visually-complicated objects such as buildings, pedestrians, and sidewalks. ...
Preprint
Full-text available
The task of lane detection has garnered considerable attention in the field of autonomous driving due to its complexity. Lanes can present difficulties for detection, as they can be narrow, fragmented, and often obscured by heavy traffic. However, it has been observed that the lanes have a geometrical structure that resembles a straight line, leading to improved lane detection results when utilizing this characteristic. To address this challenge, we propose a hierarchical Deep Hough Transform (DHT) approach that combines all lane features in an image into the Hough parameter space. Additionally, we refine the point selection method and incorporate a Dynamic Convolution Module to effectively differentiate between lanes in the original image. Our network architecture comprises a backbone network, either a ResNet or Pyramid Vision Transformer, a Feature Pyramid Network as the neck to extract multi-scale features, and a hierarchical DHT-based feature aggregation head to accurately segment each lane. By utilizing the lane features in the Hough parameter space, the network learns dynamic convolution kernel parameters corresponding to each lane, allowing the Dynamic Convolution Module to effectively differentiate between lane features. Subsequently, the lane features are fed into the feature decoder, which predicts the final position of the lane. Our proposed network structure demonstrates improved performance in detecting heavily occluded or worn lane images, as evidenced by our extensive experimental results, which show that our method outperforms or is on par with state-of-the-art techniques.
... The main idea was to use the pedestrian detection application for person tracking in a robot to make it able to cope with human-aware constraints. Angelova et al. (2015) proposes to cascade deep convolutional neural networks with fast features and apply them to the pedestrian detection task. Caltech dataset was used for evaluation. ...
Chapter
Most actual intelligent vehicles (IV) are powered by a variety of sensors and cameras. Vision-based applications for IV mainly require visual information. In this paper, the authors introduce a pedestrian detection application used for pedestrian safety. The authors proposed a deep fully convolutional neural network (DFCNN) for pedestrian detection. The proposed model is suitable for mobile implementation. To do this, the authors propose to build lightweight blocks using convolution layers, and replace pooling layers and fully connected layers with convolution layers. Training and testing of the proposed DFCNN model for pedestrian detection were performed using the Caltech dataset. The proposed DFCNN has achieved 85% of average precision and an inference speed of 30 FPS. The reported results have demonstrated the robustness of the proposed DFCNN for pedestrian detection. The achieved performance was low computation complexity and high performance.
... Cascading CNNs for improved accuracy of computer vision algorithms is a method explored by several authors in different areas. For example, Angelova et al. [17] applied deep network cascades for real-time pedestrian detection, Diba et al. [18] designed a cascade of CNNs for object detection, and [19] proposed a cascade CNN for traffic sign recognition. In these two works, using a CNN cascade to replace a single, more complex CNN achieved better accuracy. ...
Article
Full-text available
Binary convolutional neural networks (BCNN) have shown good accuracy for small to medium neural network models. Their extreme quantization of weights and activations reduces off-chip data transfer and greatly reduces the computational complexity of convolutions. Further reduction in the complexity of a BCNN model for fast execution can be achieved with model size reduction at the cost of network accuracy. In this paper, a multi-model inference technique is proposed to reduce the execution time of the binarized inference process without accuracy reduction. The technique considers a cascade of neural network models with different computation/accuracy ratios. A parameterizable binarized neural network with different trade-offs between complexity and accuracy is used to obtain multiple network models. We also propose a hardware accelerator to run multi-model inference throughput in embedded systems. The multi-model inference accelerator is demonstrated on low-density Zynq-7010 and Zynq-7020 FPGA devices, classifying images from the CIFAR-10 dataset. The proposed accelerator improves the frame rate per number of LUTs by 7.2× those of previous solutions on a ZYNQ7020 FPGA with similar accuracy. This shows the effectiveness of the multi-model inference technique and the efficiency of the proposed hardware accelerator.
... In the further work conducted by the same authors [29], it took 2 seconds to process a single frame. Whereas real-time execution expects at least 15 fps (frames per second) [31]. Another limitation of the study is that the authors have not described how the tactile images and corresponding annotations are created. ...
Article
Full-text available
For people with visual impairments, information encoded in a visual format creates certain barriers. To alleviate this, a large volume of research has been conducted in the field of assistive technology. In our work, we developed a special system that makes educational materials more accessible. The system consists of three components: the pre-labelled tactile graphics, an interactive labelling web tool and the phone application. Tactile graphics are used at schools for the blind and allow the students to understand non-textual information by touch. The digital version of the graphics first needs to be labelled by teachers using the developed web tool. Then, the phone app, which is based on the Android platform, will accompany those graphics with the audio descriptions. The fundamental purpose of the developed app is to allow the user to gain information without sighted assistance. We also conducted a study to evaluate the system. First, the structured interview was carried out to gather data about the participant’s experience with the tactile graphics and mobile devices. Next, quantitative measurements were obtained through a series of experiments. Subsequently, a post-experimental session was carried out to record the participants’ thoughts and opinions about the system. The results of the experiments demonstrated that the proposed mobile application allows the users to explore the graphics more efficiently.
... In fact, thermal and color perception media provide additional facts. Numerous previous studies have solely focused on the detection of pedestrians in color or thermal perception [33][34][35]. Some current papers use both color and thermal images [36][37][38]. ...
Article
Full-text available
In recent years, autonomous vehicles have become more and more popular due to their broad influence over society, as they increase passenger safety and convenience, lower fuel consumption, reduce traffic blockage and accidents, save costs, and enhance reliability. However, autonomous vehicles suffer from some functionality errors which need to be minimized before they are completely deployed onto main roads. Pedestrian detection is one of the most considerable tasks (functionality errors) in autonomous vehicles to prevent accidents. However, accurate pedestrian detection is a very challenging task due to the following issues: (i) occlusion and deformation and (ii) low-quality and multi-spectral images. Recently, deep learning (DL) technologies have exhibited great potential for addressing the aforementioned pedestrian detection issues in autonomous vehicles. This survey paper provides an overview of pedestrian detection issues and the recent advances made in addressing them with the help of DL techniques. Informative discussions and future research works are also presented, with the aim of offering insights to the readers and motivating new research directions
... The most relevant work to our proposed method is the one-step IDK cascade , which incorporates prior work of "I don't know" (IDK) classes (Trappenberg and Back, 2000;Khani et al., 2016) into cascade construction and introduce a latency-aware objective into the construction comparing with previous cascaded prediction frameworks (Rowley et al., 1998;Viola and Jones, 2004;Angelova et al., 2015). Another group of work focus on the problem of feature selection assuming each feature can be acquired for a cost. ...
Preprint
Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with ''happens-before'' relation between them.We argue that it is possible to ''unfold'' a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single-stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi-stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio-temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction.
... Various Computer Vision techniques are frequently utilized in numerous applications, including clothing detection [7,8], clothes collocation [9][10][11], clothing attribution and category recognition [12], and fashion image retrieval [13][14][15][16][17][18]. Object detection techniques are used in practically every aspect of life; the most notable are surveillance [19][20][21][22], autonomous driving [23][24][25], pedestrian detection [26][27][28], and the fashion industry [29][30][31][32]. Object detection aims to identify distinct item categories and precisely locate category-specific objects using bounding boxes. ...
Article
Full-text available
Over the past few decades, research on object detection has developed rapidly, one of which can be seen in the fashion industry. Fast and accurate detection of an E-commerce fashion product is crucial to choosing the appropriate category. Nowadays, both new and second-hand clothing is provided by E-commerce sites for purchase. Therefore, when categorizing fashion clothing, it is essential to categorize it precisely, regardless of the cluttered background. We present recently acquired tiny product images with various resolutions, sizes, and positions datasets from the Shopee E-commerce (Thailand) website. This paper also proposes the Fashion Category—You Only Look Once version 4 model called FC-YOLOv4 for detecting multiclass fashion products. We used the semi-supervised learning approach to reduce image labeling time, and the number of resulting images is then increased through image augmentation. This approach results in reasonable Average Precision (AP), Mean Average Precision (mAP), True or False Positive (TP/FP), Recall, Intersection over Union (IoU), and reliable object detection. According to experimental findings, our model increases the mAP by 0.07 percent and 40.2 percent increment compared to the original YOLOv4 and YOLOv3. Experimental findings from our FC-YOLOv4 model demonstrate that it can effectively provide accurate fashion category detection for properly captured and clutter images compared to the YOLOv4 and YOLOv3 models.
... Convolutional neural networks (CNNs), which exhibit significant powers of discrimination in the color image domain [4], have been successfully applied to people detection. Angelova et al. [5] propose a cascade framework consisting of several CNNs for pedestrian detection where proposals are obtained by a dense sliding window. However, each CNN in each cascade stage is applied repeatedly to each proposal, without sharing computations on convolutions. ...
Preprint
Full-text available
We address the problem of people detection in RGB-D data where we leverage depth information to develop a region-of-interest (ROI) selection method that provides proposals to two color and depth CNNs. To combine the detections produced by the two CNNs, we propose a novel fusion approach based on the characteristics of depth images. We also present a new depth-encoding scheme, which not only encodes depth images into three channels but also enhances the information for classification. We conduct experiments on a publicly available RGB-D people dataset and show that our approach outperforms the baseline models that only use RGB data.
... Convolutional neural network has the characteristics of simple training, local connection, weight sharing, down sampling and so on. Local connection is to connect each neuron with a small number of other neurons [7] . It is only used to learn local features, which can reduce many parameters; Weight sharing makes use of the similar features of the same target in the image, and a group of connected weights are shared by multiple groups, so that for the same target, multiple convolution kernels can extract the same features through weight sharing; Down sampling is mainly used to reduce the sampling in equal proportion according to the characteristics of different locations, so as to reduce the number of samples that are not particularly important, so as to further reduce the parameters. ...
... The regional convolutional neural network (R-CNN) model proposed by Girshick et al. [15] achieved the highest accuracy. Angelova et al. [16] proposed a cascaded CNN-based pedestrian detection algorithm based on the idea of cascaded classifiers in the Adaboost algorithm; it could quickly eliminate most of the background areas in an image. Ouyang et al. [17] proposed the joint deep algorithm that combines HOG features and cascading style sheets features. ...
Article
Full-text available
Most contemporary pedestrian detection algorithms are based on visible light image detection. However, in environments with dim light, small targets, and easily occluded and cluttered backgrounds, single-mode visible light images relying on color, texture, and other features cannot adequately represent the feature information of targets; as a result, a large number of targets are lost and the algorithm performance is not good. To address this problem, we propose a dual-modal multi-scale feature fusion network (DMFFNet). First, we use the MobileNet v3 backbone network to extract the features of dual-modal images as input for the multi-scale fusion attention (MFA) module, combining the idea of multi-scale feature fusion and attention mechanism. Second, we deeply fuse the multi-scale features output by the MFA with the double deep feature fusion (DDFF) module to enhance the semantic and geometric information of the target. Finally, we optimize the loss function to reflect the distance between the predicted box and the real box more realistically as well as to enhance the ability of the network toward predicting difficult samples. We performed multi-directional evaluations on the KAIST dual-light pedestrian dataset and the visible-thermal infrared pedestrian dataset (VTI) in our laboratory through comparative and ablation experiments. The overall MR-2 on the KAIST dual-light pedestrian dataset is 9.26%, and the MR-2 in dim light, partial occlusion, and severe occlusion are 5.17%, 23.35%, and 47.31%, respectively. The overall MR-2 on the VIT dual-light pedestrian dataset is 9.26%, and the MR-2 in dim light, partial occlusion, and severe occlusion are 5.17%, 23.35%, and 47.31%, respectively. The results show that the algorithm performs well on pedestrian detection, especially in dim light and when the target was occluded.
... The presents research from [21] analyzes the average delays time pedestrian and the total to pedestrians at specific examples of pedestrian crossings of all three types (pelican, puffin, and toucan crossings) and signal-controlled junctions during peak periods using simple mathematical. Approach by research [22] is pedestrian detection that utilizes the performance of cascade classifiers with the precision of deep neural networks that cascade deep nets and fast features, which is both very quick and very reasonable. ...
Article
Full-text available
Traffic lights are generally used to regulate the control flow of traffic at an intersection from all directions, including a pelican crossing system with traffic signals for pedestrians. There are two facilities for walker crossing, namely using a pedestrian bridge and a zebra cross. In general, the traffic signals of the pelican crossing system are a fixed time, whereas other pedestrians need "green man" traffic lights with duration time arrangement. Our research proposes a prototype intelligent pelican crossing system for somebody who crosses the road at zebra crossings, but the risk of falling while crossing is not expected, especially in the elderly age group or pedestrians who are pregnant or carrying children. On the other hand, the problem is that the average step length or stride length (distance in centimeter), cadence or walking rate (in steps per minute), and the possibility of accidents are very high for pedestrians to make sure do crossing during the lights “green man”. The new idea of our research aims to set the adaptive time arrangement on the pelican crossing intelligent system of the traffic lights “green man” based on the age of the pedestrians with artificial intelligence using two combined methods of the FaceNet and AgeNet. The resulting measure can predict the age of pedestrians of the training dataset of 66.67% and testing prototype in real-time with participants on the pelican crossing system of 73% (single face) and 76% (multi faces).
Conference Paper
One of the advantages of autonomous driving systems is that they prevent possible accidents and thus reduce the number of casualties and injuries. With the development of artificial intelligence, there are many studies in this field. Achieving a successful generalization performance in such applications is possible by training the model with various backgrounds and as much unique data as possible. However, the creation of such diverse and large datasets by a single community requires a huge labor and financial cost, and even if such datasets are created, the direct sharing of these datasets brings personal data privacy problems. Since federated learning eliminates the need for data sharing, it eliminates privacy issues and allows many datasets to be used together for training without the need to share different datasets with a server. Many federated learning methods such as FedAvg, FedProx, SCAFFOLD have been proposed in the literature. However, in applications where datasets of different difficulty and comprehensiveness are used, the global model weights generated are not sufficient to capture a good representation of the datasets used in the application. For this reason, we propose Federated Learning with Loss-Dependent Coefficients (FedLDC), which uses the Faster R-CNN ResNet50 FPN network as local model and offers a new federated learning approach that assigns higher coefficients to datasets with high loss values but also with more diverse background and unique data by utilizing the loss value during training. In our study, we compared FedLDC with both FedAvg, the first example of federated learning, and the standard training methodology, which is not federated. As a result of the tests we performed on three different datasets, FedLDC achieved the most successful result with a Miss Rate (MR) of 0.13 on Caltech Pedestrian and 0.07 on CityPersons, while it achieved a MR of 0.10 on the ECP Day dataset, competitive with the standard training methodology, which achieved a MR of 0.08.
Chapter
In recent years, as the increases of people’s concerns on environmental and body safety, various image-based detection techniques and research have gained wide attention. Currently, object-detection algorithms can be generally divided into two categories: traditional ones which extract features manually, and deep learning-based approaches that automatically extract features from images. Since the former requires a lot of manpower, material resources, and financial costs and consumes a lot of time to screen abnormal images, it no longer meets the urgent needs by our societies. In other words, more intelligent identification systems are required.For society security reason, if different types of weapons, such as sticks, knives, and guns, can be detected in surveillance images, this can effectively prevent the chance of gangsters carrying weapons and acting fiercely or seeking revenge. To identify weapons, we need to distinguish them from other surveillance objects and images in a real-time manner. But most cameras have limited computing power, and images captured in the real world have their own problems, such as noise, blur, and rotation jitter, which need to be solved if we want to correctly detect weapons. Therefore, in this study, we develop a weapon detection system for surveillance images by employing a deep learning model. The intelligent tool used for image detection is YOLO (You Only Look Once)-v5, a lightweighting architecture of YOLO series and Sohas (Small Objects Handled Similarly to a weapon) dataset are adopted for image detection comparison. According to our simulation results, we successfully reduced the number of parameters in the YOLOv5s model by substituting the backbone with Shufflenetv2, replacing the PANet upsample module in the neck with the CARAFE (Content-Aware ReAssembly of Features upsample) upsample module, and replacing the SPPF (Spatial Pyramid Pooling-Fast) module with three lightweight options of simp PPF. These changes resulted in a 16.35% reduction in the parameter size of the YOLOv5s model, a 30.38% increase in FLOPS computational efficiency, and a decrease of 0.024 in mAP@0.5.
Chapter
Pedestrian detection is a classical problem in computer vision and has been a difficult problem to solve for a long time. Compared with face detection, it is very difficult to accurately detect pedestrians in various scenarios because of the complex posture of human body, larger deformation, and more serious problems such as attachment and occlusion. This paper focuses on the typical pedestrian detection model - YOLO model. Through experiments, the principle of pedestrian detection model algorithm and its model effect are studied to solve the difficulties in pedestrian detection. In YOLO, logistic regression is used to predict the object score of each boundary box, and multi-scale fusion is used to make prediction. By observing the mAP index, it is concluded that the YOLO algorithm has a good effect on single-label pedestrian detection, and the calculation efficiency is high.KeywordsPedestrian DetectionYOLOTarget DetectionDarknet
Article
Full-text available
The research of pedestrian target detection in complex scenes is still of great significance. Aiming at the problem of high missed detection rate and poor timeliness of pedestrian target detection in complex scenes. This paper proposes an improved classification method. First, Haar features were extracted from the images to be detected, and the candidate areas of pedestrians were determined by Adaboost classifier. Then, the traditional SVM classifier was improved by using the combined kernel function instead of the single kernel function, and the optimal proportion of each function in the combined kernel function was found by using the adaptive particle swarm optimization algorithm. Finally, the improved SVM classifier was combined with the fusion feature to further detect the candidate area to accurately locate the pedestrian’s position. Experimental results show that compared with the traditional detection framework, the proposed method can effectively improve the detection speed and the detection accuracy. This method has certain practical significance for pedestrian target detection in complex scenes.
Chapter
In today’s world, autonomous vehicles are given more assiduity in comparison with the ones in previous years. The most significant feature which needs to be taken into rumination is the capability of an autonomous vehicle to effectively detect and recognize road and traffic signs from a certain meters and to detect the pedestrians on the road as well as to detect vehicles to prevent accidents. The paper mainly focuses on the techniques we can use for vehicle detection, pedestrian detection, and traffic sign detection and recognition (TSDR) systems. There are some distinctive kinds of road and traffic signs like refreshment, loose gravel, right hairpin bend, left hairpin bend, staggered intersection, round about, hump or rough road, unguarded level crossing, and so on. Traffic sign detection and recognition system identifies the traffic sign on the basis of attributes like color, shape, texture, etc. To be able to detect the pedestrians effectively is challenging due to different body shapes, sizes, clothing’s, poses, and lightening conditions outside. The processing of automatic vehicle detection and recognition using video as input is the challenging part to accomplish. Object detection is the most crucial and complex problem in computer vision field. The aim of this paper is to read, understand, and analyze about traffic sign detection techniques, vehicle detection techniques, and pedestrian detection for autonomous vehicles using convolutional neural networks that classify road signs present in that image into various different types.
Preprint
Full-text available
Deploying Machine learning (ML) on the milliwatt-scale edge devices (tinyML) is gaining popularity due to recent breakthroughs in ML and IoT. However, the capabilities of tinyML are restricted by strict power and compute constraints. The majority of the contemporary research in tinyML focuses on model compression techniques such as model pruning and quantization to fit ML models on low-end devices. Nevertheless, the improvements in energy consumption and inference time obtained by existing techniques are limited because aggressive compression quickly shrinks model capacity and accuracy. Another approach to improve inference time and/or reduce power while preserving its model capacity is through early-exit networks. These networks place intermediate classifiers along a baseline neural network that facilitate early exit from neural network computation if an intermediate classifier exhibits sufficient confidence in its prediction. Previous work on early-exit networks have focused on large networks, beyond what would typically be used for tinyML applications. In this paper, we discuss the challenges of adding early-exits to state-of-the-art tiny-CNNs and devise an early-exit architecture, T-RECX, that addresses these challenges. In addition, we develop a method to alleviate the effect of network overthinking at the final exit by leveraging the high-level representations learned by the early-exit. We evaluate T-RECX on three CNNs from the MLPerf tiny benchmark suite for image classification, keyword spotting and visual wake word detection tasks. Our results demonstrate that T-RECX improves the accuracy of baseline network and significantly reduces the average inference time of tiny-CNNs. T-RECX achieves 32.58% average reduction in FLOPS in exchange for 1% accuracy across all evaluated models. Also, our techniques increase the accuracy of baseline network in two out of three models we evaluate
Article
With the continued development of Autonomous Vehicle System (AVS), self-driving related technologies have attracted much attention over the past decade. In this light, we survey existing literature regarding self-driving related data, technologies, and systems. We present details of representative studies regarding collision avoidance, automatic lane-changing maneuver, object detection (including pedestrian detection and obstacle detection), and vehicle trajectory prediction, respectively. This survey summarizes the findings of existing self-driving studies, thus uncovering new insights that may guide researchers and software engineers in fields of self-driving data management systems and autonomous vehicle systems.
ResearchGate has not been able to resolve any references for this publication.