Article

Traffic Scene Segmentation Based on RGB-D Image and Deep Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Semantic segmentation of traffic scenes has potential applications in intelligent transportation systems. Deep learning techniques can improve segmentation accuracy, especially when the information from depth maps is introduced. However, little research has been done on the application of depth maps to the segmentation of traffic scene. In this paper, we propose a method for semantic segmentation of traffic scenes based on RGB-D images and deep learning. The semi-global stereo matching algorithm and the fast global image smoothing method are employed to obtain a smooth disparity map. We present a new deep fully convolutional neural network architecture for semantic pixel-wise segmentation. We test the performance of the proposed network architecture using RGB-D images as input and compare the results with the method that only takes RGB images as input. The experimental results show that the introduction of the disparity map can help to improve the semantic segmentation accuracy and that our proposed network architecture achieves good real-time performance and competitive segmentation accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In DCNN, a combination of maximum merging and downsampling is usually used to achieve invariance, but it will have a certain impact on image segmentation and positioning accuracy. Literature [23] proposed a semantic segmentation method for traffic scenes based on RGB-D images and deep learning. A semi-global stereo matching algorithm and a fast global image smoothing method are used to obtain a smooth disparity map. ...
... The PASCAL VOC 2012 data set contains 21 categories, among which ten more common categories are selected for experimental demonstration. The proposed method is compared with the methods in literature [18,23,25] for the recognition accuracy of each category. The MIOU value is shown in Table 2. ...
... In order to further demonstrate the segmentation performance of the proposed method, compare it with the methods in literature [18,23,25]. The results of MPA and MIOU of Table 3. ...
Article
Full-text available
In order to solve the problems in the existing image semantic segmentation methods, such as the poor segmentation accuracy of small target object and the difficulty in segmentation of small target area, an image semantic segmentation method based on improved ERFNet model is proposed. Firstly, combining the asymmetric residual module and the weak bottleneck module, the ERFNet network model is improved to improve the running speed and reduce the loss of precision. Then, global pooling is used to fuse the feature channels after pyramid pooling to preserve more important feature information. Finally, the network model is implemented based on PyTorch deep learning framework, and the proposed method is demonstrated by experiments, in which the model retraining method is adopted to learn and train it. The experimental results show that the proposed method improves the segmentation ability of small‐scale objects and reduces the possibility of misclassification. The average pixel accuracy (MPA) and average intersection merge ratio (MIOU) of the proposed method are higher than those of other contrast methods.
... Li et al. [20] Perform traffic-scene segmentation on RGB-D images to make use of the depth information from the images. ...
... In [20], the authors proposed to use depth images (RGB-D) for the task of semantic segmentation. The technique involved building a coarse disparity map and then using modified AlexNet on RGB-D images of the Cityscapes dataset. ...
... where TP is True Positive, FN is the False Negative, TN is the True Negative, and FP is the False Positive. Table 5 shows the accuracy results given by the AUC score for each object by ϑinspect and compares it with the state-of-art method described in [20]. It can be seen that 8 classes achieve better results than [20] with a considerable improvement in detection for buildings, sidewalks and trees. ...
Article
In recent times, Autonomous Moving Platforms (AMP) have been a vital component for various industrial sectors across the globe as they include a diverse set of aerial, marine, and land-based vehicles. The emergence and the rise of AMP necessitate a precise object-level understanding of the environment, which directly impacts the functioning like decision making, speed control, and direction of the autonomous driving vehicles. Obstacle detection and object classification are the key issues in the AMP. The autonomous vehicle is designed to move in the city roads and it should be bolstered with high-quality object detec-tion/segmentation mechanisms since inaccurate movements and speed limits can prove to be fatal. Motivated from the aforementioned discussion, in this paper, we present ϑinspect (velocity-inspect), an AI-based 5G enabled object segmentation and speed limit identification scheme for self-driving cars on the city roads. In ϑinspect, the Convolutional Neural Network (CNN) based semantic image segmentation is carried out to segment the objects as interpreted from the Cityscapes dataset. Then, object clustering is done using the K-Means approach based on the number of unique objects. The semantic segmentation is done over 12 classes and the model outshines concerning state-of-the-art approaches for various parameters like latency, high accuracy of 82.2%, and others. Further, K-Means clustering based Speed Range Analyser (SRA) is proposed to determine the acceptable and safe speed range for the vehicle, which is computed based on the object density of every object in the environment. The results show that the proposed scheme outperforms compared to traditional schemes in terms of latency and accuracy.
... However, because of complex application scenarios such as haze, shadow, low luminance of lighting conditions, it is hard to get satisfactory road detection results using traditional Digital Image Processing (DIP) methods. Whereas, with the development of Deep Neural Networks (DNNs), a multitude of researchers apply DNNs to address the road detection problem [1][2][3][4][5]. ...
... Recently, a number of studies have been carried out to address the problem. Li [3] added depth information of images to the input channel of their networks to get favorable detection performance. Pohlen [4] proposed FRRN to incorporate boundary information with semantic information to improve accuracy. ...
... So we cropped a third of the height of the data images to ignore sky information, and then we resize the image to 128×32 pixels and turn them to greyscale images. In TABLE III, we compare the performance of our algorithm with other work [3,4,5,11 ]. Our FPGA-based accelerator can achieve over 85.0% classification accuracy in KITTI and Cityscapes datasets, our work can also achieve 77.6% accuracy in the road and lane detection in the KITTI dataset. ...
Article
Road detection is widely used in driving assistance and automotive driving. However, many state-of-the-art road detection methods are time-consuming and memory-consuming. In this paper, we propose an FPGA-based deep learning accelerator using the binary SegNet (BSegNet) with computing-near-memory (CNM) architecture for road detection at edges. The accelerator has optimized CNM architecture with massive bit-level parallel processing elements (PEs) and pipeline for low latency of the critical path. The training model size of BSegNet with binary parameters is only 2.1MB, and the BSegNet can achieve training accuracy over 85% on KITTI and Cityscapes datasets. The RTL-level realized FPGA-accelerator can process the road detection with an energy-efficiency of 351.7 GOPs/W and only 18.70 W on-chip power.
... For example, Linhui et al. obtained larger receptive fields by increasing the depth of the model. This can bring some improvement, but this method will increase the memory consumption and calculation of the model, so it has not been widely adopted [1]. LiF. ...
... As shown in Eq. (1), the calculation process of error backpropagation proceeds from back to front, which is exactly the opposite direction of the process of model inference. Assuming that the data set D = x (1) , y (1) , x (2) , y (2) , ..., x (N) , y (N) is used, the numerical value g i of the gradient of i can be obtained as: ...
... As shown in Eq. (1), the calculation process of error backpropagation proceeds from back to front, which is exactly the opposite direction of the process of model inference. Assuming that the data set D = x (1) , y (1) , x (2) , y (2) , ..., x (N) , y (N) is used, the numerical value g i of the gradient of i can be obtained as: ...
Article
Full-text available
Semantic image segmentation in computer networks is designed to determine the category to which each pixel in an image belongs. It is a basic computer vision task and has a very wide range of applications in practice. In recent years, semantic image segmentation algorithms in computer networks based on deep learning have attracted widespread attention due to their fast speed and high accuracy. However, due to the large number of downsampling layers in a deep learning model, the segmentation results are usually poor at the edge of an object, and there is currently no universal quantitative evaluation index to measure the performance of segmentation at the edge of an object. Solving these two problems is of great significance to semantic image segmentation algorithms in China. Based on traditional evaluation indicators, this paper proposes a region-based evaluation index to quantitatively measure the performance of segmentation at the edge of an object and proposes an improved loss function to improve model performance. The existing semantic image segmentation methods are summarized. This paper proposes regional-based evaluation indicators. Taking advantage of the particularity of semantic image segmentation tasks, this paper presents an efficient and accurate method for extracting the edges of objects. By defining the distance from pixels to the edges of objects, this paper proposes a fast algorithm for calculating the edge area. Based on this, three methods are proposed as well as an area-based evaluation indicator. The experimental results show that the accuracy of the loss function proposed in this paper, compared with that of the current mainstream cross-entropy loss function, is improved by 1% on the DeepLab model. For area-based evaluation indicators, a 4% accuracy improvement can be achieved, and on other segmentation models, there is also a significant improvement.
... This information would then be passed to the corresponding decoder layer by means of the so-called Pooling Indices in order to guide the up-sampling stages. [73] and [74] try a similar approach. ...
... Depth as input treats depth as an additional input to the network, which can either be fed as an additional channel to the network, or processed as a separate input by a dedicated branch. A common approach in pioneering works was to treat depth information as an additional channel to RGB images, and then perform the semantic segmentation of this concatenated representation through Encoder-Decoder networks [73] [74]. ...
Preprint
Semantic image and video segmentation stand among the most important tasks in computer vision nowadays, since they provide a complete and meaningful representation of the environment by means of a dense classification of the pixels in a given scene. Recently, Deep Learning, and more precisely Convolutional Neural Networks, have boosted semantic segmentation to a new level in terms of performance and generalization capabilities. However, designing Deep Semantic Segmentation models is a complex task, as it may involve application-dependent aspects. Particularly, when considering autonomous driving applications, the robustness-efficiency trade-off, as well as intrinsic limitations - computational/memory bounds and data-scarcity - and constraints - real-time inference - should be taken into consideration. In this respect, the use of additional data modalities, such as depth perception for reasoning on the geometry of a scene, and temporal cues from videos to explore redundancy and consistency, are promising directions yet not explored to their full potential in the literature. In this paper, we conduct a survey on the most relevant and recent advances in Deep Semantic Segmentation in the context of vision for autonomous vehicles, from three different perspectives: efficiency-oriented model development for real-time operation, RGB-Depth data integration (RGB-D semantic segmentation), and the use of temporal information from videos in temporally-aware models. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective, so that the reader can not only get started, but also be up to date in respect to recent advances in this exciting and challenging research field.
... ii) projection of periodic 2D patterns and study of their deviation when they reach objects; and 220 iii) projection of pseudo-random 2D patterns. For a semantic segmentation task involving urban/rural scenes, the work of (Li et al., 2017a) proposes a method based on RGB-D images of traffic scenes and DL. They use a new deep fully convolutional neural network architecture based on modifying the AlexNet (Krizhevsky et al., 2012) network for semantic pixel-wise segmentation. ...
... Table 1 summarises ten of the applications reviewed for each kind of data, comparing the input, the task, and the AI method chosen. RGB-D 1 DL Semantic segmentation Urban/rural scenes (Li et al., 2017a) RGB-D 2 DL Semantic segmentation PV module (Espinosa et al., 2020) RGB-D 3 ML Object detection Shadow detection (Movia et al., 2016) RGB-D 4 ML Classification Rice plants (Zheng et al., 2020) RGB-D 5 DL Semantic segmentation Building scenes (Czerniawski and Leite, 2020) RGB-D 6 DL Object detection Urban/rural scenes (Gong et al., 2018) RGB-D 7 DL Object detection Urban/rural scenes (Duan et al., 2019) RGB-D 8 DL Clustering Urban/rural scenes (Li et al., 2019b) RGB-D 9 DL Semantic segmentation Rice plants RGB-D 10 DL Semantic segmentation Urban/rural scenes IRT 1 ML Object detection Electrical equipments (Ullah et al., 2017) IRT 2 DL Object detection PV module (Akram et al., 2020) Continue 19 https://doi.org/10.5194/gi-2021-32 Preprint. ...
Preprint
Full-text available
Researchers have explored the benefits and applications of modern artificial intelligence (AI) algorithms in different scenario. For the processing of geomatics data, AI offers overwhelming opportunities. Fundamental questions include how AI can be specifically applied to or must be specifically created for geomatics data. This change is also having a significant impact on geospatial data. The integration of AI approaches in geomatics has developed into the concept of Geospatial Artificial Intelligence (GeoAI), which is a new paradigm for geographic knowledge discovery and beyond. However, little systematic work currently exists on how researchers have applied AI for geospatial domains. Hence, this contribution outlines AI-based techniques for analysing and interpreting complex geomatics data. Our analysis has covered several gaps, for instance defining relationships between AI-based approaches and geomatics data. First, technologies and tools used for data acquisition are outlined, with a particular focus on RGB images, thermal images, 3D point clouds, trajectories, and hyperspectral/multispectral images. Then, how AI approaches have been exploited for the interpretation of geomatic data is explained. Finally, a broad set of examples of applications are given, together with the specific method applied. Limitations point towards unexplored areas for future investigations, serving as useful guidelines for future research directions.
... In addition to the aforementioned multimedia content, there are other modalities of data that are also useful for scene understanding, e.g., depth image, Lidar point cloud, thermal infrared image. By using them with RGB images as input, cross-modal perceiving has attracted increasing attention in real-world applications, e.g., scene parsing for autonomous driving [181], [85], object detection and tracking in low-light scenarios [182], [183], and action recognition [184]. There are three ways of fusing multi-modal data, i.e., at the input level [181], at the feature level [185], [186], [85], [182], [183], and at the output level [184], respectively. ...
... By using them with RGB images as input, cross-modal perceiving has attracted increasing attention in real-world applications, e.g., scene parsing for autonomous driving [181], [85], object detection and tracking in low-light scenarios [182], [183], and action recognition [184]. There are three ways of fusing multi-modal data, i.e., at the input level [181], at the feature level [185], [186], [85], [182], [183], and at the output level [184], respectively. Among them, fusing multi-modal data at the feature level is most prevalent, which can be further categorized into three groups, i.e., early fusion [186], late fusion [185], and fusion at multiple levels [85], [182]. ...
Preprint
In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity and perception. However, transmitting massive amounts of heterogeneous data, perceiving complex environments from these data, and then making smart decisions in a timely manner are difficult. Artificial intelligence (AI), especially deep learning, is now a proven success in various areas including computer vision, speech recognition, and natural language processing. AI introduced into the IoT heralds the era of artificial intelligence of things (AIoT). This paper presents a comprehensive survey on AIoT to show how AI can empower the IoT to make it faster, smarter, greener, and safer. Specifically, we briefly present the AIoT architecture in the context of cloud computing, fog computing, and edge computing. Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving. Next, we summarize some promising applications of AIoT that are likely to profoundly reshape our world. Finally, we highlight the challenges facing AIoT and some potential research opportunities.
... Semantic segmentation is a technique in computer vision and deep learning for identifying and understanding objects in an image, particularly at the pixel level. Semantic segmentation is widely used for various applications [28][29][30][31][32]. In medical imagery and analysis, semantic segmentation can be used to perform structural anatomy on human organs. ...
... additional channel [103], [104] to RGB images, or processed as a separate input by a dedicated branch [105], [106] some works even feed the stereo pair to the network [107]- [109]. Input fusion, although being faster, yields limited performance, while multi-branch approaches can get computationally expensive, due to the use of modality-specific encoders. ...
Article
Full-text available
Autonomous mobile robots use computational techniques of great complexity so that to allow navigation in various types of dynamic environments, avoiding collisions with obstacles and always seeking to optimize the best route, ultimately enabling them to operate in a safe and precise manner. In order for navigation at this level to be possible, a variety of intelligent sensing techniques and computer vision are used. The potential of an intelligent computer vision system to detect and predict the actions of dynamic agents on the streets is applied to increase traffic safety with intelligent robotic vehicles. In this paper we present a systematic review of computer vision models for the detection and tracking of obstacles in traffic environments. Specifically, we will cover works involving 2D and 3D data fusion for both internal and external perception, as well as current trends regarding efficient model design and temporally-aware architectures. Alongside our review, we also provide a thorough discussion on the main positive and negative points of the state-of-the-art for detecting and tracking obstacles in Visual Robotic Attention works, as well as share our experience in applying visual perception for external obstacle detection and tracking, as well as internal (driver) monitoring. The results presented should serve as a compilation of the history of visual perception for autonomous mobile robots, and our contributions to the field, thus enabling an advance for research in the area of Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles.
... Image segmentation, one of the computer vision research hotspots, is a fundamental stage in image processing, visual analysis, and pattern recognition. At present, image segmentation has been applied in a variety of fields, including medical image processing and analysis [4], traffic images [5], remote sensing images [6], and satellite images [7]. Image segmentation makes an image easier to analyze by simplifying or changing the representation of it. ...
Article
Full-text available
Multilevel thresholding image segmentation is one of the most widely used segmentation methods in the field of image segmentation. This paper proposes a multilevel thresholding image segmentation technique based on an improved whale optimization algorithm. The WOA has been applied to many complex optimization problems because of its excellent performance; however, it easily falls into local optimization. Therefore, firstly, a mixed-strategy-based improved whale optimization algorithm (MSWOA) is proposed using the k-point initialization algorithm, the nonlinear convergence factor, and the adaptive weight coefficient to improve the optimization ability of the algorithm. Then, the MSWOA is combined with the Otsu method and Kapur entropy to search for the optimal thresholds for multilevel thresholding gray image segmentation, respectively. The results of algorithm performance evaluation experiments on benchmark functions demonstrate that the MSWOA has higher search accuracy and faster convergence speed than other comparative algorithms and that it can effectively jump out of the local optimum. In addition, the image segmentation experimental results on benchmark images show that the MSWOA–Kapur image segmentation technique can effectively and accurately search multilevel thresholds.
... Moreover, these algorithms have long calculation time, low real-time, and large limitations in real scenes. Since 2012, deep-learning algorithms have been rapidly applied to target recognition [16][17][18][19] , target detection [20][21][22] and other tasks, and have achieved remarkable results. Deep learning is widely used in image semantic segmentation algorithms and intelligent vehicle assisted driving, providing reliable guidance and decision-making for intelligent vehicle assisted or active driving. ...
Article
Full-text available
With the rapid development of intelligent traffic information monitoring technology, accurate identification of vehicles, pedestrians and other objects on the road has become particularly important. Therefore, in order to improve the recognition and classification accuracy of image objects in complex traffic scenes, this paper proposes a segmentation method of semantic redefine segmentation using image boundary region. First, we use the SegNet semantic segmentation model to obtain the rough classification features of the vehicle road object, then use the simple linear iterative clustering (SLIC) algorithm to obtain the over segmented area of the image, which can determine the classification of each pixel in each super pixel area, and then optimize the target segmentation of the boundary and small areas in the vehicle road image. Finally, the edge recovery ability of condition random field (CRF) is used to refine the image boundary. The experimental results show that compared with FCN-8s and SegNet, the pixel accuracy of the proposed algorithm in this paper improves by 2.33% and 0.57%, respectively. And compared with Unet, the algorithm in this paper performs better when dealing with multi-target segmentation.
... It is a crucial step for image recognition and the basis for image analysis and understanding, its quality often affects the following image processing. However, due to the diversity and complexity of image types, formats, and representations, there are still many research difficulties in image segmentation and it has become a research hotspot in image understanding, which commonly exists in numerous fields such as the automatic detection of defects in industrial and agricultural products (Koirala, Walsh, Wang, & McCarthy, 2019;Liu et al., 2023;Melnyk, Havrylko, & Repa, 2021); automatic recognition of illegal vehicle license plates in traffic (Li, Qian, Lian, Zheng, & Zhou, 2017); disease analysis in medical assistance Kotte, Pullakura, & Injeti, 2018;Manikandan, Ramar, Iruthayarajan, & Srinivasagan, 2014;); target recognition (Girshick, Donahue, Darrell, & Malik, 2015), feature detection (Li, Huang, & Srivastava, 2021;Sun, Wang, Jiang, Fang, & Tao, 2014), and image annotation (Guillaumin, Küttel, & Ferrari, 2014). ...
... For example, [7] presented a vision system fusing deep learning and geometric modelling to detect unexpected obstacles on the road. In [8], the authors proposed a new fully convolutional deep neural network architecture for semantic segmentation of traffic scenes on the pixel level, which adopted RGB-D photos as the input. [9] introduced a learning-based approach for long-range vision that was capable of classifying complex terrain at distances up to the horizon, thus allowing highlevel strategic planning. ...
Preprint
Full-text available
As autonomous driving systems prevail, it is becoming increasingly critical that the systems learn from databases containing fine-grained driving scenarios. Most databases currently available are human-annotated; they are expensive, time-consuming, and subject to behavioral biases. In this paper, we provide initial evidence supporting a novel technique utilizing drivers' electroencephalography (EEG) signals to implicitly label hazardous driving scenarios while passively viewing recordings of real-road driving, thus sparing the need for manual annotation and avoiding human annotators' behavioral biases during explicit report. We conducted an EEG experiment using real-life and animated recordings of driving scenarios and asked participants to report danger explicitly whenever necessary. Behavioral results showed the participants tended to report danger only when overt hazards (e.g., a vehicle or a pedestrian appearing unexpectedly from behind an occlusion) were in view. By contrast, their EEG signals were enhanced at the sight of both an overt hazard and a covert hazard (e.g., an occlusion signalling possible appearance of a vehicle or a pedestrian from behind). Thus, EEG signals were more sensitive to driving hazards than explicit reports. Further, the Time-Series AI (TSAI) successfully classified EEG signals corresponding to overt and covert hazards. We discuss future steps necessary to materialize the technique in real life.
... That is, a pixel wise mask is created for the objects present in the picture. Image segmentation techniques are currently utilized in several domains such as Traffic control systems [18], self-driving cars [19], medical segmentation [20] and also for object identification tasks such as locating http:// journals.uob.edu.bh and identifying objects in the satellite images [21]. Every pixel in the images is assigned to a particular class, either the background or the objects to be recognized. ...
... e goal of semantic segmentation is to classify each pixel in an image according to its real object label. is technique is widely used in computer vision and facilitates some valuable applications, such as autonomous driving [1][2][3], intelligent medicine [4,5], and video surveillance [6][7][8]. Autonomous driving systems use video sensors to collect image information around a given vehicle, such as roads, buildings, pedestrians, and cars. Such a system divides an input image into a series of disjoint regions to analyze the surroundings of the vehicle. ...
Article
Full-text available
Semantic segmentation is widely used in automatic driving systems. To quickly and accurately classify objects in emergency situations, a large number of images need to be processed per second. To make a semantic segmentation model run on hardware with low memory and limited computing capacity, this paper proposes a real-time semantic segmentation network called MRFDCNet. This architecture is based on our proposed multireceptive field dense connection (MRFDC) module. The module uses one depthwise separable convolution branch and two depthwise dilated separable convolution branches with a proposed symmetric sequence of dilation rates to obtain local and contextual information under multiple receptive fields. In addition, we utilize a dense connection to allow local and contextual information to complement each other. We design a guided attention (GA) module to effectively utilize deep and shallow features. The GA module uses high-level semantic context to guide low-level spatial details and fuse both types of feature representations. MRFDCNet has only 1.07 M parameters, and it can achieve 72.8% mIoU on the Cityscapes test set with 74 FPS on one NVIDIA GeForce GTX 1080 Ti GPU. Experiments on the Cityscapes and CamVid test sets show that MRFDCNet achieves a balance between accuracy and inference speed. Code is available at https://github.com/Wsky1836/MRFDCNet.
... . Accuracy (mIoU) and inference speed (FPS) obtained by several stateof-the-art semantic segmentation methods, including SwiftNet [2], PSPNet [7], ENet [13], ERFNet [14], BiSeNet [15], ICNet [16], LEDNet [17], RTHP [18], DFANet [19], ESPNet [20], FCN-8s [24], DeepLab [25], CRF-RNN [26], SegNet [27], SQNet [28], FRRN [29], TwoColumn [30], and the proposed DMA-Net on the Cityscapes test set. few decades, semantic segmentation in street scenes has attracted increasing attention, mainly due to its important role in autonomous driving systems [1]- [4]. Generally, these systems demand fast inference speed for interaction and response. ...
Preprint
Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we propose a novel Deep Multi-branch Aggregation Network (called DMA-Net) based on the encoder-decoder structure to perform real-time semantic segmentation in street scenes. Specifically, we first adopt ResNet-18 as the encoder to efficiently generate various levels of feature maps from different stages of convolutions. Then, we develop a Multi-branch Aggregation Network (MAN) as the decoder to effectively aggregate different levels of feature maps and capture the multi-scale information. In MAN, a lattice enhanced residual block is designed to enhance feature representations of the network by taking advantage of the lattice structure. Meanwhile, a feature transformation block is introduced to explicitly transform the feature map from the neighboring branch before feature aggregation. Moreover, a global context block is used to exploit the global contextual information. These key components are tightly combined and jointly optimized in a unified network. Extensive experimental results on the challenging Cityscapes and CamVid datasets demonstrate that our proposed DMA-Net respectively obtains 77.0% and 73.6% mean Intersection over Union (mIoU) at the inference speed of 46.7 FPS and 119.8 FPS by only using a single NVIDIA GTX 1080Ti GPU. This shows that DMA-Net provides a good tradeoff between segmentation quality and speed for semantic segmentation in street scenes.
... At present, the point cloud segmentation methods that have attracted more attention are mainly segmentation algorithms based on deep learning. There are, relatively, many studies on point cloud semantic segmentation and other enhanced semantic segmentation algorithms [31,32], but the deep learning methods need a large number of data sets for training, so it is difficult to implement. In this study, the pipe point cloud and damage point cloud model are obtained. ...
Article
Full-text available
With the age of pipeline and increase in the volume of urban sewage, the pipeline has different degrees of defects, which can cause safety problems such as road collapse and urban flooding. The service life of drainage pipes is closely related to daily maintenance and inspection, so it is very important to inspect the defects and monitor the operation of drainage pipes regularly. However, the existing research lacks quantitative detection and intelligent management of pipeline defect information. Therefore, the depth camera is used as the sensor to quantitatively detect the volume and area of the pit on the concrete pipe, and a defect information management platform is constructed in this paper. Firstly, combined BIM model with 3D point cloud, this paper proposes a 3D defect information management platform of drainage pipeline. Then, the depth camera is used to collect the damage data and preprocess the data, and a method for calculating the damage volume and surface area of drainage pipeline based on 3D mesh reconstruction of the defect point cloud is proposed. The verification experiment results show that the error between the quantized volume and the real volume is mostly within 10%, and the maximum error is 17.54%, indicating high accuracy. The drainage pipeline information model is created. Finally, the data is uploaded to the information management platform to realize the visualization and informatization of pipeline defects and the later operation and maintenance requirements of the pipeline.
... Image segmentation is a hot topic in deep learning-based computation vision and image processing [1] . Some excellent achievements have been obtained in various industrial applications, such as healthcare [2] , autonomous driving [3] , transportation [4] , robotics [5] , to name a few. However, most deep learning models are dependent on complicated manual tuning and a large number of labeled data, which is very time-consuming and expensive to obtain. ...
Article
A well-performed deep learning model in image segmentation relies on a large number of labeled data. However, it is hard to obtain sufficient high-quality raw data in industrial applications. Meta-learning, one of the most promising research areas, is recognized as a powerful tool for approaching image segmentation. To this end, this paper reviews the state-of-the-art image segmentation methods based on meta-learning. We firstly introduce the background of the image segmentation, including the methods and metrics of image segmentation. Second, we review the timeline of meta-learning and give a more comprehensive definition of meta-learning. The differences between meta-learning and other similar methods are compared comprehensively. Then, we categorize the existing meta-learning methods into model-based, optimization-based, and metric-based. For each categorization, the popular used meta-learning models are discussed in image segmentation. Next, we conduct comprehensive computational experiments to compare these models on two pubic datasets: ISIC-2018 and Covid-19. Finally, the future trends of meta-learning in image segmentation are highlighted.
... They combined the attention mechanism with the spatial pyramid to extract precise dense features for pixel labelling, instead of complex dilated convolutions and artificially designed decoder networks. Reference [25] proposed a semantic segmentation method for traffic scenes based on RGB-D images and deep learning. A semi-global stereo matching algorithm and a fast global image smoothing method are used to obtain a smooth disparity map. ...
Article
Full-text available
Abstract In order to improve the accuracy of image semantic segmentation, an image semantic segmentation method based on generative adversarial network (GAN) and fully convolutional network (FCN) model is proposed. First of all, the network structure of the generator is improved. Introducing the residual module in the convolutional layer for difference learning makes the network structure sensitive to changes in the output, so as to better adjust the weight of the generator. Second in order to reduce the number of parameters and calculations, a small convolution kernel is used to halve the number of channels of the input feature map before using the large convolution kernel. Finally, the output of the convolutional layer and the output of the deconvolutional layer are connected by using the idea of a U‐shaped network to avoid low‐level information sharing. The proposed method was experimentally demonstrated on the PASCAL VOC 2012 and CamVid datasets. Experimental results show that the proposed method effectively improves the accuracy of image segmentation, and avoids inaccurate detection caused by insufficient image pixel information and noise interference. Its mean pixel accuracy (MPA) and mean intersection over union (MIOU) are higher than other comparison methods.
... The application field of semantic segmentation is very wide, and it can be applied to many fields such as medical imaging, autonomous driving and geographic remote sensing [6][7][8]. It provides a strong guarantee for intelligent upgrades such as medical assisted diagnosis, assisted driving and remote sensing image interpretation. ...
Article
Full-text available
Abstract Currently, image semantic segmentation has problems such as low accuracy and long running time. This paper proposes an image semantic segmentation method based on generative adversarial network and ENet model combined with deep neural network. This method first improves the network model of generative adversarial network. Ensure the high resolution of the generated image and achieve high similarity with the real image. While ensuring the high accuracy of image semantic segmentation, it effectively improves the real‐time performance of network processing. The proposed method is verified based on public data sets. The experimental results show that the segmentation accuracy of this method can reach more than 93%, and the simulation running time is less than 0.171 s, which shows good high accuracy and high real‐time performance. A feasible strategy is proposed for the further productization of semantic segmentation.
... In recent years, with the implementation of IMO (International Maritime Organization)' s mandatory requirement for ships to be equipped with onboard automatic identification system (AIS), AIS base stations along the world coast have been built rapidly, and AIS has been widely used [1][2].As far as the current development situation is concerned, a large amount of AIS data can be collected and stored through the network. Facing the rising AIS ship service information, new innovations can be found from the aspect of information processing. ...
Article
Full-text available
Maritime transportation has always been the most important mode of transportation in international trade. With the deepening development of economic globalization, the scale of international trade, which is the core content of economic globalization, is also expanding, and the shipping industry is also developing greatly. In this paper, aiming at improving the operation efficiency of container terminals, AIS data is used as the research basis to predict the arrival time of ships and reduce the uncertainty of arrival time of ships, so as to provide support for the construction of smart ports. The transitive closure method based on equivalence relation is used to fuzzy cluster the routes to be matched and the historical routes used for matching, and the optimal route matching is realized by cutting set selection. At the same time, the navigation trajectory features based on AIS data are constructed, and the RNN-LSTM (Recurrent Neural Networks-Long Short-Term Memory) model is proposed by using the characteristics of deep learning and time series.The results show that the RNN-LSTM ship trajectory prediction model based on deep learning can achieve excellent prediction results and provide technical support for intelligent transportation at sea.
... S EMANTIC segmentation is a fundamental task in computer vision, classifying each pixel in an image. This technique plays many crucial roles in modern intelligent transportation systems such as autonomous driving and video surveillance [1]- [5]. Therefore, the study of semantic segmentation is of great relevance in the above applications. ...
Article
Full-text available
In recent years, semantic segmentation based on deep convolutional neural networks has developed rapidly. However, it is still a challenge to balance the computing cost and segmentation performance for the current semantic segmentation methods. This paper proposes a lightweight real-time semantic segmentation model Balanced Sample Distribution Network (BSDNet), to solve this problem. In BSDNet, we introduce the Balanced Sample Distribution Module (BSDModule) to balance the sampling distribution of convolution and obtain features with a larger receptive field. To optimize the segmentation effect, we introduce a Shuffle Channel Attention Module (SCAModule) to enhance the interaction of channel features at the cost of a small number of parameters. BSDModule and SCAModule are lightweight and flexible and can adapt to different types of network structures. Extensive experiments on CityScapes and CamVid show that the proposed method can balance the computing cost and segmentation performance. Specifically, on the CityScapes with 512×1024 resolution, BSDNet-Xception39 achieves 68.3% MIoU and 84.6 FPS with only 1.2M parameters.
... As mentioned previously, the depth image is rich in contour and position information, which benefits the semantic segmentation of RGB image [20,21]. To effectively merge RGB and depth information, this paper designs an image fusion module called the AFC (Figure 2), such that the network can learn more complementary information from the RGB and depth branches. ...
Article
During fruit production, the robots must walk stably across the orchard, and detect the obstacles in real time on its path. With the rapid process of deep convolutional neural network (CNN), it is now a hot topic to enable orchard robots to detect obstacles through image semantic segmentation. However, most such obstacle detection schemes are under performing in the complex environment of orchards. To solve the problem, this paper proposes an image semantic fusion network for real-time detection of small obstacles. Two branches were set up to extract features from red-green-blue (RGB) image and depth image, respectively. The information extracted by different modules were merged to complement the image features. The proposed network can operate rapidly, and support the real-time detection of obstacles by orchard robots. Experiments on orchard scenarios show that our network is superior to the latest image semantic segmentation methods, highly accurate in the recognition of high-definition images, and extremely fast in reasoning.
... Up to now, many popular image segmentation algorithms have been applied to road area detection and extraction. These methods mainly focus on improving classic algorithms or combining with other algorithms [1][2][3][4][5], such as the graph-based method, clustering method, deep learning, and multitheory combination method. The core idea of the graph-based method is to transform the global segmentation of an image into a graph partition problem through a top-down traversal process and optimize the objective function. ...
Article
Full-text available
As a popular research direction in the field of intelligent transportation, road detection has been extensively concerned by many researchers. However, there are still some key issues in specific applications that need to be further improved, such as the feature processing of road images, the optimal choice of information extraction and detection methods, and the inevitable limitations of detection schemes. In the existing research work, most of the image segmentation algorithms applied to road detection are sensitive to noise data and are prone to generate redundant information or over-segmentation, which makes the computation of segmentation process more complicated. In addition, the algorithm needs to overcome objective factors such as different road conditions and natural environments to ensure certain execution efficiency and segmentation accuracy. In order to improve these issues, we integrate the idea of shallow machine-learning model that clusters first and then classifies in this paper, and a hierarchical multifeature road image segmentation integration framework is proposed. The proposed model has been tested and evaluated on two sets of road datasets based on real scenes and compared with common detection methods, and its effectiveness and accuracy have been verified. Moreover, it demonstrates that the method opens up a new way to enhance the learning and detection capabilities of the model. Most importantly, it has certain potential for application in various practical fields such as intelligent transportation or assisted driving.
... For example, in medicine, image segmentation can detect abnormal tissues of the body [1][2][3][4][5][6][7][8][9][10][11] and extract abnormal parts. In terms of transportation, the vehicle image is segmented so that the vehicle information can be accurately identified [12]. In terms of public safety, face detection [13], fingerprint recognition [14], etc. * Author to whom correspondence should be addressed. ...
Article
Full-text available
The diagnosis of brain diseases based on magnetic resonance imaging (MRI) is a mainstream practice. In the course of practical treatment, medical personnel observe and analyze the changes in the size, position, and shape of various brain tissues in the brain MRI image, thereby judging whether the brain tissue has been diseased, and formulating the corresponding medical plan. The conclusion drawn after observing the image will be influenced by the subjective experience of the experts and is not objective. Therefore, it has become necessary to try to avoid subjective factors interfering with the diagnosis. This paper proposes an intelligent diagnosis model based on improved deep convolutional neural network (IDCNN). This model introduces integrated support vector machine (SVM) into IDCNN. During image segmentation, if IDCNN has problems such as irrational layer settings, too many parameters, etc., it will make its segmentation accuracy low. This study made a slight adjustment to the structure of IDCNN. First, adjust the number of convolution layers and down-sampling layers in the DCNN network structure, adjust the network’s activation function, and optimize the parameters to improve IDCNN’s non-linear expression ability. Then, use the integrated SVM classifier to replace the original Softmax classifier in IDCNN to improve its classification ability. The simulation experiment results tell that compared with the model before improvement and other classic classifiers, IDCNN improves segmentation results and promote the intelligent diagnosis of brain tissue.
... Although there have been many solutions by using various computer vision and machine learning algorithms, deep learning has offered compelling improvement in relevant fields. Consequently, a raft of computer vision problems including scene segmentation have taken into consideration deep learning technology [7,8,9,10,11]. ...
... Simply put, semantic segmentation can be interpreted as a task to classify object categories and locate them in pixel-level of the captured image. Its applications are very broad, covering recent developments in the fields such as satellite imagery [1], medical imaging [2][3][4], robotics [5][6][7], and autonomous vehicle [8][9][10][11][12]. Various solutions and models have been competing with each other to become the best, such as FCN [13], SegNet [14,15] ICNet [16], a Correspondence to: Mahmud Dwi Sulistiyo, E-mail: mahmuds@murase.is.i.nagoya-u.ac.jp *Graduate School of Informatics, Nagoya University, Nagoya, Japan **School of Computing, Telkom University, Bandung, Indonesia ***Mathematical and Data Science Center, Nagoya University, Nagoya, Japan ****Institutes of Innovation for Future Society, Nagoya University, Nagoya, Japan Deeplab(s) [17][18][19], PSPNet [20], and many more. ...
Article
Semantic segmentation has become one of the trending topics in the world of computer vision and deep learning. Recently, due to an increasing demand to solve a semantic segmentation task simultaneously with attribute recognition of objects, a new task named attribute‐aware semantic segmentation has been introduced. Since the task requires to handle pixel‐wise object class estimation with its attributes such as a pedestrian's body orientation, previous works had difficulties to handle ambiguous attributes such as body orientations in object‐level, especially when segmenting the pedestrians with their attributes correctly. This paper proposes the ColAtt‐Net that is an attribute‐aware semantic segmentation model augmented by a column‐wise mask branch to predict the pedestrians' orientations in the horizontal perspective of the input image. We firmly assume that the pedestrians captured by a car‐mounted camera are distributed horizontally so that for each column of the input image, the pedestrian pixels can be labeled with one orientation uniformly. In the proposed method, we split the output of the base semantic segmentation model into two branches; one branch for segmenting the object categories, while the other one, as the novel column‐wise attribute branch, is to map the recognition of pedestrian's orientations that are distributed horizontally. This method successfully enhances the performance of attribute‐aware semantic segmentation by reducing the ambiguity on segmenting the pedestrian's orientation. Improvements on the pedestrian orientation segmentation are confidently shown by the proposed method in the experimental results, both in quantitative and qualitative views. This paper also discusses how the improved performance becomes an advantage in the autonomous driving system. © 2020 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.
... Disparity maps are also widely used in semantic segmentation [9] where is used for segmentation in traffic scenes. In [10] they are also used to reduce noise at the outputs of a segmentation network and in [11] they are generated from monocular vision and together with semantic segmentation they are used to estimate the position of the ego vehicle concerning the road. ...
Preprint
Full-text available
Road detection and segmentation is a crucial task in computer vision for safe autonomous driving. With this in mind, a new net architecture (3D-DEEP) and its end-to-end training methodology for CNN-based semantic segmentation are described along this paper for. The method relies on disparity filtered and LiDAR projected images for three-dimensional information and image feature extraction through fully convolutional networks architectures. The developed models were trained and validated over Cityscapes dataset using just fine annotation examples with 19 different training classes, and over KITTI road dataset. 72.32% mean intersection over union(mIoU) has been obtained for the 19 Cityscapes training classes using the validation images. On the other hand, over KITTIdataset the model has achieved an F1 error value of 97.85% invalidation and 96.02% using the test images.
... Applications such as autonomous vehicle navigation, mobile robotics, medicine, optical metrology, ubiquitous computing, among many others, have been effectively solved by computer vision. [1][2][3][4][5][6] In a computer vision system, the light reflected by the objects placed on a scene is captured by an imaging system. Afterward, the captured scene images are processed using advanced computational algorithms to extract useful information of the scene and make proper decisions. ...
Article
Full-text available
Resumen En este trabajo mostramos un algoritmo no supervisado para segmentar imágenes, basado en la clasificación de pixeles mediante una red neuronal arbitraria cuyos coeficientes pueden ser calculados mediante la descomposición en valores singulares de una matriz. Proveemos un demo publicado en huggingface, y la teoría respectiva. Abstract In this work we show an unsupervised algorithm to segment images, based on the classification of pixels by means of an arbitrary neural network whose coefficients can be calculated by decomposing a matrix into singular values. We provide a demo posted on huggingface, and the respective theory.
Article
Lane detection is an essential task in autonomous driving. A good lane detection model should achieve many objectives, such as high accuracy, rapid detection, and low memory. In this article, a grid-based network (G-NET) is designed to realize the aforementioned goals. In G-NET, the traditional pixel-level semantic segmentation is replaced with the area-level grid segmentation to release the detection burden. Then, a position vector is introduced to indicate where lane key point is in the grid. Meanwhile, the novel rolling convolution layer following with the down-sampling and up-sampling convolution layer has been designed for good feature extraction, ensuring each feature grid perceives all other grid features in the feature map. Then, an adaptive hyperparameter branch is introduced to calculate the binary threshold effectively. Finally, the detected lane points are classified into different lanes by introducing distance-based quaternion. G-NET is extensively evaluated on three most widely datasets: TuSimple, CULane, and CurveLanes. The results show that G-NET has a state-of-the-art performance. Meanwhile, field tests are conducted.
Article
Full-text available
Researchers have explored the benefits and applications of modern artificial intelligence (AI) algorithms in different scenarios. For the processing of geomatics data, AI offers overwhelming opportunities. Fundamental questions include how AI can be specifically applied to or must be specifically created for geomatics data. This change is also having a significant impact on geospatial data. The integration of AI approaches in geomatics has developed into the concept of geospatial artificial intelligence (GeoAI), which is a new paradigm for geographic knowledge discovery and beyond. However, little systematic work currently exists on how researchers have applied AI for geospatial domains. Hence, this contribution outlines AI-based techniques for analysing and interpreting complex geomatics data. Our analysis has covered several gaps, for instance defining relationships between AI-based approaches and geomatics data. First, technologies and tools used for data acquisition are outlined, with a particular focus on red–green–blue (RGB) images, thermal images, 3D point clouds, trajectories, and hyperspectral–multispectral images. Then, how AI approaches have been exploited for the interpretation of geomatic data is explained. Finally, a broad set of examples of applications is given, together with the specific method applied. Limitations point towards unexplored areas for future investigations, serving as useful guidelines for future research directions.
Article
Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we propose a novel Deep Multi-branch Aggregation Network (called DMA-Net) based on the encoder-decoder structure to perform real-time semantic segmentation in street scenes. Specifically, we first adopt ResNet-18 as the encoder to efficiently generate various levels of feature maps from different stages of convolutions. Then, we develop a Multi-branch Aggregation Network (MAN) as the decoder to effectively aggregate different levels of feature maps and capture the multi-scale information. In MAN, a lattice enhanced residual block is designed to enhance feature representations of the network by taking advantage of the lattice structure. Meanwhile, a feature transformation block is introduced to explicitly transform the feature map from the neighboring branch before feature aggregation. Moreover, a global context block is used to exploit the global contextual information. These key components are tightly combined and jointly optimized in a unified network. Extensive experimental results on the challenging Cityscapes and CamVid datasets demonstrate that our proposed DMA-Net respectively obtains 77.0% and 73.6% mean Intersection over Union (mIoU) at the inference speed of 46.7 FPS and 119.8 FPS by only using a single NVIDIA GTX 1080Ti GPU. This shows that DMA-Net provides a good tradeoff between segmentation quality and speed for semantic segmentation in street scenes.
Preprint
Full-text available
The RGB-Thermal (RGB-T) information for semantic segmentation has been extensively explored in recent years. However, most existing RGB-T semantic segmentation usually compromises spatial resolution to achieve real-time inference speed, which leads to poor performance. To better extract detail spatial information, we propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task. Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views. Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images. Extensive experiments on the urban scene dataset demonstrate that our FEANet outperforms other state-of-the-art (SOTA) RGB-T methods in terms of objective metrics and subjective visual comparison (+2.6% in global mAcc and +0.8% in global mIoU). For the 480 x 640 RGB-T test images, our FEANet can run with a real-time speed on an NVIDIA GeForce RTX 2080 Ti card.
Article
Full-text available
Speech recognition has become a necessary feature of high-quality service industry products. Therefore, the accuracy and efficiency of speech recognition are the key to product applications. At the same time, this article also designs various modular functions for speech recognition. In order to solve the problem of poor recognition performance when the existing convolutional neural network recognizes continuous speech data, we provide an improved convolutional neural network algorithm and backpropagation algorithm to reduce the weight range. In a real speech recognition system, due to the large amount of training data and model parameter training efficiency of the convolutional neural network is very low. Vocational education reform is an important way to realize modern education, and it is also an effective way to improve students’ comprehensive quality and promote personal development. According to the analysis of the current teaching situation in higher vocational colleges, the effect of vocational education reform has not reached the expected standard. This has led to a decrease in the teaching efficiency of higher vocational colleges, coupled with the increasingly fierce competition in modern society, the reform of higher vocational education has become the top priority of the education department and the school. In order to improve the scientific nature of vocational education reform research, it is necessary to strengthen the research on current and future development trends. The research scope of vocational education reform needs to be coordinated, integrated, and expanded. To strengthen research on the integration of industry and academia, it is necessary to establish a team of experts. The cross-border characteristics of vocational education are reflected in the integration of production and education to a large extent. This article will explore how to realize the reform of vocational education in higher vocational colleges based on the improved convolutional neural network and speech recognition.
Article
The purpose is to explore the spatial-temporal dynamic prediction performance of urban road network traffic flow based on convolutional neural networks (CNN) of deep learning. A dynamic prediction model of road network traffic flow based on STGCN-BiLSTM (spatial-temporal graph convolution network Bi-directional Long Short-Term Memory) is designed in view of the complex and highly nonlinear traffic data in the actual environment. Finally, the simulation experiment is conducted on the constructed model to verify its performance which is compared with that of the LSTM (Long Short-Term Memory) model, CNN model, RNN (Recurrent Neural Network) model, AlexNet model and STGCN model. The results show that the root mean square error, mean absolute error and mean absolute percentage error of the proposed algorithm model are 4.60%, 5.46% and 7.73%, respectively. The training time is stable at about 45s, the test time is stable at about 33s, and the delay is close to $1.86\times 10 ^{-4}\text{s}$ . Additionally, the traffic flow of the test section in the next 15 min, 30 min, 1 h and 2 h is further predicted. Compared with other algorithms, the predicted value of the proposed algorithm is the closest to the actual value, along with the best prediction effect. Therefore, the constructed dynamic prediction model of road network traffic flow can achieve high accuracy and better robustness under the premise of low error, which can provide experimental basis for the spatial-temporal dynamic digital development of transportation in smart cities.
Article
Real-time semantic segmentation is in intense demand for the application of autonomous driving. Most of the semantic segmentation models tend to use large feature maps and complex structures to enhance the representation power for high accuracy. However, these inefficient designs increase the amount of computational costs, which hinders the model to be applied on autonomous driving. In this paper, we propose a lightweight real-time segmentation model, named Parallel Complement Network (PCNet), to address the challenging task with fewer parameters. A Parallel Complement layer is introduced to generate complementary features with a large receptive field. It provides the ability to overcome the problem of similar feature encoding among different classes, and further produces discriminative representations. With the inverted residual structure, we design a Parallel Complement block to construct the proposed PCNet. Extensive experiments are carried out on challenging road scene datasets, i.e., CityScapes and CamVid, to make comparison against several state-of-the-art real-time segmentation models. The results show that our model has promising performance. Specifically, PCNet* achieves 72.9% Mean IoU on CityScapes using only 1.5M parameters and reaches 79.1 FPS with $1024\times 2048$ resolution images on GTX 2080Ti. Moreover, our proposed system achieves the best accuracy when being trained from scratch.
Article
In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity and perception. However, transmitting massive amounts of heterogeneous data, perceiving complex environments from these data, and then making smart decisions in a timely manner are difficult. Artificial intelligence (AI), especially deep learning, is now a proven success in various areas including computer vision, speech recognition, and natural language processing. AI introduced into the IoT heralds the era of artificial intelligence of things (AIoT). This paper presents a comprehensive survey on AIoT to show how AI can empower the IoT to make it faster, smarter, greener, and safer. Specifically, we briefly present the AIoT architecture in the context of cloud computing, fog computing, and edge computing. Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving. Next, we summarize some promising applications of AIoT that are likely to profoundly reshape our world. Finally, we highlight the challenges facing AIoT and some potential research opportunities.
Article
Adverse weather conditions seriously threaten the traffic safety, especially for rainy days with the ponding water on the road surface, which potentially result in vehicle crashes, person injuries and crash fatalities. Automatic splashed water detection based on surveillance videos is an attractive way to effectively prevent the traffic accidents. However, surveillance videos exhibit great variations with lighting changes, illumination conditions and complex backgrounds, which pose great difficulties in automatic recognition. In this paper, a novel deep learning based approach is proposed to detect the splashed water. To the best of our knowledge, this is the first work on this topic based on deep learning. An effective semantic segmentation network, called SWNet, is novelly proposed to extract the potential splashed water regions. An encoder-decoder structure is designed to capture the visual characteristics of splashed water. SWNet achieves high efficiency by reusing pooling indices and adopting the light-weight decoder. With the multi-scale feature fusion structure, SWNet integrates the coarse semantic information and detailed appearance information, which significantly boosts the accuracy and refines the edge segmentation. A weighted cross entropy loss for splashed water is adopted to cope with the unbalanced distribution between splashed water and backgrounds. Moreover, a splashed water attention module is designed to focus on the salient regions of moving vehicles and splashed water, by performing attention mechanism to integrate global contextual information in semantic segmentation. Experiments conducted on a newly collected splashed water dataset demonstrate the effectiveness and efficiency of the proposed approach, which outperforms the state-of-the-art methods.
Article
Semantic segmentation-based scene parsing plays an important role in automatic driving and autonomous navigation. However, most of the previous models only consider static images, and fail to parse sequential images because they do not take the spatial-temporal continuity between consecutive frames in a video into account. In this paper, we propose a depth embedded recurrent predictive parsing network (RPPNet), which analyzes preceding consecutive stereo pairs for parsing result. In this way, RPPNet effectively learns the dynamic information from historical stereo pairs, so as to correctly predict the representations of the next frame. The other contribution of this paper is to systematically study the video scene parsing (VSP) task, in which we use the RPPNet to facilitate conventional image paring features by adding spatial-temporal information. The experimental results show that our proposed method RPPNet can achieve fine predictive parsing results on cityscapes and the predictive features of RPPNet can significantly improve conventional image parsing networks in VSP task.
Article
Video surveillance techniques like scene segmentation are playing an increasingly important role in multimedia Internet-of-Things (IoT) systems. However, existing deep learning based methods face challenges in both accuracy and memory when deployed on edge computing devices with limited computing resources. To address these challenges, a tensor-train video scene segmentation scheme that compares the local background information in regional scene boundary boxes in adjacent frames, is proposed. Compared to existing methods, the proposed scheme can achieve competitive performance in both segmentation accuracy and parameter compression rate. In detail,B first, an improved faster Region Convolutional Neural NetworkB (faster-RCNN) model is proposed to recognize and generate a large number of region boxes with foreground and background to achieve boundary boxes. Then the foreground boxes with sparse objects are removed and the rest are considered as optional background boxes used to measure the similarity between two adjacent frames.B Second, to accelerate the training efficiency and reduce memory size, a general and efficient training way using tensor-train decomposition to factor the input-to-hidden weight matrix, is proposed. Finally, experiments are conducted to evaluate the performance of the proposed scheme in terms of the accuracy and model compression. Our results demonstrate that the proposed model can improve the training efficiency and save the memory space for the deep computation model with good accuracy. This work opens the potential for the use of artificial intelligence methods in edge computing devices for multimedia IoT systems.
Article
Full-text available
In recent years, Intelligent Transportation Systems (ITS) have seen efficient and faster development by implementing deep learning techniques in problem domains which were previously addressed using analytical or statistical solutions and also in some areas that were untouched. These improvements have facilitated traffic management and traffic planning, increased safety and security in transit roads, decreased costs of maintenance, optimized public transportation and ride-sharing company's performance, and advanced driver-less vehicle development to a new stage. This papers primary objective was to provide a review and comprehensive insight into the applications of deep learning models on intelligent transportation systems accompanied by presenting the progress of ITS research due to deep learning. First, different techniques of deep learning and their state-of-the-art are discussed, followed by an in-depth analysis and explanation of the current applications of these techniques in transportation systems. This enumeration of deep learning on ITS highlights its significance in the domain. The applications are furthermore categorized based on the gap they are trying to address. Finally, different embedded systems for deployment of these techniques are investigated and their advantages and weaknesses over each other are discussed. Based on this systematic review, credible benefits of deep learning models on ITS are demonstrated and directions for future research are discussed.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Conference Paper
Full-text available
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Article
Full-text available
Advance information about the road surface a vehicle is going to encounter can improve the performance of Antilock Braking System (ABS). For e.g. the initial slip cycles caused by the ABS could be avoided, if it is already known that the vehicle is on a surface having a low coefficient of friction (μ). In this paper, an algorithm is developed that detects different road surfaces using streaming video acquired from a camera mounted on the hood of the vehicle. The road surfaces detected here are asphalt road, cement road, sandy road, rough asphalt road (asphalt road which is deteriorating), grassy road and rough road. The value of coefficient of friction (μ) is also given out with the detected surfaces to obtain additional information about the road surfaces. Split μ (a road having different μ conditions on the left and right side of the vehicle) and μ jump (different μ conditions on the front and rear of the vehicle) are also pre detected. One method was not sufficient to achieve the goals of this algorithm. Here several simple techniques like the Canny edge algorithm, intensity histogram, contours, Hough transform and image segmentation were employed and compared with the Support Vector Machine (SVM). To prevent misdetections, the road surface detection during high motion blur is prohibited.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
We propose a novel deep architecture, SegNet, for semantic pixel wise image labelling. SegNet has several attractive properties; (i) it only requires forward evaluation of a fully learnt function to obtain smooth label predictions, (ii) with increasing depth, a larger context is considered for pixel labelling which improves accuracy, and (iii) it is easy to visualise the effect of feature activation(s) in the pixel label space at any depth. SegNet is composed of a stack of encoders followed by a corresponding decoder stack which feeds into a soft-max classification layer. The decoders help map low resolution feature maps at the output of the encoder stack to full input image size feature maps. This addresses an important drawback of recent deep learning approaches which have adopted networks designed for object categorization for pixel wise labelling. These methods lack a mechanism to map deep layer feature maps to input dimensions. They resort to ad hoc methods to upsample features, e.g. by replication. This results in noisy predictions and also restricts the number of pooling layers in order to avoid too much upsampling and thus reduces spatial context. SegNet overcomes these problems by learning to map encoder outputs to image pixel labels. We test the performance of SegNet on outdoor RGB scenes from CamVid, KITTI and indoor scenes from the NYU dataset. Our results show that SegNet achieves state-of-the-art performance even without use of additional cues such as depth, video frames or post-processing with CRF models.
Article
Full-text available
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set.
Article
Full-text available
This paper presents an efficient technique for performing spatially inhomogeneous edge-preserving image smoothing, called fast global smoother. Focusing on sparse Laplacian matrices consisting of a data term and a prior term (typically defined using four or eight neighbors for 2D image), our approach efficiently solves such global objective functions. Specifically, we approximate the solution of the memory- and computationintensive large linear system, defined over a d-dimensional spatial domain, by solving a sequence of 1D sub-systems. Our separable implementation enables applying a linear-time tridiagonal matrix algorithm to solve d three-point Laplacian matrices iteratively. Our approach combines the best of two paradigms, i.e., efficient edge-preserving filters and optimization-based smoothing. Our method has a comparable runtime to the fast edge-preserving filters, but its global optimization formulation overcomes many limitations of the local filtering approaches. Our method also achieves high-quality results as the state-of-the-art optimizationbased techniques, but runs about 10 to 30 times faster. Besides, considering the flexibility in defining an objective function, we further propose generalized fast algorithms that perform L norm smoothing (0 < < 2) and support an aggregated (robust) data term for handling imprecise data constraints. We demonstrate the effectiveness and efficiency of our techniques in a range of image processing and computer graphics applications.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Full-text available
In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
In this paper, we propose a cascade classifier combining AdaBoost and support vector machine, and applied this to pedestrian detection. The pedestrian detection involved using a window of fixed size to extract the candidate region from left to right and top to bottom of the image, and performing feature extractions on the candidate region. Finally, our proposed cascade classifier completed the classification of the candidate region. The cascade-AdaBoost classifier has been successfully used in pedestrian detection. We have improved the initial setting method for the weights of the training samples in the AdaBoost classifier, so that the selected weak classifier would be able to focus on a higher detection rate other than accuracy. The proposed cascade classifier can automatically select the AdaBoost classifier or SVM to construct a cascade classifier according to the training samples, so as to effectively improve classification performance and reduce training time. In order to verify our proposed method, we have used our extracted database of pedestrian training samples, PETs database, INRIA database and MIT database. This completed the pedestrian detection experiment whose result was compared to those of the cascade-AdaBoost classifier and support vector machine. The result of the experiment showed that in a simple environment involving campus experimental image and PETs database, both our cascade classifier and other classifiers can attain good results, while in a complicated environment involving INRA and MIT database experiments, our cascade classifier had better results than those of other classifiers.
Article
In recent years, active learning has emerged as a powerful tool in building robust systems for object detection using computer vision. Indeed, active learning approaches to on-road vehicle detection have achieved impressive results. While active learning approaches for object detection have been explored and presented in the literature, few studies have been performed to comparatively assess costs and merits. In this study, we provide a cost-sensitive analysis of three popular active learning methods for on-road vehicle detection. The generality of active learning findings is demonstrated via learning experiments performed with detectors based on histogram of oriented gradient features and SVM classification (HOG–SVM), and Haar-like features and Adaboost classification (Haar–Adaboost). Experimental evaluation has been performed on static images and real-world on-road vehicle datasets. Learning approaches are assessed in terms of the time spent annotating, data required, recall, and precision.
Conference Paper
We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This makes it possible to learn multiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. When combined with a standard classifier, features extracted from these models outperform SIFT, as well as representations from other feature learning methods.
Article
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corre- sponding semantically labeled images at 1 Hz and in part, 15 Hz. The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driv- ing scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expand- ing this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by mea- suring the performance of an algorithm from each of three distinct domains: multi-class object recogni- tion, pedestrian detection, and label propagation.
Article
Research in object detection and recognition in cluttered scenes requires large image collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. This project provides a web-based annotation tool that makes it easy to annotate images, and to instantly sharesuch annotations with the community. This tool, plus an initial set of 10,000 images (3000 of which have been labeled), can be found at http://www.csail.mit.edu/$\sim$brussell/research/LabelMe/intro.html
Algorithm of vehicle detection and pattern recognition using SVM
  • guangying
Simultaneous detection and segmentation
  • B Hariharan
  • P Arbeláez
  • R Girshick
  • J Malik
Simultaneous detection and segmentation
  • hariharan