Article

Traffic Scene Segmentation Based on RGB-D Image and Deep Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Semantic segmentation of traffic scenes has potential applications in intelligent transportation systems. Deep learning techniques can improve segmentation accuracy, especially when the information from depth maps is introduced. However, little research has been done on the application of depth maps to the segmentation of traffic scene. In this paper, we propose a method for semantic segmentation of traffic scenes based on RGB-D images and deep learning. The semi-global stereo matching algorithm and the fast global image smoothing method are employed to obtain a smooth disparity map. We present a new deep fully convolutional neural network architecture for semantic pixel-wise segmentation. We test the performance of the proposed network architecture using RGB-D images as input and compare the results with the method that only takes RGB images as input. The experimental results show that the introduction of the disparity map can help to improve the semantic segmentation accuracy and that our proposed network architecture achieves good real-time performance and competitive segmentation accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For example, Linhui et al. obtained larger receptive fields by increasing the depth of the model. This can bring some improvement, but this method will increase the memory consumption and calculation of the model, so it has not been widely adopted [1]. LiF. ...
... As shown in Eq. (1), the calculation process of error backpropagation proceeds from back to front, which is exactly the opposite direction of the process of model inference. Assuming that the data set D = x (1) , y (1) , x (2) , y (2) , ..., x (N) , y (N) is used, the numerical value g i of the gradient of i can be obtained as: ...
... As shown in Eq. (1), the calculation process of error backpropagation proceeds from back to front, which is exactly the opposite direction of the process of model inference. Assuming that the data set D = x (1) , y (1) , x (2) , y (2) , ..., x (N) , y (N) is used, the numerical value g i of the gradient of i can be obtained as: ...
Article
Full-text available
Semantic image segmentation in computer networks is designed to determine the category to which each pixel in an image belongs. It is a basic computer vision task and has a very wide range of applications in practice. In recent years, semantic image segmentation algorithms in computer networks based on deep learning have attracted widespread attention due to their fast speed and high accuracy. However, due to the large number of downsampling layers in a deep learning model, the segmentation results are usually poor at the edge of an object, and there is currently no universal quantitative evaluation index to measure the performance of segmentation at the edge of an object. Solving these two problems is of great significance to semantic image segmentation algorithms in China. Based on traditional evaluation indicators, this paper proposes a region-based evaluation index to quantitatively measure the performance of segmentation at the edge of an object and proposes an improved loss function to improve model performance. The existing semantic image segmentation methods are summarized. This paper proposes regional-based evaluation indicators. Taking advantage of the particularity of semantic image segmentation tasks, this paper presents an efficient and accurate method for extracting the edges of objects. By defining the distance from pixels to the edges of objects, this paper proposes a fast algorithm for calculating the edge area. Based on this, three methods are proposed as well as an area-based evaluation indicator. The experimental results show that the accuracy of the loss function proposed in this paper, compared with that of the current mainstream cross-entropy loss function, is improved by 1% on the DeepLab model. For area-based evaluation indicators, a 4% accuracy improvement can be achieved, and on other segmentation models, there is also a significant improvement.
... In DCNN, a combination of maximum merging and downsampling is usually used to achieve invariance, but it will have a certain impact on image segmentation and positioning accuracy. Literature [23] proposed a semantic segmentation method for traffic scenes based on RGB-D images and deep learning. A semi-global stereo matching algorithm and a fast global image smoothing method are used to obtain a smooth disparity map. ...
... The PASCAL VOC 2012 data set contains 21 categories, among which ten more common categories are selected for experimental demonstration. The proposed method is compared with the methods in literature [18,23,25] for the recognition accuracy of each category. The MIOU value is shown in Table 2. ...
... In order to further demonstrate the segmentation performance of the proposed method, compare it with the methods in literature [18,23,25]. The results of MPA and MIOU of Table 3. ...
Article
Full-text available
In order to solve the problems in the existing image semantic segmentation methods, such as the poor segmentation accuracy of small target object and the difficulty in segmentation of small target area, an image semantic segmentation method based on improved ERFNet model is proposed. Firstly, combining the asymmetric residual module and the weak bottleneck module, the ERFNet network model is improved to improve the running speed and reduce the loss of precision. Then, global pooling is used to fuse the feature channels after pyramid pooling to preserve more important feature information. Finally, the network model is implemented based on PyTorch deep learning framework, and the proposed method is demonstrated by experiments, in which the model retraining method is adopted to learn and train it. The experimental results show that the proposed method improves the segmentation ability of small‐scale objects and reduces the possibility of misclassification. The average pixel accuracy (MPA) and average intersection merge ratio (MIOU) of the proposed method are higher than those of other contrast methods.
... Li et al. [20] Perform traffic-scene segmentation on RGB-D images to make use of the depth information from the images. ...
... In [20], the authors proposed to use depth images (RGB-D) for the task of semantic segmentation. The technique involved building a coarse disparity map and then using modified AlexNet on RGB-D images of the Cityscapes dataset. ...
... where TP is True Positive, FN is the False Negative, TN is the True Negative, and FP is the False Positive. Table 5 shows the accuracy results given by the AUC score for each object by ϑinspect and compares it with the state-of-art method described in [20]. It can be seen that 8 classes achieve better results than [20] with a considerable improvement in detection for buildings, sidewalks and trees. ...
Article
In recent times, Autonomous Moving Platforms (AMP) have been a vital component for various industrial sectors across the globe as they include a diverse set of aerial, marine, and land-based vehicles. The emergence and the rise of AMP necessitate a precise object-level understanding of the environment, which directly impacts the functioning like decision making, speed control, and direction of the autonomous driving vehicles. Obstacle detection and object classification are the key issues in the AMP. The autonomous vehicle is designed to move in the city roads and it should be bolstered with high-quality object detec-tion/segmentation mechanisms since inaccurate movements and speed limits can prove to be fatal. Motivated from the aforementioned discussion, in this paper, we present ϑinspect (velocity-inspect), an AI-based 5G enabled object segmentation and speed limit identification scheme for self-driving cars on the city roads. In ϑinspect, the Convolutional Neural Network (CNN) based semantic image segmentation is carried out to segment the objects as interpreted from the Cityscapes dataset. Then, object clustering is done using the K-Means approach based on the number of unique objects. The semantic segmentation is done over 12 classes and the model outshines concerning state-of-the-art approaches for various parameters like latency, high accuracy of 82.2%, and others. Further, K-Means clustering based Speed Range Analyser (SRA) is proposed to determine the acceptable and safe speed range for the vehicle, which is computed based on the object density of every object in the environment. The results show that the proposed scheme outperforms compared to traditional schemes in terms of latency and accuracy.
... However, because of complex application scenarios such as haze, shadow, low luminance of lighting conditions, it is hard to get satisfactory road detection results using traditional Digital Image Processing (DIP) methods. Whereas, with the development of Deep Neural Networks (DNNs), a multitude of researchers apply DNNs to address the road detection problem [1][2][3][4][5]. ...
... Recently, a number of studies have been carried out to address the problem. Li [3] added depth information of images to the input channel of their networks to get favorable detection performance. Pohlen [4] proposed FRRN to incorporate boundary information with semantic information to improve accuracy. ...
... So we cropped a third of the height of the data images to ignore sky information, and then we resize the image to 128×32 pixels and turn them to greyscale images. In TABLE III, we compare the performance of our algorithm with other work [3,4,5,11 ]. Our FPGA-based accelerator can achieve over 85.0% classification accuracy in KITTI and Cityscapes datasets, our work can also achieve 77.6% accuracy in the road and lane detection in the KITTI dataset. ...
Article
Road detection is widely used in driving assistance and automotive driving. However, many state-of-the-art road detection methods are time-consuming and memory-consuming. In this paper, we propose an FPGA-based deep learning accelerator using the binary SegNet (BSegNet) with computing-near-memory (CNM) architecture for road detection at edges. The accelerator has optimized CNM architecture with massive bit-level parallel processing elements (PEs) and pipeline for low latency of the critical path. The training model size of BSegNet with binary parameters is only 2.1MB, and the BSegNet can achieve training accuracy over 85% on KITTI and Cityscapes datasets. The RTL-level realized FPGA-accelerator can process the road detection with an energy-efficiency of 351.7 GOPs/W and only 18.70 W on-chip power.
... In addition to the aforementioned multimedia content, there are other modalities of data that are also useful for scene understanding, e.g., depth image, Lidar point cloud, thermal infrared image. By using them with RGB images as input, cross-modal perceiving has attracted increasing attention in real-world applications, e.g., scene parsing for autonomous driving [181], [85], object detection and tracking in low-light scenarios [182], [183], and action recognition [184]. There are three ways of fusing multi-modal data, i.e., at the input level [181], at the feature level [185], [186], [85], [182], [183], and at the output level [184], respectively. ...
... By using them with RGB images as input, cross-modal perceiving has attracted increasing attention in real-world applications, e.g., scene parsing for autonomous driving [181], [85], object detection and tracking in low-light scenarios [182], [183], and action recognition [184]. There are three ways of fusing multi-modal data, i.e., at the input level [181], at the feature level [185], [186], [85], [182], [183], and at the output level [184], respectively. Among them, fusing multi-modal data at the feature level is most prevalent, which can be further categorized into three groups, i.e., early fusion [186], late fusion [185], and fusion at multiple levels [85], [182]. ...
Preprint
In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity and perception. However, transmitting massive amounts of heterogeneous data, perceiving complex environments from these data, and then making smart decisions in a timely manner are difficult. Artificial intelligence (AI), especially deep learning, is now a proven success in various areas including computer vision, speech recognition, and natural language processing. AI introduced into the IoT heralds the era of artificial intelligence of things (AIoT). This paper presents a comprehensive survey on AIoT to show how AI can empower the IoT to make it faster, smarter, greener, and safer. Specifically, we briefly present the AIoT architecture in the context of cloud computing, fog computing, and edge computing. Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving. Next, we summarize some promising applications of AIoT that are likely to profoundly reshape our world. Finally, we highlight the challenges facing AIoT and some potential research opportunities.
... The state estimation of vehicle dynamic system has been widely discussed in the literatures [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. Typical estimation strategies include Kalman filter-based (KF) estimation [15][16][17][18], recursive least-squares (RLSs) method [19][20][21], Luenberger observer [22], sliding-mode observer [23,24], and other non-linear observers [25][26][27][28][29][30][31][32]. ...
... The state estimation of vehicle dynamic system has been widely discussed in the literatures [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. Typical estimation strategies include Kalman filter-based (KF) estimation [15][16][17][18], recursive least-squares (RLSs) method [19][20][21], Luenberger observer [22], sliding-mode observer [23,24], and other non-linear observers [25][26][27][28][29][30][31][32]. For instance, the research [21] developed an adaptive forgetting-factor RLS to estimate linear cornering stiffness of front and rear tyres. ...
Article
Full-text available
Accurate knowledge of vehicle inertial parameters (e.g. vehicle mass and yaw moment of inertia) is essential to manage vehicle potential trajectories and improve vehicle active safety. For lightweight electric vehicles (LEVs), whose control performance of dynamics system can be substantially affected due to the drastic reduction of vehicle weights and body size, such knowledge is even more critical. This study proposes a dual unscented Kalman filter (DUKF) approach, where two UKFs run in parallel to simultaneously estimate vehicle states and parameters such as vehicle velocity, vehicle sideslip angle, and inertial parameters. The proposed method only utilises real‐time measurements from torque information of in‐wheel motor and sensors in a standard car. The four‐wheel non‐linear vehicle dynamics model considering payload variations is developed, local observability of the DUKF observer is analysed and derived via differential geometry theory. To address the non‐linearities in vehicle dynamics, the DUKF and dual extended Kalman filter (DEKF) are also presented and compared. Simulations with various manoeuvres are carried out using the platform of MATLAB/Simulink‐Carsim®. Simulation results of MATLAB/Simulink‐Carsim® show that the proposed DUKF method can effectively estimate inertial parameters of LEV under different payloads. Moreover, the investigation reveals that the proposed DUKF approach has better performance of estimating vehicle inertial parameters compared with the DEKF method.
... ii) projection of periodic 2D patterns and study of their deviation when they reach objects; and 220 iii) projection of pseudo-random 2D patterns. For a semantic segmentation task involving urban/rural scenes, the work of (Li et al., 2017a) proposes a method based on RGB-D images of traffic scenes and DL. They use a new deep fully convolutional neural network architecture based on modifying the AlexNet (Krizhevsky et al., 2012) network for semantic pixel-wise segmentation. ...
... Table 1 summarises ten of the applications reviewed for each kind of data, comparing the input, the task, and the AI method chosen. RGB-D 1 DL Semantic segmentation Urban/rural scenes (Li et al., 2017a) RGB-D 2 DL Semantic segmentation PV module (Espinosa et al., 2020) RGB-D 3 ML Object detection Shadow detection (Movia et al., 2016) RGB-D 4 ML Classification Rice plants (Zheng et al., 2020) RGB-D 5 DL Semantic segmentation Building scenes (Czerniawski and Leite, 2020) RGB-D 6 DL Object detection Urban/rural scenes (Gong et al., 2018) RGB-D 7 DL Object detection Urban/rural scenes (Duan et al., 2019) RGB-D 8 DL Clustering Urban/rural scenes (Li et al., 2019b) RGB-D 9 DL Semantic segmentation Rice plants RGB-D 10 DL Semantic segmentation Urban/rural scenes IRT 1 ML Object detection Electrical equipments (Ullah et al., 2017) IRT 2 DL Object detection PV module (Akram et al., 2020) Continue 19 https://doi.org/10.5194/gi-2021-32 Preprint. ...
Preprint
Full-text available
Researchers have explored the benefits and applications of modern artificial intelligence (AI) algorithms in different scenario. For the processing of geomatics data, AI offers overwhelming opportunities. Fundamental questions include how AI can be specifically applied to or must be specifically created for geomatics data. This change is also having a significant impact on geospatial data. The integration of AI approaches in geomatics has developed into the concept of Geospatial Artificial Intelligence (GeoAI), which is a new paradigm for geographic knowledge discovery and beyond. However, little systematic work currently exists on how researchers have applied AI for geospatial domains. Hence, this contribution outlines AI-based techniques for analysing and interpreting complex geomatics data. Our analysis has covered several gaps, for instance defining relationships between AI-based approaches and geomatics data. First, technologies and tools used for data acquisition are outlined, with a particular focus on RGB images, thermal images, 3D point clouds, trajectories, and hyperspectral/multispectral images. Then, how AI approaches have been exploited for the interpretation of geomatic data is explained. Finally, a broad set of examples of applications are given, together with the specific method applied. Limitations point towards unexplored areas for future investigations, serving as useful guidelines for future research directions.
... Applications such as autonomous vehicle navigation, mobile robotics, medicine, optical metrology, ubiquitous computing, among many others, have been effectively solved by computer vision. [1][2][3][4][5][6] In a computer vision system, the light reflected by the objects placed on a scene is captured by an imaging system. Afterward, the captured scene images are processed using advanced computational algorithms to extract useful information of the scene and make proper decisions. ...
... We can observe a constant evolution of networks accuracy due on the one side to the improvement of networks and on the other side to the increase in datasets sizes. Also, some tasks are recurrent in the community: segmentation of urban scenes [5,[12][13][14][15][16][17], indoor scenes understanding [18][19][20][21] or medical images analysis [23][24][25]. The task addressed in this paper shares a common aspect with medical imaging. ...
Preprint
Full-text available
Robotics applications in urban environments are subject to obstacles that exhibit specular reflections hampering autonomous navigation. On the other hand, these reflections are highly polarized and this extra information can successfully be used to segment the specular areas. In nature, polarized light is obtained by reflection or scattering. Deep Convolutional Neural Networks (DCNNs) have shown excellent segmentation results, but require a significant amount of data to achieve best performances. The lack of data is usually overcomed by using augmentation methods. However, unlike RGB images, polarization images are not only scalar (intensity) images and standard augmentation techniques cannot be applied straightforwardly. We propose to enhance deep learning models through a regularized augmentation procedure applied to polarimetric data in order to characterize scenes more effectively under challenging conditions. We subsequently observe an average of 18.1% improvement in IoU between non augmented and regularized training procedures on real world data.
... For example, in medicine, image segmentation can detect abnormal tissues of the body [1][2][3][4][5][6][7][8][9][10][11] and extract abnormal parts. In terms of transportation, the vehicle image is segmented so that the vehicle information can be accurately identified [12]. In terms of public safety, face detection [13], fingerprint recognition [14], etc. * Author to whom correspondence should be addressed. ...
Article
Full-text available
The diagnosis of brain diseases based on magnetic resonance imaging (MRI) is a mainstream practice. In the course of practical treatment, medical personnel observe and analyze the changes in the size, position, and shape of various brain tissues in the brain MRI image, thereby judging whether the brain tissue has been diseased, and formulating the corresponding medical plan. The conclusion drawn after observing the image will be influenced by the subjective experience of the experts and is not objective. Therefore, it has become necessary to try to avoid subjective factors interfering with the diagnosis. This paper proposes an intelligent diagnosis model based on improved deep convolutional neural network (IDCNN). This model introduces integrated support vector machine (SVM) into IDCNN. During image segmentation, if IDCNN has problems such as irrational layer settings, too many parameters, etc., it will make its segmentation accuracy low. This study made a slight adjustment to the structure of IDCNN. First, adjust the number of convolution layers and down-sampling layers in the DCNN network structure, adjust the network’s activation function, and optimize the parameters to improve IDCNN’s non-linear expression ability. Then, use the integrated SVM classifier to replace the original Softmax classifier in IDCNN to improve its classification ability. The simulation experiment results tell that compared with the model before improvement and other classic classifiers, IDCNN improves segmentation results and promote the intelligent diagnosis of brain tissue.
... It aims to provide detailed pixel-level image classification, which amounts to assign semantic labels to each pixel. It is a critical step to achieve deep understanding of different kinds of objects (such as road, human and car) in urban street scenes, and has been widely used in a variety of intelligent transportation systems, such as automotive driving, vehicle safety and video surveillance [1], [2], [3], [4] These systems usually exhibit a strong demand for real-time inference speed and efficient interaction. ...
Preprint
Deep Convolutional Neural Networks (DCNNs) have recently shown outstanding performance in semantic image segmentation. However, state-of-the-art DCNN-based semantic segmentation methods usually suffer from high computational complexity due to the use of complex network architectures. This greatly limits their applications in the real-world scenarios that require real-time processing. In this paper, we propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes, which achieves a good trade-off between accuracy and speed. Specifically, a Lightweight Baseline Network with Atrous convolution and Attention (LBN-AA) is firstly used as our baseline network to efficiently obtain dense feature maps. Then, the Distinctive Atrous Spatial Pyramid Pooling (DASPP), which exploits the different sizes of pooling operations to encode the rich and distinctive semantic information, is developed to detect objects at multiple scales. Meanwhile, a Spatial detail-Preserving Network (SPN) with shallow convolutional layers is designed to generate high-resolution feature maps preserving the detailed spatial information. Finally, a simple but practical Feature Fusion Network (FFN) is used to effectively combine both shallow and deep features from the semantic branch (DASPP) and the spatial branch (SPN), respectively. Extensive experimental results show that the proposed method respectively achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps on the challenging Cityscapes and CamVid test datasets (by only using a single NVIDIA TITAN X card). This demonstrates that the proposed method offers excellent performance at the real-time speed for semantic segmentation of urban street scenes.
... Compared to indoor scene depth maps from Kinect or RealSense, outdoor traffic scene depth maps are much more sparse. Li et al. [21] simply stacked smoothed depth maps with RGB images as a 4-channel input. Based on VGG [22], Kreso et al. [23] introduced a scale selection layer and used the depth maps as a guidance to produce a scale-invariant representation to free appearance from the scale. ...
Preprint
Full-text available
Semantic segmentation has made striking progress due to the success of deep convolutional neural networks. Considering the demand of autonomous driving, real-time semantic segmentation has become a research hotspot these years. However, few real-time RGB-D fusion semantic segmentation studies are carried out despite readily accessible depth information nowadays. In this paper, we propose a real-time fusion semantic segmentation network termed RFNet that efficiently exploits complementary features from depth information to enhance the performance in an attention-augmented way, while running swiftly that is a necessity for autonomous vehicles applications. Multi-dataset training is leveraged to incorporate unexpected small obstacle detection, enriching the recognizable classes required to face unforeseen hazards in the real world. A comprehensive set of experiments demonstrates the effectiveness of our framework. On \textit{Cityscapes}, Our method outperforms previous state-of-the-art semantic segmenters, with excellent accuracy and 22Hz inference speed at the full 2048$\times$1024 resolution, outperforming most existing RGB-D networks.
... . Accuracy (mIoU) and inference speed (FPS) obtained by several stateof-the-art semantic segmentation methods, including SwiftNet [2], PSPNet [7], ENet [13], ERFNet [14], BiSeNet [15], ICNet [16], LEDNet [17], RTHP [18], DFANet [19], ESPNet [20], FCN-8s [24], DeepLab [25], CRF-RNN [26], SegNet [27], SQNet [28], FRRN [29], TwoColumn [30], and the proposed DMA-Net on the Cityscapes test set. few decades, semantic segmentation in street scenes has attracted increasing attention, mainly due to its important role in autonomous driving systems [1]- [4]. Generally, these systems demand fast inference speed for interaction and response. ...
Preprint
Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we propose a novel Deep Multi-branch Aggregation Network (called DMA-Net) based on the encoder-decoder structure to perform real-time semantic segmentation in street scenes. Specifically, we first adopt ResNet-18 as the encoder to efficiently generate various levels of feature maps from different stages of convolutions. Then, we develop a Multi-branch Aggregation Network (MAN) as the decoder to effectively aggregate different levels of feature maps and capture the multi-scale information. In MAN, a lattice enhanced residual block is designed to enhance feature representations of the network by taking advantage of the lattice structure. Meanwhile, a feature transformation block is introduced to explicitly transform the feature map from the neighboring branch before feature aggregation. Moreover, a global context block is used to exploit the global contextual information. These key components are tightly combined and jointly optimized in a unified network. Extensive experimental results on the challenging Cityscapes and CamVid datasets demonstrate that our proposed DMA-Net respectively obtains 77.0% and 73.6% mean Intersection over Union (mIoU) at the inference speed of 46.7 FPS and 119.8 FPS by only using a single NVIDIA GTX 1080Ti GPU. This shows that DMA-Net provides a good tradeoff between segmentation quality and speed for semantic segmentation in street scenes.
... Simply put, semantic segmentation can be interpreted as a task to classify object categories and locate them in pixel-level of the captured image. Its applications are very broad, covering recent developments in the fields such as satellite imagery [1], medical imaging [2][3][4], robotics [5][6][7], and autonomous vehicle [8][9][10][11][12]. Various solutions and models have been competing with each other to become the best, such as FCN [13], SegNet [14,15] ICNet [16], a Correspondence to: Mahmud Dwi Sulistiyo, E-mail: mahmuds@murase.is.i.nagoya-u.ac.jp *Graduate School of Informatics, Nagoya University, Nagoya, Japan **School of Computing, Telkom University, Bandung, Indonesia ***Mathematical and Data Science Center, Nagoya University, Nagoya, Japan ****Institutes of Innovation for Future Society, Nagoya University, Nagoya, Japan Deeplab(s) [17][18][19], PSPNet [20], and many more. ...
Article
Semantic segmentation has become one of the trending topics in the world of computer vision and deep learning. Recently, due to an increasing demand to solve a semantic segmentation task simultaneously with attribute recognition of objects, a new task named attribute‐aware semantic segmentation has been introduced. Since the task requires to handle pixel‐wise object class estimation with its attributes such as a pedestrian's body orientation, previous works had difficulties to handle ambiguous attributes such as body orientations in object‐level, especially when segmenting the pedestrians with their attributes correctly. This paper proposes the ColAtt‐Net that is an attribute‐aware semantic segmentation model augmented by a column‐wise mask branch to predict the pedestrians' orientations in the horizontal perspective of the input image. We firmly assume that the pedestrians captured by a car‐mounted camera are distributed horizontally so that for each column of the input image, the pedestrian pixels can be labeled with one orientation uniformly. In the proposed method, we split the output of the base semantic segmentation model into two branches; one branch for segmenting the object categories, while the other one, as the novel column‐wise attribute branch, is to map the recognition of pedestrian's orientations that are distributed horizontally. This method successfully enhances the performance of attribute‐aware semantic segmentation by reducing the ambiguity on segmenting the pedestrian's orientation. Improvements on the pedestrian orientation segmentation are confidently shown by the proposed method in the experimental results, both in quantitative and qualitative views. This paper also discusses how the improved performance becomes an advantage in the autonomous driving system. © 2020 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.
... Although there have been many solutions by using various computer vision and machine learning algorithms, deep learning has offered compelling improvement in relevant fields. Consequently, a raft of computer vision problems including scene segmentation have taken into consideration deep learning technology [7,8,9,10,11]. ...
... Compared to indoor scene depth maps from Kinect or RealSense, outdoor traffic scene depth maps are much more sparse. Li et al. [21] simply stacked smoothed depth maps with RGB images as a 4-channel input. Based on VGG [22], Kreso et al. [23] introduced a scale selection layer and used the depth maps as a guidance to produce a scale-invariant representation to free appearance from the scale. ...
Article
Full-text available
Semantic segmentation has made striking progress due to the success of deep convolutional neural networks. Considering the demands of autonomous driving, real-time semantic segmentation has become a research hotspot these years. However, few real-time RGB-D fusion semantic segmentation studies are carried out despite readily accessible depth information nowadays. In this paper, we propose a real-time fusion semantic segmentation network termed RFNet that effectively exploits complementary cross-modal information. Building on an efficient network architecture, RFNet is capable of running swiftly, which satisfies autonomous vehicles applications. Multi-dataset training is leveraged to incorporate unexpected small obstacle detection, enriching the recognizable classes required to face unforeseen hazards in the real world. A comprehensive set of experiments demonstrates the effectiveness of our framework. On Cityscapes, Our method outperforms previous state-of-the-art semantic segmenters, with excellent accuracy and 22Hz inference speed at the full 2048x1024 resolution, outperforming most existing RGB-D networks.
... Disparity maps are also widely used in semantic segmentation [9] where is used for segmentation in traffic scenes. In [10] they are also used to reduce noise at the outputs of a segmentation network and in [11] they are generated from monocular vision and together with semantic segmentation they are used to estimate the position of the ego vehicle concerning the road. ...
Preprint
Full-text available
Road detection and segmentation is a crucial task in computer vision for safe autonomous driving. With this in mind, a new net architecture (3D-DEEP) and its end-to-end training methodology for CNN-based semantic segmentation are described along this paper for. The method relies on disparity filtered and LiDAR projected images for three-dimensional information and image feature extraction through fully convolutional networks architectures. The developed models were trained and validated over Cityscapes dataset using just fine annotation examples with 19 different training classes, and over KITTI road dataset. 72.32% mean intersection over union(mIoU) has been obtained for the 19 Cityscapes training classes using the validation images. On the other hand, over KITTIdataset the model has achieved an F1 error value of 97.85% invalidation and 96.02% using the test images.
... Up to now, many popular image segmentation algorithms have been applied to road area detection and extraction. These methods mainly focus on improving classic algorithms or combining with other algorithms [1][2][3][4][5], such as the graph-based method, clustering method, deep learning, and multitheory combination method. The core idea of the graph-based method is to transform the global segmentation of an image into a graph partition problem through a top-down traversal process and optimize the objective function. ...
Article
Full-text available
As a popular research direction in the field of intelligent transportation, road detection has been extensively concerned by many researchers. However, there are still some key issues in specific applications that need to be further improved, such as the feature processing of road images, the optimal choice of information extraction and detection methods, and the inevitable limitations of detection schemes. In the existing research work, most of the image segmentation algorithms applied to road detection are sensitive to noise data and are prone to generate redundant information or over-segmentation, which makes the computation of segmentation process more complicated. In addition, the algorithm needs to overcome objective factors such as different road conditions and natural environments to ensure certain execution efficiency and segmentation accuracy. In order to improve these issues, we integrate the idea of shallow machine-learning model that clusters first and then classifies in this paper, and a hierarchical multifeature road image segmentation integration framework is proposed. The proposed model has been tested and evaluated on two sets of road datasets based on real scenes and compared with common detection methods, and its effectiveness and accuracy have been verified. Moreover, it demonstrates that the method opens up a new way to enhance the learning and detection capabilities of the model. Most importantly, it has certain potential for application in various practical fields such as intelligent transportation or assisted driving.
... Deep learning technology has developed rapidly in many fields [22] including computer vision. An increasing number of studies have applied deep learning technology to point cloud datasets for 3D object detection [23]- [25]. Currently, the detection methods for point cloud data mainly include the projection method, voxel cutting method and RPN-based method. ...
Article
Full-text available
Existing outdoor three-dimensional (3D) object detection algorithms mainly use a single type of sensor, for example, only using a monocular camera or radar point cloud. However, camera sensors are affected by light and lose depth information. When scanning a distant object or an occluded object, the data collected by the short-range radar point cloud sensor are very sparse, which affects the detection algorithm. To address the above challenges, we design a deep learning network that can combine the texture information of two-dimensional (2D) data and the geometric information of 3D data for object detection. To solve the problem of a single sensor, we use a reverse mapping layer and an aggregation layer to combine the texture information of RGB data with the geometric information of point cloud data and design a maximum pooling layer to deal with the input of multi-view cameras. In addition, to solve the defects of the 3D object detection algorithm based on the region proposal network (RPN) method, we use the Hough voting algorithm implemented by a deep neural network to suggest objects. Experimental results show that our algorithm has a 1.06% decrease in average precision (AP) compared to PointRCNN in easy car object detection, but our algorithm requires 37.7% less time to calculate than PointRCNN under the same hardware environment. Moreover, our algorithm improves the AP by 1.14% compared to PointRCNN in hard car object detection.
... Because driver driving behavior and driver intention can be influenced by a large number of factors, such as inter-responses of the drive and the external stressing conditions, the process of accurately recognizing driver driving behavior and driver intention on the basis of vehicle operating states is a promising challenge. ANN-based machine learning may be a good option, as it possesses the knowledge processing and learning capability of the human brain that can adapt to identify complex driver driving behavior and driver intention [167,168]. ...
Article
Full-text available
In order to improve handling stability performance and active safety of a ground vehicle, a large number of advanced vehicle dynamics control systems—such as the direct yaw control system and active front steering system, and in particular the advanced driver assistance systems—towards connected and automated driving vehicles have recently been developed and applied. However, the practical effects and potential performance of vehicle active safety dynamics control systems heavily depend on real-time knowledge of fundamental vehicle state information, which is difficult to measure directly in a standard car because of both technical and economic reasons. This paper presents a comprehensive technical survey of the development and recent research advances in vehicle system dynamic state estimation. Different aspects of estimation strategies and methodologies in recent literature are classified into two main categories—the model-based estimation approach and the data-driven-based estimation approach. Each category is further divided into several sub-categories from the perspectives of estimation-oriented vehicle models, estimations, sensor configurations, and involved estimation techniques. The principal features of the most popular methodologies are summarized, and the pros and cons of these methodologies are also highlighted and discussed. Finally, future research directions in this field are provided.
... As mentioned previously, the depth image is rich in contour and position information, which benefits the semantic segmentation of RGB image [20,21]. To effectively merge RGB and depth information, this paper designs an image fusion module called the AFC (Figure 2), such that the network can learn more complementary information from the RGB and depth branches. ...
Article
During fruit production, the robots must walk stably across the orchard, and detect the obstacles in real time on its path. With the rapid process of deep convolutional neural network (CNN), it is now a hot topic to enable orchard robots to detect obstacles through image semantic segmentation. However, most such obstacle detection schemes are under performing in the complex environment of orchards. To solve the problem, this paper proposes an image semantic fusion network for real-time detection of small obstacles. Two branches were set up to extract features from red-green-blue (RGB) image and depth image, respectively. The information extracted by different modules were merged to complement the image features. The proposed network can operate rapidly, and support the real-time detection of obstacles by orchard robots. Experiments on orchard scenarios show that our network is superior to the latest image semantic segmentation methods, highly accurate in the recognition of high-definition images, and extremely fast in reasoning.
... S EMANTIC segmentation is a fundamental task in computer vision, classifying each pixel in an image. This technique plays many crucial roles in modern intelligent transportation systems such as autonomous driving and video surveillance [1]- [5]. Therefore, the study of semantic segmentation is of great relevance in the above applications. ...
Article
Full-text available
In recent years, semantic segmentation based on deep convolutional neural networks has developed rapidly. However, it is still a challenge to balance the computing cost and segmentation performance for the current semantic segmentation methods. This paper proposes a lightweight real-time semantic segmentation model Balanced Sample Distribution Network (BSDNet), to solve this problem. In BSDNet, we introduce the Balanced Sample Distribution Module (BSDModule) to balance the sampling distribution of convolution and obtain features with a larger receptive field. To optimize the segmentation effect, we introduce a Shuffle Channel Attention Module (SCAModule) to enhance the interaction of channel features at the cost of a small number of parameters. BSDModule and SCAModule are lightweight and flexible and can adapt to different types of network structures. Extensive experiments on CityScapes and CamVid show that the proposed method can balance the computing cost and segmentation performance. Specifically, on the CityScapes with 512×1024 resolution, BSDNet-Xception39 achieves 68.3% MIoU and 84.6 FPS with only 1.2M parameters.
... In recent years, with the implementation of IMO (International Maritime Organization)' s mandatory requirement for ships to be equipped with onboard automatic identification system (AIS), AIS base stations along the world coast have been built rapidly, and AIS has been widely used [1][2].As far as the current development situation is concerned, a large amount of AIS data can be collected and stored through the network. Facing the rising AIS ship service information, new innovations can be found from the aspect of information processing. ...
... The application field of semantic segmentation is very wide, and it can be applied to many fields such as medical imaging, autonomous driving and geographic remote sensing [6][7][8]. It provides a strong guarantee for intelligent upgrades such as medical assisted diagnosis, assisted driving and remote sensing image interpretation. ...
Article
Full-text available
Abstract Currently, image semantic segmentation has problems such as low accuracy and long running time. This paper proposes an image semantic segmentation method based on generative adversarial network and ENet model combined with deep neural network. This method first improves the network model of generative adversarial network. Ensure the high resolution of the generated image and achieve high similarity with the real image. While ensuring the high accuracy of image semantic segmentation, it effectively improves the real‐time performance of network processing. The proposed method is verified based on public data sets. The experimental results show that the segmentation accuracy of this method can reach more than 93%, and the simulation running time is less than 0.171 s, which shows good high accuracy and high real‐time performance. A feasible strategy is proposed for the further productization of semantic segmentation.
... They combined the attention mechanism with the spatial pyramid to extract precise dense features for pixel labelling, instead of complex dilated convolutions and artificially designed decoder networks. Reference [25] proposed a semantic segmentation method for traffic scenes based on RGB-D images and deep learning. A semi-global stereo matching algorithm and a fast global image smoothing method are used to obtain a smooth disparity map. ...
Article
Full-text available
Abstract In order to improve the accuracy of image semantic segmentation, an image semantic segmentation method based on generative adversarial network (GAN) and fully convolutional network (FCN) model is proposed. First of all, the network structure of the generator is improved. Introducing the residual module in the convolutional layer for difference learning makes the network structure sensitive to changes in the output, so as to better adjust the weight of the generator. Second in order to reduce the number of parameters and calculations, a small convolution kernel is used to halve the number of channels of the input feature map before using the large convolution kernel. Finally, the output of the convolutional layer and the output of the deconvolutional layer are connected by using the idea of a U‐shaped network to avoid low‐level information sharing. The proposed method was experimentally demonstrated on the PASCAL VOC 2012 and CamVid datasets. Experimental results show that the proposed method effectively improves the accuracy of image segmentation, and avoids inaccurate detection caused by insufficient image pixel information and noise interference. Its mean pixel accuracy (MPA) and mean intersection over union (MIOU) are higher than other comparison methods.
... D EPTH prediction from images plays a significant role in autonomous driving and advanced driver assistance systems, which helps understanding a geometric layout in a scene, and can be leveraged to solve other tasks, including vehicle/pedestrian detection [1], [2], traffic scene segmentation [3], and 3D reconstruction [4]. Stereo matching is a typical approach to recovering depth that finds dense correspondences between a pair of stereo images [5], [6], [7]. ...
Preprint
Predicting depth from a monocular video sequence is an important task for autonomous driving. Although it has advanced considerably in the past few years, recent methods based on convolutional neural networks (CNNs) discard temporal coherence in the video sequence and estimate depth independently for each frame, which often leads to undesired inconsistent results over time. To address this problem, we propose to memorize temporal consistency in the video sequence, and leverage it for the task of depth prediction. To this end, we introduce a two-stream CNN with a flow-guided memory module, where each stream encodes visual and temporal features, respectively. The memory module, implemented using convolutional gated recurrent units (ConvGRUs), inputs visual and temporal features sequentially together with optical flow tailored to our task. It memorizes trajectories of individual features selectively and propagates spatial information over time, enforcing a long-term temporal consistency to prediction results. We evaluate our method on the KITTI benchmark dataset in terms of depth prediction accuracy, temporal consistency and runtime, and achieve a new state of the art. We also provide an extensive experimental analysis, clearly demonstrating the effectiveness of our approach to memorizing temporal consistency for depth prediction.
... D UE to the fast evolution of intelligent vehicles, real-time depth perception has become a major area of interest. Stereo reconstruction still remains one of the most feasible methods in depth perception due to its low-cost and high resolution output, being extremely useful for environment understanding [22], [26]. ...
Article
Full-text available
In this paper, we propose a novel semantic segmentation-based stereo reconstruction method that can keep up with the accuracy of the state-of-the art approaches while running in real time. The solution follows the classic stereo pipeline, each step in the stereo workflow being enhanced by additional information from semantic segmentation. Therefore, we introduce several improvements to computation, aggregation, and optimization by adapting existing techniques to integrate additional surface information given by each semantic class. For the cost computation and optimization steps, we propose new genetic algorithms that can incrementally adjust the parameters for better solutions. Furthermore, we propose a new post-processing edge-aware filtering technique relying on an improved convolutional neural network (CNN) architecture for disparity refinement. We obtain the competitive results at 30 frames/s, including segmentation.
... At present, the point cloud segmentation methods that have attracted more attention are mainly segmentation algorithms based on deep learning. There are, relatively, many studies on point cloud semantic segmentation and other enhanced semantic segmentation algorithms [31,32], but the deep learning methods need a large number of data sets for training, so it is difficult to implement. In this study, the pipe point cloud and damage point cloud model are obtained. ...
Article
Full-text available
With the age of pipeline and increase in the volume of urban sewage, the pipeline has different degrees of defects, which can cause safety problems such as road collapse and urban flooding. The service life of drainage pipes is closely related to daily maintenance and inspection, so it is very important to inspect the defects and monitor the operation of drainage pipes regularly. However, the existing research lacks quantitative detection and intelligent management of pipeline defect information. Therefore, the depth camera is used as the sensor to quantitatively detect the volume and area of the pit on the concrete pipe, and a defect information management platform is constructed in this paper. Firstly, combined BIM model with 3D point cloud, this paper proposes a 3D defect information management platform of drainage pipeline. Then, the depth camera is used to collect the damage data and preprocess the data, and a method for calculating the damage volume and surface area of drainage pipeline based on 3D mesh reconstruction of the defect point cloud is proposed. The verification experiment results show that the error between the quantized volume and the real volume is mostly within 10%, and the maximum error is 17.54%, indicating high accuracy. The drainage pipeline information model is created. Finally, the data is uploaded to the information management platform to realize the visualization and informatization of pipeline defects and the later operation and maintenance requirements of the pipeline.
Article
Predicting depth from a monocular video sequence is an important task for autonomous driving. Although it has advanced considerably in the past few years, recent methods based on convolutional neural networks (CNNs) discard temporal coherence in the video sequence and estimate depth independently for each frame, which often leads to undesired inconsistent results over time. To address this problem, we propose to memorize temporal consistency in the video sequence, and leverage it for the task of depth prediction. To this end, we introduce a two-stream CNN with a flow-guided memory module, where each stream encodes visual and temporal features, respectively. The memory module, implemented using convolutional gated recurrent units (ConvGRUs), inputs visual and temporal features sequentially together with optical flow tailored to our task. It memorizes trajectories of individual features selectively and propagates spatial information over time, enforcing a long-term temporal consistency to prediction results. We evaluate our method on the KITTI benchmark dataset in terms of depth prediction accuracy, temporal consistency and runtime, and achieve a new state of the art. We also provide an extensive experimental analysis, clearly demonstrating the effectiveness of our approach to memorizing temporal consistency for depth prediction.
Article
The deep learning object detection algorithms have become one of the powerful tools for road vehicle detection in autonomous driving. However, the limitation of the number of high-quality labeled training samples makes the single-object detection algorithms unable to achieve satisfactory accuracy in road vehicle detection. In this paper, by comparing the pros and cons of various object detection algorithms, two different algorithms with a different emphasis are selected for a weighted ensemble. Besides, a new ensemble method named the Soft-Weighted-Average method is proposed. The proposed method is attenuated by the confidence, and it “punishes” the detection result of the corresponding relationship by the confidence attenuation, instead of by deleting the output of a certain model. The proposed method can further reduce the vehicle misdetection of the target detection algorithm, obtaining a better detection result. Lastly, the ensemble method can achieve an average accuracy of 94.75% for simple targets, which makes it the third-ranked method in the KITTI evaluation system.
Article
Researchers have explored the benefits and applications of modern artificial intelligence (AI) algorithms in different scenarios. For the processing of geomatics data, AI offers overwhelming opportunities. Fundamental questions include how AI can be specifically applied to or must be specifically created for geomatics data. This change is also having a significant impact on geospatial data. The integration of AI approaches in geomatics has developed into the concept of geospatial artificial intelligence (GeoAI), which is a new paradigm for geographic knowledge discovery and beyond. However, little systematic work currently exists on how researchers have applied AI for geospatial domains. Hence, this contribution outlines AI-based techniques for analysing and interpreting complex geomatics data. Our analysis has covered several gaps, for instance defining relationships between AI-based approaches and geomatics data. First, technologies and tools used for data acquisition are outlined, with a particular focus on red–green–blue (RGB) images, thermal images, 3D point clouds, trajectories, and hyperspectral–multispectral images. Then, how AI approaches have been exploited for the interpretation of geomatic data is explained. Finally, a broad set of examples of applications is given, together with the specific method applied. Limitations point towards unexplored areas for future investigations, serving as useful guidelines for future research directions.
Article
Full-text available
In recent years, Intelligent Transportation Systems (ITS) have seen efficient and faster development by implementing deep learning techniques in problem domains which were previously addressed using analytical or statistical solutions and also in some areas that were untouched. These improvements have facilitated traffic management and traffic planning, increased safety and security in transit roads, decreased costs of maintenance, optimized public transportation and ride-sharing company's performance, and advanced driver-less vehicle development to a new stage. This papers primary objective was to provide a review and comprehensive insight into the applications of deep learning models on intelligent transportation systems accompanied by presenting the progress of ITS research due to deep learning. First, different techniques of deep learning and their state-of-the-art are discussed, followed by an in-depth analysis and explanation of the current applications of these techniques in transportation systems. This enumeration of deep learning on ITS highlights its significance in the domain. The applications are furthermore categorized based on the gap they are trying to address. Finally, different embedded systems for deployment of these techniques are investigated and their advantages and weaknesses over each other are discussed. Based on this systematic review, credible benefits of deep learning models on ITS are demonstrated and directions for future research are discussed.
Article
Video surveillance techniques like scene segmentation are playing an increasingly important role in multimedia Internet-of-Things (IoT) systems. However, existing deep learning based methods face challenges in both accuracy and memory when deployed on edge computing devices with limited computing resources. To address these challenges, a tensor-train video scene segmentation scheme that compares the local background information in regional scene boundary boxes in adjacent frames, is proposed. Compared to existing methods, the proposed scheme can achieve competitive performance in both segmentation accuracy and parameter compression rate. In detail,B first, an improved faster Region Convolutional Neural NetworkB (faster-RCNN) model is proposed to recognize and generate a large number of region boxes with foreground and background to achieve boundary boxes. Then the foreground boxes with sparse objects are removed and the rest are considered as optional background boxes used to measure the similarity between two adjacent frames.B Second, to accelerate the training efficiency and reduce memory size, a general and efficient training way using tensor-train decomposition to factor the input-to-hidden weight matrix, is proposed. Finally, experiments are conducted to evaluate the performance of the proposed scheme in terms of the accuracy and model compression. Our results demonstrate that the proposed model can improve the training efficiency and save the memory space for the deep computation model with good accuracy. This work opens the potential for the use of artificial intelligence methods in edge computing devices for multimedia IoT systems.
Article
Semantic segmentation-based scene parsing plays an important role in automatic driving and autonomous navigation. However, most of the previous models only consider static images, and fail to parse sequential images because they do not take the spatial-temporal continuity between consecutive frames in a video into account. In this paper, we propose a depth embedded recurrent predictive parsing network (RPPNet), which analyzes preceding consecutive stereo pairs for parsing result. In this way, RPPNet effectively learns the dynamic information from historical stereo pairs, so as to correctly predict the representations of the next frame. The other contribution of this paper is to systematically study the video scene parsing (VSP) task, in which we use the RPPNet to facilitate conventional image paring features by adding spatial-temporal information. The experimental results show that our proposed method RPPNet can achieve fine predictive parsing results on cityscapes and the predictive features of RPPNet can significantly improve conventional image parsing networks in VSP task.
Article
In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity and perception. However, transmitting massive amounts of heterogeneous data, perceiving complex environments from these data, and then making smart decisions in a timely manner are difficult. Artificial intelligence (AI), especially deep learning, is now a proven success in various areas including computer vision, speech recognition, and natural language processing. AI introduced into the IoT heralds the era of artificial intelligence of things (AIoT). This paper presents a comprehensive survey on AIoT to show how AI can empower the IoT to make it faster, smarter, greener, and safer. Specifically, we briefly present the AIoT architecture in the context of cloud computing, fog computing, and edge computing. Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving. Next, we summarize some promising applications of AIoT that are likely to profoundly reshape our world. Finally, we highlight the challenges facing AIoT and some potential research opportunities.
Article
Full-text available
A well-performed deep learning model in image segmentation relies on a large number of labeled data. However, it is hard to obtain sufficient high-quality raw data in industrial applications. Meta-learning, one of the most promising research areas, is recognized as a powerful tool for approaching image segmentation. To this end, this paper reviews the state-of-the-art image segmentation methods based on meta-learning. We firstly introduce the background of the image segmentation, including the methods and metrics of image segmentation. Second, we review the timeline of meta-learning and give a more comprehensive definition of meta-learning. The differences between meta-learning and other similar methods are compared comprehensively. Then, we categorize the existing meta-learning methods into model-based, optimization-based, and metric-based. For each categorization, the popular used meta-learning models are discussed in image segmentation. Next, we conduct comprehensive computational experiments to compare these models on two pubic datasets: ISIC-2018 and Covid-19. Finally, the future trends of meta-learning in image segmentation are highlighted.
Article
Full-text available
Speech recognition has become a necessary feature of high-quality service industry products. Therefore, the accuracy and efficiency of speech recognition are the key to product applications. At the same time, this article also designs various modular functions for speech recognition. In order to solve the problem of poor recognition performance when the existing convolutional neural network recognizes continuous speech data, we provide an improved convolutional neural network algorithm and backpropagation algorithm to reduce the weight range. In a real speech recognition system, due to the large amount of training data and model parameter training efficiency of the convolutional neural network is very low. Vocational education reform is an important way to realize modern education, and it is also an effective way to improve students’ comprehensive quality and promote personal development. According to the analysis of the current teaching situation in higher vocational colleges, the effect of vocational education reform has not reached the expected standard. This has led to a decrease in the teaching efficiency of higher vocational colleges, coupled with the increasingly fierce competition in modern society, the reform of higher vocational education has become the top priority of the education department and the school. In order to improve the scientific nature of vocational education reform research, it is necessary to strengthen the research on current and future development trends. The research scope of vocational education reform needs to be coordinated, integrated, and expanded. To strengthen research on the integration of industry and academia, it is necessary to establish a team of experts. The cross-border characteristics of vocational education are reflected in the integration of production and education to a large extent. This article will explore how to realize the reform of vocational education in higher vocational colleges based on the improved convolutional neural network and speech recognition.
Preprint
Full-text available
The RGB-Thermal (RGB-T) information for semantic segmentation has been extensively explored in recent years. However, most existing RGB-T semantic segmentation usually compromises spatial resolution to achieve real-time inference speed, which leads to poor performance. To better extract detail spatial information, we propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task. Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views. Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images. Extensive experiments on the urban scene dataset demonstrate that our FEANet outperforms other state-of-the-art (SOTA) RGB-T methods in terms of objective metrics and subjective visual comparison (+2.6% in global mAcc and +0.8% in global mIoU). For the 480 x 640 RGB-T test images, our FEANet can run with a real-time speed on an NVIDIA GeForce RTX 2080 Ti card.
Article
A well-performed deep learning model in image segmentation relies on a large number of labeled data. However, it is hard to obtain sufficient high-quality raw data in industrial applications. Meta-learning, one of the most promising research areas, is recognized as a powerful tool for approaching image segmentation. To this end, this paper reviews the state-of-the-art image segmentation methods based on meta-learning. We firstly introduce the background of the image segmentation, including the methods and metrics of image segmentation. Second, we review the timeline of meta-learning and give a more comprehensive definition of meta-learning. The differences between meta-learning and other similar methods are compared comprehensively. Then, we categorize the existing meta-learning methods into model-based, optimization-based, and metric-based. For each categorization, the popular used meta-learning models are discussed in image segmentation. Next, we conduct comprehensive computational experiments to compare these models on two pubic datasets: ISIC-2018 and Covid-19. Finally, the future trends of meta-learning in image segmentation are highlighted.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Conference Paper
Full-text available
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Article
Full-text available
Advance information about the road surface a vehicle is going to encounter can improve the performance of Antilock Braking System (ABS). For e.g. the initial slip cycles caused by the ABS could be avoided, if it is already known that the vehicle is on a surface having a low coefficient of friction (μ). In this paper, an algorithm is developed that detects different road surfaces using streaming video acquired from a camera mounted on the hood of the vehicle. The road surfaces detected here are asphalt road, cement road, sandy road, rough asphalt road (asphalt road which is deteriorating), grassy road and rough road. The value of coefficient of friction (μ) is also given out with the detected surfaces to obtain additional information about the road surfaces. Split μ (a road having different μ conditions on the left and right side of the vehicle) and μ jump (different μ conditions on the front and rear of the vehicle) are also pre detected. One method was not sufficient to achieve the goals of this algorithm. Here several simple techniques like the Canny edge algorithm, intensity histogram, contours, Hough transform and image segmentation were employed and compared with the Support Vector Machine (SVM). To prevent misdetections, the road surface detection during high motion blur is prohibited.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
We propose a novel deep architecture, SegNet, for semantic pixel wise image labelling. SegNet has several attractive properties; (i) it only requires forward evaluation of a fully learnt function to obtain smooth label predictions, (ii) with increasing depth, a larger context is considered for pixel labelling which improves accuracy, and (iii) it is easy to visualise the effect of feature activation(s) in the pixel label space at any depth. SegNet is composed of a stack of encoders followed by a corresponding decoder stack which feeds into a soft-max classification layer. The decoders help map low resolution feature maps at the output of the encoder stack to full input image size feature maps. This addresses an important drawback of recent deep learning approaches which have adopted networks designed for object categorization for pixel wise labelling. These methods lack a mechanism to map deep layer feature maps to input dimensions. They resort to ad hoc methods to upsample features, e.g. by replication. This results in noisy predictions and also restricts the number of pooling layers in order to avoid too much upsampling and thus reduces spatial context. SegNet overcomes these problems by learning to map encoder outputs to image pixel labels. We test the performance of SegNet on outdoor RGB scenes from CamVid, KITTI and indoor scenes from the NYU dataset. Our results show that SegNet achieves state-of-the-art performance even without use of additional cues such as depth, video frames or post-processing with CRF models.
Article
Full-text available
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set.
Article
Full-text available
This paper presents an efficient technique for performing spatially inhomogeneous edge-preserving image smoothing, called fast global smoother. Focusing on sparse Laplacian matrices consisting of a data term and a prior term (typically defined using four or eight neighbors for 2D image), our approach efficiently solves such global objective functions. Specifically, we approximate the solution of the memory- and computationintensive large linear system, defined over a d-dimensional spatial domain, by solving a sequence of 1D sub-systems. Our separable implementation enables applying a linear-time tridiagonal matrix algorithm to solve d three-point Laplacian matrices iteratively. Our approach combines the best of two paradigms, i.e., efficient edge-preserving filters and optimization-based smoothing. Our method has a comparable runtime to the fast edge-preserving filters, but its global optimization formulation overcomes many limitations of the local filtering approaches. Our method also achieves high-quality results as the state-of-the-art optimizationbased techniques, but runs about 10 to 30 times faster. Besides, considering the flexibility in defining an objective function, we further propose generalized fast algorithms that perform L norm smoothing (0 < < 2) and support an aggregated (robust) data term for handling imprecise data constraints. We demonstrate the effectiveness and efficiency of our techniques in a range of image processing and computer graphics applications.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Full-text available
In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
In this paper, we propose a cascade classifier combining AdaBoost and support vector machine, and applied this to pedestrian detection. The pedestrian detection involved using a window of fixed size to extract the candidate region from left to right and top to bottom of the image, and performing feature extractions on the candidate region. Finally, our proposed cascade classifier completed the classification of the candidate region. The cascade-AdaBoost classifier has been successfully used in pedestrian detection. We have improved the initial setting method for the weights of the training samples in the AdaBoost classifier, so that the selected weak classifier would be able to focus on a higher detection rate other than accuracy. The proposed cascade classifier can automatically select the AdaBoost classifier or SVM to construct a cascade classifier according to the training samples, so as to effectively improve classification performance and reduce training time. In order to verify our proposed method, we have used our extracted database of pedestrian training samples, PETs database, INRIA database and MIT database. This completed the pedestrian detection experiment whose result was compared to those of the cascade-AdaBoost classifier and support vector machine. The result of the experiment showed that in a simple environment involving campus experimental image and PETs database, both our cascade classifier and other classifiers can attain good results, while in a complicated environment involving INRA and MIT database experiments, our cascade classifier had better results than those of other classifiers.
Article
In recent years, active learning has emerged as a powerful tool in building robust systems for object detection using computer vision. Indeed, active learning approaches to on-road vehicle detection have achieved impressive results. While active learning approaches for object detection have been explored and presented in the literature, few studies have been performed to comparatively assess costs and merits. In this study, we provide a cost-sensitive analysis of three popular active learning methods for on-road vehicle detection. The generality of active learning findings is demonstrated via learning experiments performed with detectors based on histogram of oriented gradient features and SVM classification (HOG–SVM), and Haar-like features and Adaboost classification (Haar–Adaboost). Experimental evaluation has been performed on static images and real-world on-road vehicle datasets. Learning approaches are assessed in terms of the time spent annotating, data required, recall, and precision.
Conference Paper
We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This makes it possible to learn multiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. When combined with a standard classifier, features extracted from these models outperform SIFT, as well as representations from other feature learning methods.
Article
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corre- sponding semantically labeled images at 1 Hz and in part, 15 Hz. The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driv- ing scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expand- ing this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by mea- suring the performance of an algorithm from each of three distinct domains: multi-class object recogni- tion, pedestrian detection, and label propagation.
Article
Research in object detection and recognition in cluttered scenes requires large image collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. This project provides a web-based annotation tool that makes it easy to annotate images, and to instantly sharesuch annotations with the community. This tool, plus an initial set of 10,000 images (3000 of which have been labeled), can be found at http://www.csail.mit.edu/$\sim$brussell/research/LabelMe/intro.html
Algorithm of vehicle detection and pattern recognition using SVM
  • G E Guangying
Simultaneous detection and segmentation
  • B Hariharan
  • P Arbeláez
  • R Girshick
  • J Malik