Chapter

Staircase Detection Using a Lightweight Look-Behind Fully Convolutional Neural Network

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Staircase detection in natural images has several applications in the context of robotics and visually impaired navigation. Previous works are mainly based on handcrafted feature extraction and supervised learning using fully annotated images. In this work we address the problem of staircase detection in weakly labeled natural images, using a novel Fully Convolutional neural Network (FCN), named LB-FCN light. The proposed network is an enhanced version of our recent Look-Behind FCN (LB-FCN), suitable for deployment on mobile and embedded devices. Its architecture features multi-scale feature extraction, depthwise separable convolutions and residual learning. To evaluate its computational and classification performance, we have created a weakly-labeled benchmark dataset from publicly available images. The results from the experimental evaluation of LB-FCN light indicate its advantageous performance over the relevant state-of-the-art architectures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the second step, a comparison among the LB-FCN light and state-of-the-art CNN models, such as RESNET50, VGG16, Inception V3, Xception, and MobileNet, was performed. ResNet50 is a The design architecture of the adopted LB-FCN light is based on the initial LB-FCN [25] while a lightweight version of the LB-FCN was adopted [33] to decrease the architecture complexity, the number of free parameters and required floating-point operations (FLOPs). LB-FCN light was compared to conventional and pre-trained CNN architectures [34,35] used to solve the classification problem of bone metastasis from P-Ca patients [14]. ...
... Then, data were normalized to achieve a scalable dataset in which the proposed LB-FCN light was trained. The design architecture of the adopted LB-FCN light is based on the initial LB-FCN [25] while a lightweight version of the LB-FCN was adopted [33] to decrease the architecture complexity, the number of free parameters and required floating-point operations (FLOPs). LB-FCN light was compared to conventional and pre-trained CNN architectures [34,35] used to solve the classification problem of bone metastasis from P-Ca patients [14]. ...
... It should be noted that LB-FCN light is more than 10 times lighter compared to MobileNet [39] that is a well-known light-weighted and efficient network especially designed for mobile applications. [14] 13.1 6.5 LB-FCN light [33] 0.6 0.3 ...
Article
Full-text available
Bone metastasis is among the most frequent in diseases to patients suffering from metastatic cancer, such as breast or prostate cancer. A popular diagnostic method is bone scintigraphy where the whole body of the patient is scanned. However, hot spots that are presented in the scanned image can be misleading, making the accurate and reliable diagnosis of bone metastasis a challenge. Artificial intelligence can play a crucial role as a decision support tool to alleviate the burden of generating manual annotations on images and therefore prevent oversights by medical experts. So far, several state-of-the-art convolutional neural networks (CNN) have been employed to address bone metastasis diagnosis as a binary or multiclass classification problem achieving adequate accuracy (higher than 90%). However, due to their increased complexity (number of layers and free parameters), these networks are severely dependent on the number of available training images that are typically limited within the medical domain. Our study was dedicated to the use of a new deep learning architecture that overcomes the computational burden by using a convolutional neural network with a significantly lower number of floating-point operations (FLOPs) and free parameters. The proposed lightweight look-behind fully convolutional neural network was implemented and compared with several well-known powerful CNNs, such as ResNet50, VGG16, Inception V3, Xception, and MobileNet on an imaging dataset of moderate size (778 images from male subjects with prostate cancer). The results prove the superiority of the proposed methodology over the current state-of-the-art on identifying bone metastasis. The proposed methodology demonstrates a unique potential to revolutionize image-based diagnostics enabling new possibilities for enhanced cancer metastasis monitoring and treatment.
... In this paper we propose a novel approach to cope with the problem of data availability in the medical domain. More specifically our methodology employs a combination of a stateof-the-art Look Behind CNN [7], as it has been recently evolved into a lightweight, more efficient, classifier (LB-FCN Light) [12], and a Generative Adversarial Network (GAN) [13] for training data generation. We trained the GAN which performs non-stationary texture synthesis [14], to generate small bowel WCE images, with and without inflammatory lesions. ...
... While GANs have been used to increase the available samples in the training datasets as a method to increase the generalization performance of CNNs, to the best of our knowledge, no work has been done to investigate their generalization performance when the training dataset consists of only generated images. In this work we investigate the generalization capabilities of state-of-the-art CNN [12] architecture when trained solely with generated WCE GI tract images on the problem of inflammatory conditions detection in real images, the results of which are promising. ...
... The proposed methodology is based on two components; the classifier and the data generator. The first component is a stateof-the-art lightweight CNN architecture, named LB-FCN light [12]. The key characteristics of this architecture are the relatively low number of free parameters, multi-scale feature extraction and residual learning [40]. ...
... Regarding the support of individuals with disabilities and especially the visually impaired individuals (VIIs), smart wearable assistive systems have been proposed [6], including object detection [7] and text recognition systems [8,9]. However, even though various wearable assistive systems have been developed for safe navigation of VIIs [6], most of them focus on obstacle detection and avoidance [10][11][12]. The majority of them have been applied mainly in indoor environments, whereas only a few of them address RP tasks [13,14]. ...
... The RP module interacts with another module of the assistive system, dedicated to obstacle detection (OD). The OD module can be based on one of the current methodologies proposed for this purpose, such as [10][11][12]. The RP module generates an optimal route in the area under examination, which can be dynamically updated based on the information on the location of possible unmapped obstacles appearing in the user's way. Figure 1 illustrates a pair of smart glasses equipped with cameras, as an example of a commonly adopted wearable assistive system for VIIs [6], to illustrate the use of the RP module. ...
Article
Full-text available
Route planning (RP) enables individuals to navigate in unfamiliar environments. Current RP methodologies generate routes that optimize criteria relevant to the traveling distance or time, whereas most of them do not consider personal preferences or needs. Also, most of the current smart wearable assistive navigation systems offer limited support to individuals with disabilities by providing obstacle avoidance instructions, but often neglecting their special requirements with respect to the route quality. Motivated by the mobility needs of such individuals, this study proposes a novel RP framework for assistive navigation that copes these open issues. The framework is based on a novel mixed 0–1 integer nonlinear programming model for solving the RP problem with constraints originating from the needs of individuals with disabilities; unlike previous models, it minimizes: (1) the collision risk with obstacles within a path by prioritizing the safer paths; (2) the walking time; (3) the number of turns by constructing smooth paths, and (4) the loss of cultural interest by penalizing multiple crossovers of the same paths, while satisfying user preferences, such as points of interest to visit and a desired tour duration. The proposed framework is applied for the development of a system module for safe navigation of visually impaired individuals (VIIs) in outdoor cultural spaces. The module is evaluated in a variety of navigation scenarios with different parameters. The results demonstrate the comparative advantage of our RP model over relevant state-of-the-art models, by generating safer and more convenient routes for the VIIs.
... Considering that the extensive experimental work required for this study is computationally demanding, we selected LB-FCN light classifier [12], as a computationally more efficient version of LB-FCN, which is a state-of-the-art classifier proposed for accurate classification of endoscopic images [11]. ...
Preprint
Full-text available
Medical image synthesis has emerged as a promising solution to address the limited availability of annotated medical data needed for training machine learning algorithms in the context of image-based Clinical Decision Support (CDS) systems. To this end, Generative Adversarial Networks (GANs) have been mainly applied to support the algorithm training process by generating synthetic images for data augmentation. However, in the field of Wireless Capsule Endoscopy (WCE), the limited content diversity and size of existing publicly available annotated datasets, adversely affect both the training stability and synthesis performance of GANs. Aiming to a viable solution for WCE image synthesis, a novel Variational Autoencoder architecture is proposed, namely "This Intestine Does not Exist" (TIDE). The proposed architecture comprises multiscale feature extraction convolutional blocks and residual connections, which enable the generation of high-quality and diverse datasets even with a limited number of training images. Contrary to the current approaches, which are oriented towards the augmentation of the available datasets, this study demonstrates that using TIDE, real WCE datasets can be fully substituted by artificially generated ones, without compromising classification performance. Furthermore, qualitative and user evaluation studies by experienced WCE specialists, validate from a medical viewpoint that both the normal and abnormal WCE images synthesized by TIDE are sufficiently realistic.
... To validate the performance of the proposed VAE-based framework in endoscopic image generation, we investigate the performance of a state-of-the-art endoscopic classifier trained solely on the images artificially generated by the proposed methodology. In particular, the LB-FCN light [21] model was employed to conduct two separate experiments. First, it was trained on the real normal and abnormal subsets of KID dataset and then its performance was compared following the same training procedure with images exclusively synthesized by the VAE architecture presented. ...
Conference Paper
Full-text available
The generalization performance of deep learning models is closely associated with the number and diversity of data available upon training. While in many applications there is a large number of data available in public, in domains such as medical image analysis, the data availability is limited. This can be largely attributed to data privacy legislations, including the General Data Protection Regulation (GDPR), and the cost of data annotation by experts. Aiming to address this issue, data augmentation approaches employing deep generative models have emerged. Existing augmentation techniques are primarily based on Generative Adversarial Networks (GANs). However, ill-posed training issues of GANs such as nonconvergence, mode collapse and instability in conjunction with their demand for large scale training datasets, complicate their use in medical imaging modalities. Motivated by these issues, this paper investigates the performance of alternative generative models i.e., Variational Autoencoders (VAEs) in endoscopic image synthesis tasks. Contrary to the conventional GAN-based approaches that aiming at augmenting the existing endoscopic datasets the proposed methodology constitutes feasible the complete substitution of medical imaging datasets from real individuals with artificially generated ones. The experimental results obtained validate the effectiveness of the proposed methodology over the state-of-art.
... In addition, to minimize falsely detected obstacles, depth information is exploited to detect and remove the ground plane along with objected that are above the users height and hence they do not impose any risk even if they are estimated to be in a range that can be considered as "risky. " The detected obstacles are consecutively recognized using a lightweight, fully convolutional neural network (LB-FCN light) [Diamantis et al. 2019], labeled and localized in the 3D space. Then, appropriate avoidance commands can be constructed and communicated to the user as it was described above. ...
... Dimas et al. [36] proposed an uncertainty-aware obstacle detection approach, in which saliency maps extracted using a GAN from RGB images and fuzzy logic applied in the depth maps were combined to detect high-risk obstacles in unknown environments. The bounding boxes were then used as an input for a lightweight convolutional neural network (CNN), named look-behind fully CNN light (LB-FCN light) [11], which performed obstacle recognition. A smartphone-based outdoor obstacle avoidance method using a Single Shot Detector (SSD) [28] was proposed by Chen et al. [8]. ...
Chapter
The assistive navigation of visually impaired individuals requires the development of different algorithms for obstacle detection, recognition, avoidance, and path planning. The assessment and optimization of such algorithms in the real world is a painstaking process that requires repetitive measurements under stable conditions, which is usually difficult to achieve and costly. To this end, digital twin environments can be used to replicate relevant real-life situations, enabling the evaluation and optimization of algorithms through adjustable and cost-effective simulations. This chapter presents a digital twin framework for the simulation and evaluation of assistive navigation systems, and its application in the context of a camera-based wearable system for visually impaired individuals in an outdoor cultural space. The system incorporates an obstacle avoidance algorithm based on fuzzy logic. The utility and the effectiveness of this framework are demonstrated with an indicative simulation study.
... This result concords with a very similar study undertaken by the same authors, except in this instance only performing a dual-class classification problem (metastasis present vs. metastasis absent) by excluding any patients with degenerative lesions where their CNN model achieved a higher overall accuracy of 97.38% (129). Ntakolia et al. (127) performed the same three-class classification problem mentioned above also on 778 PCa patients who underwent bone scintigraphy, except this time deploying a lightweight version of the look-behind FCN (LB-FCN) (132,133) and achieved a better overall accuracy of 97.41%. Their results demonstrated that state-of-the-art classification results can be achieved using a CNN with less learnable parameters and thus requiring less resources for training. ...
Article
Full-text available
Metastatic Prostate Cancer (mPCa) is associated with a poor patient prognosis. mPCa spreads throughout the body, often to bones, with spatial and temporal variations that make the clinical management of the disease difficult. The evolution of the disease leads to spatial heterogeneity that is extremely difficult to characterise with solid biopsies. Imaging provides the opportunity to quantify disease spread. Advanced image analytics methods, including radiomics, offer the opportunity to characterise heterogeneity beyond what can be achieved with simple assessment. Radiomics analysis has the potential to yield useful quantitative imaging biomarkers that can improve the early detection of mPCa, predict disease progression, assess response, and potentially inform the choice of treatment procedures. Traditional radiomics analysis involves modelling with hand-crafted features designed using significant domain knowledge. On the other hand, artificial intelligence techniques such as deep learning can facilitate end-to-end automated feature extraction and model generation with minimal human intervention. Radiomics models have the potential to become vital pieces in the oncology workflow, however, the current limitations of the field, such as limited reproducibility, are impeding their translation into clinical practice. This review provides an overview of the radiomics methodology, detailing critical aspects affecting the reproducibility of features, and providing examples of how artificial intelligence techniques can be incorporated into the workflow. The current landscape of publications utilising radiomics methods in the assessment and treatment of mPCa are surveyed and reviewed. Associated studies have incorporated information from multiple imaging modalities, including bone scintigraphy, CT, PET with varying tracers, multiparametric MRI together with clinical covariates, spanning the prediction of progression through to overall survival in varying cohorts. The methodological quality of each study is quantified using the radiomics quality score. Multiple deficits were identified, with the lack of prospective design and external validation highlighted as major impediments to clinical translation. These results inform some recommendations for future directions of the field.
... The second component uses the depth channel of the RGB-D image, to compute three risk maps, representing high, medium and low risk obstacles, based on fuzzy logic. Following the fuzzy aggregation of the outputs of these components, the resulting sub-images corresponding to obstacle regions are provided to a CNN, called Look Behind Fully Convolutional Neural Network (LB-FCN) light [43], to perform the obstacle recognition step. The processing steps required by the methodology [8], are illustrated in Fig. 3. ...
Article
Full-text available
Machine Learning (ML) applications are growing in an unprecedented scale. The development of easy-to-use machine-learning application frameworks has enabled the development of advanced artificial intelligence (AI) applications with only a few lines of self-explanatory code. As a result, ML-based AI is becoming approachable by mainstream developers and small businesses. However, the deployment of ML algorithms for remote high throughput ML task execution, involving complex data-processing pipelines can still be challenging, especially with respect to production ML use cases. To cope with this issue, in this paper we propose a novel system architecture that enables Algorithm-agnostic, Scalable ML (ASML) task execution for high throughput applications. It aims to provide an answer to the research question of how to design and implement an abstraction framework, suitable for the deployment of end-to-end ML pipelines in a generic and standard way. The proposed ASML architecture manages horizontal scaling, task scheduling, reporting, monitoring and execution of multi-client ML tasks using modular, extensible components that abstract the execution details of the underlying algorithms. Experiments in the context of obstacle detection and recognition, as well as in the context of abnormality detection in medical image streams, demonstrate its capacity for parallel, mission critical, task execution.
... The requirements under examination are achievable. The performance of the overall system was promising, as it was proved by the evaluation but also in the published studies [89,90]. However, a lot of work is needed in the design of the frame of the eyeglasses and the attached camera in order to meet the ergonomic requirements. ...
Article
Full-text available
The marginalization of people with disabilities, such as visually impaired individuals (VIIs), has driven scientists to take advantage of the fast growth of smart technologies and develop smart assistive systems (SASs) to bring VIIs back to social life, education and even to culture. Our research focuses on developing a human–computer interactive system that will guide VIIs in outdoor cultural environments by offering universal access to cultural information, social networking and safe navigation among other services. The VI users interact with computer-based SAS to control the system during its operation, while having access to remote connection with non-VIIs for external guidance and company. The development of such a system needs a user-centered design (UCD) that incorporates the elicitation of the necessary requirements for a satisfying operation for the VI users. In this paper, we present a novel SAS system for VIIs and its design considerations, which follow a UCD approach to determine a set of operational, functional, ergonomic, environmental and optional requirements of the system. Both VIIs and non-VIIs took part in a series of interviews and questionnaires, from which data were analyzed to form the requirements of the system for both the on-site and remote use. The final requirements are tested by trials and their evaluation and results are presented. The experimental investigations gave significant feedback for the development of the system, throughout the design process. The most important contribution of this study is the derivation of requirements applicable not only to the specific system under investigation, but also to other relevant SASs for VIIs.
Article
Convolutional Neural Networks (CNNs) are artificial learning systems typically based on two operations: convolution, which implements feature extraction through filtering, and pooling, which implements dimensionality reduction. The impact of pooling in the classification performance of the CNNs has been highlighted in several previous works, and a variety of alternative pooling operators have been proposed. However, only a few of them tackle with the uncertainty that is naturally propagated from the input layer to the feature maps of the hidden layers through convolutions. In this paper we present a novel pooling operation based on fuzzy sets to cope with the local imprecision of the feature maps, and we investigate its performance in the context of image classification. Fuzzy pooling is performed by fuzzification, aggregation and defuzzification of feature map neighborhoods. It is used for the construction of a fuzzy pooling layer that can be applied as a drop-in replacement of the current, crisp, pooling layers of CNN architectures. Several experiments using publicly available datasets show that the proposed approach can enhance the classification performance of a CNN. A comparative evaluation shows that it outperforms state-of-the-art pooling approaches.
Article
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Article
Full-text available
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet
Conference Paper
Full-text available
In this paper we deal with the perception task of a wearable navigation assistant. Specifically, we have focused on the detection of staircases because of the important role they play in indoor navigation due to the multi-floor reaching possibilities they bring and the lack of security they cause, specially for those who suffer from visual deficiencies. We use the depth sensing capacities of the modern RGB-D cameras to segment and classify the different elements that integrate the scene and then carry out the stair detection and modelling algorithm to retrieve all the information that might interest the user, i.e. the location and orientation of the staircase, the number of steps and the step dimensions. Experiments prove that the system is able to perform in real-time and works even under partial occlusions of the stairway.
Conference Paper
Full-text available
The process of staircase negotiation is complex for blinds. Therefore, an intelligent system is required to help them. In this paper, we investigate using only one ultrasonic sensor to detect and recognize floor and stair cases in electronic white cane. The performance of an object recognition system depends on both object representation and classification algorithms. In our system, we have used more than one representation of ultrasonic signal in frequencial domain. First, spectrogram representation explains how the spectral density of ultrasonic signal varies with time. Second, spectrum representation shows the amplitudes as a function of the frequency. Finally, periodogram representation estimates the spectral density of ultrasonic signal. Then, several features are extracted from each representation. Our system was evaluated on a set of ultrasonic signal where floor and stair cases occur with different shape. Using a multiclass SVM approach, accuracy rates of 72.41% has been achieved.
Article
Full-text available
Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Conference Paper
Full-text available
This paper presents a simple and eective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach requires no training, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. It works by scene-level matching with global image descriptors, followed by superpixel-level matching with local features and ecient Markov ran- dom eld (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image re- gions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art non- parametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 15,150 images and 170 labels. To our knowledge, this is the rst complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem.
Article
In this paper, we propose a novel Fully Convolutional Neural Network (FCN) architecture aiming to aid the detection of abnormalities, such as polyps, ulcers and blood, in gastrointestinal (GI) endoscopy images. The proposed architecture, named Look-Behind FCN (LB-FCN), is capable of extracting multi-scale image features by using blocks of parallel convolutional layers with different filter sizes. These blocks are connected by Look-Behind (LB) connections, so that the features they produce are combined with features extracted from behind layers, thus preserving the respective information. Furthermore, it has a smaller number of free parameters than conventional Convolutional Neural Network (CNN) architectures, which makes it suitable for training with smaller datasets. This is particularly useful in medical image analysis, since data availability is usually limited due to ethicolegal constraints. The performance of LB-FCN is evaluated on both flexible and wireless capsule endoscopy datasets, reaching 99.72% and 93.50%, in terms of Area Under receiving operating Characteristic (AUC) respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
We address the problem of staircase detection, in the context of a navigation aid for the visually impaired. The requirements for such a system are robustness to viewpoint, distance, scale, real-time operation, high detection rate and low false alarm rate. Our approach uses classifiers trained using Haar features and Ad-aboost learning. This first stage does detect staircases, but produces many false alarms. The false alarm rate is drastically reduced by using spatial context in the form of the estimated ground plane, and by enforcing temporal consistency. We have validated our approach on many real sequences under various weather conditions, and are presenting some of the quantitative results here.
Article
A robust vision-based staircase identification method is proposed, which comprises 2D staircase detection and 3D staircase localization. The 2D detector pre-screens the input image, and the 3D localization algorithm continues the task of retrieving geometry of the staircase on the reported region in the image. A novel set of principal component analysis-based Haar-like features are introduced, which extends the classical Haar-like features from local to global domain and are extremely efficient at rejecting non-object regions for the early stages of the cascade, and the Viola-Jones rapid object detection framework is improved to adapt the context of staircase detection, modifications have been made on the scanning scheme, multiple detections integrating scheme and the final detection evaluation metrics. The V-disparity concept is applied to detect the planar regions on the staircase surface and locate 3D planes quickly from disparity maps, and then, the 3D position of staircase is localized robustly. Finally, experiments show the performance of the proposed method.
Article
This book is required reading for anyone working with accelerator-based computing systems. From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is requiredjust the ability to program in a modestly extended version of C. CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. Youll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance. Major topics covered include Parallel programming Thread cooperation Constant memory and events Texture memory Graphics interoperability Atomics Streams CUDA C on multiple GPUs Advanced atomics Additional CUDA resources All the CUDA software tools youll need are freely available for download from NVIDIA.http://developer.nvidia.com/object/cuda-by-example.html
Conference Paper
This paper presents a strategy for descending-stair detection, approach, and traversal using inertial sensing and a monocular camera mounted on an autonomous tracked vehicle. At the core of our algorithm are vision modules that exploit texture energy, optical flow, and scene geometry (lines) in order to robustly detect descending stairwells during both far- and near-approaches. As the robot navigates down the stairs, it estimates its three-degrees-of-freedom (d.o.f.) attitude by fusing rotational velocity measurements from an on-board tri-axial gyroscope with line observations of the stair edges detected by its camera. We employ a centering controller, derived based on a linearized dynamical model of our system, in order to steer the robot along safe trajectories. A real-time implementation of the described algorithm was developed for an iRobot Packbot, and results from real-world experiments are presented.
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
Article
Research in object detection and recognition in cluttered scenes requires large image collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. This project provides a web-based annotation tool that makes it easy to annotate images, and to instantly sharesuch annotations with the community. This tool, plus an initial set of 10,000 images (3000 of which have been labeled), can be found at http://www.csail.mit.edu/$\sim$brussell/research/LabelMe/intro.html
Find your inspiration. | Flickr
  • Flickr Inc
LeNet-5, convolutional neural networks
  • Y Lecun