Conference Paper

Urban Street Contexts Classification Using Convolutional Neural Networks and Streets Imagery

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For example, Victorian buildings in England have a distinctive façade design and internal layout. In addition, building uses can be partially indicated by their façade, with computer vision methods utilised to classify architectural styles [62,71,108] and typologies [2,15,50]. A classiication of the whole image is typical for a high-level analysis of architectural features such as overall urban characteristics [41] or style and use classiication [60]. ...
... Davies [20] trains an Inception V3 network to recognise Georgian architecture from GSV images of London. Likewise, Alhasoun and González [2] train a CNN to match GSV images to their corresponding US towns by classifying street frontage based on their urban contexts [56]. Deep-learning models are also used by Yoshimura et al. [104] to measure visual similarities between the styles of diferent architects. ...
Article
Analysing 88 sources published from 2011 to 2021, this paper presents a first systematic review of the computer vision-based analysis of buildings and the built environment. Its aim is to assess the potential of this research for architectural studies and the implications of a shift to a crossdisciplinarity approach between architecture and computer science for research problems, aims, processes, and applications. To this end, the types of algorithms and data sources used in the reviewed studies are discussed in respect to architectural applications such as a building classification, detail classification, qualitative environmental analysis, building condition survey, and building value estimation. Based on this, current research gaps and trends are identified, with two main research aims emerging. First, studies that use or optimise computer vision methods to automate time-consuming, labour-intensive, or complex tasks when analysing architectural image data. Second, work that explores the methodological benefits of machine learning approaches to overcome limitations of conventional analysis in order to investigate new questions about the built environment by finding patterns and relationships between visual, statistical, and qualitative data. The growing body of research offers new methods to architectural and design studies, with the paper identifying future challenges and directions of research.
... For example, Victorian buildings in England have a distinctive façade design and internal layout. In addition, building uses can be partially indicated by their façade, with computer vision methods utilised for this task by classifying architectural styles [61,70,108] and typologies [2,15,49]. A classification of the whole image is typical for a high-level analysis of architectural features such as overall urban characteristics [40] or style or use classification [59]. ...
... In another example, Davies [19] trains an Inception V3 network to recognise Georgian architecture from GSV images of London. Likewise, Alhasoun and González [2] use a CNN trained to match GSV images to their corresponding US towns based on a visual classification of urban street contexts or to classify street frontages [55]. Deep-learning models are also used to measure visual similarities between the styles of different architects by Yoshimura et al. [104]. ...
Preprint
Full-text available
Analysing 88 sources published from 2011 to 2021, this paper presents a first systematic review of the computer vision-based analysis of buildings and the built environments to assess its value to architectural and urban design studies. Following a multi-stage selection process, the types of algorithms and data sources used are discussed in respect to architectural applications such as a building classification, detail classification, qualitative environmental analysis, building condition survey, and building value estimation. This reveals current research gaps and trends, and highlights two main categories of research aims. First, to use or optimise computer vision methods for architectural image data, which can then help automate time-consuming, labour-intensive, or complex tasks of visual analysis. Second, to explore the methodological benefits of machine learning approaches to investigate new questions about the built environment by finding patterns and relationships between visual, statistical, and qualitative data, which can overcome limitations of conventional manual analysis. The growing body of research offers new methods to architectural and design studies, with the paper identifying future challenges and directions of research.
... e understanding of urban streets in natural scenes usually includes two processes: the detection of artistic indicators of natural scenes and the recognition of artistic indicators of natural scenes, and the accuracy of the detection process plays a decisive role in the successful recognition of characters. In practical engineering, the detection effect of artistic indicators in natural scenes is often unsatisfactory, so that the entire art indicator recognizer cannot correctly identify the art indicators in the street space, and the result is that the understanding of the semantics of urban streets cannot be accurately expressed [9][10][11][12]. Stoian [13] et al. proposed a natural scene art indicator detection and recognition algorithm based on a sliding window mechanism using random forest classifiers and Hog features. ...
Article
Full-text available
Object detection in public spaces on urban streets has always been an important research topic in the field of computer vision networks. Due to the complex and changeable scene in the prediction of public space art design indicators, there are still problems in the research of target detection algorithms in practical applications. Based on the DCNN, this paper studies the accurate detection algorithm and implementation of urban streets in complex scenes. This paper uses the characteristics of DCNN coding to collect and compress data at the same time, studies the prediction module of urban street saliency detection algorithm, and combines saliency map to determine the saliency of urban street art design indicators in the measurement domain. The experimental method can greatly shorten the index prediction scan time and solve the problems of high window calibration redundancy and long positioning time in index prediction. The experimental results show that the proposed method combining urban street mask and public space feature information can reduce other interference information, the average accuracy of target detection is increased by 0.398, and the error is reduced to 3.12%, which significantly promotes urban streets and improves recognition accuracy.
... A common application involves using semantics segmentation on street view images to measure the composition of the urban environment [8; 9]. Related approaches see the utilization of image classifiers to automate the process of classifying urban streets with street imagery data [12] and object detection models to detect traffic signs [13]. ...
Conference Paper
Full-text available
Advances in Computer Vision and Deep Learning has enabled the use of new techniques in the study of urban spaces. While there has been significant research efforts in the use of Computer Vision to study urban forms with semantic segmentation, there is a conspicuous lack of development in using Computer Vision to observe activities in urban settings. Conventional methodologies to survey and observe human activities primarily involves labor intensive manual field audits and questionnaires. This made it time consuming and expensive to scale. Given the interdependence of urban forms and human activities, there is a need to improve how activity observations are conducted to match the data quantity and quality that can be generated from Computer Vision based techniques in the mapping of urban forms. In this paper, we will present an approach that utilizes an action detection model to address the laborious nature of activity observations and demonstrate how such an implementation can allow the survey of large areas with minimal manpower.
Article
Full-text available
We present the DeepGlobe 2018 Satellite Image Understanding Challenge, which includes three public competitions for segmentation, detection, and classification tasks on satellite images (Figure 1). Similar to other challenges in computer vision domain such as DAVIS[21] and COCO[33], DeepGlobe proposes three datasets and corresponding evaluation methodologies, coherently bundled in three competitions with a dedicated workshop co-located with CVPR 2018. We observed that satellite imagery is a rich and structured source of information, yet it is less investigated than everyday images by computer vision researchers. However, bridging modern computer vision with remote sensing data analysis could have critical impact to the way we understand our environment and lead to major breakthroughs in global urban planning or climate change research. Keeping such bridging objective in mind, DeepGlobe aims to bring together researchers from different domains to raise awareness of remote sensing in the computer vision community and vice-versa. We aim to improve and evaluate state-of-the-art satellite image understanding approaches, which can hopefully serve as reference benchmarks for future research in the same topic. In this paper, we analyze characteristics of each dataset, define the evaluation criteria of the competitions, and provide baselines for each task.
Article
Full-text available
Road traffic injuries are a leading cause of death worldwide. Proper estimation of car accident risk is critical for the appropriate allocation of resources in healthcare, insurance, civil engineering and other industries. We show how images of houses are predictive of car accidents. We analyse 20,000 addresses of insurance company clients, collect a corresponding house image using Google Street View and annotate house features such as age, type and condition. We find that this information substantially improves car accident risk prediction compared to the state-of-the-art risk model of the insurance company and could be used for price discrimination. From this perspective, the public availability of house images raises legal and social concerns, as they can be a proxy of ethnicity, religion and other sensitive data.
Article
Full-text available
Urban land use is key to rational urban planning and management. Traditional land use classification methods rely heavily on domain experts, which is both expensive and inefficient. In this paper, deep neural network-based approaches are presented to label urban land use at pixel level using high-resolution aerial images and ground-level street view images. We use a deep neural network to extract semantic features from sparsely distributed street view images and interpolate them in the spatial domain to match the spatial resolution of the aerial images, which are then fused together through a deep neural network for classifying land use categories. Our methods are tested on a large publicly available aerial and street view images dataset of New York City, and the results show that using aerial images alone can achieve relatively high classification accuracy, the ground-level street view images contain useful information for urban land use classification, and fusing street image features with aerial images can improve classification accuracy. Moreover, we present experimental studies to show that street view images add more values when the resolutions of the aerial images are lower, and we also present case studies to illustrate how street view images provide useful auxiliary information to aerial images to boost performances.
Preprint
Full-text available
We present the DeepGlobe 2018 Satellite Image Understanding Challenge, which includes three public competitions for segmentation, detection, and classification tasks on satellite images (Figure 1). Similar to other challenges in computer vision domain such as DAVIS[21] and COCO[33], DeepGlobe proposes three datasets and corresponding evaluation methodologies, coherently bundled in three competitions with a dedicated workshop co-located with CVPR 2018. We observed that satellite imagery is a rich and structured source of information, yet it is less investigated than everyday images by computer vision researchers. However, bridging modern computer vision with remote sensing data analysis could have critical impact to the way we understand our environment and lead to major breakthroughs in global urban planning or climate change research. Keeping such bridging objective in mind, DeepGlobe aims to bring together researchers from different domains to raise awareness of remote sensing in the computer vision community and vice-versa. We aim to improve and evaluate state-of-the-art satellite image understanding approaches, which can hopefully serve as reference benchmarks for future research in the same topic. In this paper, we analyze characteristics of each dataset, define the evaluation criteria of the competitions, and provide baselines for each task.
Article
Full-text available
In this study we propose a new method to simulate hyper-realistic urban patterns using Generative Adversarial Networks trained with a global urban land-use inventory. We generated a synthetic urban "universe" that qualitatively reproduces the complex spatial organization observed in global urban patterns, while being able to quantitatively recover certain key high-level urban spatial metrics.
Article
Full-text available
Significance We show that socioeconomic attributes such as income, race, education, and voting patterns can be inferred from cars detected in Google Street View images using deep learning. Our model works by discovering associations between cars and people. For example, if the number of sedans in a city is higher than the number of pickup trucks, that city is likely to vote for a Democrat in the next presidential election (88% chance); if not, then the city is likely to vote for a Republican (82% chance).
Conference Paper
Full-text available
Urban planning applications (energy audits, investment, etc.) require an understanding of built infrastructure and its environment, i.e., both low-level, physical features (amount of vegetation, building area and geometry etc.), as well as higher-level concepts such as land use classes (which encode expert understanding of socio-economic end uses). This kind of data is expensive and labor-intensive to obtain, which limits its availability (particularly in developing countries). We analyze patterns in land use in urban neighborhoods using large-scale satellite imagery data (which is available worldwide from third-party providers) and state-of-the-art computer vision techniques based on deep convolutional neural networks. For supervision, given the limited availability of standard benchmarks for remote-sensing data, we obtain ground truth land use class labels carefully sampled from open-source surveys, in particular the Urban Atlas land classification dataset of $20$ land use classes across $~300$ European cities. We use this data to train and compare deep architectures which have recently shown good performance on standard computer vision tasks (image classification and segmentation), including on geospatial data. Furthermore, we show that the deep representations extracted from satellite imagery of urban environments can be used to compare neighborhoods across several cities. We make our dataset available for other machine learning researchers to use for remote-sensing applications.
Article
Full-text available
Significance We develop a computer vision method to measure changes in the physical appearances of neighborhoods from street-level imagery. We correlate the measured changes with neighborhood characteristics to determine which characteristics predict neighborhood improvement. We find that both education and population density predict improvements in neighborhood infrastructure, in support of theories of human capital agglomeration. Neighborhoods with better initial appearances experience more substantial upgrading, as predicted by the tipping theory of urban change. Finally, we observe more improvement in neighborhoods closer to both city centers and other physically attractive neighborhoods, in agreement with the invasion theory of urban sociology. Our results show how computer vision techniques, in combination with traditional methods, can be used to explore the dynamics of urban change.
Article
Full-text available
Social science literature has shown a strong connection between the visual appearance of a city's neighborhoods and the behavior and health of its citizens. Yet, this research is limited by the lack of methods that can be used to quantify the appearance of streetscapes across cities or at high enough spatial resolutions. In this paper, we describe 'Streetscore', a scene understanding algorithm that predicts the perceived safety of a streetscape, using training data from an online survey with contributions from more than 7000 participants. We first study the predictive power of commonly used image features using support vector regression, finding that Geometric Texton and Color Histograms along with GIST are the best performers when it comes to predict the perceived safety of a streetscape. Using Streetscore, we create high resolution maps of perceived safety for 21 cities in the Northeast and Midwest of the United States at a resolution of 200 images/square mile, scoring ~1 million images from Google Streetview. These datasets should be useful for urban planners, economists and social scientists looking to explain the social and economic consequences of urban perception.
Article
Full-text available
A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Government Version of Record
Article
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which further makes training easy and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10/100, and a 200-layer ResNet on ImageNet.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
City scale next place prediction from sparse data through similar strangers
  • May Fahad Alhasoun
  • Faisal Alhazzani
  • Riyadh Aleissa
  • Marta Alnasser
  • González
Google street view image of a house predicts car accident risk of its resident
  • kita
Urban computing: concepts, methodologies, and applications
  • Yu Zheng
  • Licia Capra
  • Ouri Wolfson
  • Hai Yang
Social-enabled Urban Data Analytics
  • Danqing Zhang
Philipp Rode and Ricky Burdett
  • Philipp Rode
  • Ricky Burdett
City scale next place prediction from sparse data through similar strangers
  • alhasoun
Urban computing: concepts, methodologies, and applications
  • zheng