Conference PaperPDF Available

Quantifying curb appeal

Authors:
QUANTIFYING CURB APPEAL
Zachary Bessinger Nathan Jacobs
Department of Computer Science, University of Kentucky
{zach, jacobs}@cs.uky.edu
ABSTRACT
The curb appeal of a home, which refers to how attractive
it is when viewed from the street, is an important decision-
making factor for many home buyers. Existing models for au-
tomatically estimating the price of a home ignore this factor,
instead focusing exclusively on objective attributes, such as
number of bedrooms, the square footage, and the age. We pro-
pose to use street-level imagery of a home, in addition to the
objective attributes, to estimate the price of the home, thereby
quantifying curb appeal. Our method uses deep convolutional
neural networks to extract informative image features. We in-
troduce a large dataset to support an extensive evaluation of
several approaches. We find that using images and objective
attributes together results in more accurate home price esti-
mates than using either in isolation. We also find that repre-
sentations learned for scene classification tasks are more dis-
criminative for home price estimation than those learned for
other tasks.
Index Termsimage understanding, computational aes-
thetics, neural networks, image feature representation
1. INTRODUCTION
There is significant interdisciplinary research related to aes-
thetics of urban spaces and their impact on human behavior.
Perhaps one of the most well-known theories is Wilson’s bro-
ken windows theory [1] that suggests if a home or building
has broken windows, then it is highly probable that the neigh-
borhood has high crime, vandalism, and disorder. Conversely
improving lighting and adding sidewalks can reduce crime,
improve health, and increase happiness [2, 3]. We approach
the issue of aesthetics from an economic perspective, focus-
ing on its impact on the monetary value of a property.
The attractiveness of a home’s exterior is colloquially re-
ferred to as its curb appeal, a significant factor in a home pur-
chasing decision and, consequently, the value of a home. A
large industry, including contractors, television shows, maga-
zines, and blogs, are devoted to ways to improve curb appeal.
Some examples of changes that can improve curb appeal in-
clude lawn care, repainting, adding a pergola, or changing the
mailbox. Figure 1 shows the appearance of two homes be-
fore and after exterior renovations. These renovations have
Fig. 1: Appearance provides a strong cue for estimating the
value of a home. The same houses before (top) and after (bot-
tom) renovations to increase their “curb appeal.” Updated ar-
chitecture, paint, and lawn care are essential for high curb
appeal. (Images courtesy of Houzz.)
arguably increased the curb appeal of the home, which would
likely increase its sale price. This work explores learning-
based methods for quantifying the curb appeal of the home,
through estimating its price from street-level imagery.
Our approach differs significantly from previous work on
automatic home price estimation. Existing approaches are
based on a combination of objective attributes, such as the
square footage, the number of bedrooms, and the price of
homes with similar attributes in the area [4, 5]. This means
that they are unable to account for recent changes in the curb
appeal of the home, which a human appraiser would take into
consideration. We propose to use the appearance of a home,
captured by street-level imagery, as a factor in predicting the
fair-cash value of a home. In experiments on a diverse ur-
ban area, we find that this results in an average of $2 401, or
6.14%, reduction in error.
The main contributions of this work are: 1) a large-scale
evaluation of applying learned image features to the domain
of understanding home aesthetics, 2) a joint model that incor-
porates images and metadata to improve home price estima-
tion, and 3) an extensive evaluation demonstrating our joint
model outperforms independent metadata and image models.
2. RELATED WORK
Our work introduces a novel method for using imagery to un-
derstand homes in an urban area and is a special case of image
attribute estimation. The remainder of this section provides an
overview of work in these two areas.
2.1. Estimating Image Attributes
Early work on estimating image attributes of scenes, which
are visible semantically meaningful properties, include the
works of Patterson et al. [6], Su et al. [7], and Parikh et al. [8].
These early works relied on BoW models and semantic seg-
mentation to learn semantic attributes. More recently Zhou
et al. [9, 10] have leveraged recent advances in convolutional
neural networks to create benchmark datasets useful for im-
proving the task of scene classification. Once a scene can be
classified into a number of categories, that information can be
used as prior knowledge towards higher-level attribute estima-
tion. Laffont et al. [11] estimate a set of contextual attributes
related to weather and seasons from outdoor images and apply
appearance transfer to effectively synthesize weather effects
on other outdoor images. Glasner et al. [12] learn a temper-
ature attribute to transform a camera into a crude tempera-
ture sensor. Our work extends previous works of learning at-
tributes for outdoor scenes by estimating the attribute of curb
appeal.
2.2. Using Imagery to Understand Urban Areas
Recently there has been an increased interest in applying
computer vision techniques to learn attributes for understand-
ing urban areas. The most heavily researched attribute in
this research subject is automatically assessing the safety of
a region [13, 14, 15]. Porzi et al. [16] leverage the power
and speed of convolutional neural networks (CNNs) to rank
safety in an end-to-end manner from Google Street View im-
ages. Other recent works have learned how certain aspects of
cities, such as their architecture [17] and architectural evolu-
tion [18], can be used as cues for city recognition. Salesses
et al. [19] and Quercia et al. [20] have successfully estimated
attributes of wealth and beauty of a city based on images
obtained from Google Street View and social media. Our
work takes a similar approach, but with a more specific focus;
we quantify curb appeal and use it to improve our ability to
estimate the price of a home.
3. DATASET
We constructed a dataset of publicly available residential
housing data from the Fayette County Property Valuation
Administrator (PVA) [21]. Each record in the dataset has
Fig. 2: Images from our dataset, showing the diversity of ar-
chitectural styles, landscaping, and lighting conditions.
(a) (b)
(c) (d)
Fig. 3: The distribution of homes in our dataset with respect
to four different attributes.
15 metadata attributes including fair-cash value, number of
bedrooms/bathrooms, and year built. We use the fair-cash
value, which is determined by a human appraiser, as the price
of the home. Each home also has at least one front-facing
640 ×480 image captured by an appraiser. Figure 2 shows
example images from our dataset. These images are captured
at varying angles and weather conditions.
There are a small quantity of houses that are outliers in
terms of price. We alleviate this by thresholding for price and
keeping all homes valued at $400 000 or less. After filter-
ing, our final dataset contains 83 140 records. Histograms of
several objective attributes are visualized in Figure 3. The
home prices have mean µ= $154 829 and standard deviation
σ= $76 082.
4. METHODS
We propose a novel method that accounts for curb appeal by
jointly using metadata and image features to predict the value
of a home. We compare with two baseline methods that di-
rectly predict the price, one that uses only metadata features
and one that uses only image features. We generate price pre-
diction models using three regression techniques: linear re-
gression, ridge regression, and random forest regression. The
notation, P(x), refers to a price prediction model, given some
input data, x. The following subsections describe the features
used to construct each price prediction model. Evaluation re-
sults are presented in Section 5.
4.1. Home Prices using Metadata
In our metadata-only model, P(M), we ignore the available
image data and directly predict the price of a home from
its objective attributes. We search all possible subsets of at-
tributes and fit linear regression models to predict price. We
select the optimal model with dimensionality eight that max-
imizes the R2score because including additional attributes
does not significantly reduce the error. This model uses the
following attributes: the number of acres, bedrooms, bath-
rooms, year built, residential square footage, total fixtures,
basement size, and garage size. These eight objective at-
tributes are used as our metadata feature representation in all
experiments. Each attribute is scaled to have mean µ= 0 and
variance σ2= 1 and a metadata-only model is learned for
each of the three regressors.
4.2. Home Prices using Images
Our image-only model, P(I), ignores the metadata attributes
and predicts the price of the home using only an exterior pho-
tograph. We investigate a variety of feature representations
extracted from CNNs, which have recently been shown [15,
22] to perform better than hand-crafted features. Since our
images are outdoor, lighting factors such as the sun, shadows,
and seasonal changes strongly affect the image appearance.
Therefore, it is imperative that our visual features be invariant
to lighting and weather conditions. Motivated by this, we per-
form a large-scale evaluation of existing CNN architectures
to find a feature representation optimal for predicting price.
Using each of the three regressors, we learn models using a
variety of image features and select the features that minimize
the root mean squared error. The evaluation for selecting the
best image features is described in Section 5.2.
4.3. Home Prices using Metadata and Images
Our proposed model, P(M) + C(I), combines metadata at-
tributes and image features to predict the price of the home.
This model estimates the home price using two components:
the predicted home price and the curb appeal modifier. The
Table 1: RMSE (in U.S. Dollars) for each price prediction
model. The best result for each column is bolded. Lower is
better.
Linear Ridge Random
Forest
P(M) $37 058 $37 127 $29 365
P(I) $53 575 $53 565 $53 727
P(M) + C(I)$34 538 $34 606 $28 281
metadata-only model, P(M), defined in Section 4.1 is used
to directly predict the home price. The image features used
in the image-only model in Section 4.2 to directly predict the
price now predict the curb appeal modifier, C(I), which ad-
justs the predicted home price, positively or negatively, based
on its curb appeal.
5. EVALUATION
We present a quantitative and qualitative evaluation of the
proposed methods using the dataset described in Section 3.
We uniformly at random select 80% of the homes as a training
set, reserving the rest for testing. Qualitative results are pre-
sented for the top-performing model, showing homes whose
image had significant impact, both positive and negative, on
the predicted price.
5.1. Implementation Details
Our proposed method is implemented in Python using Scikit-
Learn [23] for the regressors and Caffe [24] for image feature
extraction. The optimal hyperparameters for each regressor is
learned using hyperopt [25]. Optimization is done by apply-
ing 5-fold cross validation on the training set and minimizing
the average root mean squared error. Our source code will be
made available pending publication.
5.2. Image Feature Selection
To find optimal image features for predicting price, we learn
several regression models trained on features extracted from
ten convolutional networks. The networks are trained on
three sets of training data, ImageNet [26], Places [9], and
Places2 [10] using four different architectures, AlexNet [27],
GoogLeNet [28], and VGG-16/VGG-19 [29]. We find that
features extracted from similar network architectures trained
on Places2 and ImageNet significantly differ. Comparing two
random forests to predict price with similar hyperparameters
using VGG-16 FC6 features, one trained on Places2 and the
other trained on ImageNet, Places2 trained image features
reduce the root mean squared error (RMSE) by $6 140 more
than ImageNet features. Similarly, using Places2 FC6 and
FC8 features instead of Places FC6 and FC8 features reduce
the RMSE by $1 848 and $1 636, respectively. Since the
(a) (b)
Fig. 4: Homes where the curb appeal modifier in our joint
model has a strong negative (a) and strong positive (b) impact
on estimated price, respectively.
difference in dimensionality between FC6 and FC8 is greater
than a factor of 10 and FC8 is the layer that corresponds to se-
mantic labels, we prefer to use the FC8 layer. To this end, we
select the FC8 layer of the top performing network architec-
ture, Places2 VGG-16, from which to extract image features.
These features are used as our image feature representation
for all further evaluations.
5.3. Quantitative Results
We report the RMSE for each price estimation model on the
testing set in Table 1. We find that the random forest mod-
els achieve the highest performance compared to other tested
regressors. Our joint model using random forests for predict-
ing the price component with metadata and the curb appeal
modifier with image features shows improvement over the
baseline metadata-only and image-only models, achieving a
RMSE of $28 281. This is a relative improvement of 4.0%
over our metadata-only model and 90.0% improvement over
our image-only model using similar regressors. For all regres-
sion methods, we observe that incorporating curb appeal into
the price prediction yields more accurate results.
Finding the optimal hyperparameters for ridge regres-
sion involves optimizing over α. In the metadata-only model
αM= 44.259 and the image-only model αI= 16.791. The
joint model metadata component uses αMand the curb appeal
modifier uses αC= 83.918. For random forest regression
we optimize over the number of estimators and the minimum
number of features to create a split. P(M)in uses 794 es-
timators and considers 4features for each split. P(I)uses
1188 estimators and considers 167 features for each split. In
the joint model, C(I)uses 522 estimators and 157 features
for each split.
5.4. Influence of Curb Appeal on Price
To better understand how the curb appeal modifier in our joint
model influences estimated price, we explore the relationship
between semantic image labels corresponding to the Places2
VGG-16 FC8 layer and predicted home price. Figure 4 shows
Fig. 5: Mean predicted curb appeal modifier for homes
grouped by semantic label from Places2 FC8 layer.
several homes whose prices are strongly affected by their curb
appeal. The homes in Figure 4a were significantly devalued
by incorporating curb appeal. The opposite effect happens in
Figure 4b, where incorporating curb appeal significantly in-
creased the estimated price. Elements of high curb appeal,
green grass and kempt appearance, are clearly visible in Fig-
ure 4b and strongly influence the home’s value.
To further understand semantics, we aggregate the seman-
tic labels from the test set and filter them for predictions that
occurred more than 100 times. For each label, we calculate
the mean predicted curb appeal modifier and show it in Fig-
ure 5. Scenes predicted as “yard” and “lawn” add $12 105 and
$5 823 to the price of the home, on average. Similarly, scenes
predicted as “industrial park” and “manufactured home” sub-
tract $7 152 and $3 162 from the price of the home, on aver-
age. This further supports the hypothesis that a kempt home
with natural beauty is more valuable.
6. CONCLUSIONS
Overall, we have curated a dataset of home prices along with
their street-level imagery and provided an extensive evalu-
ation of how curb appeal affects property prices. We find
that exterior appearance correlates with the price of a home
and can be used to improve existing metadata-based models.
Practically, these image features can be incorporated at little
cost since CNN inference takes less than a second.
The results we have presented are very promising in
regards to urban understanding. Applications of our work
can be used for automated exterior appraisal, architectural
anomaly detection, and demographic prediction. Our work
opens many avenues of future research towards not only im-
proving price estimation, but also holistically understanding
urban regions.
7. REFERENCES
[1] James Q Wilson and George L Kelling, “Broken windows,
Atlantic monthly, vol. 249, no. 3, 1982.
[2] Jelle Van Cauwenberg, Ilse De Bourdeaudhuij, Peter Clarys,
Jack Nasar, Jo Salmon, Liesbet Goubert, and Benedicte De-
forche, “Street characteristics preferred for transportation
walking among older adults: a choice-based conjoint analysis
with manipulated photographs,” International Journal of Be-
havioral Nutrition and Physical Activity, vol. 13, no. 1, 2016.
[3] Alva O. Ferdinand, Bisakha Sen, Saurabh Rahurkar, Sally En-
gler, and Nir Menachemi, “The relationship between built en-
vironments and physical activity: a systematic review, Amer-
ican journal of public health, vol. 102, no. 10, 2012.
[4] Stephen Malpezzi et al., “Hedonic pricing models: a selective
and applied review,Section in Housing Economics and Public
Policy: Essays in Honor of Duncan Maclennan, 2003.
[5] Gabriel Ahlfeldt, “If alonso was right: modeling accessibil-
ity and explaining the residential land gradient, Journal of
Regional Science, vol. 51, no. 2, 2011.
[6] Genevieve Patterson and James Hays, “Sun attribute database:
Discovering, annotating, and recognizing scene attributes, in
CVPR. IEEE, 2012.
[7] Yu Su and Fr´
ed´
eric Jurie, “Improving image classification us-
ing semantic attributes, IJCV, vol. 100, no. 1, 2012.
[8] Devi Parikh and Kristen Grauman, “Relative attributes, in
ICCV. IEEE, 2011.
[9] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
ralba, and Aude Oliva, “Learning deep features for scene
recognition using places database,” in NIPS, 2014.
[10] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Tor-
ralba, and Aude Oliva, “Places2: A large-scale database for
scene understanding,” Arxiv preprint: [pending], 2015.
[11] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and
James Hays, “Transient attributes for high-level understanding
and editing of outdoor scenes,” SIGGRAPH, vol. 33, no. 4,
2014.
[12] Daniel Glasner, Pascal Fua, Todd Zickler, and Lihi Zelnik-
Manor, “Hot or not: Exploring correlations between appear-
ance and temperature,” in ICCV. IEEE, 2015.
[13] Sean M Arietta, Alexei Efros, Ravi Ramamoorthi, Maneesh
Agrawala, et al., “City forensics: Using visual elements to
predict non-visual city attributes, Visualization and Computer
Graphics, IEEE Transactions on, vol. 20, no. 12, 2014.
[14] Nikhil Naik, Jade Philipoom, Ramesh Raskar, and Cesar Hi-
dalgo, “Streetscore–predicting the perceived safety of one mil-
lion streetscapes,” in IEEE Conference on Computer Vision
and Pattern Recognition Workshops. IEEE, 2014.
[15] Vicente Ordonez and Tamara L Berg, “Learning high-level
judgments of urban perception,” in ECCV. Springer, 2014.
[16] Lorenzo Porzi, Samuel Rota Bul `
o, Bruno Lepri, and Elisa
Ricci, “Predicting and understanding urban perception with
convolutional neural networks,” in ACM MM. ACM, 2015.
[17] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and
Alexei A. Efros, “What makes paris look like paris?, SIG-
GRAPH, vol. 31, no. 4, 2012.
[18] Stefan Lee, Nicolas Maisonneuve, David J. Crandall, Alexei A.
Efros, and Josef Sivic, “Linking past to present: Discovering
style in two centuries of architecture,” in IEEE International
Conference on Computational Photography, ICCP, 2015.
[19] Philip Salesses, Katja Schechtner, and C´
esar A Hidalgo, “The
collaborative image of the city: mapping the inequality of ur-
ban perception,” PloS one, vol. 8, no. 7, 2013.
[20] Daniele Quercia, Neil Keith O’Hare, and Henriette Cramer,
“Aesthetic capital: what makes london look beautiful, quiet,
and happy?,” in ACM Conference on Computer Supported Co-
operative Work & Social Computing. ACM, 2014.
[21] Fayette County Property Valuation Administrator, ,http:
//www.fayette-pva.com/.
[22] Ali S Razavian, Hossein Azizpour, Josephine Sullivan, and
Stefan Carlsson, “Cnn features off-the-shelf: an astounding
baseline for recognition,” in IEEE Conference on Computer
Vision and Pattern Recognition Workshops. IEEE, 2014.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-
chine learning in Python,” Journal of Machine Learning Re-
search, vol. 12, 2011.
[24] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor
Darrell, “Caffe: Convolutional architecture for fast feature em-
bedding,” arXiv preprint arXiv:1408.5093, 2014.
[25] James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins,
and David D Cox, “Hyperopt: a python library for model se-
lection and hyperparameter optimization,” Computational Sci-
ence and Discovery, vol. 8, no. 1, 2015.
[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei, “ImageNet Large Scale Visual Recognition Chal-
lenge,” IJCV, vol. 115, no. 3, 2015.
[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Im-
agenet classification with deep convolutional neural networks,
in NIPS, 2012.
[28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Van-
houcke, and Andrew Rabinovich, “Going deeper with convo-
lutions,” in CVPR, 2015.
[29] Karen Simonyan and Andrew Zisserman, “Very deep convo-
lutional networks for large-scale image recognition, arXiv
preprint arXiv:1409.1556, 2014.
... Some methods only check the exterior image of the house. These methods use the extracted features of AlexNet, GoogLeNet or VGG16/19 ( Bessinger & Jacobs, 2016 ). These attributes may have a positive or negative impact on the house prices. ...
... For example, an internal courtyard is a positive features and an industrial scene is a negative environment. In Bessinger & Jacobs (2016) comparison with text data-only metods showed 4% improvement. Some works could identify certain features in the picture, such as window bars, hedges, gable roofs and tropical plants, to predict the characteristics of the city and their relation with the price of the house ( Arietta & Sean, 2014 ). ...
... Zestimate and Redfin Estimate have a national median error percentage of 7.7% ( Redfin, 2022 ) and 5.92% ( Zestimate, 2022 ) respectively. On the other hand, there are also vision-based research works like ( Bessinger & Jacobs, 2016 ) that studies only exterior images of the apartments which extracts features using AlexNet, GoogLeNet, and VGG16/19. Another paper uses four pictures from each house passed through a neural network to estimate the price of the home ( Poursaeed et al., 2017 ). ...
Article
Full-text available
Real estate price estimation has been an interesting subject in the literature from the appearance of online real estate services like Zillow and Redfin. These websites and many other works in the literature have proposed their methods for evaluation and pricing of the real estate. However, these methods fail to consider important information about the appearance and the neighborhood of the house which leads to occasional incorrect estimations. The novel proposed method in this paper tries to estimate housing price by considering attributes of the home as well as interior, exterior, and satellite visual features of the house. Deep convolutional neural networks on a large dataset of images of interior, exterior and satellite images of houses are trained to extract visual features of the houses. These features along with house attributes are fed to another system to automatically estimate the value of the house. Finally, the performance of the system is compared to Zestimate and some vision-based methods in the literature on a new dataset.
... Over the last years, several researchers have extended the classical real estate appraisal models by using Convolutional Neural Networks (CNNs), instead of linear regressions, to incorporate image data into the learning process (Law, Paige, and Russell 2019). Examples include interior and exterior perspectives of houses, as well as street-side and satellite imagery (Law, Paige, and Russell 2019;Poursaeed, Matera, and Belongie 2018;Bency et al. 2017;Bessinger and Jacobs 2016;Liu et al. 2018;Kucklick and Müller 2020). Published empirical results strongly suggest that adding such image data to real estate appraisal models improves their predictive performance. ...
... Several types of real estate images can be used in such an approach, ranging from interior images to satellite images. Interior images, for example, potentially contain information about a home's luxury level and aesthetics (Poursaeed, Matera, and Belongie 2018;Naumzik and Feuerriegel 2020), whereas exterior images can capture the style and look of the property (Bessinger and Jacobs 2016). In contrast, street-side and satellite images may capture information about the neighborhood and spatial relations, setting the focus apart from the property to a local and global sphere (Law, Paige, and Russell 2019;Bency et al. 2017;Kucklick and Müller 2020). ...
... In most cases where the authors have used multi-view concatenation, a non-linear machine learning algorithm (e.g., Random Forest, Support Vector Machine) was used to combine structured features and image features (Bency et al. 2017;Poursaeed, Matera, and Belongie 2018;Bessinger and Jacobs 2016). However, to extract the image features, authors have followed different approaches. ...
Preprint
Full-text available
In the house credit process, banks and lenders rely on a fast and accurate estimation of a real estate price to determine the maximum loan value. Real estate appraisal is often based on relational data, capturing the hard facts of the property. Yet, models benefit strongly from including image data, capturing additional soft factors. The combination of the different data types requires a multi-view learning method. Therefore, the question arises which strengths and weaknesses different multi-view learning strategies have. In our study, we test multi-kernel learning, multi-view concatenation and multi-view neural networks on real estate data and satellite images from Asheville, NC. Our results suggest that multi-view learning increases the predictive performance up to 13% in MAE. Multi-view neural networks perform best, however result in intransparent black-box models. For users seeking interpretability, hybrid multi-view neural networks or a boosting strategy are a suitable alternative.
... Recent studies [3,4,30,55] utilize the image features, and house attributes to estimate the house price based on CNN. Bessigner et al. [4] integrated the front-view images of houses and house characteristics from the appraiser to evaluate the house price by using VGG16 [37] and random forest. ...
... Recent studies [3,4,30,55] utilize the image features, and house attributes to estimate the house price based on CNN. Bessigner et al. [4] integrated the front-view images of houses and house characteristics from the appraiser to evaluate the house price by using VGG16 [37] and random forest. You et al. [55] proposes a hybrid model with GoogleNet [44] and long short-term memory (LSTM) to predict house prices with consideration of peer-dependence. ...
... The images of You's work are also front-view images captured by an appraiser. Moreover, Bency et al. [4] estimates the house price with satellite images and house attributes by using InceptionV3 [45]. To better fuse the images and house attributes, Liu et al. [30] proposes a multi-instance model to fuse the house photos and house attributes via feed-forward neural network (FNN) trained by ranking loss, then estimate the house prices based on the fused features. ...
Article
Full-text available
The geographical presentation of a house, which refers to the sightseeing and topography near the house, is a critical factor to a house buyer. The street map is a type of common data in our daily life, which contains natural geographical presentation. This paper sources real estate data and corresponding street maps of houses in the city of Los Angeles. In the case study, we proposed an innovative method, attention-based multi-modal fusion, to incorporate the geographical presentation from street maps into the real estate appraisal model with a deep neural network. We firstly combine the house attribute features and street map imagery features by applying the attention-based neural network. After that, we apply boosted regression trees to estimate the house price from the fused features. This work explored the potential of attention mechanism and data fusion in the applications of real estate appraisal. The experimental results indicate the competitiveness of proposed method among state-of-the-art methods.
... The proposed framework confirms the effectiveness of the introduction of multi-source data. Researchers have explored multimodal learning in conjunction with ensemble learning methods such as random forest and deep learning methods such as VGG16 [23] and InceptionV3 [24]. Papouskova et al. [25] reduced the overfitting problem in a credit risk problem by using a heterogeneous ensemble learning model. ...
... In [5], a multi-instance deep ranking and regression (MiDRR) neural network was introduced, which uses a one-layer feed-forward neural network (FNN) to incorporate the image and house information features. In [23], the authors used VGG16 to obtain image features from house images and achieved an RMSE of 28.281. In [22], the authors used convolutional neural networks to obtain image features from satellite images and achieved an 2 of 93.47. ...
Article
Full-text available
With the development of online real estate trading platforms, multi-modal housing trading data, including structural information, location, and interior image data, are being accumulated. The accurate appraisal of real estate makes sense for government officials, urban policymakers, real estate sellers, and personal purchasers. In this study, we propose an interpretable multi-modal stacking-based ensemble learning (IMSEL) method that deals with various modalities for real estate appraisals. We crawl the structural and image data of real estate in Chengdu city, China from the nations largest real estate transaction platform with the location information, including public services, within 2 km of the real estate using Baidu map. We then compare the predictive results from IMSEL with those from previous state-of-art methods in the literature in terms of the root mean square error, mean absolute percentage error, mean absolute error, and coefficient of determination (R2). The comparison results show that IMSEL outperformed the other methods. We verified the improvement of introducing a data transformation strategy and deep visual features through a 10-fold cross-validation. We also discuss the managerial implications of our research findings.
... The top eight tactile properties are selected based on the metrics shown in Table 3. The R 2 metric has been used to access the performance of vision-based regression tasks in several works [19,53,4,37,25,5,13]. The R 2 metric compares the estimation from the modelt against using the mean tactile value t from the training set as an estimation, and is given by: ...
Preprint
The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping and pushing. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of fifteen tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss. Additionally, we develop a neural architecture search framework capable of selecting optimal combinations of viewing angles for estimating a given physical property.
Article
Deep learning models fuel many modern decision support systems, because they typically provide high predictive performance. Among other domains, deep learning is used in real-estate appraisal, where it allows to extend the analysis from hard facts only (e.g., size, age) to also consider more implicit information about the location or appearance of houses in the form of image data. However, one downside of deep learning models is their intransparent mechanic of decision making, which leads to a trade-off between accuracy and interpretability. This limits their applicability for tasks where a justification of the decision is necessary. Therefore, in this paper, we first combine different perspectives on interpretability into a multi-dimensional framework for a socio-technical perspective on explainable artificial intelligence. Second, we measure the performance gains of using multi-view deep learning which leverages additional image data (satellite images) for real estate appraisal. Third, we propose and test a novel post-hoc explainability method called Grad-Ram. This modified version of Grad-Cam mitigates the intransparency of convolutional neural networks (CNNs) for predicting continuous outcome variables. With this, we try to reduce the accuracy-interpretability trade-off of multi-view deep learning models. Our proposed network architecture outperforms traditional hedonic regression models by 34% in terms of MAE. Furthermore, we find that the used satellite images are the second most important predictor after square feet in our model and that the network learns interpretable patterns about the neighborhood structure and density.
Chapter
The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping and pushing. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of fifteen tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss. Additionally, we introduce a neural architecture search framework capable of selecting optimal combinations of viewing angles for estimating a given physical property.
Article
The property value assessment in the real estate market still remains as a challenges due to incomplete and insufficient information, as well as the lack of efficient algorithms. House attributes, such as size and number of bedrooms, are currently being employed to perform the estimation by professional appraisers and researchers. Numerous algorithms have been proposed; however, a better assessment performance is still expected by the market. Nowadays, there are more available relevant data from various sources in urban areas, which have a potential impact on the house value. In this paper, we propose to fuse urban data, i.e., metadata and imagery data, with house attributes to unveil the market value of the property in Philadelphia. Specifically, two deep neural networks, i.e., metadata fusion network and image appraiser, are proposed to extract the representations, i.e., expected levels, from metadata and street-view images, respectively. A boosted regression tree (BRT) is adapted to estimate the market values of houses with the fused metadata and expected levels. The experimental results with the data collected from the city of Philadelphia demonstrate the effectiveness of the proposed model. The research presented in this paper also provides the real estate industry a new reference to the property value assessment with the data fusion methodology.
Conference Paper
Full-text available
Cities' visual appearance plays a central role in shaping human perception and response to the surrounding urban environment. For example, the visual qualities of urban spaces affect the psychological states of their inhabitants and can induce negative social outcomes. Hence, it becomes critically important to understand people's perceptions and evaluations of urban spaces. Previous works have demonstrated that algorithms can be used to predict high level attributes of urban scenes (e.g. safety, attractiveness, uniqueness), accurately emulating human perception. In this paper we propose a novel approach for predicting the perceived safety of a scene from Google Street View Images. Opposite to previous works, we formulate the problem of learning to predict high level judgments as a ranking task and we employ a Convolutional Neural Network (CNN), significantly improving the accuracy of predictions over previous methods. Interestingly, the proposed CNN architecture relies on a novel pooling layer, which permits to automatically discover the most important areas of the images for predicting the concept of perceived safety. An extensive experimental evaluation, conducted on the publicly available Place Pulse dataset, demonstrates the advantages of the proposed approach over state-of-the-art methods.
Article
Full-text available
Knowledge about the relationships between micro-scale environmental factors and older adults’ walking for transport is limited and inconsistent. This is probably due to methodological limitations, such as absence of an accurate neighborhood definition, lack of environmental heterogeneity, environmental co-variation, and recall bias. Furthermore, most previous studies are observational in nature. We aimed to address these limitations by investigating the effects of manipulating photographs on micro-scale environmental factors on the appeal of a street for older adults’ transportation walking. Secondly, we used latent class analysis to examine whether subgroups could be identified that have different environmental preferences for transportation walking. Thirdly, we investigated whether these subgroups differed in socio-demographic, functional and psychosocial characteristics, current level of walking and environmental perceptions of their own street. Data were collected among 1131 Flemish older adults through an online (n = 940) or an interview version of the questionnaire (n = 191). This questionnaire included a choice-based conjoint exercise with manipulated photographs of a street. These manipulated photographs originated from one panoramic photograph of an existing street that was manipulated on nine environmental attributes. Participants chose which of two presented streets they would prefer to walk for transport. In the total sample, sidewalk evenness had by far the greatest appeal for transportation walking. The other environmental attributes were less important. Four subgroups that differed in their environmental preferences for transportation walking were identified. In the two largest subgroups (representing 86 % of the sample) sidewalk evenness was the most important environmental attribute. In the two smaller subgroups (each comprising 7 % of the sample), traffic volume and speed limit were the most important environmental attributes for one, and the presence of vegetation and a bench were the most important environmental attributes for the other. This latter subgroup included a higher percentage of service flat residents than the other subgroups. Our results suggest that the provision of even sidewalks should be considered a priority when developing environmental interventions aiming to stimulate older adults’ transportation walking. Natural experiments are needed to confirm whether our findings can be translated to real environments and actual transportation walking behavior.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The Bag-of-Words (BoW) model—commonly used for image classification—has\ntwo strong limitations: on one hand, visual words are lacking of\nexplicit meanings, on the other hand, they are usually polysemous.\nThis paper proposes to address these two limitations by introducing\nan intermediate representation based on the use of semantic attributes.\nSpecifically, two different approaches are proposed. Both approaches\nconsist in predicting a set of semantic attributes for the entire\nimages as well as for local image regions, and in using these predictions\nto build the intermediate level features. Experiments on four challenging\nimage databases (PASCAL VOC 2007, Scene-15, MSRCv2 and SUN-397) show\nthat both approaches improve performance of the BoW model significantly.\nMoreover, their combination achieves the state-of-the-art results\non several of these image databases.
Conference Paper
Human observers make a variety of perceptual inferences about pictures of places based on prior knowledge and experience. In this paper we apply computational vision techniques to the task of predicting the perceptual characteristics of places by leveraging recent work on visual features along with a geo-tagged dataset of images associated with crowd-sourced urban perception judgments for wealth, uniqueness, and safety. We perform extensive evaluations of our models, training and testing on images of the same city as well as training and testing on images of different cities to demonstrate generalizability. In addition, we collect a new densely sampled dataset of streetview images for 4 cities and explore joint models to collectively predict perceptual judgments at city scale. Finally, we show that our predictions correlate well with ground truth statistics of wealth and crime.
Chapter
IntroductionWhat is a Hedonic Price Index?Repeat Sales ModelsThe Roots of Hedonic Price ModelsConceptual Issues in Hedonic ModellingSpecification IssuesHedonic Modelling: the Current PositionExamples of ApplicationsConcluding ThoughtsNotes