PreprintPDF Available

Google Street View image of a house predicts car accident risk of its resident

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Road traffic injuries are a leading cause of death worldwide. Proper estimation of car accident risk is critical for appropriate allocation of resources in healthcare, insurance, civil engineering, and other industries. We show how images of houses are predictive of car accidents. We analyze 20,000 addresses of insurance company clients, collect a corresponding house image using Google Street View, and annotate house features such as age, type, and condition. We find that this information substantially improves car accident risk prediction compared to the state-of-the-art risk model of the insurance company and could be used for price discrimination. From this perspective, public availability of house images raises legal and social concerns, as they can be a proxy of ethnicity, religion and other sensitive data.
Content may be subject to copyright.
Google Street View image of a house predicts car accident risk of its resident
Kinga Kita-Wojciechowska, Faculty of Economic Sciences, University of Warsaw, Warsaw, Poland
Łukasz Kidziński, Department of Bioengineering, Stanford University, Stanford, CA, USA
Road traffic injuries are a leading cause of death worldwide. Proper estimation of car accident risk is
critical for appropriate allocation of resources in healthcare, insurance, civil engineering, and other
industries. We show how images of houses are predictive of car accidents. We analyze 20,000 addresses
of insurance company clients, collect a corresponding house image using Google Street View, and
annotate house features such as age, type, and condition. We find that this information substantially
improves car accident risk prediction compared to the state-of-the-art risk model of the insurance
company and could be used for price discrimination. From this perspective, public availability of house
images raises legal and social concerns, as they can be a proxy of ethnicity, religion and other sensitive
data.
Modern machine learning techniques for computer vision, such as Deep Learning, provided
unprecedented opportunities for academic research and industrial applications. Examples include using
satellite images for deforestation monitoring in South America (1) or poverty estimation in Africa (2),
prediction of skin cancer from skin lesion images (3), or automatic detection of pulmonary tuberculosis
from a chest radiograph (4).
One of the resources recently leveraged for research is Google Street Viewa platform from
Google where images of buildings are taken using cars equipped with a set of cameras (5). This data source
has recently been explored by researchers to answer questions in social science, e.g. demographic makeup
of neighborhoods across the US (6), estimating city-level travel patterns in Great Britain (7) or crime rate
in Brazil (8).
Our work explores whether Google Street View images of houses are predictive of the car accident
risk of their residents. Proper risk estimation is key for the insurance and healthcare industries. Insurance
is a promise to compensate the potential damage or loss in the future for a relatively low price paid now.
Unlike commodity products, the ultimate cost of an insurance policy is not known at the time of the sale.
It is, therefore, a challenge to set up a proper pricing and insurers try to leverage statistical methods to
predict the future risk of each client upfront. For this purpose, insurers collect historical data about
underwritten policies and claims incurred and they build statistical models to identify systematic and time-
invariant clients’ characteristics that correlate with the number of claims. For example, the classical motor
insurance risk factors identified worldwide are the age of the driver, the characteristics of his car, the
occurrence of car accidents in the past and geography (9). This is why the insurers tend to ask for these
and other details before providing the motor insurance offer.
Although insurers often collect address information from the client, they typically use only zip-
code for risk modeling and pricing purposes. Claims data aggregated to zip-codes are still too volatile and
require spatial smoothing (10) and further aggregation to larger geographical zones (11) . Such
commonly used methodology is based on the assumption that neighbors are driving in a similar manner.
In this paper, we challenge this assumption and show that volatility can be explained at the granularity of
individual addresses. Moreover, we show that this information can be extracted from publicly available
images from the Google Street View.
Figure 1. Examples of extremely different houses located in the same zip-code and residents of which have the same expected
claim frequency by the current insurer’s model.
Study of this insurance problem enabled following sociological and methodological discoveries:
1) features of the house correlate with the car accident risk of its resident, 2) compared to other uses of
Google Street View for research, our variables are sourced from the address rather than aggregated by
zip-code or district and they allow for new sociological discoveries at a very granular level, 3) variables
extracted from the address (the image of a house) can be used in insurance and other industries, notably
for price discrimination, 4) modern data collection and computational techniques, which allow for
unprecedented exploitation of personal data, can outpace development of legislation and raise privacy
threats.
Results
We examine a motor insurance dataset of 20,000 records—a random sample of an insurer’s
portfolio collected in Poland over the period January 2012 to December 2015. Each record represents
characteristics of an insurance policy covering motor third party liability (MTPL) including the address of
the policyholder, risk exposure defined as a fraction of the year in which the policy was active over the
period 2013-2015 and the count of incurred property damage claims over the period 2013-2015. The
insurer provided us also with the expected frequency of property damage claims for those policies,
estimated by their current best-in-class risk model, that includes zoning based on the client’s zip-code.
Figure 2. Features annotated from Google Satellite View and Google Street View image of a particular address.
We collect Google Satellite View and Google Street View images for the addresses provided in
the database. Six experts annotated the following features of the houses visible in the images: their type,
age, condition, estimated wealth of its residents, as well as type and density of other buildings from the
neighborhood (Figure 2). Four out of six annotators gave moderately consistent answers for the common
subsample of 500 addresses Fleiss’ kappa statistics indicate mostly moderate agreement among them
(Table 1). These four annotators continued annotating remaining 19,371 addresses (we removed 129
addresses from the scope of this study as they were either foreign or could not be found by Google
Maps), but this time each annotator was given a separate, randomly selected, set of addresses. We
compared distributions of collected annotations and at the end applied small corrections to match the
mean and standard deviation among all four annotators.
Table 1. Statistics for 7 newly created variablesoriginal granularity, inter-rater reliability of 4 selected annotators on the
common set of 500 observations and significance in our risk model after applying necessary simplifications.
Next, we estimated a Generalized Linear Model (GLM) to investigate the importance of newly
created variables for risk prediction (9, 12, 13). We assume the following probabilistic model of claim
frequency f, defined as the number of claims divided by risk exposure:
where is a number of property damage claims within MTPL insurance following Poisson distribution,
is a vector of independent variables and is the vector of coefficients.
For relative evaluation of the value added by our approach, we introduce three models:
- Model A (null model), where vector is
- Model B (best-in-class insurer’s model): where vector is
- Model C (our model): where vector is
The insurer provided us with the realization of the model B for each record from the dataset. That model
was estimated on a larger undisclosed dataset and contains j predictive variables (driver characteristics,
vehicle characteristics, claim history, geographical zone, etc.). Using properties of GLMs we can
decompose Model C into two parts: one corresponding to the Model B and one incorporating the new
variables. We refer to the realization of the Model B multiplied by exposure as an offset (14) and do not
estimate it. Therefore, Model C takes form
Intuitively, in this representation, the estimated coefficients explain the signal
that is not explained by the best-in-class risk model of the insurer (model B) and will also adjust for the
earned exposure of the policy shorter than 1 year. We investigate if the values of these coefficients are
non-zero, indicating that the variables we constructed provide additional predictive power to the model.
We find that five out of seven newly created variables within this research are significant for predicting
property damage MTPL claim frequency model, on top of many other rating variables used in the best-in-
class insurer’s model (Table 1).
We observe a significant variability of Gini coefficient for all A, B, C models in particular for
model A (null model with intercept only and no other variables selected) it varies from 20 to 38% within
20 resampling trials. We interpret it as the evidence that the dataset provided is extremely small (20,000
records) for modeling such rare events as property damage claims within MTPL insurance (average
frequency of 5%).
Figure 3. Gini coefficients obtained on 20% test sample in 20 bootstrapping trials from the null model (A), the best-in-class
insurer’s model (B) and our model with newly created variables (C).
Despite the high volatility of data, adding our five simple variables to the insurer’s model
improves its performance in 18 out of 20 resampling trials and the average improvement of the Gini
coefficient is nearly 2 percentage points (from 38.2% to 40.1%). To put this value into perspective, the
best-in-class insurer’s model fitted on much bigger dataset and including a broad selection of variables
(e.g. driver characteristics, car characteristics, claim history and geographical zones based on the client’s
zip-code) improves the Gini coefficient versus null model by 8 percentage points from ~30% to ~38%
(Figure 3).
Discussion
We found that features visible on a picture of a house can be predictive of car accident risk,
independently from classically used variables such as age, or zip code. This finding is not only a step
towards more granular risk prediction models, but also illustrates a novel approach to social science,
where the real-world granular data is collected and analyzed at scale.
From the practical perspective of insurance companies, the results we present are remarkably
powerful, when compared to the best-in-class insurance model. Our 5 variables, containing already some
bias from the imperfect annotation, improve Gini coefficient by nearly 2 percentage points, which is
massive, comparing to the improvement of 8 percentage points brought by numerous variables that the
insurer has already been using in his best-in-class risk model. The insurance industry could be quickly
followed by the banks, as there is a proven correlation between insurance risk models and credit risk
scoring (15). The approach itself to extract valuable information from Google Street View opens a variety
of opportunities not only for the financial sector. Any company that collects clients’ addresses could
adopt our methodology and deep learning technology enables to make it in an automated way on a
massive scale (16).
Such practice, however, raises concerns about the privacy of data stored in publicly available
Google Street View, Microsoft Bing Streetside, Mapillary, or equivalent privately-held datasets like
CycloMedia. The consent given by the clients to the company to store their addresses does not
necessarily mean a consent to store information about the appearance of their houses. In particular,
features of the house may be a proxy of ethnicity, religion or other characteristics associated with a
social status of a person (17, 18) which are forbidden by the law to be used for any discrimination, e.g.
price discrimination in certain jurisdictions (19). Fast development of modern data collection and
computational techniques allows for unprecedented exploitation of various data of clients being not
even aware of it (20) and development of corresponding legislation in this matter seems to be outpaced.
The methods we present could be substantially improved by employing more annotators for the
same set of the images. Potentially the average or ensemble of their answers would match the reality
better than an annotation of a single person (21, 22). Our model performance could be different on the
data of another insurer or in another country, however, to the best of our knowledge, most insurance
companies do aggregate risk at the zip-code or regional level, so our variables are still likely to provide an
additional signal. Another limitation is the small size of the dataset provided by the insurance company,
but we reduced this problem using bootstrapping and by using elementary modelling techniques such as
the generalized linear models.
References
1. M. Finer et al., Combating deforestation: From satellite to intervention. Science. 360, 13031305
(2018).
2. N. Jean et al., Combining satellite imagery and machine learning to predict poverty. Science. 353,
790794 (2016).
3. A. Esteva et al., Dermatologist-level classification of skin cancer with deep neural networks. Nature.
542, 115118 (2017).
4. P. Lakhani, B. Sundaram, Deep Learning at Chest Radiography: Automated Classification of
Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology. 284, 574582 (2017).
5. D. Anguelov et al., Google Street View: Capturing the World at Street Level. Computer . 43, 3238
(2010).
6. T. Gebru et al., Using deep learning and Google Street View to estimate the demographic makeup of
neighborhoods across the United States. Proc. Natl. Acad. Sci. U. S. A. 114, 1310813113 (2017).
7. R. Goel et al., Estimating city-level travel patterns using street imagery: A case study of using Google
Street View in Britain. PLoS One. 13, e0196521 (2018).
8. V. O. Andersson, M. A. F. Birck, R. M. Araujo, in Computational Neuroscience (Springer International
Publishing, 2017), pp. 8193.
9. G. Werner, C. Modlin, in Casualty Actuarial Society (2010).
10. G. Taylor, Geographic Premium Rating by Whittaker Spatial Smoothing. ASTIN Bulletin: The Journal
of the IAA. 31, 147160 (2001).
11. J. Yao, Clustering in Ratemaking: Applications in Territories Clustering. Casualty Actuarial Society
Discussion Paper Program Casualty Actuarial Society-Arlington, Virginia, 170192 (2008).
12. G. A. Spedicato, C. Dutang, L. Petrini, Machine Learning Methods to Perform Pricing Optimization. A
Comparison with Standard GLMs. Variance: Advancing the Science of Risk. 111 (2018).
13. I. Kolyshkina, S. Wong, S. Lim, in Casualty Actuarial Society (2004), pp. 279290.
14. J. Yan, J. Guszcza, M. Flynn, C.-S. P. Wu, in Casualty Actuarial Society E-Forum, Winter 2009 (2009),
p. 366.
15. L. L. Golden, P. L. Brockett, J. Ai, B. Kellison, Empirical Evidence on the Use of Credit Scoring for
Predicting Insurance Losses with Psycho-social and Biochemical Explanations. N. Am. Actuar. J. 20,
233251 (2016).
16. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, in Advances in Neural Information Processing
Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger, Eds. (Curran
Associates, Inc., 2014), pp. 487495.
17. A. R. Gillis, Population Density and Social Pathology: The Case of Building Type, Social Allowance and
Juvenile Delinquency. Soc. Forces. 53, 306314 (1974).
18. E. R. Braver, Race, Hispanic origin, and socioeconomic status in relation to motor vehicle occupant
death rates and risk factors among adults. Accid. Anal. Prev. 35, 295309 (2003).
19. J. Gaulding, Race Sex and Genetic Discrimination in Insurance: What’s Fair. Cornell Law Rev. 80,
1646 (1994).
20. M. J. Blitz, The right to map (and avoid being mapped): reconceiving First Amendment protection for
information-gathering in the age of Google Earth. Columbia Sci. Technol. Law Rev. 14, 115 (2012).
21. L. Tran-Thanh, S. Stein, A. Rogers, N. R. Jennings, Efficient crowdsourcing of unknown experts using
bounded multi-armed bandits. Artif. Intell. 214, 89111 (2014).
22. R. M. Levenson, E. A. Krupinski, V. M. Navarro, E. A. Wasserman, Pigeons (Columba livia) as
Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS One. 10, e0141357
(2015).
23. M. Goldburd, A. Khare, C. D. Tevet, in Casualty Actuarial Society (2016).
24. P. Cizek, W. K. Härdle, R. Weron, Statistical Tools for Finance and Insurance (Springer Science &
Business Media, 2005).
25. E. W. Frees, G. Meyers, A. D. Cummings, Summarizing Insurance Scores Using a Gini Index. J. Am.
Stat. Assoc. 106, 10851098 (2011).
26. M. O. Lorenz, Methods of Measuring the Concentration of Wealth. Publications of the American
Statistical Association. 9, 209219 (1905).
27. C. Gini, Measurement of Inequality of Incomes. Econ. J. Nepal. 31, 124126 (1921).
Funding: Łukasz Kidziński is partly sponsored by the Mobilize Center at Stanford, a National Institutes of
Health Big Data to Knowledge (BD2K) Center of Excellence supported through Grant U54EB020405
Author contributions: Authors contributed equally.
Competing interest: Authors declare no competing interests.
Data and materials availability: All data required to understand and assess the conclusions of this
research are available in the main text and supplementary materials. Insurance claims data used for the
study is privately owned and contains sensitive client observations precluding publication.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background: Street imagery is a promising and growing big data source providing current and historical images in more than 100 countries. Studies have reported using this data to audit road infrastructure and other built environment features. Here we explore a novel application, using Google Street View (GSV) to predict travel patterns at the city level. Methods: We sampled 34 cities in Great Britain. In each city, we accessed 2000 GSV images from 1000 random locations. We selected archived images from time periods overlapping with the 2011 Census and the 2011-2013 Active People Survey (APS). We manually annotated the images into seven categories of road users. We developed regression models with the counts of images of road users as predictors. The outcomes included Census-reported commute shares of four modes (combined walking plus public transport, cycling, motorcycle, and car), as well as APS-reported past-month participation in walking and cycling. Results: We found high correlations between GSV counts of cyclists ('GSV-cyclists') and cycle commute mode share (r = 0.92)/past-month cycling (r = 0.90). Likewise, GSV-pedestrians was moderately correlated with past-month walking for transport (r = 0.46), GSV-motorcycles was moderately correlated with commute share of motorcycles (r = 0.44), and GSV-buses was highly correlated with commute share of walking plus public transport (r = 0.81). GSV-car was not correlated with car commute mode share (r = -0.12). However, in multivariable regression models, all outcomes were predicted well, except past-month walking. The prediction performance was measured using cross-validation analyses. GSV-buses and GSV-cyclists are the strongest predictors for most outcomes. Conclusions: GSV images are a promising new big data source to predict urban mobility patterns. Predictive power was the greatest for those modes that varied the most (cycle and bus). With its ability to identify mode of travel and capture street activity often excluded in routinely carried out surveys, GSV has the potential to be complementary to new and traditional data. With half the world's population covered by street imagery, and with up to 10 years historical data available in GSV, further testing across multiple settings is warranted both for cross-sectional and longitudinal assessments.
Article
Full-text available
As the level of competition increases, pricing optimization is gaining a central role in most mature insurance markets, forcing insurers to optimise their rating and consider customer behaviour; the modeling scene for the latter is one currently dominated by frameworks based on Generalised Linear Models (GLMs). In this paper, we explore the applicability of novel machine learning techniques such as tree boosted models to optimise the proposed premium on prospective policyholders. Given their predictive gain over GLMs, we carefully analyse both the advantages and disadvatanges induced by their use.
Article
Full-text available
The United States spends more than $250 million each year on the American Community Survey (ACS), a labor-intensive door-to-door study that measures statistics relating to race, gender, education, occupation, unemployment, and other demographic factors. Although a comprehensive source of data, the lag between demographic changes and their appearance in the ACS can exceed several years. As digital imagery becomes ubiquitous and machine vision techniques improve, automated data analysis may become an increasingly practical supplement to the ACS. Here, we present a method that estimates socioeconomic characteristics of regions spanning 200 US cities by using 50 million images of street scenes gathered with Google Street View cars. Using deep learning-based computer vision techniques, we determined the make, model, and year of all motor vehicles encountered in particular neighborhoods. Data from this census of motor vehicles, which enumerated 22 million automobiles in total (8% of all automobiles in the United States), were used to accurately estimate income, race, education, and voting patterns at the zip code and precinct level. (The average US precinct contains ∼ ∼1,000 people.) The resulting associations are surprisingly simple and powerful. For instance, if the number of sedans encountered during a drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next presidential election (88% chance); otherwise, it is likely to vote Republican (82%). Our results suggest that automated systems for monitoring demographics may effectively complement labor-intensive approaches, with the potential to measure demographics with fine spatial resolution, in close to real time.
Article
Full-text available
Pathologists and radiologists spend years acquiring and refining their medically essential visual skills, so it is of considerable interest to understand how this process actually unfolds and what image features and properties are critical for accurate diagnostic performance. Key insights into human behavioral tasks can often be obtained by using appropriate animal models. We report here that pigeons (Columba livia)-which share many visual system properties with humans-can serve as promising surrogate observers of medical images, a capability not previously documented. The birds proved to have a remarkable ability to distinguish benign from malignant human breast histopathology after training with differential food reinforcement; even more importantly, the pigeons were able to generalize what they had learned when confronted with novel image sets. The birds' histological accuracy, like that of humans, was modestly affected by the presence or absence of color as well as by degrees of image compression, but these impacts could be ameliorated with further training. Turning to radiology, the birds proved to be similarly capable of detecting cancer-relevant microcalcifications on mammogram images. However, when given a different (and for humans quite difficult) task-namely, classification of suspicious mammographic densities (masses)-the pigeons proved to be capable only of image memorization and were unable to successfully generalize when shown novel examples. The birds' successes and difficulties suggest that pigeons are well-suited to help us better understand human medical image perception, and may also prove useful in performance assessment and development of medical imaging hardware, image processing, and image analysis tools.
Article
Purpose To evaluate the efficacy of deep convolutional neural networks (DCNNs) for detecting tuberculosis (TB) on chest radiographs. Materials and Methods Four deidentified HIPAA-compliant datasets were used in this study that were exempted from review by the institutional review board, which consisted of 1007 posteroanterior chest radiographs. The datasets were split into training (68.0%), validation (17.1%), and test (14.9%). Two different DCNNs, AlexNet and GoogLeNet, were used to classify the images as having manifestations of pulmonary TB or as healthy. Both untrained and pretrained networks on ImageNet were used, and augmentation with multiple preprocessing techniques. Ensembles were performed on the best-performing algorithms. For cases where the classifiers were in disagreement, an independent board-certified cardiothoracic radiologist blindly interpreted the images to evaluate a potential radiologist-augmented workflow. Receiver operating characteristic curves and areas under the curve (AUCs) were used to assess model performance by using the DeLong method for statistical comparison of receiver operating characteristic curves. Results The best-performing classifier had an AUC of 0.99, which was an ensemble of the AlexNet and GoogLeNet DCNNs. The AUCs of the pretrained models were greater than that of the untrained models (P < .001). Augmenting the dataset further increased accuracy (P values for AlexNet and GoogLeNet were .03 and .02, respectively). The DCNNs had disagreement in 13 of the 150 test cases, which were blindly reviewed by a cardiothoracic radiologist, who correctly interpreted all 13 cases (100%). This radiologist-augmented approach resulted in a sensitivity of 97.3% and specificity 100%. Conclusion Deep learning with DCNNs can accurately classify TB at chest radiography with an AUC of 0.99. A radiologist-augmented approach for cases where there was disagreement among the classifiers further improved accuracy. (©) RSNA, 2017.
Article
Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs) show potential for general and highly variable tasks across many fine-grained object categories. Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images-two orders of magnitude larger than previous datasets-consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care.
Article
An important development in personal lines of insurance in the United States is the use of credit history data for insurance risk classification to predict losses. This research presents the results of collaboration with industry conducted by a university at the request of its state legislature. The purpose was to see the viability and validity of the use of credit scoring to predict insurance losses given its controversial nature and criticism as redundant of other predictive variables currently used. Working with industry and government, this study analyzed more than 175,000 policyholders’ information for the relationship between credit score and claims. Credit scores were significantly related to incurred losses, evidencing both statistical and practical significance. We investigate whether the revealed relationship between credit score and incurred losses was explainable by overlap with existing underwriting variables or whether the credit score adds new information about losses not contained in existing underwriting variables. The results show that credit scores contain significant information not already incorporated into other traditional rating variables (e.g., age, sex, driving history). We discuss how sensation seeking and self-control theory provide a partial explanation of why credit scoring works (the psycho-social perspective). This article also presents an overview of biological and chemical correlates of risk taking that helps explain why knowing risk-taking behavior in one realm (e.g., risky financial behavior and poor credit history) transits to predicting risk-taking behavior in other realms (e.g., automobile insurance incurred losses). Additional research is needed to advance new nontraditional loss prediction variables from social media consumer information to using information provided by technological advances. The evolving and dynamic nature of the insurance marketplace makes it imperative that professionals continue to evolve predictive variables and for academics to assist with understanding the whys of the relationships through theory development.
Article
Reliable data on economic livelihoods remain scarce in the developing world, hampering efforts to study these outcomes and to design policies that improve them. Here we demonstrate an accurate, inexpensive, and scalable method for estimating consumption expenditure and asset wealth from high-resolution satellite imagery. Using survey and satellite data from five African countries-Nigeria,Tanzania, Uganda, Malawi, and Rwanda-we show how a convolutional neural network can be trained to identify image features that can explain up to 75%of the variation in local-level economic outcomes. Our method, which requires only publicly available data, could transform efforts to track and target poverty in developing countries. It also demonstrates how powerful machine learning techniques can be applied in a setting with limited training data, suggesting broad potential application across many scientific domains.
Article
This paper argues that the number of individuals per unit space is less appropriate for humans than building type as a correlate of patterns that have been considered socially “pathological.” Stepwise regression is used to analyze the data, with the census tract as the unit of analysis. The data show that when the effects of income and national origin have been removed, neither internal nor external density explains a significant amount of the variation in social allowance or delinquency, but building type remains a significant predictor of these variables. A path model is developed, showing that social allowance (welfare) rates, like the proportion of multiple dwellings, is an important predictor of juvenile delinquency rates.