Article

Predicting poverty and wealth from mobile phone metadata

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Predicting unmeasurable wealth In developing countries, collecting data on basic economic quantities, such as wealth and income, is costly, time-consuming, and unreliable. Taking advantage of the ubiquity of mobile phones in Rwanda, Blumenstock et al. mapped mobile phone metadata inputs to individual phone subscriber wealth. They applied the model to predict wealth throughout Rwanda and show that the predictions matched well with those from detailed boots-on-the-ground surveys of the population. Science , this issue p. 1073

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Further, mobile devices such as smartphones and wearable technology (Gandy et al., 2017) have become valuable sources of data for automated contract tracing and data collection. These devices consistently capture metrics related to browsing behavior, geolocation (Nikolic and Bierlaire, 2017;Wu et al., 2013), patterns of communication (Green et al., 2021;Blumenstock et al., 2015), and other features that can be used to study real-time aspects of human behavior. ...
... Governments and their national statistical offices will be required to make important decisions regarding where and when to use conventional and digital data sources for policy. Survey data, though expensive, can be used to benchmark data from other sources that can be collected more cheaply, frequently, or with more granularity (Blumenstock et al., 2015;Keusch et al., 2020aKeusch et al., , 2020b. The expansion of new sources in modern data collection including social data, sensors, and digital platforms are becoming serious complements and in some cases, alternatives, to conventional government surveys. ...
... Researchers and analysts are also drawing on ML to leverage new datasets that capture hard-tomeasure variables for policy analysis. Examples range from using Twitter to identify illegal sales of opioids (Mackey et al., 2017), developing economic uncertainty indices from scientific publications (Azqueta-Gavaldon, 2017), predicting income levels from phone metadata (Blumenstock et al., 2015), optimizing Covid-19 vaccine deployment strategies in Africa (Mellado et al., 2021), and predicting suicide risk using Reddit data (Yao et al., 2021;Allen et al., 2019). In many of these cases, researchers trained artificial intelligence (AI) to identify patterns (e.g., timing, wording, or events associated with behaviors of interest) from a small dataset and apply these algorithms to classify instances across a larger number of observations. ...
Article
Full-text available
Data for Policy ( dataforpolicy.org ), a trans-disciplinary community of research and practice, has emerged around the application and evaluation of data technologies and analytics for policy and governance. Research in this area has involved cross-sector collaborations, but the areas of emphasis have previously been unclear. Within the Data for Policy framework of six focus areas, this report offers a landscape review of Focus Area 2: Technologies and Analytics. Taking stock of recent advancements and challenges can help shape research priorities for this community. We highlight four commonly used technologies for prediction and inference that leverage datasets from the digital environment: machine learning (ML) and artificial intelligence systems, the internet-of-things, digital twins, and distributed ledger systems. We review innovations in research evaluation and discuss future directions for policy decision-making.
... Our method consumes raw metadata from mobile phone usage, which are already being collected at close to zero cost. These records can yield rich information about individuals, including mobility, consumption, and social networks (Blumenstock, Cadamuro, & On, 2015;Gonzalez, Hidalgo, & Barabasi, 2008;Lu, Wetter, Bharti, Tatem, & Bengtsson, 2013;Onnela et al., 2007;Palla, Barabási, & Vicsek, 2007;Soto, Frias-Martinez, Virseda, & Frias-Martinez, 2011). This paper shows how indicators derived from this data can predict the repayment of credit. ...
... Determining how to extract relevant behaviors from unstructured transaction records is the critical step in standard machine learning ('applied machine learning is basically feature engineering' (Ng, 2011)). A brute force data mining approach such as Blumenstock et al. (2015) would algorithmically extract indicators while being agnostic towards the outcome variable. However, such approaches can pick up spurious correlations that make them unreliable in practice (Lazer, Kennedy, King, & Vespignani, 2014). ...
... From the phone data we derive various features that may be associated with repayment. In a similar exercise, Blumenstock et al. (2015) generates features from mobile phone data using a data mining approach that is agnostic about the outcome variable. Our approach is instead tailored to one outcome, repayment. ...
Preprint
Many households in developing countries lack formal financial histories, making it difficult for firms to extend credit, and for potential borrowers to receive it. However, many of these households have mobile phones, which generate rich data about behavior. This article shows that behavioral signatures in mobile phone data predict default, using call records matched to repayment outcomes for credit extended by a South American telecom. On a sample of individuals with (thin) financial histories, our method actually outperforms models using credit bureau information, both within time and when tested on a different time period. But our method also attains similar performance on those without financial histories, who cannot be scored using traditional methods. Individuals in the highest quintile of risk by our measure are 2.8 times more likely to default than those in the lowest quintile. The method forms the basis for new forms of credit that reach the unbanked.
... Over the past decade, governments in low-and middle-income countries (LMICs) have increasingly used passively collected "big" data to inform policy decisions. Machine learning algorithms now leverage satellite imagery [52], mobile phone metadata [18], and social media data [41] in settings ranging from the targeting of humanitarian aid [7,96] to determining lending decisions [17] and informing pandemic response [78,83]. This use of big data to inform critical development policy decisions has reinvigorated longstanding debates about data privacy: big data may improve decision making in certain settings, but they can also reveal sensitive information about people and communities [20,31,104]. ...
... We developed four scenarios (see Appendix C.2 for the scenarios). While all scenarios were not real, they were all possible [7,9,18,19,23] and inspired by the literature [93,106]. To provide a balanced account, two scenarios were intended to demonstrate "helpful" uses and two were intended to demonstrate "harmful" uses. ...
... Mobile phone metadata are held by mobile network operators, and have on occasion been shared with governments, nonprofit organizations, for-profit organizations, and researchers to inform development policy decisions [74,104]. For example, mobile phone metadata have been used to map poverty in Rwanda [18], Afghanistan [19], Guatemala [48], and Bangladesh [100]; inform the targeting of humanitarian aid in Togo [7] and the Democratic Republic of the Congo [75]; measure mobility in response to natural disasters in Haiti [16] and Nepal [113]; and predict the spread of disease in Kenya [112], Sierra Leone [83], and Senegal [73]. In Togo, mobile phone metadata was as part of the Novissi program that targeted cash assistance to the poorest people in the country to help them survive economic impacts caused by pandemic shutdowns [7]. ...
Preprint
Full-text available
Passively collected "big" data sources are increasingly used to inform critical development policy decisions in low- and middle-income countries. While prior work highlights how such approaches may reveal sensitive information, enable surveillance, and centralize power, less is known about the corresponding privacy concerns, hopes, and fears of the people directly impacted by these policies -- people sometimes referred to as experiential experts. To understand the perspectives of experiential experts, we conducted semi-structured interviews with people living in rural villages in Togo shortly after an entirely digital cash transfer program was launched that used machine learning and mobile phone metadata to determine program eligibility. This paper documents participants' privacy concerns surrounding the introduction of big data approaches in development policy. We find that the privacy concerns of our experiential experts differ from those raised by privacy and development domain experts. To facilitate a more robust and constructive account of privacy, we discuss implications for policies and designs that take seriously the privacy concerns raised by both experiential experts and domain experts.
... In particular, recent studies have increasingly employed mobile phone call detail records (CDRs) to analyze human trajectories, presenting valuable insights into movement patterns, social interactions, and urban mobility at large scales [15]. Blumenstock et al. (2015) demonstrated that mobile phone data, beyond basic mobility analysis, can be leveraged to infer an individual's socioeconomic status and social behavior [16]. Human groups can be identified further by clustering algorithms to contribute to informed urban planning [17]. ...
... In particular, recent studies have increasingly employed mobile phone call detail records (CDRs) to analyze human trajectories, presenting valuable insights into movement patterns, social interactions, and urban mobility at large scales [15]. Blumenstock et al. (2015) demonstrated that mobile phone data, beyond basic mobility analysis, can be leveraged to infer an individual's socioeconomic status and social behavior [16]. Human groups can be identified further by clustering algorithms to contribute to informed urban planning [17]. ...
Article
Full-text available
Spatial big data about human mobility have been employed intensively in understanding human spatial activity patterns, which is a central topic in many applications. Available research on spatial clustering patterns of human activities has been investigated mainly based on similarities of locations and temporal attributes of spatial trajectories. These methods are not effective in revealing human groups who move among spaces at different locations but with the same functions. Function, as one semantic attribute of spaces, is a major driver of most human movements. This work investigates human clustering based on space functions of trajectory stay points in human mobility data using graph embedding. Firstly, typical functions of spaces are categorized into 35 types in our research area, which is a university campus. Human trajectories based on Wi-Fi networks were collected as test data. Then, human networks are built among human individuals. Each individual is taken as a node in the network, and an edge is built between two nodes if the corresponding individuals stay in spaces of the same type of function longer than a specific time duration. The graph embedding algorithm is used to calculate feature vector representations of nodes in the network, which can capture complex relationships among nodes through biased random walks. K-means clustering is applied to classify the feature vectors, which reveals potential behavioral pattern similarities of individuals concerning the functions of their staying spaces. The elbow method and silhouette score of clusters are used to determine an appropriate number of clusters. Three scenarios were designed based on three specific time durations, and random walk-biased parameters were fine-tuned to improve the clustering performance. Results reveal typical clusters and correlation between clusters and typical space functions.
... It is estimated that a given African household would appear in a household survey once in 1,000 years, making it difficult to measure changes in well-being over time at a local level [11]. For this reason, researchers have turned to alternative and non-traditional sources of data such as nightlights (NTL), daytime satellite imagery or mobile phone call detail records [12][13][14][15]. ...
... These surveys assess living conditions and household possessions and are available for multiple years and countries. It is worth noting that most previous studies evaluating the accuracy of satellite data compare them against the wealth index derived from DHS data [11,12,27,28]. ...
Article
Full-text available
Nightlights (NTL) have been widely used as a proxy for economic activity, despite known limitations in accuracy and comparability, particularly with outdated Defense Meteorological Satellite Program (DMSP) data. The emergence of newer and more precise Visible Infrared Imaging Radiometer Suite (VIIRS) data offers potential, yet challenges persist due to temporal and spatial disparities between the two datasets. Addressing this, we employ a novel harmonized NTL dataset (VIIRS + DMSP), which provides the longest and most consistent database available to date. We evaluate the association between newly available harmonized NTL data and various indicators of economic activity at the subnational level across 34 countries in sub-Saharan Africa from 2004 to 2019. Specifically, we analyze the accuracy of the new NTL data in predicting socio-economic outcomes obtained from two sources: 1) nationally representative surveys, i.e., the household Wealth Index published by Demographic and Health Surveys, and 2) indicators derived from administrative records such as the gridded Human Development Index and Gross Domestic Product per capita. Our findings suggest that even after controlling for population density, the harmonized NTL remain a strong predictor of the wealth index. However, while urban areas show a notable association between harmonized NTL and the wealth index, this relationship is less pronounced in rural areas. Furthermore, we observe that NTL can also significantly explain variations in both GDP per capita and HDI at subnational levels.
... Traditional poverty assessments, such as surveys and censuses, are often costly, time-consuming, and infrequent, particularly in developing countries where resource limitations and/or political constraints hinder comprehensive data collection [4]. As a response, researchers have explored innovative data streams such as satellite imagery, mobile phone usage patterns prediction [2], nightlight intensity [11], and social media activity [1], which offer the potential for faster and more cost-effective poverty assessment at fine spatial resolutions. ...
... The Random Forest result for (1 vs. 2-5) is not significantly better than randomly guessing. ChatGPT is significantly better than Random Forest for (1 vs. [2][3][4][5], and Random Forest is significantly better than ChatGPT for (1-4 vs. 5). For the other cases, the ChatGPT and Random Forest results are not significantly different. ...
Preprint
This paper investigates the novel application of Large Language Models (LLMs) with vision capabilities to analyze satellite imagery for village-level poverty prediction. Although LLMs were originally designed for natural language understanding, their adaptability to multimodal tasks, including geospatial analysis, has opened new frontiers in data-driven research. By leveraging advancements in vision-enabled LLMs, we assess their ability to provide interpretable, scalable, and reliable insights into human poverty from satellite images. Using a pairwise comparison approach, we demonstrate that ChatGPT can rank satellite images based on poverty levels with accuracy comparable to domain experts. These findings highlight both the promise and the limitations of LLMs in socioeconomic research, providing a foundation for their integration into poverty assessment workflows. This study contributes to the ongoing exploration of unconventional data sources for welfare analysis and opens pathways for cost-effective, large-scale poverty monitoring.
... Facing above-mentioned challenges, researchers previously resorted to new data sources such as search queries [7][8][9][10][11][12][13][14], social media [15][16][17][18], satellite images [19][20][21][22][23], online commodity price [ 24,25 ], financial transactions [ [26][27][28][29], check-in data [ 30 ] and mobile phone data [ [31][32][33][34] etc., to build socio-economical indicators or study economic behaviors from different perspectives. ...
... China UnionPay also develops economic indicators from the expenditures of bank card in various market segments [ 29 ]. For mobile phone data, Toole et al. [ 33 ] track employment shocks using mobile phone Call Detail Records (CDRs) in Europe, and Blumenstock et al. [ 32 ] infer poverty and wealth at individual level by combining Rwanda's mobile phone metadata and survey data with machine learning algorithms. ...
Preprint
Emerging trends in smartphones, online maps, social media, and the resulting geo-located data, provide opportunities to collect traces of people's socio-economical activities in a much more granular and direct fashion, triggering a revolution in empirical research. These vast mobile data offer new perspectives and approaches for measurements of economic dynamics and are broadening the research fields of social science and economics. In this paper, we explore the potential of using mobile big data for measuring economic activities of China. Firstly, We build indices for gauging employment and consumer trends based on billions of geo-positioning data. Secondly, we advance the estimation of store offline foot traffic via location search data derived from Baidu Maps, which is then applied to predict revenues of Apple in China and detect box-office fraud accurately. Thirdly, we construct consumption indicators to track the trends of various industries in service sector, which are verified by several existing indicators. To the best of our knowledge, we are the first to measure the second largest economy by mining such unprecedentedly large scale and fine granular spatial-temporal data. Our research provides new approaches and insights on measuring economic activities.
... Artificial intelligence, also known as AI, is a potent weapon in the battle against cybercrime because of its capacity to analyse enormous amounts of data, spot trends, and acquire experience. Identification of patterns is one of the main uses of AI in cybersecurity [16,17]. AI systems are able to identify suspicious actions that diverge from the norm in a timely manner because machine learning algorithms are trained on large amounts of data of both benign and malevolent conduct. ...
... This makes it possible for security teams to identify possible attacks early on, react quickly, and reduce risks. Furthermore, threat intelligence systems driven by AI have the ability to collect and evaluate information from a variety of sources, [17], including threat feeds and forums on the dark web, in order to provide useful insights on new dangers and weaknesses. ...
Article
Full-text available
Propose: Financial inclusion is essential for reducing poverty and promoting prosperity, according to the United Nations World Organisation. Financial Service Providers (FSPs) that provide solutions that are inclusive of all income levels must know how to effectively reach out to the underprivileged. FSPs can anticipate prospective clients' reactions as they approach them by using Artificial Intelligence (AI) on old data. This study predicts schools' and institutions' financial characteristics using big data technology. Method: This paper uses big data to simulate a human being and uses an AI-driven edge cloud computing assistance optimisation algorithm to form a cluster based on the individual's usage of private passions and interests, daily life consumption, and other indications. This allows the prediction to be realised from a component to a neural network-based cluster using the use of edge computing. Results: Furthermore, in order to test the model for forecasting, this study uses employment statistics from higher learning institutions in the province of Hunan from June 2020 to May 2021 as the study's sample. It then compares the CNN and LSTM models. The precision of predictions can reach 83.25% since the edge fog computing model in this research contains more analytical indexes as tuples than the model used by CNN. Conclusion: This research additionally proposes the use of AI-Thinking as a cognitively scaffold to reduce (pull out) actionable findings in order to promote inclusion in the economy. When contrasted to the LSTM-based classification predictions model, this model uses the use of edge computing, which significantly enhances the model's and its parameter' data quality and can increase calculations efficiency by 45%-65%.
... Past estimates of digital adoption have typically been based on probabilistic household surveys (Cohen and Adams, 2011;World Bank Group, 2016), which either lack gender disaggregation or are underpowered for subnational analyses. In other contexts, such as poverty (Blumenstock, Cadamuro and On, 2015;Pokhriyal and Jacques, 2017), wealth (Chi et al., 2022), and population mapping (Boo et al., 2022), "big data" derived from satellite, social media and mobile phone records have been used to overcome data gaps in survey-based approaches. The potential of nontraditional sources for mapping gender inequality indicators at subnational geographical resolution has yet to be explored. ...
... Flexible machine learning algorithms are appealing in this setting because of their ability to detect interactions, model higher-order effects, and better handle multiple, highlycorrelated predictors (Lundberg, Brand and Jeon, 2022). Machine learning approaches have been applied for similar prediction settings for LMICs, such as for small-area estimation of wealth and poverty (Blumenstock, Cadamuro and On, 2015;Chi et al., 2022). ...
Preprint
The digital revolution has ushered in many societal and economic benefits. Yet access to digital technologies such as mobile phones and internet remains highly unequal, especially by gender in the context of low- and middle-income countries. While national-level estimates are increasingly available for many countries, reliable, quantitative estimates of digital gender inequalities at the subnational level are lacking. These estimates, however, are essential for monitoring gaps within countries and implementing targeted interventions within the global sustainable development goals, which emphasize the need to close inequalities both between and within countries. We develop estimates of internet and mobile adoption by gender and digital gender gaps at the subnational level for 2,158 regions in 118 low- and middle-income countries (LMICs), a context where digital penetration is low and national-level gender gaps disfavoring women are large. We construct these estimates by applying machine-learning algorithms to Facebook user counts, geospatial data, development indicators, and population composition data. We calibrate and assess the performance of these algorithms using ground-truth data from subnationally-representative household survey data from 31 LMICs. Our results reveal striking disparities in access to mobile and internet technologies between and within LMICs, with implications for policy formulation and infrastructure investment. These disparities contribute to a global context where women are 21% less likely to use the internet and 17% less likely to own mobile phones than men, corresponding to over 385 million more men than women owning a mobile phone and over 360 million more men than women using the internet.
... A deep understanding of the conditions of urban residents and spatial configuration is necessary, 127,128 so as to increase the participation of local communities in the decision-making process and avoid subjective bias as much as possible. 129 As a data-driven method, MetaCity can identify vulnerable populations and vulnerable communities by detecting their social-economic status through remote sensing and mobile phone data, 130,131 and can further analyze and understand their specific needs and challenges. [132][133][134] Importantly, in the design of MetaCity, it is crucial to incorporate public participation by integrating interfaces at the input stage, aligning with SDGs' emphasis on inclusive development. ...
Article
Full-text available
Cities are complex systems that develop under complicated interactions among their human and environmental components. Urbanization generates substantial outcomes and opportunities while raising challenges including congestion, air pollution, inequality, etc., calling for efficient and reasonable solutions to sustainable developments. Fortunately, booming technologies generate large-scale data of complex cities, providing a chance to propose data-driven solutions for sustainable urban developments. This paper provides a comprehensive overview of data-driven urban sustainability practice. In this review article, we conceptualize MetaCity, a general framework for optimizing resource usage and allocation problems in complex cities with data-driven approaches. Under this framework, we decompose specific urban sustainable goals, e.g., efficiency and resilience, review practical urban problems under these goals, and explore the probability of using data-driven technologies as potential solutions to the challenge of complexity. On the basis of extensive urban data, we integrate urban problem discovery, operation of urban systems simulation, and complex decision-making problem solving into an entire cohesive framework to achieve sustainable development goals by optimizing resource allocation problems in complex cities.
... Such a nuanced approach demands the assimilation of advanced data sources and techniques (Blumenstock, 2016). Innovations in this domain encompass high-resolution satellite imagery (Head et al., 2017;Jean et al., 2016), mobile phone metadata (Aiken et al., 2022;Blumenstock et al., 2015), and digital footprints, such as online search trends and social media behaviors (Choi & Varian, 2012;Fatehkia et al., 2020;Llorente et al., 2015). The emergence and integration of these data sources can be attributed to technological progress, specifically the proliferation of big data and the enhancement of machine learning algorithms (Pokhriyal & Jacques, 2017;Steele et al., 2017). ...
Article
Full-text available
This study aimed to predict household expenditure using a combination of survey and geospatial data. A web-based application operating on the Google Earth Engine platform has been specifically developed for this research, providing a set of satellite-based indicators. These data were spatially averaged at the district level and integrated with household nonfood expenditures, a proxy of socioeconomic conditions, derived from the World Bank’s 2019 Living Standards Measurement Study (LSMS). Four machine learning algorithms were applied. By using root mean square error as the goodness-of-fit criterion, a random forest algorithm yielded the highest forecasting precision, followed by support vector machine, neural network, and generalized least squares. In addition, variable importance and minimal depth analyses were conducted, indicating that the geospatial indicators have moderate contributive powers in predicting socioeconomic conditions. Conversely, the predictive powers of variables derived from the LSMS were mixed. Some asset ownership yielded a high explanatory power, whereas some were minimal. The attained results suggest future development aimed at enhancing accuracy. Additionally, the findings revealed an association between economic activity density and household expenditure, recommending regional development promotion through urbanization and transition from agriculture to other economic sectors.
... The research seek to substantiate GSM data's validity as a creditworthiness predictor. Studies show the potential of GSM data in credit scoring [3] [5] [21], and African markets can tap into the potential that alternative data bring to enable financial inclusion. GSM data can infer social networks, communication patterns, and mobility, valuable in assessing credit risk [3]. ...
Conference Paper
Full-text available
This paper presents an innovative approach to credit scoring for the unbanked population in Africa by leveraging alternative data from Mobile Network Operators (MNOs). Traditional credit scoring methods often fail due to a lack of formal credit history. This research proposes a model that utilises mobile phone usage patterns, such as call frequency, data usage, and mobile money transactions, to assess creditworthiness. A predictive model was developed using logistic regression, a machine-learning technique for binary classification. Initial results show a 90% prediction accuracy compared to less than 80% when using traditional methods, opening up possibilities for financial inclusion. Key challenges such as data privacy and regulatory compliance are discussed, highlighting future directions in this rapidly evolving field. Despite these challenges, extending credit services to the unbanked is urgent. The potential societal and economic benefits are significant, including new customer bases for financial institutions and improved credit access for unbanked individuals, leading to a more inclusive and prosperous society.
... Several studies have explored the implementation of machine learning algorithms to address public policy problems. These applications have occurred in fields as varied as health (Kleinberg et al., 2015), education (Rockoff et al., 2011), public finance (Zumaya et al., 2021), violence prevention (Chandler et al., 2011), justice (Kleinberg et al., 2018), poverty reduction (Blumenstock et al., 2015), food security (Hossain et al., 2019), official communications (Jungblut and Jungblut, 2022), among many others. Most of these applications fall into the category that Kleinberg et al. (2015) define as "prediction policy problems," that is public policy situations where causal inference is not so relevant, but the prediction of the circumstances under which an intervention will be more effective is. ...
Article
Full-text available
Public procurement is a fundamental aspect of public administration. Its vast size makes its oversight and control very challenging, especially in countries where resources for these activities are limited. To support decisions and operations at public procurement oversight agencies, we developed and delivered VigIA, a data-based tool with two main components: (i) machine learning models to detect inefficiencies measured as cost overruns and delivery delays, and (ii) risk indices to detect irregularities in the procurement process. These two components cover complementary aspects of the procurement process, considering both active and passive waste, and help the oversight agencies to prioritize investigations and allocate resources. We show how the models developed shed light on specific features of the contracts to be considered and how their values signal red flags. We also highlight how these values change when the analysis focuses on specific contract types or on information available for early detection. Moreover, the models and indices developed only make use of open data and target variables generated by the procurement processes themselves, making them ideal to support continuous decisions at overseeing agencies.
... Scientists have long suspected that human behavior is closely linked with socioeconomical status, as many of our daily routines are driven by activities related to maintain, to improve, or afforded by such status [6,14,13]. Recent studies provided empirical support and investigated these theories in a vast and rich datasets (e.g., social media [20], phone records [29,10]) with varying scales and granularities [8,24]. ...
Preprint
The mapping of populations socio-economic well-being is highly constrained by the logistics of censuses and surveys. Consequently, spatially detailed changes across scales of days, weeks, or months, or even year to year, are difficult to assess; thus the speed of which policies can be designed and evaluated is limited. However, recent studies have shown the value of mobile phone data as an enabling methodology for demographic modeling and measurement. In this work, we investigate whether indicators extracted from mobile phone usage can reveal information about the socio-economical status of microregions such as districts (i.e., average spatial resolution < 2.7km). For this we examine anonymized mobile phone metadata combined with beneficiaries records from unemployment benefit program. We find that aggregated activity, social, and mobility patterns strongly correlate with unemployment. Furthermore, we construct a simple model to produce accurate reconstruction of district level unemployment from their mobile communication patterns alone. Our results suggest that reliable and cost-effective economical indicators could be built based on passively collected and anonymized mobile phone data. With similar data being collected every day by telecommunication services across the world, survey-based methods of measuring community socioeconomic status could potentially be augmented or replaced by such passive sensing methods in the future.
... can be estimated from communication networks [33,[50][51][52][53] or from external aggregate data [34]. However, usually they do not come together with information on individual purchase behaviour, which can be the best estimated from anonymised purchased records [7,26]. ...
Preprint
We analyse a coupled dataset collecting the mobile phone communications and bank transactions history of a large number of individuals living in a Latin American country. After mapping the social structure and introducing indicators of socioeconomic status, demographic features, and purchasing habits of individuals we show that typical consumption patterns are strongly correlated with identified socioeconomic classes leading to patterns of stratification in the social structure. In addition we measure correlations between merchant categories and introduce a correlation network, which emerges with a meaningful community structure. We detect multivariate relations between merchant categories and show correlations in purchasing habits of individuals. Finally, by analysing individual consumption histories, we detect dynamical patterns in purchase behaviour and their correlations with the socioeconomic status, demographic characters and the egocentric social network of individuals. Our work provides novel and detailed insight into the relations between social and consuming behaviour with potential applications in resource allocation, marketing, and recommendation system design.
... For instance Michel et al analyzed over one million books and presented results related to the evolution of the English language as well as various cultural phenomena (Michel et al. 2011). And Blumenstock et al used mobile phone data to predict poverty rates in Rwanda (Blumenstock, Cadamuro, and On 2015). ...
Preprint
Targeted socioeconomic policies require an accurate understanding of a country's demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learning driven approaches are cheaper and faster--with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0.82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighborhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.
... Datasets simultaneously disclosing the social structure and the socioeconomic indicators of a large number of individuals are still very rare. However, several promising directions have been proposed lately to estimate socioeconomic status from communication behaviour on regional level [38,39,40] or even for individuals [42], just to mention a few. In future works these methods could be used to generalise our results to other countries using mobile communication datasets. ...
Preprint
The uneven distribution of wealth and individual economic capacities are among the main forces which shape modern societies and arguably bias the emerging social structures. However, the study of correlations between the social network and economic status of individuals is difficult due to the lack of large-scale multimodal data disclosing both the social ties and economic indicators of the same population. Here, we close this gap through the analysis of coupled datasets recording the mobile phone communications and bank transaction history of one million anonymised individuals living in a Latin American country. We show that wealth and debt are unevenly distributed among people in agreement with the Pareto principle; the observed social structure is strongly stratified, with people being better connected to others of their own socioeconomic class rather than to others of different classes; the social network appears with assortative socioeconomic correlations and tightly connected "rich clubs"; and that egos from the same class live closer to each other but commute further if they are wealthier. These results are based on a representative, society-large population, and empirically demonstrate some long-lasting hypotheses on socioeconomic correlations which potentially lay behind social segregation, and induce differences in human mobility.
... In resource-constrained environments where census censuses and household surveys are infrequent, this approach creates an option to collect localized and timely information. Localized and timely information at a fraction of the cost of traditional methods [4]. b) Satellite Imagery (Nightlight Data) ...
Article
Full-text available
National Economic Survey (Susenas) data is used by Statistics Indonesia to calculate the poverty rate in Bandung. However, the traditional data collection method of interviewing households one by one is time consuming, expensive, and may not capture a representative sample. Therefore, this research explores the use of Time Series models, specifically AutoArima, Croston, and Exponential Smoothing algorithms to predict the poverty rate in Bandung city. Based on this problem, a prediction is needed to determine the poverty rate in Bandung City. The poverty dataset used is sourced from the Bandung City Data Portal with test data from 2010 to 2018. This research will use 3 error parameters to evaluate the results of the poverty rate in Bandung City, namely MAE, MSE and MASE. Based on the tests conducted, the dataset produces the AutoArima model as the best method with MAE = 0.183, MSE = 0.053, MASE = 0.797, for the Croston model produces an error with MAE = 0.456, MSE = 0.374, MASE = 1.985. Meanwhile, the Exponential Smoothing model produces an error with MAE = 0.410, MSE = 0.215, MASE = 1.786. From the three tests, it was concluded that the AutoArima model successfully predicted the poverty rate in Bandung City with good results.
... AI models can analyze data at much finer spatial and temporal resolutions than traditional methods, offering detailed insights into poverty dynamics at the village or neighborhood level, which allows for more targeted and effective interventions (Blumenstock et al., 2020 • Chapter 1: Introduction -Provides background information, research objectives, and the significance of the study. ...
Thesis
Full-text available
Eradicating extreme poverty remains one of the most pressing global challenges. This study investigates the potential of integrating Artificial Intelligence (AI) and spatial datasets to predict poverty levels in Malawi. By using advanced machine learning models and high-resolution data, this research aims to provide actionable insights into poverty dynamics. The study focuses on using infrastructure-related datasets such as nightlights, roads, healthcare facilities, and built-up areas. The methodology starts with data collection from diverse sources, including satellite data and socio-economic surveys. Key datasets include nightlights data reflecting economic activity, road density data indicating accessibility, healthcare facilities data providing health service locations, and built-up areas data distinguishing between urban and rural regions. The Integrated Household Panel Survey (IHPS) offers detailed socio-economic data for model validation. Data preprocessing involves cleaning, standardizing, merging, and reprojecting raster datasets to a common coordinate system. Interpolation techniques align the gap poor percentage data with the spatial datasets. In the model development phase, a Random Forest Regressor machine learning model is trained to identify patterns influencing poverty. The model's performance is evaluated using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² Score. Validation is achieved through cross-validation techniques and comparing model predictions against interpolated poverty data. Results demonstrate an adequate predictive capability for poverty levels in Malawi, with high R² values indicating robust model performance (0.825). The correlation analysis reveals significant relationships between various infrastructure datasets and poverty levels, underscoring the relation between infrastructure and socio-economic conditions. The study highlights the scalability of AI-driven models, offering potential for broader application in other regions facing similar challenges and contributes to the broader field of AI for social good, demonstrating practical applications of AI in poverty eradication.
... Financial inclusion has been defined as "a state in which all people who can use them have access to a full suite of quality financial services, provided at affordable prices, in a convenient manner, and with dignity for the clients" (The Center for Financial Inclusion 2018). Policy makers across the world are increasingly interested in financial inclusion due to its potential to enable poverty reduction, at a time when a growing number of people are being excluded from formal financial and banking institutions in developing countries (Blumenstock, Cadamuro, and On 2015;Wang and Guan 2017;Serbeh, Adjei, and Forkuor 2022). ...
Article
Full-text available
ABSTRACT Digital financial inclusion initiatives in developing countries have gained salience because of their potential to improve the socioeconomic condition of marginalized groups such as low-income women. However, persistent challenges remain in overcoming the digital divide in developing countries and enhancing access and participation of women in digital financial services. Despite the growing scholarly attention, little is known about how digital technologies might be designed to enable financial inclusion of women in developing countries. Using the technology affordances approach, we extend previous theorizing on inclusive information system, and introduce a relational approach to designing for inclusion. Specifically, we conduct a case study of a digital finance initiative in Ghana involving the design of an interactive voice response system (IVR) for low-income women where systemic barriers to technology adoption and use are pervasive. We show the significance of user feedback, environmental factors, and affordances for more inclusive information system design. We contribute a theoretically grounded framework that takes holistic account of the sociotechnical context of IS design for inclusion
Article
Measurement is not only a way of describing complex realities; it can also transform those realities by influencing policies. We live in an era of measurement innovation: new methods to deploy and new ways of adapting familiar, proven strategies to new contexts. This paper explores how new measurements provide fresh insights into the circumstances of small‐farm households worldwide and describes challenges that these techniques have yet to overcome. Because the small farm sector plays a crucial role in global food security, global value chains, and rural livelihoods, understanding its conditions is a persistent focus of policymakers and researchers. I discuss how measures including satellite‐based assessments of crop yields, tree cover, temperature, and rainfall, laboratory measures of soil and agricultural input quality, GPS‐based plot area calculations, labor activity trackers, and high‐frequency household surveys conducted via cellular phones are providing an improved understanding of fundamental dimensions of small farms and agrarian households. I identify important gaps in what is currently measured, discuss challenges related to implementing and interpreting new measures, and argue that new measurement strategies should be combined with continued investment for traditional “analog measures”—the household and farm surveys that remain fundamental for data collection in low‐ and middle‐income countries (LMICs).
Article
Understanding poverty dynamics is crucial to target and tailor economic policies in developing countries like Nigeria—a country at the risk of hosting about a quarter of all people living in poverty worldwide. To facilitate the targeting of poverty‐reducing interventions, we build a nationally representative panel dataset spanning 2011–2019 with more than a hundred covariates and apply econometric and machine learning tools to predict and examine factors associated with the static, transient, and persistent poverty status of Nigerian households. Results show that demographic factors, asset holdings, access to infrastructure, and housing indicators can accurately predict poverty in 80% of cases.
Article
China's poverty alleviation practice stands out globally, yet illuminatingly decoding its complex process remains challenging due to data and method fragmentation. This study proposed an integrated analytical framework combining multi-source data fusion, ensemble learning, and interpretable machine learning. By integrating multi-dimensional socioeconomic survey (SES) data with spatiotemporally continuous nighttime light (NTL) observations, the framework enables robust poverty prediction and mechanistic insights in data-rich and data-scarce contexts. The method was applied to examine poverty reduction dynamics in Yunnan-Guizhou-Guangxi area (YGGA) of Southwest China. Across 328 county-level units, the average poverty incidence markedly decreased from 33.38% to 12.42% (2000–2019), characterized by three phases: widespread high poverty (2000–2005), uneven regional improvement (2006–2012), and transformative poverty reduction (2013–2019). Spatio-temporal analysis uncovered the transformation from highly clustered poverty to a more dispersed distribution. Through interpretable machine learning, the study analyzed 24 driving factors in three categories. While economic & demographic indicators became increasingly dominant, the persistent influence of geographical & environmental, and social & infrastructure indicators underscored the necessity for an integrated approach to poverty governance. This study provided insights for China's post-poverty alleviation era while contributing towards inclusive growth within global sustainable development framework.
Book
Full-text available
Embark on a journey into the heart of a new industrial revolution―one that promises to redefine human mobility for generations to come. In this groundbreaking exploration, we confront the promises and perils of new mobility, navigating the intricate landscape where technology intersects with urban society. As cities evolve and technology shapes our daily lives, the ethical dimensions of this transformation remain largely uncharted territory. Amidst the rapid advancement of new mobility systems, this book sheds light on the moral dilemmas and philosophical underpinnings that often go unnoticed. From the ethical implications of technology to the systemic flaws in planning and design, we delve into the core of this paradigm shift. By understanding the foundational principles of mobility and the hidden codes that govern human movement, we pave the way for a more equitable and inclusive future. At the heart of this transformative vision lies a comprehensive framework for building a new mobility ecosystem―one that prioritizes human well-being and equity above all else. Through innovative planning processes and redesign concepts, we aim to bridge the gap between technology and society, ensuring that every individual has access to safe, efficient, and sustainable modes of transportation. From low-emission vehicles to multimodal transit hubs, this book presents a blueprint for reimagining urban spaces and redefining the way we move. By embracing shared values and collective responsibility, we strive to create a world where mobility is not just a privilege, but a fundamental human right. As we embark on this journey towards a more sustainable future, let us remember that the true measure of progress lies not in technological innovation alone, but in our ability to build communities that thrive together. Join us in shaping the future of mobility―one where humanity and equity reign supreme.
Article
Social protection programs have become increasingly widespread in low- and middle-income countries, with their own distinct characteristics to match the environments in which they are operating. This paper reviews the growing literature on the design and impact of these programs. We review how to identify potential beneficiaries given the large informal sector, the design and implementation of redistribution and income support programs, and the challenges and potential of social insurance. We use our frameworks as a guide for consolidating and organizing the existing literature and also to highlight areas and questions for future research. (JEL E26, H23, I13, I32, I38, O12, O16)
Article
Full-text available
This article explores the phenomenon of big data and its potential for use in the public sector. It argues that big data is not just “more of the same”, but that it face challenges that can open the door to the enormous advantages in the use of algorithms and massive data, guaranteeing to the citizens that its less desirable effects will not appear. A scientific perspective is adopted to emphasize the enormous potential of big data, its main limitations, and, most importantly, the intermediate arena formed by challenges and uncertainties typical of a dynamic and still unstable phenomenon.
Article
Full-text available
study sought to establish the prevalence, morphological classifications, and factors associated with severe anemia among children attending Itojo Hospital. A hospital-based cross-sectional study design was used in this study in which children aged less than 5 years who attended the pediatric ward at Itojo Hospital were involved. Patients were consecutively recruited until a sample size of 296 was obtained. Data were collected from patients’ caregivers with a structured questionnaire. Data analysis was done using SPSS Version 20.0. Descriptive statistics and bivariate and multivariate logistic regression were used during data analysis. Multiple logistic regression models were used to show the strength of the relationship and the likelihood that each of the factors would lead to severe anemia among children under 5 years. Of the 296 patients enrolled, the prevalence of severe anemia was 13.9%. The Majority of the patients (50.7%) had microcytic anemia, followed by 32.8% with normocytic anemia. Factors that were significantly associated with severe anemia were the age of the child (P=0.029), HIV/AIDS (P=0.000), leukemia (P=0.000), and sickle cell disease (P=0.000). The prevalence of severe anemia among children less than five years of age was found to be relatively high hence increasingly becoming a public health problem. There is a need for age-specific interventions that comprehensively address the issue of improved nutrition, prevention, and management of HIV infection as well as chronic and genetic disorders.
Article
Full-text available
The increased use of nighttime lights (NTL) to assess infrastructure implementation and socioeconomic development highlights the potential of this open data source, often used as a proxy indicator of economic dynamics. Many studies focus on supra-national levels and the quantification of light emissions, generating assumptions regarding development. However, fewer studies address the characterization of socio-spatial dynamics at the local level. This research analyses the Nacala corridor in Mozambique, aiming to challenge the assumption that increasing NTL levels equals local development. We qualify and contextualize the types of activities identified by nighttime light anomalies. Using data cubes with 10-year seasonal NTL emissions, we identified anomalies in the time series of 17 out of 74 settlements and subsequently analyzed them with very high-resolution images. Among these settlements, we identified soil extraction, quarrying, or industries in 13 cases. Finally, we compared the results with household surveys indicating that during the period, the population had no significant increase in access to energy. We conclude that the NTL time series can effectively portray infrastructure-driven activities, such as surface mining and industry, in the context of the Corridor. However, the assumption that local development is linked with an increase in NTL in non-urbanized areas can be misleading without qualitative analysis. The activities that are the source of radiance can be illicit, not socially adopted, economically concentrated, and/or environmentally harmful.
Article
Poverty prediction models are used to address missing data issues in a variety of contexts such as poverty profiling, targeting with proxy-means tests, cross-survey imputations such as poverty mapping, top and bottom income studies, or vulnerability analyses. Based on the models used by this literature, this paper conducts a study by artificially corrupting data clear of missing incomes with different patterns and shares of missing incomes. It then compares the capacity of classic econometric and machine-learning models to predict poverty under different scenarios with full information on observed and unobserved incomes, and the true counterfactual poverty rate. Random forest provides more consistent and accurate predictions under most but not all scenarios.
Chapter
Full-text available
The yield potential of rice and wheat has doubled with Green revolution technologies, particularly in Asia. The system of high input production needs the best quality of pesticides, fertilizers, machines, and irrigation facilities. However, neglecting the ecological integrity of the land, water, and forest resources, and endangering natural resources cannot be carried on for long. Primitive and natural practices of agriculture might be able to play the leading role in designing a sustainable and eco-friendly system of agriculture that would increase the likelihood that rural people would accept it, develop it, and maintain its interventions and innovations. From this perspective, eco-friendly system are considered to be environment friendly, biodegradable, safe, economical, and renewable substitutes to use in the organic method of farming, also called eco-friendly farming. The answer to all the problems being faced by farmers in agriculture is Eco-friendly farming or Organic farming. This new system would keep agriculture more sustainable. Sample of 109 experts from agriculture and environment field were surveyed to know the benefits, issues and impact of eco-friendly farming in India. It is found that there is a significant impact of organic farming on the environment.
Article
We are in the early stages of a new era of demographic research that offers exciting opportunities to quantify demographic phenomena at a scale and resolution once unimaginable. These scientific possibilities are opened up by new sources of data, such as the digital traces that arise from ubiquitous social computing, massive longitudinal datasets produced by the digitization of historical records, and information about previously inaccessible populations reached through innovations in classic modes of data collection. In this commentary, we describe five promising new sources of demographic data and their potential appeal. We identify cross‐cutting challenges shared by these new data sources and argue that realizing their full potential will demand both innovative methodological developments and continued investment in high‐quality, traditional surveys and censuses. Despite these considerable challenges, the future is bright: these new sources of data will lead demographers to develop new theories and revisit and sharpen old ones.
Chapter
Population distribution and migration patterns are crucial aspects of human geography, influencing a variety of factors such as social, economic, political, and environmental dynamics. Understanding these patterns is essential for urban planning, policy-making, and resource allocation. In this section, we will discuss the definitions and concepts related to population distribution and migration patterns.
Chapter
Human geography and urban planning have long relied on a variety of data sources to understand spatial patterns, analyze demographic trends, and inform policy decisions.
Chapter
This chapter introduces and prepares the background to understand recent technological advancements and their impact on society. The chapter temporally and spatially highlights significant events and innovations which have led to the emergence of digital transformation and emerging technologies such as artificial intelligence (AI). The chapter provides a brief journey of information and communication technology (ICT) along with the development of products and services and emerging trends impacting our daily lives and transforming the socio-economic systems starting from home to business and economy, educational systems, governance systems, political systems, entertainment, and sports. However, Digital Transformation and AI have also posed certain challenges in front of society. There are issues related to job losses, misinformation, fake news, deep fakes, change in cultural value systems, embedding discrimination, amplification of biases, cultural shifts, propaganda, and loss of democratic value systems. This chapter provides a comprehensive background to guide readers through the intricate interplay between digital transformation, artificial intelligence, and society and explore opportunities and challenges.
Article
Full-text available
The ongoing Ebola outbreak is taking place in one of the most highly connected and densely populated regions of Africa (Figure 1A). Accurate information on population movements is valuable for monitoring the progression of the outbreak and predicting its future spread, facilitating the prioritization of interventions and designing surveillance and containment strategies. Vital questions include how the affected regions are connected by population flows, which areas are major mobility hubs, what types of movement typologies exist in the region, and how all of these factors are changing as people react to the outbreak and movement restrictions are put in place. Just a decade ago, obtaining detailed and comprehensive data to answer such questions over this huge region would have been impossible. Today, such valuable data exist and are collected in real-time, but largely remain unused for public health purposes - stored on the servers of mobile phone operators. In this commentary, we outline the utility of CDRs for understanding human mobility in the context of the Ebola, and highlight the need to develop protocols for rapid sharing of operator data in response to public health emergencies.
Article
Full-text available
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata. Copyright © 2015, American Association for the Advancement of Science.
Article
Full-text available
Significance Knowing where people are is critical for accurate impact assessments and intervention planning, particularly those focused on population health, food security, climate change, conflicts, and natural disasters. This study demonstrates how data collected by mobile phone network operators can cost-effectively provide accurate and detailed maps of population distribution over national scales and any time period while guaranteeing phone users’ privacy. The methods outlined may be applied to estimate human population densities in low-income countries where data on population distributions may be scarce, outdated, and unreliable, or to estimate temporal variations in population density. The work highlights how facilitating access to anonymized mobile phone data might enable fast and cheap production of population maps in emergency and data-scarce situations.
Article
Full-text available
In this study we analyze the travel patterns of 500,000 individuals in Cote d'Ivoire using mobile phone call data records. By measuring the uncertainties of movements using entropy, considering both the frequencies and temporal correlations of individual trajectories, we find that the theoretical maximum predictability is as high as 88%. To verify whether such a theoretical limit can be approached, we implement a series of Markov chain (MC) based models to predict the actual locations visited by each user. Results show that MC models can produce a prediction accuracy of 87% for stationary trajectories and 95% for non-stationary trajectories. Our findings indicate that human mobility is highly dependent on historical behaviors, and that the maximum predictability is not only a fundamental theoretical limit for potential predictive power, but also an approachable target for actual prediction accuracy.
Article
Full-text available
Understanding the causes and effects of internal migration is critical to the effective design and implementation of policies that promote human development. However, a major impediment to deepening this understanding is the lack of reliable data on the movement of individuals within a country. Government censuses and household surveys, from which most migration statistics are derived, are difficult to coordinate and costly to implement, and typically do not capture the patterns of temporary and circular migration that are prevalent in developing economies. In this paper, we describe how new information and communications technologies (ICTs), and mobile phones in particular, can provide a new source of data on internal migration. As these technologies quickly proliferate throughout the developing world, billions of individuals are now carrying devices from which it is possible to reconstruct detailed trajectories through time and space. Using Rwanda as a case study, we demonstrate how such data can be used in practice. We develop and formalize the concept of inferred mobility, and compute this and other metrics on a large data set containing the phone records of 1.5 million Rwandans over four years. Our empirical results corroborate the findings of a recent government survey that notes relatively low levels of permanent migration in Rwanda. However, our analysis reveals more subtle patterns that were not detected in the government survey. Namely, we observe high levels of temporary and circular migration, and note significant heterogeneity in mobility within the Rwandan population. Our goals in this research are thus twofold. First, we intend to provide a new quantitative perspective on certain patterns of internal migration in Rwanda that are unobservable using standard survey techniques. Second, we seek to contribute to the broader literature by illustrating how new forms of ICT can be used to better understand the behavior of individuals in developing countries.
Article
Full-text available
The article introduces a model for the location of meaningful places for mobile telephone users, such as home and work anchor points, using passive mobile positioning data. Passive mobile positioning data is secondary data concerning the location of call activities or handovers in network cells that is automatically stored in the memory of service providers. This data source offers good potential for the monitoring of the geography and mobility of the population, since mobile phones are widespread, and similar standardized data can be used around the globe. We developed the model and tested it with 12 months' data collected by EMT, Estonia's largest mobile service provider, covering more than 0.5 million anonymous respondents. Modeling results were compared with population register data; this revealed that the developed model described the geography of the population relatively well, and can hence be used in geographical and urban studies. This approach also has potential for the development of location-based services such as targeting services or geographical infrastructure.
Article
Full-text available
Finite automata are considered in this paper as instruments for classifying finite tapes. Each one-tape automaton defines a set of tapes, a two-tape automaton defines a set of pairs of tapes, et cetera. The structure of the defined sets is studied. Various generalizations of the notion of an automaton are introduced and their relation to the classical automata is determined. Some decision problems concerning automata are shown to be solvable by effective algorithms; others turn out to be unsolvable by algorithms.
Article
Full-text available
Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.
Article
Full-text available
Despite recent advances in uncovering the quantitative features of stationary human activity patterns, many applications, from pandemic prediction to emergency response, require an understanding of how these patterns change when the population encounters unfamiliar conditions. To explore societal response to external perturbations we identified real-time changes in communication and mobility patterns in the vicinity of eight emergencies, such as bomb attacks and earthquakes, comparing these with eight non-emergencies, like concerts and sporting events. We find that communication spikes accompanying emergencies are both spatially and temporally localized, but information about emergencies spreads globally, resulting in communication avalanches that engage in a significant manner the social network of eyewitnesses. These results offer a quantitative view of behavioral changes in human activity under extreme conditions, with potential long-term impact on emergency detection and response.
Article
Full-text available
Small area estimation is becoming important in survey sampling due to a growing demand for reliable small area statistics from both public and private sectors. It is now widely recognized that direct survey estimates for small areas are likely to yield unacceptably large standard errors due to the smallness of sample sizes in the areas. This makes it necessary to "borrow strength" from related areas to find more accurate estimates for a given area or, simultaneously, for several areas. This has led to the development of alternative methods such as synthetic, sample size dependent, empirical best linear unbiased prediction, empirical Bayes and hierarchical Bayes estimation. The present article is largely an appraisal of some of these methods. The performance of these methods is also evaluated using some synthetic data resembling a business population. Empirical best linear unbiased prediction as well as empirical and hierarchical Bayes, for most purposes, seem to have a distinct advantage over other methods.
Article
Full-text available
A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Government Version of Record
Article
Full-text available
The rich set of interactions between individuals in society results in complex community structure, capturing highly connected circles of friends, families or professional cliques in a social network. Thanks to frequent changes in the activity and communication patterns of individuals, the associated social and communication network is subject to constant evolution. Our knowledge of the mechanisms governing the underlying community dynamics is limited, but is essential for a deeper understanding of the development and self-optimization of society as a whole. We have developed an algorithm based on clique percolation that allows us to investigate the time dependence of overlapping communities on a large scale, and thus uncover basic relationships characterizing community evolution. Our focus is on networks capturing the collaboration between scientists and the calls between mobile phone users. We find that large groups persist for longer if they are capable of dynamically altering their membership, suggesting that an ability to change the group composition results in better adaptability. The behaviour of small groups displays the opposite tendency-the condition for stability is that their composition remains unchanged. We also show that knowledge of the time commitment of members to a given community can be used for estimating the community's lifetime. These findings offer insight into the fundamental differences between the dynamics of small groups and large institutions.
Article
Full-text available
Electronic databases, from phone to e-mails logs, currently provide detailed records of human communication patterns, offering novel avenues to map and explore the structure of social and communication networks. Here we examine the communication patterns of millions of mobile phone users, allowing us to simultaneously study the local and the global structure of a society-wide communication network. We observe a coupling between interaction strengths and the network's local structure, with the counterintuitive consequence that social networks are robust to the removal of the strong ties but fall apart after a phase transition if the weak ties are removed. We show that this coupling significantly slows the diffusion process, resulting in dynamic trapping of information in communities and find that, when it comes to information diffusion, weak and strong ties are both simultaneously ineffective. • complex systems • complex networks • diffusion and spreading • phase transition • social systems
Article
Full-text available
Novel aspects of human dynamics and social interactions are investigated by means of mobile phone data. Using extensive phone records resolved in both time and space, we study the mean collective behavior at large scales and focus on the occurrence of anomalous events. We discuss how these spatiotemporal anomalies can be described using standard percolation theory tools. We also investigate patterns of calling activity at the individual level and show that the interevent time of consecutive calls is heavy-tailed. This finding, which has implications for dynamics of spreading phenomena in social networks, agrees with results previously reported on other human activities. Comment: 16 pages, 7 figures; minor changes. To appear in J. Phys. A
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Article
The prevalence of mobile communication in the developing world is ever increasing, with now 89 active subscriptions per 100 inhabitants. With this access comes the potential for unprecedented insights into individuals and societies, such as migration patterns, economic transactions, and even importation routes of infectious diseases like Ebola. However, the absence of a common framework for sharing mobile phone data in privacy-conscientious ways and an uncertain regulatory landscape has made difficult scientists' utilization of this powerful data.
Article
Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked who they intend to vote for. While representative polling has historically proven to be quite effective, it comes at considerable costs of time and money. Moreover, as response rates have declined over the past several decades, the statistical benefits of representative sampling have diminished. In this paper, we show that, with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and that this can often be achieved faster and at a lesser expense than traditional survey methods. We demonstrate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates which are in line with the forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues.
Article
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.
Article
The Suomi National Polar-Orbiting Partnership (NPP) satellite was launched on 28 October 2011, heralding the next generation of operational U.S. polar-orbiting satellites. It carries the Visible– Infrared Imaging Radiometer Suite (VIIRS), a 22-band visible/infrared sensor that combines many of the best aspects of the NOAA Advanced Very High Resolution Radiometer (AVHRR), the Defense Meteorological Satellite Program (DMSP) Operational Linescan System (OLS), and the National Aeronautics and Space Administration (NASA) Moderate Resolution Imaging Spectroradiometer (MODIS) sensors. VIIRS has nearly all the capabilities of MODIS, but offers a wider swath width (3,000 versus 2,330 km) and much higher spatial resolution at swath edge. VIIRS also has a day/night band (DNB) that is sensitive to very low levels of visible light at night such as those produced by moonlight reflecting off low clouds, fog, dust, ash plumes, and snow cover. In addition, VIIRS detects light emissions from cities, ships, oil flares, and lightning flashes. NPP crosses the equator at about 0130 and 1330 local time, with VIIRS covering the entire Earth twice daily. Future members of the Joint Polar Satellite System (JPSS) constellation will also carry VIIRS. This paper presents dramatic early examples of multispectral VIIRS imagery capabilities and demonstrates basic applications of that imagery for a wide range of operational users, such as for fire detection, monitoring ice break up in rivers, and visualizing dust plumes over bright surfaces. VIIRS imagery, both single and multiband, as well as the day/night band, is shown to exceed both requirements and expectations.
Article
The ubiquitous presence of cell phones in emerging economies has brought about a wide range of cell phone-based services for low-income groups. Often times, the success of such technologies highly depends on its adaptation to the needs and habits of each social group. In an attempt to understand how cell phones are being used by citizens in an emerging economy, we present a large-scale study to analyze the relationship between specific socio-economic factors and the way people use cell phones in an emerging economy in Latin America. We propose a novel analytical approach that combines large-scale datasets of cell phone records with countrywide census data to reveal findings at a national level. Our main results show correlations between socio-economic levels and social network or mobility patterns among others. We also provide analytical models to accurately approximate census variables from cell phone records with R2 ≈ 0.82.
Article
A training set of data has been used to construct a rule for predicting future responses. What is the error rate of this rule? This is an important question both for comparing models and for assessing a final selected model. The traditional answer to this question is given by cross-validation. The cross-validation estimate of prediction error is nearly unbiased but can be highly variable. Here we discuss bootstrap estimates of prediction errors, which can be thought of as smoothed versions of cross-validation. We show that a particular bootstrap method, the ·632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments. Besides providing point estimates, we also consider estimating the variability of an error rate estimate. All of the results here are nonparametric and apply to any possible prediction rule; however, we study only classification problems with 0-1 loss in detail. Our simulations include “smooth” prediction rules like Fisher’s linear discriminant function and unsmooth ones like nearest neighbors.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Article
Recent theoretical advances have brought income and wealth distributions back into a prominent position in growth and development theories, and as determinants of specific socio-economic outcomes, such as health or levels of violence. Empirical investigation of the importance of these relationships, however, has been held back by the lack of sufficiently detailed high quality data on distributions. Household surveys that include reasonable measures of income or consumption can be used to calculate distributional measures but at low levels of aggregation these samples are rarely representative or of sufficient size to yield statistically reliable estimates. At the same time, census (or other large sample) data of sufficient size to allow disaggregation either have no information about income or consumption, or measure these variables poorly. This note outlines a statistical procedure to combine these types of data to take advantage of the detail in household sample surveys and the co...
Article
We develop a statistical framework to use satellite data on night lights to augment official income growth measures. For countries with poor national income accounts, the optimal estimate of growth is a composite with roughly equal weights on conventionally measured growth and growth predicted from lights. Our estimates differ from official data by up to three percentage points annually. Using lights, empirical analyses of growth need no longer use countries as the unit of analysis; we can measure growth for sub- and supranational regions. We show, for example, that coastal areas in sub-Saharan Africa are growing slower than the hinterland. (JEL E01, E23, O11, 047, 057)
Conference Paper
Most correlation clustering algorithms rely on principal component analysis (PCA) as a correlation analysis tool. The correlation of each cluster is learned by applying PCA to a set of sample points. Since PCA is rather sensitive to outliers, if a small fraction of these points does not correspond to the correct correlation of the cluster, the algorithms are usually misled or even fail to detect the correct results. In this paper, we evaluate the influence of outliers on PCA and propose a general framework for increasing the robustness of PCA in order to determine the correct correlation of each cluster. We further show how our framework can be applied to PCA-based correlation clustering algorithms. A thorough experimental evaluation shows the benefit of our framework on several synthetic and real-world data sets.
Book
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
Article
A pervasive issue in social and environmental research has been how to improve the quality of socioeconomic data in developing countries. Given the shortcomings of standard sources, the present study examines luminosity (measures of nighttime lights visible from space) as a proxy for standard measures of output (gross domestic product). We compare output and luminosity at the country level and at the 1° latitude × 1° longitude grid-cell level for the period 1992-2008. We find that luminosity has informational value for countries with low-quality statistical systems, particularly for those countries with no recent population or economic censuses.
Article
Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done.
Article
In developing countries, identifying the poor for redistribution or social insurance is challenging because the government lacks information about people’s incomes. This paper reports the results of a field experiment conducted in 640 Indonesian villages that investigated two main approaches to solving this problem: proxy-means tests, where a census of hard-to-hide assets is used to predict consumption, and community-based targeting, where villagers rank everyone on a scale from richest to poorest. When poverty is defined using per-capita expenditure and the common PPP$2 per day threshold, we find that community-based targeting performs worse in identifying the poor than proxy-means tests, particularly near the threshold. This worse performance does not appear to be due to elite capture. Instead, communities appear to be using a different concept of poverty: the results of community-based methods are more correlated with how individual community members rank each other and with villagers’ self-assessments of their own status than per-capita expenditure. Consistent with this, the community-based methods result in higher satisfaction with beneficiary lists and the targeting process.
Article
Social networks form the backbone of social and economic life. Until recently, however, data have not been available to study the social impact of a national network structure. To that end, we combined the most complete record of a national communication network with national census data on the socioeconomic well-being of communities. These data make possible a population-level investigation of the relation between the structure of social networks and access to socioeconomic opportunity. We find that the diversity of individuals’ relationships is strongly correlated with the economic development of communities.
Article
This paper has an empirical and overtly methodological goal. The authors propose and defend a method for estimating the effect of household economic status on educational outcomes without direct survey information on income or expenditures. They construct an index based on indicators of household assets, solving the vexing problem of choosing the appropriate weights by allowing them to be determined by the statistical procedure of principal components. While the data for India cannot be used to compare alternative approaches they use data from Indonesia, Nepal, and Pakistan which have both expenditures and asset variables for the same households. With these data the authors show that not only is there a correspondence between a classification of households based on the asset index and consumption expenditures but also that the evidence is consistent with the asset index being a better proxy for predicting enrollments--apparently less subject to measurement error for this purpose--than consumption expenditures. The relationship between household wealth and educational enrollment of children can be estimated without expenditure data. A method for doing so - which uses an index based on household asset ownership indicators- is proposed and defended in this paper. In India, children from the wealthiest households are over 30 percentage points more likely to be in school than those from the poorest households.
Article
Using data from India, we estimate the relationship between household wealth and children's school enrollment. We proxy wealth by constructing a linear index from asset ownership indicators, using principal-components analysis to derive weights. In Indian data this index is robust to the assets included, and produces internally coherent results. State-level results correspond well to independent data on per capita output and poverty. To validate the method and to show that the asset index predicts enrollments as accurately as expenditures, or more so, we use data sets from Indonesia, Pakistan, and Nepal that contain information on both expenditures and assets. The results show large, variable wealth gaps in children's enrollment across Indian states. On average a "rich" child is 31 percentage points more likely to be enrolled than a "poor" child, but this gap varies from only 4.6 percentage points in Kerala to 38.2 in Uttar Pradesh and 42.6 in Bihar.
Article
Despite their importance for urban planning, traffic forecasting and the spread of biological and mobile viruses, our understanding of the basic laws governing human motion remains limited owing to the lack of tools to monitor the time-resolved location of individuals. Here we study the trajectory of 100,000 anonymized mobile phone users whose position is tracked for a six-month period. We find that, in contrast with the random trajectories predicted by the prevailing Lévy flight and random walk models, human trajectories show a high degree of temporal and spatial regularity, each individual being characterized by a time-independent characteristic travel distance and a significant probability to return to a few highly frequented locations. After correcting for differences in travel distances and the inherent anisotropy of each trajectory, the individual travel patterns collapse into a single spatial probability distribution, indicating that, despite the diversity of their travel history, humans follow simple reproducible patterns. This inherent similarity in travel patterns could impact all phenomena driven by human mobility, from epidemic prevention to emergency response, urban planning and agent-based modelling.
Article
This paper presents new data on poverty, inequality, and growth in those developing countries of the world for which the requisite statistics are available. Eco-nomic growth is found generally but not always to reduce poverty. Growth, however, is found to have very little to do with income inequality. Thus the "economic laws" linking the rate of growth and the distribution of benefits receive only very tenuous empirical support here. © 1989 The International Bank for Reconstruction and Development/The World Bank.
Article
An analyst using household survey data to construct a welfare metric is often confronted with a number of theoretical and practical problems. What components should be included in the overall welfare measure? Should differences in tastes be taking into account when making comparisons across people and households? How best should differences in cost-of-living and household composition be taken into consideration? Starting with a brief review of the theoretical framework underpinning typical welfare analysis undertaken based on household survey data, this paper provides some practical guidelines and advice on how best to tackle such problems. It outlines a three-part procedure for constructing a consumption-based measure of individual welfare: (i) aggregation of different components of household consumption to construct a nominal consumption aggregate, (ii) construction of price indices to adjust for differences in prices faced by households, and (iii) adjustment of the real consumption aggregate for differences in household composition. Examples based on survey data from eight countries-Ghana, Vietnam, Nepal, the Kyrgyz Republic, Ecuador, South Africa, Panama, and Brazil - are used to illustrate the various steps involved in constructing the welfare measure, and the STATA programs used for this purpose are provided in the appendix. The paper also includes examples of some analytic techniques that can be used to examine the robustness of the estimated welfare measure to underlying assumptions.
  • G S Fields
G. S. Fields, World Bank Res. Obs. 4, 167-185 (1989).
  • D Lazer
D. Lazer et al., Science 323, 721-723 (2009).
  • G King
G. King, Science 331, 719-721 (2011).
  • N Eagle
  • M Macy
  • R Claxton
N. Eagle, M. Macy, R. Claxton, Science 328, 1029-1031 (2010).
Predicting the present with Google Trends
  • H Choi
  • H Varian
H. Choi, H. Varian, Predicting the present with Google Trends. Econ. Rec. 88, 2-9 (2012). doi:10.1111/j.1475-4932.2012.00809.x
  • W Wang
  • D Rothschild
  • S Goel
  • A Gelman
W. Wang, D. Rothschild, S. Goel, A. Gelman, Int. J. Forecast. 31, 980-991 (2015).
  • J Candia
J. Candia et al., J. Phys. A 41, 224015 (2008).
  • J.-P Onnela
J.-P. Onnela et al., Proc. Natl. Acad. Sci. U.S.A. 104, 7332-7336 (2007).
  • X Lu
  • E Wetter
  • N Bharti
  • A J Tatem
X. Lu, E. Wetter, N. Bharti, A. J. Tatem, L. Bengtsson, Sci. Rep. 3, 2923 (2013).