Article

Predicting poverty and wealth from mobile phone metadata

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Accurate and timely estimates of population characteristics are a critical input to social and economic research and policy. In industrialized economies, novel sources of data are enabling new approaches to demographic profiling, but in developing countries, fewer sources of big data exist.We show that an individual's past history of mobile phone use can be used to infer his or her socioeconomic status. Furthermore, we demonstrate that the predicted attributes of millions of individuals can, in turn, accurately reconstruct the distribution of wealth of an entire nation or to infer the asset distribution of microregions composed of just a few households. In resource-constrained environments where censuses and household surveys are rare, this approach creates an option for gathering localized and timely information at a fraction of the cost of traditional methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, if this data also has the ability to predict demographic attributes, the credit risk models may lead to indirect discrimination based on age or gender. On the other hand, if mobile usage data proves to be predictive, it can provide a cost-effective way of monitoring and understanding sociodemographic changes at a high spatial resolution and track progress towards goals such as poverty reduction [12] and local economic development, human mobility, and activities [13]. This could be especially useful in developing countries where census data may be lacking or unreliable [14]. ...
... Mobile usage patterns have also been shown to be correlated with income and wealth [12,18,19,27]. F or example, the study in [19] identified notable differences between how low-income and high-income population segments use their mobile devices. ...
... Moreover, the results presented in [12] suggest that mobile usage patterns can be used to predict poverty both at the individual and at the aggregates. The latter is especially useful for computational social science and development studies in developing countries, where there is a lack of reliable and timely census data but instead a high penetration of mobile services [30]. ...
Article
When users interact with their mobile devices, they leave behind unique digital footprints that can be viewed as predictive proxies that reveal an array of users' characteristics, including their demographics. Predicting users' demographics based on mobile usage can provide significant benefits for service providers and users, including improving customer targeting, service personalization, and market research efforts. This study uses machine learning algorithms and mobile usage data from 235 demographically diverse users to examine the accuracy of predicting their sociodemographic attributes (age, gender, income, and education) from mobile usage metadata, filling the gap in the current literature by quantifying the predictive power of each attribute and discussing the practical applications and privacy implications. According to the results, gender can be most accurately predicted (balanced accuracy = 0.862) from mobile usage footprints, whereas predicting users' education level is more challenging (balanced accuracy = 0.719). Moreover, the classification models were able to classify users based on whether their age or income was above or below a certain threshold with acceptable accuracy. The study also presents the practical applications of inferring demographic attributes from mobile usage data and discusses the implications of the findings, such as privacy and discrimination risks, from the perspectives of different stakeholders.
... Nevertheless, satellite earth observation data are crucial for understanding the socioeconomic measures from space, initially by the correlation between income and night-light images Blumenstock et al. (2015). However, after pointing the way with satellite images, new studies started going beyond the use of night-light images to predict socioeconomic conditions and wealth index indicators as a poverty proxy (Ayush et al. 2020;Engstrom et al. 2017;Jean et al. 2016;Lee and Braithwaite 2020;Steele et al. 2017). ...
... These are the main arguments for using new artificial intelligence technologies to estimate social indicators as a proxy of poverty, an area of research that has been growing in recent years Hall et al. (2022); Usmanova et al. (2022). Following these concerns, one seminal work was made by (Blumenstock et al. 2015) using Call Detail Records (CDR) and the VIIRS-DNB satellite image data for predicting the wealth index in Rwanda from Demographic and Health Survey (DHS); the authors imple-mented deterministic finite automaton as feature extraction approach and linear methods with regularization approaches as the prediction algorithm. This perspective opened a research area where the satellite images, CDR, aerial images, and data geolocated are used to understand social indicators such as poverty, consumption, income, and wealth indices (Ayush et al. 2020;Engstrom et al. 2017;Jean et al. 2016;Lee and Braithwaite 2020;Niu et al. 2020;Pokhriyal and Jacques 2017;Pokhriyal et al. 2020;Steele et al. 2017;Watmough et al. 2019). ...
... The methodological differences are presented in four aspects: the data source, the feature extraction methods, the prediction methods, and the target variable chosen. The models used the census or survey data geolocalized, which contains the ground truth variable; the primary source in several studies is the DHS survey that meets the two requirements: the household's geolocation and the wealth index as a proxy of social condition (Blumenstock et al. 2015;Ledesma et al. 2020;Lee and Braithwaite 2020;Sheehan et al. 2019;Steele et al. 2017;Weidmann and Schutte 2017). Some works focus on census data as the main source (Engstrom et al. 2017;Pandey et al. 2018;Pokhriyal and Jacques 2017;Pokhriyal et al. 2020), while others focus on local or national surveys (Ayush et al. 2020;Gebru et al. 2017;Steele et al. 2017;Watmough et al. 2019). ...
Article
Full-text available
This paper presents a methodology to estimate the multidimensional poverty index using spatial data at the street block level. The data used in this study were obtained from Open Street Maps and ESA’s land use cover, which are freely available sources of spatial information. The study employs five machine-learning algorithms, including Catboost, Lightboost, and Random Forest, to estimate the multidimensional poverty index with spatial granularity. The results indicate that these models achieve promising performance in predicting poverty levels in Medellín, Colombia. The results showed that the Random Forest algorithm achieved the highest performance, with an MAE of 0.07504. Furthermore, the spatial distribution of the multidimensional poverty estimate was highly correlated with the true values of the distribution. This work contributes to predicting multidimensional poverty by demonstrating the potential of machine learning algorithms to utilize accessible spatial data. By providing evidence of the feasibility of estimating poverty levels at a granular spatial level, this methodology offers a powerful tool for policymakers to make poverty social interventions with low-cost evidence. Furthermore, this study has important implications for poverty eradication efforts in developing countries, where access to reliable data remains challenging.
... There are research opportunities engaging socio-technical transition theory [57], complexity theory with agentbased modelling [68], design science research [27], and other management and organizational theories to create frameworks for sustainability transitions from poverty to empowerment [69]. There are also several emerging data analytics and machine learning techniques that may also be useful [70][71][72] for progress monitoring and forecasting. The following discussion draws upon research articles referred to in Appendix A Tables 6, 7, 8, 9, additional articles from the Query 2, Query 3, and Query 4 searches (Fig. 2), and other research articles. ...
... Big data analytics can process data in near real time with predictive and prescriptive outputs, and a variety of visualizations. Blumenstock et al. [70] combined mobile phone metadata with survey data from Rwanda to predict poverty, wealth, social connections, travel patterns, and other expenditures. Njuguna and McSharry [71] successfully combined mobile phone data, night light data (illumination levels at night), and population data to predict regions of poverty. ...
... Xie et al. [72] trained a convolutional neural network (CNN) using night light data and reported poverty prediction results approaching that of survey data. Accurate census and survey data is not readily available in S-SA [70][71][72] but night light data is available from the National Oceanic and Atmospheric Administration [77]. ...
Article
Full-text available
Poverty elimination by 2030 is the major initiative of the United Nations Sustainable Development Goals. However, poverty in Sub-Saharan Africa is rising. There is an absence of structural reform for transformational change across the region. E-commerce is an enabler of small and large businesses in developed economies. Community-led initiatives for poverty alleviation may benefit from the transactional capabilities of e-commerce for direct trade with suppliers and consumers. Well-structured small and medium-size enterprises (SMEs) can foster local innovation and entrepreneurship, and collaboration between SMEs can enhance product development and marketing strategies. This review aims to discover formal research into the application of e-commerce in sustainable development models for poverty alleviation in Sub-Saharan Africa, and the extent of innovation, entrepreneurship, and collaboration among SMEs. The review found an absence of formal research into theories and practical strategies for sustainability innovations across the low-income spectrum. Organizational structures have not been developed to stimulate outreach, to foster innovation and entrepreneurship, or to embrace technology. Further, there is limited discussion on the importance of collaboration for the sharing of knowledge and joint business activities, but there is acknowledgement that SMEs can provide spatially diversified sustainable development. This article proposes a framework for the implementation and management of networks of SMEs focused on the sustainable development of low-income communities.
... While initial efforts in this domain relied on NTL data-given the strong correlation between nightlight luminosity and traditional measures of economic growth-the application has seen limited application at finer resolutions and in the poorest of regions (Blumenstock, 2016;Jean et al., 2016). Other major approaches include the use of high-resolution daytime satellite data (Head et al., 2017;Jean et al., 2016), mobile phone metadata (Aiken, Bedoya, et al., 2021;Blumenstock et al., 2015), internet search history and social media activities (Choi & Varian, 2012;Fatehkia et al., 2020;Llorente et al., 2015) and various combinations of these data sources (Pokhriyal & Jacques, 2017;Steele et al., 2017). These applications have been possible largely because of the proliferation of big data as well as newer methods in ML to process them. ...
... The application of this approach is not limited to developed countries though, given the increasingly high mobile penetration rates even in developing countries. Blumenstock et al. (2015), for instance, constructed a composite wealth index using principal components of various wealth indicators gleaned from the 2007 and 2010 DHS for Rwanda as well as a phone survey and CDR such as calls and text messages. Through this, the authors demonstrate that a mobile phone subscriber's wealth status can be inferred from their historical phone use pattern, with cross-validated correlation coefficients of 0.68. ...
... For example, NTLs, which tend to have lower spatial resolution of about 1 km, can be fused with higher resolution daytime SI to produce a higher resolution composite dataset. Still, ML algorithms have been used to infer individual subscribers' socioeconomic status directly from their individual phone use habits (highest resolution) and then aggregate such predictions to town, district and regional levels (lower resolutions) (Blumenstock et al., 2015). We hypothesize that certain types of assets and features can more effectively be measured in higher resolution imagery than others. ...
Article
Full-text available
The field of artificial intelligence is seeing the increased application of satellite imagery to analyse poverty in its various manifestations. This nascent but rapidly growing intersection of scholarship holds the potential to help us better understand poverty by leveraging big data and recent advances in machine vision. In this study, we statistically analyse the literature in the expanding field of welfare and poverty predictions from the combination of machine learning and satellite imagery. Here, we apply an integrative review method to extract key data on factors related to the predictive power of welfare. We found that the most important factors correlated to the predictive power of welfare are the number of pre‐processing steps employed, the number of datasets used, the type of welfare indicator targeted and the choice of AI model. Studies that used stock measure indicators (assets) as targets achieved better performance—17 percentage points higher—in predicting welfare than those that targeted flow measures (income and consumption) ones. Additionally, we found that the combination of machine learning and deep learning significantly increases predictive power—by as much as 15 percentage points—compared to using either alone. Surprisingly, we found that the spatial resolution of the satellite imagery used is important but not critical to the performance as the relationship is positive but not statistically significant. These findings have important implications for future research in this domain and for anyone aspiring to use the methodology.
... Against this backdrop, recent advances in machine learning and the growing availability of non-traditional data sources have led to the proliferation of new options for small area estimation. For example, Blumenstock, Cadamuro, and On (2015) use mobile phone records to infer the socioeconomic status of phone owners in Rwanda and Aiken et al. (2022) use mobile phone call data records to predict targeting performance of programs in Togo. However, one drawback to mobile phone data is that -like banking records -the population of mobile-phone owners may be systematically different from the population of those without phones. ...
... However, image processing is computationally intensive and unwieldy. In comparison, other types of data are easier to use, like mobile phone call data records (Aiken et al. 2022;Blumenstock, Cadamuro, and On 2015), but these can be more difficult to access due to privacy concerns, and also raise issues related to representativeness. The satellite indicators we use can be obtained from publicly available sources relatively easily and are much smaller in size. ...
Preprint
Full-text available
Reliable estimates of economic welfare for small areas are valuable inputs into the design and evaluation of development policies. This paper compares the accuracy of point estimates and confidence intervals for small area estimates of wealth and poverty derived from four different prediction methods: linear mixed models, Cubist regression, extreme gradient boosting, and boosted regression forests. The evaluation draws samples from unit-level household census data from four developing countries, combines them with publicly and globally available geospatial indicators to generate small area estimates, and evaluates these estimates against aggregates calculated using the full census. Predictions of wealth are evaluated in four countries and poverty in one. All three machine learning methods outperform the traditional linear mixed model, with extreme gradient boosting and boosted regression forests generally outperforming the other alternatives. The proposed residual bootstrap procedure reliably estimates confidence intervals for the machine learning estimators, with estimated coverage rates across simulations falling between 94 and 97 percent. These results demonstrate that predictions obtained using tree-based gradient boosting with a random effect block bootstrap generate more accurate point and uncertainty estimates than prevailing methods for generating small area welfare estimates.
... These, and similar efforts, have resulted in a plethora of scientific studies (see Fig. 1C) , which focus on combining novel digital data sources (including mobile phone and social media data) with powerful tools from computer science, mathematics, and physics to estimate developmental indicators ranging from: socio-economic status [16][17][18][19] , illiteracy 20 , unemployment 21,22 , gender inequality 23,24 , segregation 25,26 , and population statistics 27 . Similarly, these datasets and mathematical techniques can be used to achieve the Sustainable Development Goals [28][29][30] . ...
... Using digital data to produce population demographics is a relatively new endeavour. While the field has produced some exciting results, for instance, fine grained wealth distribution maps have be generated for a large number of countries 16,17,33,45,46 , little is known about the shortcomings of these new approaches. By contrast, household surveys have been used for decades and their limitations are well understood and documented. ...
Preprint
Full-text available
Novel digital data sources and tools like machine learning (ML) and artificial intelligence (AI) have the potential to revolutionize data about development and can contribute to monitoring and mitigating humanitarian problems. The potential of applying novel technologies to solving some of humanity's most pressing issues has garnered interest outside the traditional disciplines studying and working on international development. Today, scientific communities in fields like Computational Social Science, Network Science, Complex Systems, Human Computer Interaction, Machine Learning, and the broader AI field are increasingly starting to pay attention to these pressing issues. However, are sophisticated data driven tools ready to be used for solving real-world problems with imperfect data and of staggering complexity? We outline the current state-of-the-art and identify barriers, which need to be surmounted in order for data-driven technologies to become useful in humanitarian and development contexts. We argue that, without organized and purposeful efforts, these new technologies risk at best falling short of promised goals, at worst they can increase inequality, amplify discrimination, and infringe upon human rights.
... 2 They then applied the model to the populated surface of all 135 low-and middle-income countries to develop granular poverty maps for each country. Similar approaches have been used for individual countries, including Rwanda (Blumenstock et al., 2015), Senegal (Pokhriyal and Jacques, 2017), Bangladesh (Steele et al., 2017), and Belize (Hersh et al., 2021), among others. 3 While often subsumed under the term "machine learning," this modern approach to poverty mapping actually consists of several practices that differ from more traditional approaches. ...
... 4 For examples of the above procedure, see Jean et al. (2016), Pokhriyal and Jacques (2017), Steele et al. (2017), Lee and Braithwaite (2020), Yeh et al. (2020), or Chi et al. (2022. Note that a few studies use a similar procedure, but fit the machine-learning model to household-level data and then aggregate the model's predictions to the geographic unit of interest (Blumenstock et al., 2015;Hersh et al., 2021;Aiken et al., 2022). Others rely on reporting correlation coefficients between survey based estimates and model predictions, which is similar (see Smythe and Blumenstock 2022). ...
... Against this backdrop, recent advances in machine learning and the growing availability of non-traditional data sources have led to the proliferation of new options for small area estimation. For example, Blumenstock, Cadamuro, and On (2015) use mobile phone records to infer the socioeconomic status of phone owners in Rwanda and Aiken et al. (2022) use mobile phone call data records to predict targeting performance of programs in Togo. However, one drawback to mobile phone data is that -like banking records -the population of mobile-phone owners may be systematically different from the population of those without phones. ...
... However, image processing is computationally intensive and unwieldy. In comparison, other types of data are easier to use, like mobile phone call data records (Aiken et al. 2022;Blumenstock, Cadamuro, and On 2015), but these can be more difficult to access due to privacy concerns, and also raise issues related to representativeness. The satellite indicators we use can be obtained from publicly available sources relatively easily and are much smaller in size. ...
... Mobile phone metadata was used to predict poverty and map wealth distributions in Rwanda (Blumenstock et al., 2015). Phone data was combined with targeted survey data to predict wealth, social connections, travel patterns, and other expenditures. ...
... They claimed poverty prediction results approaching that of survey data. However, while night illumination data is available from the National Oceanic and Atmospheric Administration, accurate census and survey data is not readily available in S-SA (Blumenstock et al., 2015;Njuguna & McSharry, 2017;Xie, 2016) ...
Article
Full-text available
Sub-Saharan Africa is currently experiencing growth in the number of people living in poverty, and the situation is worsening due to climate change and the COVID-19 pandemic. Cities are increasingly under stress because of urbanization and the demand for low-cost housing. Slum dwellers face daunting social, environmental, and economic challenges. Geospatial analysis of remote sensing, demographic, economic, social, and environmental data is being used to delineate slums. The application of circular economy guidelines for an intelligent transformation of slums combines technical and social innovation that reaches beyond the slums to the whole urban ecosystem. Examples of contributions to the circular economy are provided. Finally, some ideas are introduced on how the internet of things can improve access to goods and services and strengthen interconnectedness through the ability to participate more readily in the social dialogue of the city. The city of Accra in Ghana, West Africa, is discussed as a potential slum city to functional intelligent city transformation.
... No Poverty Predicting poverty regions [31,32], optimizing social security payments [33], improving microfinance services [34,35] ...
Article
Full-text available
Artificial intelligence (AI) and deep learning (DL) have shown tremendous potential in driving sustainability across various sectors. This paper reviews recent advancements in AI and DL and explores their applications in achieving sustainable development goals (SDGs), renewable energy, environmental health, and smart building energy management. AI has the potential to contribute to 134 of the 169 targets across all SDGs, but the rapid development of these technologies necessitates comprehensive regulatory oversight to ensure transparency, safety, and ethical standards. In the renewable energy sector, AI and DL have been effectively utilized in optimizing energy management, fault detection, and power grid stability. They have also demonstrated promise in enhancing waste management and predictive analysis in photovoltaic power plants. In the field of environmental health, the integration of AI and DL has facilitated the analysis of complex spatial data, improving exposure modeling and disease prediction. However, challenges such as the explainability and transparency of AI and DL models, the scalability and high dimensionality of data, the integration with next-generation wireless networks, and ethics and privacy concerns need to be addressed. Future research should focus on enhancing the explainability and transparency of AI and DL models, developing scalable algorithms for processing large datasets, exploring the integration of AI with next-generation wireless networks, and addressing ethical and privacy considerations. Additionally, improving the energy efficiency of AI and DL models is crucial to ensure the sustainable use of these technologies. By addressing these challenges and fostering responsible and innovative use, AI and DL can significantly contribute to a more sustainable future.
... Mobile phone data can provide insights into food access and distribution patterns by analyzing mobility patterns, such as areas with limited access to food markets and distribution centers. 24 Financial transaction data can predict food security outcomes by analyzing household income and expenditure patterns, indicating shifts toward food insecurity. 25 Finally, weather data can predict changes in agricultural production and food availability by analyzing rainfall patterns and temperature. ...
Preprint
Full-text available
The article explores the role and prospects of artificial intelligence (AI) in addressing global food insecurity. It provides an overview of machine learning (ML) techniques-the core learning component of AI-used to predict food security outcomes, and discusses real-world examples as well as recent applications of ML. It further examines the challenges and limitations of ML, including concerns related to data quality and ethical considerations, followed by policy recommendations in crucial areas such as funding, cross-sector collaboration, education, and data standards. Finally, it underscores the importance of recognizing AI as a complementary tool, rather than a standalone solution, in the pursuit of the ultimate goal of achieving a world without hunger.
... Early studies of information and communication technologies in Africa measured the effects of mobile phones -which may not have been connected to the interneton different aspects of economic development, such as financial inclusion, access to market prices, poverty measurement, and access to other types of life-improving information (e.g., Blumenstock, Cadamuro, and On 2015;Jack and Suri 2014). More recent research in Africa has examined the impact of expanded internet access on outcomes including employment and political mobilization (Hjort & Poulsen, 2019;Manacorda & Tesei, 2020), but researchers have been constrained by limited data availability on internet use. ...
Preprint
Full-text available
We present the first objective evidence on how COVID-19 lockdowns affected internet browser usage in Africa, using detailed digital trace data on PC-based and mobile-based browsing patterns of 316 Kenyans who had access to a PC, covering the period before and during Kenya's first national COVID-19 curfew that was declared on March 25, 2020. We find that total daily browser usage increased by 41 minutes, or 15 percent of average browsing time, after the curfew started. We find no significant differences in total browsing time during the curfew by gender or by residence in high-speed vs. low-speed broadband access areas. However, we do find gender differences in the content of browsing. Women's time on YouTube and Netflix exceeded men's from the start of our sample period, and the gender gap in Netflix browsing increased by 36 minutes daily, corresponding to almost twice the average daily Netflix time in the sample. Men's browsing became less concentrated during the curfew, across both domains and topics, but women's did not. The degree of overlap in browsing between men and women also increased, likely due to men visiting sites that were previously exclusively visited by women. Across the entire sample, browsing of Kenyan domains dropped significantly relative to that of non-Kenyan domains, indicating greater reliance on international content during this period of economic and social upheaval.
... To date, digital demography has mostly focused on analysing international migration flows, including the location tracking of IP addresses when individuals log into their email accounts, Facebook advertisement target populations, geo-located Twitter data, and Google+ data (Zagheni and Weber 2012). Blumenstock et al. (2015Blumenstock et al. ( , 2018 have called attention to both the possibilities and the caveats that come with using cell phone data to influence and to study global socio-economic development, particularly in poorer regions of the world. ...
... New data collection techniques, such as sensors mounted on cookstoves, can fill the missing link between output and outcome by monitoring adoption patterns of interventions (Wilson et al., 2016). Combing satellite imagery (Jean et al., 2018;Jean et al., 2016) or mobile phone data (Blumenstock et al., 2015;Blumenstock 2016) with deep learning can reasonably assess poverty and needs in the cross-section but has made little progress in estimating changes in welfare over time. 32 Rigorous experimental studies can tease out socio-economic impacts of interventions but are less likely to recover quantities that are useful for policy (Deaton, 2010). ...
Thesis
Full-text available
INTRODUCTION Given that concentrated poverty is deepening around the world, the international development community now has even more reason to address this issue. Development aid has ostensibly served as an important policy instrument for promoting the welfare of marginalized communities in the Global South. The effectiveness of such efforts can be evaluated from varying angles but the first test to pass is its relevance to the lives of the most marginalized. This dissertation evaluates the extent to which aid activity is suited to the needs and priorities of recipients using three lenses: 1) needs assessment at the global level, 2) the design of interventions at the country level, and 3) evaluations at the sub-national level. The first chapter identifies the salient dimensions of poverty from the monetary and capability perspectives, using a cross-country analysis for 188 developing countries. The second chapter introduces a framework for analyzing two community development models in Myanmar as a country case study. The third chapter explores whether community development projects reach the poorest villages. It combines satellite imagery with spatial analysis to evaluate sub-national aid distribution. This study suggests strategies to deploy aid resources in a way to maximize their impact on people living in absolute poverty and data-sparse contexts. CHAPTER 1. THE DISCREPANCY BETWEEN TWO APPROACHES TO GLOBAL POVERTY: WHAT DOES IT REVEAL? For decades, development communities have attempted to develop poverty measures that can be used to inform need assessment and aid allocation. Building on these efforts, this paper examines the discrepancies between global poverty measures and brings that analysis to bear on identifying the salient dimensions of poverty in developing countries. It first compares the monetary and capability approaches to poverty and identifies comparable indices from each approach: the poverty headcount ratio (P0) and the multidimensional poverty headcount ratio (H). This paper then describes the degree of overlap and discrepancy between P0 and H for 118 developing countries from 2000 to 2014, synthesizing the Multidimensional Poverty Index (MPI), World Development Indicators, and OECD aid activity data. On average, there is a high correlation between the two poverty measures, but considerable discrepancies surface for some countries. I analyze the position of these countries with respect to the fitted line of the two measures, classifying them into income-poor and capability-poor countries. Countries such as Pakistan and Ethiopia, for example, are experiencing “capability poverty” while Uzbekistan and Zimbabwe are experiencing “income poverty.” I examine whether aid composition corresponds to the country’s relative income and poverty status, finding that capability-poor countries receive marginally higher social sectoral aid compared with economic sector aid. This study suggests that discrepancies between measures of international poverty can be used to target, monitor, and evaluate global aid distribution. CHAPTER 2 ALTERNATIVE MODELS OF COMMUNITY-LED DEVELOPMENT: IMPLICATIONS FOR POLICY AND PRACTICE Reconciling the dual imperatives of legitimate state building and efficient service delivery, Community-led Development (CLD) has been praised as “a new form of engagement” in providing aid to fragile states. However, whether or how the CLD represents a new model of practice remains poorly understood. The absence of an analytical framework for distinguishing alternative CLD approaches to development aid hinders both the design of context-specific interventions and the evaluation of their impacts. This dissertation aims to compare two alternative models of CLDs against a backdrop provided by the framework of community-led development. Using document reviews and stakeholder interviews, this paper analyzes two aid projects in Myanmar: the Korean government-supported Saemaul Undong (SMU, New Village Movement), which reflects the perspective of the developmental state, and the World Bank-supported National Community-Driven Development Project (NCDDP), which reflects the perspective of the revised neoliberalism. Next, this study proposes the Agency-Power-Dimension (APD) framework for use in describing donors’ general CLD aid policies in conjunction with specific CLD projects in Myanmar. The Agency-Power-Dimension (APD) framework is proposed to describe donors’ general CLD aid policies in conjunction with specific CLD projects in Myanmar. This study finds that the intervention strategies of SMU and NCDDP differ regarding the main agency of change, the handling of power, and the objectives of projects. SMU engages with government extension workers as the main change agent, and its accountability comes from the performance of projects that focuses on agricultural production. In contrast, NCDDP works with private facilitators, emphasizing the processes of inclusion in the context of public infrastructure development. Previously, impact evaluations of CLDs set hypotheses based on the logical progression of the projects whose indicators are diffused over broad socio-economic domains. The APD framework identifies the main facets of treatment arms in future experimental studies. Policymakers seeking development opportunities in other fragile states can compare East Asian/Southern and Western/Northern approaches and apply it to varying local conditions. CHAPTER 3: MAPPING COMMUNITY DEVELOPMENT AID: SPATIAL ANALYSIS IN MYANMAR Aid policy has the potential to alleviate global poverty by targeting areas of concentrated need. However, few aid-determinant studies have analyzed the characteristics of poverty at the sub-national level, and even those studies were conducted with their units of analysis at a high administrative level such as the state. This study intends to fill this knowledge gap by portraying poverty at the granular level, and promoting the evaluation of aid towards the most marginalized communities. The goal of this study is to explore the extent to which community-led-development (CLD) projects take place in poor villages, using the case of Myanmar. It also analyzes how two CLD models, National Community-Driven Development Project (NCDDP) and Saemaul Undong (SMU) target needs differently. To collect outcome variables, I develop web scraping algorithms to create comprehensive and up-to-date locations of CLD participating villages (n=12,282). As for exploratory variables, radiance values from nighttime satellite imagery are extracted to estimate wealth at the community level. In addition, I spatially interpolate the DHS wealth index to make inferences on poverty in aid sites. By geospatially matching aid and wealth related data, I test factors that explain variation in the distribution of CLD and different approaches to community development. The results show mixed evidence of poverty-oriented targeting. First, as each increment of the share of a vulnerable population rises, the likelihood of aid presence in that community declines by 4%. Next, the density of community development projects is higher in areas shining brighter. A one unit increase in the nightlight intensity increases the number of projects by 86 within a two-degree radius of a DHS village cluster. Among villages of similar levels of nightlights and population, however, aid goes to areas with lower assets. Last, NCDDP, which emphasizes inclusion and collaboration, supports poorer villages farther away from conflict events. In contrast, SMU, which considers competition conductive to performance, supports more established areas including villages near conflict zones. Unlike previous studies finding that state-level aid allocations favor the richest, this more fine-grained analysis suggests that a need-based allocation is also in place. The nuances captured in nightlight luminosity are also shown to improve predictions of aid distribution. Synthesizing new sources of data can be used to assess area-based interventions in the context of poverty and conflict where traditional survey is too costly. CONCLUSION This study draws attention to alternative forms of evidence-based targeting, design, and evaluation of aid from poverty-oriented perspectives. The first chapter reveals that there are 1.5 times more capability poor countries than income poor ones, and the capability poor countries receive marginally higher social sectoral aid relative to economic sector aid. The second chapter finds that the intervention strategies of the revised neo-liberal (NCDDP) and the developmental state (SMU) model differ in terms of the main agency of change, the handling of power, and the primary dimension of projects. The third chapter highlights that community development aid in Myanmar flows to villages with low assets but also with higher nightlight luminosity and a lower proportion of vulnerable populations. These three chapters also speak to the evolution of an aid landscape with a distinctive way of delivering aid and generating empirical evidence. This study concludes with a call for both research and practice to return to the basics, and to begin by considering client and user needs. Grounding development policy in more contextualized knowledge, the development community can better serve the “bottom billion.”
... In recent years, the application of machine learning has gained traction across various disciplines in the social sciences. For instance, in the field of economics, researchers such as Varian [3], Blumenstock et al. [4], Athey and Imbens [5], and Mullainathan and Spiess [6] have incorporated machine learning methods into their studies. Similarly, in political science, Bonikowski and DiMaggio [7] have explored the use of machine learning techniques. ...
Article
Full-text available
Network analysis aids management in reducing overall expenditures and maintenance workload. Social media platforms frequently use neural networks to suggest material that corresponds with user preferences. Machine learning is one of many methods for social network analysis. Machine learning algorithms operate on a collection of observable features that are taken from user data. Machine learning and neural network-based systems represent a topic of study that spans several fields. Computers can now recognize the emotions behind particular content uploaded by users to social media networks thanks to machine learning. This study examines research on machine learning and neural networks, with an emphasis on social analysis in the context of the current literature.
... Countries can also use ICT to aid poverty data collection. For instance, mobile phones could be used to draw poverty maps, anticipate poverty, and collect data on the poor's socioeconomic status (Blumenstock et al., 2015;Steele et al., 2017). ...
... One potential solution is to infer missing information using machine learning methods. For example, various socio-demographic characteristics were predicted from profile images [3], mobile phone metadata [4], Facebook likes [5], and images of street scenes [6]. ...
Article
Full-text available
In this paper, we develop a machine learning classifier that predicts perceived ethnicity from data on personal names for major ethnic groups populating Russia. We collect data from VK, the largest Russian social media website. Ethnicity was coded from languages spoken by users and their geographical location, with the data manually cleaned by crowd workers. The classifier shows the accuracy of 0.82 for a scheme with 24 ethnic groups and 0.92 for 15 aggregated ethnic groups. It can be used for research on ethnicity and ethnic relations in Russia, with the data sets that have personal names but not ethnicity.
... By combining cell phone signaling data with geospatial data, log-linear regression model was adopted to estimate the relationship between cell phone signaling data and population, and finally obtained population distribution at spatial resolution of 100 m  100 m [31]. Elastic net regression model was able to estimate the distribution of wealth across countries [32]. Based on remote sensing data, random forest model was applied to predict population density at a spatial resolution of 100 m  100 m [33]. ...
Article
Full-text available
There is a growing application of machine learning methods to predict socioeconomic and environmental attributes in computational social science, where big data are usually presented in tabular format. However, it is still a challenge to develop novel deep learning models to deal with tabular data, fill missing value, improve prediction accuracy, and enhance interpretability. In this study, we for the first time apply a tabular deep learning methodology (TabNet) to predict socioeconomic and environmental attributes (number of population and companies, volume of consumption, poker players’ behaviors, forest cover, etc.). Furthermore, we develop a new network architecture, referred to as improved TabNet (iTabNet), that can simultaneously learn local and global features in the tabular data to improve prediction accuracy. We also introduce a difference loss to constrain the feature selection process in iTabNet so that the model can use different features at different steps to enhance interpretability. To deal with missing values, we introduce a fusion strategy based on data mean and Auto-Encoder network to efficiently complete a more reasonable value filling. Experimental results demonstrate that the proposed iTabNet achieves competitive performances in the application to predict socioeconomic and environmental attributes based on tabular data, iTabNet using the proposed fusion strategy significantly outperforms other machine learning models when tabular data have missing values.
... In recent years, ML has started to be used in academic research and related fields in a number of humanities disciplines, including economics (Varian 2014;Blumenstock et al. 2015;Athey and Imbens 2017;Mullainathan and Spiess 2017), political science (Baldassarri and Goldberg 2014;Bonikowski and DiMaggio 2016), sociology (Balocas and Selbst 2016); Evans and Aceves 2016; Baldassarri and Abascal 2017), communication Services offered by either governmental or commercial organizations (Athey 2017;Berk et al. 2018). Generally speaking, the term "ML" refers to a wide range of techniques and instruments (Kleinberg et al., 2015). ...
Article
Neuroscientists are using artificial neural networks to solve problems in a variety of fields of science and technology, such as predicting chemical reactions and predicting agricultural yields. Whereas machine learning, requires processing sometimes huge data sets. In the past, algorithms used by "symbolic artificial intelligence" computers worked by encoding a specific output (also known as a goal) for every feasible input. The main concern of this article is the study of Artificial Neural Network Approach to Predict Liver Failure Likelihood by Machine Learning. In contrast to standard screening methods, a neural network model can detect the likelihood of liver failure in ICU patients much earlier. The performance of the model was externally validated using data on intensive care patients from several sources. Its high specificity was 77.5% and sensitivity was 83.3%. 83.5% (N=525) of the individuals with liver failure were identified by the model.
... In recent years in various human sciences: economics (Varian 2014, Blumenstock et al. 2015, Athey and Imbens 2017, Mullainathan and Spiess 2017, political science Goldberg 2014, Bonikowski andDiMaggio 2016), sociology (Barocas and Selbst 2016, Evans and Aceves 2016, Baldassarri and Abascal 2017, communication science (Hopkins and King 2010, Grimmer and Stewart 2013, Bail 2014, etc., ML has started to be applied both in academic research and in areas related to the management of services provided by the public administration (Athey 2017, Berk et al. 2021 or by private companies. ...
Article
Full-text available
Machine learning (ML), and particularly algorithms based on artificial neural networks (ANNs), constitute a field of research lying at the intersection of different disciplines such as mathematics, statistics, computer science and neuroscience. This approach is characterized by the use of algorithms to extract knowledge from large and heterogeneous data sets. In this paper we will focus our attention on its possible applications in the social sciences and, in particular, on its potential in the data analysis procedures. In this regard, we will provide an example of application on sociological data to assess the impact of ML in the study of relationships between variables. Finally, we will compare the potential of ML with traditional data analysis models. Keywords: machine learning, artificial neural networks, supervised learning, linear models, nonlinear models
... Against this backdrop, the past decade has seen an explosion in the availability of vast repositories of digital data, from satellite imagery to call detailed records, which are increasingly being analyzed to address socioeconomic challenges (10,11). Encouraged by these approaches, we take advantage of recent advances in deep learning and natural language processing to extract anticipatory signals of food insecurity episodes from the text of a large corpus of news articles. ...
Article
Full-text available
Anticipating food crisis outbreaks is crucial to efficiently allocate emergency relief and reduce human suffering. However, existing predictive models rely on risk measures that are often delayed, outdated, or incomplete. Using the text of 11.2 million news articles focused on food-insecure countries and published between 1980 and 2020, we leverage recent advances in deep learning to extract high-frequency precursors to food crises that are both interpretable and validated by traditional risk indicators. We demonstrate that over the period from July 2009 to July 2020 and across 21 food-insecure countries, news indicators substantially improve the district-level predictions of food insecurity up to 12 months ahead relative to baseline models that do not include text information. These results could have profound implications on how humanitarian aid gets allocated and open previously unexplored avenues for machine learning to improve decision-making in data-scarce environments.
... The evidence of Mobile Money impacts on societal, as opposed to individual, welfare is more varied. Blumenstock et al. (2015), Munyegera and Matsumoto (2014), and Natile (2020) demonstrated the existence of social welfare gains from Mobile Money access via improved access to and price of health services, energy, and sanitation. However, much of this evidence is less econometrically robust than the individual-level studies in terms of dealing with endogeneity biases (Aron, 2018). ...
Chapter
The recent event related to the war in Ukraine has further sparked interest in cryptocurrencies and raised questions about their regulatory oversight. In November 2021, the cryptocurrency market capitalization reached a staggering US$3 trillion, which included more than 8,000 cryptocurrencies, and it is one of the most dynamic markets in the world (Crypto Market Sizing Report, 2022). There is significant exposure of the cryptocurrency ecosystem via a multitude of direct and indirect interlinkages into the banking sector. These include activities such as direct issuance and ownership of cryptocurrencies, intermediation services for customers, clearing of contracts that reference cryptocurrencies, or services for cryptocurrency issuers such as underwriting initial coin offerings or stablecoins. However, since 10 January 2020, all existing businesses carrying on cryptocurrency operations in the United Kingdom (UK) must be compliant with the Money Laundering, Terrorist Financing, and Transfer of Funds Regulations 2017 and be registered with the Financial Conduct Authority (FCA) in order to be able to in the business. This change of the rules had a significant impact on the cryptocurrency ecosystem and its effect on the banking industry. The FCA also reported that around £60 million was lost due to social media investment scams in 2020. This leads to the main discussion of this chapter—what is the appropriate level of regulations in the market in the context of the UK as a comparison to recent EU regulatory development? Also, how does the existing regulation impact the real-world cases from the UK banking industry and if there is any possible solution for crypto-risk and how it can be avoided?
... Dazu werden computergestützte Methoden verwendet (Conte et al. 2012;Edelmann et al. 2020;Salganik 2019). Die Popularität von CSS basiert auf dem Potenzial Daten auf innovative Weise und in einem bisher nicht möglichen Umfang zu sammeln und zu analysieren (Fu et al. 2022;Salganik 2019 (Blumenstock et al. 2015;Fu et al. 2022). ...
Chapter
Full-text available
Die zunehmende Verfügbarkeit digitaler Daten und die wachsende Rechenleistung für deren Analyse eröffnen Möglichkeiten für die Analyse der sozialen Dimensionen von Umweltthemen. In diesem Kapitel geben wir einen Überblick über etablierte und neu entstehende computergestützte Methoden und Datentypen, die die umwelt-soziologische Forschung unterstützen können. Wir gehen auf Fallstudienbeispiele ein, stellen Forschungsfragen vor, die untersucht werden könnten, und diskutieren offene Herausforderungen. So wird beispielsweise erörtert, wie Webressourcen, Sensordaten und konventionelle Datentypen analysiert werden können, um neue Erkenntnisse zu gewinnen. Zu diesem Zweck werden u. a. Anwendungen wie die computergestützte Textanalyse, die Netzwerkanalyse und komplexe Systemmodellierung vorgestellt. Wir zeigen auch auf, wie neue computergestützte Methoden dazu beitragen können, Muster in den Daten zu erkennen, die mit herkömmlichen Analysemethoden möglicherweise nicht zu erkennen sind. Insgesamt bieten computergestützte sozialwissenschaftliche Methoden aufregende neue Möglichkeiten für die Forschung auf dem Gebiet der Umweltsoziologie. Durch die Bereitstellung neuer Datenanalyse- und Modellierungswerkzeuge können diese Methoden die Forschung zu einem breiten Spektrum von Umweltthemen unterstützen, vom Klimawandel und dem Management natürlicher Ressourcen bis hin zu Umweltgerechtigkeit und Nachhaltigkeitsforschung.
... Our method may provide an inexpensive solution to risk management for expensive product sales in many developing countries. By combining newly developed approaches, such as proxies based on mobile phone data, imprecise credit scores may provide high accuracy in estimating the probability of payment defaults [7,12,19]. ...
Article
In developing countries, loan sales for expensive products are very common but highly risky because of frequent payment defaults. Because personal credit histories in these countries are imprecise and insufficient compared with those in developed countries, the use of credit scores in loan sales may be very unreliable. Here, to evaluate the reliability of credit scores, we build a simple cost-benefit model of loan management, dividing credit scores into 10 credit score classes (CSCs). As CSC increases, the rough net benefits increase because the total losses caused by payment defaults decrease. The use of credit scores in loan sales is thus found to be highly reliable in developing countries. Combining credit scores with newly developed sources (e.g., mobile phones) may provide highly accurate estimates for loan management.
... Therefore, doctors and nurses would use a Big data analysis can provide the ability to monitor the progress toward achieving SDGs by 2030. Big data analysis can be more cost-effective and faster in tracking SDGs than, for example, monitoring poverty by traditional methods such as questionnaires or interviews, which can be ineffective and time-consuming and require significant effort [13,36]. ...
Article
Full-text available
Context Collecting and analyzing data has become crucial for many sectors, including the health care sector, where a hefty amount of data is generated daily. Over time, the amount and complexity of this data increase substantially. Consequently, it is considered big data that cannot be stored or analyzed conveniently unless advanced technologies are incorporated. Latest advances in technology have divulged new opportunities to use big data analysis to track a patient’s record and health. Still, it has also posed new challenges in maintaining data privacy and security in the healthcare sector. Purpose This systematic review aims to give new researchers insights into big data use in health care systems and its issues or to advise academics interested in investigating the prospects and tackling the challenges of big data implementation in rising nations like the UAE. This study uses a systematic methodology to examine big data's role and efficacy in UAE health care. Methods The research follows the methodology of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) for reporting the reviews and evaluating the randomized trials. Furthermore, the Critical Appraisal Checklist for PRISMA 2009 was applied for the research. Findings The study concludes that the healthcare systems in the United Arab Emirates can be improved through big data; however, the country authorities must acknowledge the development of efficient frameworks for performance, and quality assessment of the new health care system is significant. The said goal can be achieved via integrating big data and health informatics with the help of IT specialists, health care managers, and stakeholders. Data privacy, data storage, data structure, data ownership, and governance were the most often expressed concerns. Contribution to knowledge By discussing numerous issues and presenting solutions linked with big data, the current study contributes substantially to the knowledge of big data and its integration into health care systems in the UAE.
... Using such information, business owners can measure and forecast economic activities through, for instance, an aggregation of customers' visitations to retail stores, the duration of their visits, and the kinds of products that customers are interested in. These datasets could also become important resources for demographic profiling, as patterns of use of mobile phones could be highly correlated with one's socioeconomic status (Blumenstock et al. 2015). ...
... So far, different conclusions have emerged when describing the spatial context of social interactions and, while important strides have been made, explaining how urban demographics and socio-economic indicators relate to mobility still remains a challenge. Early work explicitly shows that existing correlations between mobile phone usage and wealth may be a starting point towards using Information and Communication Technology (ICT) data for planning were sensitive data is available for research [18,19]. When spatial context is explicitly considered, mobility research, produced from different disciplines, seem to indicate that diversity of human trajectories across the city is a conserved trait among social groups sharing similar What is now accepted, despite early predictions of a decline in the importance of space with the emergence of information and communications technologies in the sixties [28,29], is that 'real' social interactions connecting and exchanging wisdom, goods and affection are highly relevant to explain the hierarchical patterns of mobility [9]. ...
Article
Full-text available
The relationship between urban mobility, social networks, and socioeconomic status is complex and difficult to apprehend, notably due to the lack of data. Here we use mobile phone data to analyze the socioeconomic structure of spatial and social interaction in the Chilean urban system. Based on the concept of spatial and social events, we develop a methodology to assess the level of spatial and social interactions between locations according to their socioeconomic status. We demonstrate that people with the same socioeconomic status preferentially interact with locations and people with a similar socioeconomic status. We also show that this proximity varies similarly for both spatial and social interactions during the course of the week. Finally, we highlight that these preferential interactions appear to hold when considering city-city interactions.
... Subsequently, suppose that another group of scholars is aiming to speed up the surveying of poverty by combining analog and digital sources such as satellite images (Blumenstock et al., 2015;Jean et al., 2016;Yeh et al., 2020). These images reveal the living conditions of people, as they appear from the sky. ...
... These approaches require a large amount of funding and come with a delay in result delivery. More recent techniques have immediate results and very low maintenance costs; these include estimating from online social networks [30][31][32][33], mobile network data [34][35][36][37] and human mobility indicators [26,38,39]. ...
Preprint
Full-text available
Mobile phones have become an integral part of our lives in the last two decades, leaving a digital trace of our activities and communication. This study aims to develop a data processing framework to evaluate human mobility and socioeconomic status based on call detail records. The methodology proposed first calculates radius of gyration and entropy for each user, then estimates the socioeconomic status by the price and age of the subscribers' phones. Finally, an unsupervised machine learning algorithm was used to group the cells into clusters based on their mobility and socioeconomic metrics. The research showed differences between Buda and Pest during a large scale social event using mobile phone ages and prices. Additionally, the clustering results revealed homogenous groups of cells around Budapest, with similar mobility and socioeconomic metrics. The main conclusion is that mobile network data combined with mobile phone properties offer a useful tool for characterising urban mobility and socioeconomic status.
... Commercially available but, to our knowledge, unvalidated data on purchasing power have become available, but only at a high level of aggregation, at the level of municipalities [31]. Alternative data sources such as detailed data on car ownership [32], mobile phones [33] or social media data [34] or data on specific environmental exposures from high-resolution satellite images [35] might offer opportunities in the future. Finally, the availability of yearly micro censuses allows updating indices more frequently with new data while also improving resolution [36]. ...
Article
Full-text available
Background: The widely used Swiss neighbourhood index of socioeconomic position (Swiss-SEP 1) was based on data from the 2000 national census on rent, household head education and occupation, and crowding. It may now be out of date. Methods: We created a new index (Swiss-SEP 2) based on the 2012-2015 yearly micro censuses that have replaced the decennial house-to-house census in Switzerland since 2010. We used principal component analysis on neighbourhood-aggregated variables and standardised the index. We also created a hybrid version (Swiss-SEP 3), with updated values for neighbourhoods centred on buildings constructed after the year 2000 and original values for the remaining neighbourhoods. Results: A total of 1.54 million neighbourhoods were included. With all three indices, the mean yearly equivalised household income increased from around 52,000 to 90,000 CHF from the lowest to the highest index decile. Analyses of mortality were based on 33.6 million person-years of follow-up. The age- and sex-adjusted hazard ratios of all-cause mortality comparing areas in the lowest Swiss-SEP decile with areas of the highest decile were 1.39 (95% confidence interval [CI] 1.36-1.41), 1.31 (1.29-1.33) and 1.34 (1.32-1.37) using the old, new and hybrid indices, respectively. Discussion: The Swiss-SEP indices capture area-based SEP at a high resolution and allow the study of SEP when individual-level SEP data are missing or area-level effects are of interest. The hybrid version (Swiss-SEP 3) maintains high spatial resolution while adding information on new neighbourhoods. The index will continue to be useful for Switzerland's epidemiological and public health research.
... Second, we can question the fate of demographic categories in social worlds other than online advertising. The worlds of insurance (McFall et al., 2020), recruitment or social treatment policies (Eubanks, 2018), or public statistics in developing countries (Blumenstock et al., 2015) would probably be very fruitful fields of investigation. ...
Article
Recent innovations in online advertising facilitate the use of a wide variety of data sources to build micro-segments of consumers, and delegate the manufacture of audience segments to machine learning algorithms. Both techniques promise to replace demographic targeting, as part of a post-demographic turn driven by big data technologies. This article empirically investigates this transformation in online advertising. We show that targeting categories are assessed along three criteria: efficiency, communicability, and explainability. The relative importance of these objectives helps explain the lasting role of demographic categories, the development of audience segments specific to each advertiser, and the difficulty in generalizing interest categories associated with big data. These results underline the importance of studying the impact of advanced big data and AI technologies in their organizational and professional contexts of appropriation, and of paying attention to the permanence of the categorizations that make the social world intelligible.
... Ibrahim et al. used urban streetscapes to identify informal settlements and slums in cities [27]. Blumenstock et al. used cell phone signaling data to infer the economic status of society [28]. Ta et al. used POI data to study the differences in floating population isolation in urban activity space [29]. ...
Article
Full-text available
Urban poverty is a major obstacle to the healthy development of urbanization. Identifying and mapping urban poverty is of great significance to sustainable urban development. Traditional data and methods cannot measure urban poverty at a fine scale. Besides, existing studies often ignore the impact of the built environment and fail to consider the equal importance of poverty indicators. The emerging multi-source big data provide new opportunities for accurately measuring and monitoring urban poverty. This study aims to map urban poverty spatial at a fine scale by using multi-source big data, including social sensing and remote sensing data. The urban core of Zhengzhou is selected as the study area. The characteristics of the community’s living environment are quantified by accessibility, block vitality, per unit rent, public service infrastructure, and socio-economic factors. The urban poverty spatial index (SI) model is constructed by using the multiplier index of the factors. The SOM clustering method is employed to identify urban poverty space based on the developed SI. The performance of the proposed SI model is evaluated at the neighborhood scale. The results show that the urban poverty spatial measurement method based on multi-source big data can capture spatial patterns of typical urban poverty with relatively high accuracy. Compared with the urban poverty space measured based on remote sensing data, it considers the built environment and socio-economic factors in the identification of the inner city poverty space, and avoids being affected by the texture information of the physical surface of the residential area and the external structure of the buildings. Overall, this study can provide a comprehensive, cost-effective, and efficient method for the refined management of urban poverty space and the improvement of built environment quality.
... These approaches require a large amount of funding and come with a delay in result delivery. More recent techniques have immediate results and very low maintenance costs; these include estimating from online social networks [24][25][26][27], mobile network data [28][29][30][31] and human mobility indicators [19,32,33]. ...
Preprint
Mobile phones have become an integral part of our lives in the last two decades, leaving a digital trace of our activities and communication. This study aims to develop a data processing framework to evaluate human mobility and socioeconomic status based on call detail records. The methodology proposed first calculates radius of gyration and entropy for each user, then estimates the socioeconomic status by the price and age of the subscribers' phones. Finally, an unsupervised machine learning algorithm was used to group the cells into clusters based on their mobility and socioeconomic metrics. The research showed differences between Buda and Pest during a large scale social event using mobile phone ages and prices. Additionally, the clustering results revealed homogenous groups of cells around Budapest, with similar mobility and socioeconomic metrics. The main conclusion is that mobile network data combined with mobile phone properties offer a useful tool for characterising urban mobility and socioeconomic status.
... Many researchers and organizations have theorized the potential that Big Data -in particular unstructured data sources such as images, text, and video -holds for meeting these data gaps [2]. Several of these approaches use proprietary sources of data, such as Twitter, Google Maps, proprietary satellite imagery, or mobile phone metatdata [3,4,5,6,7]. Researchers have pointed to the risk of using closed, proprietary systems for development purposes [8], leading to expanded interest in the use of open and freely available sources of Big Data [9]. ...
Preprint
Full-text available
Data deprivation, or the lack of easily available and actionable information on the well-being of individuals, is a significant challenge for the developing world and an impediment to the design and operationalization of policies intended to alleviate poverty. In this paper we explore the suitability of data derived from OpenStreetMap to proxy for the location of two crucial public services: schools and health clinics. Thanks to the efforts of thousands of digital humanitarians, online mapping repositories such as OpenStreetMap contain millions of records on buildings and other structures, delineating both their location and often their use. Unfortunately much of this data is locked in complex, unstructured text rendering it seemingly unsuitable for classifying schools or clinics. We apply a scalable, unsupervised learning method to unlabeled OpenStreetMap building data to extract the location of schools and health clinics in ten countries in Africa. We find the topic modeling approach greatly improves performance versus reliance on structured keys alone. We validate our results by comparing schools and clinics identified by our OSM method versus those identified by the WHO, and describe OSM coverage gaps more broadly.
... In addition to the most common tools developed by researchers, machine-learning has the ability to discover complex structure not specified in advance, manages to fit complex and very flexible functional forms to the data without simply over fitting, finds functions that work well out-of-sample (Mullainathan and Spiess, 2017). Machine-learning can: manage unconventional data as satellites images, online posts, reviews or comments provided by people (Henderson et al. (2012); Blumenstock et al. (2015); Glaeser et al. (2018); Kang et al. (2013); Antweiler and Frank (2004)); solve estimation problems (Belloni et al. (2012); Carrasco (2012); Belloni et al. (2016);Chernozhukov et al. (2018); Athey and Imbens (2016)); improve the impact of a policy, predicting who is more likely to belong to groups with a higher payoff and groups where the effect may be zero (Kleinberg et al. (2015); Andini et al. (2018)). as irregular. ...
Preprint
Full-text available
This paper investigates informality driven by the intense use of temporary and part- time contracts by firms. In the first part, using administrative data released by the Italian Social Security Institute combined with data of firms’ financial statements, I construct an indicator of irregular job implementing LASSO technique. Irregu- lar temporary and part-time contracts are highly correlated with known measures of undeclared work provided, with a stronger incidence in Southern Regions, in Services and Commerce and of female part-time workers. In the second part, I investigate if irregular contracts are effectively used to hide work that otherwise would be completely undeclared. First, I show with an event study analysis that lower EPL induce firms to fire irregular workers more easily, part-time contracts in sectors of Services and Commerce. Second, I investigate how less financial devel- opment positively impact the number of irregular part-time workers, especially if they are female workers again in Services and Commerce. JEL Classification Numbers: H26, H29.
... This finding builds on a literature that has examined the relationship between income and religiosity, both across countries (McCleary and Barro, 2006) and within countries (Chen, 2010). 4 Our findings also fit into a broader 3 Our approach also engages with a broader literature that explores how new sources of 'big' data can be used to measure social and economic behavior -including wealth and poverty (Blumenstock, 2018b;Blumenstock, Cadamuro, and On, 2015;Jean et al., 2016), GDP (Henderson, Storeygard, and Weil, 2012), credit-worthiness (Björkegren and Grissen, 2018), and unemployment (Toole et al., 2015). 4 McCleary and Barro (2006) find that countries with higher income per capita have lower levels of religiosity, based on measures such as church and mosque attendance from the World Values Survey. ...
... This finding builds on a literature that has examined the relationship between income and religiosity, both across countries (McCleary and Barro, 2006) and within countries (Chen, 2010). 4 Our findings also fit into a broader 3 Our approach also engages with a broader literature that explores how new sources of 'big' data can be used to measure social and economic behavior -including wealth and poverty (Blumenstock, 2018b;Blumenstock, Cadamuro, and On, 2015;Jean et al., 2016), GDP (Henderson, Storeygard, and Weil, 2012), credit-worthiness (Björkegren and Grissen, 2018), and unemployment (Toole et al., 2015). 4 McCleary and Barro (2006) find that countries with higher income per capita have lower levels of religiosity, based on measures such as church and mosque attendance from the World Values Survey. ...
Chapter
Providing citizens access to information and communications technologies, as suggested in the introduction, is a complex, multi-layered process. The transition from policy to implementation often requires an extended period. Policies must be written, laws must be passed, and regulations must be enforced. Once the governing apparatus of a nation decides that it wants to distribute ICTs broadly, it must then determine how to implement that decision, which requires interaction between the private sector, the public sector, and the nonprofit sector.1 Developing ICT infrastructure requires legal, physical, institutional, and human resources. This chapter provides an overview of the development and interaction of these three interwoven elements of ICT in East Africa, and then proceeds to provide a more detailed discussion of how these elements look in each of the four countries under examination.
Article
Recent years have seen increased interest in the use of alternative data sources in the definition and production of official statistics and indicators for the UN Sustainable Development Goals. In this paper, we consider the application of data science to the production of official statistics, illustrating our perspective through the use of poverty targeting as an application. We show that machine learning can play a central role in the generation of official statistics, combining a variety of types of data (survey, administrative and alternative). We focus on the problem of poverty targeting using the Proxy Means Test in Indonesia, comparing a number of existing statistical and machine learning methods, then introducing new approaches in the spirit of small area estimation that utilize area-level features and data augmentation at the subdistrict level to develop more refined models at the district level, evaluating the methods on three districts in Indonesia on the problem of estimating 2020 per capita household expenditure using data from 2016–2019. The best performing method, XGBoost, is able to reduce inclusion/exclusion errors on the problem of identifying the poorest 40% of the population in comparison to the commonly used Ridge Regression method by between 4.5% and 13.9% in the districts studied.
Chapter
Flying into space, developing virtual currencies, developing efficient electric vehicles, and many other new technologies are improving humankind’s life on Earth. But, in today’s world, how to define disruptive technologies? Deep tech or hard tech refers to the type of organization, typically a startup, that develops these disruptive technologies.
Article
Purpose This paper aims to examine the nature and extent of disclosure on the use of big data by online platform companies and how these disclosures address and discharge stakeholder accountability. Design/methodology/approach Content analysis of annual reports and data policy documents of 100 online platform companies were used for this study. More specifically, the study develops a comprehensive big data disclosure framework to assess the nature and extent of disclosures provided in corporate reports. This framework also assists in evaluating the effect of the size of the company, industry and country in which they operate on disclosures. Findings The analysis reveals that most companies made limited disclosure on how they manage big data. Only two of the 100 online platform companies have provided moderate disclosures on big data related issues. The focus of disclosure by the online platform companies is more on data regulation compliance and privacy protection, but significantly less on the accountability and ethical issues of big data use. More specifically, critical issues, such as stakeholder engagement, breaches of customer information and data reporting and controlling mechanisms are largely overlooked in current disclosures. The analysis confirms that current attention has been predominantly given to powerful stakeholders such as regulators as a result of compliance pressure while the accountability pressure has yet to keep up the pace. Research limitations/implications The study findings may be limited by the use of a new accountability disclosure index and the specific focus on online platform companies. Practical implications Although big data permeates, the number of users and uses grow and big data use has become more ingrained into society, this study provides evidence that ethical and accountability issues persist, even among the largest online companies. The findings of this study improve the understanding of the current state of online companies’ reporting practices on big data use, particularly the issues and gaps in the reporting process, which will help policymakers and standard setters develop future data disclosure policies. Social implications From these findings, the study improves the understanding of the current state of online companies’ reporting practices on big data use, particularly the issues and gaps in the reporting process – which are helpful for policymakers and standard setters to develop data disclosure policies. Originality/value This study provides an analysis of ethical and social issues surrounding big data accountability, an emerging but increasingly important area that needs urgent attention and more research. It also adds a new disclosure dimension to the existing accountability literature and provides practical suggestions to balance the interaction between online platform companies and their stakeholders to promote the responsible use of big data.
Article
This paper uses the universe of cellphone records from a Chinese telecommunication provider for a northern Chinese city to examine the role of information exchange in urban labor markets. We provide the first direct evidence of increased communication among referral pairs around job changes. Information provided by social contacts mitigates information asymmetry and improves labor market performance. (JEL D82, J62, O18, P23, P25, R23, Z13)
Chapter
Financial inclusion is identified as a driver of sustainable development and a necessary condition for social and economic development. Yet, most people in developing countries face numerous constraints and barriers that exclude them from the financial system. The recent fintech developments and their ability to reduce these constraints promise to be a potentially useful strategy to enhance financial inclusion. In this chapter, we discuss financial inclusion and provide insights into the current state of fintech in the developing world particularly, Mobile Money—on financial inclusion and development outcomes. Being global leaders in mobile money, Sub-Saharan Africa have set the standards for other developing countries to replicate for enhanced financial inclusion. We also show that the implications and impacts of different forms of fintech on financial inclusion, and through financial inclusion on social and economic outcomes, represent one of the most exciting and important research frontiers in the field of development finance. The rapid evolvement of fintech products however poses regulatory challenges and calls for careful assessment of regulatory approaches for instance innovation offices, regulatory sandboxes, and RegTechs in regulating the financial ecosystem.
Chapter
Full-text available
The goal of this chapter is to survey the recent applications of big data in economics and finance. An important advantage of these large alternative datasets is that they provide very detailed information about economic behaviour and decisions which has spurred research aiming at answering long-standing economic questions. Another relevant characteristic of these datasets is that they might be available in real time, a property that can be used to construct economic indicators at high frequencies. Overall, big alternative datasets have the potential to make an impact on economic research and policy and to complement the information used by governmental agencies to produce the official statistics.
Article
Full-text available
The ongoing Ebola outbreak is taking place in one of the most highly connected and densely populated regions of Africa (Figure 1A). Accurate information on population movements is valuable for monitoring the progression of the outbreak and predicting its future spread, facilitating the prioritization of interventions and designing surveillance and containment strategies. Vital questions include how the affected regions are connected by population flows, which areas are major mobility hubs, what types of movement typologies exist in the region, and how all of these factors are changing as people react to the outbreak and movement restrictions are put in place. Just a decade ago, obtaining detailed and comprehensive data to answer such questions over this huge region would have been impossible. Today, such valuable data exist and are collected in real-time, but largely remain unused for public health purposes - stored on the servers of mobile phone operators. In this commentary, we outline the utility of CDRs for understanding human mobility in the context of the Ebola, and highlight the need to develop protocols for rapid sharing of operator data in response to public health emergencies.
Article
Full-text available
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata. Copyright © 2015, American Association for the Advancement of Science.
Article
Full-text available
Significance Knowing where people are is critical for accurate impact assessments and intervention planning, particularly those focused on population health, food security, climate change, conflicts, and natural disasters. This study demonstrates how data collected by mobile phone network operators can cost-effectively provide accurate and detailed maps of population distribution over national scales and any time period while guaranteeing phone users’ privacy. The methods outlined may be applied to estimate human population densities in low-income countries where data on population distributions may be scarce, outdated, and unreliable, or to estimate temporal variations in population density. The work highlights how facilitating access to anonymized mobile phone data might enable fast and cheap production of population maps in emergency and data-scarce situations.
Article
Full-text available
In this study we analyze the travel patterns of 500,000 individuals in Cote d'Ivoire using mobile phone call data records. By measuring the uncertainties of movements using entropy, considering both the frequencies and temporal correlations of individual trajectories, we find that the theoretical maximum predictability is as high as 88%. To verify whether such a theoretical limit can be approached, we implement a series of Markov chain (MC) based models to predict the actual locations visited by each user. Results show that MC models can produce a prediction accuracy of 87% for stationary trajectories and 95% for non-stationary trajectories. Our findings indicate that human mobility is highly dependent on historical behaviors, and that the maximum predictability is not only a fundamental theoretical limit for potential predictive power, but also an approachable target for actual prediction accuracy.
Article
Full-text available
Understanding the causes and effects of internal migration is critical to the effective design and implementation of policies that promote human development. However, a major impediment to deepening this understanding is the lack of reliable data on the movement of individuals within a country. Government censuses and household surveys, from which most migration statistics are derived, are difficult to coordinate and costly to implement, and typically do not capture the patterns of temporary and circular migration that are prevalent in developing economies. In this paper, we describe how new information and communications technologies (ICTs), and mobile phones in particular, can provide a new source of data on internal migration. As these technologies quickly proliferate throughout the developing world, billions of individuals are now carrying devices from which it is possible to reconstruct detailed trajectories through time and space. Using Rwanda as a case study, we demonstrate how such data can be used in practice. We develop and formalize the concept of inferred mobility, and compute this and other metrics on a large data set containing the phone records of 1.5 million Rwandans over four years. Our empirical results corroborate the findings of a recent government survey that notes relatively low levels of permanent migration in Rwanda. However, our analysis reveals more subtle patterns that were not detected in the government survey. Namely, we observe high levels of temporary and circular migration, and note significant heterogeneity in mobility within the Rwandan population. Our goals in this research are thus twofold. First, we intend to provide a new quantitative perspective on certain patterns of internal migration in Rwanda that are unobservable using standard survey techniques. Second, we seek to contribute to the broader literature by illustrating how new forms of ICT can be used to better understand the behavior of individuals in developing countries.
Article
Full-text available
The article introduces a model for the location of meaningful places for mobile telephone users, such as home and work anchor points, using passive mobile positioning data. Passive mobile positioning data is secondary data concerning the location of call activities or handovers in network cells that is automatically stored in the memory of service providers. This data source offers good potential for the monitoring of the geography and mobility of the population, since mobile phones are widespread, and similar standardized data can be used around the globe. We developed the model and tested it with 12 months' data collected by EMT, Estonia's largest mobile service provider, covering more than 0.5 million anonymous respondents. Modeling results were compared with population register data; this revealed that the developed model described the geography of the population relatively well, and can hence be used in geographical and urban studies. This approach also has potential for the development of location-based services such as targeting services or geographical infrastructure.
Article
Full-text available
Finite automata are considered in this paper as instruments for classifying finite tapes. Each one-tape automaton defines a set of tapes, a two-tape automaton defines a set of pairs of tapes, et cetera. The structure of the defined sets is studied. Various generalizations of the notion of an automaton are introduced and their relation to the classical automata is determined. Some decision problems concerning automata are shown to be solvable by effective algorithms; others turn out to be unsolvable by algorithms.
Article
Full-text available
We develop a statistical framework to use satellite data on night lights to augment official income growth measures. For countries with poor national income accounts, the optimal estimate of growth is a composite with roughly equal weights on conventionally measured growth and growth predicted from lights. Our estimates differ from official data by up to three percentage points annually. Using lights, empirical analyses of growth need no longer use countries as the unit of analysis; we can measure growth for sub- and supranational regions. We show, for example, that coastal areas in sub-Saharan Africa are growing slower than the hinterland. (JEL E01, E23, O11, 047, 057)
Article
Full-text available
Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.
Article
Full-text available
Despite recent advances in uncovering the quantitative features of stationary human activity patterns, many applications, from pandemic prediction to emergency response, require an understanding of how these patterns change when the population encounters unfamiliar conditions. To explore societal response to external perturbations we identified real-time changes in communication and mobility patterns in the vicinity of eight emergencies, such as bomb attacks and earthquakes, comparing these with eight non-emergencies, like concerts and sporting events. We find that communication spikes accompanying emergencies are both spatially and temporally localized, but information about emergencies spreads globally, resulting in communication avalanches that engage in a significant manner the social network of eyewitnesses. These results offer a quantitative view of behavioral changes in human activity under extreme conditions, with potential long-term impact on emergency detection and response.
Article
Full-text available
Small area estimation is becoming important in survey sampling due to a growing demand for reliable small area statistics from both public and private sectors. It is now widely recognized that direct survey estimates for small areas are likely to yield unacceptably large standard errors due to the smallness of sample sizes in the areas. This makes it necessary to "borrow strength" from related areas to find more accurate estimates for a given area or, simultaneously, for several areas. This has led to the development of alternative methods such as synthetic, sample size dependent, empirical best linear unbiased prediction, empirical Bayes and hierarchical Bayes estimation. The present article is largely an appraisal of some of these methods. The performance of these methods is also evaluated using some synthetic data resembling a business population. Empirical best linear unbiased prediction as well as empirical and hierarchical Bayes, for most purposes, seem to have a distinct advantage over other methods.
Article
Full-text available
A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Government Version of Record
Article
Full-text available
The rich set of interactions between individuals in society results in complex community structure, capturing highly connected circles of friends, families or professional cliques in a social network. Thanks to frequent changes in the activity and communication patterns of individuals, the associated social and communication network is subject to constant evolution. Our knowledge of the mechanisms governing the underlying community dynamics is limited, but is essential for a deeper understanding of the development and self-optimization of society as a whole. We have developed an algorithm based on clique percolation that allows us to investigate the time dependence of overlapping communities on a large scale, and thus uncover basic relationships characterizing community evolution. Our focus is on networks capturing the collaboration between scientists and the calls between mobile phone users. We find that large groups persist for longer if they are capable of dynamically altering their membership, suggesting that an ability to change the group composition results in better adaptability. The behaviour of small groups displays the opposite tendency-the condition for stability is that their composition remains unchanged. We also show that knowledge of the time commitment of members to a given community can be used for estimating the community's lifetime. These findings offer insight into the fundamental differences between the dynamics of small groups and large institutions.
Article
Full-text available
Electronic databases, from phone to e-mails logs, currently provide detailed records of human communication patterns, offering novel avenues to map and explore the structure of social and communication networks. Here we examine the communication patterns of millions of mobile phone users, allowing us to simultaneously study the local and the global structure of a society-wide communication network. We observe a coupling between interaction strengths and the network's local structure, with the counterintuitive consequence that social networks are robust to the removal of the strong ties but fall apart after a phase transition if the weak ties are removed. We show that this coupling significantly slows the diffusion process, resulting in dynamic trapping of information in communities and find that, when it comes to information diffusion, weak and strong ties are both simultaneously ineffective. • complex systems • complex networks • diffusion and spreading • phase transition • social systems
Article
Full-text available
Despite their importance for urban planning, traffic forecasting and the spread of biological and mobile viruses, our understanding of the basic laws governing human motion remains limited owing to the lack of tools to monitor the time-resolved location of individuals. Here we study the trajectory of 100,000 anonymized mobile phone users whose position is tracked for a six-month period. We find that, in contrast with the random trajectories predicted by the prevailing Lévy flight and random walk models, human trajectories show a high degree of temporal and spatial regularity, each individual being characterized by a time-independent characteristic travel distance and a significant probability to return to a few highly frequented locations. After correcting for differences in travel distances and the inherent anisotropy of each trajectory, the individual travel patterns collapse into a single spatial probability distribution, indicating that, despite the diversity of their travel history, humans follow simple reproducible patterns. This inherent similarity in travel patterns could impact all phenomena driven by human mobility, from epidemic prevention to emergency response, urban planning and agent-based modelling.
Article
Full-text available
Novel aspects of human dynamics and social interactions are investigated by means of mobile phone data. Using extensive phone records resolved in both time and space, we study the mean collective behavior at large scales and focus on the occurrence of anomalous events. We discuss how these spatiotemporal anomalies can be described using standard percolation theory tools. We also investigate patterns of calling activity at the individual level and show that the interevent time of consecutive calls is heavy-tailed. This finding, which has implications for dynamics of spreading phenomena in social networks, agrees with results previously reported on other human activities. Comment: 16 pages, 7 figures; minor changes. To appear in J. Phys. A
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Article
The prevalence of mobile communication in the developing world is ever increasing, with now 89 active subscriptions per 100 inhabitants. With this access comes the potential for unprecedented insights into individuals and societies, such as migration patterns, economic transactions, and even importation routes of infectious diseases like Ebola. However, the absence of a common framework for sharing mobile phone data in privacy-conscientious ways and an uncertain regulatory landscape has made difficult scientists' utilization of this powerful data.
Article
Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked who they intend to vote for. While representative polling has historically proven to be quite effective, it comes at considerable costs of time and money. Moreover, as response rates have declined over the past several decades, the statistical benefits of representative sampling have diminished. In this paper, we show that, with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and that this can often be achieved faster and at a lesser expense than traditional survey methods. We demonstrate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates which are in line with the forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues.
Article
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.
Article
The Suomi National Polar-Orbiting Partnership (NPP) satellite was launched on 28 October 2011, heralding the next generation of operational U.S. polar-orbiting satellites. It carries the Visible– Infrared Imaging Radiometer Suite (VIIRS), a 22-band visible/infrared sensor that combines many of the best aspects of the NOAA Advanced Very High Resolution Radiometer (AVHRR), the Defense Meteorological Satellite Program (DMSP) Operational Linescan System (OLS), and the National Aeronautics and Space Administration (NASA) Moderate Resolution Imaging Spectroradiometer (MODIS) sensors. VIIRS has nearly all the capabilities of MODIS, but offers a wider swath width (3,000 versus 2,330 km) and much higher spatial resolution at swath edge. VIIRS also has a day/night band (DNB) that is sensitive to very low levels of visible light at night such as those produced by moonlight reflecting off low clouds, fog, dust, ash plumes, and snow cover. In addition, VIIRS detects light emissions from cities, ships, oil flares, and lightning flashes. NPP crosses the equator at about 0130 and 1330 local time, with VIIRS covering the entire Earth twice daily. Future members of the Joint Polar Satellite System (JPSS) constellation will also carry VIIRS. This paper presents dramatic early examples of multispectral VIIRS imagery capabilities and demonstrates basic applications of that imagery for a wide range of operational users, such as for fire detection, monitoring ice break up in rivers, and visualizing dust plumes over bright surfaces. VIIRS imagery, both single and multiband, as well as the day/night band, is shown to exceed both requirements and expectations.
Article
The ubiquitous presence of cell phones in emerging economies has brought about a wide range of cell phone-based services for low-income groups. Often times, the success of such technologies highly depends on its adaptation to the needs and habits of each social group. In an attempt to understand how cell phones are being used by citizens in an emerging economy, we present a large-scale study to analyze the relationship between specific socio-economic factors and the way people use cell phones in an emerging economy in Latin America. We propose a novel analytical approach that combines large-scale datasets of cell phone records with countrywide census data to reveal findings at a national level. Our main results show correlations between socio-economic levels and social network or mobility patterns among others. We also provide analytical models to accurately approximate census variables from cell phone records with R2 ≈ 0.82.
Article
A training set of data has been used to construct a rule for predicting future responses. What is the error rate of this rule? This is an important question both for comparing models and for assessing a final selected model. The traditional answer to this question is given by cross-validation. The cross-validation estimate of prediction error is nearly unbiased but can be highly variable. Here we discuss bootstrap estimates of prediction errors, which can be thought of as smoothed versions of cross-validation. We show that a particular bootstrap method, the ·632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments. Besides providing point estimates, we also consider estimating the variability of an error rate estimate. All of the results here are nonparametric and apply to any possible prediction rule; however, we study only classification problems with 0-1 loss in detail. Our simulations include “smooth” prediction rules like Fisher’s linear discriminant function and unsmooth ones like nearest neighbors.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Article
Recent theoretical advances have brought income and wealth distributions back into a prominent position in growth and development theories, and as determinants of specific socio-economic outcomes, such as health or levels of violence. Empirical investigation of the importance of these relationships, however, has been held back by the lack of sufficiently detailed high quality data on distributions. Household surveys that include reasonable measures of income or consumption can be used to calculate distributional measures but at low levels of aggregation these samples are rarely representative or of sufficient size to yield statistically reliable estimates. At the same time, census (or other large sample) data of sufficient size to allow disaggregation either have no information about income or consumption, or measure these variables poorly. This note outlines a statistical procedure to combine these types of data to take advantage of the detail in household sample surveys and the co...
Conference Paper
Most correlation clustering algorithms rely on principal component analysis (PCA) as a correlation analysis tool. The correlation of each cluster is learned by applying PCA to a set of sample points. Since PCA is rather sensitive to outliers, if a small fraction of these points does not correspond to the correct correlation of the cluster, the algorithms are usually misled or even fail to detect the correct results. In this paper, we evaluate the influence of outliers on PCA and propose a general framework for increasing the robustness of PCA in order to determine the correct correlation of each cluster. We further show how our framework can be applied to PCA-based correlation clustering algorithms. A thorough experimental evaluation shows the benefit of our framework on several synthetic and real-world data sets.
Book
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
Article
A pervasive issue in social and environmental research has been how to improve the quality of socioeconomic data in developing countries. Given the shortcomings of standard sources, the present study examines luminosity (measures of nighttime lights visible from space) as a proxy for standard measures of output (gross domestic product). We compare output and luminosity at the country level and at the 1° latitude × 1° longitude grid-cell level for the period 1992-2008. We find that luminosity has informational value for countries with low-quality statistical systems, particularly for those countries with no recent population or economic censuses.
Article
Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done.
Article
In developing countries, identifying the poor for redistribution or social insurance is challenging because the government lacks information about people’s incomes. This paper reports the results of a field experiment conducted in 640 Indonesian villages that investigated two main approaches to solving this problem: proxy-means tests, where a census of hard-to-hide assets is used to predict consumption, and community-based targeting, where villagers rank everyone on a scale from richest to poorest. When poverty is defined using per-capita expenditure and the common PPP$2 per day threshold, we find that community-based targeting performs worse in identifying the poor than proxy-means tests, particularly near the threshold. This worse performance does not appear to be due to elite capture. Instead, communities appear to be using a different concept of poverty: the results of community-based methods are more correlated with how individual community members rank each other and with villagers’ self-assessments of their own status than per-capita expenditure. Consistent with this, the community-based methods result in higher satisfaction with beneficiary lists and the targeting process.
Article
Social networks form the backbone of social and economic life. Until recently, however, data have not been available to study the social impact of a national network structure. To that end, we combined the most complete record of a national communication network with national census data on the socioeconomic well-being of communities. These data make possible a population-level investigation of the relation between the structure of social networks and access to socioeconomic opportunity. We find that the diversity of individuals’ relationships is strongly correlated with the economic development of communities.
Article
This paper has an empirical and overtly methodological goal. The authors propose and defend a method for estimating the effect of household economic status on educational outcomes without direct survey information on income or expenditures. They construct an index based on indicators of household assets, solving the vexing problem of choosing the appropriate weights by allowing them to be determined by the statistical procedure of principal components. While the data for India cannot be used to compare alternative approaches they use data from Indonesia, Nepal, and Pakistan which have both expenditures and asset variables for the same households. With these data the authors show that not only is there a correspondence between a classification of households based on the asset index and consumption expenditures but also that the evidence is consistent with the asset index being a better proxy for predicting enrollments--apparently less subject to measurement error for this purpose--than consumption expenditures. The relationship between household wealth and educational enrollment of children can be estimated without expenditure data. A method for doing so - which uses an index based on household asset ownership indicators- is proposed and defended in this paper. In India, children from the wealthiest households are over 30 percentage points more likely to be in school than those from the poorest households.
Article
Using data from India, we estimate the relationship between household wealth and children's school enrollment. We proxy wealth by constructing a linear index from asset ownership indicators, using principal-components analysis to derive weights. In Indian data this index is robust to the assets included, and produces internally coherent results. State-level results correspond well to independent data on per capita output and poverty. To validate the method and to show that the asset index predicts enrollments as accurately as expenditures, or more so, we use data sets from Indonesia, Pakistan, and Nepal that contain information on both expenditures and assets. The results show large, variable wealth gaps in children's enrollment across Indian states. On average a "rich" child is 31 percentage points more likely to be enrolled than a "poor" child, but this gap varies from only 4.6 percentage points in Kerala to 38.2 in Uttar Pradesh and 42.6 in Bihar.
Article
This paper presents new data on poverty, inequality, and growth in those developing countries of the world for which the requisite statistics are available. Eco-nomic growth is found generally but not always to reduce poverty. Growth, however, is found to have very little to do with income inequality. Thus the "economic laws" linking the rate of growth and the distribution of benefits receive only very tenuous empirical support here. © 1989 The International Bank for Reconstruction and Development/The World Bank.
Article
An analyst using household survey data to construct a welfare metric is often confronted with a number of theoretical and practical problems. What components should be included in the overall welfare measure? Should differences in tastes be taking into account when making comparisons across people and households? How best should differences in cost-of-living and household composition be taken into consideration? Starting with a brief review of the theoretical framework underpinning typical welfare analysis undertaken based on household survey data, this paper provides some practical guidelines and advice on how best to tackle such problems. It outlines a three-part procedure for constructing a consumption-based measure of individual welfare: (i) aggregation of different components of household consumption to construct a nominal consumption aggregate, (ii) construction of price indices to adjust for differences in prices faced by households, and (iii) adjustment of the real consumption aggregate for differences in household composition. Examples based on survey data from eight countries-Ghana, Vietnam, Nepal, the Kyrgyz Republic, Ecuador, South Africa, Panama, and Brazil - are used to illustrate the various steps involved in constructing the welfare measure, and the STATA programs used for this purpose are provided in the appendix. The paper also includes examples of some analytic techniques that can be used to examine the robustness of the estimated welfare measure to underlying assumptions.
  • G S Fields
G. S. Fields, World Bank Res. Obs. 4, 167-185 (1989).
  • D Lazer
D. Lazer et al., Science 323, 721-723 (2009).
  • G King
G. King, Science 331, 719-721 (2011).
  • N Eagle
  • M Macy
  • R Claxton
N. Eagle, M. Macy, R. Claxton, Science 328, 1029-1031 (2010).
Predicting the present with Google Trends
  • H Choi
  • H Varian
H. Choi, H. Varian, Predicting the present with Google Trends. Econ. Rec. 88, 2-9 (2012). doi:10.1111/j.1475-4932.2012.00809.x
  • W Wang
  • D Rothschild
  • S Goel
  • A Gelman
W. Wang, D. Rothschild, S. Goel, A. Gelman, Int. J. Forecast. 31, 980-991 (2015).
  • J Candia
J. Candia et al., J. Phys. A 41, 224015 (2008).
  • J.-P Onnela
J.-P. Onnela et al., Proc. Natl. Acad. Sci. U.S.A. 104, 7332-7336 (2007).
  • X Lu
  • E Wetter
  • N Bharti
  • A J Tatem
X. Lu, E. Wetter, N. Bharti, A. J. Tatem, L. Bengtsson, Sci. Rep. 3, 2923 (2013).
  • P Deville
P. Deville et al., Proc. Natl. Acad. Sci. U.S.A. 111, 15888-15893 (2014).