ArticlePDF Available
Big Data and the
Well-Being of
Women and Girls
Applications on the
Social Scientic Frontier
April 2017
is report was written by Bapu Vaitla (overall); Claudio Bosco, Victor Alegana, Tom Bird,
Carla Pezzulo, Graeme Hornby, Alessandro Sorichetta, Jessica Steele, Cori Ruktanonchai,
Nick Ruktanonchai, Erik Wetter, Linus Bengtsson, and Andrew J Tatem (Section II);
Riccardo Di Clemente, Miguel Luengo-Oroz, and Marta C. González (Section III);
United Nations Global Pulse and University of Leiden, lead authors René Nielsen, omas
Baar, and Felicia Vacarelu (Section IV, part 1); Munmun de Choudhury, Sanket Sharma,
Tomaz Logar, Wouter Eekhout, and René Nielsen (Section IV, part 2). We thank Anoush
Tatevossian and Robert Kirkpatrick for their guidance and access to key datasets. Editing
and design by Julia Van Horn. We are grateful to Rebecca Furst-Nichols, Mayra Buvinic,
Emily Courey Pryor, and Alba Bautista for helpful comments and guidance.
is work was initiated by Data2X, a collaborative technical and advocacy platform
dedicated to improving the quality, availability, and use of gender data in order to make
a practical dierence in the lives of women and girls worldwide. Data2X works with UN
agencies, governments, civil society, academics, and the private sector to close gender
data gaps, promote expanded and unbiased gender data collection, and use gender data
to improve policies, strategies, and decision-making. Hosted at the United Nations
Foundation, Data2X receives funding from the William and Flora Hewlett Foundation
and the Bill & Melinda Gates Foundation.
Cover photo © Elizabeth Whelan
Executive Summary
I Introduction
II Geospatial Data
High-resolution Mapping of Sex-Disaggregated Indicators
III Digital Exhaust
Analyzing Economic Activity with Credit Card and Cell Phone Information
IV Internet Activity
Sex-Disaggregation of Social Media Posts
Social Media Expression as Signaling Mental Health States
V Conclusion: Reimagining the Revolution
Photo Credits
Big Data and the Well-Being of Women and Girls
Conventional forms of data—household surveys, national economic accounts, institu-
tional records, and so on—struggle to capture detailed information on the lives of women
and girls. e many forms of big data, from geospatial information to digital transaction
logs to records of internet activity, can help close the global gender data gap. is report
proles several big data projects that quantify the economic, social, and health status of
women and girls.
e rst project, described in Section II
(“Geospatial Data”), uses satellite imagery
to greatly improve the spatial resolution of
existing data on girls’ stunting, womens
literacy, and access to modern contracep-
tion in Bangladesh, Haiti, Kenya, Nigeria,
and Tanzania. is project develops
modeling techniques that use publicly
available high-resolution geospatial data to
infer similarly high-resolution patterns of
social and health phenomena across entire
countries. e approach takes advantage
of the fact that many types of social and
health data are correlated with geospa-
tial phenomena. ese relationships can
predict social and health outcomes in areas
where surveys have not been performed but
correlated geospatial data is available. is
project generated a series of highly detailed
maps that clearly illustrate landscapes of
gender inequality (see Figure A).
e second project, proled in Section III (“Digital Exhaust”), utilizes anonymized credit
card and cell phone data to describe patterns of women’s expenditure and mobility in a major
Latin American metropolis. e credit card data includes 10 weeks of transactions from
150,000 users, with associated age, sex, and location information; for a subset of these credit
card users, cell phone data is also available. e two types of information together create
portraits of economic lifestyles—patterns of behavior that illustrate the needs and priorities
Figure A. Dierences in stunng between girls and boys, Nigeria, 2013. Red areas
are where girls’ stunng is higher than boys’ stunng, green where girls’ stunng
is lower than boys’ stunng.
Execuve Summary 2
of women (see Figure
B). Over a longer
timeframe, such data
could also reveal signals
about how women are
coping with a wide
range of environmental
and economic shocks
and stressors.
e third and fourth
projects, proled in
Section IV (“Internet
Activity”) concentrate
on the expression of
ideas and emotions
on the social media
platform Twitter. e
third project develops
and prototypes a tool
for automatically identifying the sex of Twitter users, and then uses this method to quantify
the concerns of women on a wide range of global development issues. e algorithm created
in this project automates the process of looking up user’s names and pictures from Twitter
proles. Using open source software, the tool analyzes users’ names from a built-in database
that contains sex information. If name alone is insucient to infer sex, the tool analyses
prole photos using face recognition software. e tool was tested on more than 50 million
Twitter accounts across the world to understand the diering priorities of women and men
on topics related to sustainable development (see Figure C).
e nal project locates signals of depression in a large database of publicly available tweets
from women and girls in India, South Africa, the United Kingdom, and the United States.
e project uses machine learning techniques to identify genuine self-disclosures of mental
illness from nearly 1.5 million social media posts and half a million Twitter users. e
method accurately identies mental illness in 96% of cases. e project also compares
modes of linguistic expression and topical content across female and male users. Overall, the
Figure B. Frequency of women’s transacons in dierent expenditure categories, as assessed by credit
card data.
Grocery stores, supermarkets
Eating places and restaurants
Bridge and road fees, tolls
Computer network/information services
Miscellaneous food stores
Service stations
Insurance sales, underwriting and premiums
Department stores
Telecommunication services
Manual cash disbursements
Taxicabs and limousines
Cable, satellite, and radio services
Fast food restaurants
Drug stores and pharmacies
Direct marketing
Computer software stores
Motion picture theaters
Women’s ready to wear stores
Wholesale clubs
Miscellaneous general merchandise stores
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Big Data and the Well-Being of Women and Girls
ndings reveal signicant dierences in how dierent
sexes express mental health concerns on Twitter. e
work suggests two major applications for monitor-
ing and treatment. At the individual level, signals of
mental illness could provoke response, either from the
user’s community or through automated means from
the social media platform itself (for example, oering
counseling resources). At the population level, mental
health trends can be monitored in near real-time, which
may be especially useful following recessions, natural
disasters, and other shocks.
is report illustrates the potential of big data in
lling the global gender data gap. e rise of big data,
however, does not mean that traditional sources of data
will become less important. On the contrary, the suc-
cessful implementation of big data approaches requires
investment in proven methods of social scientic
research, especially for validation and bias correction of
big datasets. More broadly, the invisibility of women
and girls in national and international data systems is
a political, not solely a technical, problem. In the best
case, the current “data revolution” will be reimagined
as a step towards better “data governance”: a process
through which novel types of information catalyze the
creation of new partnerships to advocate for scientic,
policy, and political reforms that include women and
girls in all spheres of social and economic life.
Figure C. Trending topics among Twier users in Nepal, May
2012-July 2015, disaggregated by sex.
Compare how likely men and women are to tweet about each of the 16 topics.
A good education
Access to clean water and sanitation
Action taken on climate change
Affordable and nutritious food
An honest and responsive government
Better healthcare
Better job opportunities
Better transport and roads
Equality between men and women
Freedom from discrimination and persecution
Phone and internet access
Political freedoms
Protecting forests, rivers, and oceans
Protection against crime and violence
Reliable energy at home
Support for people who can’t work
I. Introducon 4
e term “big data” encompasses diverse types of information, from satellite imagery to
cell phone records to internet activity. ese forms of data dier in many ways, but all have
digital origins, record observations at high frequency, and are massive in size. Such charac-
teristics are invaluable in studying human well-being as it changes over time.
Traditional data systems struggle to quantify trajectories of physical and mental health
among a population, especially during and following economic recessions, natural disasters,
and other unpredictable shocks. e problem is exacerbated—and present even during
periods of relative economic stability—with respect to women and girls, who often work in
the informal sector or at home, suer social constraints on their mobility, and are margin-
alized in both private and public decision-making. Household surveys, national economic
accounts, institutional records, and so on often do not successfully capture the lives of
women and girls, especially at the kind of frequency needed to assess changes in economic
and health status.
is report proles groundbreaking approaches to using various kinds of big data to ll
the global gender data gap. For each of three major big data categories—geospatial data,
digital exhaust, and records of internet activity—we present exemplary research initiatives
conducted over the past two years:a,b
• InSection II (“GeospatialData”),researchersat the FlowminderFoundation
and WorldPop project use satellite imagery to improve the spatial resolution
of existing data on women and girls obtained from Demographic and Health
Surveys (DHS) in Bangladesh, Haiti, Kenya, Nigeria, and Tanzania;
• InSectionIII,(“DigitalExhaust”),researchersattheMassachusettsInstituteof
Technology (MIT), working with a colleague at United Nations Global Pulse
(UNGP), utilize credit card and cell phone data to discern patterns of women’s
expenditure and mobility in a major Latin American metropolis;
• InSectionIV, (“Internet Activity”),welookattwoprojectsconcentratingon
the expression of ideas and emotions on the social media platform Twitter. In
the rst, researchers at UNGP and the University of Leiden create an algorithm
for automatically identifying the sex of Twitter users, and then use this method
to quantify the concerns of women across a wide range of global development
issues. In the second project, researchers at Georgia Tech University, supported
by colleagues at the University of Leiden and UNGP, locate signals of depression
a More detailed reports on
each of these projects are
available at http://data2x.
b Note that the rst-person
plural “we” is used through-
out this report to refer in
dierent sections to dierent
groups of researchers. e
relevant researchers for each
section are listed on the
inside front cover.
Big Data and the Well-Being of Women and Girls
in a large database of publicly available tweets from women and girls in India,
South Africa, the United Kingdom, and the United States.
In all cases, the projects yielded important new insights into the lives of women and girls.
e sections that follow describe each in detail.
Elizabeth Whelan
II. Geospaal Data 6
e big data conversation usually centers on novel forms of data, ignoring a valuable source
of information that has been available in the public domain for decades: geospatial data. In
recent years, the amount of freely accessible geospatial data, especially satellite imagery, has
greatly expanded, spurred by increased investment from government agencies and private
businesses. is data is increasingly ne-grained in both time and space: satellite imagery,
for example, is now able to record rapid changes in both biophysical phenomena (for
example, vegetation, soil cover, and water ows) and human infrastructure (for example,
settlements, roads, and light intensity).
Equally high-resolution data on social and health indicators is critically needed, but still
lacking. Human well-being varies considerably within countries, and development indica-
tors assessed at national scales conceal these inequalities. Importantly, the status of women
and girls in economically marginalized or geographically isolated communities is often
unknown. Although four out of every ve countries in the world regularly produce sex-dis-
aggregated statistics at national or provincial scale, this data is not spatially rened enough
to support local policymaking or program targeting.
To address this problem, we developed modeling techniques that use high-resolution geo-
spatial data to infer similarly high-resolution patterns of social and health phenomena.
is approach takes advantage of the fact that many types of social and health data—
for example, child stunting, literacy, and access to modern contraception, the indicators
we focus on in this case study—are correlated with geospatial phenomena that can be
mapped in great detail across entire countries using satellite imagery. ese relationships
are then used to predict social and health outcomes in areas where surveys have not been
performed.1 e result is maps that provide entire landscapes of information on indicators
of interest. e workow is illustrated in Figure 1, and the methods more fully explicated
in the following pages. We focus especially on outcomes related to girls and women, that
is, girls’ stunting, womens literacy, and contraceptive access; results for boys’ stunting and
men’s literacy are presented in the accompanying technical report by Bosco et al. (2016).
e DHS program has been a leader in collecting and disseminating survey data on key
development indicators in low- and middle-income countries. Large-sample household
High-resoluon Mapping of
Sex-Disaggregated Indicators
Big Data and the Well-Being of Women and Girls
data collection of this type, however, is costly, and so surveys are normally designed to be
representative at the national or the largest subnational administrative level (typically called
states or provinces). ese areas often contain millions of people, and statistical assessments
at such scales obscure substantial lower-level heterogeneity in social, economic, and health
status. However, recent DHS surveys—and, increasingly, other household surveys—
provide GPS coordinates for observations or clusters of observations, which enables us
to utilize our geospatial modeling approach to improve the spatial resolution of DHS
In this study, we focus on three countries in Sub-Saharan Africa (Kenya, Nigeria, and
Tanzania), one country in South Asia (Bangladesh), and one country from the Western
Hemisphere (Haiti); all have a low or medium human development index.3 We use DHS
data from the last several years on child stunting, literacy, and the use of modern contracep-
tion (hereafter collectively referred to as “well-being outcomes”), the rst two of which are
disaggregated by sex; only girls’ and women’s results are presented in this report.c,4
We chose geospatial variables, summarized in Table 1, by combing existing publicly
available libraries for those variables that had previously shown correlation with the out-
comes.d,5 We then analyzed the relationship of these variables with stunting, literacy, and
contraceptive access at each recorded survey location. e nal step used these observed
relationships to infer, using high-resolution landscape maps of each geospatial variable,
outcomes in all non-survey locations. A continuous landscape of girls’ stunting, women’s
literacy, and access to contraception was thus generated for each country.
Figure 1. Workow, geospaal modeling of well-being outcomes.
c Children under age ve
whose height is considerably
(two standard deviations)
below the median of the
World Health Organization’s
reference population are
considered stunted. People
ages 15-49 who attended
at least secondary school or
could read part of a sentence
during the DHS interview
are dened as literate. e
current use of any modern
method of contraception
is asked of all women ages
15-49, but in Bangladesh
only of ever- married women.
d Selecting the optimal
subset of geospatial variables
is critical for maximizing
the ultimate predictive
accuracy of a model: too few
informative variables and
the model will not explain
much; too many and the
resulting model may explain
the observed data extremely
well but perform badly when
applied to new datasets.
DHS surveys provide
information on well-
being outcomes
(stunting, literacy,
modern contraceptive
use) in distinct locations
Whole landscape
maps of a broad set
of geospatial variables
(e.g. accessibility) are
The correlation of
well-being outcomes
with geospatial
variables is assessed
The geospatial models
best able to predict
well-being in the
survey locations are
Using the geospatial
models, well-being is
predicted across the
whole landscape
DHS Data
PreDictive MoDeling HigH-reS Well-being Data
geoSPatial Data
II. Geospaal Data 8
Re s ults
We rst present results of the geospatial variable selection exercise, summarizing overall
model performance and then listing the most strongly correlated set of geospatial variables
for each indicator in each country. For selected indicators, we show maps comparing DHS
survey results with the landscape of values generated by geospatial variables.
First, we note that model performance varied greatly across indicators and countries (Figure
2). Models for girls’ stunting, for example, were inadequate for all countries except Nigeria.
Geospatial variables were generally informative in building models for womens literacy.
For modern contraceptive use, models performed strongly in Tanzania and Nigeria. e
results suggest that geospatial modeling requires careful investigation of a broad set of
variables—even broader than the set explored here—and some outcomes in some countries
Table 1. Geospaal variables used in this study. See Bosco et al. (2016) for extended descripons and sources.
Accessibility Likely travel times between two points, a function of distarce and infrastructure
Aridity evapotranspiration Weather station-based interpolation of moisture availability
Births WoldPop-derived number of live births
Crop suitability Rainfed crop suitability given crop/technology mix
Distance to conicts Nigeria only, between years 2010-13
Distance to health facility calculated from Open Street Map datasets
Distance to roads Calculated from Open Street Map datasets
Distance to schools Caulculated from Open Street Map datasets
Economic productivity Gross domestic product, calculated with economic data and geospatial correlates of economic activity
Elevation Elevation above sea level
Ethnicity Estimated distribution of ethnic groups
Land surface Land biophysical properties estimated by reectance
Livestock density Modeled spatial distribution of livestock
Nightlights Light intensity, denoting population density and electrication
Population density Density inferred from settlement and land use patterns
Pregnancies WorldPop-derived number of pregnancies
Protected areas Geospatial conservation on databases
Temperature/rainfall Global climate layers
Urban/rural settlements Estimated distance to settlements, country-specic datasets
Vegetation/land cover Plant cover estimated by surface reectance
Geospatial Variate Description
Big Data and the Well-Being of Women and Girls
may not be correlated well to any set of
geospatial variables. In the present work,
literacy appears to have strong geospatial
correlates almost universally, while the per-
formance of girls’ stunting and contracep-
tive use models depends on context.
We also see that the optimum subset of
geospatial variables also diers by country
and indicator, as shown in Figure 3 for the
six best performing models. Accessibility, a
general indicator of transport infrastructure
quality, is the only geospatial variable that
is statistically signicant in all six models;
elevation and land surface are usually
important, and aridity/evapotranspiration,
distance to roads, temperature/rainfall,
and the distance to urban and rural set-
tlements are signicant in most cases. e
key message, however, is that the set of
optimum geospatial variables depends on
context; even the same indicator will have
Figure 2. Explanatory power of geospaal models, by country and indicator.
Country/indicator models with no informaon shown (stunng in Tanzania and
Hai, literacy in Hai, contracepon in Kenya and Bangladesh) were not modeled,
due to lack of sucient survey indicator data. Boys’ stunng and male literacy is
not shown.
Proportion of Explained Variance
Girls’ stunting Modern contraceptive useWomen’s literacy
Haiti BangladeshTanzania Nigeria Kenya
Figure 3. Geospaal correlates of girls’ and women’s well-being outcomes in the
six best-performing models. Shaded box indicates that the variable was included
in the nal model.
Nigeria Kenya Nigeria NigeriaTanzania Tanzania
stunting Female literacy Modern
contraceptive use
Aridity, evapotranspiration
Crop suitability
Distance to conicts
Distance to health facility
Distance to roads
Distance to schools
Distance to waterways
Economic productivity
Land surface
Livestock density
Population density
Protected areas
Urban/rural settlements
Vegetation/land cover
II. Geospaal Data 10
dierent geospatial correlates in dierent
We now turn to the core results, presented
in a series of maps. Figure 4 makes clear the
overall value of the approach. e top panel
shows stunting rates for girls in the original
DHS survey locations from the Nigeria
2013 dataset. e data appears as a scatter
of points distributed unevenly throughout
the country; between the survey locations
are large areas of space in which stunting
prevalence is not known. e bottom panel
then shows the girls’ stunting landscape in
2013 as generated by the best-performing
geospatial model (which includes those
variables shaded in the “Girls’ stunting/
Nigeria” column of Figure 3). We see a con-
tinuous gradient over the entire expanse
of the country; not only broad geographic
patterns but also dierences within sub-re-
gions of Nigeria become more evident.
Figure 4. DHS survey data for girls’ stunng in Nigeria in 2013 (top panel) and
geospaally predicted landscape of girls’ stunng in the same year (boom panel).
Big Data and the Well-Being of Women and Girls
Geospatial modeling also unveils inequalities between girls and boys across the landscape.
Figure 5 shows dierences in 2013 stunting rates across sexes in Nigeria; positive values
(colored in orange/red) indicate areas where girls have higher rates of stunting, and negative
values (colored in green) where boys have higher rates (separate results for absolute levels of
boys’ stunting are available in Bosco et al. 2016). Notably, the areas with higher absolute
levels of girls’ stunting, as shown in the right panel of Figure 4, are not necessarily the
areas of greatest inequality; the northeast, central, and southern urban regions of Nigeria
appear to exhibit the largest disadvantage
for girls. Overall, the map provides a ne-
grained picture of inequality across the
entire landscape.
Figure 6 and Figure 7 show similar results for
women’s literacy in Kenya and modern con-
traceptive use in Tanzania. Once again, we
see that geospatial modeling can transform
a limited number of survey data points dis-
tributed unevenly across the country into a
continuous landscape of information.
is approach does face challenges. Some
of the geospatial models we attempted
were unable to accurately predict well-be-
ing outcomes, as shown earlier by Figure 2.
It is possible that a wider set of geospatial
covariates would have improved modeling
performance. For many variables and
locations an exploratory approach is necessary, as little theoretical guidance linking geo-
spatial phenomena with well-being outcomes is available. In addition, the exact nature of
the relationships between geospatial phenomena and well-being outcomes—linearity vs.
non-linearity, for example—is also unclear.
Overall, however, geospatial modeling of women’s and girls’ social and health status shows
great promise. Some of the maps produced in the present study, especially the maps of all
three indicators in Nigeria and the map of womens literacy in Kenya, have suciently
Figure 5. Dierences in stunng rate between girls and boys, Nigeria, 2013.
II. Geospaal Data 12
low uncertainty to be utilized by policymakers seeking to target interventions at local ad-
ministrative levels. We have in this section presented results on three indicators from ve
countries, but the modeling architecture can be extended to other indicators and countries
for which DHS has information, as well as to other household surveys containing georef-
erenced data. e recent expansion of publicly available high-resolution satellite imagery
oers a rich bounty of data for exploring geospatial correlations in the many countries
where traditional data systems are insucient to capture the status of women and girls.
Figure 6. Women’s literacy in Kenya.
Figure 7. Modern contracepve use in Tanzania.
Big Data and the Well-Being of Women and Girls
Digital technologies are ubiquitous, and their use leaves traces—records of the goods and
services we consume, the places we go, and the people with whom we interact.6 If informa-
tion on the sex of the technology user is available, these types of “data exhaust” can oer
insight in near-real time about the lives of women and girls.
e following pages describe a project that uses credit card and cell phone data to analyze
patterns of economic activity among tens of thousands of women living in one of the most
populated cities of Latin America.e We use credit card records (CCRs) to examine the ex-
penditure priorities and patterns of mobility of dierent sexes, income levels, and ages.7
Call detail records (CDRs), meanwhile, store information about the time, duration, and
location of mobile phone calls, as well as the anonymized IDs of the people receiving calls.
Past research has used CDRs to analyze social interactions, the laws of human mobility, and
the economic welfare of users.8
For this project, we obtained over 10 weeks of anonymized individual credit card transac-
tions from 150,000 users, with associated age, sex, and location information. e CCRs
include data on the broad types of goods and services purchased, expenditure amounts,
and the chronological sequence of transactions. For 10% of these credit card users, call
detail record (CDR) data is also available. We used CCRs and CDRs together to describe
economic lifestyles—patterns of behavior that illustrate the needs and priorities of individ-
uals. A detailed analysis is done specically for women in the sample.
Detecting dierences in economic lifestyles is complicated by the fact that only a few
categories of purchases dominate spending: most people, regardless of sex, wealth level, or
age, spend most of their income on food, transport, and communication, as the “Results”
section below discusses. is project thus delves deeper, looking not only at transaction
type but also the order in which individuals made purchases, as well as their patterns of
geographical mobility. Certain sequences of transactions may be repeated in an individual’s
purchase history; for example, in panel A of Figure 8, sequence W1 captures grocery store
expenditures (the shopping cart icon) followed by department store purchases (the gift box
icon), while sequence W2 represents restaurant expenditures (the plate and silverware icon)
followed by fuel purchases (the gas pump icon). Both W1 and W2 are repeated twice in this
Analyzing Economic Acvity with
Credit Card and Cell Phone Informaon
e For contractual and
privacy reasons, we cannot
make details of the dataset
publicly available. Upon
request, the authors can
provide anonymized data
used for a subset of the
analyses described below; the
code to replicate methods is
available upon request.
III. Digital Exhaust 14
short transaction history.f Such patterns—
more than ten thousand of which were
detected in the CCR dataset—are the basis
for inferring economic lifestyles. As each
user’s sequences are analyzed, and data on
mobility from geocoded transactions added,
similarities begin to emerge; people cluster
together into distinct economic lifestyles.
Mobile phone data further enriches our understanding of patterns of economic and social
behavior, helping to delineate economic lifestyles even more distinctly. In this project,
we focus on three types of information about individuals obtained from CDRs: mobility
diversity, social diversity, and the radius of gyration (Figure 9).9 Mobility diversity—how
evenly an individual splits travel time across the various locations he/she visits—can be
constructed using location information from CDRs, gathered by the towers through which
cell phone signals pass. Social diversity quanties how evenly an individual splits airtime
across all people in his/her calling network. Finally, the radius of gyration denes the
physical area where the user is most likely to be found.
Re s ults
We nd that expenditures on food are the most important transactions for women, with
over a quarter of transactions in grocery stores/supermarkets, eating places/restaurants,
and miscellaneous food stores (Figure 10).
A closer look shows expenditure patterns
across sexes, ages, and income levels (Figure
11); notice that expenditure across sexes
shows strong dierences in some catego-
ries, while dierences across income levels
are minor. Women have more transactions
than men with respect to grocery stores/
supermarkets, insurance-related expenses,
and department stores, while the opposite
is true for restaurants and transport-related
f Sequences can be as short
as two transactions or much
longer. Longer sequences
will occur less frequently in
the transaction history.
Figure 9. Features of individual behavior obtained
from a combinaon of call detail records and
credit card records.
Figure 8. Examples of repeated sequences within a transacon history.
Big Data and the Well-Being of Women and Girls
expenses. In general, women report less
total expenditure per capita than men, in-
dicating that they either have less access to
economic resources in general and/or use
credit cards less frequently. Such patterns
are likely to vary based on the nature of the
economy and the prevailing economic cir-
cumstances; this analysis will be especially
relevant in the wake of economic and en-
vironmental shocks, when little real-time,
sex-disaggregated data is available.
Using the combination of credit card
and cell phone data, we identied seven
economic lifestyle clusters among women
in the dataset (Figure 12). One of the
Figure 11. Frequency of transacons by income, age, and gender in selected categories.
Income Age Gender
High income Age
Medium income
Low income 35-49 Female
Figure 10. Frequency of women’s transacons in each expenditure category. For
example, 12% of all transacons in the CCR dataset were from grocery stores and
Grocery stores, supermarkets
Eating places and restaurants
Bridge and road fees, tolls
Computer network/information services
Miscellaneous food stores
Service stations
Insurance sales, underwriting and premiums
Department stores
Telecommunication services
Manual cash disbursements
Taxicabs and limousines
Cable, satellite, and radio services
Fast food restaurants
Drug stores and pharmacies
Direct marketing
Computer software stores
Motion picture theaters
Women’s ready to wear stores
Wholesale clubs
Miscellaneous general merchandise stores
0.00 0.02 0.04 0.06 0.08 0.10 0.12
0.02 0.04 0.06 0.08 0.10 0.12 0.14
0.02 0.04 0.06 0.08 0.10 0.12 0.14
0.02 0.04 0.06 0.08 0.10 0.12 0.14
III. Digital Exhaust 16
clusters, however, does not exhibit a strong pattern of sequences, and is best left unlabeled
(cluster 5 in the gure). e transaction sequences within each of the other clusters is
dominated by a single type of expenditure, and we use this type to help label the clusters
as commuters, homemakers, youth tech-users, and diners (of which there are two types, as
discussed further below).
Commuters’ primary transaction is toll fees, and their mobility metrics suggest that they
travel long distances frequently. e core transaction of homemakers is grocery stores; they
are less mobile, have less social network diversity, and spend less using credit cards. Women
are overrepresented in this group, suggesting that women in this urban area perform tra-
ditional domestic roles. Youth have taxis as their primary expenditure and live close to the
city center. Tech-users are of similar age as youths, but computer and information services
are their most important transaction and they have greater diversity in their social contacts
Figure 12. Economic lifestyles of women in the dataset. Arrows indicate frequent transacon
sequences; color bar indicates how common that sequence is among the people in the category.
For example, commuters are likely to pay tolls followed by expenditure on more tolls, restaurants,
groceries, fuel, computer and informaon services, and telecommunicaon services.
Big Data and the Well-Being of Women and Girls
and mobility networks, as well as higher spending overall. e rst diners group has high
mobility diversity, high expenditures, and restaurants as the primary transaction; the second
diners group has lower mobility diversity and lower expenditure, and miscellaneous food
stores are the core expenditure.
e identication of economic lifestyle clusters is a vital input for policy formulation.
Subgroups within a population have distinct social and economic needs. For example,
commuters may be hit hardest by fuel price increases, and the creation of inexpensive and
ecient public transport systems may be an important investment in urban areas, especially
those where low-income residential areas and job opportunities are not in proximity. e
analysis above shows also that groups like homemakers and youth have distinct patterns
of expenditure from other segments of the population. Information of this kind facilitates
Tech-users Unlabeled Diners 1 Diners 2
Commuters Homemakers Youth
Figure 13. Characteriscs of women users in each group (shaded polygons) vs. the enre sample of men
(red line).
III. Digital Exhaust 18
an analysis of the relative costs and benets of policies to improve (say) access to aord-
able food, information services, or transport. e segregation of the diners illustrates that
subgroups do not access food in the same way, and food policies—nutrition subsidies for
at-risk groups, programs to incentivize low-cost grocery stores in “food desert” areas, and
so on—must be tailored to the needs specic to each economic lifestyle.
We also found that women’s economic lifestyle clusters diered in important ways from
men’s. e diagrams in Figure 13 show median scores for social diversity, mobility diversity,
age, distance from the city center, and total expenditure. e scores for women in each
economic lifestyle group are represented by the shaded polygons; the average scores for all
men in the sample are represented by the red line.
ere are clear dierences between women in each group and men in the overall sample.
Men have much greater mobility diversity than women commuters, and tend to live much
closer to the city center, indicating that men may have better access to economic op-
portunities. Women homemakers tend to be much less social and have reduced mobility
compared to men, again with implications on market and non-market activity. Female
youth, tech-users (especially), and diners also have much less social diversity than men; in
this urban area, men appear to have greater numbers of social connections, while women
have smaller networks. Young women also have a much smaller radius of movement,
again pointing to a constrained economic and social world. Female tech-users are the only
lifestyle group for whom expenditure is signicantly greater than that of men in general,
indicating that tech may be a sector in which women are nding more remunerative job
opportunities. Two distinct types of female diners are present: one group with relatively
high mobility diversity and expenditure but low radius of gyration (e.g., many dierent
restaurants in a localized area; Diners 1), and another with low mobility diversity and
expenditure but higher radius of gyration (e.g., people needing to travel relatively long
distances to nd cheaper food sources; Diners 2).
e current project does have some limitations—most notably, a bias towards inclusion
of relatively better-o individuals with access to credit cards and cell phones, although
penetration of the latter into even isolated rural areas of the developing world is increas-
ing. We did examine whether CCR users were representative of the general population,
especially given that less than a quarter of the population in the research area use credit
cards. e monthly expenditure of the CCR users was high relative to wages in most
Big Data and the Well-Being of Women and Girls
of the neighborhoods included in the project, conrming that (within neighborhoods)
the sample is biased towards wealthier individuals—although, evaluating the sample as a
whole, users from all income levels are well-represented.
Despite the potential for bias, the project does demonstrate that a combination of credit
card and cell phone data can provide detailed insights into womens economic behavior.
is research could also serve as a basis for further work applying the proposed method-
ologies to other developing countries where mobile money, in addition to credit cards, is
commonly used. e results above are drawn from a ten-week period in which economic
conditions were relatively stable. Over a longer timeframe, our approach could also reveal
signals about how women are coping with a wide range of shocks and stressors: environ-
mental disasters, recessions, macroeconomic policy shifts, and so on. For example, reduced
mobility among a low-expenditure group could signal that poorer economic classes are
unable to aord the transport costs necessary for commuting and accessing markets and
government services. Such early warning information would be valuable in designing and
managing eective social protection systems.
Elizabeth Whelan
IV. Internet Acvity 20
Social media can help monitor public perceptions and measure global development prior-
ities and impact. It can also provide insights into the dierences and inequalities between
people of dierent income, sex, age, race, ethnicity, migratory status, disability status, geo-
graphic location, and other characteristics.
Sex disaggregation, in particular, can play an important role in providing information
about the disparities between women and men. However, data from open social media
channels such as Twitter may not indicate a person’s sex. In this project, United Nations
Global Pulse (UNGP) collaborated with the University of Leiden to develop and test a
tool to infer the sex of Twitter users. e tool automates the process of looking up public
information from Twitter proles, especially the user’s name and prole picture. Using
open source software, the tool analyzes users’ names from a built-in database of predened
names, built from sources such as ocial statistics that contain sex information. For cases
in which name alone may not be enough to discern sex, the tool analyzes prole photos
using face recognition software. e tool was applied to more than 50 million Twitter
accounts from around the world to understand the dierent concerns and priorities of
women and men on topics related to sustainable development.
A key objective in the development of this tool was to ensure that the approach could be
applicable at a global scale and across dierent languages. e tool disaggregates social
media posts based on several automated classication methods in a “waterfall” approach—
starting with the classiers with the highest overall success rate and, when results are
unknown or indecisive, moving on to another classier.
e name classier performed best in this study: a user’s name was compared to a pre-exist-
ing “name dictionary” showing whether the name was more likely to indicate a woman or a
man. ese results could be further improved by using country-specic name dictionaries,
although this would require a more complex process of rst determining the home country
of a specic user. Since exact location of the user is often omitted in tweets, the approach
would adopt a separate script for classifying location, after which a country-specic dictio-
nary could be used.g If a country-specic dictionary were absent or results were indecisive,
g See the technical report
pertaining to the subsequent
section (“Social Media
Expression as Signaling
Mental Health States”) for
another approach to country
Sex-Disaggregaon of Social Media Posts
Big Data and the Well-Being of Women and Girls
disaggregation would take place by relating a user’s name either to language-specic dictio-
naries or an aggregated set of several dictionaries.h
In addition to name classication, image recognition of a user’s prole picture demon-
strated good results for sex disaggregation. For this project, the script classied a user’s
prole picture with the free-to-use tool Face++. However, if multiple persons are identied
in the same photo, the results can be inconclusive. For the purposes of this prototype tool,
the algorithm chose the face for which sex is most reliably identiable.
e script for sex disaggregation of social media accounts is open-source and readily avail-
able.i For illustrative purposes, a demo version of the tool itself has been made available
online for inferring the sex of a person based on their Twitter user name, rst name, or an
image URL.
Re s ults
To test the accuracy of the waterfall method—rst deploying the name classier, followed
by image recognition—a public website was created. e website allows users to manually
determine whether a certain Twitter account is male or female. e accuracy of the hybrid
classication approach (the waterfall combination of name and image recognition) was
h e name classication
process and script used
in this tool built upon
the code of “Gender
Computer” developed by
TU Eindhoven. e code of
Gender Computer can be
accessed here: https://github.
puter. For the overall script,
we updated several of the
dictionaries with new names
and included additional
country specic dictionaries.
i GitHub repository: https://
Figure 14. Global Pulse post-2015 tweets dashboard, conversaons about equality between men and
A good education
Access to clean water and sanitation
Action taken on climate change
Affordable and nutritious food
An honest and responsive government
Better healthcare
Better job opportunities
Better transport roads
Equality between men and women
Freedom from discrimination
Phone and internet access
Political freedoms
Protecting forests, rivers, and oceans
Protection against crime and violence
Reliable energy at home
Support for people who can’t work
IV. Internet Acvity 22
Compare how likely men and women are to tweet about each of the
16 topics.
A good education
Access to clean water and sanitation
Action taken on climate change
Affordable and nutritious food
An honest and responsive government
Better healthcare
Better job opportunities
Better transport and roads
Equality between men and women
Freedom from discrimination and persecution
Phone and internet access
Political freedoms
Protecting forests, rivers, and oceans
Protection against crime and violence
Reliable energy at home
Support for people who can’t work
compared with crowdsourced verication mechanism, the results
of which were assumed to be correct. e automated classica-
tion approach accurately determined sex in 74% of cases, a rea-
sonable result for a tool in its initial stages of development.j As
described above, future work using more context-specic dictio-
naries should improve accuracy.
e tool could be used for any study of tweets and other types
of social media expression wherein the name and/or prole
picture of users are available. For example, Global Pulse used the
sex-disaggregation tool to improve an existing real-time online
dashboard showing the volume of tweets around priority topics
related to sustainable development, including gender equality
(Figure 14). By ltering through 500 million daily tweets from
over 50 million accounts for 25,000 keywords relevant to global
development topics, this interactive dashboard showed which
countries tweeted most about which topics between May 2012
and July 2015.k
To further rene the dashboard, the gender classication script was run over the entire
dataset. Once disaggregated by sex, the dashboard revealed new insights, highlighting the
dierent concerns and priorities of women and men. For example, in Nepal, the sex-dis-
aggregated data showed that women tweeted most on “equality between men and women”
(Figure 15). In comparison, men discussed most about “protecting forests, rivers and
oceans.” In the second quarter of 2015, prompted by the earthquake that hit Nepal on
April 25th, discussions were dominated by “support for people who cannot work”—a topic
rarely mentioned previously—and “an honest and responsive government.” e above
topics were widely mentioned by both men and women.
is project faced several obstacles. Because of the anonymity standards of Twitter, identi-
fying authentic user data is not always possible. Context-specic name databases and other
tools—for example, prole and linguistic style choices—can help improve prediction of
sex; improvement on the current accuracy rate of 74% is almost certainly possible. Overall,
however, the approach developed here advances sex-disaggregated analysis of social media,
and by doing so provides a window into large databases of ideas and opinions.
Figure 15. Trending topics of Twier users in Nepal, May
2012-July 2015, disaggregated by sex.
j With respect to privacy,
the methodology uses
publicly available data from
Twitter proles. Moreover,
only publicly revealed gender
markers such as the name
and prole pictures of users
were applied to building the
tool. Users for whom the
name and prole picture
were insucient to allow the
classier to detect sex were
categorized as unknown.
k e project was initially
developed by Global Pulse in
collaboration with the UN
Millennium Campaign and
DataSift: http://post2015.
Big Data and the Well-Being of Women and Girls
Several unique social and psychological characteristics are implicated in the mental health
challenges experienced by women and girls.10 In addition, poverty, inequality, and cultural
expectations may heighten the risk of mental illness among women and girls.11 Most of the
publicly available data on mental health burden, however, comes from massive and infre-
quent exercises that rarely include sex-disaggregated information, especially in the develop-
ing world.12 Methods of mental health assessment are also inconsistent across countries.13
Overlooking sex-based dierences can have drastic consequences, including misdiagnosis,
inappropriate treatment, and constrained help-seeking.14 More high-frequency data on
mental illness and better understanding of the ways in which women and girls express their
mental health concerns is thus needed.15
Research in recent years has proposed that social media data can help understand patterns
of mental health in complement to more traditional assessments.16 Here we present a gen-
der-based, cross-cultural quantitative examination of mental health content shared on the
social media platform Twitter. Using a dataset of half a million Twitter users and nearly
1.5 million posts from four countries, India, South Africa, the United Kingdom, and the
United States, we employed machine learning techniques to identify genuine self-disclo-
sures of mental illness from public social media posts. Comparison of these posts with
content shared on online mental health support communities, as well as consultation with
mental health professionals, suggests that the method accurately identies mental illness
in nearly all cases. We also compare modes of linguistic expression and topical content
across female and male users. Overall, the ndings reveal signicant dierences in how
dierent sexes and cultures express mental health concerns on Twitter, and suggest that
unobtrusively gathered social media data can serve as an important source of mental health
e various steps of the methodology are described in the paragraphs that follow. First,
we ltered English-language Twitter posts from March 2015 to create a sample of mental
illness disclosures (“MID users”) containing any of the key phrases listed in Table 2. ese
phrases, denoting current experience with mental illness, were collated through reference
to prior work as well as consultation with a practicing psychiatrist.17 A control data sample
of posts over the same period (“CTL users”) was also created; none of these posts contained
Social Media Expression as
Signaling Mental Health States
IV. Internet Acvity 24
the key phrases in Table 2. Sex and country information were then inferred for each post
using an automated method.l
e key phrases in Table 2, however, may not indicate genuine disclosure of mental illness;
for example, “when I have to wake up at 6am, I feel like killing myself” does not indicate
suicidal intent. To eliminate such misleading posts, a machine learning method was used to
compare the language of each Twitter user with the language of posts made by people who
self-identify as suering from mental illness on the Reddit sub-communities r/depression,
r/mentalhealth, and r/SuicideWatch.18 A similar process was used to validate the control
dataset, but evaluated dissimilarity to the Reddit sample instead. A nal qualitative valida-
tion exercise was also carried out: a licensed psychiatrist and two researchers experienced
in mental health/social media research evaluated a subsample of 100 mental health disclo-
sures, each from a dierent user. Overall, the machine learning approach in this project was
96% accurate in identifying genuine disclosures of mental health concerns.
Deeper analysis of the social media content followed. We developed linguistic measures
(how users express themselves) and a topic model (what users are talking about) to quantify
the dierences between how female and male users disclosed their mental health concerns.
Linguistic measures were divided into three categories—aective attributes, cognitive at-
tributes, and linguistic style—and subtypes within these categories, drawn from previous
psycholinguistic work (Table 3).19
With respect to aective attributes, the project considered positive aect (PA), negative
aect (NA), and four other more specic measures of emotional expression: anger, anxiety,
sadness, and swearing. Cognitive measures were divided into cognition and perception,
which together evaluate cognitive complexity and emotional stability.20 Finally, four
measures of linguistic style were considered: lexical density, temporal measures, social/
personal concerns, and interpersonal awareness/focus. ese measures of linguistic style
l Because Twitter does
not allow individuals to
self-report their sex and
location information is often
inaccurate, sex and country
inference is necessary. For
sex inference, we matched
the self-reported name string
in Twitter prole names with
name databases from gov-
ernment and other sources.
For country inference, we
corrected location names
using standard techniques
and matched locations to
various large geographic
Table 2. Key phrases to lter for signals of mental health concerns.
I want to die I want to end my life I want to suicide
I thought of suicide I am depressed I [*] diagnosed [*] depression
I attempted suicide I [have/had] depression Killing myself
I [*] thinking of suicide I [*] diagnosed [*] mental illness I tried to suicide
I [have/had] mental illness Ending my life
Big Data and the Well-Being of Women and Girls
indicate one’s underlying psychological processes (lexical density), personality (temporal
references), social support and connectivity (social/personal concerns), and awareness of
one’s surroundings and environment (interpersonal focus). Prior work suggests that these
cues are valuable in understanding mental health, including in social media expression.21
Re s ults
As noted above, the machine learning approach accurately identied genuine mental
health disclosures in nearly all social media posts we examined. We also observed consid-
erable dierences in the linguistic content and topical focus of Twitter posts of female and
male users, as well as across cultures.m First, when aective and cognitive attributes are
aggregated into single categories, we see that females generally show higher scores in all
linguistic measures (Figure 16). is suggests a generally higher level of psycholinguistic
expressiveness on social media by women and girls, a promising result for the objective of
using such platforms to identify trends in mental health. Second, we see that the dierences
are even more pronounced in the MID (mental illness disclosure) user sample than in the
CTL (control) sample.
m We focus on sex dif-
ferences here in this
summary. Please refer to De
Choudhury et al. (2016)
for a discussion of cultural
Table 3. Types of linguisc measures used.
Affective Positive affect Expressions denoting positive moods (e.g. joy, energy, alertness)
Negative affect Expressions denoting negative moods (e.g. sadness, fear, lethargy)
Anger Expressions of anger
Anxiety Expressions of anxiety
Sadness Expressions of sadness
Swearing Use of swear words, denoting frustration, intensity of reaction
Cognitive Cognition Expression that reects thought, possibly independent of external stimuli
Perception Expression that reects sensory input (e.g. information received by seeing, hearing, feeling)
Linguistic style Lexical density Nouns, adjectives, adverbs, and verbs as a proportion of all words
Temporal measures Use of past, present, and future tenses
Social/personal concerns Words pertaining to social engagement or self-engagement (e.g. words about family, friends, social
work, health, etc.)
Interpersonal awareness/focus Use of 1st person singular, 1st person plural, 2nd person, and 3rd person pronouns
Attribute Category Subtype Description/example
IV. Internet Acvity 26
We can further break down the attribute
categories. Female users in the MID sample
show 15.4% higher sadness and 10.7%
higher anxiety; prior literature indicates
that expression of these emotions is asso-
ciated with depression, mental instability,
and feelings of helplessness, loneliness, and
restlessness. However, female users also tend
to use 7.1% more positive aect in their
content, perhaps to demonstrate a positive
outlook publicly despite the mental health
challenges they are facing. Male users, on
the other hand, express 2.6% more negative aect overall, including 5.3% higher anger
and 9.5% more expressions with swearing. Females express fewer cognitive attributes on
social media than do males. Lower usage of words that denote certainty, for example,
may demonstrate heightened emotional instability. ese dierences in cognitive expres-
sion are not pronounced in the control
sample, however, suggesting that experience
of mental illness, not intrinsic dierences
between the sexes, is responsible for the
observed gap.
We turn now to social/personal concerns
and interpersonal focus, both subtypes
within the linguistic style category. Male
MID users display an 8.1% lower sense
of achievement than women and girls,
a known signal of reduced self-esteem.22 Female MID users, meanwhile, express 6.0%
greater concern about their health and 2.7% greater concern about their body, which
may indicate a greater self-awareness about their health or, alternatively, more xation
with social perceptions about their appearance. Another interesting
nding is that male MID users exhibit lower use of words having to
do with social concerns, friends, or family. eir female peers may be
using such language more frequently in their Twitter posts to explicitly
seek help from their social networks. e interpersonal focus metrics
Figure 16. Dierences in linguisc measures between female and male users,
disaggregated by mental illness disclosure (MID) and control sample (CTL).
Posive values indicate higher scores for female users.
#depression has invaded my peace and
#anxiety has exhausted my thoughts. Pain
isn’t always physical
– female user
why am I even here... No one needs or wants
me... I’m useless
– female user
Absolute difference between
female and male users (%)
Lexical density
and awareness
CTL usersMID users
Over the past 2 years I have been hit with
physical and mental pain. The pain is real. It
is sll there.
– female user
Big Data and the Well-Being of Women and Girls
also reect these patterns. Male MID users use rst person pronouns
10.2% more than female MID users, but 3.0% less second person and
3.4% less third person pronouns, indicating that males tend to be less
interactive. Once again, these dierences are much less pronounced in
the CTL sample. Mental illness appears to amplify dierences between
female and male expression on social media.
Our analysis of topics—what users were talking about—conrmed
the patterns observed in the linguistic measures. We found that
two topics were more likely to appear in male MID posts than in
female MID posts. e rst topic related to
negative thoughts and hopelessness, and the
second to detachment from the social realm
and a hesitation to seek help. Female MID
users expressed a positive outlook in coping
with mental health challenges, as well as a
desire for disclosure and help-seeking, to a
much greater degree than their male peers.
Women were also much more likely to share
personal experiences around mental illness
and engage in self-assessments.
is work provides some of the rst detailed
insights on patterns of mental health among
girls and women using public social media posts. We found that
female users expressed higher sadness and anxiety, but lower anger and
negative aect than male counterparts. ese observations align with
prior work in social psychology.23 Female mental health disclosers in
our dataset also expressed greater social and familial concerns than
did males. e literature indicates that women tend to rely more on
the social network of family and community, whereas men exhibit
a relative orientation towards public stoicism.24 e topic analysis
conrmed this pattern. Although much work remains to link these
dierences to specic mental health conditions and severity of illness,
this data suggests that such research would indeed be fruitful.
Hard to really feel sick with this support
group. #family
– female user
I miss having someone, a friend to talk to
all night
– female user
Somemes I wonder if anyone sll looks
out for me. I am a mess that nobody wants
to clean up. I’m a wreck
– male user
If I were going to kill myself, I wouldn’t tell
anyone. If I’m already invisible, why see me
to favor your own self-righteousness?
– male user
you’re afraid to tell people how you feel
because you fear rejecon, so you bury it
deep inside yourself where it only destroys
you more
– female user
I used to hurt myself because it was the
only pain I could control.
– female user
IV. Internet Acvity 28
e present analysis
has important limita-
tions. e phrases used
to lter for mental
health concerns are
not an exhaustive list
of possible signals of
depression, anxiety,
or other states. In
addition, our sample
is not representative of the general population; it captures Twitter users, which are likely
to be more auent, more technologically skilled, and more willing to express themselves
publicly about mental health issues than the population at large. Inferring overall mental
health disorder prevalence rates from social media will clearly require validation surveys
that precisely quantify bias.
Overall, however, we conclude that machine learning methods can lter through immense
amounts of data available to identify signals of illness with a high level of accuracy. is
suggests two major applications for monitoring and treatment. At the individual level,
signals of mental illness could provoke response, either from the user’s community or
through automated means from the social media platform itself (for example, oering
counseling resources). At the population level, given adjustments for biases in Internet use
and other factors, mental health trends can be monitored in near real-time, which may be
especially useful following acute events of social stress such as recessions, political crises,
and natural disasters. Social media monitoring will not replace more formal approaches to
mental health surveillance, but it can complement these other tools.
Adam Cohn
Big Data and the Well-Being of Women and Girls
Big data is a valuable resource in the ne-grained measurement of womens and girls’
well-being. Flowminder’s geospatial modeling work, based on satellite imagery, provides
a high-resolution look at social and health outcomes as they vary over space; the same
method could be used to create data systems that capture variation in welfare over time.
Expenditure patterns inferred from credit card and cell phone expenditure records, as in the
MIT-led work, provide a detailed look at economic activity across dierent social groups
and over time. e social media-based projects achieve the same objective of high-spatial
resolution, high-frequency measurement, but with a focus on emotions, thoughts, and
Overall, this report illustrates the potential of big data in lling the global gender data
gap. In closing, however, we note that the rise of big data does not signify that traditional
sources of data will become less important. On the contrary, the successful implementation
of big data requires investment in proven methods of social scientic research, not least
for the validation and bias correction of big datasets. For example, Flowminder’s work
requires DHS or other types of survey data as a starting point, as well as eld biophys-
ical data to calibrate the interpretation of satellite imagery. Inferring women’s economic
behavior from cell phone and credit card records demands ground-truthing work to assess
how strongly, within a given culture and economy, these records reect overall social and
consumer behavior. Twitter users are a biased sample of society at large, and determining
the magnitude and direction of that bias through surveys is critical if this information is to
be useful in assessing population-level patterns.
More broadly, big data is not a panacea for all the challenges of development planning and
research. e invisibility of women and girls in international and national data discourse is
a political, not solely a technical, problem. New methods can indeed illuminate previously
ignored aspects of the lives of women and girls, but it can also create a sense that technical
advancements alone will compel investments in gender-sensitive data systems by national
statistical agencies, civil society organizations, and international donors. ey will not. In
the worst-case scenario, they will have the opposite eect: the data deluge may shift policy
focus towards the groups and regions for which the most information is available, not the
people and places in greatest need. Even big data illuminates only small parts of the entire
eld of human experience.
V. Conclusion: Reimagining the Revoluon 30
In the best case, however, the current “data revolution” will be reimagined as a step towards
building “data governance”: a process through which novel types of data bring about not
instant, perfect knowledge about global development processes, but rather catalyze the
creation of new partnerships for the sharing and interpretation of information. In the best
case, projects that use big datasets will be informed by thoughtful hypotheses advanced by
women and girls themselves, take a pragmatic but tireless approach to data policy reform in
a decision-making world still dominated by men, and work in concert with advocates for
the inclusion of women and girls in all spheres of social and economic life. ese are the
kinds of projects we have proled in this report, and they hold great promise for the future
of social science and policymaking.
Elizabeth Whelan
Big Data and the Well-Being of Women and Girls
1 See Alegana et al. 2015 and Sedda et al. 2015 for similar past work
2 ICF International 2012
3 HDRO (UNDP) 2015
4 KNBS 2010; NBS 2011; NPC 2014; NIPORT 2013; Cayemittes et al. 2013
5 Alegana et al. 2015; Gething et al. 2015
6 Lazer et al. 2009
7 Yoshimura et al. 2009; Krumme et al. 2013; Giles 2012
8 Toole et al. 2015; Gonzalez, Hidalgo, and Barabasi 2008; Jiang et al. 2016; Song et al. 2010; Blumenstock,
Cadamuro, and On 2015; Lenormand et al. 2015; Louail et al. 2014; Çolak et al. 2016
9 Eagle, Macy, and Claxton 2010; Gonzalez, Hidalgo, and Barabasi 2008; Pappalardo et al. 2015.
10 Cauce et al. 2002; Taylor and Brown 1988
11 Wang et al. 2000
12 WHO 2001
13 Spector 2002
14 Taylor and Brown 1998; Spector 2002
15 Ormel et al. 1994; Patel et al. 1999
16 Coppersmith, Dredze, and Harman 2014; Coppersmith, Harman, and Dredze 2014; Culotta 2014; De
Choudhury, Counts, and Horvitz 2013; De Choudhury et al. 2014; De Choudhury et al. 2013; Eichstaedt et al.
2015; Homan et al. 2014
17 Coppersmith, Dredze, and Harman 2014; Coppersmith, Harman, and Dredze 2014
18 Zhu, Ghahramani, and Laerty 2003; De Choudhury et al. 2016
19 Pennebaker, Francis, and Booth 2001; Chung and Pennebaker 2007; De Choudhury and De 2014
20 Gross and Muñoz 1995
21 Ramirez-Esparza 2008
22 Chancellor 2016
23 Lieberman and Goldstein 2006
24 Guillemin, Bombardier, and Beaton 1993
Photo CRedits
Cover, pages 5, 19, and 30: © Elizabeth Whelan. All rights reserved.
Page 28: Adam Cohn, “Indian Woman with Smartphone.” Some rights reserved, CC BY-NC-ND 2.0 (Creative
Commons Attibution-NonCommercial-NoDerivs 2.0 Generic) license. Cropped.
References 32
Alegana, Victor A., Peter M. Atkinson, Carla Pezzulo, Alessandro Sorichetta, D. Weiss, T. Bird, E. Erbach-
Schoenberg, and Andrew J. Tatem. 2015. “Fine resolution mapping of population age-structures for health
and development applications.” Journal of the Royal Society Interface 12, no. 105: 20150073.
Blumenstock, Joshua, Gabriel Cadamuro, and Robert On. 2015. “Predicting poverty and wealth from mobile
phone metadata.” Science 350, no. 6264: 1073-1076.
Cauce, Ana Mari, Melanie Domenech-Rodriguez, Matthew Paradise, Bryan N Cochran, Jennifer Munyi Shea,
Debra Srebnik, and Nazli Baydar. 2002. “Cultural and contextual inuences in mental health help seeking:
a focus on ethnic minority youth.” Journal of Consulting and Clinical Psychology 70, no. 1: 44.
Cayemittes, Michel, Michelle Fatuma Busangu, Jean de Dieu Bizimana, Bernard Barrère, Blaise Sévère, Viviane
Cayemittes et Emmanuel Charles. 2013. Enquête Mortalité, Morbidité et Utilisation des Services, Haïti,
2012. Calverton, Maryland, USA: MSPP, IHE et ICF International.
Chancellor, Stevie, Zhiyuan (Jerry) Lin, Erica Goodman, Stephanie Zerwas, and Munmun De Choudhury.
2016. “Quantifying and predicting mental illness severity in online pro-eating disorder communities.
In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work & Social
Computing, pp. 1171-1184. ACM.
Chung, Cindy, and James W Pennebaker. 2007. “e psychological functions of function words.” Social
Communication: 343-359.
Çolak, Serdar, Antonio Lima, and Marta C. González. 2016. “Understanding congested travel in urban areas.”
Nature Communications 7: 10793. doi:10.1038/ncomms10793.
Coppersmith, Glen, Craig Harman, and Mark Dredze. 2014. “Measuring post traumatic stress disorder in Twitter.
In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM),
pp. 579-582.
Coppersmith, Glen, Mark Dredze, and Craig Harman. 2014. “Quantifying mental health signals in Twitter.” In
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: from Linguistic
Signal to Clinical Reality, pp. 51-60. Baltimore, Maryland: Association of Computational Linguistics.
Culotta, Aron. 2014. “Estimating county health statistics with Twitter.” In Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, pp. 1335-1344. ACM.
De Choudhury, Munmun, and Sushovan De. 2014. “Mental health discourse on Reddit: Self-disclosure, social
support, and anonymity.” In Proceedings of the Eighth International AAAI Conference on Weblogs and
Social Media (ICWSM).
De Choudhury, Munmun, Emre Kiciman, Mark Dredze, Glen Coppersmith, and Mrinal Kumar. 2016.
“Discovering shifts to suicidal ideation from mental health content in social media.” In Proceedings of the
SIGCHI conference on human factors in computing systems, pp. 2098-2110. ACM.
Big Data and the Well-Being of Women and Girls
De Choudhury, Munmun, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. “Predicting depression via
social media.” In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media
De Choudhury, Munmun, Scott Counts, and Eric Horvitz. 2013. “Social media as a measurement tool of depres-
sion in populations.” In Proceedings of the 5th Annual ACM Web Science Conference, pp. 47-56. ACM.
De Choudhury, Munmun, Scott Counts, Eric Horvitz, and Aaron Ho. 2014. “Characterizing and predict-
ing postpartum depression from Facebook data.” In Proceedings of the ACM Conference on Computer
Supported Cooperative Work and Social Computing. ACM.
Eagle, Nathan, Michael Macy, and Rob Claxton. 2010. “Network diversity and economic development.” Science
328, no. 5981: 1029-1031.
Eichstaedt, Johannes C., Hansen Andrew Schwartz, Margaret L Kern, Gregory Park, Darwin R Labarthe, Raina M
Merchant, Sneha Jha, Megha Agrawal, Lukasz A Dziurzynski, Maarten Sap, Christopher Weeg, Emily E.
Larson, Lyle H. Ungar, Martin E.P. Seligman. 2015. “Psychological language on Twitter predicts county-level
heart disease mortality.” Psychological Science 26, no. 2: 159-169. doi:10.1177/0956797614557867.
Gething, Peter, Andy Tatem, Tom Bird, and Clara R. Burgert‐Brucker. 2015. “Creating Spatial Interpolation
Surfaces with DHS Data.” DHS Spatial Analysis Reports No. 11. Rockville, Maryland, USA: ICF
Giles, Jim. 2012. “Making the links.” Nature 488, no. 7412: 448-450.
Gonzalez, Marta C., Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. “Understanding individual human
mobility patterns.” Nature 453, no. 7196: 779-782.
Gross, James J., and Ricardo F Muñoz. 1995. “Emotion regulation and mental health.” Clinical Psychology:
Science and Practice 2, no. 2: 151-164.
Guillemin, Francis, Claire Bombardier, and Dorcas Beaton. 1993. “Cross-cultural adaptation of health related
quality of life measures: literature review and proposed guidelines.” Journal of Clinical Epidemiology 46,
no. 12: 1417-1432.
Homan, Christopher M., Naiji Lu, Xin Tu, Megan C Lytle, and Vincent Silenzio. 2014. “Social structure and
depression in TrevorSpace.” In Proceedings of the 17th ACM Conference on Computer Supported
Cooperative Work & Social Computing, pp. 615-625. ACM.
Human Development Report Oce (HDRO), United Nations Development Program (UNDP). 2015. Human
Development Report 2015: Work for Human Development. New York: United Nations Development
ICF International. 2012. Demographic and Health Survey Sampling and Household Listing Manual. MEASURE
DHS, Calverton, Maryland, U.S.A.: ICF International.
References 34
Jiang, Shan, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C. González.
2016. “e TimeGeo modeling framework for urban motility without travel surveys.” Proceedings of the
National Academy of Sciences: 201524261. doi:0.1073/pnas.1524261113.
Kenya National Bureau of Statistics (KNBS) and ICF Macro. 2010. Kenya Demographic and Health Survey
2008‐09. Calverton, Maryland: KNBS and ICF Macro.
Krumme, Coco, Alejandro Llorente, Manuel Cebrian, and Esteban Moro. 2013. “e predictability of consumer
visitation patterns.” Scientic Reports 3: 1645.
Lazer, David, Alex Sandy Pentland, Lada Adamic, Sinan Aral, Albert Laszlo Barabasi, Devon Brewer, Nicholas
Christakis, Noshir Contractor, James Fowler, Myron Guttman, Tony Jebara, Gary King, Michael Macy,
Deb Roy, and Marshall Van Alstyne. 2009. “Life in the network: the coming age of computational social
science.” Science 323, no. 5915: 721-723.
Lenormand, Maxime, omas Louail, Oliva G. Cantú-Ros, Miguel Picornell, Ricardo Herranz, Juan Murillo
Arias, Marc Barthelemy, Maxi San Miguel, and José J. Ramasco. 2015. “Inuence of sociodemographic
characteristics on human mobility.” Scientic Reports 5: 10075. doi:10.1038/srep10075.
Lieberman, Morton A., and Benjamin A Goldstein. 2006. “Not all negative emotions are equal: e role of
emotional expression in online support groups for women with breast cancer.” Psycho-Oncology 15, no.
2: 160-168.
Louail, omas, Maxime Lenormand, Oliva G. Cantu Ros, Miguel Picornell, Ricardo Herranz, Enrique Frias-
Martinez, José J. Ramasco, and Marc Barthelemy. 2014. “From mobile phone data to the spatial structure
of cities.” Scientic Reports 4: 5276. doi:10.1038/srep05276.
National Bureau of Statistics (NBS) [Tanzania] and ICF Macro. 2011. Tanzania Demographic and Health Survey
2010. Dar es Salaam, Tanzania: NBS and ICF Macro.
National Institute of Population Research and Training (NIPORT), Mitra and Associates, and ICF International.
2013. Bangladesh Demographic and Health Survey 2011. Dhaka, Bangladesh and Calverton, Maryland,
USA: NIPORT, Mitra and Associates, and ICF International.
National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and
Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International.
Ormel, Johan, Michael VonKor_, T Bedirhan Ustun, Stefano Pini, Ailsa Korten, and Tineke Oldehinkel. 1994.
“Common mental disorders and disability across cultures: results from the WHO collaborative study on
psychological problems in general health care.” Journal of the American Medical Association 272, no. 22:
Pappalardo, Luca, Dino Pedreschi, Zbigniew Smoreda, and Fosca Giannotti. 2015. “Using big data to study
the link between human mobility and socio-economic development.” In Proceedings of the 2015 IEEE
International Conference on Big Data (Big Data), pp. 871-878. IEEE Computer Society. doi:10.1109/
Big Data and the Well-Being of Women and Girls
Pennebaker, James W., Martha E Francis, and Roger J Booth. 2001. “Linguistic inquiry and word count: LIWC
2001.” Mahway: Lawrence Erlbaum Associates, 71:2001.
Rachel E Spector. 2002. “Cultural diversity in health and illness.” Journal of Transcultural Nursing 13, no. 3:
Ramirez-Esparza, Nairan, Cindy K Chung, Ewa Kacewicz, and James W Pennebaker. 2008. “e psychology
of word use in depression forums in English and in Spanish: Texting two text analytic approaches.” In
Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM).
Sedda, Luigi, Andrew J. Tatem, David W. Morley, Peter M. Atkinson, Nicola A. Wardrop, Carla Pezzulo, Alessandro
Sorichetta, Joanna Kuleszo, and David J. Rogers. 2015. “Poverty, health and satellite-derived vegetation
indices: their inter-spatial relationship in West Africa.” International Health 7, no. 2: 99-106.
Song, Chaoming, Zehui Qu, Nicholas Blumm, and Albert-László Barabási. 2010. “Limits of predictability in
human mobility.” Science 327, no. 5968s: 1018-1021.
Taylor, Shelley E., and Jonathon D Brown. 1988. “Illusion and well-being: a social psychological perspective on
mental health.” Psychological Bulletin 103, no. 2: 193.
Toole, Jameson L., Carlos Herrera-Yaqüe, Christian M. Schneider, and Marta C. González. 2015. “Coupling
human mobility and social ties.” Journal of the Royal Society Interface 12, no. 105: 20141128. doi:10.1098/
Vikram Patel, Ricardo Araya, Mauricio de Lima, Ana Ludermir, and Charles Todd. 1999. “Women, poverty
and common mental disorders in four restructuring societies.” Social Science & Medicine 49, no. 11:
Wang, Xiangdong, Lan Gao, Naotaka Shinfuku, Huabiao Zhang, Chengzhi Zhao, and Yucun Shen. 2000.
“Longitudinal study of earthquake-related PTSD in a randomly selected community sample in North
China.” American Journal of Psychiatry 157, no. 8: 1260-1266. doi:10.1176/appi.ajp.157.8.1260.
World Health Organization. 2001. e World Health Report 2001: Mental health: new understanding, new hope.
World Health Organization.
Yoshimura, Yuji, Stanislav Sobolevsky, Juan N. Bautista Hobin, Carlo Ratti, and Josep Blat. 2016. “Urban associ-
ation rules: uncovering linked trips for shopping behavior.” Environment and Planning B: Planning and
Design: 0265813516676487.
Zhu, Xiaojin, Zoubin Ghahramani, and John Laerty. 2003. “Semi-supervised learning using Gaussian elds and
harmonic functions.” In Proceedings of the Twentieth International Conference on Machine Learning
(ICML-03), pp. 912-919.
... Using aggregated mobility data, researchers around the world can develop models to study and predict transmission dynamics [36][37][38] , investigate the impact and effectiveness of restriction policies, and re-opening strategies [38][39][40][41][42][43][44][45][45][46][47] , and analyse the effects of these policies on the local economy, ethnic and socio-economic groups [48][49][50][51] . Moreover, coupling the mobility data with the socio-economic and ethnicity groups from the census, it is possible to estimate the socio-economic impact of such restrictions in each different community [52][53][54][55][56] . ...
Full-text available
Socio-economic constructs and urban topology are crucial drivers of human mobility patterns. During the COVID-19 pandemic, these patterns were re-shaped in their main two components: the spatial dimension represented by the daily travelled distance, and the temporal dimension expressed as the synchronisation time of commuting routines. Leveraging location-based data from de-identified mobile phone users, we observed that during lockdowns restrictions, the decrease of spatial mobility is interwoven with the emergence of asynchronous mobility dynamics. The lifting of restriction in urban mobility allowed a faster recovery of the spatial dimension compared to the temporal one. Moreover, the recovery in mobility was different depending on urbanisation levels and economic stratification. In rural and low-income areas, the spatial mobility dimension suffered a more significant disruption when compared to urbanised and high-income areas. In contrast, the temporal dimension was more affected in urbanised and high-income areas than in rural and low-income areas.
... welche sich das Ziel gesetzt hat, genderrelevante Daten zu nutzen, um weltweit die Lebensqualität von Frauen und Mädchen zu unterstützen. In dem Report "Big Data and the Well-Being of Women and Girls" werden verschiedene Projekte zur Gesundheitsförderung und Prävention vorgestellt (Vaitla 2017). ...
Full-text available
Die steigende Anzahl an verfügbaren digitalen Daten sorgt unter dem Stichwort Big Data für neue Forschungspotenziale. Diese stehen als Synonym für die Erhebung und Analyse großer computergenerierter Datenmengen und sind in verschiedener Hinsicht vielfältig. In der Gesundheitsförderung und Prävention finden sich diverse Möglichkeiten für Big Data-Projekte. Zum einen können hierbei nicht nur klassische Gesundheitsdaten herangezogen werden, sondern auch Daten aus anderen Lebensbereichen. Durch die Kombination dieser Daten ergibt sich die Möglichkeit, frühzeitig potenzielle Erkrankungen zu identifizieren und präventive bzw. gesundheitsfördernde Maßnahmen einzuleiten. Im Rahmen dieses Beitrages werden daher folgende Fragestellungen untersucht: Welche Arten von Daten gibt es und wie können diese genutzt werden? Was unterscheidet Big Data Analytics von bisherigen Datenanalyseverfahren? Welche exemplarischen Vorgehensweisen gibt es bei der Big Data-Nutzung im Gesundheitswesen? Darüber hinaus erlangt das Berufsfeld „Data Scientist“ vor diesem Hintergrund zunehmende Relevanz und wird daher in diesem Beitrag für Nachwuchswissenschaftler_innen anhand erster Schritte beleuchtet.
... Mobility is a complex issue and no single dataset or approach is sufficient to unpack its multidimensionality and offer insights on the way forward for decision-makers. In addition, the Data2X report on Big Data and the Well Being of Girls points out that much of the data that could provide new insights on these issues is collected by corporations, and is therefore often not available to researchers and public policymakers [14]. ...
Full-text available
The use of public transportation or simply moving about in streets are gendered issues. Women and girls often engage in multi-purpose, multi-stop trips in order to do household chores, work, and study ('trip chaining'). Women-headed households are often more prominent in urban settings and they tend to work more in low-paid/informal jobs than men, with limited access to transportation subsidies. Here we present recent results on urban mobility from a gendered perspective by uniquely combining a wide range of datasets, including commercial sources of telecom and open data. We explored urban mobility of women and men in the greater metropolitan area of Santiago, Chile, by analyzing the mobility traces extracted from the Call Detail Records (CDRs) of a large cohort of anonymized mobile phone users over a period of 3 months. We find that, taking into account the differences in users' calling behaviors, women move less than men, visiting less unique locations and distributing their time less equally among such locations. By mapping gender differences in mobility over the 52 comunas of Santiago, we find a higher mobility gap to be correlated with socio-economic indicators, such as a lower average income, and with the lack of public and private transportation options. Such results provide new insights for policymakers to design more gender inclusive transportation plans in the city of Santiago.
... Traditional data (i.e., household surveys, institutional records, or censuses) are often collected with a specific intention, following a structured format, and with valid and reliable measurements. While big data is not always collected in this way, the many forms of big data (illustrated in Figure 1.1) can help to close the gender gap by providing more granular, near realtime information, especially in locations where other sources of data are lacking ( Vaitla et al., 2017). Big data was initially defined in terms of the "3 Vs"-volume (amount of data), variety (different types of data), and velocity (speed at which data is generated and transmitted) ( Laney, 2001). ...
Technical Report
Full-text available
The potential of big data for sustainable development lies primarily in the application of insights from new data sources to inform policy interventions on the three pillars of sustainable development: economic, social, and environmental. There is still progress to be made in this area. Although pilot projects have shown the feasibility of using big data to assess and facilitate progress towards the Sustainable Development Goals (SDGs), there remains a dearth of examples that have scaled or become sustainable. Also, the methodological and technical expertise required to implement big data projects is not evenly distributed across geographies and organizations. This report provides background context on how big data can be used to facilitate and assess progress towards the SDGs, and focuses in particular on SDG 5 – “Achieve gender equality and empower all women and girls”. It examines successes and challenges in the use of big data to improve the lives of women and girls, and identifies concrete data innovation projects from across the development sector that have considered the gender dimension.
... Analogous to the price index that uses online information to improve survey-based approaches to measure inflation 44 , the meaningful information of groups extracted from the CCR data can be used to compare consumers worldwide 4 . Interesting avenues for the application of this method are policy evaluation of macroeconomic events such as inflation and employment and their effects on the spending habits of various groups 45 . ...
Full-text available
Zipf-like distributions characterize a wide set of phenomena in physics, biology, economics and social sciences. In human activities, Zipf-laws describe for example the frequency of words appearance in a text or the purchases types in shopping patterns. In the latter, the uneven distribution of transaction types is bound with the temporal sequences of purchases of individual choices. In this work, we define a new framework using a text compression technique on the sequences of credit card purchases to detect ubiquitous patterns of collective behavior. Clustering the consumers by their similarity in purchases sequences, we detect five consumer groups. Remarkably, post checking, individuals in each group are also similar in their age, total expenditure, gender, and the diversity of their social and mobility networks extracted by their mobile phone records. By properly deconstructing transaction data with Zipf-like distributions, this method uncovers sets of significant sequences that reveal insights on collective human behavior.
Conference Paper
With the growth of urban areas, cities are the centres of the great challenges of our society. Urban form influences the metabolism of cities in multiple ways and mobility is one of them. Depending on the type of urban fabric, population, and activities located in them, travel needs and modes of transport differences appear. As population is diverse, this relationship between urban form and mobility probably have significant gender gaps that should be investigated. The aim of this paper is to demonstrate the correlation between the type of urban fabric and people’s mobility patterns, looking for significant gender differences in the number of trips and the mode of transport. Data were collected from the survey done for the Mobility Plan of the Metropolitan Area of Valencia and cadastral information. For statistical analysis, the PSPP program and Pearson's correlation coefficient were used. This paper demonstrates significant differences in relation to gender and modes of transport. Women use more sustainable modes of transport, especially in dense and compact cities. Urban sprawl increases mobility, especially trips using private motorised modes. On the contrary, more sustainable modes, like by foot, on bike, or using public transport, are used in compact cities. Looking for sustainable mobility, women and density are key aspects which land planners must take into account when designing cities.
Quantitative evaluation of product social impacts is made possible through the use of social impact indicators, which combine user data in a meaningful way to give insight into the current social condition of an individual or population. Most existing methods for collecting this user data for social impact indicators require direct human interaction with users of a product (e.g., interviews, surveys, and observational studies). These interactions produce high-fidelity data that help indicate the product impact but only at a single snapshot in time and are typically infrequently collected due to the large human resources and cost associated with obtaining them. In this paper, a framework is proposed that outlines how low-fidelity data passively and less expensively obtained using remote sensors, satellites, or digital technology can be collected and correlated with high-fidelity, infrequently collected data to enable continuous, remote monitoring of user data. These user data are critical to determining current social impact indicators that can be used in a posteriori social impact evaluation. We illustrate an application of this framework by demonstrating how it can be used to efficiently collect data that can be used in determining several social impact indicators related to water hand pumps in Uganda. Key to this example is the use of a deep learning model to correlate user type (man, woman, or child) with raw hand pump data obtained via an integrated motion unit sensor for 1,200 hand pump users.
Conference Paper
Full-text available
Social media sites have struggled with the presence of emotional and physical self-injury content. Individuals who share such content are often challenged with severe mental illnesses like eating disorders. We present the first study quantifying levels of mental illness severity (MIS) in social media. We examine a set of users on Instagram who post content on pro-eating disorder tags (26M posts from 100K users). Our novel statistical methodology combines topic modeling and novice/clinician annotations to infer MIS in a user’s content. Alarmingly, we find that proportion of users whose content expresses high MIS have been on the rise since 2012 (13%/year increase). Previous MIS in a user’s content over seven months can predict future risk with 81% accuracy. Our model can also forecast MIS levels up to eight months in the future with performance better than baseline. We discuss the health outcomes and design implications as well as ethical considerations of this line of research.
Full-text available
Significance Individual mobility models are important in a wide range of application areas. Current mainstream urban mobility models require sociodemographic information from costly manual surveys, which are in small sample sizes and updated in low frequency. In this study, we propose an individual mobility modeling framework, TimeGeo, that extracts required features from ubiquitous, passive, and sparse digital traces in the information and communication technology era. The model is able to generate individual trajectories in high spatial–temporal resolutions, with interpretable mechanisms and parameters capturing heterogeneous individual travel choices. The modeling framework can flexibly adapt to input data with different resolutions, and be further extended for various modeling purposes.
Full-text available
Rapid urbanization and increasing demand for transportation burdens urban road infrastructures. The interplay of number of vehicles and available road capacity on their routes determines the level of congestion. Although approaches to modify demand and capacity exist, the possible limits of congestion alleviation by only modifying route choices have not been systematically studied. Here we couple the road networks of five diverse cities with the travel demand profiles in the morning peak hour obtained from billions of mobile phone traces to comprehensively analyse urban traffic. We present that a dimensionless ratio of the road supply to the travel demand explains the percentage of time lost in congestion. Finally, we examine congestion relief under a centralized routing scheme with varying levels of awareness of social good and quantify the benefits to show that moderate levels are enough to achieve significant collective travel time savings.
Full-text available
In this article. a mental health help-seeking model is offered as a framework for understanding cultural and contextual factors that affect ethnic minority adolescents' pathways into mental health services. The effects of culture and context are profound across the entire help-seeking pathway, from problem identification to choice of treatment providers. The authors argue that an understanding of these help-seeking pathways provides insights into ethnic group differences in mental health care utilization and that further research in this area is needed.
Conference Paper
Full-text available
Big Data offer nowadays the potential capability of creating a digital nervous system of our society, enabling the measurement, monitoring and prediction of relevant aspects of socioeconomic phenomena in quasi real time. This potential has fueled, in the last few years, a growing interest around the usage of Big Data to support official statistics in the measurement of individual and collective economic well-being. In this work we study the relations between human mobility patterns and socioeconomic development. Starting from nationwide mobile phone data we extract a measure of mobility volume and a measure of mobility diversity for each individual. We then aggregate the mobility measures at municipality level and investigate the correlations with external socioeconomic indicators independently surveyed by an official statistics institute. We find three main results. First, aggregated human mobility patterns are correlated with these socioeconomic indicators. Second, the diversity of mobility, defined in terms of entropy of the individual users' trajectories, exhibits the strongest correlation with the external socioeconomic indicators. Third, the volume of mobility and the diversity of mobility show opposite correlations with the socioeconomic indicators. Our results, validated against a null model, open an interesting perspective to study human behavior through Big Data by means of new statistical indicators that quantify and possibly " nowcast " the socioeconomic development of our society.
Full-text available
Partial financial support has been received from the Spanish Ministry of Economy (MINECO) and FEDER (EU) under projects MODASS (FIS2011-24785) and INTENSE@COSYP (FIS2012-30634), and from the EU Commission through projects EUNOIA, LASAGNE and INSIGHT. The work of ML has been funded under the PD/004/2013 project, from the Conselleria de Educacion, Cultura y Universidades of the Government of the Balearic Islands and from the European Social Fund through the Balearic Islands ESF operational program for 2013-2017. JJR acknowledges funding from the Ramon y Cajal program of MINECO.
In this article, we introduce the method of urban association rules and its uses for extracting frequently appearing combinations of stores that are visited together to characterize shoppers' behaviors. The Apriori algorithm is used to extract the association rules (i.e., if -> result) from customer transaction datasets in a market-basket analysis. An application to our large-scale and anonymized bank card transaction dataset enables us to output linked trips for shopping all over the city: the method enables us to predict the other shops most likely to be visited by a customer given a particular shop that was already visited as an input. In addition, our methodology can consider all transaction activities conducted by customers for a whole city in addition to the location of stores dispersed in the city. This approach enables us to uncover not only simple linked trips such as transition movements between stores but also the edge weight for each linked trip in the specific district. Thus, the proposed methodology can complement conventional research methods. Enhancing understanding of people's shopping behaviors could be useful for city authorities and urban practitioners for effective urban management. The results also help individual retailers to rearrange their services by accommodating the needs of their customers' habits to enhance their shopping experience.
Conference Paper
History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research.
Accurate and timely estimates of population characteristics are a critical input to social and economic research and policy. In industrialized economies, novel sources of data are enabling new approaches to demographic profiling, but in developing countries, fewer sources of big data exist.We show that an individual's past history of mobile phone use can be used to infer his or her socioeconomic status. Furthermore, we demonstrate that the predicted attributes of millions of individuals can, in turn, accurately reconstruct the distribution of wealth of an entire nation or to infer the asset distribution of microregions composed of just a few households. In resource-constrained environments where censuses and household surveys are rare, this approach creates an option for gathering localized and timely information at a fraction of the cost of traditional methods.