Content uploaded by Samin Aref
Author content
All content in this area was uploaded by Samin Aref on Oct 25, 2019
Content may be subject to copyright.
Demography in the Digital Era: New Data
Sources for Population Research
Demografia nell’era digitale: nuovi fonti di dati per gli
studi di popolazione
Diego Alburez-Gutierrez, Samin Aref, Sofia Gil-Clavel, Andr´
e Grow, Daniela V.
Negraia, Emilio Zagheni
Abstract The spread of digital technologies and the increased access to the internet
has contributed to the production and accumulation of unprecedented quantities of
data about human behavior. Demographers, who have a long-standing interest in
issues related to data and data quality, are in an ideal position to make sense of
this new information. This paper discusses three ways in which the Data Revolution
has created novel sources of data for demographic research. It discusses the unique
technical and ethical challenges posed by these data sources and the opportunities
they provide for understanding historical and contemporary demographic dynamics
around the world.
Abstract La diffusione di tecnologie digitali e la crescita nell’accesso ad inter-
net hanno contribuito ad una produzione ed accumulo senza precedenti di dati sul
comportamento umano. I demografi, che hanno avuto un interesse di lunga data
su questioni relative a dati e qualit`
a dei dati, sono in una posizione ideale per es-
trarre significato da queste nuove informazioni. Questo articolo discute tre modi in
cui la ‘Data Revolution’ ha creato nuove fonti di dati per la ricerca demografica.
L’articolo discute le sfide tecniche ed etiche create da questi dati e le opportunit`
a
che offrono per comprendere le dinamiche demografiche storiche e contemporanee.
Key words: demography, digital data, social media, data revolution
Diego Alburez-Gutierrez, Max Planck Institute for Demographic Research (MPIDR), e-mail:
alburezgutierrez@demogr.mpg.de
Samin Aref, MPIDR, e-mail: aref@demogr.mpg.de
Sofia Gil-Clavel, MPIDR, e-mail: gil@demogr.mpg.de
Andr´
e Grow, MPIDR, e-mail: grow@demogr.mpg.de
Daniela V. Negraia, MPIDR, e-mail: negraia@demogr.mpg.de
Emilio Zagheni, MPIDR, e-mail: zagheni@demogr.mpg.de
1
D. Alburez-Gutierrez, S. Aref, S. Gil-Clavel, A. Grow, D.V. Negraia, E.
Zagheni. (2019). 'Demography in the Digital Era: New Data Sources for Popu-
lation Research' in Arbia G., Peluso S., Pini A., Rivellini G. (eds.) Book
of Short Papers SIS2019. Pearson.
2 Diego Alburez-Gutierrez et al.
1 The Data Revolution: A New Data Paradigm in Demography?
Demography, the systematic study of population dynamics and the causes and con-
sequence of compositional changes in populations, has always been a data-driven
discipline. Administrators have used censuses to count (and tax) populations since
ancient times. In modern societies, an interest in data characterized the development
of the discipline of demography. For example, John Graunt identified London’s 16th
century ‘Bills of Mortality’ as a potential source of data for demographic analysis,
ultimately resulting in the creation of life tables. We argue that demography is at
the gates of a new data paradigm defined by the increased availability of population
data produced or made available by digital technologies and the internet [6, 8]. The
shift is part of the Data Revolution, the process through which the transition from
analogue to digital electronic technologies has resulted in the accumulation of vast
amounts of individual-level data (see Figure 1 for an illustration). The spread of
the internet, the World Wide Web, and the Internet of Things, have accelerated this
process, producing unprecedented data on society and human behavior [29].
This paper presents three innovative sources of data that have been made pos-
sible by the Data Revolution and explores their potential for conducting ground-
breaking demographic research. First, digitization has helped improve access to
existing data, such as censuses and population registers [33, 20], and bibliometric
databases [12, 27]. Similarly, the advent of online peer-to-peer collaboration has cre-
ated new resources, such as massive online genealogical databases [21], that can be
used for studying intergenerational demographic processes. Second, demographers
can now analyze digital traces left by internet users in platforms like Twitter [35]
and Facebook (FB) [15] to study population dynamics. Finally, the Data Revolution
has created new opportunities for collecting primary data using devices connected
to the internet. Examples discussed in this paper include online surveys [7, 1], apps
for registering time-use data [28], and internet advertising platforms [37, 9].
The Data Revolution
Spread of new
technologies and
platforms
Internet & WWW
Accumulation of
individual-level data
Internet of Things
Online adverts
Social media sites
Demography
Economics
Environment
Society
entails
such
as
which
affected
resulting in
useful for
understanding
Fig. 1 The Data Revolution and new sources of data for demographic analysis.
Demography in the Digital Era: New Data Sources for Population Research 3
1.1 Digitized and Crowd-sourced Data
Demographers and statistical agencies were quick to recognize the importance of
digitizing paper-based demographic data. The digitization of censuses and pop-
ulation registers was pioneered by the Integrated Public Use Microdata Series
(IPUMS), which now hosts the world’s largest collection of demographic micro-
data.1In time, digitization enabled the creation of crucial data repositories for de-
mographic research (e.g. the Human Mortality and the Human Fertility Databases2
or digital national population registers). Nordic registers, for example, have been
used to study intergenerational processes in fertility [22], health [5], mortality [4],
and migration [33]. Most of the existing research focuses on Europe, but researchers
increasingly acknowledge the potential of other population registers for conducting
demographic research (e.g. East Asia [13] or North America [18]).
Bibliometric databases, such as Scopus [2], Web of Science [27], and Dimen-
sions [32], are other examples of digitized sources with potential for demographic
research. These databases contain data on millions of scientific publications pro-
duced each year, including author affiliation and addresses. Affiliation data can be
used for analyzing scientific collaboration and mobility of researchers across coun-
tries [24, 3, 12]. Yet, using these data sources for migration research has limitations
which require a careful interpretation of the results [2, 27]. Changes on author af-
filiation, for example, are not a perfect proxy for mobility since conducting and
publishing research can be a lengthy process. Migration of researchers is likely to
be underestimated because some movements are not represented in publications in-
dexed in bibliometric databases. This calls for future research integrating bibliomet-
ric data with complementary data sources to resolve some of the methodological
issues. Despite these limitations, bibliometric data sources offer substantial benefits
[12, 27] compared to traditional data sources like surveys. These resources make re-
search on migration of research-active scientists more cross-disciplinary, scalable,
longitudinal, contemporary, and comprehensive.
Demographic data can also be crowd-sourced. Platforms like Geni.com and Wik-
iTree have allowed thousands of amateur genealogists to collaborate in building
large-scale online genealogical databases such as the Familinx database, which in-
cludes 86 million individual records from around the globe, with data that go back as
far as the 17th century [21]. This particular database was scraped from Geni.com, a
collaborative social network that allows users to find and verify family relations. On-
line genealogies are a promising resource because they cover long historical periods
and are not restricted by national boundaries - on the downside, they are not repre-
sentative samples and underrepresent Low- and Middle-Income countries (LMIC).
Despite their potential, sound demographic research using these data is still miss-
ing, including methodologies for addressing systematic biases and generalizing the
findings to larger populations [16].
1www.ipums.org; www.international.ipums.org, accessed 28.02.2019.
2www.mortality.org; www.humanfertility.org, accessed 28.02.2019.
4 Diego Alburez-Gutierrez et al.
1.2 Digital Traces from Social Media
About half of the world’s population are active internet users and many use social
media platforms like FB and Twitter.3Demographic information on the users of
these platforms can be used to perform demographic research in a timely manner.
Social media data can also be used to study populations that would otherwise re-
main entirely out of reach [26]. Researchers can access FB and Twitter data using the
platforms’ Application Programming Interfaces (API), some of which have been de-
signed for advertising purposes. The FB Marketing API gives access to aggregated
population data (e.g., the number of FB users by sex and age in a given country who
share certain interests), but not individual-level user data. Unfortunately, FB does
not provide much detail about how these aggregate figures are estimated (e.g., how
users are classified according to their interests, behavior, and demographic char-
acteristics). Twitter, in addition to a Marketing API for aggregate-level summary
information, allows researchers to query individual-level data from ‘public tweets’
(i.e. tweets not protected by the user).4Still, researchers can access information that
users have agreed to share, including text and images from tweets, user names, and
tweet locations. Having access to individual-level Twitter posts gives researchers
the freedom to design and test different models and algorithms using primary data.
Previous studies have collected data using APIs to study contemporary social and
demographic processes. FB data have been used to study access to digital technolo-
gies [15, 17], immigrant cultural assimilation [14], and to estimate migrant stocks
[37]. Twitter data have been used to study migration flows [35], and monitor popu-
lation health [10] and natural disasters [19]. The use of the FB and Twitter data has
clear advantages, but also important drawbacks. A notable limitation is that social
media data are generally not representative of the entire population. Recent studies
have attempted to overcome this limitation by combining social media data, sta-
tistical models, and representative surveys [36, 37]. Another limitation is the lack
of individual-level demographic data for Twitter users. Studies have addressed this
limitation by using pattern recognition techniques to infer the demographic charac-
teristics of users [34]. Nevertheless, there are clear benefits in using this new source
of data. For instance, demographers and sociologists have been able to reach and
study new populations, while statisticians and computer scientists have had the op-
portunity to test new models and algorithms. These examples show how the internet
has created research opportunities that were unimaginable when social networking
platforms were initially conceived, over 20 years ago.
3www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx, accessed 28.02.2019.
4help.twitter.com/en/safety-and-security/public-and-protected-tweets, accessed 28.02.2019.
Demography in the Digital Era: New Data Sources for Population Research 5
1.3 New opportunities for collecting primary data
The Data Revolution has also created new opportunities for collecting primary data
via the internet. Several studies have recruited participants for online surveys us-
ing social networking sites (e.g., FB and Twitter) and online labor markets (e.g.,
Amazon Mechanic Turk and Craigslist) [7]. Such platforms tend to have wide reach
and often allow the targeting of individuals based on specific demographic char-
acteristics, interests, and behaviors. This makes them attractive for both drawing
convenience samples and recruiting members of hard-to-reach populations, usually
at a lower cost than would be possible with traditional probability samples [1]. Of
the existing platforms, FB has been the most popular, arguably because it (currently)
has the largest number of users and the widest international coverage, and because
it provides detailed information about user characteristics that can be used for tar-
geting participants [7, 9]. In the existing studies using FB, recruitment usually takes
place via ads that can be shown to users at various places of a webpage. Such ads
consist of one or more pictures accompanied by a short study description and a link
to an external site hosting the survey (see [26] for an example).
In addition to new platforms for survey research, internet-enabled devices (e.g.,
mobile phones and activity trackers) can revolutionize current research practice.
One example comes from the area of time-use research. The ways in which people
use their time (i.e., the quantity and quality of time; whether they spend it alone or
interacting with other people or with machines) has implications for their health and
wellbeing. Information about people’s time use and wellbeing has mainly been col-
lected using (1) recall or (2) real-time techniques. In recall techniques, interviews
are typically conducted over the phone or via paper-and-pencil self-administered
diaries, where respondents report back about what they did during that day or the
previous day, the duration of each activity, where they were, and who they inter-
acted with. This method affords coverage and detail of what participants did during
the previous day and the sequence in which activities took place, but it is not very
precise in estimating the hour and the minute in which an activity took place [25].
Furthermore, such recall diaries can be lengthy and burdensome for the respondent,
which is why most national representative recall-diary surveys are cross-sectional
and only cover one or two diary-days for each respondent. ‘Real-time’ techniques
that rely on applications or instant messages received on personal mobile phone have
the potential to capture what people are doing ‘right now’, and are likely to provide
a more precise picture of the exact activity [11, 28]. Such techniques also have the
advantage of being faster and less fatiguing for the respondent, allowing longitudi-
nal or repeated measures. Additionally, data collection via cellphone applications
could provide much needed insight into what people from various socioeconomic
contexts and political regimes do during day-to-day life and how those activities
are then linked to various measures of wellbeing. Nevertheless, assessing time-use
and subjective wellbeing in real time has its own reliability and validity challenges,
stemming particularly from the fact that asking respondents to evaluate their current
behavior and/or emotional wellbeing ‘right now’, may change the very behavior
and/or emotions we are trying to measure [23, 31].
6 Diego Alburez-Gutierrez et al.
2 What’s next for Demography?
This paper highlighted new opportunities for demographic research created by the
Data Revolution. The review of new data sources, however, is not exhaustive and
researchers will continue to find new ways of making sense of our social world with
the help of the internet and electronic devices. This concluding section considers the
unique technical and ethical challenges of digital data and discusses how addressing
them can contribute to the advancement of the demographic discipline.
Demographers using digital data face particular issues related to access, rep-
resentativity, and ethics. Researchers often ‘depend on the kindness of strangers’
for accessing data since internet companies, unlike governments, are not obliged to
share data from their platforms. This creates uncertainty as the conditions of access
may change in the future. There are important attempts to address this issue. The
Opal Project, for example, has proposed protocols for private companies to willingly
share anonymized data on a regular basis to inform public policy and academic re-
search.5Furthermore, digital sources are rarely representative of larger populations
in the way that randomized surveys are (even if, as this paper has shown, digital
technology can enhance the collection of primary survey data). Coverage can also
be an issue, as access to the internet is more restricted in LMIC. Nevertheless, digital
trace data can be used to show some of these global inequalities in access to digital
technologies [15, 17].6The issue has motivated research on generalizing from non-
representative samples to larger populations [36, 37]. This is a promising area of
methodological development with wide applications, especially as survey response
rates continue to decline around the world. The availability of online data has also
led researchers to think long and hard about data security, privacy and informed con-
sent in the digital era [30]. Ethical considerations must be a primary concern when
designing demographic studies using digital or internet data. Social scientists need
to adhere to ethical and transparent research practices, particularly as the privacy of
users is constantly threatened in the online world [38].
Finally, it is important to note that while innovative sources of data provide ex-
citing opportunities for new research, they are unlikely to make ‘traditional’ demo-
graphic sources obsolete in the near future (e.g., surveys, censuses). Rather, the Data
Revolution has the potential to complement and augment these existing data sources.
Traditional population data, for example, are crucial for identifying systematic bias
in online sources and calibrating estimates made from these data [37]. Social media
data can be used to estimate important demographic measures in contexts where tra-
ditional survey data are not available. The Data Revolution has already changed the
way we do demography, as evidenced by the digitization of historical censuses and
populations registers, and the creation of large-scale and open-access repositories
of demographic data. The pace of this changes is likely to increase in the future as
more researchers engage in ground-breaking research using digital data sources.
5www.opalproject.org/, accessed 28.02.2019.
6This work has resulted in efforts to ‘nowcast’ the digital gender gap in internet and mobile access
using real-time big data: https://www.digitalgendergaps.org/, accessed 28.02.2019.
Demography in the Digital Era: New Data Sources for Population Research 7
References
1. Antoun, C., Zhang, C., Conrad, F.G., Schober, M.F.: Comparisons of online recruitment strate-
gies for convenience samples: Craigslist, Google AdWords, Facebook, and Amazon Mechan-
ical Turk. Field Methods 28(3), 231–246 (2016)
2. Appelt, S., van Beuzekom, B., Galindo-Rueda, F., de Pinho, R.: Which factors influence the
international mobility of research scientists? In: A. Geuna (ed.) Global Mobility of Research
Scientists, pp. 177–213. Academic Press, San Diego (2015)
3. Aref, S., Friggens, D., Hendy, S.: Analysing scientific collaborations of New Zealand institu-
tions using Scopus bibliometric data. In: Proceedings of the Australasian Computer Science
Week Multiconference, p. 49. ACM (2018)
4. Baranowska-Rataj, A., Barclay, K., Kolk, M.: The effect of number of siblings on adult mor-
tality: Evidence from Swedish registers for cohorts born between 1938 and 1972. Population
Studies 71(1), 43–63 (2017)
5. Barclay, K.J., Kolk, M.: The long-term cognitive and socioeconomic consequences of birth
intervals: a within-family sibling comparison using Swedish register data. Demography 54(2),
459–484 (2017)
6. Billari, F.C., Zagheni, E.: Big data and population processes: A revolution? In: A. Petrucci,
R. Verde (eds.) Proceedings of the Conference of the Italian Statistical Society, pp. 167–178.
Firenze University Press (2017)
7. Boas, T.C., Christenson, D.P., Glick, D.M.: Recruiting large online samples in the United
States and India: Facebook, Mechanical Turk, and Qualtrics. Political Science Research and
Methods pp. 1–19 (2018)
8. Cesare, N., Lee, H., McCormick, T., Spiro, E., Zagheni, E.: Promises and pitfalls of using
digital traces for demographic research. Demography 55(5), 1979–1999 (2018)
9. Chu, J.L., Snider, C.E.: Use of a social networking web site for recruiting Canadian youth for
medical research. Journal of Adolescent Health 52(6), 792–794 (2013)
10. Cocos, A., Fiks, A.G., Masino, A.J.: Deep learning for pharmacovigilance: Recurrent neural
network architectures for labeling adverse drug reactions in Twitter posts. Journal of the
American Medical Informatics Association 24(4), 813–821 (2017)
11. Csikszentmihalyi, M., Larson, R.: Validity and reliability of the experience-sampling method.
In: Flow and the Foundations of Positive Psychology: The Collected Works of Mihaly Csik-
szentmihalyi, pp. 35–54. Springer, Dordrecht (2014)
12. Czaika, M., Orazbayev, S.: The globalisation of scientific mobility, 1970–2014. Applied Ge-
ography 96, 1–10 (2018)
13. Dong, H., Campbell, C., Kurosu, S., Yang, W., Lee, J.Z.: New sources for comparative social
science: Historical population panel data from East Asia. Demography 52(3), 1061–1088
(2015)
14. Dubois, A., Zagheni, E., Garimella, K., Weber, I.: Studying migrant assimilation through Face-
book interests. In: S. Staab, O. Koltsova, D.I. Ignatov (eds.) Social Informatics, Lecture Notes
in Computer Science, pp. 51–60. Springer International Publishing (2018)
15. Fatehkia, M., Kashyap, R., Weber, I.: Using Facebook ad data to track the global digital gender
gap. World Development 107, 189–209 (2018)
16. Fire, M., Elovici, Y.: Data mining of online genealogy datasets for revealing lifespan patterns
in human population. ACM Trans. Intell. Syst. Technol. 6(2), 28:1–28:22 (2015)
17. Garcia, D., Kassa, Y.M., Cuevas, A., Cebrian, M., Moro, E., Rahwan, I., Cuevas, R.: Analyzing
gender inequality through large-scale Facebook advertising data. Proceedings of the National
Academy of Sciences 115(27), 6958–6963 (2018)
18. Gauvin, H., Moreau, C., Lefebvre, J.F., Laprise, C., V´
ezina, H., Labuda, D., Roy-Gagnon,
M.H.: Genome-wide patterns of identity-by-descent sharing in the French Canadian founder
population. European Journal of Human Genetics 22(6), 814–821 (2014)
19. Ghahremanlou, L., Sherchan, W., Thom, J.A.: Geotagging Twitter messages in crisis manage-
ment. The Computer Journal 58(9), 1937–1954 (2015)
8 Diego Alburez-Gutierrez et al.
20. Hall, P.K., McCaa, R., Thorvaldsen, G., Group, I.M.A. (eds.): Handbook of international his-
torical microdata for population research: A project of IMAG, The International Microdata
Access Group. Minnesota Population Center, Minneapolis, Minn (2000)
21. Kaplanis, J., Gordon, A., Shor, T., Weissbrod, O., Geiger, D., Wahl, M., Gershovits, M.,
Markus, B., Sheikh, M., Gymrek, M., Bhatia, G., MacArthur, D.G., Price, A.L., Erlich, Y.:
Quantitative analysis of population-scale family trees with millions of relatives. Science
360(6385), 171–175 (2018)
22. Kolk, M.: Multigenerational transmission of family size in contemporary Sweden. Population
Studies 68(1), 111–129 (2014)
23. Ludwigs, K., Lucas, R., Burger, M., Veenhoven, R., Arends, L.: How does more attention to
subjective well-being affect subjective well-being? Applied Research in Quality of Life 13(4),
1055–1080 (2018)
24. Moed, H.F., Halevi, G.: A bibliometric approach to tracking international scientific migration.
Scientometrics 101(3), 1987–2001 (2014)
25. Phipps, P.A., Vernon, M.K.: Twenty-four hours: an overview of the recall diary method and
data quality in the American time use survey. In: R.F. Belli, F.P. Stafford, D.F. Alwin (eds.)
Calendar and Time Diary: Methods in Life Course Research, pp. 109–128. Sage Publications,
Thousand Oaks (2009)
26. P¨
otzschke, S., Braun, M.: Migrant sampling using Facebook advertisements: A case study of
Polish migrants in four European countries. Social Science Computer Review 35(5), 633–653
(2017)
27. Robinson-Garcia, N., Sugimoto, C.R., Murray, D., Yegros-Yegros, A., Larivi`
ere, V., Costas,
R.: The many faces of mobility: Using bibliometric data to measure the movement of scien-
tists. Journal of Informetrics 13(1), 50–63 (2019)
28. Roessger, K.M., Greenleaf, A., Hoggan, C.: Using data collection apps and single-case designs
to research transformative learning in adults. Journal of Adult and Continuing Education
23(2), 206–225 (2017)
29. Salath´
e, M., Bengtsson, L., Bodnar, T.J., Brewer, D.D., Brownstein, J.S., Buckee, C., Camp-
bell, E.M., Cattuto, C., Khandelwal, S., Mabry, P.L., Vespignani, A.: Digital Epidemiology.
PLoS Computational Biology 8(7), e1002,616 (2012)
30. Salganik, M.J.: Bit by bit: Social research in the digital age. Princeton University Press,
Princeton (2018)
31. Sasaki, W., Nakazawa, J., Okoshi, T.: Comparing ESM timings for emotional estimation
model with fine temporal granularity. In: Proceedings of the 2018 ACM International Joint
Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and
Wearable Computers, pp. 722–725. ACM (2018)
32. Thelwall, M.: Dimensions: A competitor to Scopus and the Web of Science? Journal of Infor-
metrics 12(2), 430–435 (2018)
33. Thorvaldsen, G., Østrem, N.O.: Migration and the historical population register of Norway.
Journal of Migration History 4(2), 237–248 (2018)
34. Yildiz, D., Munson, J., Vitali, A., Tinati, R., Holland, J.A.: Using Twitter data for demographic
research. Demographic Research 37(46), 1477–1514 (2017)
35. Zagheni, E., Garimella, V.R.K., Weber, I., State, B.: Inferring international and internal mi-
gration patterns from Twitter data. In: Proceedings of the 23rd International Conference on
World Wide Web - WWW ’14 Companion, pp. 439–444. ACM Press, Seoul, Korea (2014)
36. Zagheni, E., Weber, I.: Demographic research with non-representative internet data. Interna-
tional Journal of Manpower 36(1), 13–25 (2015)
37. Zagheni, E., Weber, I., Gummadi, K.: Leveraging Facebook’s advertising platform to monitor
stocks of migrants. Population and Development Review 43(4), 721–734 (2017)
38. Zuboff, S.: Big other: Surveillance capitalism and the prospects of an information civilization.
Journal of Information Technology 30(1), 75–89 (2015)