ArticlePDF Available

The opportunities, challenges and risks of big data for official statistics

Authors:

Abstract

The development of big data is set to be a significant disruptive innovation in the production of official statistics offering a range of opportunities, challenges and risks to the work of National Statistical Institutions (NSIs). This paper provides a synoptic overview of these issues in detail, mapping out the various pros and cons of big data for producing official statistics, examining the work to date by NSIs in formulating a strategic and operational response to big data, and plotting some suggestions with respect to on-going change management needed to address the use of big data for official statistics.
1
Big data and official statistics: Opportunities, challenges and risks
Rob Kitchin, NIRSA, National University of Ireland Maynooth, County Kildare, Ireland
The Programmable City Working Paper 9
http://www.nuim.ie/progcity/
16th April 2015
Abstract
The development of big data is set to be a significant disruptive innovation in the production
of official statistics offering a range of opportunities, challenges and risks to the work of
national statistical institutions (NSIs). This paper provides a synoptic overview of these
issues in detail, mapping out the various pros and cons of big data for producing official
statistics, examining the work to date by NSIs in formulating a strategic and operational
response to big data, and plotting some suggestions with respect to on-going change
management needed to address the use of big data for official statistics.
Key words: big data, official statistics, national statistical institutions (NSIs), opportunities,
challenges, risks
Introduction
National Statistical Institutions (NSIs) are charged with producing and publishing official
statistics across a range of domains and scales relating to a nation. Official statistics are used
to report on the present state of play and unfolding trends with respect to society and
economy to a domestic and international audience, with many statistics being collated into
supra-national statistical systems. Over the last couple of hundred years, NSIs, both on their
own initiative and in collaboration with each other, have developed rigorous and standardized
procedures for sampling, generating, handling, processing, storing, analyzing, sharing and
publishing official statistical data. More recently, over the past half century, NSIs have
increasingly turned to exploiting administrative data sets produced by other state agencies for
2
official statistics. In both cases, NSIs are the principle administrator of an official statistical
system, in the first case controlling the whole data life cycle and in the second supported by
legislative tools to ensure compliance with data provision.
The development of big data is set to be a significant disruptive innovation in the
production of official statistics offering a range of opportunities, challenges and risks to the
work of NSIs. As with many new innovations driven by new technological developments,
big data has become a buzz phrase that is variously understood, with many definitions
making reference to a fundamental shift in the nature of some data with respect to the 3Vs of
volume, velocity and variety. Based on an extensive review of the literature and a conceptual
comparison between small and big data (see Table 1), Kitchin (2013, 2014) contends that big
data has the following characteristics:
huge in volume, consisting of terabytes or petabytes of data;
high in velocity, being created in or near real-time;
diverse in variety, being structured, semi-structured and unstructured in nature;
exhaustive in scope, striving to capture entire populations or systems (n=all);
fine-grained in resolution and uniquely indexical in identification;
relational in nature, containing common fields that enable the conjoining of different
data sets;
flexible, holding the traits of extensionality (can add new fields easily) and
scaleability (can expand in size rapidly).
Table 1: Comparing small and big data
Small data
Big data
Volume
Limited to large
Very large
Exhaustivity
Samples
Entire populations
Resolution and identification
Course & weak to tight & strong
Tight & strong
Relationality
Weak to strong
Strong
Velocity
Slow, freeze-framed/bundled
Fast, continuous
Variety
Limited to wide
Wide
Flexible and scalable
Low to middling
High
With some notable exceptions, such as financial and weather datasets, the occurrence
of big data is largely a post-millennium phenomena enabled by: advances in computational
power; pervasive, ubiquitous and mobile computing; networked storage; new forms of
database design; new modes of software-mediated communication, interactions and
3
transactions; and data analytics that utilise machine learning and are able to cope with a data
deluge. To date, official statistical data have been small data, holding some of the
characteristics of big data but not all. For example, a census has volume, exhaustivity,
resolution, and relationality, but has no velocity (generated once every five or ten years), no
variety (usually c.30 structured questions), and no flexibility (once set a census cannot be
altered mid data generation). Most other official statistical data lack exhaustivity using
sampling frameworks to selectively represent populations. In comparison, mobile phone
companies are logging millions of calls and associated metadata every hour, large
supermarket chains are handling hundreds of thousands of customer transactions an hour,
traffic sensors are tracking hundreds of thousands of vehicles a day as they of navigate cities,
and social media companies are processing billions of interactions a day. In each case the
data relate to entire populations of that system, are often resolute relating to specific
customers and transactions, and in the case of social media can be highly varied including
text, photos, videos, sound files and weblinks.
Not unsurprisingly, given its scope, timeliness, and resolution, and the potential
efficiencies it offers in the resourcing and compiling of data and statistics, big data has
captured the interest of NSIs and related agencies such as Eurostat, the European Statistical
System (ESS, who have formulated a big data roadmap), United Nations Economic
Commission for Europe (UNECE, who have established a High Level Group for the
Modernization of Statistical Production and Services focused on big data, with four ‘task
teams’: privacy, partnerships, sandbox and quality), and the United Nations Statistical
Division (UNSD, who have organized a Global Working Group on Big Data and Official
Statistics). In 2013 the Heads of the National Statistical Institutes of the EU signed the
Scheveningen Memorandum to examine the use of big data in official statistics. However, a
survey jointly conducted by UNSD and UNECE revealed that of the 32 NSIs that responded
only a ‘few countries have developed a long-term vision for the use of Big Data’, or
‘established internal labs, task teams or working groups to carry out pilot projects to
determine if and how Big Data could be used as a source of Official Statistics’ (EUESC
2015: 16). Some are ‘currently on the brink of formulating a Big Data strategy’ ... ‘but most
countries have not yet defined business processes for integrating Big Data sources and results
into their work and do not have a defined structure for managing Big Data projects’ (EUESC
2015: 16). As these organisations are discovering, whilst big data offers a number of
opportunities for NSIs, they also offer a series of challenges and risks that are not easy to
handle and surmount. Indeed, the use of big data needs careful consideration to ensure that
4
they do not compromise the integrity of NSIs and their products. The rest of the paper
discusses these opportunities, challenges and risks, which are summarized in Table 2.
Table 2: Opportunities, challenges and risks of big data for official statistics
Opportunities
Challenges
Risks
complement, replace, improve,
and add to existing datasets
produce more timely outputs
compensate for survey fatigue
of citizens and companies
complement and extend micro-
level and small area analysis
improve quality and ground
truthing
refine existing statistical
composition
easier cross-jurisdictional
comparisons
better linking to other datasets
new data analytics producing
new and better insights
reduced costs
optimization of working
practices and efficiency gains in
production
redeployment of staff to higher
value tasks
greater collaboration with
computational social science,
data science, and data industries
greater visibility and use of
official statistics
forming strategic alliances with
big data producers
gaining access to data
gaining access to associated
methodology and metadata
establishing provenance and
lineage of datasets
legal and regulatory issues
establishing suitability for
purpose
establishing dataset quality
with respect to veracity
(accuracy, fidelity),
uncertainty, error, bias,
reliability, and calibration
technological feasibility
methodological feasibility
experimenting and trialing big
analytics
institutional change
management
ensuring inter-jurisdictional
collaboration and common
standards
mission drift
damage to reputation and
losing public trust
privacy breaches and data
security
inconsistent access and
continuity
resistance of big data providers
and populace
fragmentation of approaches
across jurisdictions
resource constraints and cut-
backs
privatisation and competition
Opportunities
Clearly the key opportunity of big data is the availability of new sources of dynamic, resolute
data that can potentially complement, replace, improve, and add to existing datasets and
refine existing statistical composition, and produce more timely outputs. Indeed, Florescu et
al. (2014: 3-4) detail that big data sources could be used in current statistical systems in five
ways:
to entirely replace existing statistical sources such as surveys (existing statistical
outputs);
to partially replace existing statistical sources such as surveys (existing statistical
outputs);
5
to provide complementary statistical information in the same statistical domain
but from other perspectives (additional statistical outputs);
to improve estimates from statistical sources (including surveys) (improved
statistical outputs);
to provide completely new statistical information in a particular statistical domain
(new alternative statistical outputs).
To these, Tam and Clarke (2014: 8-9) add:
sample frame or register creation – identifying survey population units and/or
providing auxiliary information such as stratification variables;
imputation of missing data items substituting for same or similar units;
editing assisting the detection and treatment of anomalies in survey data;
linking to other data – creating richer datasets and/or longitudinal perspectives;
data confrontation – ensuring the validity and consistency of survey data;
improving the operational efficiency and effectiveness of NSIs through use of
paradata created and captured from its statistical operations.
Significantly, big data offer the opportunity to produce more timely official statistics,
drastically reducing the processing and calculating processes, and to do so on a rolling basis
(Eurostat 2014). For example, rather than it taking several weeks to produce quarterly
statistics (such as GDP), it might take a few minutes or hours, with the results being released
on the same timescale on a rolling basis. In this sense, big data offers the possibility for
‘nowcasting’, the prediction of the present (Choi and Varian 2011: 1). For Global Pulse
(2012: 39) the timeliness of big data enables:
1. “early warning: early detection of anomalies in how populations use digital
devices and services can enable preventive interventions;
2. real-time awareness: a fine-grained and current representation of reality which
can inform the design and targeting of programs and policies;
3. real-time feedback: real time monitoring makes it possible to understand where
policies and programs are failing and make the necessary adjustments in a more
timely manner.”
6
In developing world, where the resourcing of NSIs has sometimes been limited and
traditional surveys are sometimes viewed as cumbersome, expensive and of limited
effectiveness, or they are affected by other external influences (political pressure, war, etc),
big data is seen as a means of filling basic gaps in official statistics and by-passing political
bottlenecks to statistical reform (Global Pulse 2012; Albert 2013, Krätke and Byiers 2014;
Letouzé and Jütting 2014). Such an aspiration is also relevant to the developed world in
cases where official statistics are difficult to produce, or are methodologically weak, or lack
adequate granularity and disaggregation (spatially, temporally). Indeed, big data offers a rich
source of granular data, often at the level of unique individuals, households or companies, to
complement and extend micro-level and small area analysis (Reimsbach-Kounatze 2015).
Further, big data are direct measurements of a phenomena and provide a reflection of
actual transactions, interactions and behaviour of people, societies, economies and systems,
rather than surveys which reflect what people say they do or think. Thus while big datasets
can be noisy, and contain gamed and faked data, they potentially poses more ground truth
with respect to social reality than current instruments used for official statistics (Hand 2015).
And since the big data being produced are an inherent part of the systems that generate them,
they can compensates for significant survey fatigue amongst citizens and companies (Struijs
et al. 2014). Moreover, since big data are generated from systems that often span or
deployed in many jurisdictions ̶ unlike much data derived from surveys or administrative
systems ̶ they potentially ensure comparability of phenomena across countries.
An additional advantage is that big data offers the possibility to add significant value
to official statistics at marginal cost, given the data are already being produced by third
parties (Dunne 2013; Landefeld 2014; Struijs et al. 2014; AAPOR 2015). Indeed, it could
lead to greater optimization of working practices, efficiency gains in production, and a
redeployment of staff away from data generation and curation to higher value tasks such as
analysis or quality assurance, communication or developing new products. It also has the
potential to lead to greater collaboration with computational social science, data science, and
data industries, leading to new insights and innovations, and a greater visibility and use of
official statistics as they become more refined, timely and resolute. Further, new data
analytics, utilising machine learning to perform data mining and pattern recognition,
statistical analysis, prediction, simulation, and optimization, data visualization and visual
analytics, mean that greater insights might be extracted from existing statistical data and new
sources of big data, and new derived data and statistical products can be developed
7
(Scannapieco 2013). In a scoping exercise, the European Statistical System Committee
(2014: 8) has thus identified several official statistical domains that could be profitably
augmented by the use of different kinds of big data (see Table 3).
Table 3: Potential use of big data in official statistics
Data source
Data type
Mobile communication
Mobile phone data
WWW
Web searches
e-commerce websites
Businesses’ websites
Business registers
Job advertisements
Real-estate websites
Social media
Sensors
Traffic loops
Smart meters
Satellite images
environment statistics
Automatic vessel identification
Transactions of process
generated data
Flight movements
Supermarket scanner and sales data
Crowdsourcing
Volunteered geographic information
(VGI) websites (OpenStreetMap,
Wikimapia, Geowiki)
Community pictures collections
(flickr, Instagram, Panoramio)
Source: European Statistical System Committee (2014: 18)
Challenges
Whilst big data offers a number of opportunities its use is not without a number of significant
challenges. A first issue is to gain access to the required big data in the first place for
assessment, experimenting, trialing and adoption (Global Pulse 2012; Eurostat 2014; Tam
and Clarke 2014). Although some big data are produced by public agencies, such as weather
data, some website and administrative systems, and some transport data, much big data are
presently generated by private companies such as mobile phone, social media, utility,
financial and retail companies (Kitchin 2014). These big data are valuable commodities to
these companies, either providing a resource that generates competitive advantage or
constituting a key product, and are generally not publicly available for official or public
analysis in raw or derived forms. For NSIs to gain access to such data requires forming
binding strategic partnerships with these companies (so-called ‘data compacts; Krätke and
8
Byiers 2014) or creating/altering legal instruments (such as Statistics Acts) to compel
companies to provide such data. Such negotiations and legislative reform is time consuming
and politically charged, especially when NSIs generally do not pay or compensate companies
for providing data for official statistics.
Once data has been sourced, it needs to be assessed for its suitability for
complementing, replacing or adding to official statistics. This assessment concerns
suitability for purpose, technological and methodological feasibility, and the change
management required for implementation. From the perspective of both NSIs and the public,
official statistics are generated: (a) with the purpose to serve the whole spectrum of the
society; (b) based on quality criteria and best practices; (c) by statisticians with assured
professional independence and objectivity (Eurostat 2014). However, unlike the surveys
administered by NSIs, in most cases the big data listed in Table 3 are generated by
commercial entities for their specific needs and were never intended to be used for the
production of official statistics. The extent to which repurposed big data provide adequate,
rigorous and reliable surrogates for more targeted, sampled data therefore needs to be
established (Struijs et al. 2014). A key consideration in this respect is representativeness,
both of phenomena and populations (Global Pulse 2012; Daas et al., 2013; Tam and Clarke
2014). NSIs carefully set their sampling frameworks and parameters, whereas big data
although exhaustive are generally not representative of an entire population as they only
relate to whomever uses a service. For example, credit card data only relates to those that
possess a credit card and social media data only relates to those using that service, which in
both cases are stratified by social class and age (and in the latter case also includes many
anonymous and bot accounts). In cases such as the Consumer Price Index the same bundle of
goods and services with statistically determined weights need to be tracked over time, rather
than simply web-scraping an unknown unbundle (Horrigan 2013). There is a challenge then
in using big data in the context of existing methodologies.
Further, NSIs spend a great deal of effort in establishing the quality and parameters of
their datasets with respect to veracity (accuracy, fidelity), uncertainty, error, bias, reliability,
and calibration, and documenting the provenance and lineage of a dataset. The OECD (2011)
measure data quality across seven dimensions: relevance, accuracy, credibility, timeliness,
accessibility, interpretability, coherence. These qualities are largely unknown with respect to
various forms of big data (UNECE 2014b; Reimsbach-Kounatze 2015), though it is generally
acknowledged that the datasets can be full of dirty, gamed and faked data as well as data
being absent (Dass et al., 2013; Kitchin 2014). Further, their generators are reluctant to share
9
methodological transparency in how they were produced and processed. In addition, the
frames within which big data are generated can be mutable, changing over time. For
example, Twitter and Facebook are always tweaking their designs and modes of interaction,
and often present different users with alternate designs as they perform A/B testing on the
relative merits of different interface designs and services. The data created by such systems
are therefore inconsistent across users and/or time. These issues, created through the
differences in characteristics of big data from the survey and administrative data usually used
in official statistics (see Table 4), raise significant questions concerning the suitability of big
data for official statistics and how they might be assessed and compensated for (Tam and
Clarke 2014). For some, the initial foray should only be to explore the potential of using big
data to improve the quality of estimates within current methodological frameworks and to
assess the levels and causes of sampling and non-sampling errors across data sources that
threaten valid inference (Horrigan 2013).
Table 4: Characteristics of survey, administrative and big data
Survey data
Administrative data
Big data
Specification
Statistical products
specified ex-ante
Statistical products
specified ex-post
Statistical products
specified ex-post
Purpose
Designed for statistical
purposes
Designed for other
purposes
Organic (not designed) or
designed for other purposes
Byproducts
Lower potential for by-
products
Higher potential for by-
products
Higher potential for by-
products
Methods
Classical statistical
methods available
Classical statistical
methods available, usually
depending on the specific
data
Classical statistical
methods not always
available
Structure
Structured
A certain level of data
structure, depending on the
objective of data collection
A certain level of data
structure, depending on the
source of information
Comparability
Weaker comparability
between countries
Weaker comparability
between countries
Potentially greater
comparability between
countries
Representativeness
Representativeness and
coverage known by design
Representativeness and
coverage often known
Representativeness and
coverage difficult to assess
Bias
Not biased
Possibly biased
Unknown and possibly
biased
Error
Typical types of errors
(sampling and non-
sampling errors)
Typical types of errors
(non-sampling errors, e.g.,
missing data, reporting
errors and outliers)
Bother typical errors (e.g.,
missing data, reporting
errors and outliers)
although possibily less
frequently occurring, and
new types of errors
Persistence
Persistent
Possibly less persistent
Less persistent
Volume
Manageable volume
Manageable volume
Huge volume
Timeliness
Slower
Potentially faster
Potentially must faster
Cost
Expensive
Inexpensive
Potentially inexpensive
Burden
High burden
No incremental burden
No incremental burden
10
Adapted from Florescu et al. (2014: 2-3)
Once the suitability of the data is established, an assessment needs to be made as to
the technological feasibility regarding transferring, storing, cleaning, checking, and linking
big data, and conjoining the data with established existing official statistical datasets
(Scannapieco et al. 2013; Struijs et al. 2014; Tam and Clarke 2014). As Cervera et al.
(2014) note, at present, there is a lack of user-friendly tools for big data that make it difficult
engage with and it is difficult to integrate big data in present workflows and big data
infrastructure with existing infrastructure. In particular, there is a real challenge of
developing techniques for dealing with streaming data, such as processing such data on the
fly (spotting anomalies, sampling/filtering for storage) (Scannapieco et al. 2013). Moreover,
there are questions concerning the methodological feasibility of augmenting and producing
official statistics using big data and performing analytics on a constant basis as data is
dynamically generated, in order to produce real-time statistics or visualisations.
A key challenge in managing these developments is the implementation of a change
management process to fully prepare the organisation for taking on new roles and
responsibilities. New data life cycle systems need to be established and implemented,
accompanied by the building and maintenance of new IT infrastructure capable of handling,
processing and storing big data (Dunne 2013). These new systems need to ensure data
security and compliance with data protection. They also need to be adequately resourced,
creating demands for additional finance and skilled staff.
Beyond the work of an individual NSI, another additional potential challenge is
ensuring that the approaches taken across jurisdictions are aligned so that the new official
statistics produced by NSIs are comparable across space and time and can be conjoined to
produce larger supra-national datasets. The challenges here are institutional and political in
nature and require significant levels of dialogue and coordination across NSIs to establish
new standardized approaches to leveraging big data for official statistics.
Risks
Given the various challenges set out above, along with general public and institutional
perceptions and reactions to the use of big data, there are a number of risks associated with
using big data in producing official statistics. The key risks relate to mission drift, reputation
and trust, privacy and data security, access and continuity, fragmentation across jurisdictions,
resource constraints and cut-backs, and privatisation and competition.
11
The key mission for NSIs is to produce useful and meaningful official statistics.
Traditionally, the driver of what statistics have been produced has been a key concern or
question; data has been generated in order to answer a specific set of queries. In the era of
big data there is the potential for this to be reversed, with the abundance and cost benefit of
big data setting the agenda for what is measured. In other words, official statistics may drift
towards following the data, rather than the data being produced for the compilation of official
statistics. As well as having implications to the institutional work of NSIs, there is a clear
threat to integrity and quality of official statistics in such a move. It is absolutely critical
therefore that NSIs remain focused on the issues and questions data are used to address,
assessing the suitability of big data to their core business, rather than letting big data drive
their mission.
A critical risk for NSIs in implementing a new set of means and methods for
producing official statistics is their reputation and public trust being undermined. A
reputation as a fair, impartial, objective, neutral provider of high quality official statistics is
seen by NSI’s as a mission critical quality, and is usually their number one priority in their
institutional risk register. Partnering with a commercial third party and using their data to
compile official statistics exposes the reputation of a NSI to that of the partner. A scandal
with respect to data security and privacy breaches, for example, may well reflect onto the NSI
(Dunne 2013). Further, failing to adequately address data quality issues will undermine
confidence in the validity and reliability of official statistics, which will be difficult to re-
establish. Similarly, given big data is being repurposed, often without the explicit consent of
those the data represent, there is the potential for a public backlash and resistance to such re-
use. It also has to be recognized, however, that a lack of trust in government both in the
developed (Casselman 2015) and particularly the developing world (Letouzé and Jütting
2014) with respect to competence and motive that is driving some calls for the work of NSIs
to be complemented or replaced by opening government data to enable replication and new
analysis and the use of big data.
Related to reputation, but a significant risk in its own right is the infringement of
privacy and breaching of data security. NSIs take privacy and security very seriously acting
as trusted repositories that employ sophisticated systems for managing data, using strategies
such as anonymisation and aggregation, access rules and techniques, and IT security
measures, to ensure confidentiality and security. These systems are designed to work with
carefully curated ‘small’ datasets. Big data increases the challenge of securing data by
providing new forms of voluminous, relational data, new types of systems and databases, and
12
new flows of data between institutions. There is therefore a need to establish fresh
approaches that ensure the security integrity of the big data held by NSIs (Cervera et al.
(2014; Landefeld 2014; Struijs et al. 2014; AAPOR 2015). To this end, UNECE (2014b: 3)
suggest that in addition to the dimensions used to assess administrative data, five new
dimensions should be added: ‘privacy and confidentiality (a thorough assessment of whether
the data meets privacy requirements of the NSO), complexity (the degree to which the data is
hierarchical, nested, and comprises multiple standards), completeness (of metadata) and
linkability (the ease with which the data can be linked with other data)’. The first two are
important to prevent privacy being breached or data being stolen and used for nefarious ends.
As the Wikileaks and Snowden scandals and other data breaches have demonstrated public
trust in state agencies and their handling and use of personal data have already been
undermined. Likewise a series of high profile breaches of private company data holdings,
such as the stealing of credit card or personal information, has reduced public confidence in
data security more widely. A similar scandal with respect to a NSI could be highly
damaging, and potentially contagious to other NSIs.
At present, NSIs gain their data through dedicated surveys within their control and
administrative databases which they access through legislative mandate. They have little
control or mandate with respect to big data held by private entities, however. In partnering
with third parties NSIs lose overall control of generation, sampling, and data processing and
have limited ability to shape the data produced (Landefeld 2014), especially in cases where
the data are the exhaust of a system that are being significantly repurposed. This raises a
question concerning assurance and managing quality. An associated key risk is that access
to the desired data on a voluntary or licensed basis is denied by companies who do not want
to lose competitive advantage, share a valuable asset without financial compensation, or have
the responsibility or burden of supplying such data, or that initially negotiated access is then
discontinued (Landefeld 2014). The latter poses a significant risk to data continuity and
time-series datasets if existing systems have been replaced by the new big data solution. It
may be possible to mandate companies to provide access to appropriate data using legal
instruments, but it is likely that such a mandate will be strongly resisted and legally
challenged by some companies across jurisdictions. In cases where companies are compelled
unwillingly to share data there has to be a process by which to validate and assure the quality
of the data prepared for sharing.
A key issue for the compilation of supra-national statistics and benchmarking is
finding comparable datasets. NSIs have traditionally been responsible for developing their
13
statistical systems. While there has long been a swapping of knowledge and best practice,
each NSI produced official statistics defined by their statisticians, framed by public
administrative needs and context. The result has been a patchwork quilt of different
definitions, methods, protocols and standards for producing official statistics, so that while
the data generated is similar, they are not the same. For example, how unemployment is
defined and measured often varies across jurisdiction. There is a distinct risk of perpetuating
this situation with respect to official statistics derived from big data creating a fragmented
and non-comparable datasets.
Big data offers the potential to create efficiencies in the production of official
statistics. There is a risk, however, of governments viewing the use of big data as a means of
reducing staffing levels and cutting costs. This is particularly the case in a time of austerity
and a strong neoliberal ethos dominating the political landscape of many jurisdictions. While
there are some very real possibilities of rationalisation, especially with respect to casual and
part-time staffing of censuses and surveys, the core statistical and technical staffing of NSIs
need to be maintained, and may need to be expanded in the short-term given the potential to
create new suites of statistics that need testing, validation, and continuous quality control
checks. Indeed, there will be a need to develop new technical and methodological skills
within NSIs, including creating expertise in new data analytics, either through retraining or
recruitment (Cervera et al. 2014; AAPOR 2015). Without such investment, NSIs will
struggle to fully exploit the potential benefits of utilising big data for official statistics. Any
reductions in staffing and resources, especially before big data has been fully integrated into
the workflow of NSIs, is likely to place serious strain on the organisation and threaten the
integrity of the products produced.
A final risk is competition and privatisation. If NSIs choose to ignore or dismiss big
data for compiling useful statistical data then it is highly likely that private data companies
will fill the gap, generating the data either for free distribution (e.g. Google Trends) or for
sale. They will do so in a timeframe far quicker (near real-time) than NSIs are presently
working, perhaps sacrificing some degree of veracity for timeliness, creating the potential for
lower quality but more timely data to displace high quality, slower data (Eurostat 2014). The
result may be a proliferation of alternative official statistics produced by a variety of vendors,
each challenging the veracity and trustworthiness of those generated by NSIs (Letouzé and
Jütting 2014). Data brokers are already taking official statistic data and using them to create
new derived data, combining them with private data, and providing valued-added services
such as data analysis. They are also producing alternative datasets, registers and services,
14
combining multiple commercial and public datasets to produce their own private databanks
from which they can produce a multitude of statistics and new statistical products. For
example, Acxiom is reputed to have constructed a databank concerning 500 million active
consumers worldwide (about 190 million individuals and 126 million households in the
United States), with about 1,500 data points per person, its servers processing over 50 trillion
data transactions a year (Singer 2012). It claims to be able to provide a ‘360 degree view’ of
consumers by meshing offline, online and mobile data, using these data to create detailed
profiles and predictive models (Singer 2012). Such organisations are also actively
campaigning to open up the administrative datasets used by NSIs to produce official
statistics, arguing that they and others could do much more with them, and in a much more
efficient and effective way (Casselman 2015).
For NSIs that partially operate as trading funds, that is they generate additional
income to support their activities from the sale of specialist derived data and services,
opening data and the operations of data brokers will increasingly threaten revenue streams.
As with other aspects of the public sector it may also be the case that governments will look
to privatise certain competencies or datasets of NSIs. This has happened in some
jurisdictions with respect to other public data agencies such as mapping institutions, notably
the UK where Ordnance Survey is increasingly reliant on the sale and licensing of geospatial
data and the postcode dataset has been recently privatised with the sale of Royal Mail. Such
a neoliberal move has the potential to undermine trust in official statistics and threatens
making and maintaining open datasets.
The way forward
The advent of relatively widely generated big data across domains has created a set of
disruptive innovations from which NSIs are not exempt given their role as key data providers
and authorities for official statistics. Indeed, Letouzé and Jütting (2014) argue that “engaging
with Big Data is not a technical consideration but a political obligation. It is an imperative to
retain, or regain, their primary role as the legitimate custodian of knowledge and creator of a
deliberative public space.” At the same time, as Cervera et al. (2014: 37) argueBig Data
should reduce, not increase statistical burden ... Big Data should increase, not reduce
statistical quality.” As elaborated above, big data presents a number of opportunities,
challenges and potential risks to NSIs and it is clear that they need to formulate a strategic
and operational response to their production. This work has already begun with the UNECE,
Eurostat and UNSD taking leading roles, with the latter constituting in 2014 a Global
15
Working Group on Big Data for Official Statistics (comprising of representatives from 28
developed and developing countries) (UNESC 2015). The Scheveningen Memorandum
(2013) commits European NSIs to setting out a roadmap that will be integrated into the
statistical annual work programmes of Eurostat. Importantly, the approach adopted has been
one of collaboration between NSIs, trying to develop a common strategic and operational
position. A very welcome development in this regard has been the creation of a big data
‘sandbox’ environment, hosted in Ireland by the Central Statistics office (CSO) and the Irish
Centre for High-End Computing (ICHEC), that provides a technical platform to:
“(a) test the feasibility of remote access and processing Statistical organisations
around the world will be able to access and analyse Big Data sets held on a central
server. Could this approach be used in practice? What are the issues to be resolved?;
(b) test whether existing statistical standards / models / methods etc. can be applied to
Big Data;
(c) determine which Big Data software tools are most useful for statistical
organisations;
(d) learn more about the potential uses, advantages and disadvantages of Big Data sets
“learning by doing”;
(e) build an international collaboration community to share ideas and experiences on
the technical aspects of using Big Data.” (UNECE 2014)
In 2014, approximately 40 statisticians/data scientists from 25 different organisations were
working with the sandbox (Dunne 2014). Over time, the sandbox could potentially develop
into a Centre of Excellence and non-for-profit pan-NSI big data service provider, delivering
comparable statistical information across jurisdictions (Dunne 2014).
However, it is evident that NSIs and associated agencies are only at the start of the
process of engaging with, testing and assessing, and thinking through the implications of big
data to the production of statistics and the organisation and work of NSIs. Consequently,
while there has been some notable progress since 2013, as set out above, there are a still a
number of open issues that require much thinking, debate, negotiation, and resolution. And,
as made clear in the sessions and discussion at the New Techniques and Technologies for
Statistics conference in Brussels in March 2015, there is a wide divergence of opinions across
official statisticians as to relative merits of big data and its potential opportunities and risks
and how best for NSIs to proceed.
16
It is clear that the initial approach adopted needs to continue apace, with the
international community of NSIs working through the challenges and risks presented in this
paper to find common positions on:
conceptual and operational (management, technology, methodology) approach and
dealing with risks;
other roles NSIs might adopt in the big data landscape such as becoming the arbiters
or certifiers of big data quality within any emerging regulatory environment,
especially for those used in official statistics, or become clearing houses for statistics
from non-traditional sources that meet their quality standards (Cervera et al. 2014;
Landefeld 2014; Struijs et al. 2014; Reimsbach-Kounatze 2015);
resolving issues of access, licensing, and standards;
undertaking experimentation and trialing;
establishing best practices for change management from the short to long-term which
will ensure stable institutional transitions, the maintenance of the high standards of
quality, and continuity of statistics over time and across jurisdiction; and,
political lobbying with respect to resourcing.
To this end, alliances could be profitably forged with other international bodies that are
wrestling with the same kinds of issues such as the Research Data Alliance (RDA) and the
World Data System (WDS) to share knowledge and approaches.
At present, NSIs are in reactive mode and are trying to catch up with the
opportunities, challenges and risks of big data. It is important that they not only catch up but
get ahead of the curve, proactively setting the agenda and shaping the new landscape for
producing official statistics. There is, however, much work to be done before such a situation
is achieved.
Acknowledgements
The author is grateful to Padraig Dalton, Richie McMahon and John Dunne from the Central
Statistics Office, Ireland, for useful discussion concerning big data and NSIs, and also
audience feedback from the New Techniques and Technologies for Statistics conference held
in Brussels, March 10-13, 2015 where this paper was presented as a keynote talk. The
17
research for this paper was provided by a European Research Council Advanced Investigator
Award (ERC-2012-AdG-323636).
References
AAPOR (2015) AAPOR Report on Big Data. American Association for Public Opinion
Research, Deerfield, IL. http://www.aapor.org/AAPORKentico/AAPOR_Main/media/Task-
Force-Reports/BigDataTaskForceReport_FINAL_2_12_15.pdf (last accessed 1 April 2015)
Albert, J.R.G. (2013) Big Data: Big Threat or Big Opportunity for Official Statistics?
18th October. Philippines Statistical Authority, National Statistical Coordination Board.
http://www.nscb.gov.ph/statfocus/2013/SF_102013_OSG_bigData.asp (last accessed 16th
March 2015)
Casselman, B. (2015) Big government is getting in the way of big data. Five Thirty Eight
Economics. March 9th. http://fivethirtyeight.com/features/big-government-is-getting-in-the-
way-of-big-data/ (last accessed 16th March 2015)
Cervera, J.L., Votta, P., Fazio, D., Scannapieco, M., Brennenraedts, R. and Van Der Vorst, T.
(2014) Big data in official statistics. ESS Big Data Event in Rome2014, Technical Workshop
Report. http://www.cros-portal.eu/sites/default/files/Big%20Data%20Event%202014%20-
%20Technical%20Final%20Report%20-finalV01_0.pdf (last accessed 20 March 2015).
Choi, H. and Varian, H. (2011) Predicting the present with Google Trends. Google Research.
http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf (last accessed 1 April 2015)
Daas, P.J.H., Puts M.J., Buelens, B. and van den Hurk, P.A.M. (2013) Big Data and Official
Statistics. http://www.cros-portal.eu/sites/default/files/NTTS2013fullPaper_76.pdf (last
accessed 1 April 2015)
Dunne, J. (2013) Big data coming soon ...... to an NSI near you. Paper presented at the
World Statistics Congress, 25th-30th, Hong Kong. http://www.statistics.gov.hk/wsc/STS018-
P3-A.pdf (last accessed 1st April 2015)
18
Dunne, J. (2014) Big data …. now playing at “the sandbox”. Paper presented at The
International Association for Official Statistics 2014 conference, 8-10 October, Da Nang,
Vietnam.
European Statistical System Committee (2014) ESS Big Data Action Plan and Roadmap 1.0.
22nd Meeting of the European Statistical System Committee, Riga, Latvia, 26 September
2014.
Eurostat (2014) Big data – an opportunity or a threat to official statistics? Paper presented
at the Conference of European Statisticians, 62nd plenary session, Paris, 9-11 April 2014.
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2014/32-Eurostat-
Big_Data.pdf (last accessed 20 March 2015)
Florescu, D., Karlberg, M., Reis, F., Del Castillo, P.R., Skaliotis, M. and Wirthmann, A.
(2014) Will ‘big data’ transform official statistics?
http://www.q2014.at/fileadmin/user_upload/ESTAT-Q2014-BigDataOS-v1a.pdf (last
accessed 1 April 2015)
Global Pulse (2012) Big data for development: Challenges and opportunities. May 2012.
Global Pulse, New York.
http://www.unglobalpulse.org/sites/default/files/BigDataforDevelopment-
UNGlobalPulseJune2012.pdf (last accessed 1st April 2015)
Hand, D.J. (2015) Official Statistics in the New Data Ecosystem. Paper presented at the New
Techniques and Technologies in Statistics conference, Brussels, March 10-12.
http://www.cros-portal.eu/sites/default/files//Presentation%20S20AP2%20%20Hand%20-
%20Slides%20NTTS%202015.pdf (last accessed 7 April 2015)
Horrigan, M.W. (2013) Big Data: A Perspective from the BLS. 1st January. Amstat News.
http://magazine.amstat.org/blog/2013/01/01/sci-policy-jan2013/ (last accessed 16th March
2015)
Kitchin, R. (2014)
The Data Revolution: Big Data, Open Data, Data Infrastructures and Their
Consequences.
Sage, London.
19
Kitchin, R. (2014) The real-time city? Big data and smart urbanism. GeoJournal 79(1): 1-
14.
Krätke, F. and Byiers, B. (2014 ) The Political Economy of Official Statistics Implications for
the Data Revolution in Sub-Saharan Africa. PARIS21, Partnership in Statistics for
Development in the 21st Century Discussion Paper No. 5, December.
http://ecdpm.org/publications/political-economy-official-statistics-implications-data-
revolution-sub-saharan-africa/ (last accessed 1 April 2015)
Landefeld, S. (2014) Uses of Big Data for Official Statistics: Privacy, Incentives, Statistical
Challenges, and Other Issues. Discussion Paper at International Conference on Big Data for
Official Statistics, Beijing, China, 28-30 Oct 2014 (last accessed 1 April 2015)
Letouzé, E. and Jütting, J. (2014) Official Statistics, Big Data and Human Development:
Towards a New Conceptual and Operational Approach. Data-Pop Alliance White Paper
Seies. Paris 21. 14th December.
http://www.datapopalliance.org/s/WhitePaperBigDataOffStatsNov17Draft.pdf (last accessed
16th March 2015)
OECD (2011) Quality Framework and Guidelines for OECD Statistical Activities, updated
17th January 2012.
http://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=std/qfs%282011%2
91&doclanguage=en.
Reimsbach-Kounatze, C. (2015) “The Proliferation of “Big Data” and Implications for
Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy
Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en (last accessed
16th March 2015)
Scannapieco, M., Virgillito, A. and Zardetto, D. (2013) Placing Big Data in Official
Statistics: A Big Challenge? Paper presented at New Techniques and Technologies in
Statistics. http://www.cros-portal.eu/sites/default/files//NTTS2013fullPaper_214.pdf (last
accessed 1 April 2015)
20
Scheveningen Memorandum (2013) Big Data and Official Statistics.
http://ec.europa.eu/eurostat/documents/42577/43315/Scheveningen-memorandum-27-09-13
(last accessed 1 April 2015)
Singer, N. (2012) You for Sale: Mapping, and Sharing, the Consumer Genome. New York
Times, 17th June, http://www.nytimes.com/2012/06/17/technology/acxiom-the-quiet-giant-of-
consumer-database-marketing.html (last accessed 11th October 2013)
Struijs, P., Braaksma, B. and Daas, PJH. (2014) Official statistics and Big Data. Big Data &
Society 1(1): 1–6 DOI: 10.1177/2053951714538417 (last accessed 20 March 2015)
Tam, S-M. and Clarke, F. (2014) Big Data, Official Statistics and Some Initiatives by the
Australian Bureau of Statistics. Paper presented at International Conference on Big Data for
Official Statistics, Beijing, China, 28-30 Oct 2014
http://unstats.un.org/unsd/trade/events/2014/Beijing/documents/other/Australia%20Bureau%
20of%20Statistics%20-%20Some%20initiatives%20on%20Big%20Data%20-
%2023%20July%202014.pdf (last accessed 1 April 2015)
UNECE (2014a) Sandbox. 24th April.
http://www1.unece.org/stat/platform/display/bigdata/Sandbox (last accessed 16th March
2015)
UNECE (2014b) A Suggested Framework for National Statistical Offices for assessing the
Quality of Big Data.
http://www1.unece.org/stat/platform/download/attachments/108102944/Big Data Quality
Framework Abstract for NTTS.docx (last accessed 16th March 2015)
UNESC (2015) Report of the Global Working Group on Big data for official statistics.
http://unstats.un.org/unsd/statcom/doc15/2015-4-BigData.pdf (last accessed 16th March
2015)
... Big data is a reality and has the potential to be used as a source of data in the implementation of statistics in Indonesia. Kitchin, (2015)identified that Big Data can be used to: replace all existing data sources, replace some existing data sources, generate complementary data with different perspectives to complement existing data, increase estimates from other data sources, generate new data (Kitchin, 2015). BPS asa provider of basic statistics can obtain data in other ways in accordance with the development of science and technology. ...
... Big data is a reality and has the potential to be used as a source of data in the implementation of statistics in Indonesia. Kitchin, (2015)identified that Big Data can be used to: replace all existing data sources, replace some existing data sources, generate complementary data with different perspectives to complement existing data, increase estimates from other data sources, generate new data (Kitchin, 2015). BPS asa provider of basic statistics can obtain data in other ways in accordance with the development of science and technology. ...
... The rapid development of information technology makes big data a provider of dynamic new data sources that can complement, replace, expand and complement and improve the composition of existing statistics, as well as produce more timely output data (Kitchin, 2015). Florescu et al. are cited by Kitcin in their journal, which clarifies that big data sources can be used in five ways in current statistical systems: a. ...
Article
Full-text available
Big Data is the new oil. The real issue of the development of information technology is the existence of Big Data. Big Data is very important, talking about Big Data not only touches the issue of how big data you have, but what can be done with the data. The high cost of Big Data access is inversely proportional to the collection of data through surveys or censuses for free. analysis of the application of efficiency theory in an economic approach in imposing access costs to statistical activities, knowing the impact of imposing Big Data access fee tariffs, knowing the potential of Big Data as a supporter of official statistics in the future, what are the obstacles and solutions in implementing it. The preparation of this paper is juridical-empirical research with the nature of descriptive research with an economic law approach. Big Data is managed for different purposes using different systems and methods and not necessarily using statistical rules. The implementation of Big Data as a new data source is carried out through a combination of data sources. The imposition of Big Data access fees as a source of data supporting official statistics causes low utilization of Big Data as a new data source. The information technology revolution makes Big Data has the potential to complement, replace, improve, add, and improve the composition of existing statistical data sources, and produce more timely outputs. Difficult access and high costs for data collection are major obstacles. Therefore, it needs to be supported by legal instruments that facilitate its implementation in Indonesia. One is the reformulation of existing regulations that make it easier for basic statistical organizations to obtain such data sources for free and use only in the national interest
... National Statistical Institutions (NSIs) are trusted to produce high quality statistics to generate independent insight for all. Traditionally NSIs have relied on surveys to produce official statistics but more recently the trend is to leverage administrative data in Government databases to meet the ever increasing demand for statistical information on a broader range of subjects, with increased timeliness and at greater granularity [1]. Today, in a world where global data creation is projected to surpass 175 zettabytes by 2025 [2], novel data sources are providing new opportunities for NSIs to meet this rising demand with constrained budgets [1,3,4]. ...
... Traditionally NSIs have relied on surveys to produce official statistics but more recently the trend is to leverage administrative data in Government databases to meet the ever increasing demand for statistical information on a broader range of subjects, with increased timeliness and at greater granularity [1]. Today, in a world where global data creation is projected to surpass 175 zettabytes by 2025 [2], novel data sources are providing new opportunities for NSIs to meet this rising demand with constrained budgets [1,3,4]. ...
Article
Full-text available
Today, there is a greater demand to produce more timely official statistics at a more granular level. National Statistical Institutes (NSIs) are more and more looking to novel data sources to meet this demand. This paper focuses on the use of one such source to compile more timely and detailed official statistics on port visits. The data source used is sourced from the Automatic Identification System (AIS) used by ships to transmit their position at sea. The primary purpose of AIS is maritime safety. While some experimental statistics have been compiled using this data, this paper evaluates the potential of AIS as a data source to compile official statistics with respect to port visits. The paper presents a novel method called “Stationary Marine Broadcast Method” (SMBM) to estimate the number of port visits using AIS data. The paper also describes how the H3 Index, a spatial index originally developed by Uber, is added to each transmission in the data source. While the paper concludes that the AIS based estimates won’t immediately replace the official statistics, it does recommend a pathway to using AIS-based estimates as the basis for official port statistics in the future.
... In early papers such as Florescu et al. [1], Landefeld [2], Tam & Clarke [3], Kitchin [4], the focus is generally on the data sources themselves, and concerns such as access, availability, commercial sensitivity, and fitness for purpose, whilst downplaying the difficulty of adapting existing statistical methodology, despite the cautions expressed by Florescu et al. [1] in their table of features of big data, that statistical methods 'may not always be applicable' to big data, and by Struijs et al. [5] that 'it is difficult to apply traditional statistical methods, based on sampling theory'. Instead of addressing this issue, Kitchin [4] focuses on the more technical (though still important) questions of transferring, storing, cleaning, checking and linking big data, conjoining big data with official statistical datasets, and presents this work as if belonging to a separate data gathering and integration phase undertaken prior to any statistical analysis. ...
... In early papers such as Florescu et al. [1], Landefeld [2], Tam & Clarke [3], Kitchin [4], the focus is generally on the data sources themselves, and concerns such as access, availability, commercial sensitivity, and fitness for purpose, whilst downplaying the difficulty of adapting existing statistical methodology, despite the cautions expressed by Florescu et al. [1] in their table of features of big data, that statistical methods 'may not always be applicable' to big data, and by Struijs et al. [5] that 'it is difficult to apply traditional statistical methods, based on sampling theory'. Instead of addressing this issue, Kitchin [4] focuses on the more technical (though still important) questions of transferring, storing, cleaning, checking and linking big data, conjoining big data with official statistical datasets, and presents this work as if belonging to a separate data gathering and integration phase undertaken prior to any statistical analysis. ...
Article
Recent years have seen increased interest in the use of alternative data sources in the definition and production of official statistics and indicators for the UN Sustainable Development Goals. In this paper, we consider the application of data science to the production of official statistics, illustrating our perspective through the use of poverty targeting as an application. We show that machine learning can play a central role in the generation of official statistics, combining a variety of types of data (survey, administrative and alternative). We focus on the problem of poverty targeting using the Proxy Means Test in Indonesia, comparing a number of existing statistical and machine learning methods, then introducing new approaches in the spirit of small area estimation that utilize area-level features and data augmentation at the subdistrict level to develop more refined models at the district level, evaluating the methods on three districts in Indonesia on the problem of estimating 2020 per capita household expenditure using data from 2016–2019. The best performing method, XGBoost, is able to reduce inclusion/exclusion errors on the problem of identifying the poorest 40% of the population in comparison to the commonly used Ridge Regression method by between 4.5% and 13.9% in the districts studied.
... For example, in summer 2023, Mayor Rebecca Alty of Yellowknife in the Northwest Territories shared that her city possesses an abundance of data, but they would need to "turn it into a story" for policy-making. While municipalities have access to large sustainability datasets, city offices typically lack the specialized skill sets and advanced technology necessary to effectively utilize these data [1,11]. ...
Article
Full-text available
Arctic city mayors influence municipal sustainability outcomes, navigating decisions on waste management, social service funding, and economic development. How do mayors make these decisions and to what extent do they integrate sustainability indicator data? Interviews with the mayors of Fairbanks, Alaska, Yellowknife, Canada, and Luleå, Sweden, revealed indicators are used on a case-by-case basis to track trends but lack systematic integration into decision-making. Constituent concerns drive agendas rather than indicator trends. Based on International Organization for Standardization (ISO) guidelines, 128 indicators grouped into 19 sustainability themes were compiled from 2000 to 2019 for the study cities. Partial Least Squares Structural Equation Modeling (PLS-SEM) was applied to examine the utility of ISO indicators as a guiding factor for sustainability trend tracking, identifying key themes for each city. Results show that indicator trends are too inconsistent and interconnected to be useful as an independent form of guidance for mayors. For Arctic municipalities, sustainability indicator datasets are useful in specific circumstances, but they do not provide the same kind of decision-making heuristic that mayors receive from direct constituent interaction. Findings emphasize the importance of more robust data collection and the development of management frameworks that support sustainability decision-making in Arctic cities.
... Overall, the scholarly literature on the use of big data for official statistics is European-oriented and tends to focus on technical and instrumental aspects, such as applications and constraints of new sources and methodological issues (Allin, 2021;Kitchin, 2015;MacFeely, 2019;Struijs et al., 2014). The political economy of big data for official statistics is an area still little explored by research. ...
Article
Full-text available
With the rise of the data-driven economy, the state-owned sector of official statistics has been pressurised to ‘modernise’ and engage with big data. However, most of these data sources are controlled by tech corporations. We address the conflicts this brings about from a Latin American perspective through three case studies involving national statistical offices, international organisations, and the private sector. We investigate strategies for enabling data markets for official statistics and analyse how the statistical field has acted in this context. In doing so, we contribute to understanding the political economy of big data in Latin America and debates on how digitalisation encompasses the reshaping of state-business relations. Supported by secondary data and semi-structured interviews, we appraise Bourdieu's theory of fields, Marxian readings on the enclosure of commons, and Polanyi's double movement to analyse how data commodification challenges data as public goods – a fundamental principle for official statistics. The findings demonstrate that data enclosures have prevented the state's access to compiling official statistics and show that the initiatives for introducing big data in Latin American national statistical offices have involved testing data markets through public–private partnerships supported by international organisations and big tech. As a result, the statistical field has reacted to the data market with a ‘double movement’: mobilising symbolic capital such as ‘trust’ for partnering with businesses to access data and technologies and, conversely, defending the public value of data in counter-movements protective of the relative autonomy of the national statistical offices and the state's control over informational capital.
... The United Nations, through the Statistics Division (UNSD) and the Committee of Experts on Global Geospatial Information Management (UN-GGIM), emphasizes the importance of this vision in documents such as the United Nations Integrated Geospatial Information Framework -UN-IGIF (UN-GGIM, 2022). Countries such as Australia are at the forefront of this integration (Kitchin, 2015). The trend of the general use of Big Data, including diverse sources such as social networks, sensors, and device monitoring, among others, is to improve official databases (Tam and Van Halderen, 2020). ...
Article
Full-text available
The potential of intrinsic parameters to estimate geospatial data quality on Voluntary Geographic Information (VGI) platforms is a recurrent theme in Cartography. The spatial-temporal distribution in these platforms is very heterogeneous, depending on several factors such as input availability, number, and motivation of volunteers, especially in developing countries. The most recent approaches have been aiming to detail temporal patterns as an additional measure of quality in VGI. This research proposes a methodology to identify and analyze the behavior of the contribution parameters over time (2007-2022) of the OSM platform and differentiates the influences that affect its growth. Part of the Metropolitan region of Curitiba was the study area, subdivided into 1 x 1 km cells. The cumulative growth of contributions was calculated and later adjusted using a Logistic Regression. The obtained parameters made it possible to identify abruptly growing cells caused by external data import, mass contributions, or collective mapping activities. In addition, heterogeneity in the growth of the data available in OSM over time was evident. Furthermore, the proposed methodology promoted the investigation of a new indicator of intrinsic quality based on modelling the spatiotemporal evolution of OSM feature insertions.
... Since the 1850s, credit bureaus have enabled banking and financial sector economic actors to reduce risks by sharing information on creditors (Lauer, 2017). Official statistics also partly rely on the provision of information by private sector actors (Kitchin, 2015). However, with the digitalization of business and the unprecedented increase in the volume of data produced, there has been a renewed interest in data sharing. ...
Article
Full-text available
Data sharing is a requisite for developing data-driven innovation and collaboration at the local scale. This paper aims to identify key lessons and recommendations for building trustworthy data governance at the local scale, including the public and private sectors. Our research is based on the experience gained in Rennes Metropole since 2010 and focuses on two thematic use cases: culture and energy. For each one, we analyzed how the power relations between actors and the local public authority shape the modalities of data sharing and exploitation. The paper will elaborate on challenges and opportunities at the local level, in perspective with the national and European frameworks.
... The collected data should be also evaluated in terms of their relevance to the SDG performance. NSOs thus emphasize upon the quality of the data in terms of data sources, data storage, data documentation, data standardization, data reliability, data bias and error rates, uncertainty, accuracy, and fidelity (Kitchin, 2015). The utilization of big data in SDG evaluation should be critically investigated regarding the different conceptual natures of big data. ...
... Given that the explanatory power is the limiting factor in this study, it is likely that no other machine learning model will be capable of solving this problem if there is no relationship to map. It confirms the well-known risk of extrapolating a model to domains that are not well represented in the data (Kitchin, 2015). In addition, the non-probability nature of the data might also bias the quality metrics, because the observations are not independent and identically distributed. ...
Article
for analyzing human behavior. Sensor data is typically non-probability-based because not every element in the target population has a positive and known probability of being recorded. Accordingly, using such data as a primary data source for population inference is currently an active field of research. In this paper, an algorithmic population inference framework using network analysis and non-probability data is developed. This approach is demonstrated using road sensor data as the primary source to infer the Dutch freight traffic across the state road network. We interpret the Dutch state road network as a graph, with traffic junctions as vertices and state roads as edges. Road sensors are installed on a non-probability sample of edges detecting passing transport vehicles. Photographs of the license plates allow for the rare opportunity of linking sensor data with population registers. Extreme gradient boosting is applied to learn the probability of vehicle detection by a sensor from features about time, edge, vehicle and vehicle owner. Population inference is made using the learned relationship to predict the probability of detection on each day of the year, along each edge in the network for each vehicle in the population. Different data scenarios were designed to simulate the effects of the non-probability nature of the data and the extreme class imbalance. Furthermore, several performance metrics were applied. With about 27 million records and over 100 features trained and tested on an imbalanced non-probability sample, substantial variation in model performance across test sets was found. Promising results were achieved using a balanced probability sample as a control: the model performed about halfway between random guessing and perfect prediction. These results are of high practical importance because combining non-probability with administrative data is currently considered one of the most promising across several disciplines.
Article
The rapid digitization and publication of local government records presents researchers with an unprecedented chance to study governance processes. In tandem, advances in computer science and statistics—alongside significant increases in computational power—have led to the development of “text-as-data” methods and their application to social science and policy research. This paper evaluates the potential utility of digitized public meeting minutes and video recordings for studying decision-making about technology adoption by local public agencies, using survey data on the same topic as a benchmark. Focusing on transit agencies in California, we evaluate surveys and digitized meeting records with respect to overall data availability, bias in data availability, and the types of information about technology adoption contained. We find that meeting minutes and video recordings are available for more than twice as many agencies than for a state transit agency-sponsored survey, and that the availability of digitized records is not skewed toward larger agencies, as is the case for survey data. Meanwhile, we find important complementarities with respect to the type of information available about technology adoption in these three data sources.
Article
Full-text available
More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.
Article
Full-text available
Official statisticians have been dealing with a diversity of data sources for decades. However, new sources of data in the Big Data domain provide an opportunity to deliver a more efficient and effective statistical service. This paper outlines a number of considerations for the official statistician when deciding whether to embrace a particular new data source in the regular production of official statistics. The principal considerations are relevance, business benefit, and the validity of using the source for official statistics in finite population inferences or analytic inferences. The paper also describes the Big Data Flagship Project of the Australian Bureau of Statistics (ABS), which has been established to provide the opportunity for the ABS to gain practical experience in assessing the business, statistical, technical, computational and other issues in using Big Data. In addition, ABS participation in national and international activities in this area will help it share experience and knowledge, while collaboration with academics will enable ABS to better acquire the capability to address business problems using the new sources of data as part of the solution.
Article
Full-text available
The rise of Big Data changes the context in which organisations producing official statistics operate. Big Data provides opportunities, but in order to make optimal use of Big Data, a number of challenges have to be addressed. This stimulates increased collaboration between National Statistical Institutes, Big Data holders, businesses and universities. In time, this may lead to a shift in the role of statistical institutes in the provision of high-quality and impartial statistical information to society. In this paper, the changes in context, the opportunities, the challenges and the way to collaborate are addressed. The collaboration between the various stakeholders will involve each partner building on and contributing different strengths. For national statistical offices, traditional strengths include on the one hand the position to collect data and combine data sources to statistical products, and on the other hand their focus on quality, transparency and sound methodology. In the Big Data era of competing and multiplying data sources, they continue to have a unique knowledge of official statistical production methods. And their impartiality and respect for privacy as enshrined in law uniquely position them as a trusted third party. Based on this, they may advise on the quality and validity of information of various sources. By thus positioning themselves, they will be able to play their role as key information providers in a changing society.
Article
Full-text available
‘Smart cities’ is a term that has gained traction in academia, business and government to describe cities that, on the one hand, are increasingly composed of and monitored by pervasive and ubiquitous computing and, on the other, whose economy and governance is being driven by innovation, creativity and entrepreneurship, enacted by smart people. This paper focuses on the former and, drawing on a number of examples, details how cities are being instrumented with digital devices and infrastructure that produce ‘big data’. Such data, smart city advocates argue enables real-time analysis of city life, new modes of urban governance, and provides the raw material for envisioning and enacting more efficient, sustainable, competitive, productive, open and transparent cities. The final section of the paper provides a critical reflection on the implications of big data and smart urbanism, examining five emerging concerns: the politics of big urban data, technocratic governance and city development, corporatisation of city governance and technological lock-ins, buggy, brittle and hackable cities, and the panoptic city.
Article
This working paper describes the potential of the proliferation of new sources of large volumes of data, sometimes also referred to as “big data”, for informing policy making in several areas. It also outlines the challenges that the proliferation of data raises for the production of official statistics and for statistical policies.
Article
Big data is a component of the Fourth Industrial Revolution. The deep penetration of digital technology has turned data into an essential component of the production process. Data are automatically generated by machines during the course of operation and during interactions with humans. This paper describes the concept and composition of big data. Most of the big data are unstructured and include text, audio-video files, images, emails, log files, etc. Statisticians are more interested in structured data presented in a pre-defined database model. Big data offer new sources and opportunities that cannot be discounted. However, the use of big data requires proper assessment in terms of quality dimensions such as accuracy, comparability and methodological soundness. Against the backdrop of arguments regarding big data, some users view big data as a replacement of official statistics. Such a conclusion is premature for at least two reasons: first, only a small part of big data can be used for decision-making. Second, theory and practice prove that a small sample based on scientific methods can yield much more reliable and accurate estimates than the results obtained from the processing of large amounts of unstructured data. The paper assesses the possibility of using big data for Sustainable Development Goals (SDG) monitoring, which is a nationally owned process, and NSOs are accountable for the SDG data they report. If the data are derived from a big data source, irrespective of the level of technical sophistication used in data transformation, the reliability of such data might be questioned by the national institutions. The paper concludes that the reliability of data obtained from big data sources hinges on the quality of tools and methods applied to data transformation. Statisticians can play an important role in alerting society, decision-making bodies of the government and businesses about the reliability of information derived from the different sources.