Conference PaperPDF Available

Abstract and Figures

One of the most used measures of the economic health of a nation is the Gross Domestic Product (GDP): the market value of all officially recognized final goods and services produced within a country in a given period of time. GDP, prosperity and well-being of the citizens of a country have been shown to be highly correlated. However, GDP is an imperfect measure in many respects. GDP usually takes a lot of time to be estimated and arguably the well-being of the people is not quantifiable simply by the market value of the products available to them. In this paper we use a quantification of the average sophistication of satisfied needs of a population as an alternative to GDP. We show that this quantification can be calculated more easily than GDP and it is a very promising predictor of the GDP value, anticipating its estimation by six months. The measure is arguably a more multifaceted evaluation of the well-being of the population, as it tells us more about how people are satisfying their needs. Our study is based on a large dataset of retail micro transactions happening across the Italian territory.
Content may be subject to copyright.
Going Beyond GDP to Nowcast Well-Being
Using Retail Market Data
Riccardo Guidotti1,2, Michele Coscia3, Dino Pedreschi2
and Diego Pennacchioli1
1KDDLab ISTI CNR, Via G. Moruzzi, 1, Pisa, IT {name.surname}@isti.cnr.it
2KDDLab CS Dept. Univ. of Pisa, L. B. Pontecorvo, 3, Pisa, IT
{name.surname}@di.unipi.it
3CID - HKS, 79 JFK St. Cambridge MA, US michele coscia@hks.harvard.edu
Abstract. One of the most used measures of the economic health of a
nation is the Gross Domestic Product (GDP): the market value of all
officially recognized final goods and services produced within a country
in a given period of time. GDP, prosperity and well-being of the citizens
of a country have been shown to be highly correlated. However, GDP
is an imperfect measure in many respects. GDP usually takes a lot of
time to be estimated and arguably the well-being of the people is not
quantifiable simply by the market value of the products available to
them. In this paper we use a quantification of the average sophistication
of satisfied needs of a population as an alternative to GDP. We show that
this quantification can be calculated more easily than GDP and it is a
very promising predictor of the GDP value, anticipating its estimation
by six months. The measure is arguably a more multifaceted evaluation
of the well-being of the population, as it tells us more about how people
are satisfying their needs. Our study is based on a large dataset of retail
micro transactions happening across the Italian territory.
1 Introduction
Objectively estimating a country’s prosperity is a fundamental task for modern
society. We need to have a test to understand which socio-economic and political
solutions are working well for society and which ones are not. One such test is
the estimation of the Gross Domestic Product, or GDP. GDP is defined as the
market value of all officially recognized final goods and services produced within
a country in a given period of time. The idea of GDP is to capture the average
prosperity that is accessible to people living in a specific region.
No prosperity test is perfect, so it comes as no surprise to reveal that GDP
is not perfect either. GDP has been harshly criticised for several reasons [1]. We
focus on two of these reasons. First: GDP is not an easy measure to estimate.
It takes time to evaluate the values of produced goods and services, as to evalu-
ate them they first have to be produced and consumed. Second: GDP does not
accurately capture the well-being of the people. For instance income inequality
skews the richness distribution, making the per capita GDP uninteresting, be-
cause it does not describe the majority of the population any more. Moreover,
arguably it is not possible to quantify well-being just with the number of dollars
in someone’s pocket: she might have dreams, aspirations and sophisticated needs
that bear little to no correlation with the status of her wallet.
In this paper we propose a solution to both shortcomings of GDP. We intro-
duce a new measure to test the well-being of a country. The proposed measure
is the average sophistication of the satisfiable needs of a population. We are
able to estimate such measure by connecting products sold in the country to the
customers buying them in significant quantities, generating a customer-product
bipartite network. The sophistication measure is created by recursively correct-
ing the degree of each customer in the network. Customers are sophisticated if
they purchase sophisticated products, and products are sophisticated if they are
bought by sophisticated customers. Once this recursive correction converges, the
aggregated sophistication level of the network is our well-being estimation.
The average sophistication of the satisfiable needs of a population is a good
test of a country’s prosperity as it addresses the two issues of GDP we discussed.
First, it shows a high correlation with the GDP of the country, when shifting
the GDP by two quarters. The average sophistication of the bipartite network is
an effective nowcasting of the GDP, making it a promising predictor of the GDP
value the statistical office will release after six months. Second, our measure
is by design an estimation of the sophistication of the needs satisfied by the
population. It is more in line with a real well-being measure, because it detaches
itself from the mere quantity of money circulating in the country and focuses
closely on the real dynamics of the population’s everyday life.
The analysis we present is based on a dataset coming from a large retail
company in Italy. The company operates 120 shops in the West Coast in Italy.
It serves millions of customers every year, of which a large majority is identifiable
through fidelity cards. We analyze all items sold from January 2007 to June 2014.
We connect each customer to all items she purchased during the observation
period, reconstructing 30 quarterly bipartite customer-product networks. For
each network, we quantify the average sophistication of the customers and we
test its correlation with GDP, for different temporal shift values.
2 Related Work
Nowcasting is a promising field of research to resolve the delay issues of GDP.
Nowcasting has been successfully combined with the analysis of large datasets
of human activities. Two famous examples are Google Flu trends [2] and the
prediction of automobile sales [3]. Social media data has been used to nowcast
employment status and shocks [4] [5]. Such studies are not exempt from criti-
cisms: [6] proved that nowcasting with Google queries alone is not enough and
the data must be integrated with other models. Nowcasting has been already
applied to GDP too [7], however the developed model uses a statistical approach
that is intractable for a high number of variables, thus affecting the quality of re-
sults. Other examples can be found focusing on the Eurozone [8], or on different
targets such as poverty risk [9] and income distribution [10].
Our proposal of doing GDP nowcasting using retail data is based on the
recent branch of research that considers markets as self-organizing complex sys-
tems. In [11], authors model the global export market as a bipartite network,
connecting the countries with the products they export. Such structure is able
to predict long-term GDP growth of a country. This usage of complex networks
has been replicated both at the macro economy level [12] and at the micro level
of retail [13]. At this level, in previous work we showed that the complex system
perspective still yields an interesting description of the retail dynamics [14]. We
defined a measure of product and customer sophistication and we showed its
power to explain the distance travelled by customers to buy the products they
need [15], and even their profitability for the shop [16]. In this work, we borrow
these indicators and we use them to tackle the problem of nowcasting GDP. An
alternative methodology uses electronic payment data [17]. However in this case
the only issue addressed is the timing issue, but no attempt is made into making
the measure more representative of the satisfaction of people’s needs.
The critiques to GDP we mentioned have resulted in the proliferation of al-
ternative well-being indicators. We mention the Index of Sustainable Economic
Welfare (ISEW), the Genuine Progress Indicator (GPI) [18] and the Human De-
velopment Index (HDI)1. A more in depth review about well-being alternatives
is provided in [19]. These indicators are designed to correct some shortcomings
of GDP, namely incorporating sustainability and social cost. However, they are
still affected by long delays between measurements and evaluation. They are
also affected by other criticisms: for instance, GPI includes a list of adjustment
items that is considered inconsistent and somewhat arbitrary. Corrections have
been developed [20], but so far there is no final reason to prefer them to GDP
and thus we decide to adhere to the standard and we consider only the GDP
measure, and we remark that no alternative has addressed the two mentioned
issues of GDP in a universally recognized satisfactory way.
3 Data
Our analysis is based on real world data about customer behaviour. The dataset
we used is the retail market data of one of the largest Italian retail distribution
companies. The dataset has been already presented in previous works ([16] [15])
and we refer to those publications for an in-depth description of our cleaning
strategy. We report here when we perform different operations.
The dataset contains retail market data in a time window spanning from
Jan 1st, 2007 to June, 30th 2014. The active and recognizable customers are
1M. The stores of the company cover the West Coast of Italy. We aggregated
the items sold using the Segment classification in the supermarket’s marketing
hierarchy. We end up with 4,500 segments, to which we refer as products.
At this point we need to define the time granularity of our observation period.
We choose to use a quarterly aggregation mainly because we want to compare
1http://hdr.undp.org/en/statistics/hdi
our results with GDP, and GDP assumes a better relevance in a quarterly ag-
gregation. For each quarter, we have 500kactive customers.
Since our objective is to establish a correlation between the supermarket
data and the Gross Domestic Product of Italy, we need a reliable data source
for GDP. We rely on the Italian National Bureau of Statistic ISTAT. ISTAT
publishes quarterly reports about the status of the Italian country under several
aspects, including the official GDP estimation. ISTAT is a public organization
and its estimates are the official data used by the Italian central government.
We downloaded the GDP data from the ISTAT website2.
Fig. 1: The geographical distribution of observed customers (yellow dots) and
shops (blue dots) in the territory of Italy.
Figure 1 shows that the observed customers cover the entire territory of Italy.
However, the shop distribution is not homogeneous. Shops are located in a few
Italian regions. Therefore, the coverage of these regions is much more significant,
while customers from other regions usually shop only during vacation periods in
these regions. Our analysis is performed on national GDP data, because regional
GDP data is disclosed only with a yearly aggregation. However, the correlation
between national GDP and the aggregated GDP of our observed regions (Tus-
cany, Lazio and Campania) during our observation period is 0.95 (p < 0.001).
This is because Italy has a high variation on the North-South axis, which we
cover, while the West-East variation, which we cannot cover, is very low.
2http://dati.istat.it/Index.aspx?lang=en&themetreeid=91, date of last access:
September 23rd, 2015
4 Methodology
In this section we present the methodology implemented for the paper. First, we
present the algorithm we use to estimate the measure of sophistication (Section
4.1). Second, we discuss the seasonality issues (Section 4.2).
4.1 Sophistication
The sophistication index is used to objectively quantify the sophistication level
of the needs of the customers buying products. We introduced the sophistication
index in [15], which is an adaptation from [11], necessary to scale up to large
datasets. We briefly report here how to compute the customer sophistication
index, and we refer to the cited papers for a more in-depth explanation.
The starting point is a matrix with customers on rows and products on the
columns. This matrix is generated for each quarter of each year of observation.
Each cell contains the number of items purchased by the customer of the prod-
uct in a given quarter (e.g. Q1 of 2007, Q2 of 2007 and so on). We then have
30 of such matrices. The matrices are already very sparse, with an average fill
of 1.4% (ranging from 33 to 37 million non zero values). Our aim is to increase
the robustness of these structures, by constructing a bipartite network connect-
ing customers exclusively to the subset of products they purchase in significant
quantities. Figure 2 provides a simple depiction of the output bipartite network.
Fig. 2: The resulting bipartite network connecting customers to the products
they buy in significant quantities.
To filter the edges, we calculate the Revealed Comparative Advantage (RCA,
known as Lift in data mining [21]) of each product-customer cell [22], following
[11]. Given a product piand a customer cj, the RCA of the couple is defined as
follows:
RCA(pi, cj) = X(pi, cj)
X(p, cj)X(pi, c)
X(p, c)1
,
where X(pi, cj) is the number of pibought by cj,X(p, cj) is the number of
products bought by cj,X(pi, c) is the total number of times pihas been sold and
X(p, c) is the total number of products sold. RCA takes values from 0 (when
X(pi, cj) = 0, i.e. customer cjnever bought a single instance of product pi) to
+. When RCA(pi, cj) = 1, it means that X(pi, cj) is exactly the expected
value under the assumption of statistical independence, i.e. the connection be-
tween customer cjand product pihas the expected weight. If RCA(pi, cj)<1
it means that the customer cjpurchased the product piless than expected,
and vice-versa. Therefore, we keep an edge in the bipartite network iff its corre-
sponding RCA is larger than 1. Note that most edges were already robust. When
filtering out the edges, we keep 93% of the original connections.
The customer sophistication is directly proportional to the customer’s degree
in the bipartite network, i.e. with the number of different products she buys. Dif-
ferently from previous works [15] that used the traditional economic complexity
algoritm [11], in this work we use the Cristelli formulation of economic com-
plexity [23]. Note that the two measures are highly correlated. Therefore, in the
context of this paper, there is no reason to prefer one measure over the other,
and we make the choice of using only one for clarity and readability.
Consider our bipartite network G= (C, P, E) described by the adjacency
matrix M|C|×|P|, where Care customers and Pare products. Let cand pbe
two ranking vectors to indicate how much a C-node is linked to the most linked
P-nodes and, similarly, P-nodes to C-nodes. It is expected that the most linked
C-nodes connected to nodes with high pjscore have an high value of ci, while the
most linked P-nodes connected to nodes with high ciscore have an high value
of pj. This corresponds to a flow among nodes of the bipartite graph where the
rank of a C-node enhances the rank of the P-node to which is connected and
vice-versa. Starting from iC, the unbiased probability of transition from ito
any of its linked P-nodes is the inverse of its degree c(0)
i=1
ki, where kiis the
degree of node i.P-nodes have a corresponding probability of p(0)
j=1
kj. Let n
be the iteration index. The sophistication is defined as:
c(n)
i=
|P|
X
j=1
1
kj
Mij p(n1)
ji p(n)
j=
|C|
X
i=1
1
ki
Mij c(n1)
ij
These rules can be rewritten as a matrix-vector multiplication
c=¯
Mp p =¯
MTc(1)
where ¯
Mis the weighted adjacency matrix. So, like previously we have
c(n)=¯
M¯
MTc(n1) p(n)=¯
MT¯
Mp(n1)
c(n)=Cc(n1) p(n)=Pp(n1)
where C(|C|×|C|)=¯
M¯
MTand P(|P|×|P|)=¯
MT¯
Mare related to x(n)=
Ax(n1). This makes sophistication solvable using the power iteration method
(and it is proof of convergence). Note that this procedure is equivalent to the
HITS ranking algorithm, as proved in [24].
0.00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
C-SOP Q1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C-RATIO
2007
2008
2009
2010
2011
2012
2013
2014
0.00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
C-SOP Q2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C-RATIO
2007
2008
2009
2010
2011
2012
2013
2014
0.00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
C-SOP Q3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C-RATIO
2007
2008
2009
2010
2011
2012
2013
0.00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
C-SOP Q4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C-RATIO
2007
2008
2009
2010
2011
2012
2013
Fig. 3: The customer sophistication distributions per quarter and per year. Each
plot reports the probability (y axis) of a customer to have a given sophistication
value (x axis), from quarter 1 to quarter 4 (left to right) for each year.
At the end of our procedure, we have a value of customer and product so-
phistication for each customer for each quarter. For the rest of the section we
focus on customer sophistication for space reasons. Each customer is associated
with a timeline of 30 different sophistications. The overall sophistication is nor-
malized to take values between 0 and 1. Figure 3 shows the distribution of the
customer sophistication per quarter and per year. We chose to aggregate the
visualization by quarter because the same quarters are similar across years but
different within years, due to seasonal effects.
2007 2008 2009 2010 2011 2012 2013 2014
quarter
2
3
4
5
6
7
8
9alpha
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.9
1.0
1.1
1.2
1.3
1.4
1.5
beta
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.02
0.01
0.00
0.01
0.02
0.03
0.04
0.05 gamma
Fig. 4: The different values taken by the fit parameters across the observation
period for the sophistication distribution.
Figure 3 shows that the sophistication distribution is highly skewed. We ex-
pect it to be an exponential function: by definition the vast majority of the
population is unsophisticated and highly sophisticated individuals are an elite.
The fit function cannot be a power-law because the different levels of sophisti-
cation for least to most sophisticated do not span a sufficiently high number of
orders of magnitude. We fitted a function of the form f(x) = γ+β×αxfor
each quarterly snapshot of our bipartite networks. Figure 4 reports the evolu-
tion of the fit parameters α,βand γ. The figure shows that the fit function
is mostly stable over time. The fits have been performed using ordinary least
squares regression.
Table 1: The most and least sophisticated products in our dataset.
SOP Rank Product
1 Cosmetics
2 Underwear for men
3 Furniture
4 Multimedia services
5 Toys
... ...
-5 Fresh Cheese
-4 Red Meat
-3 Spaghetti
-2 Bananas
-1 Short Pasta
To prove the quality of our sophistication measure in capturing need sophisti-
cation, we report in Table 1 a list of the top and bottom sophisticated products,
calculated aggregating data from all customers. Top sophisticated products are
non daily needed products and are usually non-food. The least complex products
are food items. Being Italian data, pasta is the most basic product.
4.2 Seasonality
Both GDP and the behavior of customers in the retail market are affected by
seasonality. Different periods of the year are associated with different economic
activities. This is particularly true for Italy in some instances: during the month
of August, Italian productive activities come to an almost complete halt, and the
country hosts its peak tourist population. The number and variety of products
available in the supermarket fluctuates too, with more fruit and vegetables avail-
able in different months, or with Christmas season and subsequent sale shocks.
A number of techniques have been developed to deal with seasonal changes
in GDP. One of the most popular seasonal adjustments is done through the X-
13-Arima method, developed by the U.S. Census Bureau [25]. However, we are
unable to use this methodology for two reasons. First, it requires an observation
period longer than the one we are able to provide in this paper. Second, the
methodologies present in literature are all fine-tuned to specific phenomena that
are not comparable to the shopping patterns we are observing. Thus we cannot
apply them to our sophistication timelines. Given that we are not able to make
a seasonal adjustment for the sophistication, we chose to not seasonally adjust
GDP too. We acknowledge this as a limitation of our study and we leave the
development of a seasonal adjustment for sophistication as a future work.
5 Experiments
In this section we test the relation between the statistical properties of the
bipartite networks generated with our methodology and the GDP values of the
country. We first show the evolution of aggregated measures of expenditure,
number of items, degree and sophistication along our observation period. We
then test the correlation with GDP, with various temporal shifts to highlight
the potential predictive power of some of these measures.
Before showing the timelines, we describe our approach for the aggregation of
the properties of customers. The behavior of customers is highly differentiated.
We already shown that the sophistication distribution is highly skewed and best
represented as an exponential function. The expenditure and the number of items
purchased present a skewed distribution among customers: few customers spend
high quantities of money and buy many items, many customers spend little
quantities of money and buy few items. For this reason, we cannot aggregate
these measures using the average over the entire distribution, as it is not well-
behaved for skewed values. To select the data we use the inter-quantile range,
the measure of spread from the first to the third quantile. In practice, we trim
the outliers out of the aggregation and then we compute the average, the Inter-
Quartile Mean, or “IQM”. The IQM is calculated as follows:
xIQM =2
n
3n
4
X
i=n
4+1
xi
assuming nsorted values.
Also note that all the timelines we present have been normalized. All variables
take values between zero and one, where zero represents the minimum value
observed and one the maximum. As for the notation used, in the text and in the
captions of the figures we use the abbreviations reported in Table 2.
Table 2: The abbreviations for the measures used in the experiment section.
Abbreviation Description
IQM Inter-Quartile Mean.
GDP Gross Domestic Product.
EXP IQM of the total expenditure per customer.
PUR IQM of the total number of items purchased per customer.
C-DEG IQM of the number of products purchased in significant quantities
(i.e. the bipartite network degree) per customer.
P-DEG IQM of the number of customers purchasing the product in signifi-
cant quantities (i.e. the bipartite network degree).
C-SOP IQM of the sophistication per customer.
P-SOP IQM of the sophistication per product.
The first relation we discuss is between GDP and the most basic customer
variables. Figure 5 depicts the relation between GDP and the IQM expenditure
(left), and GDP and IQM of number of items purchased (right). Besides the
obvious seasonal fluctuation, we can see that the two measures are failing to
capture the overall GDP dynamics. GDP has an obvious downward trend, due
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
GDP EXP
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
GDP PUR
Fig. 5: The relation between GDP and IQM customer expenditure (left) and
IQM number of items purchased (right).
to the fact that our observation window spans across the global financial crisis,
which hit Italy particularly hard starting from the first quarter of 2009. However,
the average expenditure in the observed supermarket has not been affected at all.
Also the number of items has not been affected. If we calculate the corresponding
correlations, we notice a negative relationship which, however, fails to pass a
stringent null hypothesis test (p > 0.01).
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
GDP C-SOP
2007 2008 2009 2010 2011 2012 2013 2014
quarter
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
GDP P-SOP
Fig. 6: The relation between GDP and IQM customer (left) and product (right)
sophistication.
Turning to our sophistication measure, Figure 6 depicts the relation between
GDP and our complex measures of sophistication. On the left we have the mea-
sure of customer sophistication we discussed so far. We can see that the alignment
is indeed not perfect. However, averaging out the seasonal fluctuation, customer
sophistication captures the overall downward trend of GDP. The financial crisis
effect was not only a macroeconomic problem, it also affected the sophistication
of the satisfiable needs of the population. Note that, again, we have a negative
correlation. This means that, as GDP shrinks, customers become more sophisti-
cated. This is because the needs that once were classified as basic are not basic
any more, hence the rise in sophistication of the population. Differently from
before, the correlation is actually statistically significant (p < 0.01).
We also report on the left the companion sophistication measure: since we
can define the customer sophistication as the average sophistication of the prod-
ucts they purchase, we can also define a product sophistication as the average
sophistication of the customers purchasing them. Figure 6 (right) shows the
reason why we do not focus on product sophistication: the overall trend for
product sophistication tends to be the opposite of the customer sophistication.
This anti-correlation seems to imply that, as the customers struggle in satisfy-
ing their needs, the once top-sophisticated products are not purchased any more,
lowering the overall product sophistication index. However, this is only one of
many possible interpretations and we need further investigation in future works.
Table 3: The correlations of all the used measures with GDP at different shift
values. We highlight the statistically significant correlations.
PPPPPP
P
Measure
Shift -3 -2 -1 0 1 2
EXP -0.29302 -0.49830 -0.530780.23976 -0.27619 -0.37073
PUR -0.27091 -0.49836-0.53046∗∗ 0.18638 -0.30909 -0.32432
C-DEG 0.24624 0.39808 -0.554790.13727 0.08191 0.36001
P-DEG -0.12409 -0.26289 -0.57657∗∗ 0.30255 -0.22198 -0.28325
C-SOP -0.32728 -0.67007∗∗∗ 0.23261 0.09251 -0.15844 -0.58773∗∗
P-SOP -0.02675 -0.12916 0.60974∗∗ -0.18587 0.15342 -0.03843
p < 0.1, ∗∗p < 0.05, ∗∗∗ p < 0.01
We sum up the correlation tests performed in Table 3. In the Table, we report
the correlation values for all variables. We test different shift values, where the
GDP timeline is shifted of a given number of quarters with respect to the tested
measure. When shift = -1, it means that we align the GDP with the previous
quarter of the measure (e.g. GDP Q4-08 aligned with measure’s Q3-08).
We also report the significance levels of all correlations. Note that all p-
values are being corrected for the multiple hypothesis test. When considering
several hypotheses, as we are doing here, the problem of multiplicity arises:
the more hypotheses we check, the higher the probability of a false positive.
To correct for this issue, we apply a Holm-Bonferroni correction. The Holm-
Bonferroni method is an approach that controls the family-wise error rate (the
probability of witnessing one or more false positive) by adjusting the rejection
criteria of each of the individual hypotheses [26]. Once we adjust the p-values, we
obtain the significance levels reported in the table. Only one correlation passes
the Holm-Bonferroni test for significance at p < 0.01 and it is exactly the one
involving the customer sophistication with shift equal to -2. This correlation is
highlighted in bold in Table 3, and it represents the main result of the paper.
Note that in the table we also report the correlation values using the IQM
for the customer and product degree measures, of which we have not shown the
timelines, due to space constraints. We include them because, as we discussed
previously, our sophistication measures are corrected degree measures. If the
degree measures were able to capture the same correlation with GDP there
would be no need for our more complex measures. Since the degree measures
do not pass the Holm-Bonferroni test we can conclude that the sophistication
measures are necessary to achieve our results.
3 2 1 0 1 2
SHIFT
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
PEARSON
C-SOP P-SOP
Fig. 7: The correlation between average customer sophistication and GDP with
different shifting values.
We finally provide a visual representation of the customer and product so-
phistication correlations with GDP at different shift levels in Figure 7. The figure
highlights the different time frames in which the two measures show their pre-
dictive power over GDP. The customer sophistication has its peak at shift equal
to -2. The cyclic nature of the data implies also a strong, albeit not significant,
correlation when the shift is equal to 2. Instead, the product sophistication ob-
tains its highest correlation with GDP with shift equal to -1. This might still
be useful in some cases, as the GDP for a quarter is usually released by the
statistical office with some weeks of delay.
6 Conclusion
In this paper we tackled the problem of having a fast and reliable test for esti-
mating the well-being of a population. Traditionally, this is achieved with many
measures, and one of the most used is the Gross Domestic Product, or GDP,
which roughly indicates the average prosperity of the citizens of a country. GDP
is affected by several issues, and here we tackle two of them: it is a hard measure
to quantify rapidly and it does not take into account all the non-tangible aspects
of well-being, e.g. the satisfied needs of a population. By using retail informa-
tion, we are able to estimate the overall sophistication of the needs satisfied by a
population. This is achieved by constructing and analyzing a customer-product
bipartite network. In the paper we show that our customer sophistication mea-
sure is a promising predictor of the future GDP value, anticipating it by six
months. It is also a measure less linked with the amount of richness around a
person, and it focuses more on the needs this person is able to satisfy.
This paper opens the way for several future research tracks. Firstly, in the
paper we were unable to define a proper seasonal adjustment for our sophisti-
cation measure. The seasonality of the measure is evident, but it is not trivial
how to deal with it. A longer observation period and a new seasonal adjust-
ment measure is needed and our results show that this is an worthwhile research
track. Secondly, we showed that there is an interesting anti-correlation between
the aggregated sophistication measures calculated for customers and products.
This seems to imply that, in harsh economic times, needs that once were basic
become sophisticated (increasing the overall customer sophistication) and needs
that were sophisticated are likely to be dropped (decreasing the overall prod-
uct sophistication). More research is needed to fully understand this dynamic.
Finally, in this paper we made use of a quarterly aggregation to build our bi-
partite networks. We made this choice because the quarterly aggregation is the
most fine-grained one we can obtain for GDP estimations. However, now that
we showed the correlation, we might investigate if the quarterly aggregation is
the most appropriate for our analysis. If we can obtain comparable results with
a lower level of aggregation (say monthly or weekly) our well-being estimation
can come closer to be calculated almost in real-time.
Acknowledgements
We gratefully thank Luigi Vetturini for the preliminary analysis that made this
paper possible. We thank the supermarket company Coop and Walter Fabbri
for sharing the data with us and allowing us to analyse and to publish the
results. This work is partially supported by the European Community’s H2020
Program under the funding scheme FETPROACT-1-2014: 641191 CIMPLEX,
and INFRAIA-1-2014-2015: 654024 SoBigData.
References
1. Costanza, R., Kubiszewski, I., Giovannini, E., Lovins, H., McGlade, J., Pickett,
K.E., Ragnarsd´ottir, K.V., Roberts, D., De Vogli, R., Wilkinson, R.: Time to
leave gdp behind. Nature Comment (2014)
2. Wilson, N., Mason, K., Tobias, M., Peacey, M., Huang, Q., Baker, M.: Interpreting
google flu trends data for pandemic h1n1 influenza: the new zealand experience.
Euro surveillance: bulletin europ´een sur les maladies transmissibles= European
communicable disease bulletin 14(44) (2008) 429–433
3. Choi, H., Varian, H.: Predicting the present with google trends. Economic Record
88(s1) (2012) 2–9
4. Toole, J.L., Lin, Y.R., Muehlegger, E., Shoag, D., Gonzalez, M.C., Lazer, D.: Track-
ing employment shocks using mobile phone data. arXiv preprint arXiv:1505.06791
(2015)
5. Llorente, A., Cebrian, M., Moro, E., et al.: Social media fingerprints of unemploy-
ment. arXiv preprint arXiv:1411.3140 (2014)
6. Lazer, D., Kennedy, R., King, G., Vespignani, A.: The parable of google flu: traps
in big data analysis. Science 343(14 March) (2014)
7. Giannone, D., Reichlin, L., Small, D.: Nowcasting: The real-time informational
content of macroeconomic data. Journal of Monetary Economics 55(4) (2008)
665–676
8. Foroni, C., Marcellino, M.: A comparison of mixed frequency approaches for now-
casting euro area macroeconomic aggregates. International Journal of Forecasting
30(3) (2014) 554–568
9. Navicke, J., Rastrigina, O., Sutherland, H.: Nowcasting indicators of poverty risk
in the european union: a microsimulation approach. Social Indicators Research
119(1) (2014) 101–119
10. Leventi, C., Navicke, J., Rastrigina, O., Sutherland, H.: Nowcasting the income
distribution in europe. (2014)
11. Hausmann, R., Hidalgo, C., Bustos, S., Coscia, M., Chung, S., Jimenez, J., Simoes,
A., Yildirim, M.: The atlas of economic complexity. Puritan Press, Boston (2011)
12. Caldarelli, G., Cristelli, M., Gabrielli, A., Pietronero, L., Scala, A., Tacchella, A.:
A network analysis of countries export flows: Firm grounds for the building blocks
of the economy. PLoS ONE 7(10) (10 2012) e47278
13. Chawla, S.: Feature selection, association rules network and theory building. Jour-
nal of Machine Learning Research - Proceedings Track 10 (2010) 14–21
14. Pennacchioli, D., Coscia, M., Rinzivillo, S., Giannotti, F., Pedreschi, D.: The retail
market as a complex system. EPJ Data Science 3(1) (2014) 1–27
15. Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Explaining
the product range effect in purchase data. In: Big Data, 2013 IEEE International
Conference on, IEEE (2013) 648–656
16. Guidotti, R., Coscia, M., Pedreschi, D., Pennacchioli, D.: Behavioral entropy and
profitability in retail. DSAA (2015)
17. Galbraith, J.W., Tkacz, G.: Nowcasting gdp with electronic payments data. (2015)
18. Lawn, P.A.: A theoretical foundation to support the index of sustainable eco-
nomic welfare (isew), genuine progress indicator (gpi), and other related indexes.
Ecological Economics 44(1) (2003) 105–118
19. Helbing, D., Balietti, S.: How to create an innovation accelerator. The European
Physical Journal Special Topics 195(1) (2011) 101–136
20. Lawn, P.A.: An assessment of the valuation methods used to calculate the index
of sustainable economic welfare (isew), genuine progress indicator (gpi), and sus-
tainable net benefit index (snbi). Environment, Development and Sustainability
7(2) (2005) 185–208
21. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey.
ACM Comput. Surv. 38(3) (2006) 9+
22. Balassa, B.: Trade liberalization and ’revealed’ comparative advantage. Manchester
School 33 (1965) 99–123
23. Cristelli, M., Gabrielli, A., Tacchella, A., Caldarelli, G., Pietronero, L.: Measuring
the intangibles: A metrics for the economic complexity of countries and products.
PloS one 8(8) (2013) e70726
24. Guidotti, R.: Mobility ranking-human mobility analysis using ranking measures.
University of Pisa, Pisa (2013)
25. Monsell, B.C.: Update on the development of x-13arima-seats. In: Proceedings of
the Joint Statistical Meetings: American Statistical Association. (2009)
26. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian
journal of statistics (1979) 65–70
... We base our analysis on real-world data describing the purchases of the customers of COOP, one of the largest supermarket chains in Italy. This source of data has been used for different purposes, such as identifying successful innovations, meant to be a success later on [60], introducing an alternative metric to GDP by quantification of the average sophistication of satisfied needs of a population [61], creating a personal cart assistant that suggests to the customer the items to put in her shopping list based on a innovative clustering method [62] and finally, describing the buying behavior of different classes of customers, as highly ranked customers that have more sophisticated needs tend to buy niche products, i.e., low-ranked products, and on the other hand, low-ranked, low purchase volume customers tend to buy only high-ranked products, very popular products that everyone buys [63]. ...
Article
Full-text available
Increased availability of epidemiological data, novel digital data streams, and the rise of powerful machine learning approaches have generated a surge of research activity on real-time epidemic forecast systems. In this paper, we propose the use of a novel data source, namely retail market data to improve seasonal influenza forecasting. Specifically, we consider supermarket retail data as a proxy signal for influenza, through the identification of sentinel baskets, i.e., products bought together by a population of selected customers. We develop a nowcasting and forecasting framework that provides estimates for influenza incidence in Italy up to 4 weeks ahead. We make use of the Support Vector Regression (SVR) model to produce the predictions of seasonal flu incidence. Our predictions outperform both a baseline autoregressive model and a second baseline based on product purchases. The results show quantitatively the value of incorporating retail market data in forecasting models, acting as a proxy that can be used for the real-time analysis of epidemics.
... Nowcasting 5 financial and economic indicators focus on the potential of data science as a proxy for well-being and socioeconomic applications. The development of innovative research methods has demonstrated that poverty indicators can be approximated by social and behavioral mobility metrics extracted from mobile phone data and GPS data [34]; and the Gross Domestic Product can be accurately nowcasted by using retail supermarket market data [18]. Furthermore, nowcasting of demographic aspects of territory based on Twitter data [1] can support official statistics, through the estimation of location, occupation, and semantics. ...
Article
Full-text available
This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.
... We base our analysis on real-world data describing the purchases of the customers of COOP, one of the largest supermarket chains in Italy. This source of data has been used for different purposes, such as identifying successful innovations, meant to be a success later on [58], introducing an alternative metric to GDP by quantification of the average sophistication of satisfied needs of a population [59], creating a personal cart assistant that suggests to the customer the items to put in her shopping list based on a innovative clustering method [60] and finally, describing the buying behavior of different classes of customers, as highly ranked customers that have more sophisticated needs tend to buy niche products, i.e., low-ranked products, and on the other hand, low-ranked, low purchase volume customers tend to buy only high-ranked products, very popular products that everyone buys [61]. ...
Preprint
Full-text available
Increased availability of epidemiological data, novel digital data streams, and the rise of powerful machine learning approaches have generated a surge of research activity on real-time epidemic forecast systems. In this paper, we propose the use of a novel data source, namely retail market data to improve seasonal influenza forecasting. Specifically, we consider supermarket retail data as a proxy signal for influenza, through the identification of sentinel baskets, i.e., products bought together by a population of selected customers. We develop a nowcasting and forecasting framework that provides estimates for influenza incidence in Italy up to 4 weeks ahead. We make use of the Support Vector Regression (SVR) model to produce the predictions of seasonal flu incidence. Our predictions outperform both a baseline autoregressive model and a second baseline based on product purchases. The results show quantitatively the value of incorporating retail market data in forecasting models, acting as a proxy that can be used for the real-time analysis of epidemics.
... The degree of integration can be considered both with respect to economic aspects but also based on how immigrant customers change their habits during their stay in terms of purchased products. Market basket analysis and the study of food consumption have been widely used in the literature for different purposes, such as defining individual indicators of customer predictability [79], studying GDP trends [80], analysing customers with respect to their temporal purchasing patterns [82] and classifying them as residents or tourists according to their shopping profile [81]. Exploiting retail data to study the migration phenomenon from an individual and collective point of view that is not exposed to social sanctions and with multiple observations in time can bring to the light novel results useful for better understanding the migration phenomenon and also for developing well-being policies. ...
Article
Full-text available
How can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various phases of migration, comparing traditional and novel data sources and models at each phase. We concentrate on three phases of migration, at each phase describing the state of the art and recent developments and ideas. The first phase includes the journey, and we study migration flows and stocks, providing examples where big data can have an impact. The second phase discusses the stay, i.e. migrant integration in the destination country. We explore various data sets and models that can be used to quantify and understand migrant integration, with the final aim of providing the basis for the construction of a novel multi-level integration index. The last phase is related to the effects of migration on the source countries and the return of migrants.
... In recent years, individual financial transaction datasets have been utilized to infer interesting perspectives on human mobility [27,28], revealing different characteristics of people's dynamics and spending habits with a novel scale-free classification of Spanish cities [29]. Moreover, financial transactions from retail market data is used to calculate a quantification of the average sophistication of satisfied needs of a population as a promising predictor of Gross Domestic Product (GDP) [30]. ...
Article
Full-text available
People are increasingly leaving digital traces of their daily activities through interacting with their digital environment. Among these traces, financial transactions are of paramount interest since they provide a panoramic view of human life through the lens of purchases, from food and clothes to sport and travel. Although many analyses have been done to study the individual preferences based on credit card transaction, characterizing human behavior at larger scales remains largely unexplored. This is mainly due to the lack of models that can relate individual transactions to macro-socioeconomic indicators. Building these models, not only can we obtain a nearly real-time information about socioeconomic characteristics of regions, usually available yearly or quarterly through official statistics, but also it can reveal hidden social and economic structures that cannot be captured by official indicators. In this paper, we aim to elucidate how macro-socioeconomic patterns could be understood based on individual financial decisions. To this end, we reveal the underlying interconnection of the network of spending leveraging anonymized individual credit/debit card transactions data, craft micro-socioeconomic indices that consists of various social and economic aspects of human life, and propose a machine learning framework to predict macro-socioeconomic indicators.
Article
The inclusive wealth approach is increasingly common to measure the sustainable development of the countries. It comprised the natural, human and produced capital of nations to measure social wellbeing. We measure the inclusive wealth of the provinces in China from 2000 to 2015 and reports the sustainable use of the resources. We identify that three types of capital have increased to varying degrees, with produced capital increasing by 615.6%, natural capital increasing by 33.8%, and human capital increased by 337.0%. The total amount of inclusive wealth has increased by 300.4% in the past 15 years. However, the provinces in China are still facing unbalanced development across the country compared to developed nations. The use of the natural capital, more specifically now-renewable resources, has been restricting the wealth growth in some provinces. Although ecological services account for a small proportion of the total inclusive wealth, more attention is essential for sustainable development. Meanwhile, the rapid growth of carbon damages posed threat to future wealth accumulation. Innovative, coordinated, green, open and shared development are the goals of China 13th and 14th five-year plan and our inclusive wealth of China will be key measurement tool of this achievement.
Article
In a city or region, understanding the relationship between human mobility and socioeconomic status is critical to public policies formulation, urban design and marketing strategies development. Based on the available massive geo-located human data, existing studies focused almost exclusively on the position attributes (i.e. coordinates) of the locations visited by people to explore the relationship, however, they ignored the category attributes (e.g. restaurant or supermarket) of these locations which imply the purposes (e.g. eating or shopping) behind human movements. A location with coordinates and category information is usually referred to as a point-of-interest (POI). In this paper, we study the relationship between POIs-related human mobility and socioeconomic status at city level. Starting from the location-based social network (i.e. Foursquare) dataset, we find that the check-in numbers of location categories are correlated with socioeconomic indicators, either positively or negatively. To further validate these correlations, we develop and test a multi-task prediction framework based on POIs-related human mobility for forecasting socioeconomic indicators. Extensive experiments on the Foursquare dataset show that the socioeconomic indicators can be well predicted by our proposed framework. Our findings and methods are helpful for modeling human mobility and assessing socioeconomic status.
Thesis
Full-text available
This work investigates the impact of ranking measures in the analysis of mobility network. We consider big datasets of GPS trajectories that allowed us to construct two different kinds of networks: the network of car pooling between car drivers, and the bipartite graph between drivers and visited locations. We show how an analysis based on ranking drivers and locations reveals interesting properties of these networks.
Conference Paper
Full-text available
Human behavior is predictable in principle: people are systematic in their everyday choices. This predictability can be used to plan events and infrastructure, both for the public good and for private gains. In this paper we investigate the largely unexplored relationship between the systematic behavior of a customer and its profitability for a retail company. We estimate a customer’s behavioral entropy over two dimensions: the basket entropy is the variety of what customers buy, and the spatio-temporal entropy is the spatial and temporal variety of their shopping sessions. To estimate the basket and the spatiotemporal entropy we use data mining and information theoretic techniques. We find that predictable systematic customers are more profitable for a supermarket: their average per capita expenditures are higher than non systematic customers and they visit the shops more often. However, this higher individual profitability is masked by its overall level. The highly systematic customers are a minority of the customer set. As a consequence, the total amount of revenues they generate is small. We suggest that favoring a systematic behavior in their customers might be a good strategy for supermarkets to increase revenue. These results are based on data coming from a large Italian supermarket chain, including more than 50 thousand customers visiting 23 shops to purchase more than 80 thousand distinct products.
Article
Full-text available
Can data from mobile phones be used to observe economic shocks and their consequences at multiple scales? Here we present novel methods to detect mass layoffs, identify individuals affected by them and predict changes in aggregate unemployment rates using call detail records (CDRs) from mobile phones. Using the closure of a large manufacturing plant as a case study, we first describe a structural break model to correctly detect the date of a mass layoff and estimate its size. We then use a Bayesian classification model to identify affected individuals by observing changes in calling behaviour following the plant's closure. For these affected individuals, we observe significant declines in social behaviour and mobility following job loss. Using the features identified at the micro level, we show that the same changes in these calling behaviours, aggregated at the regional level, can improve forecasts of macro unemployment rates. These methods and results highlight promise of new data resources to measure microeconomic behaviour and improve estimates of critical economic indicators. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Article
Full-text available
Aim of this paper is to introduce the complex system perspective into retail market analysis. Currently, to understand the retail market means to search for local patterns at the micro level, involving the segmentation, separation and profiling of diverse groups of consumers. In other contexts, however, markets are modelled as complex systems. Such strategy is able to uncover emerging regularities and patterns that make markets more predictable, e.g. enabling to predict how much a country’s GDP will grow. Rather than isolate actors in homogeneous groups, this strategy requires to consider the system as a whole, as the emerging pattern can be detected only as a result of the interaction between its self-organizing parts. This assumption holds also in the retail market: each customer can be seen as an independent unit maximizing its own utility function. As a consequence, the global behaviour of the retail market naturally emerges, enabling a novel description of its properties, complementary to the local pattern approach. Such task demands for a data-driven empirical framework. In this paper, we analyse a unique transaction database, recording the micro-purchases of a million customers observed for several years in the stores of a national supermarket chain. We show the emergence of the fundamental pattern of this complex system, connecting the products’ volumes of sales with the customers’ volumes of purchases. This pattern has a number of applications. We provide three of them. By enabling us to evaluate the sophistication of needs that a customer has and a product satisfies, this pattern has been applied to the task of uncovering the hierarchy of needs of the customers, providing a hint about what is the next product a customer could be interested in buying and predicting in which shop she is likely to go to buy it.
Article
Full-text available
Recent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and inter-personal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 145 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. As a result, we provide a simple model able to produce accurate, easily interpretable reconstruction of regional unemployment incidence from their social-media digital fingerprints alone. Our results show that cost-effective economical indicators can be built based on publicly-available social media datasets.
Conference Paper
Full-text available
The at-risk-of-poverty rate is one of the three indicators used for monitoring progress towards the Europe 2020 poverty and social exclusion reduction target. Timeliness of this indicator is critical for monitoring the effectiveness of policies. However, due in part to the complicated nature of the European Union Statistics on Income and Living Conditions (EU-SILC), estimates of the number of people at risk of poverty are published with a 2 to 3 year delay. This paper presents a method of estimating (‘nowcasting’) the current distribution of household income, including the at-risk-of-poverty rate, using a tax-benefit microsimulation model (EUROMOD) based on the EU-SILC, combined with up-to-date macro-level statistics. The method is applied to 13 EU Member States experiencing differing economic conditions over the period in question, including those which have been affected comparatively little by the crisis as well as those which have suffered a major reduction in economic activity and employment.
Article
Full-text available
The at-risk-of-poverty rate is one of the three indicators used for monitoring progress towards the Europe 2020 poverty and social exclusion reduction target. Timeliness of this indicator is critical for monitoring the effectiveness of policies. However, due to complicated nature of the European Union Statistics on Income and Living Conditions (EU-SILC) poverty risk estimates are published with a 2–3 years delay. This paper presents a method that can be used to estimate (“nowcast”) the current at-risk-of-poverty rate for the European Union (EU) countries based on EU-SILC microdata from a previous period. The EU tax-benefit microsimulation model EUROMOD is used for this purpose in combination with up to date macro-level statistics. The method is validated by using EU-SILC data for 2007 incomes to estimate at-risk-of-poverty rates for 2008–2012 and to compare the predictions with actual EU-SILC and other external statistics. The method is tested on eight EU countries which are among those experiencing the most volatile economic conditions within the period: Estonia, Greece, Spain, Italy, Latvia, Lithuania, Portugal and Romania.
Article
We describe and assess the usefulness of a newly-constructed database of electronic payments, comprised of debit and credit card transactions as well cheques that clear through the banking system, as indicators of current GDP growth. Apart from capturing a broad range of spending activity, these variables are available on a very timely basis, thereby making them suitable candidate indicators. Controlling both for the release dates of various variables and the vintage of GDP available to analysts at the time a nowcast is produced, we generate nowcasts of GDP growth for a given quarter over a span of five months, which is the period over which interest in nowcasts would exist. We find that nowcast errors fall by about 60 percent between the first and final nowcast. Evidence on the value of the additional payments variables is mixed, however; the point estimates suggest reductions in forecast loss at some nowcast horizons, but with considerable variability.
Conference Paper
In our market society, buyers are considered rational entities, driven by two utility functions: i) the amount of money spent, a universal quantity to be minimized; and ii) the individual needs to satisfy, a personal quantity, varying from person to person, to be maximized. In this paper, we propose an analytic framework based on big data to measure the personal utility function and we prove that this function has a stronger effect on customer behavior than the price. By focusing on the purchases in an Italian supermarket chain, we discover and describe a range effect of products: the more sophisticated the needs they satisfy, the more cost the customers are willing to pay to buy them, in terms of distance to travel more than in terms of the price of the item itself. We exhibit a striking empirical evidence of this theory by tracking the geographical information about points of sale and customers, in a large dataset containing tens of thousands of customers and thousands of products. We create a data mining framework able to scale to possibly hundreds of thousands, or millions, of customers and to let emerge from the data the knowledge about the actual range of each product. As an application of this finding, we show how it is possible to accurately predict how long a customer will travel (or which shop she will choose) to buy a product, as a function of the product's sophistication.
Article
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.