PreprintPDF Available

Determining the usual environment of cardholders as a key factor to measure the evolution of domestic tourism

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Domestic tourism is harder to analyse compared to international tourism due to its smaller data footprint generation, as most times private means of transport are used, no border is crossed, and no lodging is registered. Digital data sources can be a useful, but still underused, complement to official survey-based statistics to fill this lack of reliable information. Although entities like Eurostat have encouraged national institutes of statistics to use big data sources, results on this topic are still scarce. These digital sources can not only extend the scope of the scientific literature, but will also increase the level of detailed information throughout time and space (e.g. weekly tourism evolution indexes at city level can be generated as a result of our methodology), what will enable an improved management of the touristic sector. In particular the present paper covers a research gap in the use of card transactions data (on site payments and cash withdrawals) to provide an innovative methodology to enhance vision on domestic tourism dynamics. The chosen approach is based on the United Nations World Tourism Organization definition of 'usual environment': "the geographical area (though not necessarily a contiguous one) within which an individual conducts his/her regular life routines" Upon this premise, a methodology has been developed in order to use transactional footprints of cardholders to delineate their usual environment, and subsequently to classify transactions as 'touristic' or 'non-touristic'. In order to validate the methodology, a series of tests are shown in the latter part of the study. Furthermore, so as to ensure scalability, the resulting procedure is non-territory reliant, and can therefore be adapted to different geographies by varying one single parameter. All authors are-or have been-members of BBVA Data & Analytics, the data science centre of excellence of BBVA, the bank that has been the data source for this research and its practical applications, which Some practical applications are described in section 5 through two use cases carried out in Spain and Mexico by BBVA.
Content may be subject to copyright.
1
Determining the Usual Environment of Cardholders
as a Key Factor to Measure the Evolution of Domestic
Tourism
Juan de Dios Romero Palop, Heriberto Valero Lapaz, Diego Bodas Sagi, Juan Murillo Arias
Abstract
Domestic tourism is harder to analyse compared to international tourism due to its smaller data footprint
generation, as most times private means of transport are used, no border is crossed, and no lodging is
registered. Digital data sources can be a useful, but still underused, complement to official survey-based
statistics to fill this lack of reliable information. Although entities like Eurostat have encouraged national
institutes of statistics to use big data sources, results on this topic are still scarce. These digital sources can
not only extend the scope of the scientific literature, but will also increase the level of detailed information
throughout time and space (e.g. weekly tourism evolution indexes at city level can be generated as a result
of our methodology), what will enable an improved management of the touristic sector. In particular the
present paper covers a research gap in the use of card transactions data (on site payments and cash
withdrawals) to provide an innovative methodology to enhance vision on domestic tourism dynamics. The
chosen approach is based on the United Nations World Tourism Organization definition of ‘usual
environment’: “the geographical area (though not necessarily a contiguous one) within which an individual
conducts his/her regular life routines” Upon this premise, a methodology has been developed in order to
use transactional footprints of cardholders to delineate their usual environment, and subsequently to classify
transactions as ‘touristic’ or ‘non-touristic’. In order to validate the methodology, a series of tests are shown
in the latter part of the study. Furthermore, so as to ensure scalability, the resulting procedure is non-
territory reliant, and can therefore be adapted to different geographies by varying one single parameter. All
authors are -or have been- members of BBVA Data & Analytics, the data science centre of excellence of
BBVA, the bank that has been the data source for this research and its practical applications, which Some
practical applications are described in section 5 through two use cases carried out in Spain and Mexico by
BBVA.
This is an extended version of a conference paper entitled “Using Transactional Data to Determine the
Usual Environment of Cardholders” previously published in the proceedings of Information and
Communication Technologies in Tourism 2018 Conference (ENTER 2018) held in Jönköping, Sweden,
January 24-26, 2018.”
Keywords: usual environment; big data; domestic tourism; digital footprint; payments; applied data
science.
1 Introduction
Since Spain opened up its economy to the international community in the early 1960s, the Spanish tourism
industry has grown steadily year-by-year, reaching a 14.9% as contribution to GDP, and a 15.1% as
contribution to employment by the year 2017 (World Travel & Tourism Council (WTTC), 2018). Given its
contribution to the economy, it is of utmost importance to gather and analyse relevant and accurate data in
order to understand the sector’s evolution, and hence make informed decisions.
There are currently numerous studies focused on the monitoring of international tourism. For example,
(Kozak & Rimmington, 2000) uses surveys conducted at points of entry of a given country, and data
provided by tourist agents such as hotels, travel agencies, or airlines to determine the destination attributes
that are critical to the overall satisfaction levels of tourists visiting Mallorca (Spain) during the winter
season. Nevertheless, methods based on surveys present the added difficulty that it is unrealistic to gather
complete and precise survey information from every individual agent in the sector. For this reason and to
take advantage of the vast amount of digital data available from different sources, other approaches have
been evaluated.
As stated in (Eurostat, 2014) and (Baggio, 2019) the main challenge faced when using these new digital
sources is to differentiate between tourists and non-tourists. For example, in (Koerbitz, Önder, & Hubmann-
Haidvogel, 2013), photos (and their metadata) uploaded on websites such as Flickr (2007-2011) are used to
determine whether these digital footprints provide a useful indicator for tourism demand in Austria. Users
who declared their hometown and current location in the country were classified as residents, while users
2
who indicated two different locations (excluding Austria) on their profiles were classified as tourists. A
multivariate logistic regression model was created that determined the criteria for classification as a tourist
(residents versus non-residents).
In (Heerschap, Ortega, Priem, & Offermans, 2014), the authors analyse data from tourism accommodations
web pages using crawlers, but they found many problems to process the required information. On the
contrary, they found data from mobile devices and networks (in collaboration with Vodafone) useful to
measure the number of foreign tourists in a zone and a specific time period. Mobile phone data can also be
very useful to study how people (tourists and residents) move in big events. In this case, prior information
provided for the users (country) is used to separate tourists from residents. Mobile phone location data have
already been used to confirm that humans follow identifiable patterns in their lives (González, Hidalgo, &
Barabási, 2008).
In this paper, bankcard footprints are used to improve the vision of tourism. This approach has already been
implemented to track the activity of foreign visitors (Sobolevsky, et al., 2015). If it were also used to analyse
cardholder activity nationwide, it would be possible to measure flows of domestic tourism. Furthermore,
the data generated by card payments can help define areas and activities of interest for national visitors, and
also inform us of their place of origin. In order to address the distinction between tourists and residents one
option would be to use the address given by the cardholder when opening their bank account, and to consider
as ‘touristic’ only those transactions which are made in locations other than their place of residence.
Notwithstanding, this alternative presents certain drawbacks. First, information previously given by the
cardholder may no longer be valid because customers are not obliged to notify the bank of a change of
address. Second, and more importantly, the cardholder’s place of residence alone cannot determine which
of their transactions are ‘touristic’. Indeed, it is currently not uncommon for people to live in one location
and have to travel some distance to their place of work/study or to where they spend most of their time, and
where their card transactions will not be tourism-related.
Taking this into account, this paper focuses on defining and implementing a general methodology to identify
the principal areas where cardholders do most of their spending on a daily basis. The methodology is based
on the definition of usual environment proposed by United Nations World Tourism Organization
(UNWTO): "The usual environment of an individual, a key concept in tourism, is defined as the
geographical area (though not necessarily a contiguous one) within which an individual conducts his/her
regular life routines." (UNWTO, 2010). Specifically, this methodology has been applied to Banco Bilbao
Vizcaya Argentaria (BBVA) cardholders as a case study.
Some authors have already shown that card payments can be used to forecast the evolution of economic
indexes (Tkacz & Galbraith, 2013). In fact, BBVA data has already been used to replicate the Spanish Retail
Trade Index (RTI) (Bodas, et al., 2018). The RTI has traditionally been measured through surveys
conducted via a limited sample of retailers. However, in this study, information is obtained through retail
transactions made by credit and debit cardholders. The resulting indexes were found to be robust when
compared with the RTI indexes published by the National Statistics Institute (INE). The high granularity of
the data makes it possible to reproduce the evolution of daily retail sales, with timely answers also
concerning the impact of any given retail sales event. In addition, this type of data also provides a large
amount of geographic detail (by city or even by postcode), together with information regarding additional
dimensions such as the sector relating to the activity. One of the use cases presented here shows how official
tourist statistics can be replicated by applying the usual environment methodology.
The rest of this paper is structured as follows. Section 2 describes the data sources used in this work. Section
3 introduces the methodology proposed for determining the usual environment of the cardholders. Section
4 explains the results obtained. Next section describes some business cases using this methodology. Finally,
section 6 concludes and summarizes.
2 Data source: representativeness, size and privacy concerns.
For the purpose of this research and its practical applications only anonymous data have been accessed and
analysed. Before being uploaded to the analytic infrastructure accessible to the authors, transactional data
have been processed with dissociation algorithms: this means that no specific individual can be directly
identified by name or card ID. Fields describing socio-economic attributes of the cardholder such as gender,
age and financial capacity (although not used in the usual environment methodology) remain part of the
dataset in order to be used aggregately in touristic performance analysis that, in turn, is dependent on the
usual environment attribute.
Regarding the representativity of the source, BBVA is one of the largest banks in Spain and has over 4.8
million active cardholders. In a country with 46 million inhabitants and 79 million cards issued, this sample
3
size is a couple of orders of magnitude bigger that the survey-based data currently used to measure the
nation’s domestic tourism patterns. Besides, to make BBVA data statistically representative, this data have
been sampled to fit BBVA’s customer distribution to the demographic distribution of Spanish population.
BBVA customers alone make more than 2 billion on-site payments or cash withdrawals per year in Spain.
This huge amount of information has been used to develop and validate the methodology proposed here.
Nevertheless, the transaction chains per customer are of heterogeneous length, as the individual rate of
activity is altogether variable.
The methodology assigns each active cardholder a list of localities where they have made a payment during
a given time interval. From an infrastructural and data processing point of view, this is a real challenge as
its implementation involves managing the entire set of BBVA cardholders including a large variety of
profiles, from frequent card users to occasional users.
However, focusing only on the cardholders who use their cards regularly would mean losing part of the
extra value provided by transactional data. Therefore, one of the aims is to develop a methodology that
works for each profile in the dataset. An additional objective is to make the methodology as non-dependent
as possible vis à vis the characteristics of the geography where it is applied. In terms of the case study,
Spain covers an area of 506,000 km2 approximately and is divided into 8,191 localities. The average locality
size is 62 km2. Nevertheless, there is a significant variance in the distribution and size of the localities
within the country. For instance, while the locality of Caceres is the largest with more than 1,700 km2, there
are 80 localities of less than 2 km2.
Each time a payment is made by card, a record of the transaction is sent to the bank. This data is a powerful
tool to describe routines because every single transaction is characterized, among other factors, by place
(lat-long level), timestamp, and the economic category of the merchant where the operation takes place
(e.g. fashion, bars & restaurants, leisure, etc.), depending on the information provided by the vendor during
the POS sign-up process. These features make it possible to infer the usual environment of cardholders and
will eventually allow for the labelling of transactions as touristic or non-touristic.
The implementation of the process uses all transactions carried out during a 12-month window in order to
include seasonal behaviour. Rather than using the entire set of transactions in every execution, a twelve
(12) month period can better identify current cardholder routines so as to obtain up-to-date information to
label transactions as touristic or non-touristic. Although transactions are available from 2012 up to the
present, the examples and figures shown in the following sections correspond to the years 2015 and 2016,
since the development of the methodology has been carried out using data from those years.
The methodology has been implemented in R programming language. Furthermore, due to the large amount
of data involved in each execution (4.8 million cardholders, 2,000 million transactions), the usage of
parallelization and optimization libraries is necessary (R Development Core Team, 2008).
3 A methodology to determine the usual environment of cardholders
The methodology is divided into four stages: 1) measuring locality importance, 2) geospatial clustering, 3)
cluster selection and 4) defining the usual environment area. Figure 1 shows the flow chart of the
methodology.
4
Fig. 1. Methodology flow chart.
3.1 Measuring locality importance
As mentioned earlier, routine is the key concept of the UNWTO definition of usual environment. Hence,
the first step of the algorithm is to identify which localities are the recipients of the majority of the
cardholder’s regular expenses.
The question arises regarding which transactions should be used to identify routines. The answer to this is
that two aspects need to be considered: a) the type of merchant and b) the timestamp of the transactions.
Firstly, transactions belonging to touristic or non-regular expenses categories (i.e. hotel reservations or real
estate) have been excluded, as they are not part of an individual’s routine. On the other hand, cash
withdrawals are included as an accurate indicator of daily habits. Next, the remaining transactions are
divided in two groups according to date type. One group (labelled as Holiday transactions) include
transactions made at weekends, national holidays and during the month of August (summer holiday month
in Spain). Another group (labelled as Work transactions) includes weekday payments. This distinction
seeks to give greater importance to the transactions made on working days in order to determine the
localities that are part of the cardholder’s routine. For each cardholder-locality pairing, a numerical index
is assigned using the following formula:
     
(1)
where c identifies the cardholder, l the locality and α determines the importance given to each group. In the
BBVA case study, Work transactions are considered twice as valuable as Holiday transactions in order to
delineate the usual environment; therefore α is set to 0.5. A more comprehensive analysis of the sensitivity
of this parameter remains to be explored.
3.2 Geospatial clustering
The above index could be used to rank the localities where cardholders incur their expenses and select those
with higher values. However, the UNWTO definition of usual environment does not necessarily refer to a
contiguous area. For this reason, the next step for the methodology is to apply a geographical criterion to
group the cardholder localities where at least one transaction has been made.
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm has been selected
to carry out this process (Ester, Kriegel, Sander, & Xu, 1996).The R implementation of this algorithm has
been used (Hahsler & Piekenbrock, 2017). The main idea behind the algorithm is to find the core localities,
those surrounded by several localities that form an area of density, and to use a loop to build the clusters
around them. At each step, the localities reachable from the current members of the cluster are added.
5
Aside from its general approach, two characteristics of the DBSCAN algorithm were key for its selection
over other alternatives (Xu & Wunsch, 2005): 1) generation of irregular clusters, and 2) no need to specify
number of clusters. Due to the heterogeneous distribution of localities throughout the country and to the
irregular shape of the coast areas, it is important to be able to obtain irregular-shaped clusters as well.
Adding to the cluster any locality reachable from a current member of the same cluster ensures its expansion
into any given direction. Thereby, the definition of reachable is one of the parameters of the algorithm; ε is
the maximum distance that the algorithm uses when looking for new cluster members.
Values of ε from 20 to 60 kilometres, 5 by 5, were tested before selecting 40 kilometres as the optimum
value for the parameter in the case study. The final decision is based on the shape of the clusters obtained,
and the distribution of the number of localities included in the usual environment of the customer. For
instance, with a 60 kilometres parameter, the clusters become so large that hardly any transactions were
considered touristic for the purpose of the domestic tourism analysis. On the other hand, when the parameter
was set to less than 40 kilometres, the resulting clusters were of such small size that localities on the
periphery of large cities were not considered part of the same cluster. This was something that had to be
avoided so as not to label transactions in these localities as touristic. It should be emphasized that the value
assigned to the ε parameter depends on the characteristics of the geography being analysed. This parameter
may have a greater value when applying the algorithm to both larger and/or more sparsely populated
countries.
As mentioned above, aside from the irregular shape of the areas, to develop an algorithm that works for
most cardholders regardless of their behaviour poses a challenge. For this reason it is crucial to find a
clustering algorithm that does not request, as a parameter, the number of clusters to be generated. DBSCAN
meets this requirement; it only needs to know how many points an area has to contain to be considered
dense in order to discard points outside these areas (noise). In this case, this value is set to 1 for two reasons.
Firstly, there may be cardholders with activity in only one locality, meaning that fixing this parameter to a
higher value could leave them without a usual environment. Secondly, even those cardholders with activity
in more than one locality might have carried out transactions in only one of the localities of their usual
environment. As DBSCAN does not work with extra features beyond latitude and longitude, the solution
to avoid discarding isolated localities is to fix the minimum number of localities per cluster to one (1) and
then to use the locality importance index calculated above for the next step of the methodology.
3.3 Cluster selection
The result of applying DBSCAN is a list of clusters and localities belonging to each cardholder. The
decision not to discard isolated localities leads to the creation of one-locality clusters, with the average
number of clusters per cardholder rising to 3.6. This decision also impacts the size of clusters and creates a
high level of diversity: from high transactional cardholders whose largest cluster has more than thirty
localities, to cardholders with one single cluster made up of one locality. The average size of the clusters is
3.8 localities.
All this leads to the next step and to decide which clusters conform to the usual environment of each
cardholder. This choice is based on the value of the indexes calculated above representing a function of the
number of payments accumulated by each cluster. As a measure of their importance in the cardholder’s
usual behaviour, every cluster is assigned an index calculated as the sum of the indexes of the cluster’s
localities. Therefore, the cluster with the highest index would represent the area where cardholders, in the
main, follow their routines.
That said, the UNWTO definition specifies that separated areas could be part of an individual’s usual
environment. Thus the methodology not only must select the cluster with the highest index value, but also
accept those clusters with a high relative importance in the cardholder’s routine. An analysis of the
distribution of the percentages of transactions among the clusters show that there are three principal
scenarios: 1) cardholders with only one cluster which collects their entire transactional footprint, 2)
cardholders with their expenditure split in two clusters with similar percentages (e.g. 60%-40%), and 3) the
remaining cardholders have multiple clusters but only one cluster collects more than 75% of the
transactions. The following formula was created to determine where to set the limit between the clusters
belonging to the usual environment, and the clusters that are excluded from the environment due to a sudden
decrease in registered activity:
(2)
where max_pct is the percentage represented by the cluster with the highest index value and cluster_pcti is
the percentage represented by cluster i.
6
The formula is based on the percentages represented by each cluster and uses the highest percentage as a
reference. According to this mathematical expression, every cluster i representing at least two thirds of the
percentage of the highest ranked cluster has been selected to be part of the usual environment.
For example, if the highest ranked cluster represents 60% of the cardholder’s usual behaviour, only then
can another cluster with the remaining 40% be included. On the other hand, when the highest cluster
represents 40%, every other cluster representing at least 26% of the behaviour is included in the
cardholder’s usual environment.
3.4 Usual environment area
At this point, a list of localities is associated with each cardholder. However, as already stated, the main
objective here is to determine which transactions are touristic (made outside the usual environment).
Therefore, the algorithm needs to consider that cardholders may not have made a transaction in each locality
of their usual environment. Nevertheless, this does not mean that the cardholder will not at some stage
travel to different points on the map (covering the same distance) to incur a regular expense. As such, this
type of transaction should not be considered touristic. A further step has been included in the algorithm to
ensure that transactions carried out in the area of influence of the localities in the current list are not
considered touristic.
For this purpose, the centroid is calculated for each cluster selected to be part of the usual environment.
These points are set as the centres of different circular areas that, together, conform to the usual environment
of each cardholder. The maximum distance between each centroid and the localities included in its cluster
is called d. As it was concluded that 40 kilometres is the optimum value for ε, the DBSCAN distance
parameter, if d is less than 40 then the radius r used to create the circular area is set to 40 kilometres. If d is
greater than 40 then r is set to d + 5 kilometres. Figure 2 shows an example of both cases. Thus, a heuristic
margin error is added to avoid considering routine transactions as touristic. Every locality whose centroid
is located within the circular area is considered part of the cardholder’s usual environment, even if no
transactions were registered there. Usual environments may follow a discontiguous topology when made
up of clusters set at larger distances from one another.
4 Analytical Results
The following procedure has been applied as an approach to validate the methodology results. Firstly, a
comparison is made with the cardholder’s given address. Although many of these addresses are not current,
the percentage of usual environments that include the given address of the cardholder provides an estimation
of the performance of the algorithm. Thus, 92% of more than 4.8 million cardholders whose usual
environment was calculated for 2016 have their stated locality included in the resulting list of localities.
Furthermore, focusing on the cardholders who have changed address is also an effective way to see how
the algorithm is working. In fact, cardholders who move to a different province are the most reliable set
because the new locality is unlikely to be included in their former usual environment. On the other hand, in
2016 almost 40,0000 cardholders informed the bank of a change of address, of which 94% had the new
locality included in their usual environment.
Fig. 2. Examples of usual environment areas: d < 40 km. (left); d > 40 km. (right).
7
A further analysis to help validate the methodology is to calculate the percentage of transactions used as
input to be included in the resulting usual environments. This value shows the percentage of the transactions
considered part of the cardholder’s routine. In this case study, the percentage has been obtained for each
cardholder with a mean value of 93%, and a median of 97%. These values show that the methodology is
identifying most of transactions as regular ones.
4.1 Distribution of the number of localities in the usual environment
In addition to these values, a study was made of the distribution of the area included in the cardholder’s
usual
environment (Figure 3). Due to the heterogeneous area of localities across Spain, together with the
country’s population distribution, a non-uniform transaction frequency distribution was expected. Even so,
some peaks will still most likely be observed due to the high volume of people living in large cities. In
addition, a comparison between the data from 2015 and 2016 can help confirm the consistency of the
algorithm.
The figure shows that a significant majority of cardholders have anything between 2,500 and 5,000 km2 as
their usual environment. Both graphs are similar with a valley at around value 4,500, a peak at value 3,000,
and a further peak at value 5,000. The second peak is particularly noteworthy and merits looking into. As
it stands, the 5,000-km2 peak corresponds to those cardholders living in Madrid. Cardholders who have
made transactions exclusively in Madrid city will share the same usual environment. This includes all the
localities covered by the circular area of 40 kilometres radius (area = π*402 = 5027 km2) from the most
central point of the city centre.
The most important difference between the two graphs is the peak level at value 3,000. It corresponds to
cardholders who have made transactions only in the city of Barcelona during the 12-month window. The
reason for such a difference in the number of cardholders is the acquisition of Catalunya Caixa by BBVA
in September 2016 (European Commision, 2014). From this moment on, all Catalunya Caixa cardholders
were included in BBVA datasets. Thus, the number of cardholders located in Catalonia processed by the
usual environment procedure in 2016 was much greater than in previous years.
8
Fig. 3. Distribution of the area of the cardholders' usual environment. Years 2015 and 2016.
4.2 Number of cardholders belonging to each locality
To add to these results, the number of cardholders whose usual environment includes each locality has been
obtained. Although BBVA’s market share differs in every region, the most populated cities of the country
should be included in the usual environment of a number of cardholders proportionate to the general
population. To illustrate this point, Figure 4 assigns a darker hue to the localities with a higher number of
cardholders. This map shows the results from year 2016 using the locality divisions.
As expected, the darkest areas are those around Madrid and Barcelona. Moreover, the rest of the country's
large cities can also be identified on the map: Valencia and Alicante on the Mediterranean Coast; Malaga,
Seville and Cordoba in Andalusia; and Bilbao and Zaragoza in the north. Also of note is the large area in
white between Madrid and Barcelona, and the union of the coastal provinces of Galicia (Pontevedra and La
Coruña).
A comparison was made between the areas of influence of the two largest cities of the country. Figures 5
and 6 show the number of cardholders whose usual environment includes each locality in addition to Madrid
or Barcelona.
9
Fig. 4. Number of cardholder whose usual environment includes the locality. Year 2016.
Fig. 5. Number of cardholders whose usual environment includes Madrid and the locality. Year 2016.
10
Fig. 6. Number of cardholders whose usual environment includes Barcelona and the locality. Year
2016.
In both cases, the maps show the significant influence these cities have on their immediate surroundings.
Barcelona’s area of influence spreads mainly through Catalonia, and even includes part of Aragon, while
the city of Madrid enjoys a big influence not only in the entire province of Madrid, but also in the adjacent
provinces (Toledo, Guadalajara, Avila, Segovia and Cuenca).
The major difference is observed when analysing cardholders whose usual environments include either one
of the two major cities in addition to other larger ones. Figure 5 shows how Madrid shares cardholders with
almost every sizeable city in the country, while Figure 6 shows that Barcelona only shares cardholders with
Madrid, Valencia and Bilbao. Aside from the central geographical position of Madrid, the reason behind
these differences is the excellent quality of transport links (both by land and air) between these points and
Madrid, which makes commuting much easier. It is also noteworthy that most headquarters of international
companies are located in Madrid leading to an increase in the capital’s influence throughout the country.
5 Use cases
Since its development, this methodology has already been applied by BBVA for different projects.
Internally, its results have been used in projects such as second home detection or client segmentation.
Furthermore, this methodology has been the cornerstone for some joint projects with the administration. In
this section, two such projects are presented as examples: 1) the replication of an official Spanish domestic
tourism index and 2) the collaboration with the Mexican Secretariat of Tourism (SECTUR) to measure the
impact of domestic tourism in twelve touristic corridors.
5.1 Use case: replicating Spanish domestic tourism index
The Spanish Statistical Office (INE) publishes the Residents Travel Survey (ETR, formerly known as
EGATUR). This publication uses surveys to estimate, among other things, the total expenditure incurred
by domestic tourists in Spain monthly.
The survey is published on a quarterly basis with every issue containing data relating to the previous 3-
month period. For instance, readers will need to wait until the end of June to obtain tourism data for January,
February and March. This time gap could, however, be filled by using digital sources such as transactional
data. Indeed, the immediacy provided by transaction records can add to the robustness of classical
methodologies.
11
As a result, the new digital source was applied to one of the use cases of the usual environment
methodology. The main objective was to test the validity of transactional data and the methodology in
replicating the domestic tourism expenditure index evolution. In particular, the focus was on yearly and
monthly variation percentages rather than the total amount registered. The latter would have implied the
development of an extrapolation methodology, which in itself is complex. Furthermore, for time reasons,
and in order to avoid waiting for the index to be published, indexes from previous years (January 2014 -
September 2016) were employed for this particular use case.
Accordingly, the first test was to ascertain whether a correlation existed between the official statistics and
the amount of money spent by the entire set of BBVA cardholders outside their usual environment. For this
purpose, the usual environment of every BBVA cardholder was calculated for the years 2014, 2015 and
2016 (to September). Nevertheless, the results showed that the unequal distribution of BBVA cardholders
throughout the country (with overrepresentation in large cities), together with the increase in the number
of cardholders during this period was leading to different conclusions. For this reason and in order to carry
out a comparison, the main challenge was to select and maintain a sample of cardholders based on the
principles of the official surveys (Instituto Nacional de Estadística, 2015).
Subsequently, the geographical distribution issue was approached at province level (in Spain there are 50
provinces and two self-governing cities). Since the number of cardholders was considerably larger than the
official surveys sample size (approx. 127,000 surveys per year), a linear optimization problem was defined
to obtain (similar to the population) the largest cardholder sample with the same proportion of cardholders
from every Spanish province. The proportion represented by each province and year was obtained from
official sources (Instituto Nacional de Estadística). In relation to BBVA, cardholders were considered as
members of each of the provinces to which the centroids of their usual environment clusters belonged. For
example, a cardholder with a disjointed usual environment distributed between Madrid and Barcelona
(resulting from a selection of two clusters in the third step of the methodology), could appear twice in the
final sample, once per province. The following optimization problem was considered:


For each province i,



(3)
where sample_sizep is the optimum size of the cardholders belonging to province p sample and populationi
is the real population of the province i.
The solution of the problem resulted in an optimum sample size of more than 1.3 million cardholders. This
sample size remained bigger than the official surveys sample size and ensured the proportional
representation of the 50 provinces.
At this point there were some provinces whose representation in the sample was considerably smaller than
the size of whole set of available cardholders selected. This led to the question of how the sampling
methodology should work. There were a variety of options in this regard, ranging from a random selection
of the province representatives, to applying additional information to build a more solid sample.
The first attempt was to use the age and gender of the cardholders in order to refine the sample and suit the
distribution of the population. Thus, another optimization problem was raised. In this case, the restrictions
of the problem had to do not only with the origin of the cardholders, but also with their age and gender.
Hence, the resulting sample would have been a better representation of the population in Spain.
However, the fact that there are some sociodemographic profiles with a limited digital fingerprint reduced
the optimum sample size to less than 500 cardholders. This sample size was considered too small to be
representative. For this reason, the decision was to use the geographically-fixed sample size and select
randomly the cardholders from each province.
Initial testing with this approach highlighted a problem: a random selection of cardholders from the most
populated provinces (Madrid, Barcelona, etc.) meant that there were higher probabilities of choosing a
cardholder with no touristic activity at all. While the number of cardholders in these provinces is higher, so
is the percentage of cardholders with only one transaction. To solve this, the decision was made to fill each
province sample with its most transactional cardholders (both inside and outside their usual environments).
12
As a result, all the samples followed similar standard transaction distribution patterns with minor
differences in their mean values (around 90 transactions per year).
Once the original sample was ready, the next question to arise was the resampling methodology. Every year
(in fact, daily) there are new BBVA cardholders. At the same time there are BBVA cardholders who stop
using their debit/credit cards. The decision was made easier because, as already mentioned, the aim of the
use case was to analyse a full 12-month period. Instead of filing a new sample month by month (the best
option for a monthly updated index), a new sample is selected for the year in question using the same rule:
each year the province sample consists of those cardholders with the highest number of transactions carried
out during that given year.
Two (2) time series were built using the transactions made by the cardholders belonging to these samples
outside their usual environment. One of these shows the month-to-month variation (comparing one month
to the previous month), while the other represents the interannual variation (comparing a given month to
the same month of the previous year). Figures 7 and 8 show the comparison between the time series
representing the variation in the number of trips of Spanish nationals within the country, but outside their
usual environment. The information was obtained from the official publications and from the series
generated using BBVA data. In both figures, the BBVA series are similar to the official series.
Fig. 7. Comparison between ETR and BBVA series: interannual variation.
Fig. 8. Comparison between ETR and BBVA series: month-to-month variation.
13
5.2 Use case: measuring domestic tourism in Mexico
BBVA also has a presence in North and South America. In particular, BBVA Bancomer is one of the most
important banks in Mexico with more than 11.5 million cardholders. This level of market share provides a
broad vision of the economic evolution of the country and lends itself to the same type of analysis presented
here.
Together with the Mexican Secretariat of Tourism (SECTUR), the impact of tourism in eleven corridors
and in the entire set of municipalities labeled as “Pueblo Magicos” was measured using transactional data.
This analysis contained information regarding both international tourists and the domestic market. In order
to determine whether the transactions were carried out by visitors or residents, the methodology explained
in Section 3 was applied. In this case, instead of using localities, cardholders were associated with the set
of AGEBs (the basic geostatistical area used in Mexico) belonging to their usual environment. This decision
led to an analysis of the optimum value of the parameter ε. Since Mexican AGEBs are smaller than Spanish
localities, the step of the DBSCAN was reduced to 20 kilometres and this value was also used as the
standard radius in the final step of the methodology. Subsequently, the same tests illustrated earlier were
applied to validate this value with a positive result. As mentioned in the abstract, this was the only change
required in order to apply the methodology to a different country.
In this use case, no official statistics were available for comparison. For this reason, the entire set of
cardholders was used to calculate the statistics shown in the reports. The usage of the usual environment
not only enabled the differentiation between residents and domestic tourists, but also the sociodemographic
characterization of the latter. In addition, the comparison between municipalities and the impact of each
type of tourism is also enabled. For example, considerable differences emerged in the entire set of Pueblos
Magicos when analysing the type of merchant where each profile spent more money: while international
tourists spent more than one third of their money on entertainment activities, this category only represented
3% of the expenditure of domestic tourists.
A report for each corridor was published on the SECTUR webpage (Secretaría de Turismo de México,
2016). Furthermore, an interactive dashboard containing the information of the Pueblos Magicos analysis
is available (BBVA Data & Analytics, 2016). Figure 9 shows an example of the type of data included in
this dashboard. In the upper-left corner there is a map showing the amount of money spent by both domestic
and international tourists during 2015 in each Pueblo Mágico. In the right-upper corner there is a pie chart
showing the distribution between domestic tourism, international tourism and locals (those whose usual
environment includes the locality) expenditure. Furthermore, the distribution of the age and gender of the
domestic tourists is shown. The tree map charts in the lower part of the figure show the category distribution
of the transactions carried out by international tourists (left) and domestic tourists (right). Thus, the users
of the dashboard can identify which kind of people is visiting each destination and what are they looking
for.
Fig. 9. Example of interactive dashboard showing touristic data from Pueblos Magicos.
14
6 Conclusions and further work
So far, most of the official publications focused in tourism evolution use surveys as primary data source.
This fact has a negative impact on the frequency and delay between the period studied and the time of
publication. In addition, the level of granularity (geographic divisions, time periods...) of the statistics could
be improved since most of them work at country level. On the other hand, the methodologies used in these
publications are considered robust and their results are already used as standards.
Big Data from digital sources, has successfully been used in previous works to measure and monitor tourism
activity. In particular, some effective examples can be found using data from a photo sharing platform and
social network like Flickr, or mobile networks. These examples focus on the number of tourists visiting a
destination. However, using these sources, it is difficult to estimate how much money tourists spent. In this
context, the objective of this paper is to provide a methodology to enable the usage of card transactions
data to analyse domestic tourism.
The main question that arises is how to distinguish between tourists and residents. The proposed
methodology uses the definition of usual environment proposed by UNWTO to answer this question. Thus,
the evolution of domestic tourism can be studied using the transactions made outside the cardholders’ usual
environments.
The methodology is divided into four steps: locality importance; geospatial clustering; cluster selection;
and establishment of a usual environment. The value of the DBSCAN ε parameter in the third step is the
most important as it is completely dependent on how each country is divided. After tests were carried out
focusing on the unique characteristics of a given area, a distance of 40 kilometres was selected as the
optimum value for the Spanish use case.
This new approach provides new opportunities and complementary ways to analyse tourism trends and
patterns. The strength of this kind of data sources is their near real time availability and the great level of
detail of the merchant location (lat-long). Thus, the frequency and quality of the information provided by
official publications could be improved not only with the creation of early indexes but also with the analysis
of the evolution at city, or even neighbourhood, level. The integration of different digital data sources and
methodologies is still work to be done for future research.
The use case presented in Section 5.2 showed an example of how current statistics based on surveys can be
enhanced with data deriving from this new source. BBVA has already used the usual environment
methodology in the framework of collaboration with the Mexican Secretariat of Tourism (SECTUR). The
project’s main objective was to measure the impact of tourism in eleven Mexican tourist corridors and in
the entire set of municipalities belonging to the “Pueblos Mágicos” programme. Using this methodology
made it possible to analyse the different consumption patterns between tourists (both domestic and
international) and residents. It should be noted that individual privacy rights (as per European Union laws)
are granted as the data is anonymized. Furthermore, only aggregated data is shown on the dashboards and
analysis.
However, there are some issues that must be taken into account. The sampling methodology used in the
surveys ensures that the results can be extrapolated to the whole population. When using the bank data only
a percentage of the whole set of transactions (depending on the market share) is available unless multiple
banks datasets are aggregated. For this reason, it is necessary to show the legitimacy of the sample and, in
case of making a final product, develop an extrapolation technique to obtain useful results.
The following results were obtained in the Spanish case study presented: 92% of more than 4.8 million
cardholders with a usual environment for 2016 have their given locality included in the resulting list of
localities. The mean percentage of input transactions considered part of the cardholder’s routine is 93%
(the median value is 97%). Also, further analyses were performed considering the localities included in the
cardholder’s usual environment. The distribution of the number of cities per cardholder, and cardholder
distribution per city show significant similarities with the distribution of the country’s population. In
addition, in section 5.1, an analysis of the correlation between the evolution of Spanish official indexes and
BBVA series is provided as proof of the validity of the results. The creation of a extrapolation technique is
left as further work.
15
7 Acknowledgements
The authors would like to thank the two main institutions whose partnership made possible the outcome of
this paper:
Banco Bilbao Vizcaya Argentaria (BBVA, a global financial corporation) for providing the dataset for this
research. Special thanks to Elena Alfaro Martinez, Jon Ander Beracoechea and Fabien Girardin for their
support to this applied research project.
Exceltur, (Alliance for Excellency in Tourism, a non-profit group formed by the Chairmen of the 23 leading
Spanish tourist groups), and specially to Eva Hurtado and Óscar Perelli, for providing useful insights and
stimulating discussions around current gaps in tourism intelligence, and inspiring suggestions about the
methodology followed in this research.
References
Baggio, R. (2019). Measuring Tourism: Methods, Indicators, and Needs. In The Future of Tourism (pg.
255-269). Springer.
BBVA Data & Analytics. (2016). Tableau Public. https://public.tableau.com/profile/bbva.data.analytics
Bodas, D. J., García, J., Murillo Arias, J., Pacce, M., Rodrigo, T., Ruiz de Aguirre, P., et al. (2018).
Measuring Retail Trade Using Card Transactional Data. Working paper .
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters
in Large Spatial Databases with Noise. (U. o. Institute for Computer Science, Ed.) AAI Press.
European Commision. (2014). Restructuring of Catalunya Banc S.A. through its acquisition by BBVA.
Brussels.
Eurostat. (2014). Feasibility Study on the Use of Mobile Positioning Data for Tourism Statistics.
González, M. C., Hidalgo, C. A., & Barabási, A.-L. (June 2008). Understanding individual human
mobility patterns. Nature 423 , 479-482.
Hahsler, M., & Piekenbrock, M. (March 2017). Density Based Clustering of Applications with Noise
(DBSCAN) and Related Algorithms.
Heerschap, N., Ortega, S., Priem, A., & Offermans, M. (2014). Innovation of tourism statistics through
the use of new big data sources. (N. Statistics, Ed.)
Instituto Nacional de Estadística. (July 2015). Metodología Encuesta de Turismo de Residentes
(ETR/FAMILITUR).
Instituto Nacional de Estadística. (n.d.). Página oficial Instituto Nacional de Estadística. From Official
Population Figures referring to revision of Municipal Register 1 January:
http://www.ine.es/jaxiT3/Tabla.htm?t=2852&L=1
Koerbitz, W., Önder, I., & Hubmann-Haidvogel, A. C. (2013). Identifying Tourist Dispersion in Austria
by Digital Footprints. In Information and Communication Technologies in Tourism (pg. 495-
506). Berlin: Springer.
Kozak, M., & Rimmington, M. (2000). Tourist Satisfaction with Mallorca, Spain, as an Off-Season
Holiday Destination. Journal of Travel Research , 38.
R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing.
Secretaría de Turismo de México. (2016). Colaboración sobre Big Data y Turismo. Obtained from
http://www.datatur.sectur.gob.mx:81/Reportes/bigdata/bigdata.htm
Sobolevsky, S., Bojic, I., Belyi, A., Sitko, I., Hawelka, B., Murillo Arias, J. et al. (2015). Scaling of City
Attractiveness for Foreign Visitors through Big Data of Human Economical and Social Media
Activity. Proceedings - 2015 IEEE International Congress on Big Data.
Tkacz, G., & Galbraith, J. W. (2013). Nowcasting GDP: Electronic Payments, Data Vintages and the
Timing of Data Releases. CIRANO Working Papers .
UNWTO. (2010). International Recommendations for Tourism Statistics 2008. United Nations,
Department of Economic and Social Affairs, Statistics Division. New York: United Nations
Publications.
World Travel & Tourism Council (WTTC). (2018). Travel & Tourism Economic Impact 2018 Spain.
Obtained from https://www.wttc.org/-/media/files/reports/economic-impact-research
/countries-2018/spain2018.pdf
Xu, R., & Wunsch, D. C. (2005). Survey of clustering algorithms. IEEE Transactions on Neural
Networks , 16, 645-678.
16
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Chapter
This chapter examines the wicked problem of adequately measuring tourism in its many facets. This concise survey examines the main techniques used today for assessing tourism flows and their direct, indirect, and induced effects on environmental, socio-cultural, and economic macro-scenarios. Demand and supply evaluations are explored through the description of traditional time series and econometric models and the main national tourism statistical measurements together with cutting-edge techniques such as the artificial intelligence methods that use the most recent advances in computer science. The important task of estimating the impacts of tourism related activities on the socio-economic environment is discussed by looking at the most popular methods such as the Input-Output model, the Social Accounting Matrix, the Computable General Equilibrium model and the Tourism Satellite Account. Moreover, computerised numerical simulation techniques are called into play for their capability to provide useful insights and outcomes in complex and uncertain situations, typical of the tourism domain.
Chapter
Tourism data are important for destinations, especially for planning, forecasting tourism demand, marketing, measuring economic impacts and benchmarking. There are different ways to collect tourism data. Traditional methods include guest surveys and data from accommodation providers, which are time consuming and expensive. Today, everyone leaves digital footprints on the internet, which can be used as data. One such footprint is photos uploaded on photo sharing websites. The purpose of this study is to find out how representative Flickr data is in comparison to actual tourist numbers in Austria. Using Flickr API data were collected related to Austria. The tourists and residents were categorized based on their activity time span on Flickr. Polynomial regression was conducted to estimate actual tourist bed nights based on Flickr tourist numbers. The results show that Flickr data can be used as an estimation of actual tourist numbers in Austria.
Article
We describe and assess the usefulness of a newly-constructed database of electronic payments, comprised of debit and credit card transactions as well cheques that clear through the banking system, as indicators of current GDP growth. Apart from capturing a broad range of spending activity, these variables are available on a very timely basis, thereby making them suitable candidate indicators. Controlling both for the release dates of various variables and the vintage of GDP available to analysts at the time a nowcast is produced, we generate nowcasts of GDP growth for a given quarter over a span of five months, which is the period over which interest in nowcasts would exist. We find that nowcast errors fall by about 60 percent between the first and final nowcast. Evidence on the value of the additional payments variables is mixed, however; the point estimates suggest reductions in forecast loss at some nowcast horizons, but with considerable variability.
Article
A number of research studies have investigated tourist satisfaction with mass tourism destinations, particularly during the peak (summer) season. However, there has been limited investigation of tourist satisfaction with off-season holiday destinations. This article reports the findings of a study to determine destination attributes critical to the over-all satisfaction levels of tourists visiting Mallorca, Spain, during the winter season. Their future holiday intentions also are investigated. Findings are analyzed, and implications and limitations are discussed.
Measuring Retail Trade Using Card Transactional Data
  • D J García
  • J Murillo Arias
  • J Pacce
  • M Rodrigo
  • T Ruiz De Aguirre
BBVA Data & Analytics. (2016). Tableau Public. https://public.tableau.com/profile/bbva.data.analytics Bodas, D. J., García, J., Murillo Arias, J., Pacce, M., Rodrigo, T., Ruiz de Aguirre, P., et al. (2018). Measuring Retail Trade Using Card Transactional Data. Working paper.
Restructuring of Catalunya Banc S.A. through its acquisition by BBVA
European Commision. (2014). Restructuring of Catalunya Banc S.A. through its acquisition by BBVA. Brussels.
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms
  • M Hahsler
  • M Piekenbrock
Hahsler, M., & Piekenbrock, M. (March 2017). Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms.
Innovation of tourism statistics through the use of new big data sources
  • N Heerschap
  • S Ortega
  • A Priem
  • M Offermans
Heerschap, N., Ortega, S., Priem, A., & Offermans, M. (2014). Innovation of tourism statistics through the use of new big data sources. (N. Statistics, Ed.)