Conference PaperPDF Available

Abstract and Figures

Retail data are one of the most requested commodities by commercial companies. Unfortunately, from this data it is possible to retrieve highly sensitive information about individuals. Thus, there exists the need for accurate individual privacy risk evaluation. In this paper, we propose a methodology for assessing privacy risk in retail data. We define the data formats for representing retail data, the privacy framework for calculating privacy risk and some possible privacy attacks for this kind of data. We perform experiments in a real-world retail dataset, and show the distribution of privacy risk for the various attacks.
Content may be subject to copyright.
Assessing Privacy Risk in Retail Data
Roberto Pellungrini1, Francesca Pratesi1,2, and Luca Pappalardo1,2
1Department of Computer Science, University of Pisa, Italy
2ISTI-CNR, Pisa, Italy
Abstract. Retail data are one of the most requested commodities by
commercial companies. Unfortunately, from this data it is possible to
retrieve highly sensitive information about individuals. Thus, there exists
the need for accurate individual privacy risk evaluation. In this paper, we
propose a methodology for assessing privacy risk in retail data. We define
the data formats for representing retail data, the privacy framework for
calculating privacy risk and some possible privacy attacks for this kind
of data. We perform experiments in a real-world retail dataset, and show
the distribution of privacy risk for the various attacks.
1 Introduction
Retail data are a fundamental tool for commercial companies, as they can rely
on data analysis to maximize their profit [7] and take care of their customers by
designing proper recommendation systems [11]. Unfortunately, retail data are
also very sensitive since a malicious third party might use them to violate an
individual’s privacy and infer personal information. An adversary can re-identify
an individual from a portion of data and discover her complete purchase history,
potentially revealing sensitive information about the subject. For example, if an
individual buys only fat meat and precooked meal, an adversary may infer a
risk to suffer from cardiovascular disease [4]. In order to prevent these issues,
researchers have developed privacy preserving methodologies, in particular to
extract association rules from retail data [3,10,5]. At the same time, frameworks
for the management and the evaluation of privacy risk have been developed for
various types of data [1,13,2,9,8].
We propose privacy risk assessment framework for retail data which is based
on our previous work on human mobility data [9]. We first introduce a set of data
structures to represent retail data and then present two re-identification attacks
based on these data structures. Finally, we simulate these attacks on a real-world
retail dataset. The simulation of re-identification attacks allows the data owner
to identify individuals with the highest privacy risk and select suitable privacy
preserving technique to mitigate the risk, such as k-anonymity [12].
The rest of the paper is organized as follows. In Section 2, we present the data
structures which describe retail data. In Section 3, we define the privacy risk and
the re-identification attacks. Section 4, shows the results of our experiments and,
finally, Section 5 concludes the paper proposing some possible future works.
2 Data Definitions
Retail data are generally collected by retail companies in an automatic way:
customers enlist in membership programs and, by means of a loyalty card, share
informations about their purchases while at the same time receiving special offers
and bonus gifts. Products purchased by customers are grouped into baskets. A
basket contains all the goods purchased by a customer in a single shopping
Definition 1 (Shopping Basket). A shopping basket Su
jof an individual u
is a list of products Su
j={i1, i2, . . . , in}, where ih(h= 1, . . . , n) is an item
purchased by uduring her j-th purchase.
The sequence of an individual’s baskets forms her shopping history related
to a certain period of observation:
Definition 2 (History of Shopping Baskets). The history of shopping bas-
kets HSuof an individual uis a time-ordered sequence of shopping baskets
1, . . . , Su
3 Privacy Risk Assessment Model
In this paper we start from the framework proposed in [9] and extended in [8],
which allows for the assessment of the privacy risk in human mobility data.
The framework requires the identification of the minimum data structure, the
definition of a set of possible attacks that a malicious adversary might conduct
on an individual, and the simulation of these attacks. An individual’s privacy
risk is related to her probability of re-identification in a dataset w.r.t. a set of
re-identification attacks. The attacks assume that an adversary gets access to a
retail dataset, then, using some previously obtained background knowledge, i.e.,
the knowledge of a portion of an individual’s retail data, the adversary tries to
re-identify all the records in the dataset regarding that individual. We use the
definition of privacy risk (or re-identification risk) introduced in [12].
The background knowledge represents how the adversary tries to re-identify
the individual in the dataset. It can be expressed as a hierarchy of categories,
configurations and instances: there can be many background knowledge cate-
gories, each category may have several background knowledge configurations,
each configuration may have many instances. A background knowledge category
is an information known by the adversary about a specific set of dimensions
of an individual’s retail data. Typical dimensions in retail data are the items,
their frequency of purchase, the time of purchase, etc. Examples of background
knowledge categories are a subset of the items purchased by an individual, or
a subset of items purchased with additional spatio-temporal information about
the shopping session. The number kof the elements of a category known by the
adversary gives the background knowledge configuration. This represents the
fact that the quantity of information that an adversary has may vary in size. An
example is the knowledge of k= 3 items purchased by an individual. Finally,
an instance of background knowledge is the specific information known, e.g., for
k= 3 an instance could be eggs, milk and flour bought together. We formalize
these concepts as follows.
Definition 3 (Background knowledge configuration). Given a background
knowledge category B, we denote by Bk∈ B ={B1, B2, . . . , Bn}a specific back-
ground knowledge configuration, where krepresents the number of elements in
Bknown by the adversary. We define an element bBkas an instance of
background knowledge configuration.
Let Dbe a database, Da retail dataset extracted from D(e.g., a data struc-
ture as defined in Section 2), and Duthe set of records representing individual
uin D, we define the probability of re-identification as follows:
Definition 4 (Probability of re-identification). The probability of re-identi-
fication P RD(d=u|b)of an individual uin a retail dataset Dis the probability
to associate a record d∈ D with an individual u, given an instance of background
knowledge configuration bBk.
If we denote by M(D, b) the records in the dataset Dcompatible with the
instance b, then since each individual is represented by a single History of Shop-
ping Baskets, we can write the probability of re-identification of uin Das
P RD(d=u|b) = 1
|M(D,b)|. Each attack has a matching function that indicates
whether or not a record is compatible with a specific instance of background
Note that P RD(d=u|b) = 0 if the individual uis not represented in D.
Since each instance bBkhas its own probability of re-identification, we define
the risk of re-identification of an individual as the maximum probability of re-
identification over the set of instances of a background knowledge configuration:
Definition 5 (Risk of re-identification or Privacy risk). The risk of re-
identification (or privacy risk) of an individual ugiven a background knowledge
configuration Bkis her maximum probability of re-identification Risk(u, D) =
max P RD(d=u|b)for bBk. The risk of re-identification has the lower bound
|D|(a random choice in D), and Risk(u, D)=0 if u /D.
3.1 Privacy attacks on retail data
The attacks we consider in this paper consist of accessing the released data in the
format of Definition (2) and identifying all users compatible with the background
knowledge of the adversary.
Intra-Basket Background Knowledge. We assume that the adversary has as back-
ground knowledge a subset of products bought by her target in a certain shopping
session. For example, the adversary once saw the subject at the workplace with
some highly perishable food, that are likely bought together.
Definition 6 (Intra-Basket Attack). Let kbe the number of products of an
individual wknown by the adversary. An Intra-Basket background knowledge
instance is b=S0
iBkand it is composed by a subset of purchase S0
of length k. The Intra-Basket background knowledge configuration based on k
products is defined as Bk=Sw[k]. Here Sw[k]denotes the set of all the possible
k-combinations of the products in each shopping basket of the history.
Since each instance b=S0
iBkis composed of a subset of purchase S0
length k, given a record d=HSuDand the corresponding individual u, we
define the matching function as:
matching(d, b) = (true Sd
false otherwise (1)
Full Basket Background Knowledge. We suppose that the adversary knows the
contents of a shopping basket of her target. For example, the adversary once
gained access to a shopping receipt of her target. Note that in this case it is
not necessary to establish k, i.e., the background knowledge configuration has a
fixed length, given by the number of items of a specific shopping basket.
Definition 7 (Full Basket Attack). A Full Basket background knowledge in-
stance is b=Sw
jBand it is composed of a shopping basket of the target win
all her history. The Full Basket background knowledge configuration is defined
as B=Sw
Since each instance b=Sw
iBis composed of a shopping basket Sw
i, given a
record d=HSuDand the corresponding individual u, we define the matching
function as:
matching(d, b) = (true Sd
false otherwise (2)
4 Experiments
For the Intra-basket attack we consider two sets of background knowledge con-
figuration Bkwith k= 2,3, while for the Full Basket attack we have just one
possible background knowledge configuration, where the adversary knows an
entire basket of an individual. We use a retail dataset provided by Unicoop3
storing the purchases of 1000 individuals in the city of Leghorn during 2013,
corresponding to 659,761 items and 61,325 baskets. We consider each item at
the category level, representing a more general description of a specific item,
e.g., “Coop-brand Vanilla Yogurt” belongs to category “Yogurt”.
We performed a simulation of the attacks for all Bk. We show in Fig. 1
the cumulative distributions of privacy risks. For the Intra-basket attack, with
k= 2 we have almost 75% of customers for which privacy risk is to equal 1.
Fig. 1. Cumulative distributions for privacy attacks.
Switching to k= 3 causes a sharp increase in the overall risk: more than 98% of
individuals have maximum privacy risk (e.g., 1). The difference between the two
configurations is remarkable, showing how effective an attack could be with just
3 items. Since most of customers are already re-identified, further increasing the
quantity of knowledge (e.g., exploiting higher kor the Full Basket attack) does
not offer additional gain. Similar results were obtained for movie rating dataset
in [6] and mobility data in [9], suggesting the existence of a possible general
pattern in the behavior of privacy risk.
5 Conclusion
In this paper we proposed a framework to assess privacy risk in retail data.
We explored a set of re-identification attacks conducted on retail data struc-
tures, analyzing empirical privacy risk of a real-world dataset. We found, on
average, a high privacy risk across the considered attacks. Our approach can be
extended in several directions. First, we can expand the repertoire of attacks
by extending the data structures, i.e., distinguishing among shopping sessions
and obtaining a proper transaction dataset, or considering different dimensions
for retail data, e.g., taking into account spatio-temporal information. Second, it
would be interesting to compare the distributions of privacy risk of different at-
tacks through some similarity measures, such as the Kolmogorov-Smirnov test.
Another possible development is to compute a set of measures commonly used
in retail data analysis and investigate how they relate to privacy risk. Finally,
it would be interesting to generalize the privacy risk computation framework to
data of different kinds, from retail to mobility and social media data, studying
sparse relation spaces across different domains.
Funded by the European project SoBigData (Grant Agreement 654024).
1. C. Alberts, S. Behrens, R. Pethia, W. Wilson. 1999. Operationally Critical Threat,
Asset, and Vulnerability Evaluation (OCTAVE) Framework, Version 1.0. CMU/SEI-
99-TR-017. Software Engineering Institute, Carnegie Mellon University.
2. M. Deng, K. Wuyts, R. Scandariato, B. Preneel, W. Joosen. A privacy threat analysis
framework: supporting the elicitation and fulfillment of privacy requirements. Requir.
Eng. 16, 1. 2011.
3. F. Giannotti, L. V. Lakshmanan, A. Monreale, D. Pedreschi, H. Wang. Privacy-
preserving mining of association rules from outsourced transaction databases. IEEE
Systems Journal, Volume 7, Issue 3. 2013.
4. A. K. Kant. Dietary patterns and health outcomes. Journal of the American Dietetic
Association, Volume 104, Issue 4, 2004.
5. H. Q. Le, S. Arch-int, H. X. Nguyen, N. Arch-int. Association rule hiding in risk
management for retail supply chain collaboration. Computers in Industry, Volume
64, Issue 7, 2013.
6. Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets.
IEEE Security and Privacy, 2008
7. G. Pauler, A. Dick. Maximizing profit of a food retailing chain by targeting and pro-
moting valuable customers using Loyalty Card and Scanner Data. European Journal
of Operational Research, Volume 174, Issue 2, 2006.
8. R. Pellungrini, L. Pappalardo, F. Pratesi, A. Monreale. A data mining approach to
assess privacy risk in human mobility data. Accepted for pubblication in ACM TIST
Special Issue on Urban Computing.
9. F. Pratesi, A. Monreale, R. Trasarti, F. Giannotti, D. Pedreschi, T. Yanagihara.
PRISQUIT: a System for Assessing Privacy Risk versus Quality in Data Sharing.
Technical Report 2016-TR-043. ISTI - CNR, Pisa, Italy.
10. S. J. Rizvi, J. R. Haritsa. Maintaining data privacy in association rule mining.
VLDB 2002.
11. C. Rygielski, J.-C. Wang, D. C. Yen. Data mining techniques for customer rela-
tionship management. Technology in society 24.4 2002.
12. P. Samarati, L. Sweeney. Generalizing Data to Provide Anonymity when Disclosing
Information (Abstract). PODS 1998.
13. G. Stoneburner, A. Goguen, A. Feringa. Risk Management Guide for Information
Technology Systems: Recommendations of the National Institute of Standards and
Technology. NIST special publication, Vol. 800. 2002.
14. V. Torra Data Privacy: Foundations, New Developments and the Big Data Chal-
lenge. Springer 2017.
15. Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu and J. Pei. Publishing Sensitive
Transactions for Itemset Utility, ICDM 2008.
... Pellungrini et al. [15] have provided a methodology to evaluate privacy risk in retail data. They have defined the data formats for representing retail data, a privacy framework for calculating privacy risk and some possible privacy attacks for these types of retail data. ...
The imminent introduction of the Data Protection Act in India would make it necessary for almost all enterprises, dealing with personal data, to implement privacy-specific controls. These controls would serve to mitigate the risks that breach the privacy properties of user data. Hence, the first step toward implementing such controls is the execution of privacy risk assessment procedures that would help elicit the privacy risks to user data. All user data are processed/managed by one or more business processes. Hence, assessment of privacy risks to user data should consider the vulnerabilities within, and threats to, corresponding business process. It should also consider different perspectives, namely business, legal and contractual needs, and users’ expectations, during the computation of data privacy values. This paper proposes such a comprehensive methodology for identifying data privacy risks and quantifying the same. The risk values are computed at different levels (privacy property level, business process level, etc.) to help both senior management and operational personnel, in assessing and mitigating privacy risks.
In the modern Internet era the usage of social media such as Twitter and Facebook is constantly increasing. These social media are accumulating a lot of textual data, because individuals often use them for sharing their experiences and personal facts writing text messages. These data hide individual psychological aspects that might represent a valuable alternative source with respect to the classical clinical texts. In many studies, text messages are used to extract individuals psychometric profiles that help in analysing the psychological behaviour of users. Unfortunately, both text messages and psychometric profiles may reveal personal and sensitive information about users, leading to privacy violations. Therefore, in this paper, we propose a study of privacy risk for psychometric profiles: we empirically analyse the privacy risk of different aspects of the psychometric profiles, identifying which psychological facts expose users to an identity disclosure.
Retail data are of fundamental importance for businesses and enterprises that want to understand the purchasing behaviour of their customers. Such data is also useful to develop analytical services and for marketing purposes, often based on individual purchasing patterns. However, retail data and extracted models may also provide very sensitive information to possible malicious third parties. Therefore, in this paper we propose a methodology for empirically assessing privacy risk in the releasing of individual purchasing data. The experiments on real-world retail data show that although individual patterns describe a summary of the customer activity, they may be successful used for the customer re-identifiation.
Full-text available
Human mobility data are an important proxy for understanding human mobility dynamics and developing useful analytical services. Unfortunately these data are very sensitive since they may enable the re-identification of individuals in a database. Existing frameworks for privacy risk assessment in human mobility data provide the data providers with tools to control and mitigate privacy risks, but they suffer two main shortcomings: (i) they have a high computational complexity; (ii) the privacy risk must be re-computed every time new data records become available. In this paper we propose novel re-identification attacks and a fast and flexible data mining approach for privacy risk assessment in human mobility data. The idea is to learn classifiers to capture the relation between individual mobility patterns and the level of privacy risk of individuals. We show the effectiveness of our approach by an extensive experimentation on a real-world GPS data in two urban areas, and investigate the relations between human mobility.
Conference Paper
Full-text available
We consider the problem of publishing sensitive transaction data with privacy preservation. High dimensionality of transaction data poses unique challenges on data privacy and data utility. On one hand, re-identification attacks tend to use a subset of items that infrequently occur in transactions, called moles. On the other hand, data mining applications typically depend on subsets of items that frequently occur in transactions, called nuggets. Thus the problem is how to eliminate all moles while retaining nuggets as much as possible. A challenge is that moles and nuggets are multi-dimensional with exponential growth and are tangled together by shared items. We present a novel and scalable solution to this problem. The novelty lies in a compact border data structure that eliminates the need of generating all moles and nuggets.
Full-text available
Ready or not, the digitalization of information has come, and privacy is standing out there, possibly at stake. Although digital privacy is an identified priority in our society, few systematic, effective methodologies exist that deal with privacy threats thoroughly. This paper presents a comprehensive framework to model privacy threats in software-based systems. First, this work provides a systematic methodology to model privacy-specific threats. Analogous to STRIDE, an information flow–oriented model of the system is leveraged to guide the analysis and to provide broad coverage. The methodology instructs the analyst on what issues should be investigated, and where in the model those issues could emerge. This is achieved by (i) defining a list of privacy threat types and (ii) providing the mappings between threat types and the elements in the system model. Second, this work provides an extensive catalog of privacy-specific threat tree patterns that can be used to detail the threat analysis outlined above. Finally, this work provides the means to map the existing privacy-enhancing technologies (PETs) to the identified privacy threats. Therefore, the selection of sound privacy countermeasures is simplified.
This book offers a broad, cohesive overview of the field of data privacy. It discusses, from a technological perspective, the problems and solutions of the three main communities working on data privacy: statistical disclosure control (those with a statistical background), privacy-preserving data mining (those working with data bases and data mining), and privacy-enhancing technologies (those involved in communications and security) communities. Presenting different approaches, the book describes alternative privacy models and disclosure risk measures as well as data protection procedures for respondent, holder and user privacy. It also discusses specific data privacy problems and solutions for readers who need to deal with big data.
Spurred by developments such as cloud computing, there has been considerable recent interest in the paradigm of data mining-as-a-service. A company (data owner) lacking in expertise or computational resources can outsource its mining needs to a third party service provider (server). However, both the items and the association rules of the outsourced database are considered private property of the corporation (data owner). To protect corporate privacy, the data owner transforms its data and ships it to the server, sends mining queries to the server, and recovers the true patterns from the extracted patterns received from the server. In this paper, we study the problem of outsourcing the association rule mining task within a corporate privacy-preserving framework. We propose an attack model based on background knowledge and devise a scheme for privacy preserving outsourced mining. Our scheme ensures that each transformed item is indistinguishable with respect to the attacker's background knowledge, from at least k-1 other transformed items. Our comprehensive experiments on a very large and real transaction database demonstrate that our techniques are effective, scalable, and protect privacy.
In this paper, we set up a House of Profit Model, an approach of maximizing profit of a food retailing chain by targeting and promoting valuable customers. Our model combines•segmentation analysis of households using Loyalty Card and Scanner Data,•price and promotion elasticity analysis,•simulation of effects of pricing and promotion,•price and promotion optimization to maximize profit.These components are well-known in the literature and each of them has received considerable independent study. However, in this study we combine each of these components into one consistent, application-orientated model. We then demonstrate using panel data that the combination has a synergic effect on the efficiency of estimation and the maximization of profit (e.g., price and promotion elasticity estimation is improved by conducting it within market segments rather than across an entire hetereogeneous population). These estimates are further improved by incorporating “pass through”—a functional relationship between a retailer’s unit prices and unit costs.
Advancements in technology have made relationship marketing a reality in recent years. Technologies such as data warehousing, data mining, and campaign management software have made customer relationship management a new area where firms can gain a competitive advantage. Particularly through data mining—the extraction of hidden predictive information from large databases—organizations can identify valuable customers, predict future behaviors, and enable firms to make proactive, knowledge-driven decisions. The automated, future-oriented analyses made possible by data mining move beyond the analyses of past events typically provided by history-oriented tools such as decision support systems. Data mining tools answer business questions that in the past were too time-consuming to pursue. Yet, it is the answers to these questions make customer relationship management possible. Various techniques exist among data mining software, each with their own advantages and challenges for different types of applications. A particular dichotomy exists between neural networks and chi-square automated interaction detection (CHAID). While differing approaches abound in the realm of data mining, the use of some type of data mining is necessary to accomplish the goals of today’s customer relationship management philosophy.