PreprintPDF Available

Personal Data Detection in Multidimensional Databases

Authors:

Abstract and Figures

Mapping personal data in the process for de-identification is always a prerequisite. Today with the big data, we are obliged to automate the data discovery step in this process. That avoids time-consuming, increases the accuracy of detection and allow to ensure confidentiality. For all these reasons we have proposed a new approach to automate detection of personal data. Our approach is adapted to mul-tidimensional databases. Our techniques used in this approach are based on two levels. We propose two detection solutions at the data level, and a solution at the metadata level. After detecting personal data in a database using the identification scores, we use the sensitivity scores to assess the total sensitivity of the multidimensional database.
Content may be subject to copyright.
Personal Data Detection in Multidimensional Databases
Amine MRABET
Research & Innovation
Umanis
7 Rue Paul Vaillant Couturier
92300 Levallois-Perret, France
amrabet@umanis.com
Ali HASSAN
Research & Innovation
Umanis
7 Rue Paul Vaillant Couturier
92300 Levallois-Perret, France
ahassan@umanis.com
Patrice DARMON
Research & Innovation
Umanis
7 Rue Paul Vaillant Couturier
92300 Levallois-Perret, France
pdarmon@umanis.com
ABSTRACT
Mapping personal data in the process for de-identification
is always a prerequisite. Today with the big data, we are
obliged to automate the data discovery step in this process.
That avoids time-consuming, increases the accuracy of de-
tection and allow to ensure confidentiality. For all these
reasons we have proposed a new approach to automate de-
tection of personal data. Our approach is adapted to mul-
tidimensional databases. Our techniques used in this ap-
proach are based on two levels. We propose two detection
solutions at the data level, and a solution at the metadata
level. After detecting personal data in a database using the
identification scores, we use the sensitivity scores to assess
the total sensitivity of the multidimensional database.
CCS Concepts
Information systems ÑData warehouses; Data an-
alytics; Security and privacy ÑPrivacy protections;
Keywords
GDPR; data protection; personal data discovery
1. INTRODUCTION
Privacy protection is a fundamental human right. This is
recognized by article 8 of the Convention for the Protection
of Human Rights and Fundamental Freedom which assure
everyone’s right to respect for “his private and family life,
his home and his correspondence”. Similarly, the Charter
of Fundamental Rights of the European Union defines “the
respect for private and family life” (Article 7) and adds a
specific article on “protection of personal data” (Article 8).
The term “personal data” has been defined in French law
n˝78-17 of 6 january 1978 on data processing, data files and
individual liberties. The definition of these data was: “any
information relating to a natural person who is or can be
identified, directly or indirectly, by reference to an identi-
fication number or to one or more factors specific to him.”
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Intis 19 Tanger, Morroco
c
2019 ACM. ISBN 123-4567-24-567/08/06. . . $15.00
DOI: 10.475/123 4
The French National Commission for Informatics and Liber-
ties (CNIL) uses the same definition and gives the following
examples of a personal data: “a name, a photo, a finger-
print, a postal address, an email address , a phone number,
a social security number, an internal personnel number, an
IP address, a computer login, a voice recording”.
The policy of the European Union on the General Data
Protection Regulation (GDPR), informs us about the cate-
gories and sensitivities of personal data that companies han-
dle, how to use these data, the recipients to whom they com-
municate these data as well as the rights which the compa-
nies and the person concerned have. The GDPR requires the
adoption of measures that respect the principles of privacy
by design and privacy by default.
Privacy protection has three specific objectives [10]: un-
linkability, transparency and intervenability:
Unlinkability ensures that privacy-relevant data can
not be linked between domains. The unlinkability is
related to the principle of the maximum minimization
of data .
Transparency ensures that all privacy data process-
ing, including the legal, technical and organizational
treatments, can be understood and reconstructed at
any time. The information must be available before
(planned treatment), during (current treatment) and
after treatment (to find out exactly what happened).
The amount of information to be provided and how it
is to be communicated should be tailored to the capa-
bilities of the target audience.
Intervenability makes it possible to intervene on any
ongoing or planned data processing concerning per-
sonal data, in particular by the persons whose data is
processed.
The use of “self-service BI” offered by the different solu-
tions “Tableau” 1, “Power BI” 2, “Qlik Sense” 3, gives power
to a non-IT population to create their own reporting, launch
their own queries independently and publish reports with-
out going through the IT system department. This could
increase the risk of sharing reports that contain personal
data. The determination of the sensitivity level of informa-
tion in reports and databases is the responsibility of the users
1https://www.tableau.com
2https://powerbi.microsoft.com
3https://www.qlik.com
themselves. This task is performed manually and depends
on the skills of the user.
Therefore, our goal in this paper is to propose an approach
that automatically detect personally identifiable information
(PII) and calculate the sensitivity level of this data in multi-
dimensional databases (MDBs). We propose a new approach
based on three methods: two methods detect personal data
by analyzing data values based on regular expressions and
reference standards and a method that detects PII based on
the metadata (for example, the attribute name and the name
of the dimension to which it belongs). This last method is
based on an ontology architecture.
The article is structured as follows. Section 2 reviews
related work of protection and PII detection. Section 3 de-
scribes the multidimensional model and presents the case
study. Section 4 is devoted to our proposed PII detection
method. We conclude in Section 5.
2. RELATED WORK
In this section, we present in a general way the strategies
and the methods of the protection of the personal data and
then we detail the methods used to detect the personal data
in the databases.
2.1 Data Privacy Protection
In order to achieve the previous objectives of privacy pro-
tection, the development of appropriate methodologies for
the privacy by design has been discussed in several research
[22, 9, 13, 16, 3]. A detailed state of the art can be found
in [8, 17, 20]. [13] summarizes the different strategies to
achieve the privacy by design. The author of this paper
classifies them into two categories: data oriented strategies
and process oriented strategies.
2.1.1 Data Oriented Strategies
These strategies support the unlinkability. The following
strategies are data oriented:
1. Minimise: the amount of personal data processed
should be limited to the minimum possible. Which
means that no unnecessary data is collected.
2. Hide: personal data, and their interrelations, must be
hidden. Several techniques can be used to achieve this
strategy: data encryption, anonymization.
3. Separate: this strategy indicates that personal data
must be stored in different sources (databases) and
processed in a distributed manner.
4. Aggregate: PII should be processed at the highest
level of aggregation (with the least detail) in which
they are still useful.
2.1.2 Process Oriented Strategies
These strategies support transparency and intervenability.
The following strategies are process oriented:
Inform: the data subjects must be informed when
their personal data are treated. They should be in-
formed about the information that has been treated,
for what purpose and by which treatments.
Control: the persons concerned should can view, up-
date and even demand to delete the personal data col-
lected.
Enforce: this strategy guarantees the implementation
of a privacy protection policy. This policy should be
consistent with legal requirements. A state of the art
[20] shows the methods used to implement this strategy
for databases in the cloud.
Demonstrate: this strategy requires a data controller
to demonstrate compliance with the privacy policy and
legal requirements.
2.2 PII Detection
The problem of privacy protection in OLAP is widely
treated in the state of art in works that have data oriented
strategies [18, 1, 14, 21, 4, 7, 5]. Contrariwise, the issue
of the detection of personal data in databases is very little
discussed in the state of the art.
[15] proposes a method to detect the level of the sensitivity
of data requested in a query. If the query request sensitive
data, a swapping is applied to the data before sending the
response. Otherwise the data is sent without modification.
But this proposal is based on the fact that the sensitivity
of data is already identified. A weight (sensitivity level) is
already associated with each attribute. The value of weight
is determined according to the role the attribute plays in
revealing the identity of an individual.
The proposal of [6] does not have this limit. The au-
thors propose to use generic rules that can be specific to
each domain, table, attribute and attribute value. Accord-
ing to these rules, a sensitivity score is calculated for each
attribute. This proposal uses the semantic links between
attributes and techniques of natural language processing to
find these links. In other words, it assumes that attribute
names are words with semantics. Therefore, the use of ab-
breviations could disrupt this method.
The commercial tool “DgSECURE DETECT”
4offers a
solution to detect personal data. It analyzes the data itself
(the values). It performs deep content inspection of data
using special techniques that incorporate dictionary-based
and weighted keyword matches and machine learning. How-
ever, the method used is too generic and does not take into
account the data specification of each domain.
In addition, all these methods have been proposed to de-
tect PII in traditional relational databases. To our knowl-
edge, there is no work that takes into account the particu-
larity of multidimensional databases, for example, the orga-
nization of data in multiple levels of granularity.
3. PRELIMINARIES : CONCEPTUAL
MULTIDIMENSIONAL MODEL
In this section, we present the definition of the multidi-
mensional model [11, 12].
Let N=tn1, n2, ...ube a finite set of names.
Definition 1. Afact Fiis defined by (nF i, Mi):
nF i PNis the name of the fact,
Mi“ tm1, ..., mpiuis a set of measures.
Definition 2. Adimension Diis defined by (nDi , Ai, Hi):
nDi PNis the name of the dimension,
Ai“ tai
1, ..., ai
riuŤtI di, Alliuis the set of dimension
attributes,
4https://www.dataguise.com/detect/
Figure 1: Example of a conceptual multidimensional star schema
Hi“ tHi
1, ..., Hi
siuis the set of hierarchies.
Dimension attributes are organized in hierarchies from the
most detailed granularity (Idi) to the most general (Alli).
Definition 3. Ahierarchy Hjis defined by (nHj , Pj,ăHj ):
nHj PNis the name of the hierarchy,
Pj“ tpj
1, ..., pj
qjuis a set of dimension attributes called
parameters (PjĎAi),
ăHj “ tppj
x, pj
yq | pj
xPPj^pj
yPPjuis an antisymmet-
ric and transitive binary relation defining a navigation
path on the dimension,
WeakHj :PjÑ2AizPjis an application that as-
sociates with each parameter a set of dimension at-
tributes, called weak attributes.
Definition 4. Amultidimensional schema Sis defined by
(F, D, Star):
F = tF1, ..., Fnuis the set of facts,
D = tD1, ..., Dmuis the set of dimensions,
Star: F Ñ2Dis a function that associates each fact
with its analysis axes (dimensions).
Assuming:
MŤn
i1Mi,
AŤm
i1Ai,
PiŤsi
j1Pj,
PŤm
i1Pi,
WiŤsi
j1Ťqj
k1,
WeakHj ppj
kq,WŤm
i1Wi.
Example (case study).
The figure 1 shows an example of multidimensional star
schema. This schema is used to analyze the quantity and
amount (measures) of sales (fact) according to four dimen-
sions: “Stores”, “Customers”, “Dates” and “Products”. The
“Dates” dimension organizes temporal granularities into two
hierarchies: one for weeks and the other for months.
The “Products” and “Stores” dimensions include a single
hierarchy. The “Customers” dimension organizes the sales
order information in three hierarchies: (1) according to the
geographical distribution of the customers (2) according to
their year of birth (3) according to their gender.
On the “Customers” dimension, several weak attributes
(“Name”,“First Name”,“Address” and “Mail”) are associated
with the “ID Customer” parameter. They can be considered
as personal data that must be protected.
In order to better meet the needs for anonymisation and
protection of personal data, we propose in the rest of this
article a method to automate the detection of personal data
in multidimensional databases which takes into account the
peculiarity of the structure of these bases.
4. PII DETECTION METHOD IN MULTI-
DIMENSIONAL DATABASES
In this section, we present in detail our solution. This
last is based on two steps. The first is to build and enrich
an ontology. The second is the detection of the personal
data by using this ontology proposed in the first step. In
the second step we also calculating the sensitivity level of a
multidimensional database using scores.
4.1 Ontology Update Process
We begin this section by describing the structure of the
ontology we have proposed. Then we show the methods used
to fill this ontology. We chose to use an ontology (semantic
knowledge base) to optimize the detection of personal data.
Detection via ontology is applied at the metadata level. The
benefit of working at this level of detection is to increase con-
fidentiality and performance. This solution is secure because
we decrease the detection at the data level. This solution
works well because detection at the metadata level is faster
than detection at the data level.
4.1.1 Ontology Description
In the state of the art, an ontology is often associated
with the multidimensional schema to better understand it
[2]. But we use an ontology as semantic knowledge base to
describe the link between the attributes of multidimensional
schemes and personal data.
Our ontology is composed of two parts. A static part (Fig-
ure 2) includes a list of predefined PII entities in our method.
And a dynamic part (Figure 3) that will be updated by PII
detection via two methods. One method uses reference bases
and the other uses a regular expression standard.
In the static part, for each domain, we classified these
PII entities in several categories proposed by the CNIL and
we assigned a level (a score) of sensitivity to each of these
entities (see Figure 2). These scores are predefined and con-
figurable according to the context of the database.
Figure 2: Static Ontology
In order to adapt our method of detecting personal data to
multidimensional databases, we propose a dynamic ontology
(see Figure 3) at four levels:
1. “Domain” level: this level represents the data sector or
data domain. In the Figure 3, the name of the domain
corresponds to the name of the fact of the schema of
Figure 1.
2. “Dimension” level: each tag of this level groups all the
attributes (parameters and weak attributes) belonging
to a dimension of the multidimensional schema.
3. “Parameter” level: tags of this level group all the weak
attributes associated with the concerned parameter.
4. “Weak attribute” level.
If we take the example of the Figure 1, we note that thanks
to this four-levels structure, our ontology is able to differen-
tiate the attribute “Name” of a customer on the dimension
“Customers” and the attribute “Name” of a product on the
dimension “Products”. In the dynamic part, we determine
to which PII entity corresponds each attribute of the MDB.
For example, the “Name” attribute in the “Customers” di-
mension corresponds to the “Name of person” entity. The
name of the PII entity is associated with the weak attribute
or the parameter tag (see line 12 on the Figure 3).
The names of the entities PII corresponding to an at-
tribute of the MDB are added in a “Fields” tag (see lines
11 to 13 of the Figure 3). After each detection, we update
this dynamic part of the ontology.
Figure 3: Dynamic Ontology
4.1.2 Extraction of Multidimensional Schema (Meta-
data)
In order to update our ontology, we follow a training pro-
cess based on training bases. Figure 4 presents the step
of extracting the metadata of a multidimensional database.
After the extraction of the structure, we follow the steps of
cleaning, treatment and storage in order to have an adequate
metadata and ready for the training. These operations con-
cern the names of the attributes of the MDB in order to
avoid the problems resulting from the difference between
the characters uppercase/lowercase and singular/plural and
specific characters.
Figure 4: Metadata extract
Following the extraction of metadata, we propose two
methods to update our ontology. The two methods used
make it possible to calculate the identification scores for each
attribute. The first method is based on a predefined regu-
lar expression standard. The second method is based on
reference bases (open data).
4.1.3 Updating the ontology
As detailed above, the proposed architecture for ontology
in this work requires two methods to train it. To execute
our approach, we need to work at the data level. Using the
two methods presented above.
The figure 5 presents our first method used to update
our ontology. In this method, we use a regular expression
standards. In this method, we inject regular expressions
into our scoring queries. Each query returns a matching
score between the attribute in question and a PII entity.
The calculated score is used to update the dynamic part of
the ontology. For more details [19].
Figure 5: Update via RegEx
The figure 6 presents our second method used to update
our ontology. In this method, we have prepared several ref-
erence bases to perform detection comparisons. To ensure
data privacy, we work with hashed data from our reference
bases. And for this reason, we use the same hash function
that is used to hash references and to hash the data to be de-
tected. This method also returns a matching score between
the attribute in question and the desired personal data. And
in the same way, we use this scores to update the dynamic
part of the ontology.
Figure 6: Update via references
4.2 Personal Data Detection
In this section, we detail our methods of detecting per-
sonal data. The algorithm 1 details our detection proce-
dure. This algorithm takes six inputs: the multidimen-
sional schema (S) representing the meta level, the data ware-
house (Base DW ) which is the database (the tables with
the data), the static and dynamic parts of our ontology
(Ontology S ,Ontology D), our reference bases and regular
expressions (Base ref ,B ase RegE x). It returns the list of
PII entities (P II ) detected in the base.
The different steps that we propose in this algorithm are:
1. Extraction of multidimensional schema structure. To
perform this step, we use the “Extract Meta” method
(line 1).
2. Detect the personal data (for all the attributes and
measures of the multidimensional schema) via a match-
ing with our trained ontology (lines 2 to 5). In this
step, we use the“Matching Ontology”method to search
for sensitive attributes in the dynamic part “Ontol-
ogy D” of our ontology. Then we retrieve the infor-
mation in the static ontology “Ontology S” , via the
“Find PII” function.
3. Use our regular expressions to also start another checks
for the attributes not detected in the previous step
(rows 6 and 7). For this step, we use the method
“Matching S RegEx”.
4. Use our reference bases to run checks for each attribute
not detected in the previous step (lines 8 and 9). For
this step, we use the method “Matching S Ref”.
Steps 3 and 4 of our algorithm can be reversed or run in par-
allel to optimize run time performance (performance tests
are being implemented).
Algorithm 1: Detection of personal data
Input: S,Base DW ,Ontology S,Ontology D,
Base ref ,Base RegE x
Output: P II s
1Schema ÐExtract Meta(S);
2foreach xPSchema.A YSchema.M do
3Result ÐMatching Ontology(x, Ontology D);
4if (Result N ull)then
5P II ÐFind PII (Result, Ontolog y S );
6else
7P II ÐMatching S RegEx(x, Base RegE x);
8if P II N ull then
9P II ÐMatching S Ref(x, Base ref );
10 P II s.addpP I Iq;
11 return P II s;
The figure 7 summarizes the methods of detection of per-
sonal data in multidimensional schema detailed in the al-
gorithm 1. To detect the data, we start with the ontology
method, then the regular expressions method and we finish
with the reference bases method.
Figure 7: The procedure for detecting personal data
in the OLAP
4.3 Sensitivity score
To confirm the protection of personal data, it is necessary
to validate with a sensitivity assessment score to ensure the
necessary security level. For this reason, we propose a calcu-
lation of the sensitivity score. This last is calculated accord-
ing to the sensitivity levels predefined by the personal data
in the static part of our ontology. To define these levels, we
based on the recommendations of the CNIL. The total sensi-
tivity score in a multidimensional database is the sum of the
sensitivity levels of the measures and attributes (parameters
and weak attributes) of the schema. Some attributes can be
calculated from other attributes of the same hierarchy. In
other words, one attribute can be included in another. For
example, the day ‘08/03/1981’ includes the month ‘03/1981’
and the year ‘1981’.
In this case we have: ‘Day’ ‘Month’ ‘Year’.
xyôxăHj y^x.value ñy.value
In this type of case, the attributes included in another will
not be counted in the calculation of the sensitivity score.
The algorithm 2 details the calculation of the sensitivity
score. It takes three Inputs: the list of PII (P II ), the
static ontology (Ontology S) and the sensitivity threshold
(threshold) to be respected. This last is configurable ac-
cording to the use case. This algorithm calculates the sum
of the sensitivity by respecting the inclusion constraint of
the attributes and alerts the user that his system breaks the
GDPR rules.
Algorithm 2: DW sensitivity calculation
Input: P II s,Ontology S,threshold
Output: sensitivity DW
1sensitivity DW ÐN ull;
2foreach xPP II s |EyPP I Is |yxdo
3sensitivity DW Ð
sensitivity DW `Ontology S.f indS corepxq);
4if (sensitivity DW ąthreshold)then
5Alert();
6return sensitivity DW ;
5. CONCLUSION
In this paper, we have developed an approach for the auto-
matic detection of PII (personally identifiable information)
in the multidimensional databases. Our solution is based on
a trained dynamic ontology and static ontology. In order to
update this ontology and detect PII, we use two detection
methods based on reference bases and regular expressions.
The ontology is updated with each usage, which allows us to
have an advanced ontology. Such an ontology guarantees a
high accuracy of detection. We also propose a scoring taking
into account the particularity of multidimensional schemes
(organization of attributes with several levels of granular-
ity).
In the perspective of this work, we started to develop an-
other method based on artificial intelligence algorithms. In
this method we use the Named Entity Recognition (NER).
We plan to implement a semantic validation module before
updating the ontology.
6. REFERENCES
[1] R. Agrawal, R. Srikant, and D. Thomas. Privacy
preserving OLAP. page 251, 2005.
[2] G. Amaral and G. Guizzardi. On the Application of
Ontological Patterns for Conceptual Modeling in
Multidimensional Models. In T. Welzer, J. Eder,
V. Podgorelec, and A. Kamiˇsali´c Latifi´c, editors,
Advances in Databases and Information Systems,
pages 215–231, Cham, 2019. Springer International
Publishing.
[3] T. Antignac and D. Le M´etayer. Privacy by Design:
From Technologies to Architectures (Position Paper).
In B. Preneel and D. Ikonomou, editors, Lecture Notes
in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), volume 8450 LNCS, pages 1–17.
Springer International Publishing, Cham, 2014.
[4] A. Cuzzocrea and D. Sacc`a. Balancing accuracy and
privacy of OLAP aggregations on data cubes. page 93,
2010.
[5] A. Cuzzocrea, A. Schuster, and G. Vercelli.
PP-OMDS: An Effective and Efficient Framework for
Supporting Privacy-Preserving OLAP-based
Monitoring of Data Streams. 1(Iceis):282–292, 2018.
[6] C. d. Mouza, E. M´etais, N. Lammari, J. Akoka,
T. Aubonnet, I. Comyn-Wattiau, H. Fadili, and S. S.
Cherfi. Towards an Automatic Detection of Sensitive
Information in a Database. In 2010 Second
International Conference on Advances in Databases,
Knowledge, and Data Applications, pages 247–252,
2010.
[7] F. Fessant, T. Benkhelif, and F. Cl´erot. Anonymiser
des donn´ees multidimensionnelles `a l’aide du
coclustering. Revue des Nouvelles Technologies de
l’Information, Extraction:153–164, 2017.
[8] D. George, D.-F. Josep, H. Marit, J.-H. Hoepman,
D. L. M´etayer, T. Rodica, and S. Stefan. Privacy and
Data Protection by Design from policy to engineering.
CoRR, 2014.
[9] S. G¨
urses, C. Troncoso, and C. Diaz. Engineering
privacy by design. 2011.
[10] M. Hansen. Top 10 mistakes in system design from a
privacy perspective and privacy protection goals. In
J. Camenisch, B. Crispo, S. Fischer-H¨
ubner,
R. Leenes, and G. Russello, editors, IFIP Advances in
Information and Communication Technology, volume
375 AICT, pages 14–31. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012.
[11] A. Hassan, F. Ravat, O. Teste, R. Tournier, and
G. Zurfluh. OLAP in Multifunction Multidimensional
Databases. In Advances in Databases and Information
Systems, pages 190–203, 2013.
[12] A. Hassan, F. Ravat, O. Teste, R. Tournier, and
G. Zurfluh. Differentiated Multiple Aggregations in
Multidimensional Databases. TLDKS XXI,
9260:20–47, 2015.
[13] J.-H. Hoepman. Privacy Design Strategies.
ICT-System Security and Privacy Protection–29th
IFIP TC, 11:446–459, 2014.
[14] M. Hua, S. Zhang, W. Wang, H. Zhou, and B. Shi.
FMC: An Approach for Privacy Preserving OLAP.
pages 408–417, 2005.
[15] P. Kamakshi and A. V. Babu. Automatic detection of
sensitive attribute in PPDM. In 2012 IEEE
International Conference on Computational
Intelligence and Computing Research, pages 1–5, 2012.
[16] F. Kerschbaum. Privacy-Preserving Computation. In
B. Preneel and D. Ikonomou, editors, Privacy
Technologies and Policy: First Annual Privacy Forum,
APF 2012, Limassol, Cyprus, October 10-11, 2012,
Revised Selected Papers, pages 41–54. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2014.
[17] C. Lazaro and D. Le M´etayer. The Control over
personal data: True remedy or fairytale? SCRIPTed,
12(1), 2015.
[18] Lingyu Wang, S. Jajodia, and D. Wijesekera. Securing
OLAP data cubes against privacy breaches. pages
161–175, 2004.
[19] A. Mrabet, M. Bentounsi, and P. Darmon. SecP2I : A
Secure Multi-party Discovery of Personally Identifiable
Information (PII) in Structured and Semi-structured
Datasets. In IEEE BigData (to appear), 2019.
[20] S. Sobati Moghadam, J. Darmont, and G. Gavin.
Enforcing Privacy in Cloud Databases. In DaWaK
2017, volume 10440, pages 53–73. Springer, 2017.
[21] S. Y. Sung, Y. Liu, H. Xiong, and P. A. Ng. Privacy
preservation for data cubes. Knowledge and
Information Systems, 9(1):38–61, 2006.
[22] M. C. Tschantz and J. M. Wing. Formal methods for
privacy. In Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics),
volume 5850 LNCS of FM ’09, pages 1–15, Berlin,
Heidelberg, 2009. Springer-Verlag.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Data warehouses (DW) play a decisive role in providing analytical information for decision making. Multidimensional modeling is a special approach to modeling data, considered the foundation for building data warehouses. With the explosive growth in the amount of heterogeneous data (most of which external to the organization) in the latest years, the DW has been impacted by the need to interoperate and deal with the complexity of this new type of information, such as big data, data lakes and cognitive computing platforms, becoming evident the need to improve the semantic expressiveness of the DW. Research has shown that ontological theories can play a fundamental role in improving the quality of conceptual models, reinforcing their potential to support semantic interoperability in its various manifestations. In this paper we propose the application of ontological patterns, grounded in the Unified Foundational Ontology (UFO), for conceptual modeling in multidimen-sional models, in order to improve the semantic expressiveness of the models used to represent analytical data in a DW.
Conference Paper
Full-text available
Outsourcing databases, i.e., resorting to Database-as-a-Service (DBaaS), is nowadays a popular choice due to the elasticity, availability, scalability and pay-as-you-go features of cloud computing. However, most data are sensitive to some extent, and data privacy remains one of the top concerns to DBaaS users, for obvious legal and competitive reasons. In this paper, we survey the mechanisms that aim at making databases secure in a cloud environment, and discuss current pitfalls and related research challenges.
Conference Paper
Full-text available
Dans cet article, nous proposons une méthodologie pour anonymiser une table de données multidimensionnelles contenant des données individuelles (soit n individus décrits par m variables). L'objectif est de publier une table ano-nyme construite à partir d'une table initiale qui protège contre le risque de ré-identification. En d'autres termes, on ne doit pas pouvoir retrouver dans les don-nées publiées un individu présent dans la table originale. La solution proposée consite à agréger les données à l'aide d'une technique de coclustering, puis à uti-liser le modèle produit pour générer une table de données synthétiques du même format que les données initiales. Les données synthétiques, qui contiennent des individus fictifs, peuvent maintenant être publiées. Les données produites sont évaluées en termes d'utilité pour différentes tâches de fouille (analyse explora-toire, classification) et de niveau de protection.
Conference Paper
Full-text available
Most models proposed for modeling multidimensional data warehouses consider a same function to determine how measure values are aggregated. We provide a more flexible conceptual model allowing associating each measure with several aggregation functions according to dimensions, hierarchies, and levels of granularity. This article studies the impacts of this model on the multidimensional table (MT) and the OLAP algebra [11]. It shows how the MT can handle several aggregation functions. It also introduces the changes of the internal mechanism of OLAP operators to take into account several aggregation functions especially if these functions are non-commutative.
Article
Full-text available
Many models have been proposed for modeling multidimensional data warehouses and most consider a same function to determine how measure values are aggregated according to different data detail levels. We provide a conceptual model that supports (1) multiple aggregations, associating to the same measure a different aggregation function according to analysis axes or hierarchies, and (2) differentiated aggregation, allowing specific aggregations at each detail level. Our model is based on a graphical formalism that allows controlling the validity of aggregation functions (distributive, algebraic or holistic). We also show how conceptual modeling can be used, in an R-OLAP environment, for building lattices of pre-computed aggregates.
Conference Paper
Personal data governance is became a key issue within organisations. This is mainly due to (i) the strategic value of personal data which provide more insights improving commercial and operational efficiency ; and (ii) data security risk issues and privacy regulation restrictions (e.g., GDPR and CCPA). Creating data catalogs is an important step for setting up a personal data governance. However, it remains a time-consuming task especially because of the absence of naming conventions in database modeling coupled to the heterogeneity of database management systems (DBMS) across Information Systems (IS). The paper presents SecP2I, an efficient data analytics-based approach permitting personal data discovery in structured and semi-structured datasets while guaranteeing end-to-end data confidentiality. The effectiveness of the platform is proven using a real world HR dataset.
Conference Paper
Private data is commonly revealed to the party performing the computation on it. This poses a problem, particularly when outsourcing storage and computation, e.g., to the cloud. In this paper we present a review of security mechanisms and a research agenda for privacy-preserving computation. We begin by reviewing current application scenarios where computation faces privacy requirements. We then review existing cryptographic techniques for privacy-preserving computation. And last, we outline research problems that need to be solved for implementing privacy-preserving computations. Once addressed, privacy-preserving computations can quickly become a reality enhancing the privacy protection of citizens.
Conference Paper
Privacy requirements are often not well considered in system design. The objective of this paper is to help interested system designers in three ways: First, it is discussed how "privacy" should be understood when designing systems that take into account the protection of individuals' rights and their private spheres. Here specifically the concept of linkage control as an essence of privacy is introduced. Second, the paper presents a list of ten issues in system design collected during the daily work of a Data Protection Authority. Some of the mistakes are based on today's design of data processing systems; some belong to typical attitudes or mindsets of various disciplines dealing with system design (technology, law, economics and others). Third, it is explained how working with protection goals can improve system design: In addition to the well-known information security protection goals, namely confidentiality, integrity and availability, three complementing privacy protection goals - unlinkability, transparency and intervenability - are proposed. © 2012 IFIP International Federation for Information Processing.
Article
In this paper we define the notion of a privacy design strategy. These strategies help IT architects to support privacy by design early in the software development life cycle, during concept development and analysis. Using current data protection legislation as point of departure we derive the following eight privacy design strategies: MINIMISE, HIDE, SEPARATE, AGGREGATE, INFORM, CONTROL, ENFORCE, and DEMONSTRATE. The strategies also provide a useful classification of privacy design patterns and the underlying privacy enhancing technologies. We therefore believe that these privacy design strategies are not only useful when designing privacy friendly systems, but also helpful when evaluating the privacy impact of existing IT systems. © IFIP International Federation for Information Processing 2014.