Conference Paper

SecP2I : A Secure Multi-party Discovery of Personally Identifiable Information (PII) in Structured and Semi-structured Datasets

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Personal data governance is became a key issue within organisations. This is mainly due to (i) the strategic value of personal data which provide more insights improving commercial and operational efficiency ; and (ii) data security risk issues and privacy regulation restrictions (e.g., GDPR and CCPA). Creating data catalogs is an important step for setting up a personal data governance. However, it remains a time-consuming task especially because of the absence of naming conventions in database modeling coupled to the heterogeneity of database management systems (DBMS) across Information Systems (IS). The paper presents SecP2I, an efficient data analytics-based approach permitting personal data discovery in structured and semi-structured datasets while guaranteeing end-to-end data confidentiality. The effectiveness of the platform is proven using a real world HR dataset.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The calculated score is used to update the dynamic part of the ontology. For more details [19]. ...
Chapter
Full-text available
Mapping personal data in the process for de-identification is always a prerequisite. Today with the big data, we are obliged to automate the data discovery step in this process. That avoids time-consuming, increases the accuracy of detection and allow to ensure confidentiality. For all these reasons we have proposed a new approach to automate detection of personal data. Our approach is adapted to mul-tidimensional databases. Our techniques used in this approach are based on two levels. We propose two detection solutions at the data level, and a solution at the metadata level. After detecting personal data in a database using the identification scores, we use the sensitivity scores to assess the total sensitivity of the multidimensional database.
... The calculated score is used to update the dynamic part of the ontology. For more details [19]. ...
Preprint
Full-text available
Mapping personal data in the process for de-identification is always a prerequisite. Today with the big data, we are obliged to automate the data discovery step in this process. That avoids time-consuming, increases the accuracy of detection and allow to ensure confidentiality. For all these reasons we have proposed a new approach to automate detection of personal data. Our approach is adapted to mul-tidimensional databases. Our techniques used in this approach are based on two levels. We propose two detection solutions at the data level, and a solution at the metadata level. After detecting personal data in a database using the identification scores, we use the sensitivity scores to assess the total sensitivity of the multidimensional database.
Chapter
Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.KeywordsOutlier DetectionNamed Entity RecognitionData MaskingPersonal Identifying Information (PII)
Conference Paper
Full-text available
Afin d’améliorer la protection de données personnelles dans les lacs de données, nous proposons un méta-modèle qui comprend plusieurs aspects de privacy. Notre méta-modèle permet de décrire les contraintes nécessaires pour l’implémentation des procédures de protection de données personnelles (pseudonymisation/anonymisation). Il permet également de renforcer la conformité au RGPD dans les lacs de données en intégrant par exemple un journal de traitements sur les données personnelles et l’identification de finalité de chaque collection de données. Ce méta-modèle est présenté via un schéma conceptuel et implémenté via une base de données orientée graphe. La validation de notre proposition s’appuie sur les modélisations et discussions de plusieurs scénarios possibles de protection de données personnelles dans les lacs de données.
Preprint
Full-text available
Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the longest sequence. In addition, a double scan in dictionary is also proposed in order to process the attributes which have a compound name.
Research Proposal
Full-text available
Avec le Big Data et les lacs de données d'aujourd'hui, nous sommes en face d'une masse de données difficile à gérer manuellement. La protection de données personnelles dans ce contexte nécessite donc une analyse automatique pour la découverte des données. Un stockage dans une base de connaissances des noms des attributs déjà analysés pourrait optimiser cette découverte auto-matique. Pour avoir une meilleure base de connaissances, il ne faut pas stocker les attributs dont le nom n'a pas de sens. Dans ce papier, pour vérifier si le nom d'un attribut a une signification, nous proposons de calculer les distances entre ce nom et l'ensemble des mots d'un dictionnaire. Nos études sur les fonctions de distance N-Gramme, Jaro-Winkler et Levenshtein montrent des limites pour fixer un seuil d'acceptation d'un attribut dans la base de connaissances. Afin de surmonter ces limites, notre solution vise à renforcer le calcul de score en utili-sant une fonction exponentielle basée sur la séquence la plus longue. En outre, un double parcours de dictionnaire est également proposé afin de traiter les at-tributs qui ont un nom composé.
Article
Full-text available
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
Article
Full-text available
This paper deals with the derivation of complex properties relating objects belonging to a set of database schemes and the exploitation of derived properties in query optimization and view maintenance problems. The derivation process has an iterative nature and works by case analysis over database schemes, in a way somehow similar to the way knowledge discovery processes behave on database extensions: for this reason, we call it Intentional Knowledge Discovery. Complex properties have a fuzzy nature and are represented and manipulated using a fuzzy variant of Description Logics. The paper presents detailed descriptions of procedures used to derive complex properties. As far as applications are concerned, the paper illustrates several query optimization cases and a view maintenance algorithm both based on the availability of extracted complex properties.
Conference Paper
Full-text available
We consider the problem of computing the intersection of private datasets of two parties, where the datasets contain lists of elements taken from a large domain. This problem has many applications for online collaboration. We present protocols, based on the use of homomorphic encryption and balanced hashing, for both semi-honest and malicious environments. For lists of length k, we obtain O(k) communication overhead and O(k ln ln k) computation. The protocol for the semihonest environment is secure in the standard model, while the protocol for the malicious environment is secure in the random oracle model. We also consider the problem of approximating the size of the intersection, show a linear lower-bound for the communication overhead of solving this problem, and provide a suitable secure protocol. Lastly, we investigate other variants of the matching problem, including extending the protocol to the multi-party setting as well as considering the problem of approximate matching.
Conference Paper
The Jaro-Winkler similarity is a widely used measure for the similarity of strings. We propose an efficient algorithm for the bounded search of similar strings in a large set of strings. We compared our approach to the naive approach and the approach by Dreßler et al. Our results prove a significant improvement of the efficiency in computation of the bounded Jaro-Winkler similarity for querying of similar strings.
Article
Assurer la conformité au règlement général sur la protection des données (RGPD) passe par la mise en place de la protection de la vie privée dès la conception des processus métiers des organisations (privacy by design). Il est par conséquent nécessaire de prendre en compte les contraintes liées à l'usage des données à caractère personnel dans le plan d'urbanisme des systèmes d'informations (SI). La démonstration présente ARIANE, une plateforme intégrée de gouvernance des données à caractère personnel. ARIANE permet d'industrialiser la protection de la vie privée en constituant un référentiel unique de personnes physiques.
Conference Paper
For companies, the need to efficiently deal with vast amounts of integrated multi-source data is becoming crucial. Core concerns are 1) proper and flexible human resources management approaches, for 2) more effective resource allocation, as well as 3) team staffing. We here propose to address the talent search problem. Our approach is based on professional skill characterization and normalization, and matching between unstructured documents (such as resumes and job descriptions). To this end, we first provide a complete information technology skills taxonomy, together with a taxonomy managing companies and their sector of activity. This, in order to enhance named entity recognition and normalization. We next design a flexible, scalable and secure architecture integrating multi-source big data, which provides efficient unstructured document analysis and matching. Finally, we evaluate the performance of our platform using real data.
Article
The long-awaited General Data Protection Regulation (GDPR) of the EU was provisionally agreed in December 2015.¹ The final details are still being ironed out, but publication of the final version of the regulation is expected around July 2016.² There will then be a two-year waiting period until every organisation that does business in, or with, the EU must comply with the regulation. Since it is a regulation, not a directive, compliance is mandatory, without the need for each member state to ratify it into its own legislation. The GDPR expands the scope of data protection so that it applies to anyone or any organisation that collects and processes information related to EU citizens, no matter where they are based or where the data is stored. Colin Tankard of Digital Pathways examines what effect the new regulation is likely to have on organisations.
Article
The role of boards of directors requires greater attention to legislative requirements, greater due diligence in transactions and business alliances, and coherent information management strategies for data security. A board should require that management set up a reasonable information and reporting system that covered the strengths and weakness of the enterprise's information security system to ensure averting any failure in oversight duty. A case study of Delaware-based InterCepht, an aerospace and defense contractor for the US and NATO governments regarding the merger with one of its trading partners in Germany, PuntCode is also discussed. The InterCepht board asked the company's outside council to identify any legal requirements for data security that InterCepht should pay close attention to in the merger's early stage. Counsel proposed that InterCepht develop a comprehensive data security strategy, which would alert it to risks that it would likely encounter during the merger.
Article
The growth of the Internet has triggered tremendous opportunities for cooperative computation, where people are jointly conducting computation tasks based on the private inputs they each supplies. These computations could occur between mutually untrusted parties, or even between competitors. For example, customers might send to a remote database queries that contain private information; two competing financial organizations might jointly invest in a project that must satisfy both organizations' private and valuable constraints, and so on. Today, to conduct such computations, one entity must usually know the inputs from all the participants; however if nobody can be trusted enough to know all the inputs, privacy will become a primary concern. This problem is referred to as Secure Multi-party Computation Problem (SMC) in the literature. Research in the SMC area has been focusing on only a limited set of specific SMC problems, while privacy concerned cooperative computations call for SMC studies in a variety of computation domains. Before we can study the problems, we need to identify and define the specific SMC problems for those computation domains. We have developed a frame to facilitate this problem-discovery task. Based on our framework, we have identified and defined a number of new SMC problems for a spectrum of computation domains. Those problems include privacy-preserving database query, privacy-preserving scientific computations, privacy-preserving intrusion detection, privacy-preserving statistical analysis, privacy-preserving geometric computations, and privacy-preserving data mining. The goal of this paper is not only to present our results, but also to serve as a guideline so other people can identify useful SMC problems in their own computation domains.
Conference Paper
The use of cryptographic hash functions like MD5 or SHA-1 for message authentication has become a standard approach in many applications, particularly Internet security protocols. Though very easy to implement, these mechanisms are usually based on ad hoc techniques that lack a sound security analysis. We present new, simple, and practical constructions of message authentication schemes based on a cryptographic hash function. Our schemes, NMAC and HMAC, are proven to be secure as long as the underlying hash function has some reasonable cryptographic strengths. Moreover we show, in a quantitative way, that the schemes retain almost all the security of the underlying hash function. The performance of our schemes is essentially that of the underlying hash function. Moreover they use the hash function (or its compression function) as a black box, so that widely available library code or hardware can be used to implement them in a simple way, and replaceability of the underlying hash function is easily supported.
ARIANE: la Gouvernance des Données comme Accélérateur de Conformité au Règlement Général su rla Protection des Données
  • bentounsi