PresentationPDF Available

Abstract

Talk at ALENEX20, January 5-6, Salt Lake City, Utah, U.S.
Reverse-Safe Data Structures for
Text Indexing
Giulia Bernardini1,4, Huiping Chen2, Gabriele Fici3,
Grigorios Loukides2, Solon P. Pissis4
1University of Milano - Bicocca, Italy
2King’s College London, UK
3University of Palermo, Italy
4CWI Amsterdam, The Netherlands
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Dataset Data
structure
Preprocess
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Dataset
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Reverse-safe data structures
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Data
structure
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
reverse-safe data structures
z-reverse-
safe data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Dataset z
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
RSDS for text indexing
Main result: given z and a text of length n, build a z-RSDS
of size 𝓞(n) in 𝓞(n
𝛚
logd) time answering pattern matching
queries of length m ≤ d in 𝓞(m) time, where d is maximal.
Text of length nz-RSDS
of size
𝓞(n)
𝓞(n
𝛚
logd) time
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Na et al’03: The truncated suffix tree Td(S) is constructible in 𝓞(n) time.
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
abaabbabbaabbaabbaba are 3-equivalent.
Two strings S and S’ are d-equivalent if and only if Td (S) = Td (S’).
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n 𝛼d = 1
d = 0 𝛼d = |𝛴|n
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n 𝛼d = 1
d = 0 𝛼d = |𝛴|n
Binary search
for optimal d
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Hutchinson’75: the number of distinct Eulerian paths in GS,d
is equal to the number of strings that are d-equivalent with S.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Kirchhoff's theorem: the Eulerian paths in a directed
multigraph G with n nodes can be counted in 𝓞(n
𝛚
) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by binary search:
Build the de Bruijn graph GS,d in 𝓞(n) time
Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logn) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by exponential search:
Build the de Bruijn graph GS,d in 𝓞(n) time
Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logd) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: reducing the BS interval
Fici et al’06: Let r(S) be the length of a longest repeated substring
of S. S is uniquely determined by its substrings of length r(s) + 2.
d = r(S)+2 𝛼d = 1
d = 0 𝛼d = |𝛴|n
𝓞(n) time
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: two more improvements
If d is good for a prefix of S then it is also good for S
operate on prefixes of S using prefix doubling.
If GS,d is big then the Laplacian matrix of GS,d is sparse
employ sparse LU decomposition algorithms (Gilbert
et al’88).
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: utility
MSN dataset: n = 4698764 , |𝛴|= 17
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: runtime
SYN datasets: n from 1M to 50M, |𝛴|= 10
Thank you for your attention!
And thanks to the ACM Special Interest Group on
Algorithms and Computation Theory (SIGACT)
for supporting me with a travel award
... For instance, individuals' genomic [14], web [22], or movement [13] data are often disseminated in the context of outsourcing mining tasks. However, disseminating string data may lead to privacy concerns [12], [9], [37], [38], [8]. ...
... Experimental Data. We considered the following publicly available datasets used in [1,11,25,27,31]: Oldenburg (OLD), Trucks (TRU), MSNBC (MSN), the complete genome of Escherichia coli (DNA), and synthetic data (uniformly random strings, the largest of which is referred to as SYN). See Table 3 for the characteristics of these datasets and the parameter values used in experiments, unless stated otherwise. ...
Article
Full-text available
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.
ResearchGate has not been able to resolve any references for this publication.