PresentationPDF Available

Reverse-Safe Data Structures for Text Indexing

Authors:

Abstract

Talk at ALENEX20, January 5-6, Salt Lake City, Utah, U.S.
Reverse-Safe Data Structures for
Text Indexing
Giulia Bernardini1,4, Huiping Chen2, Gabriele Fici3,
Grigorios Loukides2, Solon P. Pissis4
1University of Milano - Bicocca, Italy
2King’s College London, UK
3University of Palermo, Italy
4CWI Amsterdam, The Netherlands
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Dataset Data
structure
Preprocess
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Dataset
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Reverse-safe data structures
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Data
structure
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
reverse-safe data structures
z-reverse-
safe data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Dataset z
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
RSDS for text indexing
Main result: given z and a text of length n, build a z-RSDS
of size 𝓞(n) in 𝓞(n
𝛚
logd) time answering pattern matching
queries of length m ≤ d in 𝓞(m) time, where d is maximal.
Text of length nz-RSDS
of size
𝓞(n)
𝓞(n
𝛚
logd) time
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Na et al’03: The truncated suffix tree Td(S) is constructible in 𝓞(n) time.
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
abaabbabbaabbaabbaba are 3-equivalent.
Two strings S and S’ are d-equivalent if and only if Td (S) = Td (S’).
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n 𝛼d = 1
d = 0 𝛼d = |𝛴|n
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n 𝛼d = 1
d = 0 𝛼d = |𝛴|n
Binary search
for optimal d
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Hutchinson’75: the number of distinct Eulerian paths in GS,d
is equal to the number of strings that are d-equivalent with S.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Kirchhoff's theorem: the Eulerian paths in a directed
multigraph G with n nodes can be counted in 𝓞(n
𝛚
) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by binary search:
Build the de Bruijn graph GS,d in 𝓞(n) time
Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logn) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by exponential search:
Build the de Bruijn graph GS,d in 𝓞(n) time
Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logd) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: reducing the BS interval
Fici et al’06: Let r(S) be the length of a longest repeated substring
of S. S is uniquely determined by its substrings of length r(s) + 2.
d = r(S)+2 𝛼d = 1
d = 0 𝛼d = |𝛴|n
𝓞(n) time
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: two more improvements
If d is good for a prefix of S then it is also good for S
operate on prefixes of S using prefix doubling.
If GS,d is big then the Laplacian matrix of GS,d is sparse
employ sparse LU decomposition algorithms (Gilbert
et al’88).
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: utility
MSN dataset: n = 4698764 , |𝛴|= 17
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: runtime
SYN datasets: n from 1M to 50M, |𝛴|= 10
Thank you for your attention!
And thanks to the ACM Special Interest Group on
Algorithms and Computation Theory (SIGACT)
for supporting me with a travel award
... An Eulerian tour traverses all the edges in a graph exactly once. Apart from their theoretical interest [2], Eulerian tours and related concepts are commonly employed in bioinformatics for sequence assembly using de Bruijn graphs [3,4]; other applications involve sanitization in data mining [5] and flame-cutting in manufacturing [6], to name a few. Based on Euler's theorem, it takes linear time to establish whether a graph admits an Eulerian tour (i.e. ...
... s (G)}. They arise from bioinformatics and string applications in a special kind of graphs, called de Bruijn graphs (dBGs), where each node-distinct Eulerian tour corresponds to a different string: e.g., genome assembly [3] in bioinformatics, or privacy [5,16] in data structures, where bounds for the number of Eulerian tours are sought for. Our second result deals with node-distinct Eulerian tours in G. Here, D ⊆ E is any maximal set of distinct edges, and m(e) for an edge e ∈ E denotes its multiplicity (i.e. ...
Article
Full-text available
Given an undirected multigraph G=(V,E) with no self-loops, and one of its nodes sVs\in V, we consider the #P-complete problem of counting the number ETs(e)(G)ET^{(e)}_s(G) of its Eulerian tours starting and ending at node s. We provide lower and upper bounds on the size of ETs(e)(G)ET^{(e)}_s(G). Namely, let d(v) denote the degree of a node vVv\in V; we show that max{L1(e),L2(e)}ETs(e)(G)d(s)vV(d(v)1)!! \max \{L_1^{(e)}, L_2^{(e)}\} \le |ET^{(e)}_{s}(G)| \le d(s)\, \prod _{v \in V} (d(v) - 1)!! where L1(e)=(d(s)1)!!vVs(d(v)2)!!L_1^{(e)} = (d(s)-1)!!\prod _{v \in V {\setminus } s}{(d(v)-2)!!} and L2(e)=21V+EL_2^{(e)} = 2^{1-|V|+|E|}. We also consider the notion of node-distinct Eulerian tours. Indeed, the classical Eulerian tours are edge-distinct sequences. Node-distinct Eulerian tours, denoted ETs(n)(G)ET^{(n)}_s(G), should instead be different as node sequences. Let Δ(u)\Delta (u) be the number of distinct neighbors of a node u, DED \subseteq E be the set of distinct edges in the multigraph G, and m(e) for an edge eEe\in E be its multiplicity (i.e. E=eDm(e)|E|=\sum _{e \in D} m(e)). We prove that max{L1(n),L2(n),L3(n)}ETs(n)(G)d(s)vV(d(v)1)!!eDm(e)!1, \max \{L_1^{(n)}, L_2^{(n)}, L_3^{(n)}\} \le |ET^{(n)}_{s}(G)| \le d(s)\, \prod _{v \in V} (d(v) - 1)!! \cdot \textstyle \prod _{e\in D} m(e)!^{-1}, where L1(n)=L1(e)/(eDm(e)!)L_1^{(n)} = L_1^{(e)}/(\prod _{e \in D}m(e)!), L2(n)=(Δ(s)1)!!vVs(Δ(v)2)!!L_2^{(n)} = (\Delta (s)-1)!!\prod _{v \in V {\setminus } s}{(\Delta (v)-2)!!}, and L3(n)=21V+DL_3^{(n)} = 2^{1-|V|+|D|}. We also extend all of our results to graphs having self-loops.
... For instance, individuals' genomic [14], web [22], or movement [13] data are often disseminated in the context of outsourcing mining tasks. However, disseminating string data may lead to privacy concerns [12], [9], [37], [38], [8]. ...
... Experimental Data. We considered the following publicly available datasets used in [1,11,25,27,31]: Oldenburg (OLD), Trucks (TRU), MSNBC (MSN), the complete genome of Escherichia coli (DNA), and synthetic data (uniformly random strings, the largest of which is referred to as SYN). See Table 3 for the characteristics of these datasets and the parameter values used in experiments, unless stated otherwise. ...
Article
Full-text available
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.
ResearchGate has not been able to resolve any references for this publication.