Content uploaded by Giulia Bernardini
Author content
All content in this area was uploaded by Giulia Bernardini on Jan 07, 2020
Content may be subject to copyright.
Reverse-Safe Data Structures for
Text Indexing
Giulia Bernardini1,4, Huiping Chen2, Gabriele Fici3,
Grigorios Loukides2, Solon P. Pissis4
1University of Milano - Bicocca, Italy
2King’s College London, UK
3University of Palermo, Italy
4CWI Amsterdam, The Netherlands
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Dataset Data
structure
Preprocess
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
How do we view data structures?
Data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
Dataset
.
.
.
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Reverse-safe data structures
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Data
structure
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
reverse-safe data structures
z-reverse-
safe data
structure
q1(x) Yes
q2(x,y) {3, 8}
q3(y) 14
.
.
.
.
.
.
Dataset 1
Dataset 2
Dataset 3
.
.
.
Dataset z
.
.
.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
z-
RSDS for text indexing
Main result: given z and a text of length n, build a z-RSDS
of size 𝓞(n) in 𝓞(n
𝛚
logd) time answering pattern matching
queries of length m ≤ d in 𝓞(m) time, where d is maximal.
Text of length nz-RSDS
of size
𝓞(n)
𝓞(n
𝛚
logd) time
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: suffix tree
Weiner’73: Given a query of length m we can answer decision,
counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Na et al’03: The truncated suffix tree Td(S) is constructible in 𝓞(n) time.
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
abaabbabba ≠ abbaabbaba are 3-equivalent.
Two strings S and S’ are d-equivalent if and only if Td (S) = Td (S’).
Text indexing: truncated suffix tree
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n ⇒ 𝛼d = 1
d = 0 ⇒ 𝛼d = |𝛴|n
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Text indexing: truncated suffix tree
Lemma: The number 𝛼d of strings which are d-equivalent
to S is monotonically non-increasing for increasing d.
d = n ⇒ 𝛼d = 1
d = 0 ⇒ 𝛼d = |𝛴|n
Binary search
for optimal d
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Hutchinson’75: the number of distinct Eulerian paths in GS,d
is equal to the number of strings that are d-equivalent with S.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
De Bruijn graph
Kirchhoff's theorem: the Eulerian paths in a directed
multigraph G with n nodes can be counted in 𝓞(n
𝛚
) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by binary search:
●Build the de Bruijn graph GS,d in 𝓞(n) time
●Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d ≥ z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logn) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
The algorithm
For each d implied by exponential search:
●Build the de Bruijn graph GS,d in 𝓞(n) time
●Count Eulerian paths in 𝓞(n
𝛚
) time to check 𝛼d ≥ z
For d optimal, pick any Eulerian path on GS,d to find S’~d S
Build and return Td (S’) in 𝓞(n) time
Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n
𝛚
logd) time.
It answers pattern matching queries of length m ≤ d in 𝓞(m) time.
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: reducing the BS interval
Fici et al’06: Let r(S) be the length of a longest repeated substring
of S. S is uniquely determined by its substrings of length r(s) + 2.
d = r(S)+2 ⇒ 𝛼d = 1
d = 0 ⇒ 𝛼d = |𝛴|n
𝓞(n) time
Suffix tree of S
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Engineering the algorithm: two more improvements
●If d is good for a prefix of S then it is also good for S
⇒ operate on prefixes of S using prefix doubling.
●If GS,d is big then the Laplacian matrix of GS,d is sparse
⇒ employ sparse LU decomposition algorithms (Gilbert
et al’88).
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: utility
MSN dataset: n = 4698764 , |𝛴|= 17
Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020
Experiments: runtime
SYN datasets: n from 1M to 50M, |𝛴|= 10
Thank you for your attention!
And thanks to the ACM Special Interest Group on
Algorithms and Computation Theory (SIGACT)
for supporting me with a travel award