Content uploaded by Giulia Bernardini

Author content

All content in this area was uploaded by Giulia Bernardini on Jan 07, 2020

Content may be subject to copyright.

Reverse-Safe Data Structures for

Text Indexing

Giulia Bernardini1,4, Huiping Chen2, Gabriele Fici3,

Grigorios Loukides2, Solon P. Pissis4

1University of Milano - Bicocca, Italy

2King’s College London, UK

3University of Palermo, Italy

4CWI Amsterdam, The Netherlands

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Dataset Data

structure

Preprocess

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

q2(x,y) {3, 8}

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

.

.

.

.

.

.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

.

.

.

.

.

.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

How do we view data structures?

Data

structure

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

Dataset

.

.

.

.

.

.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Reverse-safe data structures

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

.

.

.

.

.

.

Dataset 1

Dataset 2

Dataset 3

.

.

.

Data

structure

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

z-

reverse-safe data structures

z-reverse-

safe data

structure

q1(x) Yes

q2(x,y) {3, 8}

q3(y) 14

.

.

.

.

.

.

Dataset 1

Dataset 2

Dataset 3

.

.

.

Dataset z

.

.

.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

z-

RSDS for text indexing

Main result: given z and a text of length n, build a z-RSDS

of size 𝓞(n) in 𝓞(n

𝛚

logd) time answering pattern matching

queries of length m ≤ d in 𝓞(m) time, where d is maximal.

Text of length nz-RSDS

of size

𝓞(n)

𝓞(n

𝛚

logd) time

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Text indexing: sufﬁx tree

Weiner’73: Given a query of length m we can answer decision,

counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Text indexing: sufﬁx tree

Weiner’73: Given a query of length m we can answer decision,

counting, and reporting queries in 𝓞(m), 𝓞(m), and 𝓞(m+occ) time, resp.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Na et al’03: The truncated suﬃx tree Td(S) is constructible in 𝓞(n) time.

Text indexing: truncated sufﬁx tree

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

abaabbabba ≠ abbaabbaba are 3-equivalent.

Two strings S and S’ are d-equivalent if and only if Td (S) = Td (S’).

Text indexing: truncated sufﬁx tree

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Text indexing: truncated sufﬁx tree

Lemma: The number 𝛼d of strings which are d-equivalent

to S is monotonically non-increasing for increasing d.

d = n ⇒ 𝛼d = 1

d = 0 ⇒ 𝛼d = |𝛴|n

Suﬃx tree of S

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Text indexing: truncated sufﬁx tree

Lemma: The number 𝛼d of strings which are d-equivalent

to S is monotonically non-increasing for increasing d.

d = n ⇒ 𝛼d = 1

d = 0 ⇒ 𝛼d = |𝛴|n

Binary search

for optimal d

Suﬃx tree of S

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

De Bruijn graph

Hutchinson’75: the number of distinct Eulerian paths in GS,d

is equal to the number of strings that are d-equivalent with S.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

De Bruijn graph

Kirchhoﬀ's theorem: the Eulerian paths in a directed

multigraph G with n nodes can be counted in 𝓞(n

𝛚

) time.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

The algorithm

For each d implied by binary search:

●Build the de Bruijn graph GS,d in 𝓞(n) time

●Count Eulerian paths in 𝓞(n

𝛚

) time to check 𝛼d ≥ z

For d optimal, pick any Eulerian path on GS,d to ﬁnd S’~d S

Build and return Td (S’) in 𝓞(n) time

Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n

𝛚

logn) time.

It answers pattern matching queries of length m ≤ d in 𝓞(m) time.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

The algorithm

For each d implied by exponential search:

●Build the de Bruijn graph GS,d in 𝓞(n) time

●Count Eulerian paths in 𝓞(n

𝛚

) time to check 𝛼d ≥ z

For d optimal, pick any Eulerian path on GS,d to ﬁnd S’~d S

Build and return Td (S’) in 𝓞(n) time

Main result: Td (S’) is a z-RSDS of size 𝓞(n) built in 𝓞(n

𝛚

logd) time.

It answers pattern matching queries of length m ≤ d in 𝓞(m) time.

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Engineering the algorithm: reducing the BS interval

Fici et al’06: Let r(S) be the length of a longest repeated substring

of S. S is uniquely determined by its substrings of length r(s) + 2.

d = r(S)+2 ⇒ 𝛼d = 1

d = 0 ⇒ 𝛼d = |𝛴|n

𝓞(n) time

Suﬃx tree of S

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Engineering the algorithm: two more improvements

●If d is good for a preﬁx of S then it is also good for S

⇒ operate on preﬁxes of S using preﬁx doubling.

●If GS,d is big then the Laplacian matrix of GS,d is sparse

⇒ employ sparse LU decomposition algorithms (Gilbert

et al’88).

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Experiments: utility

MSN dataset: n = 4698764 , |𝛴|= 17

Giulia Bernardini Reverse-Safe Data Structures ALENEX 2020

Experiments: runtime

SYN datasets: n from 1M to 50M, |𝛴|= 10

Thank you for your attention!

And thanks to the ACM Special Interest Group on

Algorithms and Computation Theory (SIGACT)

for supporting me with a travel award