Page 1

Top-k Ranked Document Search in General Text

Databases

J.Shane Culpepper1, Gonzalo Navarro2, Simon J.Puglisi1, and Andrew Turpin1

1School of Computer Science and Information Technology, RMIT Univ., Australia

{shane.culpepper,simon.puglisi,andrew.turpin}@rmit.edu.au

2Department of Computer Science, Univ. of Chile. gnavarro@dcc.uchile.cl

Abstract. Text search engines return a set of k documents ranked by

similarity to a query. Typically, documents and queries are drawn from

natural language text, which can readily be partitioned into words, allow-

ing optimizations of data structures and algorithms for ranking. However,

in many new search domains (DNA, multimedia, OCR texts, Far East

languages) there is often no obvious definition of words and traditional

indexing approaches are not so easily adapted, or break down entirely.

We present two new algorithms for ranking documents against a query

without making any assumptions on the structure of the underlying text.

We build on existing theoretical techniques, which we have implemented

and compared empirically with new approaches introduced in this pa-

per. Our best approach is significantly faster than existing methods in

RAM, and is even three times faster than a state-of-the-art inverted file

implementation for English text when word queries are issued.

1Introduction

Text search is a vital enabling technology in the information age. Web search

engines such as Google allow users to find relevant information quickly and easily

in a large corpus of text, T . Typically, a user provides a query as a list of words,

and the information retrieval (IR) system returns a list of relevant documents

from T , ranked by similarity.

Most IR systems rely on the inverted index data structure to support efficient

relevance ranking [24]. Inverted indexes require the definition of terms in T prior

to their construction. In the case of many natural languages, the choice of terms

is simply the vocabulary of the language: words. In turn, for the inverted index

to operate efficiently, queries must be composed only of terms that are in the

index. For many natural languages this is intuitive for users; they can express

their information needs as bags of words or phrases.

However, in many new search domains the requirement to choose terms prior

to indexing is either not easily accomodated, or leads to unacceptable restrictions

on queries. For example, several Far East languages are not easily parsed into

words, and a user may adopt a different parsing as that used to create the index.

Likewise, natural language text derived from OCR or speech-to-text systems

may contain “words” that will not form terms in the mind of a user because they

Page 2

contain errors. Other types of text simply do not have a standard definition of

a term, such as biological sequences (DNA, protein) and multimedia signals.

With this in mind, in this paper we take the view of a text database (or

collection) T as a string of n symbols drawn from an alphabet Σ. T is partitioned

into N documents {d1,d2,...,dN}. Queries are also strings (or sets of strings)

composed of symbols drawn from Σ. Here, the symbols in Σ may be bytes,

letters, nucleotides, or even words if we so desire; and the documents may be

articles, chromosomes or any other texts in which we need to search. In this

setting we consider the following two problems.

Problem 1. A document listing search takes a query q ∈ Σ∗and a text T ∈ Σ∗

that is partitioned into N documents, {d1,d2,...,dN}, and returns a list of the

documents in which q appears at least once.

Problem 2. A ranked document search takes a query q ∈ Σ∗, an integer 0 < k ≤

N, and a text T ∈ Σ∗that is partitioned into N documents {d1,d2,...,dN},

and returns the top-k documents ordered by a similarity measureˆS(q,di).

By generalizing the problems away from words, we aim to develop indexes

that support efficient search in new types of text collections, such as those out-

lined above, while simultaneously enabling users in traditional domains (like

web search) to formulate richer queries, for example containing partial words or

markup. In the ranked document search problem, we focus on the specific case

whereˆS(q,di) is the tf×idf measure. tf×idf is the basic building block for a

large class of similarity measures used in the most successful IR systems.

This paper contains two contributions towards efficient ranked document

search in general texts. (1) We implement, empirically validate and compare

existing theoretical proposals for document listing search on general texts, and

include a new variant of our own. (2) We propose two novel algorithms for

ranked document search using general query patterns. These are evaluated and

compared empirically to demonstrate that they perform more efficiently than

document listing approaches. In fact, the new ranked document search algo-

rithms are three times faster than a highly tuned inverted file implementation

that assumes terms to be English words.

Our approach is to build data structures that allow us to efficiently calculate

the frequency of a query pattern in a document (tf) on the fly, unlike traditional

inverted indexes that stores precomputed tf values for specific query patterns

(usually words). Importantly, we are able to derive this tf information in an

order which allows rapid identification of the top-k ranked documents. We see

this work as an important first step toward practical ranked retrieval for large

general-text collections, and an extension of current indexing methods beyond

traditional algorithms that assume a lexicon of terms a priori.

2Basic Concepts

Relevance Ranking. We will focus on the tf×idf measure, where tft,dis the

number of times term t appears in document d, and idftis related to the number

of documents where t appears. Appendix A covers the related basic concepts.

Page 3

Suffix Arrays and Self-Indexes. The suffix array A[1..n] of a text collection T

of length n is a permutation of (1...n), so that the suffixes of T , starting

at the consecutive positions indicated in A, are lexicographically sorted [10]:

T [A[i]..n] < T [A[i+1]..n]. Because of the lexicographic ordering, all the suffixes

starting with a given substring t of T form a range A[sp..ep], which can be deter-

mined by binary search in O(|t|logn) time. Variants of this basic suffix array are

efficient data structures for returning all positions in T where a query pattern

q occurs; once sp and ep are located for t = q, it is simple to enumerate the

occ = ep−sp+1 occurrences of q. However, if T is partitioned into documents,

then listing the documents that contain q, rather than all occurrences, in less

than O(occ) time is not so straightforward; see Section 3.1.

Self-indexes [13] offer the same functionality as a suffix array but are heavily

compressed. More formally, they can (1) extract any text substring T [i..j], (2)

compute sp and ep for a pattern t, and (3) return A[i] for any i.

For example, the Alphabet-Friendly FM-index (AF-FMI) [5] occupies nHh(T )+

o(nlogσ) bits, where σ is the size of the text alphabet, Hh is the h-th order

empirical entropy [11] (a lower bound on the space required by any order-h sta-

tistical compressor), and h ≤ αlogσn for any constant 0 < α < 1. It carries

out (1) in time O(log1+ǫn + (j − i)logσ) for any constant ǫ > 0, (2) in time

O(|t|logσ) and (3) in time O(log1+ǫn).

Wavelet Trees. The wavelet tree [8] is a data structure for representing a sequence

D[1..n] over an alphabet Σ of size σ. It requires nH0(D)+o(nlogσ)+O(σ logn)

bits of space, which is asymptotically never larger than the n⌈logσ⌉ bits needed

to represent D in plain form (assuming σ = o(n)), and can be significantly

smaller if D is compressible. A wavelet tree computes D[i] in time O(logσ),

as well as rankc(D,i), the number of occurrences of symbol c in D[1..i], and

selectc(D,j), the position in D of the j-th occurrence of symbol c.

An example of a wavelet tree is shown in Fig. 1, and has a structure as

follows. At the root, we divide the alphabet Σ into symbols < c and ≥ c, where

c is the median of Σ. Then store bitvector Broot[1..n] in the root node, where

Broot[i] = 0 if D[i] < c and 1 otherwise. Now the left child of the root will

handle sequence Dleft, formed by concatenating together all the symbols < c in

D[1..n] (respecting the order); and the right child will handle Dright, which has

the symbols ≥ c. At the leaves, where all the symbols of the corresponding Dleaf

are equal, nothing is stored. It is easy to see that there are ⌈logσ⌉ levels and

that n bits are spent per level, for a total of at most n⌈logσ⌉ bits. If, instead, the

bitvectors at each level are represented in compressed form [17], the total space

of each bitvector Bv becomes nH0(Bv) + o(n), which adds up to the promised

H0(D) + o(nlogσ) + O(σ logn) bits for the whole wavelet tree.

The compressed bitvectors also allow us to obtain B[i], and to compute rank

and select, in constant time over the bitvectors, which enables the O(logσ)-

time corresponding operations on sequence D; in particular D[i], rankc(D,i)

and selectc(D,j) all take O(logσ)-time via simple tree traversals (see [13]).

Page 4

15

1=1

n0=5

n1=6

n0=4n1=1n0=5

n0=1n1=1n0=3n1=0n0=4n1=1n0=0

n1=1

6

1

2

0

5

1

6

1

2

0

3

0

1

0

8

1

5

1

1

0

5

1

4

0

3

0101

2

0

2

0

3

1

1

0

1

0

1

0

4

1

3

1

6

0

5

0

6

0

8

1

5

0

5

0

7

10

2

1

2

1

1

0

1

0

1

0

3

0

4

1

3

0

6

1

5

0

6

1

5

0

5

0

5

0

8

1

7

0

234678

sp ep

75

5

51

1

1

23

4567

891011121314

n

Fig.1. D = {6,2,5,6,2,3,1,8,5,1,5,5,1,4,3,7} as a wavelet tree. The top row of

each node shows D, the second row the bitvector Bv, and numbers in circles are node

numbers for reference in the text. n0 and n1 are the number of 0 and 1 bits respectively

in the shaded region of the parent node of the labelled branch. Shaded regions show

the parts of the nodes that are accessed when listing documents in the region D[sp =

3..ep = 13]. Only the bitvectors (preprocessed for rank and select) are present in the

actual structure, the numbers above each bitvector are included only to aid explanation.

3Previous Work

3.1Document Listing

The first solution to the document listing problem on general text collections [12]

requires optimal O(|q| + docc) time, where docc is the number of documents

returned; but O(nlogn) bits of space, substantially more than the nlogσ bits

used by the text. It stores an array D[1..n], aligned to the suffix array A[1..n], so

that D[i] gives the document text position A[i] belongs to. Another array,C[1..n],

stores in C[i] the last occurrence of D[i] in D[1..i−1]. Finally, a data structure is

built on C to answer the range minimum query RMQC(i,j) = argmini≤r≤jC[r]

in constant time [4]. The algorithm first finds A[sp..ep] in time O(|q|) using the

suffix tree of T [2]. To retrieve all of the unique values in D[sp..ep], it starts with

the interval [s..e] = [sp..ep] and computes i = RMQC(s,e). If C[i] ≥ sp it stops;

otherwise it reports D[i] and continues recursively with [s..e] = [sp..i − 1] and

[s..e] = [i + 1..ep] (condition C[i] ≥ sp always refers to the original sp value). It

can be shown that a different D[i] value is reported at each step.

By representing D with a wavelet tree, values C[i] can calculated on de-

mand, rather than stored explicitly [22]. This reduces the space to |CSA| +

nlogN + 2n + o(nlogN) bits, where |CSA| is the size of any compressed

suffix array (Section 2). The CSA is used to find D[sp..ep], and then C[i] =

selectD[i](D,rankD[i](D,i) − 1) is determined from the wavelet tree of D in

Page 5

O(logN) time. They use a compact data structure of 2n + o(n) bits [6] for the

RMQ queries on C. If, for example, the AF-FMI is used as the compressed suffix

array then the overall time to report all documents for query q is O(|q|logσ +

docclogN). With this representation, tft,d= rankd(D,ep) − rankd(D,sp − 1).

Gagie et al. [7] use the wavelet tree in a way that avoids RMQs on C at all. By

traversing down the wavelet tree of D, while setting sp′= rankb(Bv,sp−1)+1

and ep′= rankb(Bv,ep) as we descend to the left (b = 0) or right (b = 1) child

of Bv, we reach each possible distinct leaf (document value) present in D[sp,ep]

once. To discover each successive unique d value, we first descend to the left child

each time the resulting interval [sp′,ep′] is not empty, otherwise we descend to

the right child. By also trying the right child each time we have gone to the left,

all the distinct successive d values in the interval are discovered. We also get

tft,d = ep − sp + 1 upon arriving at the leaf of each d. They show that it is

possible to get the i-th document in the interval directly in O(logN) time. This

is the approach we build upon to get our new algorithms described in Section 4.

Sadakane [20] offers a different space-time tradeoff. He builds a compressed

suffix arrayA, and a parentheses representation of C in order to run RMQ queries

on it without accessing C. Furthermore, he stores a bitvector B indicating the

points of T where documents start. This emulates D[i] = rank1(B,A[i]) for

document listing. The overall space is |CSA| + 4n + o(n) + N logn

|CSA| bits are required in order to compute the tft,dvalues. If the AF-FMI is

used as the implementation of A, the time required is O(|q|logσ+docclog1+ǫn).

Any document listing algorithm obtains docc trivially, and hence idft =

log(N/docc). If, however, a search algorithm is used that does not list all docu-

ments, idf must be explicitly computed. Sadakane [20] proposes a 2n+o(n) bit

data structure built over the suffix array to compute idftfor a given t.

Nbits. Other

3.2 Top-k Retrieval

In IR it is typical that only the top k ranked documents are required, for some k,

as for example in Web search. There has been little theoretical work on solving

this “top-k” variant of the document listing problem. Muthukrishnan [12] solves

a variant where only the docc′documents that contain at least f occurrences

of q (tfq,d ≥ f) are reported, in time O(|q| + docc′). This requires a general

data structure of O(nlogn) bits, plus a specific one of O((n/f)logn) bits. This

approach does not exactly solve the ranked document search problem. Recently,

Hon et al. [9] extended the solution to return the top-k ranked documents in

time O(|q| + k logk), while keeping O(nlogn) bits of space. They also gave a

compressed variant with 2|CSA| + o(n) + N logn

query time, but its practicality is not clear.

Nbits and O(|q| + k log4+ǫn)

4 New Algorithms

We introduce two new algorithms for top-k document search extending Gagie

et al.’s proposal for document listing [7]. Gagie et al. introduce their method as

Page 6

a repeated application of the quantile(D[sp..ep],p) function, which returns the

p-th number in D[sp..ep] if that subarray were sorted. To get the first unique

document number in D[sp..ep], we issue d1= quantile(D[sp..ep],1). To find the

next value, we issue d2= quantile(D[sp..ep],1 + tfq,d1). The j-th unique doc-

ument will be dj = quantile

?

computed along the way as tft,d= rankd(D,ep) − rankd(D,sp − 1). This lists

the documents and their tf values in increasing document number order.

Our first contribution to improving document listing search algorithms is the

observation, not made by Gagie et al., that the tfq,dvalue can be collected on the

way to extracting document number d from the wavelet tree built on D. In the

parent node of the leaf corresponding to d, tfq,dis equal to the number of 0-bits

(resp. 1-bits) in Bv[sp′..ep′] is d’s leaf is a left child (resp. right child). Thus, two

wavelet tree rank operations are avoided; an important practical improvement.

We now recast Gagie et al.’s algorithm. When listing all distinct documents

in D[sp..ep], the algorithm of Gagie et al. can be thought of as a depth-first

traversal of the wavelet tree that does not follow paths which do not lead to

document numbers not occurring in D[sp..ep].

Consider the example tree of Fig. 1, where we list the distinct numbers in

D[3..13]. A depth-first traversal begins by following the leftmost path to leaf 8.

As we step left to a child, we take note of the number of 0-bits in the range used

in its parent node, labelled n0on each branch. Both n0and n1are calculated to

determine if there is a document number of interest in the left and right child. As

we enter leaf 8, we know that there are n0= 3 copies of document 1 in D[3..13],

and report this as tfq,1 = 3. Next in the depth-first traversal is leaf 9, thus

we report tfq,2= 1, the n1value of its parent node 5. The traversal continues,

reporting tfq,3= 1, and then moves to the right branch of the root to fetch the

remainder of the documents to report.

Again, this approach produces the document numbers in increasing docu-

ment number order. These can obviously be post-processed to extract the k

documents with the highest tfq,dvalues by sorting the docc values. A more effi-

cient approach, and our focus next, fetches the document numbers in tf order,

and then only the first k are processed.

D[sp..ep],1 +?j−1

i=1tfq,di

?

, with the frequencies

4.1Top-k via Greedy Traversal

The approach used in this method is to prioritize the traversal of the wavelet tree

nodes by the size of the range [sp′..ep′] in the node’s bitvector. By traversing to

nodes with larger ranges in a greedy fashion, we will reach the document leaves

in tf order, and reach the first k leaves potentially having explored much less

of the tree than we would have using a depth-first-style traversal.

We maintain a priority queue of (node, range) pairs, initialized with the single

pair (root,[sp..ep]). The priority of a pair favors larger ranges, and ties are broken

in favor of deeper nodes. At each iteration, we remove the node (v,[sp′..ep′]) with

largest ep′− sp′. If v is a leaf, then we report the corresponding document and

its tf value, ep′−sp′+1. Otherwise, the node is internal; if Bv[sp′..ep′] contains

Page 7

one or more 0-bits (resp. 1-bits) then at least one document to report lies on

the left subtree (resp. right subtree) and so we insert the child node with an

appropriate range, which will have size n0(resp. n1), into the queue. Note we

can insert zero to two new elements in the queue.

Fig. 5(a) gives pseudo code. In the worst case, this approach will explore

almost as much of the tree as would be explored during the constrained depth-

first traversal of Gagie et al., and so requires O(docclogN) time. This worst case

is reached when every node that is a parent of a leaf is in the queue, but only

one leaf is required, e.g. when all of the documents in D[sp..ep] have tfq,d= 1.

4.2Top-k via Quantile Probing

We now exploit the fact that in a sorted array X[1..m] of document numbers,

if a document d occurs more than m/2 times, then X[m/2] = d. The same

argument applies for numbers with frequency > m/4: if they exist, they must

occur at positions m/4, 2m/4 or 3m/4 in X. In general we have the following:

Observation 1 On a sorted array X[1..m], if there exists a d ∈ X with fre-

quency larger than m/2ithen there exists at least one j such that X[jm/2i] = d.

Of course we cannot afford to fully sort D[sp..ep]. However, we can access the

elements of D[sp..ep] as if they were sorted using the aforementioned quantile

queries [7] over the wavelet tree of D. That is, we can determine the document

d with a given rank r in D[sp..ep] using quantile(D[sp..ep],r) in O(logN) time.

In the remainder of this section we refer to D[sp..ep] as X[1..m] with m a

power of 2, and assume we can probe X as if it were sorted (with each probe

requiring O(logN) time). Fig. 5(b) gives pseudocode for the final method.

To derive a top-k listing algorithm, we apply Obs. 1 in rounds. As the al-

gorithm proceeds, we will accumulate candidates for the top-k documents in a

min-heap of at most k pairs of the form (d, tfq,d), keyed on tfq,d. In round 1, we

determine the document d with rank m/2 and its frequency tfq,d. If d does not

already have an entry in the heap,3then we add the pair (d,tfq,d) to the heap,

with priority tfq,d. This ends the first round. Note that the item we inserted in

fact may have tfq,d≤ m/2, but at the end of the round if a document d has

tfq,d> m/2, then it is in the heap. We continue, in round 2, to probe the ele-

ments X[m/4] and X[3m/4], and their frequencies fX[m/4]and fX[3m/4]. If the

heap contains less than k items, and does not contain an entry for X[m/4], we

insert (X[m/4],fX[m/4]). Else we check the frequency of the the minimum item.

If it is less than fX[m/4], we extract the minimum and insert (X[m/4],fX[m/4]).

We then perform the same check and update with (X[3m/4],fX[3m/4]).

In round 2 we need not ask about the element with rank 2m/4 = m/2, as we

already probed it in round 1. To avoid reinspecting ranks, during the ith round,

we determine the elements with ranks m/2i,m/2i+2i,m/2i+2i+1.... The total

number of elements probed (and hence quantile queries) to list all documents is

at most 4m/fmin, where fminis the k-th highest frequency in the result.

3We can determine this easily in O(1) time by maintaining a bitvector of size N.

Page 8

1

1

1

1

1

2

1

2

1

2

3

1

2

3

1

2

3

1

1

3

1

1

3

1

3

1

3

1

3

1

3

1

3

Query Length

Time (msec)

2

2

2

2

2

3

2

2

2

2

2

2

2

2

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

5

5

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

5

6

6

6

1

10

100

1000

10000

345678910 11 1213 141516171819 20

PROTEIN

1

2

3

4

5

6

Sada

l−gram

VM

WT

Quantile

Greedy

1

1

1

1

1

2

1

2

1

1

2

1

2

1

1

2

3

1

2

3

1

1

1

1

1

1

Query Length

Time (msec)

2

3

2

3

2

3

2

2

2

2

3

2

2

2

3

2

2

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

5

6

4

4

5

6

4

5

6

4

5

6

4

5

6

4

5

6

4

5

6

5

5

5

5

5

5

5

5

6

5

6

5

6

5

6

6

6

6

6

6

6

6

1

10

100

1000

10000

34567891011121314151617181920

WSJ

Fig.2. Mean time to find documents for all 200 queries of each length for methods

Sada, ℓ-gram, VM and WT, and mean time to report the top k = 10 documents by

tfq,d for methods Quantile and Greedy. (Lines interpolated for clarity.)

Due to Obs. 1, and because we maintain items in a min-heap, at the end of

the ith round, the k most frequent documents having tf > m/2iare guaranteed

to be in our heap. Thus, if the heap contains k items at the start of round i+1,

and the smallest element in it has tf ≥ m/2i+1, then no element in the heap

can be displaced; we have found the top-k items and can stop.

5Experiments

We evaluated our new algorithms (Greedy from Section 4.1 and Quantile from

Section 4.2) with English text and protein collections. We also implemented

our improved version of Gagie et al.’s Wavelet Tree document listing method,

labelled WT. We include three baseline methods derived from previous work on

the document listing problem. The first two are implementations of V¨ alim¨ aki

and M¨ akinen [22] and Sadakane [20] as described in Section 3, labelled VM and

Sada respectively. The third, ℓ-gram, is a close variant of Puglisi et al.’s inverted

index of ℓ-grams [16], used with parameters ℓ = 3 and block size= 4096. It is

described in detail in Appendix C.

Experimental Data. We use two data sets. wsj is a 100MB collection of 36,603

news documents in text format drawn from disk three of the trec data collection

(http://trec.nist.gov). protein is a concatenation of 143,244 Human and

Mouse protein sequences totalling 60MB (http://www.ebi.ac.uk/swissprot).

For each collection, a total of 200 queries of character lengths ranging from 3 to

20 which appear at least 5 times in the collection were randomly generated, for

a total of 3,600 sample queries. Each query was run 10 times. Statistics of the

queries used are presented in Appendix D.

Timing Results. Fig. 2 shows the total time for 200 queries of each query

length for all methods. The document listing method of Gagie et al. with our

optimizations (number 4 on the graphs) is clearly the fastest method for finding

all documents and tfq,d values that contain the query q in document number

order. The two algorithms which implicitly return documents in decreasing tfq,d

Page 9

Sada l−gramVMWTQuantile Greedy

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Time (msec per Document)

Sadal−gram VMWT QuantileGreedy

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Time (msec per Document)

Fig.3. Time per document listed as in Fig. 2, with 25th and 75th percentiles (boxes),

median (solid line), and outliers (whiskers).

Sada

572

870

ℓ-gram

122

VM

391

247

WT

341

217

Quantile

341

217

Greedy

341

217

wsj

protein

77

Table 1. Peak memory use during search (MB) for the algorithms on wsj and protein.

order, Quantile and Greedy, are faster than all other methods. Note these final two

methods are only timed to return k = 10 documents, but if the other methods

were set the same task, their times would only increase as they must enumerate

all documents prior to choosing the top-k. Moreover, we found that choosing any

value of k from 1 to 100 had little effect on the runtime of Greedy and Quantile.

Note the anomalous drop in query time for |q| = 5 on protein for all methods

except ℓ-gram. This is a result of the low occ and docc for that set of queries,

thus requiring less work from the self-index methods. Time taken to identify the

range D[sp..ep] is very fast, and about the same for all the query lengths tested.

A low occ value means this range is smaller, and so less work for the document

listing methods. Method ℓ-gram however does not benefit from the small number

of occurrences of q as it has to intersect all inverted lists for the 3-grams that

make up q, which may be long even if the resulting list is short.

Fig. 3 summarizes the time per document listed, and clearly shows that the

top-k methods (Quantile and Greedy) do more work per document listed. However,

Fig. 2 demonstrates that this is more than recouped whenever k is small, relative

to the total number of documents containing q. Table 2 shows that the average

docc is well above 10 for all pattern lengths in the current experimental setup.

Memory Use. Table 1 shows the memory use of the methods on the two

data sets. The inverted file approach, ℓ-gram, uses much less memory than the

other approaches, but must have the original text available in order to filter

out false matches and perform the final tf calculations. It is possible for the

wavelet trees in all of the other methods to be compressed, but it is also possible

to compress the text that is used (and counted) in the space requirements for

method ℓ-gram. The Sada method exhibits a higher than expected memory usage

because the protein collection has a high proportion of short documents. The

Page 10

Sada method requires a csa to be constructed for each document, and in this

case is undesirable, as the csa algorithm has a high startup overhead that is

only recouped as the size of the text indexed increases.

ZetZet−p Zet−io WTQuantileGreedy

0

5

10

15

Time per query (msec)

2 word queries

ZetZet−pZet−ioWTQuantile Greedy

4 word queries

Fig.4. Time to find word based queries using Zettair and the best of the new methods

for 2 and 4 word queries on wsj.

Term-based Search. The results up to now demonstrate that the new com-

pressed self-index based methods are capable of performing document listing

search on general patterns in memory faster than previous ℓ-gram based inverted

file approaches. However, these approaches are not directly comparable to com-

mon word-based queries at which traditional inverted indexes excel. Therefore,

we performed additional experiments to test if these approaches are capable

of outperforming a term-based inverted file. For this sake we generated 44,693

additional queries aligned on English word boundaries from the wsj collection.

Statistics of these phrase queries of word length 2 to 15 are given in Appendix D.

Short of implementing an in-memory search engine, it is difficult to choose

a baseline inverted file implementation that will efficiently solve the top-k doc-

ument listing problem. Zettair is a publicly available, open source search engine

engineered for efficiency (www.seg.rmit.edu.au/zettair). In addition to the

usual bag-of-terms query processing using various ranking formulas, it readily

supports phrase queries where the terms in q must occur in order, and also imple-

ments the impact ordering scheme of Anh and Moffat [1]. As such, we employed

Zettair in three modes. Firstly, zet used the Okapi BM-25 ranking formula to

return the top 20 ranked documents for the bag-of-terms q. Secondly, zet-p used

the “phrase query” mode of Zettair to return the top 20 ranked documents which

contained the exact phrase q. Finally, we used the zet-io mode to perform a bag-

of-terms search for q using impact ordered inverted lists and the associated early

termination heuristics. Zettair was modified to ensure that all efficiency mea-

surements were done in ram, just as the self-indexing methods require. Time to

load posting lists into memory is not counted in the measurements.

Fig. 4 shows the time for searching for two- and four-word patterns. We do

not show the times for Sada, VM, and ℓ-gram, as they were significantly slower

Page 11

than the new methods, as expected from Fig. 2. The Greedy and Quantile methods

used k = 20. The Zet-ph has better performance, on average, than Zet, and Zet-io

is the most efficient of all word-based inverted indexing methods tested. A direct

comparison between the three Zettair modes and the new algorithms is tenuous,

as Zettair implements a complete ranking, whereas the document listing methods

simply use only the tf and idf as their “ranking” method. However, Zet-io pre-

orders the inverted lists to maximize efficiency, removing many of the standard

calculations performed in Zet and Zet-ph. This makes Zet-io comparable with

the computational cost of our new methods. The WT approach is surprisingly

competitive with the best inverted indexing method, Zet-io. Given the variable

efficiency of two-word queries with WT (due to the diverse number of possible

document matches for each query), it is difficult to draw definitive conclusions

on the relative algorithm performance. However, the Greedy algorithm is clearly

more efficient than Zet-io (means 0.91ms and 0.69ms, Wilcoxon test, p < 10−15).

When the phrase length is increased, the two standard Zettair methods get

slower per query, as expected, because they now have to intersect more inverted

lists to produce the final ranked result. Interestingly, all other methods get faster,

as there are fewer total documents to list on average, and fewer intersections for

the impact ordered inverted file. For four word queries, all of the self-indexing

methods are clearly more efficient than the inverted file methods. Adding an idf

computation to Greedy and Quantile will not make them less efficient than WT.

6 Discussion

We have implemented document listing algorithms that, to date, had only been

theoretical proposals. We have also improved one of the approaches, and intro-

duced two new algorithms for the case where only the top-k documents sorted

by tf values are required. For general patterns, approach WT as improved in this

paper was the fastest for document listing, whereas our novel Greedy approach

was much faster for fetching the top k documents (for k < 100, at least). In the

case where the terms comprising documents and queries are fixed as words in the

English language, Greedy is capable of processing 4600 queries per second, com-

pared to the best inverted indexing method, Zet-io, which processes only 1400

on average. These results are extremely encouraging: perhaps self-index based

structures can compete efficiently with inverted files. In turn, this will remove

the restriction that IR system users must express their information needs as

terms in the language chosen by the system, rather than in a more intuitive way.

Our methods return the top-k documents in tf order, which departs from

the tf×idf framework of most information retrieval systems. However, in the

context of general pattern search, q only ever contains one term: the pattern to

be found. For one term queries, the value of idf is simply a constant multiplier to

the final ranking score, and so not useful for discriminating documents. If these

data structures are to be used for bag-of-strings search, then the idf factor may

become important, and can be easily extracted using method WT, which is still

faster than Zet-io in our experiments.

Page 12

References

1. V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In

Proc. 29th ACM SIGIR, pp˙372–379, 2006.

2. A. Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms

on Words, NATO ISI Series, pages 85–96. Springer-Verlag, 1985.

3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison

Wesley, 1999.

4. M. Bender and M. Farach-Colton. The LCA problem revisited.

LATIN, LNCS 1776, pp˙88–94, 2000.

5. P. Ferragina, G. Manzini, V. M¨ akinen, and G. Navarro. Compressed representa-

tions of sequences and full-text indexes. ACM TALG, 3(2):article 20, 2007.

6. J. Fischer and V. Heun. A new succinct representation of RMQ-information and

improvements in the enhanced suffix array. In Proc. ESCAPE, pp˙459–470, 2007.

7. T. Gagie, S. Puglisi, and A. Turpin. Range quantile queries: Another virtue of

wavelet trees. In Proc. 16th SPIRE, LNCS 5721, pp˙1–6, 2009.

8. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes.

In Proc. 14th SODA, pp˙841–850, 2003.

9. W.-K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string

retrieval problems. In Proc. FOCS, pp˙713–722, 2009.

10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches.

SIAM J. Computing, 22(5):935–948, 1993.

11. G. Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407–

430, 2001.

12. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc.

13th SODA, pp˙657–666, 2002.

13. G. Navarro and V. M¨ akinen.Compressed full-text indexes.

Surveys, 39(1):article 2, 2007.

14. M. Persin, J. Zobel, and R. Sacks-Davis.

frequency-sorted indexes. JASIS, 47(10):749–764, 1996.

15. J. M. Ponte and W. B. Croft.A language modeling approach to information

retrieval. In Proc. 21th ACM SIGIR, pp˙275–281, 1998.

16. S. Puglisi, W. Smyth, and A. Turpin. Inverted files versus suffix arrays for locating

patterns in primary memory. In Proc. 13th SPIRE, pp˙122–133, 2006.

17. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applica-

tions to encoding k-ary trees and multisets. In Proc. SODA, pp˙233–242, 2002.

18. S. E. Robertson and K. S. Jones. Relevance weighting of search terms. JASIST,

27:129–146, 1976.

19. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi

at TREC-3. In D. K. Harman, editor, Proc. 3rd TREC, 1994.

20. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete

Algorithms, 5(1):12–22, 2007.

21. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing.

Comm. ACM, 18(11):613–620, 1975.

22. N. V¨ alim¨ aki and V. M¨ akinen. Space-efficient algorithms for document retrieval.

In B. Ma and K. Zhang, editors, Proc. 18th CPM, LNCS 4580, pp˙205–215, 2007.

23. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufmann, 2nd

edition, 1999.

24. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing

Surveys, 38(2):1–56, 2006.

In Proc. 4th

ACM Computing

Filtered document retrieval with

Page 13

ABasic IR Concepts

Relevance Ranking. Most modern search engines rank documents to help users

find relevant information. A similarity metric,ˆS(q,di), is used to evaluate each

document direlative to a user query q (often a list of terms or words). If docu-

ments are treated as “bags of terms”, similarity can be measured using simple

statistical properties of the vocabulary [23]. Successful similarity metrics include

vector space models [21], probabilistic models [18], and language models [15].

While these approaches have different theoretical bases, their similarity scoring

functions all employ some variant of tf×idf information. The tf×idf class of

scoring functions are based on a weighting wd,tderived using a variation of the

formula wd,t= tft,d×idft, where tft,dis the term frequency (number of times t

occurs in d) and idftis the inverse document frequency (logarithm of the inverse

of the fraction of the documents where t appears). Typically, the top-k docu-

ments are returned, ordered byˆS(q,di) =?

are the weight of the query term t in document d and query q, respectively. Using

a tf×idf metric ensures that a high term frequency in an individual document

increases the similarity contribution, while a high frequency across all documents

reduces its contribution. Hence, terms that are common in all documents do not

dominate the more discriminating terms in a query. Popular scoring functions

that make use of tf×idf information include the Vector Space Model [21] and

the Okapi BM25 metric [19]. Many others are possible [24].

t∈qwd,t× wq,t, where wd,tand wq,t

Inverted Indexes. For each term that occurs in T (the set of which forms the

vocabulary), an inverted index stores a list of documents that contain that term.

Inverted indexes have been the dominant data structure used in information

retrieval systems for over 30 years [3,23,24]. Inverted indexes require the tex-

tual units (words, q-grams, characters) of the vocabulary to be defined before

indexing commences in order to limit the size of the index and vocabulary, and

to allow tf×idf information to be precomputed and stored. The ordering of

documents within inverted lists can be altered to improve the speed of returning

the top-k documents for a query. Persin et al. [14] give different heuristics to

support top-k ranked retrieval (under the tf×idf model) when inverted lists

are sorted by decreasing tf. Anh and Moffat [1] study various generalizations of

this idea under the name “impact ordering”.

B Pseudocodes

Fig. 5 gives pseudocode for our new methods.

CDetails of ℓ-gram inverted file

It is possible to use a character-based inverted file to solve the ranked document

search problem for general strings by indexing small, overlapping units of text

of length ℓ. Then a string query q is resolved by decomposing it into its ℓ-gram

Page 14

procedure top-k-greedy(sp,ep,root,k)

1: Let h be an empty max-heap

2: h.insert(root,[sp,ep])

3: numFound ← 0

4: while h not empty and numFound < k do

5:(v,[sp′,ep′]) ← h.pop()

6:

if v is a leaf then

7:

output (d,tfq,d) ← (v.label,ep′− sp′+ 1)

8: numFound ← numFound + 1

9:

else

10:[s0,e0] ← [rank0(Bv,ep′),rank0(Bv,sp′)]

11:[s1,e1] ← [rank1(Bv,ep′),rank1(Bv,sp′)]

12:

if n0 = (e0− s0) ?= 0 then

13: h.insert(v.left,[s0,e0])

14:

if n1 = (e0− s0) ?= 0 then

15:h.insert(v.right,[s1,e1])

(a)

procedure top-k-quantile(sp,ep,k)

1: Let h be an empty min-heap

2: m ← ep − sp + 1

3: i ← m

4: s ← m/2

5: while h.size < k and i > h.top.tf do

6:p ← s

7:

while p < m do

8:(d,tfq,d) ← quantile(D[sp..ep],p)

9:

if h.size < k then

10:h.insert(d,tfq,d)

11:

elsif h.top.tf < tfq,d then

12:h.extract-min()

13: h.insert(d,tfq,d)

14:p ← p + i

15:s ← s/2, i ← i/2

(b)

Fig.5. Algorithms for computing the top-k documents by tf. The algorithm in (a)

maintains a priority queue h of (node,range) pairs, each pair having priority equal to

the length of the range component. The algorithm outputs the top-k documents as it

proceeds; the algorithm in (b) maintains a min-heap of (doc,tf) pairs keyed on tf.

At the end of (b) the heap contains the top-k document numbers and tf values. For

simplicity (b) assumes m = ep − sp + 1 is a power of 2.

Page 15

components, and treating each component as a term. With suitable modifica-

tion, a classical term-based inverted file can perform the task provided |q| ≥ ℓ.

Accordingly, we developed a block-addressing inverted file and had it index every

distinct ℓ-gram in the collection.

The text collection is concatenated and logically partitioned into ⌈n/b⌉ blocks,

each of size b. The index is comprised of two pieces. The set of distinct ℓ-grams

in the collection is held in a lexicon, which we implement with a hashtable. With

each ℓ-gram entry in the lexicon is a pointer to a postings list: a list of the blocks

in the collection that contain one or more occurrences of that ℓ-gram. The num-

ber of items in the lexicon is bound by σℓ, the number of possible substrings of

length ℓ. However, the number of indexed ℓ-grams tends to be more sparse in

practice. For small ℓ the size of the postings lists dominates space usage. To save

space, the lists are compressed by storing the block numbers in increasing order

and encoding only the difference (gap) between adjacent items with a suitable

integer code.

Locating the positions of occurrence of a pattern q,|q| ≥ ℓ is a two phase

process. First, using the lexicon, we gather the lists of the at most |q| − ℓ + 1

distinct ℓ-grams contained in q and sort them into increasing order of length. We

then intersect the two shortest lists, to obtain a list of candidate blocks – this is

a superset of the blocks that contain the pattern. We continue to intersect the

candidate list with the remaining gathered lists until either no lists remain, or

the size of the candidate list becomes less than a threshold. At this point, for

each item (block number) in the candidate list, we scan each of the corresponding

text block looking for occurrences of q. The intersection phase of the algorithm

can be thought of as a filtering step: blocks that cannot contain the pattern are

eliminated from the later scanning phase. The block size, b, allows a space-time

tradeoff: increasing b makes lists smaller but makes our filter less specific and

requires us to scan larger portions of the collection.

During the in-block scanning phase, as we locate positions of pattern oc-

currence, these positions must be mapped to document numbers and tf values

accumulated. Clearly once all blocks have been scanned we will also know idf. To

facilitate mapping between text positions and document numbers we store the N

document boundaries in a sorted array. To find the document a given position i

is contained in we simply (binary) search for the smallest value greater than i in

this array. Because we scan text blocks left-to-right, positions of occurrence are

mapped to documents in increasing order, allowing the amount of the mapping

array searched to be reduced as block scanning proceeds: a small optimization.

The space cost for the mapping array is N logn bits – negligible relative to other

index components. In all experiments we use ℓ = 3 and block size= 4096.

D Query Data

Table 2 summarizes attributes of the queries used. Note the unusually low aver-

age occ and docc values for the queries of length 5 for the protein data set.