Page 1
Top-k Ranked Document Search in General Text
Databases
J.Shane Culpepper1, Gonzalo Navarro2, Simon J.Puglisi1, and Andrew Turpin1
1School of Computer Science and Information Technology, RMIT Univ., Australia
{shane.culpepper,simon.puglisi,andrew.turpin}@rmit.edu.au
2Department of Computer Science, Univ. of Chile. gnavarro@dcc.uchile.cl
Abstract. Text search engines return a set of k documents ranked by
similarity to a query. Typically, documents and queries are drawn from
natural language text, which can readily be partitioned into words, allow-
ing optimizations of data structures and algorithms for ranking. However,
in many new search domains (DNA, multimedia, OCR texts, Far East
languages) there is often no obvious definition of words and traditional
indexing approaches are not so easily adapted, or break down entirely.
We present two new algorithms for ranking documents against a query
without making any assumptions on the structure of the underlying text.
We build on existing theoretical techniques, which we have implemented
and compared empirically with new approaches introduced in this pa-
per. Our best approach is significantly faster than existing methods in
RAM, and is even three times faster than a state-of-the-art inverted file
implementation for English text when word queries are issued.
1 Introduction
Text search is a vital enabling technology in the information age. Web search
engines such as Google allow users to find relevant information quickly and easily
in a large corpus of text, T . Typically, a user provides a query as a list of words,
and the information retrieval (IR) system returns a list of relevant documents
from T , ranked by similarity.
Most IR systems rely on the inverted index data structure to support efficient
relevance ranking [24]. Inverted indexes require the definition of terms in T prior
to their construction. In the case of many natural languages, the choice of terms
is simply the vocabulary of the language: words. In turn, for the inverted index
to operate efficiently, queries must be composed only of terms that are in the
index. For many natural languages this is intuitive for users; they can express
their information needs as bags of words or phrases.
However, in many new search domains the requirement to choose terms prior
to indexing is either not easily accomodated, or leads to unacceptable restrictions
on queries. For example, several Far East languages are not easily parsed into
words, and a user may adopt a different parsing as that used to create the index.
Likewise, natural language text derived from OCR or speech-to-text systems
may contain “words” that will not form terms in the mind of a user because they
Page 2
contain errors. Other types of text simply do not have a standard definition of
a term, such as biological sequences (DNA, protein) and multimedia signals.
With this in mind, in this paper we take the view of a text database (or
collection) T as a string of n symbols drawn from an alphabet Σ. T is partitioned
into N documents {d1,d2,...,dN}. Queries are also strings (or sets of strings)
composed of symbols drawn from Σ. Here, the symbols in Σ may be bytes,
letters, nucleotides, or even words if we so desire; and the documents may be
articles, chromosomes or any other texts in which we need to search. In this
setting we consider the following two problems.
Problem 1. A document listing search takes a query q ∈ Σ∗and a text T ∈ Σ∗
that is partitioned into N documents, {d1,d2,...,dN}, and returns a list of the
documents in which q appears at least once.
Problem 2. A ranked document search takes a query q ∈ Σ∗, an integer 0 < k ≤
N, and a text T ∈ Σ∗that is partitioned into N documents {d1,d2,...,dN},
and returns the top-k documents ordered by a similarity measureˆS(q,di).
By generalizing the problems away from words, we aim to develop indexes
that support efficient search in new types of text collections, such as those out-
lined above, while simultaneously enabling users in traditional domains (like
web search) to formulate richer queries, for example containing partial words or
markup. In the ranked document search problem, we focus on the specific case
whereˆS(q,di) is the tf×idf measure. tf×idf is the basic building block for a
large class of similarity measures used in the most successful IR systems.
This paper contains two contributions towards efficient ranked document
search in general texts. (1) We implement, empirically validate and compare
existing theoretical proposals for document listing search on general texts, and
include a new variant of our own. (2) We propose two novel algorithms for
ranked document search using general query patterns. These are evaluated and
compared empirically to demonstrate that they perform more efficiently than
document listing approaches. In fact, the new ranked document search algo-
rithms are three times faster than a highly tuned inverted file implementation
that assumes terms to be English words.
Our approach is to build data structures that allow us to efficiently calculate
the frequency of a query pattern in a document (tf) on the fly, unlike traditional
inverted indexes that stores precomputed tf values for specific query patterns
(usually words). Importantly, we are able to derive this tf information in an
order which allows rapid identification of the top-k ranked documents. We see
this work as an important first step toward practical ranked retrieval for large
general-text collections, and an extension of current indexing methods beyond
traditional algorithms that assume a lexicon of terms a priori.
2 Basic Concepts
Relevance Ranking. We will focus on the tf×idf measure, where tft,dis the
number of times term t appears in document d, and idftis related to the number
of documents where t appears. Appendix A covers the related basic concepts.
Page 3
Suffix Arrays and Self-Indexes. The suffix array A[1..n] of a text collection T
of length n is a permutation of (1...n), so that the suffixes of T , starting
at the consecutive positions indicated in A, are lexicographically sorted [10]:
T [A[i]..n] < T [A[i+1]..n]. Because of the lexicographic ordering, all the suffixes
starting with a given substring t of T form a range A[sp..ep], which can be deter-
mined by binary search in O(|t|logn) time. Variants of this basic suffix array are
efficient data structures for returning all positions in T where a query pattern
q occurs; once sp and ep are located for t = q, it is simple to enumerate the
occ = ep−sp+1 occurrences of q. However, if T is partitioned into documents,
then listing the documents that contain q, rather than all occurrences, in less
than O(occ) time is not so straightforward; see Section 3.1.
Self-indexes [13] offer the same functionality as a suffix array but are heavily
compressed. More formally, they can (1) extract any text substring T [i..j], (2)
compute sp and ep for a pattern t, and (3) return A[i] for any i.
For example, the Alphabet-Friendly FM-index (AF-FMI) [5] occupies nHh(T )+
o(nlogσ) bits, where σ is the size of the text alphabet, Hh is the h-th order
empirical entropy [11] (a lower bound on the space required by any order-h sta-
tistical compressor), and h ≤ αlogσn for any constant 0 < α < 1. It carries
out (1) in time O(log1+ǫn + (j − i)logσ) for any constant ǫ > 0, (2) in time
O(|t|logσ) and (3) in time O(log1+ǫn).
Wavelet Trees. The wavelet tree [8] is a data structure for representing a sequence
D[1..n] over an alphabet Σ of size σ. It requires nH0(D)+o(nlogσ)+O(σ logn)
bits of space, which is asymptotically never larger than the n⌈logσ⌉ bits needed
to represent D in plain form (assuming σ = o(n)), and can be significantly
smaller if D is compressible. A wavelet tree computes D[i] in time O(logσ),
as well as rankc(D,i), the number of occurrences of symbol c in D[1..i], and
selectc(D,j), the position in D of the j-th occurrence of symbol c.
An example of a wavelet tree is shown in Fig. 1, and has a structure as
follows. At the root, we divide the alphabet Σ into symbols < c and ≥ c, where
c is the median of Σ. Then store bitvector Broot[1..n] in the root node, where
Broot[i] = 0 if D[i] < c and 1 otherwise. Now the left child of the root will
handle sequence Dleft, formed by concatenating together all the symbols < c in
D[1..n] (respecting the order); and the right child will handle Dright, which has
the symbols ≥ c. At the leaves, where all the symbols of the corresponding Dleaf
are equal, nothing is stored. It is easy to see that there are ⌈logσ⌉ levels and
that n bits are spent per level, for a total of at most n⌈logσ⌉ bits. If, instead, the
bitvectors at each level are represented in compressed form [17], the total space
of each bitvector Bv becomes nH0(Bv) + o(n), which adds up to the promised
H0(D) + o(nlogσ) + O(σ logn) bits for the whole wavelet tree.
The compressed bitvectors also allow us to obtain B[i], and to compute rank
and select, in constant time over the bitvectors, which enables the O(logσ)-
time corresponding operations on sequence D; in particular D[i], rankc(D,i)
and selectc(D,j) all take O(logσ)-time via simple tree traversals (see [13]).
Page 4
15
1=1
n0=5
n1=6
n0=4n1=1n0=5
n0=1n1=1n0=3n1=0n0=4n1=1n0=0
n1=1
6
1
2
0
5
1
6
1
2
0
3
0
1
0
8
1
5
1
1
0
5
1
4
0
3
0101
2
0
2
0
3
1
1
0
1
0
1
0
4
1
3
1
6
0
5
0
6
0
8
1
5
0
5
0
7
10
2
1
2
1
1
0
1
0
1
0
3
0
4
1
3
0
6
1
5
0
6
1
5
0
5
0
5
0
8
1
7
0
234678
sp ep
75
5
51
1
1
23
4567
89 10 11 12 13 14
n
Fig.1. D = {6,2,5,6,2,3,1,8,5,1,5,5,1,4,3,7} as a wavelet tree. The top row of
each node shows D, the second row the bitvector Bv, and numbers in circles are node
numbers for reference in the text. n0 and n1 are the number of 0 and 1 bits respectively
in the shaded region of the parent node of the labelled branch. Shaded regions show
the parts of the nodes that are accessed when listing documents in the region D[sp =
3..ep = 13]. Only the bitvectors (preprocessed for rank and select) are present in the
actual structure, the numbers above each bitvector are included only to aid explanation.
3 Previous Work
3.1Document Listing
The first solution to the document listing problem on general text collections [12]
requires optimal O(|q| + docc) time, where docc is the number of documents
returned; but O(nlogn) bits of space, substantially more than the nlogσ bits
used by the text. It stores an array D[1..n], aligned to the suffix array A[1..n], so
that D[i] gives the document text position A[i] belongs to. Another array,C[1..n],
stores in C[i] the last occurrence of D[i] in D[1..i−1]. Finally, a data structure is
built on C to answer the range minimum query RMQC(i,j) = argmini≤r≤jC[r]
in constant time [4]. The algorithm first finds A[sp..ep] in time O(|q|) using the
suffix tree of T [2]. To retrieve all of the unique values in D[sp..ep], it starts with
the interval [s..e] = [sp..ep] and computes i = RMQC(s,e). If C[i] ≥ sp it stops;
otherwise it reports D[i] and continues recursively with [s..e] = [sp..i − 1] and
[s..e] = [i + 1..ep] (condition C[i] ≥ sp always refers to the original sp value). It
can be shown that a different D[i] value is reported at each step.
By representing D with a wavelet tree, values C[i] can calculated on de-
mand, rather than stored explicitly [22]. This reduces the space to |CSA| +
nlogN + 2n + o(nlogN) bits, where |CSA| is the size of any compressed
suffix array (Section 2). The CSA is used to find D[sp..ep], and then C[i] =
selectD[i](D,rankD[i](D,i) − 1) is determined from the wavelet tree of D in
Page 5
O(logN) time. They use a compact data structure of 2n + o(n) bits [6] for the
RMQ queries on C. If, for example, the AF-FMI is used as the compressed suffix
array then the overall time to report all documents for query q is O(|q|logσ +
docclogN). With this representation, tft,d= rankd(D,ep) − rankd(D,sp − 1).
Gagie et al. [7] use the wavelet tree in a way that avoids RMQs on C at all. By
traversing down the wavelet tree of D, while setting sp′= rankb(Bv,sp−1)+1
and ep′= rankb(Bv,ep) as we descend to the left (b = 0) or right (b = 1) child
of Bv, we reach each possible distinct leaf (document value) present in D[sp,ep]
once. To discover each successive unique d value, we first descend to the left child
each time the resulting interval [sp′,ep′] is not empty, otherwise we descend to
the right child. By also trying the right child each time we have gone to the left,
all the distinct successive d values in the interval are discovered. We also get
tft,d = ep − sp + 1 upon arriving at the leaf of each d. They show that it is
possible to get the i-th document in the interval directly in O(logN) time. This
is the approach we build upon to get our new algorithms described in Section 4.
Sadakane [20] offers a different space-time tradeoff. He builds a compressed
suffix arrayA, and a parentheses representation of C in order to run RMQ queries
on it without accessing C. Furthermore, he stores a bitvector B indicating the
points of T where documents start. This emulates D[i] = rank1(B,A[i]) for
document listing. The overall space is |CSA| + 4n + o(n) + N logn
|CSA| bits are required in order to compute the tft,dvalues. If the AF-FMI is
used as the implementation of A, the time required is O(|q|logσ+docclog1+ǫn).
Any document listing algorithm obtains docc trivially, and hence idft =
log(N/docc). If, however, a search algorithm is used that does not list all docu-
ments, idf must be explicitly computed. Sadakane [20] proposes a 2n+o(n) bit
data structure built over the suffix array to compute idftfor a given t.
Nbits. Other
3.2Top-k Retrieval
In IR it is typical that only the top k ranked documents are required, for some k,
as for example in Web search. There has been little theoretical work on solving
this “top-k” variant of the document listing problem. Muthukrishnan [12] solves
a variant where only the docc′documents that contain at least f occurrences
of q (tfq,d ≥ f) are reported, in time O(|q| + docc′). This requires a general
data structure of O(nlogn) bits, plus a specific one of O((n/f)logn) bits. This
approach does not exactly solve the ranked document search problem. Recently,
Hon et al. [9] extended the solution to return the top-k ranked documents in
time O(|q| + k logk), while keeping O(nlogn) bits of space. They also gave a
compressed variant with 2|CSA| + o(n) + N logn
query time, but its practicality is not clear.
Nbits and O(|q| + k log4+ǫn)
4New Algorithms
We introduce two new algorithms for top-k document search extending Gagie
et al.’s proposal for document listing [7]. Gagie et al. introduce their method as
Page 6
a repeated application of the quantile(D[sp..ep],p) function, which returns the
p-th number in D[sp..ep] if that subarray were sorted. To get the first unique
document number in D[sp..ep], we issue d1= quantile(D[sp..ep],1). To find the
next value, we issue d2= quantile(D[sp..ep],1 + tfq,d1). The j-th unique doc-
ument will be dj = quantile
?
computed along the way as tft,d= rankd(D,ep) − rankd(D,sp − 1). This lists
the documents and their tf values in increasing document number order.
Our first contribution to improving document listing search algorithms is the
observation, not made by Gagie et al., that the tfq,dvalue can be collected on the
way to extracting document number d from the wavelet tree built on D. In the
parent node of the leaf corresponding to d, tfq,dis equal to the number of 0-bits
(resp. 1-bits) in Bv[sp′..ep′] is d’s leaf is a left child (resp. right child). Thus, two
wavelet tree rank operations are avoided; an important practical improvement.
We now recast Gagie et al.’s algorithm. When listing all distinct documents
in D[sp..ep], the algorithm of Gagie et al. can be thought of as a depth-first
traversal of the wavelet tree that does not follow paths which do not lead to
document numbers not occurring in D[sp..ep].
Consider the example tree of Fig. 1, where we list the distinct numbers in
D[3..13]. A depth-first traversal begins by following the leftmost path to leaf 8.
As we step left to a child, we take note of the number of 0-bits in the range used
in its parent node, labelled n0on each branch. Both n0and n1are calculated to
determine if there is a document number of interest in the left and right child. As
we enter leaf 8, we know that there are n0= 3 copies of document 1 in D[3..13],
and report this as tfq,1 = 3. Next in the depth-first traversal is leaf 9, thus
we report tfq,2= 1, the n1value of its parent node 5. The traversal continues,
reporting tfq,3= 1, and then moves to the right branch of the root to fetch the
remainder of the documents to report.
Again, this approach produces the document numbers in increasing docu-
ment number order. These can obviously be post-processed to extract the k
documents with the highest tfq,dvalues by sorting the docc values. A more effi-
cient approach, and our focus next, fetches the document numbers in tf order,
and then only the first k are processed.
D[sp..ep],1 +?j−1
i=1tfq,di
?
, with the frequencies
4.1Top-k via Greedy Traversal
The approach used in this method is to prioritize the traversal of the wavelet tree
nodes by the size of the range [sp′..ep′] in the node’s bitvector. By traversing to
nodes with larger ranges in a greedy fashion, we will reach the document leaves
in tf order, and reach the first k leaves potentially having explored much less
of the tree than we would have using a depth-first-style traversal.
We maintain a priority queue of (node, range) pairs, initialized with the single
pair (root,[sp..ep]). The priority of a pair favors larger ranges, and ties are broken
in favor of deeper nodes. At each iteration, we remove the node (v,[sp′..ep′]) with
largest ep′− sp′. If v is a leaf, then we report the corresponding document and
its tf value, ep′−sp′+1. Otherwise, the node is internal; if Bv[sp′..ep′] contains
Page 7
one or more 0-bits (resp. 1-bits) then at least one document to report lies on
the left subtree (resp. right subtree) and so we insert the child node with an
appropriate range, which will have size n0(resp. n1), into the queue. Note we
can insert zero to two new elements in the queue.
Fig. 5(a) gives pseudo code. In the worst case, this approach will explore
almost as much of the tree as would be explored during the constrained depth-
first traversal of Gagie et al., and so requires O(docclogN) time. This worst case
is reached when every node that is a parent of a leaf is in the queue, but only
one leaf is required, e.g. when all of the documents in D[sp..ep] have tfq,d= 1.
4.2 Top-k via Quantile Probing
We now exploit the fact that in a sorted array X[1..m] of document numbers,
if a document d occurs more than m/2 times, then X[m/2] = d. The same
argument applies for numbers with frequency > m/4: if they exist, they must
occur at positions m/4, 2m/4 or 3m/4 in X. In general we have the following:
Observation 1 On a sorted array X[1..m], if there exists a d ∈ X with fre-
quency larger than m/2ithen there exists at least one j such that X[jm/2i] = d.
Of course we cannot afford to fully sort D[sp..ep]. However, we can access the
elements of D[sp..ep] as if they were sorted using the aforementioned quantile
queries [7] over the wavelet tree of D. That is, we can determine the document
d with a given rank r in D[sp..ep] using quantile(D[sp..ep],r) in O(logN) time.
In the remainder of this section we refer to D[sp..ep] as X[1..m] with m a
power of 2, and assume we can probe X as if it were sorted (with each probe
requiring O(logN) time). Fig. 5(b) gives pseudocode for the final method.
To derive a top-k listing algorithm, we apply Obs. 1 in rounds. As the al-
gorithm proceeds, we will accumulate candidates for the top-k documents in a
min-heap of at most k pairs of the form (d, tfq,d), keyed on tfq,d. In round 1, we
determine the document d with rank m/2 and its frequency tfq,d. If d does not
already have an entry in the heap,3then we add the pair (d,tfq,d) to the heap,
with priority tfq,d. This ends the first round. Note that the item we inserted in
fact may have tfq,d≤ m/2, but at the end of the round if a document d has
tfq,d> m/2, then it is in the heap. We continue, in round 2, to probe the ele-
ments X[m/4] and X[3m/4], and their frequencies fX[m/4]and fX[3m/4]. If the
heap contains less than k items, and does not contain an entry for X[m/4], we
insert (X[m/4],fX[m/4]). Else we check the frequency of the the minimum item.
If it is less than fX[m/4], we extract the minimum and insert (X[m/4],fX[m/4]).
We then perform the same check and update with (X[3m/4],fX[3m/4]).
In round 2 we need not ask about the element with rank 2m/4 = m/2, as we
already probed it in round 1. To avoid reinspecting ranks, during the ith round,
we determine the elements with ranks m/2i,m/2i+2i,m/2i+2i+1.... The total
number of elements probed (and hence quantile queries) to list all documents is
at most 4m/fmin, where fminis the k-th highest frequency in the result.
3We can determine this easily in O(1) time by maintaining a bitvector of size N.
Page 8
1
1
1
1
1
2
1
2
1
2
3
1
2
3
1
2
3
1
1
3
1
1
3
1
3
1
3
1
3
1
3
1
3
Query Length
Time (msec)
2
2
2
2
2
3
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
5
6
6
6
1
10
100
1000
10000
3456789 101112 13 141516 171819 20
PROTEIN
1
2
3
4
5
6
Sada
l−gram
VM
WT
Quantile
Greedy
1
1
1
1
1
2
1
2
1
1
2
1
2
1
1
2
3
1
2
3
1
1
1
1
1
1
Query Length
Time (msec)
2
3
2
3
2
3
2
2
2
2
3
2
2
2
3
2
2
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
5
6
4
4
5
6
4
5
6
4
5
6
4
5
6
4
5
6
4
5
6
5
5
5
5
5
5
5
5
6
5
6
5
6
5
6
6
6
6
6
6
6
6
1
10
100
1000
10000
3456789 10 11 1213 1415 1617 181920
WSJ
Fig.2. Mean time to find documents for all 200 queries of each length for methods
Sada, ℓ-gram, VM and WT, and mean time to report the top k = 10 documents by
tfq,d for methods Quantile and Greedy. (Lines interpolated for clarity.)
Due to Obs. 1, and because we maintain items in a min-heap, at the end of
the ith round, the k most frequent documents having tf > m/2iare guaranteed
to be in our heap. Thus, if the heap contains k items at the start of round i+1,
and the smallest element in it has tf ≥ m/2i+1, then no element in the heap
can be displaced; we have found the top-k items and can stop.
5 Experiments
We evaluated our new algorithms (Greedy from Section 4.1 and Quantile from
Section 4.2) with English text and protein collections. We also implemented
our improved version of Gagie et al.’s Wavelet Tree document listing method,
labelled WT. We include three baseline methods derived from previous work on
the document listing problem. The first two are implementations of V¨ alim¨ aki
and M¨ akinen [22] and Sadakane [20] as described in Section 3, labelled VM and
Sada respectively. The third, ℓ-gram, is a close variant of Puglisi et al.’s inverted
index of ℓ-grams [16], used with parameters ℓ = 3 and block size= 4096. It is
described in detail in Appendix C.
Experimental Data. We use two data sets. wsj is a 100MB collection of 36,603
news documents in text format drawn from disk three of the trec data collection
(http://trec.nist.gov). protein is a concatenation of 143,244 Human and
Mouse protein sequences totalling 60MB (http://www.ebi.ac.uk/swissprot).
For each collection, a total of 200 queries of character lengths ranging from 3 to
20 which appear at least 5 times in the collection were randomly generated, for
a total of 3,600 sample queries. Each query was run 10 times. Statistics of the
queries used are presented in Appendix D.
Timing Results. Fig. 2 shows the total time for 200 queries of each query
length for all methods. The document listing method of Gagie et al. with our
optimizations (number 4 on the graphs) is clearly the fastest method for finding
all documents and tfq,d values that contain the query q in document number
order. The two algorithms which implicitly return documents in decreasing tfq,d
Page 9
Sada l−gram VMWTQuantileGreedy
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Time (msec per Document)
Sadal−gram VMWT Quantile Greedy
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Time (msec per Document)
Fig.3. Time per document listed as in Fig. 2, with 25th and 75th percentiles (boxes),
median (solid line), and outliers (whiskers).
Sada
572
870
ℓ-gram
122
VM
391
247
WT
341
217
Quantile
341
217
Greedy
341
217
wsj
protein
77
Table 1. Peak memory use during search (MB) for the algorithms on wsj and protein.
order, Quantile and Greedy, are faster than all other methods. Note these final two
methods are only timed to return k = 10 documents, but if the other methods
were set the same task, their times would only increase as they must enumerate
all documents prior to choosing the top-k. Moreover, we found that choosing any
value of k from 1 to 100 had little effect on the runtime of Greedy and Quantile.
Note the anomalous drop in query time for |q| = 5 on protein for all methods
except ℓ-gram. This is a result of the low occ and docc for that set of queries,
thus requiring less work from the self-index methods. Time taken to identify the
range D[sp..ep] is very fast, and about the same for all the query lengths tested.
A low occ value means this range is smaller, and so less work for the document
listing methods. Method ℓ-gram however does not benefit from the small number
of occurrences of q as it has to intersect all inverted lists for the 3-grams that
make up q, which may be long even if the resulting list is short.
Fig. 3 summarizes the time per document listed, and clearly shows that the
top-k methods (Quantile and Greedy) do more work per document listed. However,
Fig. 2 demonstrates that this is more than recouped whenever k is small, relative
to the total number of documents containing q. Table 2 shows that the average
docc is well above 10 for all pattern lengths in the current experimental setup.
Memory Use. Table 1 shows the memory use of the methods on the two
data sets. The inverted file approach, ℓ-gram, uses much less memory than the
other approaches, but must have the original text available in order to filter
out false matches and perform the final tf calculations. It is possible for the
wavelet trees in all of the other methods to be compressed, but it is also possible
to compress the text that is used (and counted) in the space requirements for
method ℓ-gram. The Sada method exhibits a higher than expected memory usage
because the protein collection has a high proportion of short documents. The
Page 10
Sada method requires a csa to be constructed for each document, and in this
case is undesirable, as the csa algorithm has a high startup overhead that is
only recouped as the size of the text indexed increases.
ZetZet−pZet−io WTQuantile Greedy
0
5
10
15
Time per query (msec)
2 word queries
Zet Zet−pZet−io WT Quantile Greedy
4 word queries
Fig.4. Time to find word based queries using Zettair and the best of the new methods
for 2 and 4 word queries on wsj.
Term-based Search. The results up to now demonstrate that the new com-
pressed self-index based methods are capable of performing document listing
search on general patterns in memory faster than previous ℓ-gram based inverted
file approaches. However, these approaches are not directly comparable to com-
mon word-based queries at which traditional inverted indexes excel. Therefore,
we performed additional experiments to test if these approaches are capable
of outperforming a term-based inverted file. For this sake we generated 44,693
additional queries aligned on English word boundaries from the wsj collection.
Statistics of these phrase queries of word length 2 to 15 are given in Appendix D.
Short of implementing an in-memory search engine, it is difficult to choose
a baseline inverted file implementation that will efficiently solve the top-k doc-
ument listing problem. Zettair is a publicly available, open source search engine
engineered for efficiency (www.seg.rmit.edu.au/zettair). In addition to the
usual bag-of-terms query processing using various ranking formulas, it readily
supports phrase queries where the terms in q must occur in order, and also imple-
ments the impact ordering scheme of Anh and Moffat [1]. As such, we employed
Zettair in three modes. Firstly, zet used the Okapi BM-25 ranking formula to
return the top 20 ranked documents for the bag-of-terms q. Secondly, zet-p used
the “phrase query” mode of Zettair to return the top 20 ranked documents which
contained the exact phrase q. Finally, we used the zet-io mode to perform a bag-
of-terms search for q using impact ordered inverted lists and the associated early
termination heuristics. Zettair was modified to ensure that all efficiency mea-
surements were done in ram, just as the self-indexing methods require. Time to
load posting lists into memory is not counted in the measurements.
Fig. 4 shows the time for searching for two- and four-word patterns. We do
not show the times for Sada, VM, and ℓ-gram, as they were significantly slower
Page 11
than the new methods, as expected from Fig. 2. The Greedy and Quantile methods
used k = 20. The Zet-ph has better performance, on average, than Zet, and Zet-io
is the most efficient of all word-based inverted indexing methods tested. A direct
comparison between the three Zettair modes and the new algorithms is tenuous,
as Zettair implements a complete ranking, whereas the document listing methods
simply use only the tf and idf as their “ranking” method. However, Zet-io pre-
orders the inverted lists to maximize efficiency, removing many of the standard
calculations performed in Zet and Zet-ph. This makes Zet-io comparable with
the computational cost of our new methods. The WT approach is surprisingly
competitive with the best inverted indexing method, Zet-io. Given the variable
efficiency of two-word queries with WT (due to the diverse number of possible
document matches for each query), it is difficult to draw definitive conclusions
on the relative algorithm performance. However, the Greedy algorithm is clearly
more efficient than Zet-io (means 0.91ms and 0.69ms, Wilcoxon test, p < 10−15).
When the phrase length is increased, the two standard Zettair methods get
slower per query, as expected, because they now have to intersect more inverted
lists to produce the final ranked result. Interestingly, all other methods get faster,
as there are fewer total documents to list on average, and fewer intersections for
the impact ordered inverted file. For four word queries, all of the self-indexing
methods are clearly more efficient than the inverted file methods. Adding an idf
computation to Greedy and Quantile will not make them less efficient than WT.
6 Discussion
We have implemented document listing algorithms that, to date, had only been
theoretical proposals. We have also improved one of the approaches, and intro-
duced two new algorithms for the case where only the top-k documents sorted
by tf values are required. For general patterns, approach WT as improved in this
paper was the fastest for document listing, whereas our novel Greedy approach
was much faster for fetching the top k documents (for k < 100, at least). In the
case where the terms comprising documents and queries are fixed as words in the
English language, Greedy is capable of processing 4600 queries per second, com-
pared to the best inverted indexing method, Zet-io, which processes only 1400
on average. These results are extremely encouraging: perhaps self-index based
structures can compete efficiently with inverted files. In turn, this will remove
the restriction that IR system users must express their information needs as
terms in the language chosen by the system, rather than in a more intuitive way.
Our methods return the top-k documents in tf order, which departs from
the tf×idf framework of most information retrieval systems. However, in the
context of general pattern search, q only ever contains one term: the pattern to
be found. For one term queries, the value of idf is simply a constant multiplier to
the final ranking score, and so not useful for discriminating documents. If these
data structures are to be used for bag-of-strings search, then the idf factor may
become important, and can be easily extracted using method WT, which is still
faster than Zet-io in our experiments.
Page 12
References
1. V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In
Proc. 29th ACM SIGIR, pp˙372–379, 2006.
2. A. Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms
on Words, NATO ISI Series, pages 85–96. Springer-Verlag, 1985.
3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison
Wesley, 1999.
4. M. Bender and M. Farach-Colton.The LCA problem revisited.
LATIN, LNCS 1776, pp˙88–94, 2000.
5. P. Ferragina, G. Manzini, V. M¨ akinen, and G. Navarro. Compressed representa-
tions of sequences and full-text indexes. ACM TALG, 3(2):article 20, 2007.
6. J. Fischer and V. Heun. A new succinct representation of RMQ-information and
improvements in the enhanced suffix array. In Proc. ESCAPE, pp˙459–470, 2007.
7. T. Gagie, S. Puglisi, and A. Turpin. Range quantile queries: Another virtue of
wavelet trees. In Proc. 16th SPIRE, LNCS 5721, pp˙1–6, 2009.
8. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes.
In Proc. 14th SODA, pp˙841–850, 2003.
9. W.-K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string
retrieval problems. In Proc. FOCS, pp˙713–722, 2009.
10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches.
SIAM J. Computing, 22(5):935–948, 1993.
11. G. Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407–
430, 2001.
12. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc.
13th SODA, pp˙657–666, 2002.
13. G. Navarro and V. M¨ akinen.Compressed full-text indexes.
Surveys, 39(1):article 2, 2007.
14. M. Persin, J. Zobel, and R. Sacks-Davis.
frequency-sorted indexes. JASIS, 47(10):749–764, 1996.
15. J. M. Ponte and W. B. Croft. A language modeling approach to information
retrieval. In Proc. 21th ACM SIGIR, pp˙275–281, 1998.
16. S. Puglisi, W. Smyth, and A. Turpin. Inverted files versus suffix arrays for locating
patterns in primary memory. In Proc. 13th SPIRE, pp˙122–133, 2006.
17. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applica-
tions to encoding k-ary trees and multisets. In Proc. SODA, pp˙233–242, 2002.
18. S. E. Robertson and K. S. Jones. Relevance weighting of search terms. JASIST,
27:129–146, 1976.
19. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi
at TREC-3. In D. K. Harman, editor, Proc. 3rd TREC, 1994.
20. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete
Algorithms, 5(1):12–22, 2007.
21. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing.
Comm. ACM, 18(11):613–620, 1975.
22. N. V¨ alim¨ aki and V. M¨ akinen. Space-efficient algorithms for document retrieval.
In B. Ma and K. Zhang, editors, Proc. 18th CPM, LNCS 4580, pp˙205–215, 2007.
23. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufmann, 2nd
edition, 1999.
24. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing
Surveys, 38(2):1–56, 2006.
In Proc. 4th
ACM Computing
Filtered document retrieval with
Page 13
A Basic IR Concepts
Relevance Ranking. Most modern search engines rank documents to help users
find relevant information. A similarity metric,ˆS(q,di), is used to evaluate each
document direlative to a user query q (often a list of terms or words). If docu-
ments are treated as “bags of terms”, similarity can be measured using simple
statistical properties of the vocabulary [23]. Successful similarity metrics include
vector space models [21], probabilistic models [18], and language models [15].
While these approaches have different theoretical bases, their similarity scoring
functions all employ some variant of tf×idf information. The tf×idf class of
scoring functions are based on a weighting wd,tderived using a variation of the
formula wd,t= tft,d×idft, where tft,dis the term frequency (number of times t
occurs in d) and idftis the inverse document frequency (logarithm of the inverse
of the fraction of the documents where t appears). Typically, the top-k docu-
ments are returned, ordered byˆS(q,di) =?
are the weight of the query term t in document d and query q, respectively. Using
a tf×idf metric ensures that a high term frequency in an individual document
increases the similarity contribution, while a high frequency across all documents
reduces its contribution. Hence, terms that are common in all documents do not
dominate the more discriminating terms in a query. Popular scoring functions
that make use of tf×idf information include the Vector Space Model [21] and
the Okapi BM25 metric [19]. Many others are possible [24].
t∈qwd,t× wq,t, where wd,tand wq,t
Inverted Indexes. For each term that occurs in T (the set of which forms the
vocabulary), an inverted index stores a list of documents that contain that term.
Inverted indexes have been the dominant data structure used in information
retrieval systems for over 30 years [3,23,24]. Inverted indexes require the tex-
tual units (words, q-grams, characters) of the vocabulary to be defined before
indexing commences in order to limit the size of the index and vocabulary, and
to allow tf×idf information to be precomputed and stored. The ordering of
documents within inverted lists can be altered to improve the speed of returning
the top-k documents for a query. Persin et al. [14] give different heuristics to
support top-k ranked retrieval (under the tf×idf model) when inverted lists
are sorted by decreasing tf. Anh and Moffat [1] study various generalizations of
this idea under the name “impact ordering”.
B Pseudocodes
Fig. 5 gives pseudocode for our new methods.
C Details of ℓ-gram inverted file
It is possible to use a character-based inverted file to solve the ranked document
search problem for general strings by indexing small, overlapping units of text
of length ℓ. Then a string query q is resolved by decomposing it into its ℓ-gram
Page 14
procedure top-k-greedy(sp,ep,root,k)
1: Let h be an empty max-heap
2: h.insert(root,[sp,ep])
3: numFound ← 0
4: while h not empty and numFound < k do
5:(v,[sp′,ep′]) ← h.pop()
6:
if v is a leaf then
7:
output (d,tfq,d) ← (v.label,ep′− sp′+ 1)
8:numFound ← numFound + 1
9:
else
10:[s0,e0] ← [rank0(Bv,ep′),rank0(Bv,sp′)]
11:[s1,e1] ← [rank1(Bv,ep′),rank1(Bv,sp′)]
12:
if n0 = (e0− s0) ?= 0 then
13:h.insert(v.left,[s0,e0])
14:
if n1 = (e0− s0) ?= 0 then
15: h.insert(v.right,[s1,e1])
(a)
procedure top-k-quantile(sp,ep,k)
1: Let h be an empty min-heap
2: m ← ep − sp + 1
3: i ← m
4: s ← m/2
5: while h.size < k and i > h.top.tf do
6:p ← s
7:
while p < m do
8:(d,tfq,d) ← quantile(D[sp..ep],p)
9:
if h.size < k then
10:h.insert(d,tfq,d)
11:
elsif h.top.tf < tfq,d then
12:h.extract-min()
13:h.insert(d,tfq,d)
14:p ← p + i
15:s ← s/2, i ← i/2
(b)
Fig.5. Algorithms for computing the top-k documents by tf. The algorithm in (a)
maintains a priority queue h of (node,range) pairs, each pair having priority equal to
the length of the range component. The algorithm outputs the top-k documents as it
proceeds; the algorithm in (b) maintains a min-heap of (doc,tf) pairs keyed on tf.
At the end of (b) the heap contains the top-k document numbers and tf values. For
simplicity (b) assumes m = ep − sp + 1 is a power of 2.
Page 15
components, and treating each component as a term. With suitable modifica-
tion, a classical term-based inverted file can perform the task provided |q| ≥ ℓ.
Accordingly, we developed a block-addressing inverted file and had it index every
distinct ℓ-gram in the collection.
The text collection is concatenated and logically partitioned into ⌈n/b⌉ blocks,
each of size b. The index is comprised of two pieces. The set of distinct ℓ-grams
in the collection is held in a lexicon, which we implement with a hashtable. With
each ℓ-gram entry in the lexicon is a pointer to a postings list: a list of the blocks
in the collection that contain one or more occurrences of that ℓ-gram. The num-
ber of items in the lexicon is bound by σℓ, the number of possible substrings of
length ℓ. However, the number of indexed ℓ-grams tends to be more sparse in
practice. For small ℓ the size of the postings lists dominates space usage. To save
space, the lists are compressed by storing the block numbers in increasing order
and encoding only the difference (gap) between adjacent items with a suitable
integer code.
Locating the positions of occurrence of a pattern q,|q| ≥ ℓ is a two phase
process. First, using the lexicon, we gather the lists of the at most |q| − ℓ + 1
distinct ℓ-grams contained in q and sort them into increasing order of length. We
then intersect the two shortest lists, to obtain a list of candidate blocks – this is
a superset of the blocks that contain the pattern. We continue to intersect the
candidate list with the remaining gathered lists until either no lists remain, or
the size of the candidate list becomes less than a threshold. At this point, for
each item (block number) in the candidate list, we scan each of the corresponding
text block looking for occurrences of q. The intersection phase of the algorithm
can be thought of as a filtering step: blocks that cannot contain the pattern are
eliminated from the later scanning phase. The block size, b, allows a space-time
tradeoff: increasing b makes lists smaller but makes our filter less specific and
requires us to scan larger portions of the collection.
During the in-block scanning phase, as we locate positions of pattern oc-
currence, these positions must be mapped to document numbers and tf values
accumulated. Clearly once all blocks have been scanned we will also know idf. To
facilitate mapping between text positions and document numbers we store the N
document boundaries in a sorted array. To find the document a given position i
is contained in we simply (binary) search for the smallest value greater than i in
this array. Because we scan text blocks left-to-right, positions of occurrence are
mapped to documents in increasing order, allowing the amount of the mapping
array searched to be reduced as block scanning proceeds: a small optimization.
The space cost for the mapping array is N logn bits – negligible relative to other
index components. In all experiments we use ℓ = 3 and block size= 4096.
D Query Data
Table 2 summarizes attributes of the queries used. Note the unusually low aver-
age occ and docc values for the queries of length 5 for the protein data set.
Page 16
query lengthprotein queries
avg total avg doc avg total avg doc
occ
3 10,458 12,510
4 972
5 304
6 723
7 979
8 764
9 682
10+ 535
Table 2. Statistical properties for the 3,600 randomly generated queries for the pro-
tein and wsj collection.
trec queries
occoccocc
17,204
10,446
6,071
4,086
2,312
2,000
1,523
2,177
114,379
48,123
23,660
9,355
4,320
6,365
2,035
3,390
919
230
664
879
761
649
523
Table 3 shows the statistical properties of the new phrase queries of varying
word length, from 2 to 15.
Num. of words Num of queries avg docc avg occ
28,693
35,534
44,609
54,036
63,403
73,058
82,863
92,556
10+9,941
Table 3. Statistical properties for the 44,693 randomly generated English word queries
from the wsj collection.
3,167
373
234
8,067
441
236
45
22
45
22
6
3
3
1
6
3
3
1
Download full-text