ArticlePDF Available

Searching with Autocompletion: An Indexing Scheme with Provably Fast Response Time

Authors:

Abstract and Figures

We study the following autocompletion problem, which is at the core of a new full-text search technology that we have developed over the last year. The problem is, for a given document collection, to precompute a data structure using as little space as possible such that queries of the following kind can be processed as quickly as possible: given a range of words and an arbitrary set of documents, compute the set of those words from the given range, which occur in at least one of the given documents, as well as the subset of the given documents, which contain at least one of these words. With a standard inverted index, one inverted list has to be processed for each word from the given range. We propose a new indexing scheme that without using more space than an inverted index has a guaranteed query processing time that is independent of the size of the given word range. Experiments on real-world data confirm our theoretical analysis and show the practicability of our new scheme.
Content may be subject to copyright.
Searching with Autocompletion:
An Indexing Scheme with Provably Fast Response Time
Holger Bast1, Christian W. Mortensen2, and Ingmar Weber1
1Max-Planck-Institut f¨ur Informatik, Saarbr¨ucken, Germany
2IT University of Copenhagen, Denmark
Abstract. We study the following autocompletion problem, which is at the core of a
new full-text search technology that we have developed over the last year. The problem
is, for a given document collection, to precompute a data structure using as little space as
possible such that queries of the following kind can be processed as quickly as possible:
given a range of words and an arbitrary set of documents, compute the set of those words
from the given range, which occur in at least one of the given documents, as well as
the subset of the given documents, which contain at least one of these words. With a
standard inverted index, one inverted list has to be processed for each word from the
given range. We propose a new indexing scheme that without using more space than an
inverted index has a guaranteed query processing time that is independent of the size of
the given word range. Experiments on real-world data confirm our theoretical analysis
and show the practicability of our new scheme.
1 Introduction
Autocompletion, in its most basic form, is the following mechanism: the user types the first few letters of
some word, and either by pressing a dedicated key (traditionally the tabulator key) or automatically after each
key stroke a procedure is invoked that displays all words from some precompiled list that are continuations
of the typed sequence.
The precompiled word list depends on the application. In a Unix shell, it is by default the list of all files
in all directories listed in the PATHenvironment variable. In an editor like Vim, it is the list of all words that
appear somewhere in the edited document(s). In a Windows Help file, it is the list of all words that are used
somewhere in the help text. In the recently launched Google Suggest service [11], it is an extract of frequent
queries from Google’s query log.
Algorithmically, this basic form of autocompletion is easy: it requires two simple searches (one for each
of the two endpoints of the range of words starting with the typed in sequence) in a sorted list of strings, and
an ordinary binary search will be more than fast enough even for millions of strings [4].
We here consider a context-sensitive version of this mechanism used in full-text search. This is a new
technology that we have developed over the last year [3]. Think of a Google-like search interface and that we
have typed one query word or more already. When beginning to type the next word, the new autocompletion
mechanism is then such that only those completions of the partially typed last word are displayed which
together with the previous part of the query would actually lead to a hit. At the same time, a selection of
these hits is displayed.
For example, if we have already typed sympos, and then start to type alg, we would get only those
completions of alg which actually occur (and let us say close in this case) together with words starting with
sympos. That is, we would get a relatively short list of words like algorithms or algebraic but none
of the large number of other words starting with alg, which occur somewhere in the collection but nowhere
close to a word starting with sympos.
We have in fact engineered a complete web-service around this feature, with several instances up and
running [3]. The example from the previous paragraph can be checked out under http://search.
mpi-inf.mpg.de, where the complete English Wikipedia is indexed. Type sympos..alg slowly, let-
ter by letter, watching the changing lists of completions and hits. The two dots indicate the desired proximity
of the words. Typing ?at any point in the query will provide some quick help.
The subject and contribution of this paper is the formulation of the main algorithmic problem underlying
this web service, and the design and analysis of an efficient indexing and query processing scheme. More
specifically, our goal is to provide the sketched autocompletion functionality with a guaranteed short re-
sponse time for each and every query. As we will see, this is a challenging problem, and it cannot be solved
efficiently with existing indexing techniques.
1.1 Formal problem definition and main result
The autocompletion problem we investigate in this paper is, given a collection of documents, to build a data
structure using as little space as possible such that the following autocompletion queries can be processed
as quickly as possible:
Definition: An autocompletion query is given by a range of words W(all possible completions of the last
word which the user has started typing), and a a set of documents D(the hits for the preceding part of the
query). To process the query means to compute the subset W0 W of words that occur in at least one
document from Das well as the subset D0 D of documents that contain at least one of these words. A
threshold value Tmay be specified, in which case it suffices to compute, instead of W0, a subset W00 W0
of size min{T , |W0|}. In other words, if W0contains more than Twords it suffices to compute any T
1
of these. Thresholded autocompletion queries contain the unthresholded ones as a special case by setting
T=.
Two points should be very clear about this definition.
First, note that the process of typing a query (letter by letter) corresponds to a chain of autocompletion
queries according to the definition above. Namely, the set Wis always readily obtained from a sorted word
list, in fact, this is just an instance of the basic form of autocompletion, which we described at the beginning
of the introduction and which can be dealt with by a straightforward binary search. The set D, on the other
hand, is simply the set of all documents as long as the user is typing the first query word, and for any further
word it is just the output set D0of the instance solved when the last letter of the previous word was typed.
Second, note that, from the point of view of the user, it is good enough to process autocompletion queries
with a constant threshold. If there are few (say up to 30) completions that would lead to a hit, a user would
certainly like to see all of them, in order to check which of them make sense with regard to what he or she is
looking for. For many completions (say more than 30), however, in practice no more than a small selection
can and will be visually scanned anyway.
Our main result is as follows. The number Nof word-in-document pairs is just the number of distinct
words in each document, summed over all documents. The restrictions on Nand Wwill always be met
in practice and help us to keep the formulas simple and focused on the main performance aspects at this
point. Details on the space-time tradeoff and the exact dependencies on Wand the threshold are provided in
Section 5, Lemmas 7 and 8. In our conclusions, we will also briefly comment on the I/O-complexity of our
new scheme.
Theorem 1. Given a collection with ndocuments, mdistinct words, and N16mword-in-document
pairs, there is a data structure and query processing scheme TREE+++ with the following properties:
(a) The data structure can be constructed in O(N)time.
(b) It uses at most Nlog nbits of space (which is the space used by an ordinary inverted index).
(c) Autocompletion queries with word range Wof size O(mn/N ), document set D, and a constant thresh-
old, can be processed in time O(|D| log(mn/N)).
1.2 The BASIC scheme and outline for the rest of the paper
To clarify the achievement of Theorem 1 above, it will be instructive to first take a closer look at the
straightforward solution to our autocompletion problem, which we will refer to as BASIC. It is based on
the standard indexing data structure from information retrieval, the so-called inverted index [18], for which
we simply precompute for each word from the collection the list of documents containing that word. For a
query-efficient processing, these lists are typically sorted. Here we assume a sorting by document number
in ascending order. With such an inverted index, an (unthresholded) autocompletion query given by a word
range Wand a set of documents Dcan be processed as follows.
1. For each word w W, fetch the list Dwof documents that contain wand compute the intersection
D Dw. For the set W0of actual completions, report all words for which this intersection is non-empty.
2. Compute the subset D0of documents from Dthat contain at least one word from Was the union of the
non-empty intersections D Dwcomputed in step 1.
Lemma 1. Scheme BASIC uses time at least (Pw∈W min{|D|,|Dw|})to process a query. The inverted
lists can be stored using a total of at most N· dlog2nebits, where nis the total number of documents, and
Nis the total number of word-in-document pairs.
2
Proof. In step 1, one intersection is computed for each w W and any algorithm for intersecting Dand Dw
has to differentiate between 2min{|D|,|Dw|} possible outputs. For the space usage, it suffices to observe that the
elements of the inverted lists are just a rearrangement of the sets of distinct words from all documents, and
that each document number can be encoded with dlog2nebits (for compression issues, see our conclusions).
ut
Lemma 1 points out the inherent problem of the BASIC scheme, that its query processing time depends
on W, and can be on the order of |W| · |D| in the worst case. In particular, no scheme along the lines
of BASIC (and this includes standard string search indexing approaches; see the following Section 1.3)
can process thresholded autocompletion queries any faster than unthresholded ones, because in general the
inverted list Dwof each and every word from Whas to be inspected to produce the correct W00 and D0. For
our web service [3], this would mean tangible delays (seconds) for certain queries, which for an interactive
application is very undesirable.
In the following sections, we develop a new indexing scheme that without using more space than an
inverted index enables a query processing time independent of W, as stated in Theorem 1.3Four main ideas
will lead us to this new scheme: a tree over the words (Section 2), relative bit vectors (Section 3), pushing
up the words (Section 4), and dividing into blocks (Section 5).
In all these sections, we will consider unthresholded autocompletion queries. At the end of Section 5 we
will comment on the minor modifications required to make our scheme work for thresholded autocompletion
queries and in particular, how thus to obtain the result stated in Theorem 1. We have opted to use much of
the available space for intuitive explanations and examples of the various data structures. Their space and
time bounds are concisely stated in formal lemmas, the proofs of all of which can be found in the Appendix.
In Section 6, we will complement our theoretical findings with experiments on real-world data.
1.3 Related work
To the best of our knowledge, the autocompletion problem, as we have defined it above based on the re-
quirements of our web service, has not been explicitly studied in the literature.
We have already seen, in the previous subsection, that for bounds independent of W, we cannot just
first compute the set of all matches in W, and then check which of these matches actually occur in docu-
ments from D. In particular, this excludes the use of the many indexing schemes for one-dimensional string
matching [10] [14] [12].
On the other hand, our problem is not as hard as the various multi-dimensional indexing problems, where
a given query-tuple has to be matched against a collection of tuples. The difference is that for an autocom-
pletion query with multiple words (and their number may be arbitrary), we already have the information
about the set of documents matching the part of the query before the last word, because we computed it
when this part had been typed. Indeed, none of the state-of-the-art multi-dimensional indexing schemes that
we are aware of can achieve a provably fast query processing time with a space consumption on the order
of N[9] [8] [1]. As we will point out, however, our data structure shows some interesting analogies to the
geometric range-search data structures from [5] and [15].
The large body of work on string searching concerned with data structures such as PAT/suffix tree/arrays
can be seen as orthogonal to the problem we are discussing here. Namely, in the context of our autocomple-
tion problem these data structures would serve to get from mere prefix search to full substring search. For
example, our Theorem 1 could be enhanced to full substring search (find all words from a given subset of
documents containing a given substring and all documents from that subset containing such words) by first
building a suffix data structure like that of [7], and then building our data structure on top of the sorted list
of all suffixes (instead of on top of the list of all words).
3We remark that there is little hope that one can also remove the dependency on D, since Dis an arbitrary subset of documents,
while the set of possible values for the W, which are ranges, is much more constrained.
3
2 Building a tree over the words (TREE)
This section is about TREE, our first scheme on the way to Theorem 1, and the idea behind it is to increase
the amount of preprocessing by precomputing inverted lists not only for words but also for their prefixes.
More precisely, we construct a complete binary tree with mleaves, where mis the number of distinct words
in the collection. We assume here and throughout the paper that mis a power of two. For each node vof the
tree, we then precompute the list Dvof documents which contain at least one word from the subtree of that
node, and as for the inverted index, we sort this list by ascending document number. The lists of the leaves
are then exactly the lists of an ordinary inverted index, and the list of an inner node is exactly the union of
the lists of its two children. The list of the root node is exactly the set of all non-empty documents. A simple
example is given in Figure 1.
Fig.1. Toy example for the data structure of scheme TREE with 10 documents and 4 different words.
Given this tree data structure, an (unthresholded) autocompletion query given by a word range Wand a set
of documents Dis then processed as follows.
1. Compute the unique minimal sequence v1,...,vlof nodes with the property that their subtrees cover
exactly the range of words W. Process these lnodes from left to right, and for each node vinvoke the
following procedure.
2. Fetch the list Dvof vand compute the intersection D Dv. If the intersection is empty, do nothing. If
the intersection is non-empty, then if vis a leaf, report the corresponding word, otherwise invoke this
procedure (step 2) recursively for each of the two children of v.
3. Compute D0as the union of the D Dv, for the nodes vcomputed in step 1.
Scheme TREE can save us time for the following reason. If the intersection computed at an inner node v
in step 2 is empty, we know that none of the words in the whole subtree of vis a completion leading to a
hit, that is, with a single intersection we are able to rule out a large number of potential completions. This
does not work in the other direction, however: if the intersection at vis non-empty, we know nothing more
than that there is at least one word in the subtree which will lead to a hit, and we will have to examine both
children recursively. The following lemma shows the potential of TREE to make the query processing time
depend on W0instead of on Wlike for BASIC. Since TREE is just a step on the way to our final scheme,
we do not bother to give the exact query processing time here but just the number of nodes visited, because
we need exactly this information in the next section.
Lemma 2. When processing a query with TREE, at most 2(|W 0|+ 1) log2|W| nodes are visited.
The price TREE pays in terms of space is large. In an extreme worst case, each level of the tree would
use just as much space as the inverted index stored at the leaf level, which would give a blow-up factor of
log2M. For the collections we consider in Section 6, the actual blowup factor would be about 6.
4
3 Relative Bitvectors (TREE+BITVEC)
In this section, wedescribe and analyze TREE+BITVEC,which reduces the space usage of algorithm TREE
from the last section, while maintaining as much as possible its potential for a query processing time de-
pending on W0instead of on W.The basic trick will be to store the inverted lists via bit vectors, more
specifically, via relative bit vectors. The resulting data structure turns out to have similarities with the static
2-dimensional orthogonal range counting structure of Chazelle [5].
In the root node, the list of all non-empty documents is stored as a bit vector in the obvious way: when N
is the number of documents, there are Nconsecutive bits, and the ith bit corresponds to document number
i, and the bit is set to 1if and only if that document contains at least one word from the subtree of the node.
In the case of the root node this means that the ith bit is 1if and only if document number icontains any
word at all.
Now consider any one child vof the root node, and with it store a vector of N0bits, were N0is the
number of 1-bits in the parent’s bit vector. To make it interesting already at this point in the tree, assume that
indeed some documents are empty, so that not all bits of the parent’s bit vector are set to one, and N0< N.
Now the jth bit of vcorresponds to the jth 1-bit of its parent, which in turn corresponds to a document
number ij. We then set the jth bit of vto 1if and only if document number ijcontains a word in the subtree
of v.
The same principle is now used for every node vthat is not the root: the jth bit of the bit vector of v
corresponds to that document to which the jth 1-bit of the parent of vcorresponds, and that bit is set to 1if
and only if that document contains a word from the subtree of v. Constructing these bit vectors is relatively
straightforward; it will be part of the construction given in Appendix B.
Fig.2. The data structure of TREE+BITVEC for the toy collection from Figure 1.
Lemma 3. Let stree denote the total lengths of the inverted lists of algorithm TREE. The total number of
bits used in the bit vectors of algorithm TREE+BITVEC is then at most 2stree plus the number of empty
documents (which cost a 0-bit in the root each).
The procedure for processing a query with TREE+BITVEC is, in principle, the same as we gave it for TREE
in the previous section (before Lemma 2). The only difference comes from the fact that the bit vectors, except
that of the root, cannot be interpreted in isolation but only relative to their respective parents.
We deal with this as follows. We ensure that whenever we visit a node v, we have the set Ivof those
positions of the bit vector stored at vthat correspond to documents from the given set D, as well as the |Iv|
numbers of those documents. For the root node, this is trivial to compute. For any other node v,Ivcan be
computed from its parent uas follows: for each i Iucheck if the ith bit of uis set to 1, if so compute
the number of 1-bits at positions less than i, and add this number to the set Ivand store by it the number of
5
the document from Dthat was stored by i. With this enhancement, we can follow the same steps as in the
procedure for TREE, except that we have to ensure now that whenever we visit a node that is not the root,
we have visited its parent before. The lemma below shows that we have to visit an additional number of up
to 2 log2Mnodes because of this.
We also observe here that we can compute the output set D0of documents containing at least one word
from W0on the fly as follows. We associate with each element from the given Da single bit, initialized to
zero. When we visit a node vand use Ivfor random accesses to v’s bit vector as described above, then for
each i Ivfor which the ith bit of vis set to 1, set the bit of the document in Dto which ipoints to 1. It
is not hard to see that the subset of elements Dfor which eventually a 1-bit is set, is exactly D0. Since Wis
fully covered by the v1,...,vl(see step 1 of the query processing procedure in Section 2), it suffices to do
this for the nodes v1,...,vlonly.
Lemma 4. When processing a query with TREE+BITVEC, at most 2(|W 0|+ 1) log 2|W|+ 2 log2mnodes
are visited.
4 Pushing Up the Words (TREE+BITVEC+PUSHUP)
The scheme TREE+BITVEC+PUSHUP presented in this section gets rid of the log 2|W | factor in query
processing time from Lemma 4. The idea is to modify the TREE+BITVEC data structure such that whenever
the intersection at a node is non-empty, we can produce some part of W0.For that we store by each single
1-bit, which is an indicator of the fact that a particular document contains a word from a particular range,
one word from that document and that range. We do this in such a way that each word is stored only in one
place for each document in which it occurs. When there is only one document, this leads to a data structure
that is similar to the priority search tree of McCreight, which was designed to solve the so-called 3-sided
dynamic orthogonal range-reporting problem in two dimensions [15].
Let us start with the root node. Each 1-bit of the bit vector of the root node corresponds to a non-
empty document, and we store by that 1-bit the lexicographically smallest word occurring in that document.
Actually, we will not store the word but rather its number, where we assume that we have numbered the
words from 0..m 1. More than that, for all nodes at depth i(i.e., iedges away from the root), we will
omit the leading ibits of its word number, because for a fixed node these are all identical and can be easily
computed from the position of the node in the tree.
Now consider anyone child vof the root node, which has exactly one half Hof all words in its subtree.
The bit vector of vwill still have one bit for each 1-bit of its parent node, but the definition of a 1-bit of
vis slightly different now from that for TREE+BITVEC. Consider the jth bit of the bit vector of v, which
corresponds to the jth set bit of the root node, which corresponds to some document number ij. Then this
document contains at least one word otherwise the jth bit in the root node would not have been set
and the number of the lexicographically smallest word contained is stored by that jth bit. Now, if document
ijcontains other words, and at least one of these other words is contained in H, only then the jth bit of the
bit vector of vis set to 1, and we store by that 1-bit the lexicographically smallest word contained in that
document that has not already been stored in one of its ancestors (in this case only the root node).
Note the difference to the TREE+BITVEC scheme: if vwas the left child of the root, and document ij
had only one word from H, then in algorithm TREE+BITVEC+PUSHUP this word is stored by the root
node, and the corresponding bit of node vwould be 0. In algorithm TREE+BITVEC, however, that bit would
be 1. In particular, in algorithm TREE+BITVEC+PUSHUP there would be no more bits corresponding
to document ijbelow node v, while in algorithm TREE+BITVEC there would be a bit corresponding to
document ijalong the whole path from the root to leaf corresponding to the single word from Hcontained
in document ij.
6
Fig.3. The data structure of TREE+BITVEC+PUSHUP for the example collection from Figure 1. The large bitvector in each node
encodes the inverted list. The words stored by the 1-bits of that vector are shown in grey on top of the vector. The word list actually
stored is shown below the vector, where A=00, B=01, C=10, D=11, and for each node the common prefix is removed, e.g., for the
node marked C-D, C is encoded by 0and D is encoded by 1. A total of 49 bits is used, not counting the redundant 000 vectors and
bookkeeping information like list lengths etc.
Figure 3 explains this data structure by a simple example. The construction of the data structure is relatively
straightforward and can be done in time O(N). Details are given in Appendix B.
The query processing is very similar to the procedure we gave for TREE+BITVEC. We visit nodes in
such an order, starting from the root, that whenever we visit a node v, we have the set Ivof exactly those
positions in the bit vector of vthat correspond to elements from D(and for each i Ivwe know its
corresponding element in D). For each such position with a 1-bit, we now check whether the word stored by
that 1-bit is in W, and if so output it. It is not hard to see that this processing of a node vcan be implemented
by random lookups into the bit vector in time O(|Iv|). See Appendix C for details.
It is also not hard to see that each word thus reported is indeed an element of W0and that no word
from W0will be missed. However, the same word may be reported several times now, up to once for each
document in which it occurs. This can be dealt with in several ways, for example, by initializing a bit vector
of size Wto all zeroes and setting a bit to one whenever the corresponding word is to be reported. We finally
remark that the set D0can be computed on the fly as described for TREE+BITVEC in Section 3.
Lemma 5. With TREE+BITVEC+PUSHUP, an unthresholded autocompletion query can be processed in
time O(|D|log2m+Pw∈W 0|D Dw|). In the special case, where Wis the range of all words, the pro-
cessing time is bounded by O(|D | +Pw∈W0|D Dw|). For thresholded queries, the same bounds hold with
W0replaced by W00.
Lemma 6. The bit vectors of TREE+BITVEC+PUSHUP require a total of at most 2N+ 2nbits. The
(truncated) numbers of the words stored by the 1-bits require a total of at most N(2 + log 2(nm/N)) bits.
5 Divide into Blocks (TREE+BITVEC+PUSHUP+BLOCKS)
This section is our last station on the way to our main result, Theorem 1, and its goal is to bring down the
log2mfactor in the time bound of Lemma 5. According to our experiments, this factor really hurts for large
collections.
A very simple idea does it. We divide the set of all words in blocks of equal size B, where 1B
m, and construct the data structure according to TREE+BITVEC+PUSHUP for each block separately. An
(unthresholded) autocompletion query given by a word range Wand a set of documents Dis then processed
in the following three step.
1. Determine the set of l(consecutive) blocks, which contain at least one word from W, and for i=
1,...,l, compute the subrange Wiof Wthat falls into block i. Note that W=W1˙
· · · ˙
∪Wl.
7
2. For i= 1,...,l, process the query given by Wiand Daccording to TREE+BITVEC+PUSHUP, result-
ing in a set of hits D0
i D and a set of completions W0
i Wi.
3. Compute the union of the sets of completions W0
1˙
· · · ˙
∪W0
l(a simple concatenation). Compute the
union of the hit sets D1 · · · Dlon the fly during step 2, as described for TREE+BITVEC and
TREE+BITVEC+PUSHUP before.
Lemma 7. With TREE+BITVEC+PUSHUP+BLOCKS and block size B, an unthresholded autocompletion
query can be processed in time O(|D|(log2B+|W|/B) + Pw∈W 0|D Dw|). For a thresholded query, the
same bound holds with W0replaced by W00.
Lemma 8. TREE+BITVEC+PUSHUP+BLOCKS with block size Brequires at most 2N+n· dm/Bebits
for its bit vectors and at most Ndlog2Bebits for the word numbers stored by the 1-bits. For Bmn/N,
this adds up to at most N(3 + dlog2Be)bits.
Parts (b) and (c) of Theorem 1 now follow directly from Lemmas 7 and 8, by choosing Bto be nm/N. This
choice of Bminimizes the space bound of Lemma 8. For |W| =O(nm/N), the |W |/B term of Lemma 7
then is a constant. Note that N/n is the average number of distinct words per document, so that the condition
|W| C·nm/N rules out only very large word ranges. In our web service, this condition is maintained
by enabling the autocompletion functionality only for prefixes of a certain minimal length (say 2). Part (a)
of Theorem 1 is established by the construction given in Appendix B. This finishes the proof of our main
result.
6 Experiments
We tested the scheme TREE+BITVEC+PUSHUP+BLOCKS with the (rounded) space-optimal block size
B= 2dlog2(nm/N)eon two document collections. In this section, we will refer to this scheme as TREE+++.
bits per word-in-doc pair
Collection n m N/n BQBASIC TREE+++
HOMEOPATHY 33,250 216,596 272.6 1,024 7,168 16.0 11.6
WIKIPEDIA 441,465 1,456,349 140.6 16,384 6,497 19.0 15.2
Table 1. The characteristics of our two test collections: n= number of documents, m= number of distinct words, N/n = average
number of distinct words in a document, B= space-optimal choice for the block size, and Q= number of queries. The last two
columns give the space usage of BASIC and TREE+++ in bits per word-in-document pair.
The HOMEOPATHY collection consists of HTML documents on homeopathic medicine: 15 encyclopedic
works (in English language) plus over 25,000 mails from a practitioner’s mailing list (in German language).
The distinguishing feature of this collection is that we have actual queries for it, extracted from the query
log of an installation of our web service [3]. Figure 4 gives an impression of the extracted queries.
WIKIPEDIA is an order of magnitude larger and consists of an HTML dump of all pages from http:
//en.wikipedia.org (English language) and http://de.wikipedia.org (German language)
from November 2004. For this collection, queries were generated synthetically by randomly picking three
words from a random document. Prefixes of this 3-word string were then used, giving rise to several one,
two, and three word queries. Their characteristics are similar to those of the real queries for HOMEOPATHY.
The principle advantage of TREE+++ over BASIC is that its worst case running time does not depend
on W, while the processing time of BASIC grows with the size of the given range. Thus, when |D| is not too
large, that is, the user has already chosen some discriminative, focused query terms, TREE+++ is expected
to outperform BASIC.
8
(a) (b) (c)
Fig.4. Histograms of three characteristic values of the 7168 queries to our HOMEOPATHY collection. (a): the number |W 0|of
completions leading to a hit (shown for the window [1,100]; the histogram for the window [100,1000] is very similar). (b): the size
|W| of the given word range (shown, again representatively, for the window [1,100]). (c): the length of the last prefix.
For our experiments, BASIC and TREE+++ were implemented in Perl (our web service is implemented in
C++, however not yet with the full-fledged TREE+++). Since Perl incurs large unpredictable overheads in
the actual running times, we opted to compare the following, more objective operation counts: for each inter-
section of lists Xand Yof BASIC, we charged min{|X|,|Y|}(1 + log(max{|X|,|Y|}/min{|X|,|Y|})).
This corresponds to the worst case running time of the best known algorithm for intersecting two sorted lists
[2] [6] [13]. In the case where one list consists of all documents, we charged only the length of the shorter
list. For TREE+++, we charged for each visited node vthe size of the set Ivof indices into the bit vector of
v. For both schemes, we charged only for the last prefix of each query, because the first part of a multi-word
query appears as a separate query and the resulting document hits andcompletions are cached by the system
and do not need to be recomputed.
The results from Figure 5 conform nicely to our theoretical analysis. Three main observations can be
made: (1) For a significant fraction of the queries, BASIC takes a very long time (Figure 5(b)). This is due
to the dependency of its running time on W, which is typically very large when only few letters have been
typed of the last query word. (2) The maximal query processing time for TREE+++ is a factor 4 below that
for BASIC (Figure 5(b)). (3) For a considerable fraction of the queries, BASIC is very fast (Figure 5(a)).
This is the case when Wis very small or when we are dealing with a single prefix as a query.
(a) (b)
Fig.5. Histogram of the operation counts of TREE+++ and BASIC for all HOMEOPATHYqueries: (a) distribution of the counts
2·105; (b) distribution of the counts 2·105.
9
It should be clear that TREE+++ cannot outperform BASIC for small W: in the extreme case, when |W| =
1, we are asking for the inverted list of a single word, and BASIC just has it there precomputed, while
TREE+++ first needs to assemble this from a logarithmic number of lists. Similarly, for queries consisting
of only one prefix, BASIC only has to compute the union of the corresponding inverted lists. On the other
hand, for TREE+++ this corresponds to the worst case as |D| =n. In fact, the second peak at about 50,000
for the TREE+++ operation count, which appears in Fig. 5 (a), is solely due to these queries. Table 2 further
demonstrates this dependence on Dfor the two test collections. The potential gain of TREE+++, when
the user chooses discriminative terms, can be clearly seen in the last column of this table: having already
’zoomed in’ to 1% of the collection, the operation count for TREE+++ is for both collections more than one
order of magnitude smaller than that for BASIC.
Collection Method 1 |D| 1 |D| < n |D| =n1 |D| < n/100
BASIC 3.43 ·1041.06 ·1052.96 ·1035.38 ·104
HOMEOPATHY TREE+++ 4.19 ·1041.16 ·1045.51 ·1041.17 ·103
BASIC 7.71 ·1041.09 ·1052.70 ·1043.05 ·104
WIKIPEDIA TREE+++ 3.35 ·1057.64 ·1047.36 ·1052.33 ·103
Table 2. Average operation counts of TREE+++ and BASIC for various restrictions on |D|.
TREE+++ can be seen as a careful reorganisation of a standard inverted index that sacrifices the immediate
availability of inverted lists for single words, for the ability to quickly process arbitrary autocompletion
queries.
7 Conclusions
We have introduced an interesting new range-searching problem that is at the heart of our new full-text
search technology [3]. We have designed a new indexing scheme and provided complementary theoretical
and experimental evidence that we indeed achieve fast query processing times for arbitrary queries with a
reasonable space usage.
While our main result is stated for the RAM model of computation, our analysis also yields a non-trivial
O(nlog(mn/N)/B)bound on the number of disk I/O’s for arbitrary autocompletion queries with a constant
threshold, where Bnow is the number of bytes fetched in a single I/O-operation [17]. Compared to our
bound from Theorem 1, an nhas taken the place of |D| here. We pose it as an open problem to improve the
above bound to O(n)or even below. Note that the I/O-complexity of BASIC is Θ(Pw∈W (|D| +|Dw|)/B),
which is Θ(N/B)in the worst case.
The bits and numbers in our index data structures are uncompressed, and we compare their space usage
with that of an ordinary uncompressed inverted index. It is known that compression not only significantly
reduces space usage but also query processing times, namely when the data structures are residing on disk
and the processing times are dominated by the time required for fetching lists into main memory [18]
[16]. We deem it an interesting and challenging research problem to devise a compression scheme for our
approach, and then compare its performance with that of a state-of-the-art compressed ordinary inverted
index.
10
References
1. S. Alstrup, G. S. Brodal, and T. Rauhe. New data structures for orthogonal range searching. In IEEE Symposium on Foundations
of Computer Science, pages 198–207, 2000.
2. R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. Lecture Notes in Computer Science, 3109:400–408,
2004.
3. H. Bast, T. Warken, and I. Weber. Searching with autocompletion: a new interactive web search technology. Demo instances
are accessible under http://search.mpi-inf.mpg.de (Wikipedia), http://www.homeonet.org (Homeopathy), http://www.mpi-
inf.mpg.de (MPII webpages).
4. J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In ACM-SIAM Symposium on Discrete
Algorithms, pages 360–369, 1997.
5. B. Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing,
17(3):427–462, 1988.
6. E. D. Demaine, A. Lopez-Ortiz, and J. I. Munro. Adaptive set intersections, unions, and differences. In ACM-SIAM Symposium
on Discrete Algorithms, pages 743–752, 2000.
7. P. Ferragina and R. Grossi. The string b-tree: a new data structure for string search in external memory and its applications.
Journal of the ACM, 46(2):236–280, 1999.
8. P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. Journal of Computer and
System Science, 66(4):763–774, 2003.
9. V. Gaede and O. unther. Multidimensional access methods. ACM Computing Surveys, 30(2):170–231, 1998.
10. G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. New indices for text: PAT Trees and PAT arrays. Prentice-Hall, Inc., 1992.
11. Google suggest beta service, November 2004. http://www.google.com/webhp?complete=1&hl=en.
12. R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching
(extended abstract). In ACM Symposium on the Theory of Computing, pages 397–406, 2000.
13. F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets. SIAM Journal on Computing,
1(1):31–39, 1972.
14. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948,
1993.
15. E. M. McCreight. Priority search trees. SIAM Journal on Computing, 14(2):257–276, 1985.
16. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 222–229, 2002.
17. J. S. Vitter and E. A. M. Shriver. Optimal disk i/o with parallel block transfer. In ACM Symposium on Theory of Computing,
pages 159–169, 1990.
18. I. H. Witten, T. C. Bell, and A. Moffat. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition.
Morgan Kaufmann, 1999.
11
A Proofs of Lemmas 2, 3, 4, 5, 6, 7, and 8
For convenience, we restate before each proof the lemma as stated in the corresponding section of the paper.
Lemma 2. When processing a query with TREE, at most 2(|W 0|+ 1) log2|W| nodes are visited.
Proof. A node at height hhas at most 2hnodes below it. So each of the nodes v1,...,vlhas height at most
blog2|W|c. Further, no three nodes from v1,...,vlhave identical height, which implies that l2blog |W|c.
Similarly, for each word in W0we need to visit at most two additional nodes at each height below blog |W|c.
ut
Lemma 3. Let stree denote the total lengths of the inverted lists of algorithm TREE. The total number of
bits used in the bit vectors of algorithm TREE+BITVEC is then at most 2stree plus the number of empty
documents (which cost a 0-bit in the root each).
Proof. The lemma is a consequence of two simple observations. The first observation is that wherever there
was a document number in an inverted list of algorithm TREE there is now a 1-bit in the bit vector of the
same node, and this correspondence is 11. The total number of 1-bits is therefore stree.
The second observation is that if a node vthat is not the root has a bit corresponding to some document
number i, than the parent node also has a bit corresponding to that same document, and that bit of the parent
is set to 1, since otherwise node vwould not have a bit corresponding to that document.
It follows that the nodes which have a bit corresponding to a particular fixed document form a subtree
that is not necessarily complete but where each inner node has degree 2, and where 0-bits can only occur at
a leaf. The total number of 0-bits pertaining to a fixed document is hence at most the total number of 1-bits
for that same document plus one. Since for each documents we have as many 1-bits at the leaves as there are
words in the documents, the same statement holds without the plus one (assuming that the average number
N/n of distinct words per document is at least 1, which it will always be in practice).
ut
Lemma 4. When processing a query with TREE+BITVEC, at most 2(|W 0|+ 1) log 2|W|+ 2 log2mnodes
are visited.
Proof. By Lemma 2, at most 2(|W0|+ 1) log2|W| nodes are visited in the subtrees of the nodes v1,...,vl
that cover W. It therefore remains to bound the total number of nodes contained in the paths from the root
to these nodes v1,...,vl.
First consider the special case, were Wstarts with the leftmost leaf, and extends to somewhere in the
middle of the tree. Then each of the v1,...,vlis a left child of one node of the path from the root to vl. The
total number of nodes contained in the lpaths from the root to each of v1,...,vlis then at most d1, where
dis the depth of the tree. The same argument goes through for the symmetric case when the range ends with
the rightmost leaf.
In the general case, where Wbegins at some intermediate leaf and ends at some other intermediate leaf,
there is a node usuch that the leftmost leaf of the range is contained in the left subtree of uand the rightmost
leaf of the range is contained in the right subtree of u. By the argument from the previous paragraph, the
paths from uto those nodes from v1,...,vllying in the left subtree of uthen contain at most du1different
nodes, where duis the depth of the subtree rooted at u. The same bound holds for the paths from uto the
12
other nodes from v1,...,vl, lying in the right subtree of u. Adding the length of the path from the root to u,
this gives a total number of at most 2d3ut
Lemma 5. With TREE+BITVEC+PUSHUP, an unthresholded autocompletion query can be processed in
time O(|D|log2m+Pw∈W 0|D Dw|). In the special case, where Wis the range of all words, the pro-
cessing time is bounded by O(|D | +Pw∈W0|D Dw|). For thresholded queries, the same bounds hold with
W0replaced by W00.
Proof. As we noticed above, the query processing time spent in any particular node vcan be made linear
in the number of bits inspected via the index set Iv. Recall that each i Ivcorresponds to some document
from D. Then for reasons identical to those that led to the space bound of Lemma 3, for any fixed document
d D, the set of all visited nodes vwhich have an index in their Ivcorresponding to dform a binary tree,
and it can only happen for the leaves of that tree that the index points to a 0-bit, so that the number of these
0-bits is at most the number of 1-bits plus one.
Let again v1,...,vldenote the at most 2 log2Mnodes covering the given word range W(see Section
2). Observe that, by the time we reach the first node from v1,...,vl, the index set Ivwill only contain
indices from D0, as all the 1-bits for these nodes correspond to a word in W0. Strictly speaking, this is only
guaranteed after the intersection with this node, which accounts for an additional Din the total cost. Thus,
each distinct word wwe find in at least one of the nodes can correspond to at most |D Dw|1-bits met in
intersections with the bitvectors of other nodes in the set, and each 1-bit leads to at most two 0-bits met in
intersections. Summing over all w W 0gives the second term in the equation of the lemma. Note that the
same argument holds if we only want a subset of words W00 W0.
The remaining nodes that we visit are all ancestors of one of the v1,...,vl, and we have already shown
in the proof of Lemma 4 that their number is at most 2 log 2m. Since the processing time for a node is always
bounded by O(|D|), that fraction of the query processing time spent in ancestors of v1,...,vlis bounded by
O(|D|log2m). For the remark, observe that if Wis the range of all words, then the root node alone covers
that whole range, and all the query processing time is spent in the part already analyzed above, except that
we have to add the number of 0-bits in the root node, which is at most |D|.
ut
Lemma 6. The bit vectors of TREE+BITVEC+PUSHUP require a total of at most 2N+ 2nbits. The
(truncated) numbers of the words stored by the 1-bits require a total of at most N(2 + log 2(nm/N)) bits.
Proof. Just as for TREE+BITVEC, each 1-bit can be associated with the occurrence of a particular word
in a particular document, and that correspondence is 11. This proves that the total number of 1-bits is
exactly nL, and since word numbers are stored only by 1-bits and there is indeed one word number stored
by each 1-bit, the total number of word numbers stored is also nL. By the same argument as in Lemma 3,
the number of 0-bits is at most the number of 1-bits plus 1for each document plus the number of 0-bits in
the root node.
For a fixed document, the number of bits used to store the truncated numbers of the ldistinct word
numbers is maximal when these numbers are stored as high up in the tree as possible, that is, in nodes of
depth at most blog2lc. Using the nice formula Pl
i=1 i2i= (l1)2l+1 + 2, it can be shown that in that
worst case the number of bits is at most l·(2 + log 2(m/l)) (details omitted). A simple Lagrangian argument
shows that the sum of these bounds over all documents is maximal when each document contains exactly
N/n words, which leads to the bound stated in the lemma.
ut
13
Lemma 7. With TREE+BITVEC+PUSHUP+BLOCKS and block size B, an unthresholded autocompletion
query can be processed in time O(|D|(log2B+|W|/B) + Pw∈W 0|D Dw|). For a thresholded query, the
same bound holds with W0replaced by W00.
Proof. Since each block contains at most Bwords, according to Lemma 5, we need time at most O(|D|log 2B+
Pw∈W0
i|D Dw|)for block i. (For thresholded autocompletion queries we can limit the sum to w W 00
i.)
However, for all but at most two blocks (the first and the last) it holds that all words of the blocks are in
W, so that according to the remark following Lemma 5, the query processing time for each of the at most
|W|/B inner blocks is actually O(|D|+Pw∈W 0
i|D Dw|)(where, again, for the thresholded case the range
of the sum reduces to w W 00
i). Summing these up gives us the bound claimed in the lemma. If the given
word range Wis entirely contained within a single block the lemma follows directly from Lemma 5. ut
Lemma 8. TREE+BITVEC+PUSHUP+BLOCKS with block size Brequires at most 2N+n· dm/Bebits
for its bit vectors and at most Ndlog2Bebits for the word numbers stored by the 1-bits. For Bmn/N,
this adds up to at most N(3 + dlog2Be)bits.
Proof. To count the number of bits in the inverted lists, we can use the same argument as for algorithm
TREE+BITVEC+PUSHUP: there is exactly one 1-bit for each word-in-document occurrence. The total
number of 0-bits is exactly one more than the total number of 1-bits, plus the number of 0-bits in the bit
vectors of the roots of the trees. The latter can be bounded by n· dm/Be, since there are dm/Beblocks and
ndocuments.
To encode a particular word within a block, dlog2Bebits are obviously sufficient. This adds up to a
total of at most Ndlog2Bebits. Further space will be saved by sparing the iprefix bits of a word number
(within a block) on level iof the tree. However, for Bmn/N, this saving does not give an asymptotic
improvement over Nlog2B.ut
B The index construction for TREE+BITVEC+PUSHUP
The construction of the tree for algorithm TREE+BITVEC+PUSHUP is relatively straightforward: we can
process the documents one by one, in order of ascending document number, and for each document process
its words one by one, in order of ascending word number, and place each word at its proper place in the tree.
According to the remarks in the following description, the processing of each word can be implemented in
constant amortized time. In particular, this construction is therefore significantly faster than the construction
of an ordinary inverted index, for which, either implicitly or explicitly, a sorting problem of the order of N
needs to be solved [18].
1. Process the documents in order of ascending document numbers, and for each document ddo the fol-
lowing.
2. Process the distinct words in document din order of ascending word number, and for each word wdo
the following. Maintain a current node, which we initialize as an artificial parent of the root node.
3. If the current node does not contain win its subtree, then set the current node to its parent, until it does
contain win its subtree. For each node left behind in this process, append a 0-bit to the bit vector of
those of its children which have not been visited.
Note: for a particular word, this operation may take non-constant time, but once we go from a node to
its parent in this step, the old node will never be visited again. Since we only visit nodes, by which a
word will be stored and such nodes are visited at most three times, this gives constant amortized time
for this step.
4. Set the current node to that one child which contains win its subtree. Store the word wby this node.
Add a 1-bit to the bit vector of that node.
14
C Query processing in TREE+BITVEC+PUSHUP
First, it is easy to intersect Dwith the documents in the root node, because we can simply lookup the
document numbers in the bitvector at the root. Consider then a child vof the root. What we want to do is to
make a new set Svof document indices, which gives the numbering of the document indices of Din terms
of the numbering used in v. This is possible, if we in vstore an additional array, where we in entry istore
the number of ones which are in the bitvector of vup to index where α1is a parameter. For one such
entry we need log nbits. Svcan then be computed in constant time per entry from Dif we for each entry
look up in this array and scan through at most αbits of the original vector. If αis in the order of number
of bits in a machine-word, it is reasonable to assume that this can be done in constant time. We can then
continue down the tree in a similar way. The total space usage for the document lists will be increased by a
factor 1 + (log n).
15
ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
We survey new indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text.
Conference Paper
Full-text available
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Article
As databases have expanded in scope to storing string data (XML documents, product catalogs), it has become increasingly important to search databases based on matching substrings, often on multiple, correlated dimensions. While string B-trees are I/O optimal in one dimension, no index structure with non-trivial query bounds is known for two-dimensional substring indexing. In this paper, we present a technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points. We develop an I/O efficient algorithm for solving the common colors problem, and use it to obtain an I/O efficient (poly-logarithmic query time) algorithm for the two-dimensional substring indexing problem. Our techniques result in a family of secondary memory index structures that trade space for time, with no loss of accuracy.
Conference Paper
This paper introduces a simple intersection algorithm for two sorted sequences that is fast on average. It is related to the multiple searching problem and to merging. We present the worst and average case analysis, showing that in the former, the complexity nicely adapts to the smallest list size. In the later case, it performs less comparisons than the total number of elements on both inputs when n = αm (α > 1). Finally, we show its application to fast query processing in Web search engines, where large intersections, or differences, must be performed fast.
Conference Paper
Motivated by Boolean queries in text database systems, we consider the problems of finding the intersection, union, or difference of a collection of sorted sets. While the worst-case complexity of these problems is straightforward, we consider a notion of complexity that depends on the particular instance. We develop the idea of a proof that a given set is indeed the correct answer. Proofs, and in particular shortest proofs, are characterized. We present adaptive algorithms that make no a priori assumptions about the problem instance, and show that their running times are within a constant factor optimal with respect to a natural measure of the difficulty of an instance. In the process, we develop a framework for designing and evaluating adaptive algorithms in the comparison model.
Conference Paper
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching. that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.
Conference Paper
A new and conceptually simple data structure, called a suffix array, for on-line string searches is introduced. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, “Is W a substring of A?” to be answered in time Q(P+logN), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O(N) time in the worst case, versus O(NlogN) time for suffix arrays. However, an augmented algorithm is given that, regardless of the alphabet size, constructs suffix arrays in O(N) expected time, albeit with lesser space efficiency. It is believed that suffix arrays will prove to be better in practice than suffix trees for many applications.
Article
In this paper we present a new algorithm for merging two linearly ordered sets which requires substantially fewer comparisons than the commonly used tape merge or binary insertion algorithms. Bounds on the difference between the number of comparisons required by this algorithm and the information theory lower bounds are derived. Results from a computer implementation of the new algorithm are given and compared with a similar implementation of the tape merge algorithm.