Conference PaperPDF Available

In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems.

Authors:

Abstract

Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main alternative strategies for index update: in-place update, index merging, and complete re-build. Our experiments with large volumes of web data show that re-merge is for large numbers of updates the fastest approach, but in-place update is suitable when the rate of update is low or buffer size is limited.
In-Place versus Re-Build versus Re-Merge:
Index Maintenance Strategies for Text Retrieval Systems
Nicholas Lester Justin Zobel Hugh E. Williams
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Victoria 3001, Australia
Email: {nml,jz,hugh}@cs.rmit.edu.au
Abstract
Indexes are the key technology underpinning efficient text
search. A range of algorithms have been developed for fast
query evaluation and for index creation, but update algo-
rithms for high-performance indexes have not been evalu-
ated or even fully described. In this paper, we explore the
three main alternative strategies for index update: in-place
update, index merging, and complete re-build. Our experi-
ments with large volumes of web data show that re-merge
is for large numbers of updates the fastest approach, but
in-place update is suitable when the rate of update is low
or buffer size is limited.
1 Introduction
High-performance text indexes are key to the use of mod-
ern computers. They are used in applications ranging from
the large web-based search engines to the “find” facilities
included in popular operating systems, and from digital li-
braries to online help utilities. The past couple of decades
have seen dramatic improvements in the efficiency of
query evaluation using such indexes (Witten, Moffat &
Bell 1999, Zobel, Moffat & Ramamohanarao 1998, Sc-
holer, Williams, Yiannis & Zobel 2002). These advances
have been complemented by new methods for building in-
dexes (Heinz & Zobel 2003, Witten et al. 1999) that on
typical 2001 hardware allow creation of text databases at
a rate of around 8 gigabytes per hour, that is, a gigabyte
every 8 minutes.
In contrast, the problem of efficient maintenance of in-
verted indexes has had relatively little investigation. Yet
the problem is an important one. In some applications,
documents arrive at a high rate, and even within the con-
text of a single desktop machine a naive update strategy
may be unacceptable — a search facility in which update
costs were the system’s major consumer of CPU cycles
and disk cycles would not be of value.
In this paper we explore the three main strategies to
maintaining text indexes, focusing on addition of new doc-
uments.
To our knowledge, there has been no previous evalua-
tion of alternative update strategies.
Copyright (c) 2004, Australian Computer Society, Inc. This paper appeared
at the 27th Australasian Computer Science Conference, The University of
Otago, Dunedin, New Zealand. Conferences in Research and Practice in
Information Technology, Vol. 26. V. Estivill-Castro, Ed. Reproduction for
academic, not-for profit purposes permitted provided this text is included.
The first strategy for update is to simply amend the in-
dex, list by list, to include information about a new docu-
ment. However, as a typical document contains hundreds
of distinct terms, such an update involves hundreds of disk
accesses, a cost that is only likely to be tolerable if the
rate of update is very low indeed. This cost can be ame-
liorated by buffering new documents; as they will share
many terms, the per-document cost of update will be re-
duced. The second strategy is to re-build the index from
scratch when a new document arrives. On a per-document
basis this approach is extremely expensive, but if new doc-
uments are buffered then overall costs may well be accept-
able. Indeed many intranet search services operate with
exactly this model, re-crawling (say) every week and re-
indexing. The third strategy is to make use of the strategies
employed in algorithms for efficient index construction.
In these algorithms, a collection is indexed by dividing
it into blocks, constructing an index for each block, and
then merging. To implement update, it is straightforward
to construct an index for a block of new documents, then
merge it with the existing index.
In all of these approaches, performance depends on
buffer size. Using large collections of web documents, we
explore how these methods compare for different index
sizes and buffer sizes. These experiments, using a version
of our open-source LUCY search engine, show that the in-
place method becomes increasingly attractive as collection
size grows. We had expected to observe that re-build was
competitive for large buffer sizes; this expectation was not
confirmed, with re-merge being substantially more effi-
cient. Our results also show that incremental update of
any kind is remarkably slow even with a large buffer; for
a large collection our best speed is about 0.1 seconds per
document, compared to roughly 0.003 seconds per docu-
ment for batch index construction using the same imple-
mentation.
Overall, given sufficient buffer space for new docu-
ments and sufficient temporary space for a copy of the
index, it is clear that re-merge is the strategy of choice.
For a typical desktop application, however, where keeping
of spare indexes may be impractical, in-place update is not
unduly expensive and provides a reasonable pragmatic al-
ternative.
2 Indexes for text retrieval
Inverted files are the only effective structure for support-
ing text search (Witten et al. 1999, Zobel et al. 1998). An
inverted index is a structure that maps from a query term,
typically a word, to a postings list that identifies the docu-
ments that contain that term. For efficiency at search time,
each postings list is stored contiguously; typical query
terms occur in 0.1%–1% of the indexed documents, and
thus list retrieval, if fragmented, would be an unaccept-
able overhead.
The set of terms that occur in the collection is known
as the vocabulary. Each postings list in a typical imple-
mentation contains the number and locations of term oc-
currences in the document, for each document in which
the term occurs. More compact alternatives are to omit
the locations, or even to omit the number of occurrences,
recording only the document identifiers. However, term
locations can be used for accurate ranking heuristics and
for resolution of advanced query types such as phrase
queries (Bahle, Williams & Zobel 2002).
Entries in postings lists are typically ordered by doc-
ument number, where numbers are ordinally assigned to
documents based on the order in which they are indexed
by the construction algorithm. This is known as document
ordering and is commonly used in text retrieval systems
because it is straightforward to maintain, and additionally
yields compression benefits as discussed below. However,
ordering postings list entries by metrics other than docu-
ment number can achieve significant efficiency gains dur-
ing query evaluation (Anh & Moffat 2002, Persin, Zobel
& Sacks-Davis 1996).
Another key to efficient evaluation of text queries is in-
dex compression. Well-known integer compression tech-
niques (Golomb 1966, Elias 1975, Scholer et al. 2002) can
be applied to postings lists to significantly reduce their
size. Integer compression has been shown to reduce query
evaluation cost by orders of magnitude for indexes stored
both on disk and in memory (Scholer et al. 2002).
To realise maximal benefits from integer compression,
a variety of techniques are used to reduce the magnitude
of the numbers stored in postings lists. For example, doc-
ument ordering allows differences to be taken between
consecutive numbers, and then the differences can be en-
coded rather than the document numbers. This technique,
known as taking d-gaps, can also be applied to within-
document term occurrence information in the postings list,
for further compression gains. Golomb-coding of d-gaps,
assuming terms are distributed randomly amongst docu-
ments, yields optimal bitwise codes (Witten et al. 1999);
alternatively, byte-oriented codes allow much faster de-
compression (Anh & Moffat 2002, Scholer et al. 2002).
Compression can reduce the total index size by a fac-
tor of three to six, and decompression costs are more than
offset by the reduced disk transfer times. However, both
integer compression and taking d-gaps constrain decod-
ing of the postings lists to be performed sequentially in
the absence of additional information. This can impose
significant overheads in situations where large portions of
the postings list are not needed in query evaluation.
Techniques for decreasing the decoding costs imposed
by index compression have been proposed. Skipping
(Moffat & Zobel 1996) involves encoding information
into the postings lists that allows portions of the post-
ings list to be passed over without cost during decod-
ing. This can greatly increase the speed at which con-
junctive queries, such as Boolean AND queries, can be
processed. Non-conjunctive queries can also benefit from
this approach, by processing postings lists conjunctively
after selecting a set of candidate results disjunctively (Anh
& Moffat 1998).
Inverted indexes are key to fast query evaluation, but
construction of the inverted index is a resource intensive
task. On 2001 hardware and using techniques described
twelve years ago (Harman & Candela 1990), the inversion
process would require around one day per gigabyte. The
latest techniques in index construction have dramatically
reduced this time, to around 8 minutes per gigabyte on the
same hardware.
The most efficient method for index construction is a
refinement of sort-based inversion (Heinz & Zobel 2003).
Sort-based inversion operates by recording a posting
consisting of a term, ordinal document number, and oc-
currence information — in temporary disk space for each
term occurrence in the collection. Once the postings for
the entire collection have been accumulated in temporary
disk space, they are sorted — typically using an external
merge-sort algorithm — to group postings for the same
term into postings lists (Harman & Candela 1990). The
postings lists then constitute an inverted index of the col-
lection. Sort-based inversion has the advantages that it
only requires one pass over the collection and can oper-
ate in a limited amount of memory, as full vocabulary ac-
cumulation is not required. Simple implementations are
impractically slow, but the strategy of creating temporary
indexes in memory, writing them as blocks, then merging
the blocks to yield the final index is highly efficient.
An alternative to sort-based inversion is in-memory in-
version (Witten et al. 1999), which proceeds by building
a matrix of terms in the collection in a first pass, and then
filling in document and term occurrences in a second pass.
If statistics about term occurrences are gathered during
the first pass, the exact amount of memory required to in-
vert the collection can be allocated, from disk if necessary.
Term occurrence information is written into the allocated
space in a second pass over the collection. Allocation of
space to hold postings from disk allows in-memory inver-
sion to scale to very large collection sizes. However, in-
memory inversion does have the disadvantages that it re-
quires two passes over the collection and vocabulary must
be accumulated over the entire collection.
Another alternative to construction is a hybrid sorting
approach (Moffat & Bell 1995) in which the vocabulary is
kept in memory while blocks of sorted postings are writ-
ten to disk. However, compared to the pure sort-based
approach, more memory and indexing time is required
(Heinz & Zobel 2003).
To evaluate a ranked query with an inverted index,
most text retrieval systems read the postings lists associ-
ated with the terms in the query. The lists are then pro-
cessed from least- to most-common term (Kaszkiel, Zobel
& Sacks-Davis 1999). For each document that occurs in
2
each postings list, a score for that document is increased
by the result of a similarity computation such as the cosine
(Witten et al. 1999) or Okapi BM-25 (Robertson, Walker,
Hancock-Beaulieu, Gull & Lau 1992) measures. The sim-
ilarity function considers factors including the length of
the document, the number of documents containing the
term, and the number of times the term occurred in the
document. Other types of query — such as Boolean or
phrase queries — can also be resolved using an inverted
index.
The techniques described here have been implemented
in the LUCY text search engine, written by the Search En-
gine Group at RMIT.1. This search engine was used for
all experiments described in this paper.
3 Index update strategies
For text retrieval systems, the principles of index mainte-
nance — that is, of update — are straightforward. When
a document is added to the collection, the index terms
are extracted; a typical document contains several hun-
dred distinct terms that must be indexed. (It is well es-
tablished that all terms, with the exception of a small
number of common terms such as “the” and “of”, must
be indexed to provide effective retrieval (Baeza-Yates &
Ribeiro-Neto 1999, Witten et al. 1999). For phrase match-
ing to be accurate, all terms must be indexed.) For each of
these terms it is necessary to retrieveits postings list from
disk, add to the list information about the new document
and thus increase its length by a few bytes, then store the
modified list back on disk.
This simple approach to update, naively implemented,
carries unacceptable costs. On 100 gigabytes of text, the
postings lists for the commonest of the indexed terms
is likely to be tens of megabytes long, and the median
list tens to hundreds of kilobytes. To complete the up-
date the system must fetch and modify a vast quantity of
data, find contiguous free space on disk for modified post-
ings lists, and garbage-collect as the index becomes frag-
mented. Therefore, the only practical solution is to amor-
tise the costs over a series of updates.
The problem of index maintenance for text data has not
been broadly investigated: there is only a little published
work on how to efficiently modify an index as new docu-
ments are accumulated or existing documents are deleted
or changed (Clarke, Cormack & Burkowski 1994, Cut-
ting & Pedersen 1990, Tomasic, Garcia-Molina & Shoens
1994). This work pre-dates the major innovations in text
representation and index construction that were described
in the previous section.
There are several possible approaches to cost amorti-
sation for index maintenance. One approach is to adapt
the techniques used for index construction. In efficient in-
dex construction techniques, a temporary index is built in
memory until space is exhausted. This temporary index is
then written to disk as a run. When all documents have
been processed, the runs are merged to give a final index.
The re-merge strategy could be used for index mainte-
nance: as new documents are processed, they are indexed
1Available at http://www.seg.rmit.edu.au/lucy
in memory, and when memory is exhausted this run of new
information could be merged with the existing index in a
single linear pass. While the index would be unavailable
for some time during the merge (tens of minutes on an
index for 100 gigabytes of text), the overall cost is much
lower than the naive approach. To avoid the system itself
being unavailable at this time, a copy of the index can be
kept in a separate file, and the new index is switched in
once the merge is complete.
Update deferral using a temporary in-memory index
can be used to improve the naive update strategy. Once
main memory is exhausted, the postings lists on disk are
individually merged with entries from the temporary in-
dex. Updating postings lists in-place still requires con-
sideration of the problems associated with space manage-
ment of postings lists, but the cost of update can be sig-
nificantly reduced by reducing the number of times that
individual postings lists have to be written to disk.
A more primitive, but still commonly used, approach
to cost amortisation is to but re-buildthe entire index from
the stored collection. This approach has a number of dis-
advantages, including the need to store the entire collec-
tion and that the index is not available for querying during
the re-building process. Re-building is intuitively worse
than the re-merge strategy, but that does not mean that it is
unacceptable. Consider for example a typical 1-gigabyte
university web site. A re-build might take 10 minutes —
a small cost given the time needed to crawl the site and
the fact that there is no particular urgency to make updates
immediately available.
Update techniques from other areas cannot be readily
adapted to text retrieval systems. For example, there is
a wide body of literature on maintenance of data struc-
tures such as B-trees, and, in the database field, specific
research on space management for large objects such as
image data (Biliris 1992a, Biliris 1992b, Carey, DeWitt,
Richardson & Shekita 1986, Carey, DeWitt, Richardson &
Shekita 1989, Lehman & Lindsay 1989). However, these
results are difficult to apply to text indexes: they present
very different technical problems to indexes for conven-
tional databases. On the one hand, the number of terms per
document and the great length of postings lists make the
task of updating a text retrieval system much more costly
than is typically the case for conventional database sys-
tems. On the other hand, as query-to-document matching
is an approximate process — and updates do not neces-
sarily have to be instantaneous as there is no equivalent
in a text system to the concept of integrity constraint —
there are opportunities for novel solutions that would not
be considered for a conventional database system.
4 Update algorithms
Three algorithms are compared in the experiments pre-
sented in this paper. All three algorithms accumulate post-
ings in main memory as documents are added to the col-
lection. These postings can be used to resolve queries,
making new documents retrievable immediately. Once
main memory is filled with accumulated postings, the in-
dex is updated according to one of the three strategies.
3
In-place. The in-place algorithm updates postings lists
for each term that occurred in the new documents. The
list updates are not performed in a specific order other than
that imposed by the data structures used to accumulate the
postings. This is almost certainly not the optimal disk ac-
cess pattern, and is thus a topic for further research. Free
space for the postings lists is managed using a list of free
locations on the disk. These are ordered by disk location
so that a binary search can be used to determine whether
an existing postings list can be extended using additional
free space occurring immediately after it. A first fit algo-
rithm is used to search for free space if a postings list has
to be moved to a new location or a new postings list must
be created. The entire algorithm is described below.
1. Postings are accumulated in main memory as docu-
ments are added to the collection.
2. Once main memory is exhausted, for each in-
memory postings list:
(a) Determine how much free space follows the
corresponding on-disk postings list.
(b) If there is sufficient free space, append the in-
memory postings list, discard it and advance to
the next in-memory postings list.
(c) Otherwise, determine a new disk location with
sufficient space to hold the on-disk and in-
memory postings lists, using a first-fit algo-
rithm.
(d) Read the on-disk postings list from its previous
location and write it to the new location.
(e) Append the in-memory postings list to the new
location.
(f) Discard the in-memory postings list and ad-
vance to the next.
Note that this algorithm requires that it is possible to ap-
pend to a postings list without first decoding it. Doing so
involves separately storing state information that describes
the end of the existing list: the last number encoded, the
number of bits consumed in the last byte, and so on. For
addition of new documents in document ordered lists, such
appending is straightforward; under other organisations of
postings lists, the entire existing list must be decoded. In
our experiments, we test both append and (in one data set)
full-decode implementations.
Re-merge. The re-merge algorithm updates the on-disk
index by performing a merge between the on-disk postings
and the postings in main memory, writing the result to a
new disk location. This requires one complete scan of the
existing index. The on-disk postings and the in-memory
postings are both processed in ascending order, using the
hash values of the terms as the sorting key. This allows the
use of a simple merge algorithm to combine them. After
the merge is finished, the new index is substituted for the
old. In detail, this algorithm is as follows.
1. Postings are accumulated in main memory as docu-
ments are added to the collection.
2. Once main memory is exhausted, for each in-
memory postings list and on-disk postings list:
(a) If the term for the in-memory posting list has
a hash value less than the term for the on-
disk postings list, write the on-disk postings list
to the new index and advance to the next in-
memory postings term.
(b) Otherwise, if the in-memory posting term has a
hash value equal to the on-disk postings term,
write the on-disk postings list followed by the
in-memory postings list to the new index. Ad-
vance to next in-memory and on-disk postings
lists.
(c) Otherwise, write the on-disk postings list to the
new index and advance to the next on-disk post-
ings list.
3. The old index and in-memory postings are discarded,
replaced by the new index.
The re-merge algorithm processes the entire index, merg-
ing in new postings that have been accumulated in mem-
ory. This algorithm allows the index to be read efficiently,
by processing it sequentially, but forces the entire index to
be processed for each update.
If queries must be processed while maintenance is un-
der way, two copies of the index must be kept, as queries
cannot be resolved using the new index until it is complete.
The drawback, therefore, is that the new index is written
to a new location and there are therefore two copies of the
index; however, the index can be split and processed in
chunks in order to reduce this redundancy. The benefit is
that unlike the in-place algorithm, this ensures that lists
are stored contiguously, that is, there is no fragmentation.
Re-build. The re-build algorithm discards the current
index after constructing an entirely new index. The new
index is built on the stored collection and the new doc-
uments added since the last update. In order to service
queries during the re-building process, a copy of the index
and the accumulated in-memory postings must be kept.
After the re-building process is finished, the in-memory
postings and old index are discarded and the new index
substituted in their place. This process is as follows.
1. Postings are accumulated in main memory as docu-
ments are added to the collection.
2. Once main memory is exhausted, a new indexis built
from the current entire collection.
3. The old index and in-memory postings are discarded,
replaced by the new index.
The re-building algorithm constructs a new index from
stored sources each time that maintenance is required.
This necessitates that the entire collection be stored and
re-processed in the indexing process. Moreover, existing
postings are ignored.
Similarly to the re-merge algorithm, a separate copy
of the index must be maintained to resolve queries during
4
the maintenance process. In addition, as in the other ap-
proaches, postings must still be accumulated in-memory
to defer index maintenance and these must be kept until
the re-build is complete. This requirement has impact on
the index construction process, since less main-memory is
available to construct runs.
5 Experiments
Experiments were performed on a dual Pentium III
866 MHz machine, with 256 Mb of main memory on a
133MHz front side bus, and a quad Xeon 2 GHz machine
with 2 GB main memory on a 400 MHz front side bus.
Two sets of experiments were run on the dual Pen-
tium III using 1 Gb and 2.75 Gb collections taken from
the TREC WT10g collection. TREC is a large-scale in-
ternational collaboration intended primarily for compar-
ison of text retrieval methods (Harman 1995), and pro-
vides large volumes of data to participants, allowing di-
rect comparison of research results. The WT10g collec-
tion contains around 1.7 million documents from a 1997
web crawl (Hawking, Craswell & Thistlewaite 1999); it
was used primarily as an experimental collection at TREC
in 2000 and 2001.
The first experiment with the Pentium III was to update
the index of the 1 Gb collection, where an initial index
on 500 Mb of data (75,366 documents) was updated with
500 Mb of new data (75,368 documents). In these exper-
iments, we varied the size of the buffer used to hold new
documents, to measure the relative efficiencyof the differ-
ent methods as the number of documents to be inserted in
a batch was varied.
The second experiment used the 2.75 Gb collection,
where an initial index on 2.5 Gb (373,763 documents)
was updated with 250 Mb (39,269 documents). A smaller
number of updates was used due to time constraints; at
around a second per document in the slower cases, and 15
separate runs, this experiment took a week to complete.
A third experiment was run on the quad Xeon machine
using 21 Gb of data, where an initial index of 20 Gb
of data (3,989,496 documents) was updated with 1 Gb
(192,264 documents). The data for the third experiment
was taken from the TREC WT100g collection, a superset
of the WT10g 1997 web crawl.
These experiments were chosen to explore the charac-
teristics of the three strategies when operating under dif-
ferent conditions. In particular, we explored the behaviour
of the approaches when the index fits into memory, and
contrasted this with the behaviour when the index is many
multiples of main-memory size. The different collection
sizes were chosen to explore the maintenance cost of each
of the three strategies with different amounts of data.
6 Results
The results of the timing experiments are shown in Fig-
ures 1, 2, and 3. In all experiments the machines are under
light load, that is, no other significant tasks are accessing
the disk or memory.
Figure 1 shows the results of the first experiment,
where 500 Mb of data was added to an initial index on
500 Mb. The in-place and re-merge strategies were run
with buffered numbers of documents ranging between 10
and 10,000. The re-build strategy was limited to buffering
numbers of documents ranging from 100 to 10,000 due to
the excessive running times required for lower numbers.2
The results support our intuitive assessment that the
re-building strategy is less efficient than re-merging for
all non-trivial scenarios. Both strategies outperform the
in-place update for large document buffers, which can be
attributed to the advanced index construction algorithms
that underly their operation. However, their performance
degrades at a faster rate than in-place update with smaller
buffer sizes, to eventually become slower than in-place up-
date. This highlights that both algorithms need to process
the entire index or collection every update, even for small
updates.
The in-place variant that decodes the postings lists be-
fore updating them is also shown in Figure 1. As expected,
it is less efficient than the more optimised in-place strat-
egy in all cases. List decoding is not a significant over-
head for large document buffer sizes but, as buffer size
decreases, the per-document overhead increases and de-
coding becomes impractical.
The results of the second experiment, where 250 Mb of
data was added to an initial index of 2.5 Gb, are shown in
Figure 2. The re-building strategy was again always worse
than re-merging. All three strategies show comparable be-
haviour to the first experiment, but all schemes are slower
because of the processing costs associated with a larger
index. As discussed previously, the re-build and re-merge
strategies’ performance degraded faster than the in-place,
making them both slower than in-place update at a buffer
size of 100 documents. This is a larger buffer size than the
corresponding point in Figure 1, showing that the in-place
scheme has degraded less under the increased index size.
In contrast to the other two strategies, the in-place algo-
rithm only needs to process the sections of the index that
it updates. However, this does not make its performance
independent of the index size, as the postings lists that it
manipulates have size that is proportional to the index.
Figure 3 shows the results of the third experiment,
which were performed on the quad Pentium IV. Unfor-
tunately, a lack of time prevented us from running this
experiment with the re-build strategy. The results shown
using the re-merge and in-place strategies are consis-
tent with earlier results, with the re-merge scheme out-
performing the in-place strategy for large document buffer
sizes. These results are not directly comparable to the pre-
vious experiments, since the experiment was performed
on a different machine. However, the relative performance
of the strategies is comparable, and the point at which in-
place becomes more efficient than re-merge is higher than
in the two previous experiments. The index size in this ex-
periment is approximately ten times the size of the index
used in the second experiment, which corresponds to the
relative improvementin the efficiency of the in-place algo-
rithm. The results support the expectation that the in-place
algorithm continues to work well as index size increases.
The fragmentation of the final index produced for the
2In these experiments and those discussed below, buffer sizes ranged up to ap-
proximately 200 Mb
5
in-place (decode)
in-place
re-build
re-merge
10 100 1000 10000
Number of documents buffered
0.01
0.1
1
Time per document (seconds/document update)
Figure 1: Running times per input document for the three update strategies for the 1 GB collection on the dual Pentium
III, for a range of buffer sizes.
in-place
re-build
re-merge
100 1000 10000
Number of documents buffered
0.1
1
10
Time per document (seconds/document update)
Figure 2: Running times per input document for the three update strategies for the 2.75 GB collection on the dual Pentium
III, for a range of buffer sizes.
6
in-place
re-merge
1000 10000 100000
Number of documents buffered
0.1
1
Time per document (seconds/document update)
Figure 3: Running times per input document for the three update strategies for the 21 GB collection on the quad Xeon,
for a range of buffer sizes.
in-place strategy in each of the experiments is shown in
Figure 4. (Note that the results from different experiments
are not directly comparable to each other,due to the differ-
ing sizes of the collections.) The figures shown are frag-
mentation as a percentage of the total space used to hold
the postings lists. The high fragmentation suggests that
space management is a significant problem in implement-
ing the in-place strategy.
To explore fragmentation of the index during mainte-
nance, the fragmentation was sampled after every update
of the on-disk postings for the 1 Gb collection. Document
buffer sizes 200, 1000 and 5000 were plotted, with the re-
sults shown in Figure 5. These results indicate that after an
initial period where the fragmentation rapidly rises after
updating the initially-packed organisation of the postings
lists, the fragmentation remains relatively stable. Interest-
ingly, the degree of fragmentation in the index appears to
be related to the size of the updates applied to it, not to the
number of updates. The results in Figure 4 indicate that
this relationship also depends on the size of the existing
index, indicating that the size of the updates applied to the
index relative to the size of the index may be a key factor
in the level of fragmentation.
The oscillation that can be observed in the fragmenta-
tion results is due to the movement of large postings lists.
Postings lists for frequently occurring words such as “the”
are large, and have to be updated for almost every docu-
ment added to the index. This frequent growth causes the
postings list to be moved toward the back of the index,
where previously unused space can be allocated to hold
them. Fragmentation then jumps because a large space in
the index is left in the previous position of the list. Once
at the back of the index, the large postings list can grow
without having to be moved and smaller postings lists can
be placed in its previous position. This causes fragmenta-
tion to fall, and can continue until another large postings
list needs to be relocated to the end of the index, starting
the process again.
7 Conclusions
Inverted indexes are data structures that support query-
ing in applications as diverse as web search, digital li-
braries, application help systems, and email searching.
Their structure is well-understood, and their construction
and use in querying has been an active research area for
almost fifteen years. However, despite this, there is al-
most no publically-available information on how to main-
tain inverted indexes when documents are added, changed,
or removed from a collection.
In this paper, we have investigated three strategies for
inverted index update: first, an in-place strategy, where
the existing structure is added to and fragmentation oc-
curs; second, a re-merge strategy in which new structures
are merged with the old to create a new index; and, last,
a re-build strategy that entirely re-constructs the index
at each update. We have experimented with these three
approaches using different collection sizes, and by vary-
ing the number of documents that are buffered in main-
memory before the update process.
Our results show that when reasonable numbers of
documents are buffered, the re-merge strategy is fastest.
This result is largely because the index fragments under
the in-place strategy, necessitating frequent reorganisation
of large parts of the index structure and rendering it less
efficient. However, an in-place approach is desirable if it
can be made more efficient, since it is the only strategy
in which two copies of the index are not needed during
update.
We believe that optimisation of the in-place strategy is
a promising area for future work. Unlike the re-build and
re-merge strategies — which are the product of more than
ten years of index construction research — the in-place
strategy is new and largely unoptimised. We plan to in-
vestigate how space can be pre-allocated during construc-
tion to reduce later fragmentation, what strategies work
best for choosing and managing free space, and whether
special techniques for frequently-used or large entries can
7
1Gb 1Gb (decode)
2.75Gb
21Gb
10 100 1000 10000 100000 1000000
Number of documents buffered
0
10
20
30
40
Index Fragmentation (%)
Figure 4: Index fragmentation of the in-place strategiesfor all experiments. Note that curves for different collections are
not directly comparable.
200 document buffer
1000 document buffer
5000 document buffer
(black marks indicate relocation of the largest inverted list)
80000 100000 120000 140000 160000
Documents in index
0
10
20
30
Fragmentation (%)
Figure 5: Index fragmentation of the in-place strategy on the 1 Gb collection during the maintenance process.
8
reduce overall costs.
Acknowledgements
This work was supported by the Australian Research
Council. We are grateful for the comments of two anony-
mous reviewers.
References
Anh, V. N. & Moffat, A. (1998), Compressed inverted files with reduced
decoding overheads, in R. Wilkinson, B. Croft, K. van Rijsber-
gen, A. Moffat & J. Zobel, eds, “Proc. ACM-SIGIR Int. Conf. on
Research and Development in Information Retrieval”, Melbourne,
Australia, pp. 291–298.
Anh, V. N. & Moffat, A. (2002), Impact transformation: effective and
efficient web retrieval, in M. Beaulieu, R. Baeza-Yates, S. Myaeng
& K. J¨avelin, eds, “Proc. ACM-SIGIR Int. Conf. on Research and
Development in Information Retrieval”, Tampere, Finland, pp. 3–
10.
Baeza-Yates, R. & Ribeiro-Neto, B. (1999), Modern Information Re-
trieval, Addison-Wesley Longman.
Bahle, D., Williams, H. E. & Zobel, J. (2002), Efficient phrase querying
with an auxiliary index, in K. J¨arvelin, M. Beaulieu, R. Baeza-
Yates & S. H. Myaeng, eds, “Proc. ACM-SIGIR Int. Conf. on Re-
search and Development in Information Retrieval”, Tampere, Fin-
land, pp. 215–221.
Biliris, A. (1992a), An efficient database storage structure for large dy-
namic objects, in F. Golshani, ed., “Proc. IEEE Int. Conf. on Data
Engineering”, IEEE Computer Society, Tempe, Arizona, pp. 301–
308.
Biliris, A. (1992b), The performance of three database storage structures
for managing large objects, in M. Stonebraker, ed., “Proc. ACM-
SIGMOD Int. Conf. on the Management of Data”, San Diego, Cal-
ifornia, pp. 276–285.
Carey, M. J., DeWitt, D. J., Richardson, J. E. & Shekita, E. J. (1986),
Object and file management in the EXODUS extensible database
system, in W. W. Chu, G. Gardarin, S. Ohsuga & Y. Kambayashi,
eds, “Proc. Int. Conf. on Very Large Databases”, Morgan Kauf-
mann, Kyoto, Japan, pp. 91–100.
Carey, M. J., DeWitt, D. J., Richardson, J. E. & Shekita, E. J. (1989),
Storage management for objects in EXODUS, in W. Kim & F. H.
Lochovsky, eds, “Object-Oriented Concepts, Databases, and Ap-
plications”, Addison-Wesley Longman, New York, pp. 341–369.
Clarke, C. L. A., Cormack, G. V. & Burkowski, F. J. (1994), Fast in-
verted indexes with on-line update, Technical Report CS-94-40,
Department of Computer Science, University of Waterloo, Water-
loo, Canada.
Cutting, D. R. & Pedersen, J. O. (1990), Optimizations for dynamic in-
verted index maintenance, in J.-L. Vidick, ed., “Proc. ACM-SIGIR
Int. Conf. on Research and Development in Information Retrieval”,
ACM, Brussels, Belgium, pp. 405–411.
Elias, P. (1975), “Universal codeword sets and representations of the in-
tegers”, IEEE Transactions on Information Theory IT-21(2), 194–
203.
Golomb, S. W. (1966), “Run-length encodings”, IEEE Transactions on
Information Theory IT–12(3), 399–401.
Harman, D. (1995), “Overview of the second text retrieval conference
(TREC-2)”, Information Processing & Management 31(3), 271–
289.
Harman, D. & Candela, G. (1990), “Retrieving records from a gigabyte
of text on a minicomputer using statistical ranking”, Jour. of the
American Society for Information Science 41(8), 581–589.
Hawking, D., Craswell, N. & Thistlewaite, P. (1999), Overview of
TREC-7 very large collection track, in E. M. Voorhees & D. K.
Harman, eds, “The Eighth Tert REtrieval Conference (TREC-8)”,
National Institute of Standards and Technology Special Publication
500-246, Gaithersburg, MD, pp. 91–104.
Heinz, S. & Zobel, J. (2003), “Efficient single-pass index construction
for text databases”, Jour. of the American Society for Information
Science and Technology 54(8), 713–729.
Kaszkiel, M., Zobel, J. & Sacks-Davis, R. (1999), “Efficient passage
ranking for document databases”, ACM Transactions on Informa-
tion Systems 17(4), 406–439.
Lehman, T. J. & Lindsay, B. G. (1989), The Starburst long field manager,
in P. M. G. Apers & G. Wiederhold, eds, “Proc. Int. Conf. on Very
Large Databases”, Amsterdam, The Netherlands, pp. 375–383.
Moffat, A. & Bell, T. A. H. (1995), “In situ generation of compressed
inverted files”, Journal of the American Society of Information Sci-
ence 46(7), 537–550.
Moffat, A. & Zobel, J. (1996), “Self-indexing inverted files for fast text
retrieval”, ACM Transactions on Information Systems 14(4), 349–
379.
Persin, M., Zobel, J. & Sacks-Davis, R. (1996), “Filtered document re-
trieval with frequency-sorted indexes”, Jour. of the American Soci-
ety for Information Science 47(10), 749–764.
Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gull, A. & Lau, M.
(1992), Okapi at TREC, in “Proc. Text Retrieval Conf. (TREC)”,
pp. 21–30.
Scholer, F., Williams, H. E., Yiannis, J. & Zobel, J. (2002), Compres-
sion of inverted indexes for fast query evaluation, in K. J¨arvelin,
M. Beaulieu, R. Baeza-Yates & S. H. Myaeng, eds, “Proc. ACM-
SIGIR Int. Conf. on Research and Development in Information Re-
trieval”, Tampere, Finland, pp. 222–229.
Tomasic, A., Garcia-Molina, H. & Shoens, K. (1994), Incremental up-
dates of inverted lists for text document retrieval, in “Proc. ACM-
SIGMOD Int. Conf. on the Management of Data”, ACM, Min-
neapolis, Minnesota, pp. 289–300.
Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes:
Compressing and Indexing Documents and Images, second edn,
Morgan Kaufmann, San Francisco, California.
Zobel, J., Moffat, A. & Ramamohanarao, K. (1998), “Inverted files
versus signature files for text indexing”, ACM Transactions on
Database Systems 23(4), 453–490.
9
... An inverted index is a set of posting lists, each of which contains the posting elements. Related techniques to construct inverted index are similar to the schemes proposed in [24] [25]. ...
Article
Full-text available
With the rapid development of network technology and cloud computing, more and more organizations and users outsource their data into the cloud server. In order to protect data privacy, the sensitive data has to be encrypted which increases the heavy computational overhead and brings great challenges to resource-constraint devices. In this paper, we propose secure index based on counting Bloom filter (CBF) for ranked multiple keywords search. In the proposed scheme, several algorithms are designed to maintain and lookup CBF, while a pruning algorithm is used to delete the repeated items for saving the space. Besides, the relevance scores are encrypted by the Paillier cryptosystem. It ensures that the same relevance scores are encrypted into different bits, which can resist the statistical analyses on the ciphertext of the relevance scores. Moreover, since the Paillier cryptosystem supports homomorphic addition of ciphertext without the knowledge of the private key, the major computing work in ranking could be moved from user side to the cloud server side. Therefore, the proposed scheme has huge potentials in resource-constraint mobile devices such as 5G mobile terminals. Security analyses prove that the proposed scheme can prevent the information leakage. Experiment results guarantee that computation overhead of the proposed scheme in user side is low.
... Ren et al. [22] preserve not only the latest graph data but also past snapshots to trace the transition of the graph. Some studies have focused on incremental updates of an inverted index [23], [24], [25], they propose data structures of indices and physical storage methods. ...
... This requires a single scan on the index. The construction time is reported [LZW04,LZW06] to be significantly less than in the other two methods. During the remerge the old index may still be available for querying, but no new documents can be indexed at the same time. ...
Conference Paper
Inverted index files are commonly used to support keyword search in document collections. While the offline construction of an index can be done efficiently, its incremental update remains a hard problem, especially when the index does not completely fit in memory. We propose a novel approach for maintaining up-to-date index files on a system that constantly serves document updates and user queries. Unlike previous updating policies, we use knowledge of both the update term distribution and the query term distribution to partition the terms into functional groups. We implement two schemes for selective enforcement of contiguous layout of the data on disk, while mandating that the cost of the consolidation is less than its estimated benefit. The first is the “greedy merge” inspired by the ski-rental problem as studied in the context of competitive analysis. The second is the “opportunistic prognosticator” — by making reliable predictions, the online problem becomes suitable for offline optimizations.
Article
Targeted at the search performance of distributed search engine architecture, especially the multi-source heterogeneous documents and various user requests, a parallel retrieval model was proposed for the purpose of improving index structures and document management servers. Based on document themes, different document servers and indexes were established, and classification search was implemented. Based on index classification, the internal management structure was improved, and the parallel search was implemented by making use of the multi-thread task pool to manage category searching. The results showed that the improved parallel retrieval model can reduce the CPU and memory overhead of Merge nodes by 20% and 40% respectively in the case of large concurrent requests, and the average searching speed and system throughput increase by 30% and 60%.
Article
Examined in this paper is the issues of on-line maintenance for growing text collections. This approach builds inverted index to reflect changes in the collection it describes, supporting efficient document insertions and deletions. A fast on-line index construction method based on dynamic balancing tree is proposed. This method takes a merge-based approach and aims to keep a good performance during indexing and query processing by always merging indices with similar sizes. This is achieved by using a tree in which each node corresponds to one sub-index. The placing of sub-indexes in the tree depends on the sizes of the sub-indexes, so that each merging takes place among similarly sized sub-indexes. The method is suited very well not only for growing text collections, but also for dynamic text collections, and is a uniform framework for merge-based index maintenance strategies. It is more general, faster and scales better than previous methods, and can be adapted to suit different balances of insertion and querying operations. Using experiments on large scale Web data, the efficiency, flexibility and scalability of this method are demonstrated in practice, showing that on-line index construction for growing text collections can be performed efficiently and almost as fast as for static text collections.
Article
To improve time and space efficiencies of index maintenance, an on-line dynamic index hybrid update (ODIHU) technique is proposed based on self-learning of allocated space. Based on Zipf theorem, ODIHU appropriately estimates the number of short and long lists with theoretical analysis, and manages short and long lists with uniform storage model of distinguishing long and short lists based on link. ODIHU manages long list space with history-based adaptive learning allocation (HALA), and manages short list space with linear allocation (LA), exponential allocation (EA), and uniform allocation (UA). To decrease index and retrieval cost, ODIHU divides index data set into limited sections and controls index merge with schemes. Then ODIHU merges short lists with immediate merge, and merges long lists with improved Y-limited contiguous multiple merge scheme, which balances the trade-off of the time and space efficiencies effectively. Based on the proposed RABIF, ODIHU not only considers both index level and inverted list level updating, but also effectively improves time and space efficiencies of index updating.
Article
The universal search engine, which is widely used now, has significantly improved the efficiency of retrieving information. According to CNNIC (China Internet Network Information Center) 26th Internet survey, the search takes up 76.30% for absolute advantage as a major way for users to obtain information from the Internet. Among almost all the surveys of using on the Internet in the world, search engine is second only to e-mail service. But with the growth of a wide range of information, these universal search engines can not meet people's needs either in retrieval precision or in retrieval efficiency when retrieving information on a subject or topic. That's because as long as the user enters the same keywords, the feedbacks of universal search engine are just the same. Universal search engine does not take the differences in interests and needs between different users, which often exist, into account. For example, dentists and ceramics enthusiasts would hold different concerns about the term "ceramic". In order to be more rapid, accurate and efficient in retrieving information on particular subject or theme, it is essential to develop information retrieval systems on specific areas, that is, the domain-specific search engine.
Conference Paper
Full-text available
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Article
Statistically based ranked retrieval of records using keywords provides many advantages over traditional Boolean retrieval methods, especially for end users. This approach to retrieval, however, has not seen widespread use in large operational retrieval systems. To show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. The results show not only response times on the order of 1 and 1/2 seconds for 806 megabytes of text, but also very favorable user reaction. Novice users were able to consistently obtain good search results after 5 minutes of training. Additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. These techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. © 1990 John Wiley & Sons, Inc.
Article
With the proliferation of the world's “information highways” a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index. The index dynamically separates long and short inverted lists and optimizes retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria.
Conference Paper
Starburst is an experimental database management sys- tem prototype whose objectives include extensibility, sup- port for knowledge databases, use of memory-resident database techniques, and support for large objects. We describe the structure of the Starburst long field manager, which was designed to manage large database objects such as voice, image, sound and video. The long field manager uses the buddy system for managing disk space, which al- lows it to allocate a range of small to very large disk extents (buddy segments) quickly and efficiently. When possible, a small number of large buddy segments are used to store long fields, which allows the long field manager to perform fewer disk seeks when transferring large objects to and from disk. The long field manager uses shadow-based recovery for long field data and write-ahead-log recovery for long field descriptor and allocation data. Internal space management synchronization is enforced by a combination of long-term and instantaneous locks.
Conference Paper
provided by no commercial database system at this time). Storing This paper describes the design of the object-oriented storage component of EXODUS, an extensible database manaaement~svs- tern currently under development at the University of-Wiscon&t. The basic abstraction in the EXODU'S storage system is the storage object, an uninterpmted variable-length m&z& of arbitrary size; higher level abstractions such as records and indices am supported via the storage object abstraction. One of the key design %atums described here is a scheme,for managing large dynamic objects, as storage objects can occupy many disk pages and can grow or s.hrink at arbitrary points. The data structure and algorithmsused to su - port such objects are described, and nerformance results from a ore s - iminary prototype of the EXODUS. large-object management scheme am presented. A scheme for maintainin K versions of large objects is also described. We then describe the fi e structure used in the EXODUS storage system, which provides a mechanism for grouping and sequencing through a set of related storage objects. In addition to object and file management. we discuss the EXODUS approach to buffer management, &ntcurrency control, and recovery, both for small and large objects.
Conference Paper
We extend the applicability of impact transformation, which is a technique for adjusting the term weights assigned to documents so as to boost the effectiveness of retrieval when short queries are applied to large document collections. In conjunction with techniques called quantization and thresholding, impact transformation allows improved query execution rates compared to traditional vector-space similarity computations, as the number of arithmetic operations can be reduced. The transformation also facilitates a new dynamic query pruning heuristic. We give results based upon the trec web data that show the combination of these various techniques to yield highly competitive retrieval, in terms of both effectiveness and efficiency, for both short and long queries.
Conference Paper
Compressed inverted files are the most com- pact way of indexing large text databases, typically oc- cupying around 10% of the space of the collection they index. The drawback of compression is the need to de- compress the index lists during query processing. Here we describe an improved implementation of compressed inverted lists that eliminates almost all redundant decod- ing and allows extremely fast processing of conjunctive Boolean queries and ranked queries. We also describe a pruning method to reduce the number of candidate documents considered during the evaluation of ranked queries. Experimental results with a database of 510 Mb show that the new mechanism can reduce the CPU and elapsed time for Boolean queries of 4-10 terms to one tenth and one fifth respectively of the standard tech- nique. For ranked queries, the new mechanism reduces both CPU and elapsed time to one third and memory usage to less than one tenth of the standard algorithm, with no degradation in retrieval effectiveness.
Conference Paper
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.
Conference Paper
For free-text search over rapidly evolving corpora, dynamic update of inverted indices is a basic requirement. B-trees are an effective tool in implementing such indices. The Zipfian distribution of postings suggests space and time optimizations unique to this task. In particular, we present two novel optimizations, merge update, which performs better than straight forward block update, and pulsing which significantly reduces space requirements without sacrificing performance.
Conference Paper
This study analyzes the performance of the storage structures and algorithms employed in three experimental database storage systems – EXODUS, Starburst, and EOS – for managing large unstructured general-purpose objects. All three mechanisms are segment-based in that the large object is stored in a sequence of segments, each consisting of physically continuous disk block. To analyze the algorithms we measured object creation time, sequential scan time, storage utilization in the presence of updates, and the I/O cost of random reads, inserts, and deletes.