Conference PaperPDF Available

Compression of Inverted Indexes For Fast Query Evaluation

Authors:

Abstract

Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Compression of Inverted Indexes For Fast Query
Evaluation
Falk Scholer Hugh E. Williams John Yiannis Justin Zobel
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V
Melbourne, Australia, 3001.
{fscholer,hugh,jyiannis,jz}@cs.rmit.edu.au
ABSTRACT
Compression reduces both the size of indexes and the time
needed to evaluate queries. In this paper, we revisit the
compression of inverted lists of document postings that store
the position and frequency of indexed terms, considering two
approaches to improving retrieval efficiency: better imple-
mentation and better choice of integer compression schemes.
First, we propose several simple optimisations to well-known
integer compression schemes, and show experimentally that
these lead to significant reductions in time. Second, we ex-
plore the impact of choice of compression scheme on retrieval
efficiency.
In experiments on large collections of data, we show two
surprising results: use of simple byte-aligned codes halves
the query evaluation time compared to the most compact
Golomb-Rice bitwise compression schemes; and, even when
an index fits entirely in memory, byte-aligned codes result
in faster query evaluation than does an uncompressed index,
emphasising that the cost of transferring data from memory
to the CPU cache is less for an appropriately compressed in-
dex than for an uncompressed index. Moreover, byte-aligned
schemes have only a modest space overhead: the most com-
pact schemes result in indexes that are around 10% of the
size of the collection, while a byte-aligned scheme is around
13%. We conclude that fast byte-aligned codes should be
used to store integers in inverted lists.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Systems
and Software Performance evaluation (efficiency and effec-
tiveness); E.4 [Data]: Coding and Information Theory Data
compaction and compression
General Terms
Performance, Algorithms, Design
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’02, August 11-15, 2002, Tampere, Finland.
Copyright 2002 ACM 1-58113-561-0/02/0008 ...$5.00.
Keywords
Inverted indexes, retrieval efficiency, index compression, in-
teger coding
1. INTRODUCTION
Search engines have demanding performance requirements.
Users expect fast answers to queries, many queries must be
processed per second, and the quantity of data that must
be searched in response to each query is staggering. The
demands continue to grow: the Google search engine, for
example, indexed around one billion documents a year ago
and now manages more than double that figure1.Moreover,
the increasing availability and affordability of large storage
devices suggests that the amount of data stored online will
continue to grow.
Inverted indexes are used to evaluate queries in all prac-
tical search engines [14]. Compression of these indexes has
three major benefits for performance. First, a compressed
index requires less storage space. Second, compressed data
makes better use of the available communication bandwidth;
more information can be transfered per second than when
the data is uncompressed. For fast decompression schemes,
the total time cost of transfering compressed data and sub-
sequently decompressing is potentially much less than the
cost of transferring uncompressed data. Third, compression
increases the likelihood that the part of the index required
to evaluate a query is already cached in memory, thus en-
tirely avoiding a disk access. Thus index compression can
reduce costs in retrieval systems.
We have found that an uncompressed inverted index that
stores the location of the indexed words in web documents
typically consumes more than 30% of the space required
to store the uncompressed collection of documents. (Web
documents often include a great deal of information that is
not indexed, such as HTML tags; in the TREC web data,
which we use in our experiments, on average around half
of each document is indexable text.) When the index is
compressed, the index size is reduced to between 10%–15%
of that required to store the uncompressed collection; this
size includes document numbers, in-document frequencies,
and word positions within documents. If the index is too
large to fit entirely within main memory, then querying the
uncompressed index is slower: as we show later, it is up to
twice as slow as the fastest compressed scheme.
In this paper, we revisit compression schemes for the in-
1See http://www.google.com/
222
verted list component of inverted indexes. We also propose
a new method for decoding lists. There have been a great
many reports of experiments on compression of indexes with
bitwise compression schemes [6, 8, 12, 14, 15], which use an
integral number of bits to represent each integer, usually
with no restriction on the alignment of the integers to byte
or machine-word boundaries. We consider several aspects
of these schemes: how to decode bitwise representations of
integers efficiently; how to minimise the operations required
for the most compact scheme, Golomb coding; and the rela-
tive performance of Elias gamma coding, Elias delta coding,
Golomb coding, and Rice coding for storing indexes.
We question whether bitwise compression schemes are the
best choice for storing lists of integers. As an alternative,
we consider bytewise integer compression schemes, which
require that each integer is stored in an integral number of
blocks, where each block is eight bits. The length of each
stored integer can therefore be measured in an exact number
of bytes. An additional restriction is to require that these
eight-bit blocks must align to machine-word or byte bound-
aries. We propose and experimentally investigate several
variations of bytewise schemes.
We investigate the performance of different index com-
pression schemes through experiments on large query sets
and collections of Web documents. We report two surpris-
ing results.
For a 20 gigabyte collection, where the index is several
times larger than main memory, optimised bytewise
schemes more than halve the average decoding time
compared to the fastest bitwise approach.
For a much smaller collection, where the index fits in
main memory, a bytewise compressed index can still
be processed faster than an uncompressed index.
These results show that effective use of communication band-
widths is important for not only disk-to-memory transfers
but also memory-to-cache transfers. The only disadvantage
of bytewise compressed indexes is that they are up to 30%
larger than bitwise compressed indexes; the smallest bitwise
index is around 10% of the uncompressed collection size,
while the bytewise index is around 13%.
2. INVERTED INDEXES
An inverted index consists of two major components: the
vocabulary of terms—for example the words—from the col-
lection, and inverted lists, which are vectors that contain
information about the occurrence of the terms [14].
In a basic implementation, for each term tthere is an in-
verted list that contains posti ngs <f
d,t,d > where fd,t is
the frequency fof term tin the ordinal document d.One
posting is stored in the list for each document that contains
the term t. Inverted lists of this form—along with additional
statistics such as the document length ld,andft,thenum-
ber of documents that contain the term t—are sufficient to
support ranked and Boolean query modes.
To support phrase querying or proximity querying, addi-
tional information must be kept in the inverted lists. Thus
inverted list postings should be of the form
<f
d,t,d,[o0,d,t ...o
fd,t,d,t ]>
The additional information is the list of offsets o;oneo-
set is stored for each of the fd,t occurrences of term tin
document d. Postings in inverted lists are usually ordered
by increasing d, and the offsets likewise ordered within the
postings by increasing o. This has the benefit that differ-
ences between values—rather than the raw values—can be
stored, improving the compressibility of the lists.
Other arrangements of the postings in lists are useful when
lists are not necessarily completely processed in response to
a query. For example, in frequency-sorted indexes [9, 10]
postings are ordered by fd,t , and in impact-ordered indexes
the postings are ordered by quantised weights [1]. These ap-
proaches also rely on compression to help achieve efficiency
gains, and the improvements to compression performance
we describe in this paper are as applicable to these methods
as they are to the simple index representations we use as a
testbed for our compression methods.
Consider an example inverted list with offsets for the term
“Matthew”:
<3,7,[6,51,117] >< 1,44,[12] >< 2,117,[14,1077] >
In this index, the terms are words, the offsets are word po-
sitions within the documents, and the lists are ordered by d.
This inverted list states that the term “Matthew” occurs 3
times in document 7, at offsets 6, 51, and 117. It also occurs
once in document 44 at offset 12, and twice in document 117,
at offsets 14 and 1077.
Ranked queries can be answered using the inverted index
as follows. First, the terms in the user’s query are located
in the inverted index vocabulary. Second, the correspond-
ing inverted lists for each term are retrieved from disk, and
then processed by decreasing ft.Third,foreachpostingin
each inverted list, an accumulator weight Adis increased;
the magnitude of the increase is dependent on the similarity
measure used, and can consider the weight wq,t of term t
in the query q,theweightwd,t of the term tin the docu-
ment d, and other factors. Fourth, after processing part [1,
6] or all of the lists, the accumulator scores are partially
sorted to identify the most similar documents. Last, for a
typical search engine, document summaries of the top ten
documents are generated or retrieved and shown to the user.
The offsets stored in each inverted list posting are not used
in ranked query processing.
Phrase queries require offsets and that a given sequence of
words be contiguous in a matching document. For example,
consider a combined ranked and phrase query:
“Matthew Richardson” Richmond
To evaluate such a query, the same first two steps as for
ranked querying are applied. Then, instead of accumulating
weights, it is necessary to construct a temporary inverted list
for the phrase, by fetching the inverted list of each of the
individual terms and combining them. If the inverted list for
“Matthew” is as above and the inverted list for “Richard-
son” is
<1,7,[52] ><2,12,[1,4] ><1,44,[83] >
then both words occur in document 7 and as an ordered
pair. Only the word “Richardson” is in document 12, both
words occur in document 44 but not as a pair, and only
“Matthew” occurs in document 117. The list for “Matthew
Richardson” is therefore
<1,7,[51] >
223
After this, the ranking process is continued from the third
step, where the list for the term “Richmond” and the newly
created list are used to adjust accumulator weights. Phrase
queries can involve more than two words.
3. COMPRESSING INVERTED INDEXES
Special-purpose integer compression schemes offer both
fast decoding and compact storage of inverted lists [13, 14].
In this section, we consider how inverted lists are compressed
and stored on disk. We limit our discussions here to the
special-purpose integer compression techniques that have
previously been shown to be suitable for index compression,
and focus on their use in increasing the speed of retrieval
systems.
Without compression, the time cost of retrieving inverted
lists is the sum of the time taken to seek for and then retrieve
the inverted lists from disk into memory, and the time taken
to transfer the lists from memory into the CPU cache be-
fore they are processed. The speed of access to compressed
inverted lists is determined by two factors: first, the com-
putational requirements for decoding the compressed data
and, second, the time required to seek for and retrieve the
compressed data from disk and to transfer it to the CPU
cache before it is decoded. For a compression scheme to
allow faster access to inverted lists, the total retrieval time
and CPU processing costs should be less than the retrieval
time of the uncompressed representation. However, a third
factor makes compression attractive even if CPU process-
ing costs exceed the saving in disk transfer time: compress-
ing inverted lists increases the number of lists that can be
cached in memory between queries, so that in the context of
a stream of queries use of compression reduces the number
of disk accesses. It is therefore important that a compres-
sion scheme be efficient in both decompression CPU costs
and space requirements.
There are two general classes of compression scheme that
are appropriate for storing inverted lists. Variable-bit or
bitwise schemes store integers in an integral number of bits.
Well-known bitwise schemes include Elias gamma and delta
coding [3] and Golomb-Rice coding [4]. Bytewise schemes
store an integer in an integral number of blocks, where a
block is eight bits in size; we distinguish between blocks and
bytes here, since there is no implied restriction that a block
must align to a physical byte-boundary. A simple bytewise
scheme is variable-byte coding [2, 13]; uncompressed inte-
gers are also stored in an integral number of blocks, but we
do not define them as bytewise schemes since, on most ar-
chitectures, an integer has a fixed-size representation of four
bytes. In detail, these schemes are as follows.
Elias coding [3] is a non-parameterised bitwise method of
coding integers. (Non-parameterised methods use static or
fixed codes to store integers.) The Elias gamma code repre-
sents a positive integer kby 1 + log2kstored as a unary
code, followed by the binary representation of kwithout its
most significant bit. Using Elias gamma coding, small inte-
gers are compactly represented; in particular, the integer 1
is represented as a single 1-bit. Gamma coding is relatively
inefficient for storing integers larger than 15 [13].
Elias delta codes are suited to coding larger integers, but
are inefficient for small values. For an integer k,adelta
code stores the gamma code representation of 1 + log2k,
and then the binary representation of kwithout its most
significant bit.
Golomb-Rice bitwise coding [4] has been shown to offer
more compact storage of integers and faster retrieval than
the Elias codes [13]; indeed, it is bitwise optimal under the
assumption that the set of documents with a given term is
random. The codes are adapted to per-term likelihoods via
a parameter that is used to determine the code emitted for
an integer. In many cases, this parameter must be stored
separately using, for example, an Elias code. For coding of
inverted lists, a single parameter is used for all document
numbers in a postings list, but each posting requires a pa-
rameter for its offsets. The parameters can be calculated as
the lists are decoded using statistics stored in memory and
in the lists, as we discuss later.
Coding of an integer kusing Golomb codes with respect
to a parameter bis as follows. The code that is emitted is in
two parts: first, the unary code of a quotient qis emitted,
where q=(k1)/b+ 1; second, a binary code is emitted
for the remainder r,wherer=kq×b1. The number
of bits required to store the remainder ris either log2bor
log2b. To retrieve the remainder, the value of the “toggle
point” t=1((log2k)+1))bis required, where indicates
a left-shift operation. After retrieving log2bbits of the
remainder r, the remainder is compared to t.Ifr>t,then
one additional bit of the remainder must be retrieved. It
is generally thought that caching calculated values of log2b
is necessary for fast decoding, with a main-memory penalty
of having to store the values. However, as we show later,
when the standard log library function is replaced with a
fast bit-shifting version, caching is unnecessary.
Rice coding is a variant of Golomb coding where the value
of bis restricted to be a power of 2. The advantage of this
restriction is that there is no “toggle point” calculation re-
quired, that is, the remainder is always stored in exactly
log2bbits. The disadvantage of this scheme is that the
choice of value for bis restricted and, therefore, the com-
pression is slightly less effective than that of Golomb coding.
For compression of inverted lists, a value of bis required.
Witten et al. [14] report that for cases where the probability
of any particular integer value occurring is small—which is
the usual case for document numbers dand offsets o—then
bcan be calculated as:
b=0.69 ×mean(k)
For each inverted list, the mean value of document numbers
dcan be approximated as k=N/ftwhere Nis the number
of documents in the collection and ftis the number of post-
ings in the inverted list for term t[14]. This approach can
also be extended to offsets: the mean value of offsets ofor
an inverted list posting can be approximated as k=ld/fd,t
where ldis the length of document dand fd,t is the number
of offsets of term twithin that document. As the statistics
N,ft,andlare often available in memory, or in a simple
auxiliary structure on disk, storage of bvalues is not required
for decoding; approximate values of lcanbestoredinmem-
ory for compactness [7], but use of approximate values has
little effect on compression effectiveness as it leads to only
small relative errors in computation of b.
In bytewise coding an integer is stored in an integral num-
ber of eight-bit blocks. For variable-byte codes, seven bits
in each block are used to store a binary representation of
the integer k. The remaining bit is used to indicate whether
the current block is the final block for k, or whether an ad-
ditional block follows. Consider an example of an integer k
224
in the range of 27= 128 to 214 =16,384. Two blocks are
required to represent this integer: the first block contains
the seven least-significant bits of the integer and the eighth
bit is used to flag that another block follows; the second
block contains the remaining most-significant bits and the
eighth bit flags that no further blocks follow. We use the
convention that the flag bit is set to 1 in the final block and
0otherwise.
Compressing an inverted index, then, involves choosing
compression schemes for the three kinds of data that are
stored in a posting: a document number d, an in-document
frequency fd,t, and a sequence of offsets o. A standard choice
is to use Golomb codes for document numbers, gamma codes
for frequencies, and delta codes for offsets [14]. (We explore
the properties of this choice later.) In this paper, we describe
such a choice as a GolD-GamF-DelO index.
3.1 Fast Decoding
We experiment with compression of inverted lists of post-
ings that contain frequencies fd,t , documents numbers d,and
offsets o. For fast decompression of these postings, there are
two important considerations: first, the choice of compres-
sion scheme for each component of the posting; and, second,
modifications to each compression scheme so that it is both
fast and compatible with the schemes used for the other
components. In this section, we outline the optimisations
we use for fast decompression. Our code is publically avail-
able and distributed under the GNU public licence.2
Bitwise Compression
We have experimented with a range of variations of bitwise
decompression schemes. Williams and Zobel [13] reported
results for several efficient schemes, where vectors that con-
tain compressed integers are retrieved from disk and subse-
quently decoded.3In their approach, vector decoding uses
bitwise shift operations, bit masks, multiplication, subtrac-
tion, and function calls to retrieve sequences of bits that
span byte boundaries. In our experiments on Intel Pentium-
based servers running the Linux operating system, we have
found that bitwise shift operations are usually faster than
bit masks, and that the function calls are slow. By opti-
mising our code to use bitwise shifts and to remove nested
function calls, we have found that the overall time to decode
vectors—regardless of the compression scheme used—is on
average around 60% of that using the code of Williams and
Zobel.
Other optimisations that are specific to Golomb-Rice cod-
ing are also of value. Golomb-Rice decoding requires that
log2bis calculated to determine the number of remainder
bits to be retrieved. It is practicable to explicitly cache
values of log2bin a hash table as they are calculated, or
to pre-calculate all likely-to-be-used values as the retrieval
query engine is initialised. This saves recalculation of loga-
rithms when a value of bis reused in later processing, with
the penalty of additional memory requirements for storing
the lookup table.
We measured the performance of Golomb coding with and
2Thesearchengineusedintheseexperimentsand
our integer compression code is available from
http://www.seg.rmit.edu.au/
3The code used by Williams and Zobel in their ex-
perimentsisavailablefromhttp://www.cs.rmit.edu.au/
~hugh/software/
without caching. Timings are average elapsed query evalua-
tion cost to process index information for 25,000 queries on a
9.75 gigabyte (Gb) collection of Web data [5] using our pro-
totype retrieval engine on a GolD-GamF-GolO index (that
is, Golomb document numbers, gamma frequencies, Golomb
offsets); we discuss collection statistics and experimental de-
sign further in Section 4. The cache lookup table size is
unrestricted.
We found that, without caching of log2bvalues, the av -
erage query evaluation time is 0.961 seconds. Caching of
log2bvalues as they are calculated during query processing
roughly halves the average query evaluation time, to 0.494
seconds. Pre-calculating and storing the values offers almost
no benefit over caching during query processing, reducing
the time to 0.491 seconds; this reflects that only limited
bvalues are required during query evaluation. Caching of
toggle points yields 0.492 seconds. As toggle points are cal-
culated using bitwise shifts, addition, and subtraction, this
is further evidence that bitwise shifts are inexpensive on our
hardware.
An alternative approach to managing log computations
is to replace the standard library log function with a loop
that determines log2busing bitwise shifts and equality
tests; the logarithm value can be determined by locating
the position of the most-significant 1-bit in b. We found
that this led to slight additional improvements in the speed
of decoding Golomb codes, outperforming explicit caching.
All Golomb-Rice coding results reported in this paper are
computed in this way.
Bytewise Compression
We have experimented with improvements to variable-byte
coding. Unlike in bitwise coding, we have found that mask-
ing and shifting are equally as fast because of the large num-
ber of shifts required. We use shifts in our experiments.
Perhaps the most obvious way to increase the speed of
variable-byte decoding is to align the eight-bit blocks to byte
boundaries. Alignment with byte boundaries limits the de-
coding to only one option: the flag bit indicating if this is the
last byte in the integer is always the most significant bit, and
the remaining seven bits contain the value. Without byte
alignment, additional conditional tests and operations are
required to extract the flag bit, and the seven-bit value can
span byte boundaries. We would expect that byte alignment
would improve the speed of decoding variable-byte integers.
Figure 1 shows the effect of byte alignment of variable-
byte integers. In this experiment, variable-byte coding is
used to store the offsets oin each inverted list posting. The
optimised Golomb coding scheme described in the previ-
ous section is used to code document numbers dand Elias
gamma coding is used to store the frequencies fd,t. We refer
to this as a GolD-GamF-VbyO index.
The graph at the left of Figure 1 shows total index size
as a percentage of the uncompressed collection being in-
dexed. The first bar shows that, without byte alignment,
the GolD-GamF-VbyO index requires almost 13% of the
space required by the collection. The second bar shows that
padding to byte alignment after storing the Gamma-coded
fd,t values increases the space requirement to just over 13.5%
of the collection size. We discuss the other schemes in this
figure later in this section.
The graph at the right of Figure 1 shows elapsed query
evaluation times using different index designs. Timings are
225
Original
Original with byte boundary
Signature block
Signature block with byte boundary
0
5
10
15
20
Size (% of Collection)
Original
Original with byte boundary
Signature block
Signature block with byte boundary
Scanning
Scanning with byte boundary
Scanning with signature block
Scanning with signature block and byte boundary
0.0
0.2
0.4
0.6
0.8
1.0
Average Query Time (Seconds)
Figure 1: Variable-byte schemes for compressing off-
sets in inverted lists in a GolD-GamF-VbyO index.
Four different compression schemes are shown and,
for each, both original and scanning decoding are
shown. Scanning decoding can be used when offsets
are not needed for query resolution.
the average elapsed query evaluation cost to process the
inverted lists for 25,000 queries on a 20 Gb collection of
Web [5] data using our prototype retrieval engine. Queries
are processed as conjunctive Boolean queries. The first bar
shows that the average time is around 0.7 seconds for the
GolD-GamF-VbyO index without byte alignment. The sec-
ond bar shows that the effect of byte alignment is a 25% re-
duction in average query time. Therefore, despite the small
additional space requirement, byte-alignment is beneficial
when storing variable-byte integers.
A second optimisation to variable-byte coding is to con-
sider the query mode when processing the index. For query-
ing that does not use offsets—such as ranked and Boolean
querying—decoding of the offsets in each posting is unneces-
sary. Rather, all that is required are the document numbers
dand document frequencies fd,t . An optimisation is there-
fore to only examine the flag bit of each block and to ignore
the remaining seven bits that contain the value. The value
of fd,t indicates the number of offsets ostored in the post-
ing. By examining flag bits until fd,t 1-bits are processed,
it is possible to bypass the offsets with minimal processing.
We call this approach scanning.
Scanning can also be used in query modes that do require
offset decoding. As we discussed earlier, phrase querying
requires that all terms are present in a matching document.
After processing the inverted list for the first term that is
evaluated in a phrase query, a temporary inverted list of
postings is created. This temporary list has a set Dof doc-
uments that contain the first term. When processing the
second term in the query, a second set of document num-
bers Dare processed. Offsets for the posting associated
with document dDcan be scanned, that is, passed over
without decoding, if dis not a member of D.(Atthesame
time, document numbers in Dthat are not in Dare dis-
carded.)
We show the performance of scanning in Figure 1. The
fifth and sixth bars show how scanning affects query evalu-
ation time for variable-bytes that are either unaligned and
aligned to byte boundaries in the GolD-GamF-VbyO index.
Scanning removes the processing of seven-bit values. This
reduces the cost of retrieving unaligned variable-bytes to less
than that of the aligned variable-byte schemes; the small
speed advantage is due to the retrieval of smaller lists in the
unaligned version. Scanning has little effect on byte-aligned
variable bytes, reflecting that the processing of seven-bit val-
ues using shift operations has a low cost. Overall, however,
byte-alignment is preferred since the decoding cost of offsets
is expensive in an unaligned scheme.
A third optimisation is an approach we call signature
blocks, which are a variant of skipping. Skipping is the ap-
proach of storing additional integers in inverted lists that
indicate how much data can be skipped without any pro-
cessing [14]. Skipping has the disadvantage of an additional
storage space requirement, but has been shown to offer sub-
stantial speed improvements [14]. A signature block is an
eight-bit block that stores the flag bits of up to eight blocks
that follow. For example, a signature block with the bit-
string 11100101 represents that five integers are stored in
the eight following eight-bit blocks: the string 111 repre-
sents that the first three blocks store one integer each; the
string 001 represents that the fourth integer is stored over
three blocks; and, the string 01 represents that the final in-
teger is stored over two blocks. As all flag bits are stored
in the signature block, the following blocks use all eight bits
to store values, rather the seven-bit scheme in the standard
variable-byte integer representation.
The primary use of signature blocks is skipping. To skip
offsets, fd,t offset values must be retrieved but not processed.
By counting the number of 1-bits in a signature block, the
number of integers stored in the next eight blocks can be
determined. If the value of fd,t exceeds this, then a second
or subsequent signature block is processed until fd,t offsets
have been skipped. The last signature block is, on average,
half full. We have found that bitwise shifts are faster than
a lookup table for processing of signature blocks.
The speed and space requirements are also shown in Fig-
ure 1. Not surprisingly, the signature block scheme requires
more space than the previous variable-byte schemes. This
space requirement is further increased if byte alignment of
blocks is enforced. In terms of speed, the third and fourth
bars in the right-hand histogram show that signature blocks
are slower than the original variable-byte schemes when off-
sets are processed in the GolD-GamF-VbyO index. These
results are not surprising: signature blocks are slow to pro-
cess when they are unaligned, and the byte-aligned version
is slow because processing costs are no less than the original
variable-byte schemes and longer disk reads are required.
As shown by the seventh bar, when offsets are skipped the
unaligned signature block scheme is slower than the original
226
variable-byte scheme. The savings of skipping with signa-
ture blocks are negated by more complex processing when
blocks are not byte-aligned. In contrast, the right-most bar
shows that the byte-aligned signature block scheme with
skipping is slightly faster on average than all other schemes.
However, we conclude—given the compactness of the in-
dex and good overall performance—that the best all-round
scheme is the original variable-byte scheme with byte align-
ment. Therefore, all variable-byte results reported in the
Section 4 use the original byte-aligned variable-byte scheme
with scanning.
Customised Compression
Combinations of bitwise and bytewise compression schemes
are also possible. The aim of such approaches is to com-
bine the fast decoding of bytewise schemes with the com-
pact storage of bitwise schemes. For example, a simple and
efficient custom scheme is to store a single bit that indicates
which of two compression schemes is used, and then to store
the integer using the designated compression scheme. We
have experimented with several approaches for storing off-
sets. The simplest and most efficient approach we tested
is as follows: when fd,t = 1, we store a single bit indicat-
ing whether the following offset is stored as a bitwise Elias
delta code or as a bytewise eight-bit binary representation.
When storing values, we use Elias delta coding if the value
is greater than 256 and the binary scheme otherwise. This
scheme has the potential to reduce space because in the me-
dian posting fd,t is 1 and the average offset is around 200.
Selective use of a fixed-width representation can save stor-
age of the 6-bit prefix used to indicate magnitude in the
corresponding delta code.
We report the results with this scheme, which we call cus-
tom, in the next section. This was the fastest custom scheme
we tested. Other approaches we tried included switching be-
tween variable-byte and bitwise schemes, using the custom
scheme when fd,t is either 1 or 2, and other simple varia-
tions. We omit results for these less successful approaches.
4. RESULTS
All experiments described in this paper are carried out on
an Intel Pentium III based machine with 512 Mb of main-
memory running the Linux operating system. Other pro-
cesses and disk activity was minimised during timing exper-
iments, that is, the machine was under light-load.
A theme throughout these experiments and greatly im-
pacting on the results is the importance of caching. On a
modern machine, caching takes place at two levels. One level
is the caching of recently-accessed disk blocks in memory, a
process that is managed by the operating system. When the
size of the index significantly exceeds memory capacity, to
make space to fetch a new inverted list, the blocks contain-
ing material that has not been accessed for a while must be
discarded. One of the main benefits of compression is that a
much greater volume of index information can be cached in
memory. For this reason, we test our compression schemes
with streams of 10,000 or 25,000 queries extracted from a
query log [11], where the frequency distribution of query
terms leads to beneficial use of caching. Again, queries are
processed as conjunctive Boolean queries.
The other level at which caching takes place is the reten-
tion in the CPU cache of small blocks of data, typically of
128 bytes, recently accessed from memory. CPU caching
DelD-GamF-GolO
GolD-GamF-GamO
GolD-GamF-DelO
GolD-GamF-GolO
GolD-GamF-RicO
GolD-GamF-VbyO
GolD-VbyF-VbyO
RicD-GamF-RicO
RicD-VbyF-VbyO
VbyD-VbyF-VbyO
Custom
No compression
0
10
20
30
40
Index Size (% of Collection)
DelD-GamF-GolO
GolD-GamF-GamO
GolD-GamF-DelO
GolD-GamF-GolO
GolD-GamF-RicO
GolD-GamF-VbyO
GolD-VbyF-VbyO
RicD-GamF-RicO
RicD-VbyF-VbyO
VbyD-VbyF-VbyO
Custom
No compression
0
1
2
3
4
Average Query Time (Seconds x 10^-2)
Figure 2: Performance of integer compression
schemes for offsets in inverted lists, in an index with
Golomb document numbers and gamma frequencies.
In this experiment, the index fits in main memory.
A 500 Mb collection is used, and results are aver-
aged over 10,000 queries.
is managed in hardware. In current desktop computers, as
many as 150 instruction cycles are required to fetch a single
machine-word into the CPU. At a coarser level, compres-
sion of postings lists means that the number of fetches from
memory to cache during decompression is halved.
Small collection
Figure 2 shows the relative performance of the integer com-
pression schemes we have described for storing offsets, on
a 500 Mb collection of 94,802 Web documents drawn from
the TREC Web track data [5]; timing results are averaged
over 10,000 queries drawn from an Excite search engine
query log [11]. The index contains 703,518 terms.
These results show the effect of varying the coding scheme
used for document numbers d, frequencies fd,t,andosetso.
In all cases where both bitwise and variable-byte codes are
used, the bitwise codes are padded to a byte boundary before
a variable-byte code is emitted; thus, for example, in a GolD-
GamF-VbyO index, there is padding between the gamma
227
frequency and the sequence of variable-byte offsets. Not all
code combinations are shown; for example, given that the
speed advantage of using variable-byte document numbers
is small, we have not reported results for index types such as
VbyD-GamF-RicD, and due to the use of padding a choice
such as VbyD-GamF-VbyD. Given the highly skew distribu-
tion of fd,t values, Golomb or Rice are not suitable coding
methods, so these have not been tried.
In the “no compression” case, fixed-width fields are used
to store postings. Document numbers are stored in 32 bits,
frequencies in 16 bits, and offsets in 24 bits; these were the
smallest multiples of bytes that would not overflow for rea-
sonable assumptions about data properties.
The relative performance of Elias delta and gamma, Rice,
and Golomb coding is as expected. The non-parameterised
Elias coding schemes result in larger indexes than the param-
terised Golomb-Rice schemes that, in turn, result in slower
query evaluation. The average difference between offsets is
greater than 15, making Elias delta coding more appropri-
ate overall than gamma coding; the latter is both slower and
less space-efficient.
On the lower graph in Figure 2, comparing the fourth and
fifth columns and comparing the fifth and eighth columns,
it can be seen that choice of Golomb or Rice codes for either
offsets or document numbers has virtually no impact on in-
dex size. Comparing the fifth and eighth columns on the up-
per graph, the schemes yield similar decoding times for docu-
ment numbers. However, Rice codes are markedly faster for
decoding offsets, because no toggle point calculation is re-
quired. Among the bitwise schemes, we conclude that Rice
coding should be used in preference to other schemes for
coding document numbers and offsets.
The most surprising result is the effect of using the op-
timised byte-boundary variable-byte scheme for coding off-
sets. Despite the variable-byte index being 26% larger than
the corresponding Rice-coded index, the overall query evalu-
ation time is 62% less. Further speed gains are given by cod-
ing all values in variable-byte codes. Indeed, variable-byte
decoding is faster even than processing uncompressed lists.
This result is remarkable: the cost of transfering variable-
byte coded lists from memory to the CPU cache and then
decoding the lists is less than the cost of transferring un-
compressed lists. To our knowledge, this is the first practi-
cal illustration that compression improves the efficiency of
an in-memory retrieval system. We conclude from this that
variable-byte coding should be used to store offsets to reduce
both disk retrieval and memory retrieval costs.
In experiments with integers, Williams and Zobel found
that variable-byte coding is faster than the bitwise schemes
for storing large integers of the magnitude stored in inverted
lists [13]. Our result confirms this observation for retrieval
systems, while also showing that the effect extends to fast
retrieval from memory and that improvements to variable-
byte coding can considerably increase decoding speed.
The custom scheme uses both Elias delta and a binary
bytewise scheme, reducing query evaluation to around 58%
of the time for the Elias delta scheme. However, the custom
scheme is almost twice as slow as the variable-byte scheme
and, therefore, has little benefit in practice.
Large collection
Figure 3 shows the results of a larger experiment with an
index that does not fit within the main-memory of our ma-
DelD-GamF-GolO
GolD-GamF-GamO
GolD-GamF-DelO
GolD-GamF-GolO
GolD-GamF-VbyO
RicD-GamF-RicO
RicD-VbyF-VbyO
VbyD-VbyF-VbyO
No compression
0
10
20
30
40
Index Size (% of Collection)
DelD-GamF-GolO
GolD-GamF-GamO
GolD-GamF-DelO
GolD-GamF-GolO
GolD-GamF-VbyO
RicD-GamF-RicO
RicD-VbyF-VbyO
VbyD-VbyF-VbyO
No compression
0.0
0.5
1.0
1.5
2.0
Average Query Time (Seconds)
Figure 3: The performance of integer compression
schemes for compressing offsets in inverted lists,
with Golomb-coded document numbers and gamma-
coded offsets. In this experiment, the index is sev-
eral times larger than main memory. A 20 Gb col-
lection is used, and results are averaged over 25,000
queries.
chine. Exactly the same index types are tried as for the
experiment above. A 20 Gb collection of 4,014,894 Web doc-
uments drawn from the TREC Web track data [5] is used
and timing results are averaged over 25,000 Boolean queries
drawn from an Excite search engine query log [11]. The
index contains 9,574,703 terms. We include only selected
schemes in our results.
We again note that we have not used heuristics to reduce
query evaluation costs such as frequency-ordering or early
termination. Indeed, we have not even used stopping; with
stopwords removed, query times are greatly impoved. Our
aim in this research is to measure the impact on index de-
coding time of different choices of compression method, not
to establish new benchmarks for query evaluation time. Our
improvements to compression techniques could, however, be
used in conjunction with the other heuristics, in all likeli-
hood further reducing query evaluation time compared to
the best times reported previously.
228
The relative speeds of the bitwise Golomb, Elias delta, and
variable-byte coded offset schemes are similar to that of our
experiments with the 500 Mb collection. Again, variable-
byte coding results in the fastest query evaluation. Perhaps
unsurprisingly given the results described above, an uncom-
pressed index that does not fit in main-memory is relatively
much slower than the variable-byte scheme; the disk trans-
fer costs are a larger fraction of the overall query cost when
the index does not fit in memory, and less use can be made
of the memory cache. Indexes with variable-byte offsets are
twice as fast as indexes with Golomb, delta, or gamma off-
sets, and one-and-a-half times as fast as indexes with Rice
offsets. VbyD-VbyF-VbyO indexes are twice as fast as any
index type with non-variable-byte offsets.
In separate experiments we have observed that the gains
demonstrated by compression continue to increase with col-
lection size, as the proportion of the index that can be held
in memory declines. Despite the loss in compression with
variable-byte coding, indexes are still less than one-seventh
of the size of the indexed data, and the efficiency gains are
huge.
5. CONCLUSIONS
Compression of inverted lists can significantly improve the
performance of retrieval systems. We have shown that an ef-
ficiently implemented variable-byte bytewise scheme results
in query evaluation that is twice as fast as more compact
bitwise schemes. Moreover, we have demonstrated that the
cost of transferring data from memory to the CPU cache
can also be reduced by compression: when an index fits in
main memory, the transfer of compressed data from mem-
ory to the cache and subsequent decoding is less than that
of transferring uncompressed data. Using byte-aligned cod-
ing, we have shown that queries can be run more than twice
as fast as with bitwise codes, at a small loss of compression
efficiency. These are dramatic gains.
Modern computer architectures create opportunities for
compression to yield performance advantages. Once, the
main benefits of compression were to save scarce disk space
and computer-to-computer transmission costs. An equally
important benefit now is to make use of the fact that the
CPU is largely idle. Fetching a single byte from memory in-
volves a delay of 12 to 150 CPU cycles; a fetch from disk in-
volves a delay of 10,000,000 cycles. Compression can greatly
reduce the number of such accesses, while CPU time that
would otherwise be unused can be spent on decoding. With
fast decoding, overall costs are much reduced, greatly in-
creasing query evaluation speed. In current computers such
architecture considerations are increasingly important to de-
velopment of new algorithms for query processing. Poor
caching has been a crucial shortcoming of existing algo-
rithms investigated in this research.
There are several possible extensions to this work. We
plan to investigate nibble-coding, a variant of variable-byte
coding where two flag bits are used in each variable-byte
block. It is likely that this approach may improve the per-
formance of signature blocks. We will also experiment with
phrase querying in practice and to explore the average query
evaluation speed when partial scanning is possible.
6. REFERENCES
[1] V. Anh, O. de Kretser, and A. Moffat. Vector-Space
ranking with effective early termination. In W. Croft,
D. Harper, D. Kraft, and J. Zobel, editors, Proc.
ACM-SIGIR International Conference on Research
and Development in Information Retrieval, pages
35–42, New York, Sept. 2001.
[2] E.deMoura,G.Navarro,N.Ziviani,and
R. Baeza-Yates. Fast and flexible word searching on
compressed text. ACM Transactions on Information
Systems, 18(2):113–139, 2000.
[3] P. Elias. Universal codeword sets and representations
of the integers. IEEE Transactions on Information
Theory, IT-21(2):194–203, Mar. 1975.
[4] S. Golomb. Run-length encodings. IEEE Transactions
on Information Theory, IT–12(3):399–401, July 1966.
[5] D. Hawking, N. Creswell, and P. Thistlewaite.
Overview of TREC-7 very large collection track. In
E. Voorhees and D. Harman, editors, Proc. Text
Retrieval Conference (TREC), pages 91–104,
Washington, 1999. National Institute of Standards
and Technology Special Publication 500-242.
[6] A. Moffat and J. Zobel. Self-indexing inverted files for
fast text retrieval. ACM Transactions on Information
Systems, 14(4):349–379, Oct. 1996.
[7] A. Moffat, J. Zobel, and R. Sacks-Davis.
Memory-efficient ranking. Information Processing &
Management, 30(6):733–744, 1994.
[8] G.Navarro,E.deMoura,M.Neubert,N.Ziviani,and
R. Baeza-Yates. Adding compression to block
addressing inverted indexes. Information Retrieval,
3(1):49–77, 2000.
[9] M. Persin. Document filtering for fast ranking. In
W. Croft and C. van Rijsbergen, editors, Proc.
ACM-SIGIR International Conference on Research
and Development in Information Retrieval, pages
339–348, Dublin, Ireland, 1994.
[10] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered
document retrieval with frequency-sorted indexes.
Journal of the American Society for Information
Science, 47(10):749–764, 1996.
[11] A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic.
Searching the web: The public and their queries.
Journal of the American Society for Information
Science, 52(3):226–234, 2001.
[12] A. Vo and A. Moffat. Compressed inverted files with
reduced decoding overheads. In R. Wilkinson,
B. Croft, K. van Rijsbergen, A. Moffat, and J. Zobel,
editors, Proc. ACM-SIGIR International Conference
on Research and Development in Information
Retrieval, pages 290–297, Melbourne, Australia, July
1998.
[13] H. Williams and J. Zobel. Compressing integers for
fast file access. Computer Journal, 42(3):193–201,
1999.
[14] I. Witten, A. Moffat, and T. Bell. Managing
Gigabytes: Compressing and Indexing Documents and
Images. Morgan Kaufmann Publishers, Los Altos, CA
94022, USA, second edition, 1999.
[15] N. Ziviani, E. de Moura, G. Navarro, and
R. Baeza-Yates. Compression: A key for
next-generation text retrieval systems. IEEE
Computer, 33(11):37–44, Nov. 2000.
229
... The scheme used to compress the adjlist array in CIC-PIM is a variant of the standard variable-length byte-aligned encoding (standard vByte, shown in Fig. 3) [57]. To compress non-negative integers, the standard vByte tries to remove leading zeros (if any) to save space. ...
... Then the first delta is encoded with signed vByte and other deltas with standard vByte. Due to the byte-aligned feature, both the encoding and decoding routines of vByte are fast [57]. The decoding routine is called on the fly in graph algorithms and incurs limited runtime overhead. ...
... The fixedlength feature makes real random access possible, as detailed in Section 3.1.4. Fixed-length encoding is rarely adopted because of its usually low space saving [57]. However, due to the compression amenability after chunking, CIC-PIM achieves reasonable space saving (Section 4.5). ...
Article
Shared-memory graph processing is usually more efficient than in a cluster in terms of cost effectiveness, ease of programming and runtime. However, the limited memory capacity of a single machine and the huge sizes of graphs restrains its applicability. Hence, it is imperative to reduce memory footprint. We observe that index compression holds promise and propose CIC-PIM, a lightweight encoding with chunked index compression, to reduce the memory footprint and the runtime of graph algorithms. CIC-PIM aims for significant space saving, real random-access support and high cache efficiency by exploiting the ubiquitous power-law and sparseness features of large scale graphs. The basic idea is to divide index structures into chunks of appropriate size and compress the chunks with our lightweight fixed-length byte-aligned encoding. After CIC-PIM compression, two-fold larger graphs are processed with all data fit in memory, resulting in speedups or fast in-memory processing unattainable previously.
... For indexing text and Web pages [33] inverted indexing is used and hash table is closely related to inverted indexing. Compression techniques helps index file to have shorter length [34][35] Domain-specific compression can be categorized by choosing optimal k-mer. ...
Article
Full-text available
Now a days huge amount of biological data is produced due to advancements in high-throughput sequencing technology. Those enormous volumes of sequence require effective storage, fast transmission and provision of quick access for analysis to any record. Standard general purpose lossless compression techniques failed to compress these sequences rather they may increase the size enough. Researcher always try to develop new algorithms for this purpose. Present algorithms indicate that there is enough room to make new algorithms to compress groups of genomes and that will be more time and space effective. In this review paper, we will be analyzing and presenting genomic compression algorithms for both single genomes i.e. non-referential algorithm by exploiting the intra-sequence similarity and sets of related or non-related genomes i.e. referential compression algorithms by exploiting inter-sequence similarity. Also we will discuss the different data format on which those algorithms are applied. The main focus of this review paper is the different data structure for huge sequence representation like compressed suffix tries, suffix tree, suffix array, etc. algorithm such as dynamic programming approach and different indexing technique for searching similar subsequence using pattern recognition methods. We will also discuss about Map-Reduce using HDFS, Yarn and Spark for first searching and streaming concept for reference sequence selection.
... The survey by Zobel and Moffat [113] covers more than 40 years of academic research in Information Retrieval and gives an introduction to the field, with Section 8 dealing with efficient index representations. Moffat and Turpin [68], Moffat [58], Pibiri and Venturini [82] describe several of the techniques illustrated in this article; Williams and Zobel [108], Scholer et al. [92] and Trotman [102] experimentally evaluate many of them. ...
Article
Full-text available
The data structure at the core of large-scale search engines is the inverted index , which is essentially a collection of sorted integer sequences called inverted lists . Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.
... Therefore, we cannot expect a great deal of compression effectiveness as, for example, the one for Web pages [30]. Nonetheless, we experimented with several compressors, such as: Binary Interpolative Coding (BIC) [20], dictionary-based encoding (DINT) [26], Elias-Fano (EF) [8,9], partitioned Elias-Fano (PEF) [23], Variable-Byte [32] paired with SIMD instructions [31], optimally-partitioned Variable-Byte (OptVB) [29], and Simple16 [35]. A description of all such compression methods can be found in the recent survey on the topic [30]. ...
Preprint
Full-text available
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.
... Literature on the representation of integers and integer sequences is vast. Many solutions are known, each of them exposing a different space/time trade-off, including: Elias' gamma and delta [9], Golomb [11], Elias-Fano [8,10,40], partitioned Elias-Fano [23], clustered Elias-Fano [26], Interpolative [21,22], PForDelta [12,43,46], Simple [1,2,44], Variable-Byte [7,14,29,30,33,35,36], QMX [37], ANS-based [19,20], DINT [25]. We point the reader to the surveys by Zobel and Moffat [45], by Moffat [18] and by Pibiri and Venturini [28] for a review of many techniques. ...
Preprint
Full-text available
Representing sorted integer sequences in small space is a central problem for large-scale retrieval systems such as Web search engines. Efficient query resolution, e.g., intersection or random access, is achieved by carefully partitioning the sequences. In this work we describe and compare two different partitioning paradigms: partitioning by cardinality and partitioning by universe. Although the ideas behind such paradigms have been known in the coding and algorithmic community since many years, inverted index compression has extensively adopted the former paradigm, whereas the latter has received only little attention. As a result, an experimental comparison between these two is missing for the setting of inverted index compression. We also propose and implement a solution that recursively slices the universe of representation of a sequence to achieve compact storage and attain to fast query execution. Albeit larger than some state-of-the-art representations, this slicing approach substantially improves the performance of list intersections and unions while operating in compressed space, thus offering an excellent space/time trade-off for the problem.
Chapter
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
Article
We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.
Article
Abstract Fast and eective ranking of a collection of documents with respect to a query requires several structures, including a vocabulary, inverted le entries, arrays of term weights and document lengths, an array of partial similarity accumulators, and address tables for inverted le entries and documents. Of all of these structures, the array of document lengths and the array of accumulators are the components accessed most frequently in a ranked query, and it is crucial to acceptable performance that they be held in main memory. Here we describe an approximate ranking process that makes use of a compact array of in-memory low precision approximations for the lengths. Combined with another simple rule for reducing the memory required by the partial similarity accumulators, the approximation heuristic allows the ranking of large document,collections using less than one byte of memory per document, an eight-fold reduction compared with the space required by conventional techniques. Moreover, in our experiments retrieval eectiveness was unaected by the use of these heuristics. 2 Memory Ecient Ranking
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Conference Paper
Compressed inverted files are the most com- pact way of indexing large text databases, typically oc- cupying around 10% of the space of the collection they index. The drawback of compression is the need to de- compress the index lists during query processing. Here we describe an improved implementation of compressed inverted lists that eliminates almost all redundant decod- ing and allows extremely fast processing of conjunctive Boolean queries and ranked queries. We also describe a pruning method to reduce the number of candidate documents considered during the evaluation of ranked queries. Experimental results with a database of 510 Mb show that the new mechanism can reduce the CPU and elapsed time for Boolean queries of 4-10 terms to one tenth and one fifth respectively of the standard tech- nique. For ranked queries, the new mechanism reduces both CPU and elapsed time to one third and memory usage to less than one tenth of the standard algorithm, with no degradation in retrieval effectiveness.
Conference Paper
Considerable research effort has been invested in improving the effectiveness of information retrieval systems. Techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. But such enhancements can add to the cost of evaluating queries. In this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. We describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. That is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations.
Conference Paper
Ranking techniques are effective for finding answers in document collections but the cost of evaluation of ranked queries can be unacceptably high. We pr( ,pose an evaluation technique that reduces both main memory usage and query evaluation time. ba~ed on early recognition of which documents are likely to be highly ranked. Our experiments show that, for our test data, the proposed technique evaluates queries in 20% of the time and 29’0 of the memo~y taken by the standard inverted fil~ implement ation, wit bout degradation in retrieval effectiveness. processing the inverted indexes for these i erm-that is, processing the lists of document identifiers for documents containing these terms—takes much time. Moreover, these common terms give rise to a large number of candidate documents and hence a large number of similarity values to be kept in main memory. We propose a technique for filtering documents that allows a significant reduction in both the volume of main memory required and the time of query evaluation. Our approach is based on processing only the most informative parts of inverted lists. Our exl }eriments in applying this technique to the cosine measure show that it allows evaluation of the queries on a large document collection in approxirnately 20% of the time and in 2% of memory c)f previous techniques, and without deterioration in retrieval effectiveness. Such savings make information retrieval possible for small machines such as PCs, and for large multi-user document systems such as library sj stems, which can have thousands of simultaneous users.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
First Page of the Article