ArticlePDF Available

Perfect Hash Functions for Large Dictionaries

Authors:

Abstract and Figures

We describe a new practical algorithm for finding perfect hash functions with no specification space at all, suitable for key sets ranging in size from small to very large. The method is able to find perfect hash functions for various sizes of key sets in linear time. The perfect hash functions produced are optimal in terms of time (perfect) and require at most computation of h1(k) and h2(k); two simple auxiliary pseudorandom functions.
Content may be subject to copyright.
Perfect Hash Functions for Large Dictionaries
Amjad M Daoud, PH.D.
Department of Computer Science
Tabuk University
Tabuk, Suadi Arabia
email: daoudamjad@gmail.com
ABSTRACT
We describe a new practical algorithm for finding perfect
hash functions with no specification space at all, suitable
for key sets ranging in size from small to very large. The
method is able to find perfect hash functions for various sizes
of key sets in linear time. The perfect hash functions pro-
duced are optimal in terms of time (perfect) and require at
most computation of h1(k) and h2(k); two simple auxiliary
pseudorandom functions.
Categories and Subject Descriptors
H.2.m [Information Systems]: Database Management-
Miscellaneous
General Terms
Algorithms, management, design.
Keywords
MOS, acyclic, indexing, perfect hashing, random graphs.
1. INTRODUCTION
There is a growing demand for fast access to large text
collections such as repositories of Web pages. For a reposi-
tory to be searched fast, it requires an index. To construct
an index, the set of unique words and URLs need to be
identified. The dictionary constructed may not fit in main
memory and may need to be stored on slow external media.
Hashing has long been used when the fastest possible direct
access to random locations is desired. More recently, a range
of linear algorithms for producing quality order preserving
and minimal perfect hashing algorithms for static sets were
introduced [11] [8] [13] [14]. The specifications of the min-
imal perfect hash functions require O(n) words where nis
the number of keys. For example, The WebGraph research
group [1] used the Czech, et. al. algorithms [8] to access
18.5 million URLs. The perfect hash function specification
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIMS’07, November 9, 2007, Lisboa, Portugal.
Copyright 2007 ACM 978-1-59593-831-2/07/0011 ...$5.00.
comprised of a gtable required 88 MB to hash the UK web
graph. For a world wide web snapshot of 118 million URLs,
the gtable size is 563 MB mostly random numbers that can
not be compressed any further. Clearly, the PHF specifi-
cation for large sets cannot be stored in the L2 cache and
computing the final hash address would require three ex-
pensive random lookups to the gtable. Random accesses
even to main memory often take as much time as executing
hundreds of instructions on modern CPUs.
In this paper, we present a new faster perfect hashing ap-
proach that maps the key set into a bipartite graph using two
functions (h1(k), h2(k)), arranges the vertices of the graph
according to incident edges cardinality in ascending order,
and assigns the keys to their final locations simply accord-
ing to h1(k) or h2(k). Ideally the graph would be compact
and acyclic. The algorithm produces perfect hash functions
that require no specification space at all; however, functions
produced are only perfect; not minimal and not order pre-
serving. Our algorithm improves upon earlier schemes in
that it uses the bipartite graph approach to avoid degen-
erate edges problems and so that two functions are used.
Compact acyclic bipartite graphs are harder to construct
but produce perfect hashing schemes that require at most
two accesses. Nevertheless, our approach relaxes the acyclic
requirement in random graphs presented in [8] and can tol-
erate the presence of the most common cyclic components;
and thus is easier to construct. The algorithm produces per-
fect hash functions with much higher success rates than the
acyclic hypergraph approach [8] and mostly from the first
trial. Our algorithms have been tested in the MG4J system
[1] to accumulate URLs collected and are currently used to
store efficiently large web maps [11].
1.1 Motivation
This work was in part motivated by our investigations
that deal with tightly integrating information retrieval with
relational databases [12] and construction of efficient web
maps for search engines [11]. We investigated two popular
methods for producing order preserving and minimal per-
fect hashing algorithms as described in [13] and [8]. We
found that they have large main memory requirements or
high probability of failure due to degenerate edges (one out
of 10 trials succeeds as evident from the MG system and
MG4J implementations). Both algorithms require substan-
tial storage for their hash functions parameters and are in-
efficient when memory is scarce.
67
1.2 Applications
There are numerous applications for our algorithm in digi-
tal libraries and information retrieval systems: some of them
are novel such as web maps; the others are well known such
as the dictionary membership [13] [11]. Efficient perfect hash
functions have been successfully used to manage very large
web graphs. Storing snapshots of the Web helps the tuning
of ranking algorithms such as Google PageRank [3] algo-
rithm. The WebGraph research group [1] used order pre-
serving perfect hashing [8] to access URL sets as large as
18.5 million URLs and the perfect hash function specifica-
tions required 88 MB.
2. PREVIOUS METHODS TO FIND PHFS
Hashing has been a topic of study for many years, both
in regard to practical methods and analytical investigations
[17]. A less extensive literature has grown up, mostly during
the last decade, dealing with perfect hash functions; it is that
subarea that we consider in this section.
Given a key kfrom a static key set Sof cardinality n,
selected from a universe of keys Uwith cardinality N, our
objective is to find a function hfthat maps each kto a
distinct entry in the hash table Tcontaining mslots. If the
ratioTm/n = 1, his said to be minimal.
For a function hto be perfect, it must map each key k
in Sto a unique integer bounded by musing O(1) oper-
ations. Early algorithms by Sprugnoli [22], Jaeschke [16],
and Fredman, Koml´os, and Szemer´edi [15]. Chang [5] used
four tables based on the first and second letters of the key.
Cichelli [6] used the length of the key and tables based on
the first and last letters of the key. Note, however, that the
length of a key, its first letter, and its last letter are some-
times insufficient to avoid collisions; consider the case of the
words ‘woman’ and ‘women’ in Cichelli’s method.
Cercone et al. [4] enhanced the discriminating power
of transformations from strings to integers by generating
a number of letter to number tables, one for each letter
position. Clearly, if the original keys are distinct, numbers
formed by concatenating fixed length integers obtained from
these conversion tables will be unique. In practice, it often
suffices to simply form the sum or product of the sequence
of integers.
While in some schemes (e.g., [6]) the resulting integer is
actually the hash address desired, in most algorithms, the h
function must further map from the integer value produced
into the hash table.
Brain and Tharp [2] presented a new approach to MPHF
hashing. Their scheme first maps the keys onto a 2-dimension
relatively sparse array and then compacts the array onto a
one-dimension array. If compaction could be made collision
free, the indices of the resulting one-dimension array can
be used as hashing addresses. Their algorithm can generate
PHFs for up to 5,000 keys.
Schmidt and Siegel [21] presented tight bounds on the
spatial complexity of perfect hash functions. They described
a variation of an explicit O(1) time single probe perfect hash
function that can be specified in O(n+ log log m) bits.
In [14], Fox et. al. presented two algorithms to find
MPHFs, both based on the notion of Mapping-Ordering-
Searching (MOS). In general, MOS approach calls for map-
ping keys in a particular key set into a space of representa-
tive of keys, ordering the subsets of representatives (of keys)
Author Time Space Minimal Calc
Efficiency Efficiency Efficiency
Sprugnoli O(2n)O(nlog n) no O(1)
Jaeschke O(2n)O(nlog n) no O(1)
Cichelli O(2n)O(nlog n) no O(1)
Brain O(2n)O(nlog n) no O(1)
Chang O(n2log n)O(nlog n) no O(nlog n)
Sager O(n4)O(nlog n) yes O(1)
Fredman O(n)O(nlog n) no O(1)
Cormack O(n)O(nlog n) yes O(1)
Daoud90 O(n1+η)O(n
ηln 2 ) yes O(1)
Schmidt O(n)O(n) yes O(1)
Czech O(n)O(nlog n) yes O(1)
current O(n)O(1) no O(1)
Table 1: Comparison of Different Perfect Hashing
Algorithms
and finally searching the MPHF specification space for each
representative subset so that corresponding keys fit onto the
hash table. The space used to specify the MPHF is related to
all three steps. Ideally, the representative space should help
ordering step to produce a proper ordering so that searching
step could easily fit the keys. Since identifying an optimal
ordering is NP-complete [20] , heuristics is usually sought to
obtain suboptimal orderings in practice.
In [13],[11], and [10] we presented algorithms that make
use of the dependency graph where vertices are the range of
h1and h2functions (generated in mapping step) and edges
are keys. The ordering step heuristically arranges the ver-
tices in the dependency graph to get a vertex sequence. The
subsets of keys induced by the sequence are handled accord-
ing to their size in ascending order and such that subsets of
size one (the majority) are assigned in the hash table dur-
ing the searching step first. These algorithms are capable
of producing order preserving minimal PHFs for very large
sets while keeping the specification space close to theoret-
ical bounds. The ordering heuristics may fit large subsets
in the table later than the smaller ones. This increases the
difficulty of fitting the table when the number of subsets is
small. Another drawback of the algorithm is that each ver-
tex contributes log nbits to the MPHF specification. For
vertices associated with small key subsets appearing earlier
in the sequence, this is a large overhead.
In Table 1, we compare the different perfect hashing algo-
rithms.
In the next section, we extend these algorithms to build
dependency graphs that are more sparse and easier to work
with (i.e. acyclic). We propose a new ordering algorithm
that allows as many keys as possible to be assigned during
the search step in parallel.
3. THE NEW ALGORITHM
In this section, we introduce a new algorithm to generate
perfect hash functions that require no specification space at
all and has a high success rate.
The new algorithm builds dependency graphs that are
more sparse and easier to work with (i.e. acyclic). Moti-
vated by the fact that an acyclic dependency graph is a per-
fect mapping of the key set to an array of size m, the two
random functions used to build dependency graphs h1(k)
68
and h2(k) are used to generate PHFs that require no spec-
ification space at all with high success rates and in linear
time. The final hash function is simply h1(k) or h2(k). So
to find kwe would check the two Ttable entries: T[h1(k)]
or T[h2(k)].
The dependency graph is traversed so that we partition
the set of keys into a sequence of levels called a tower. If
the vertex ordering is l1,· · · , lt, then the level of keys K(li),
1it, corresponding to a set of vertices vi, 1 i2r1,
is the set of edges incident both to viand to a vertex earlier
in the ordering. The first level K(l1) is the set of edges
incident to each vertex vi, 1 i2r1, in the dependency
graph chosen such that the vertex has only one incident
edge or has an edge that would break a cycle involving vi.
If a component is cyclic and has more than one cycle, the
ordering step fails and we try a different mapping again .
Clearly, our hashing scheme allows us to have cycles of
size two in the dependency graph. In section 5, we show that
cyclic components can have cycles of size two only, since the
probability of having vertices with more than two incident
edges drops to zero as napproaches 0 as m= 2r→ ∞
[8]. Consequently, a vertex cannot have two cycles of size
two. Clearly, our algorithm has a strong mathematical back-
ground. In fact, relaxing the acyclic requirement in random
bipartite graphs helps lower mand increases the success
rates of our algorithm.
The algorithm consists of the three steps: Mapping, Or-
dering, and Searching. Each step, along with implementa-
tion details, will be described in a separate subsection below.
3.1 The Mapping Step
The Mapping step takes a set of nkeys and produces the
two auxiliary hash functions h1, and h2. The h1, and h2
values are used to build a bipartite graph called the depen-
dency graph. Half of the vertices of the dependency graph
correspond to the h1values and are labeled 0,...,r1. The
other half of the vertices correspond to the h2values and are
labeled r, . . . , 2r1. There is one edge in the dependency
graph for each key in the original set of keys. A key kcorre-
sponds to an edge labeled kbetween the vertex labeled h1(k)
and the vertex labeled h2(k). Notice that there may be other
edges between h1(k) and h2(k), but those edges are labeled
with keys other than k. There are two data structures that
constitute the dependency graph, one for the edges (keys)
and one for the vertices (determined by the h1and h2val-
ues). Both are implemented as arrays. The vertex array
is
vertex: array [0..2r-1] of record
firstedge: integer;
degree: integer;
end
firstedge is the header for a singly-linked list of the edges
incident to the vertex.degree is the number of edges inci-
dent to the vertex. The edge array is
edge: array [1..n] of record
h1,h2:integer;
nextedge1:integer;
nextedge2:integer;
end
h1, and h2contain the h1, and h2values for the edge
(key). Also, nextedgei, for side i(= 1,2) of the graph
(1) build random tables for h1, and h2
(2) for each v[0. . . 2r 1]do
vertex[v].firstedge = 0
vertex[v].degree = 0
(3) for each i[1. . . n]do
edge[i].h1= h1(ki)
edge[i].h2= h2(ki)
edge[i].nextedge1=0
add edge[i] to linked list with header
vertex[h1(ki)].firstedge
increment vertex[h1(ki)].degree
edge[i].nextedge2=0
add edge[i] to linked list with header
vertex[h2(ki)].firstedge
increment vertex[h2(ki)].degree
Figure 1: The Mapping Step
(corresponding to h1,h2, respectively), points to the next
edge in the linked list whose head is given by firstedge in
the vertex array.
Figure 1 details the Mapping step. Let k1,k2,...,knbe
the set of keys. The h1, and h2functions are selected (1) as
the result of building tables of random numbers. The con-
struction of the dependency graph in (2) and (3) is straight-
forward. Therefore, the expected time for the Mapping step
is O(n).
3.2 The Ordering Step
The Ordering step explores the dependency graph so as to
partition the set of keys into a sequence of levels. The step
actually produces an ordering of the vertices in levels having
a degree of one when all preceding levels are processed.
Since the vertex degree distribution is decidedly skewed
and the graph is relatively sparse, the first level would con-
tain more than 70% of the keys. All these keys can be as-
signed in parallel in the search step. Next we visit vertices
that are of minimum degree as they are likely to have only
one unprocessed edge left. The algorithm uses a sufficient
number of stacks to identify the next minimum degree pro-
cessed and the unprocessed vertices. These stacks accelerate
choosing the next vertex with the required degree. Figure
2 details the Ordering step. In step (1), STACKS and ver-
tex ordering VS are initialized. In step (2), we choose all
vertices vof degree = 1 and add them to K(l1). In step
(3), all vertices adjacent to vertices in K(l1) are pushed on
STACKS according to their degree. In (4), the rest of the
vertices in the current component are processed and added
to the vertex ordering VS. STACKS are used to identify
those vertices that have not been selected and to return an
unselected vertex of minimum degree.
Clearly the Ordering step can be finished in O(n) time
because we traverse the vertices in the dependency graph
only once.
3.3 The Searching Step
The Searching step takes the levels produced in the Order-
ing step and tries to assign hash values to edges according
to the ordering. Assigning hash values to vertices that have
degree of one or one key kin K(li) amounts to assigning
69
(1) initialize(STACKS)
initialize ordering sequence VS
(2) select all vertices of degree = 1
mark them as SELECTED, and add them to K(l1)
(3) for each wadjacent to vertices in K(l1)do
push(w, STACKS[deg(w)])
i=2
(4) while some vis not SELECTED do
while STACKS are not empty do
v=popmin(STACKS)
if vhas one edge left to process then
mark vSELECTED
add vto level iin VS list
for wadjacent to vdo
if wis not SELECTED and
wis not in STACKS[deg(w)] then
push(w, STACKS[deg(w)])
i=i+1
endwhile
endwhile
Figure 2: The Ordering Step
(1) for i=1 to t (number of levels in V S)do
if vj.degree in K(li)=1then
for each vK(li)do
Tj=kvj
remove kvj
else
fail
Figure 3: The Searching Step
T[h1(k)] to the key kif the vertex is on the h1side and
assigning T[h2(k)] to the key kif the vertex is on the h2
side. Recall that the hash table Thas the same size as the
dependency graph and that m= 2r.
Figure 3 gives the algorithm for the Searching step. Clearly
the Searching step requires only O(n) time to finish.
4. EXAMPLE
We show an example of finding a PHF for the 20 key set
listed in Table 2. The ratio used is 1.2 and thus r= 12 and
the size of the hash table T= 24. The h1, and h2values
(see Table 2) are used to build a bipartite graph. Half of
the vertices of the dependency graph correspond to the h1
values and are labeled 0,...,r 1 = 11. The other half
of the vertices correspond to the h2values and are labeled
r, . . . , 2r1 = 23. There is one edge in the dependency
graph for each key in the original set of keys.
We notice that we have four acyclic components (trees)
and Table 3 verifies that we indeed have a PHF for the
example key set.
The tower levels dictates the order of assignments: ver-
tices with degree = 1 are {v0,v2,v3,v5,v8,v10 ,v11 }on
the h1side and {v12,v13 ,v14,v15 ,v16,v18 ,v19,v22 }on the
Key h1h2
x-rays 7 22
Euclidean 1 14
ethyl ether 6 19
Clouet 4 23
Bulwer 4 17
dentifrice 7 20
Lagomorpha 11 17
Chungking 8 15
quibbles 7 18
Han Cities 5 16
treacherous 1 23
calc- 6 23
deposited 3 23
rotundus 9 17
antennae 1 12
sodium lamp 7 13
oculomotor nerve 2 21
tussle 0 20
imprecise 9 20
meridiem 10 21
Table 2: The Key Set Used in Example, n=20,
m=24, ratio = 1.2
h2side. So K(l1) = {v0,v2,v3,v5,v8,v10,v11 ,v12,v13 ,v14,
v15,v16 ,v18,v19 ,v22}has the following set of keys: “tussle”,
“oculomotor nerve”, “deposited”, “Han Cities”, “Chungking”,
“meridiem”, “Lagomorpha”, “antenae”, “sodium lamp”, “Eu-
clidean”, “quibbles”, “ethyl ether”, and “x-rays” that can be
assigned to their final addresses according to h1or h2with-
out collision. Here, we favor placing keys according to their
h1(k) value since we would compute h1(k) first when search-
ing for k. In our example, the l1can be visualized as the
edges that are incident on the nodes of the first level. Notice
that we have an acyclic bipartite graph which is equivalent
to four random trees. In fact, the algorithm assigns the
edges incident on their leaf nodes first (each has only one
incident node). Next, moves to the next level till all edges
(keys) are assigned to unique vertices.
When all assigned keys are removed; K(l2) = {v1,v6,v7},
has the set of keys “treacherous”, “calc-”, “dentifrice”; K(l3)
={v20,v23 }has two keys {“imprecise”,“Clouet”}, and finally
K(l4) = {v4}has two keys {“Bulwer”, “rotundus”},
5. ANALYSIS
We bound the expected total length of cycles in the graph.
Let C2kdenote the number of cycles of length 2k. To build
a cycle of length 2k, we select 2kvertices and connect them
with 2kedges in any order. There are r
k2ways to choose
2kvertices out of 2rvertices of a graph, k!k!/2kways of
connecting them into a cycle, and (2k)! possible ordering of
the edges. The cycle can be embedded into the structure
of the graph in r
r2kr2(r2k)ways. Hence, the number of
graphs containing a cycle of length 2kis
r
k!2
((k!)2/2k)(2k)! r
r2k!r2(r2k)
Also, the expected number of cycles of length 2kin the
graph is
70
key h1h2hfinal T[hf inal]
x-rays 7 22 h2T[22]
Euclidean 1 14 h2T[14]
ethyl ether 6 19 h2T[19]
Clouet 4 23 h2T[23]
Bulwer 4 17 h1T[04]
dentifrice 7 20 h1T[07]
Lagomorpha 11 17 h1T[11]
Chungking 8 15 h1T[08]
quibbles 7 18 h2T[18]
Han Cities 5 16 h1T[05]
treacherous 1 23 h1T[01]
calc- 6 23 h1T[06]
deposited 3 23 h1T[03]
rotundus 9 17 h1T[09]
antenna 1 12 h2T[12]
sodium lamp 7 13 h2T[13]
oculomotor nerve 2 21 h1T[02]
tussle 0 20 h1T[00]
imprecise 9 20 h2T[20]
meridiem 10 21 h1T[10]
Table 3: Keys and Their Final Hash Addresses,
n=20, m=24, ratio = 1.2
n/2
X
k=1
2kE(C2k) =
n/2
X
k=1 r
k2((k!)2/2k)(2k)!r
r2kr2(r2k)
n2k
And the expected number of cycles in the graph cis
n/2
X
k=1
E(C2k) =
n/2
X
k=1 r
k2((k!)2/2k)(2k)!r
r2kr2(r2k)
n2k
According to [8], the number of cycles is bounded by
n/2
X
k=1
E(C2k)< ln(2k)
Next, we count the number of tree components in G, ex-
cluding zero-degree vertex components. We have the num-
ber of different trees in a bipartite graph G0:
Rij =ji1·ij1
Let tbe expected number of trees of distinct edges of size
from 1 to nin a bipartite graph Gwith rvertices on each
side is
E(TR) X
iX
jr
ir
j·Rij ·n
n·(n)!
r2n
where iand jshould satisfy the constraints ni1 and
nj1 when i+j1< n, or ni0 and nj0
when i+j1 = n.
Moreover, the maximum number of components that have
one cycle is bounded by 2rtand the probability of having
a cyclic component with more than one cycle approaches 0
as m= 2r→ ∞. This is a direct result from [8] Lemma 4.3.
Consequently, and almost surely a vertex cannot participate
in two different cycles of size two or higher.
ratioT=1.2
Keys Map Order Search Total
1M 0.55 0.05 0.04 0.64
2M 1.05 0.09 0.06 1.20
4M 2.12 0.20 0.11 2.43
8M 5.85 0.37 0.19 6.21
19.5M 10.40 0.72 0.40 11.52
ratioT=1.4
Keys Map Order Search Total
1M 0.56 0.04 0.03 0.63
2M 1.20 0.08 0.05 1.33
4M 2.32 0.17 0.11 2.60
8M 5.05 0.33 0.23 5.61
19.5M 11.90 0.60 0.45 11.95
Note: Machine = Sony Vaio P4-3200MHz;
Time (CPU) is in seconds.
Table 4: Running Time Summary of the New Algo-
rithm
ratioTMapping Ordering Searching Total
1.2 10.40 0.72 0.40 11.52
1.3 10.95 0.66 0.47 12.08
1.4 11.90 0.60 0.45 11.95
1.5 12.31 0.55 0.44 13.25
Note: Machine = Sony Vaio P4-3200MHz;
Time (CPU) is in seconds.
Table 5: Running Time Summary of the New Al-
gorithm on a UK Web Graph: 19.5M Key Set of
URLs
6. EXPERIMENTAL RESULTS
In this section, we present the running results of the new
algorithm on a Sony VAIO workstation equipped with 3.2
GHz Pentuim 4, a 3Ware 8-port 8506-8 SATA RAID con-
troller, and four Western Digital 250gb SATA drives (8MB
cache) in RAID0 configuration. The system runs the Linux
OS with the 2.6 kernel. The xfs filesystem is used due to its
prefetching of sequential files. The system has 4Gig of dual
DDRAM, with 1GB dedicated to filesystem cache. Table 4
describes the performance of the algorithm on different key
sets for different ratios ratioT= 1.2, ratioT= 1.35, and
ratioT= 1.5.
Table 5 shows the time for a large set 19.5Mkeyset that
consist of 18.5M URLs taken from [1] and 1M keywords when
ratioT, is varied from 1.2 to 1.5. It can be seen that mapping
time is the dominant factor. The mapping step can actually
skip all the keys that map to isolated vertices as those are
assigned to the first level during the ordering step. The first
level nodes need not be ordered. So we can skip them during
the mapping step; and this improves the performance of the
mapping step significantly over other schemes. Similar op-
timization can be applied to algorithm that process acyclic
hypergraphs described in [8].
7. CONCLUSION
This paper describes a new practical algorithm for find-
71
ing perfect hash functions with no specification space at all,
suitable for key sets ranging in size from small to very large.
The method is able to find PHFs for various sizes of key sets
in linear time. The hash functions are optimal in terms of
time (perfect) and requires at most computation of h1(k)
and h2(k). Moreover, to help access methods researchers
understand the algorithm, we have dedicated a WWW page
to our algorithms and provided Java applets to visualize how
they work [23].
8. REFERENCES
[1] Boldi P., and Vigna S. The WebGraph framework I:
Compression techniques, Proc. of the Thirteenth
International World Wide Web Conference,
Manhattan, USA (2004).
[2] Brain, M.D., and Tharp, A.L. Perfect hashing using
sparse matrix packing. Information Systems 15
(1990), 281-290.
[3] Brin, S. and Page, L.
http://www-db.stanford.edu/ backrub/google.html,
(2002).
[4] Cercone, N., Krause, M., and Boates, J. Minimal and
almost minimal perfect hash function search with
application to natural language lexicon design,
Computers and Mathematics with Applications 9
(1983), 215-231.
[5] Chang, C.C. The study of an ordered minimal perfect
hashing scheme, Communications of the ACM 27
(1984), 384-387.
[6] Cichelli, R.J. Minimal perfect hash functions made
simple, Communications of the ACM 23 (1980),
17-19.
[7] Cormack, G.V., Horspool, R.N.S., and Kaiserswerth,
M. Practical perfect hashing, The Computer Journal
28 (1985), 54-58.
[8] Czech, Z.J., Havas G. and Majewski, B.S. Perfect
hashing, Theoretical Computer Science, Vol. 182
(1997) 1-143.
[9] Daoud, A.M. Efficient Data Structures for Information
Retrieval, PH.D. dissertation, Department of
Computer Science, Virginia Polytechnic Institute &
State University (1993).
[10] Daoud, A.M. Efficient Data Structures for Search
Engines, Technical Report TR-2005-3, DiTech (2005).
[11] Daoud, A.M. Augmented Order Preserving Perfect
Hashing for Information Retrieval, to appear (2004).
[12] DeFazio, S., Daoud, A.M., Smith, L., Srinivasan, J.,
Croft, W.B. and Callan, J. Integrating IR and
RDBMS using cooperative indexing, SIGIR (1995)
94-92.
[13] Fox, E.A., Chen, Q., Daoud, A.M. and Heath, L.
Order preserving minimal perfect hash functions and
information retrieval, SIGIR (1990) 279-311.
[14] Fox, E.A., Heath, L., Chen, Q., and Daoud, A.M.
Practical minimal perfect hash functions for large
databases, Communications of the ACM (1992).
[15] Fredman, M.L., Koml´os, J. and Szemer´edi, E. Storing
a sparse table with O(1) worst case access time,
Journal of the ACM 31 (1984), 538-544.
[16] Jaeschke, G. Reciprocal hashing—a method for
generating minimal perfect hash functions.
Communications of the ACM 24 (1981), 829-833.
[17] Knuth, D.E. The Art of Computer Programming,
Volume 3, Sorting and Searching, Addison-Wesley
Publishing Company, Reading, MA, 1973.
[18] Raghavan S., and Garcia-Molina, H. Representing
Web Graphs, Stanford University (2002).
[19] Zobel, J., Heinz, S., Williams, H.E. In-memory Hash
Tables for Accumulating Text Vocabularies,
Information Processing Letters, 80(6), (2001) 271-277.
[20] Sager, T.J. A polynomial time generator for minimal
perfect hash functions, Communications of the ACM
28 (1985), 523-532.
[21] Schmidt, J.P., and Siegel, A. On aspects of
universality and performance for closed hashing.
Proceedings of the 21st ACM Symposium on Theory of
Computing, 1989, 355-366.
[22] Sprugnoli, R. Perfect hashing functions: a single
probe retrieving method for static sets,
Communications of the ACM 20 (1978), 841-850.
[23] http://www.omlet.org/phf/phf2005.html
72
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. Indeed, the algorithm described gives hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can be found using space and time that is on average linear in the number of keys involved.
Article
Full-text available
This dissertation deals with the application of efficient data structures and hashing algorithms to the problems of textual information storage and retrieval. We have developed static and dynamic techniques for handling large dictionaries, inverted lists, and optimizations applied to ranking algorithms. We have carried out an experiment called REVTOLC that demonstrated the efficiency and applicability of our algorithms and data structures. Also, the REVTOLC experiment revealed the effectiveness and ease of use of advanced information retrieval methods, namely extended Boolean (p-norm), vector, and vector with probabilistic feedback methods. We have developed efficient static and dynamic data structures and linear algorithms to find a class of minimal perfect hash functions for the efficient implementation of dictionaries, inverted lists, and stop lists. Further, we have developed a linear algorithm that produces order preserving minimal perfect hash functions. These data structures and algorithms enable much faster indexing of textual data and faster retrieval of best match documents using advanced information retrieval methods. Finally, we summarize our research findings and some open problems that are worth further investigation.
Article
Full-text available
A method is presented for computing machine independent, minimal perfect hash functions of the form: hash value ← key length + the associated value of the key's first character + the associated value of the key's last character. Such functions allow single probe retrieval from minimally sized tables of identifier lists. Application areas include table lookup for reserved words in compilers and filtering high frequency words in natural language processing. Functions for Pascal's reserved words, Pascal's predefined identifiers, frequently occurring English words, and month abbreviations are presented as examples.
Article
Full-text available
We describe the first practical algorithms for finding minimal perfect hash functions that have been used to access very large databases (i.e., having over 1 million keys). This method extends earlier work wherein an 0(n-cubed) algorithm was devised, building upon prior work by Sager that described an 0(n-to the fourth) algorithm. Our first linear expected time algorithm makes use of three key insights: applying randomness whereever possible, ordering our search for hash functions based on the degree of the vertices in a graph that represents word dependencies, and viewing hash value assignment in terms of adding circular patterns of related words to a partially filled disk. Our second algorithm builds functions that are slightly more complex, but does not build a word dependency graph and so approaches the theoretical lower bound on function specification size. While ultimately applicable to a wide variety of data and file access needs, these algorithms have already proven useful in aiding our work in improving the performance of CD-ROM systems and our construction of a Large External Network Database (LEND) for semantic networks and hypertext/hypermedia collections. Virginia Disc One includes a demonstration of a minimal perfect hash function running on a PC to access a 130,198 word list on that CD-ROM. Several other microcomputer, minicomputer, and parallel processor versions and applications of our algorithm have also been developed. Tests including those wiht a French word list of 420,878 entries and a library catalog key set with over 3.8 million keys have shown that our methods work with very large databases.
Article
A method is presented for building minimal perfect hash functions, i.e., functions which allow single probe retrieval from minimally sized tables of identifier sets. A proof of existence for minimal perfect hash functions of a special type (reciprocal type) is given. Two algorithms for determining hash functions of reciprocal type are presented and their practical limitations are discussed. Further, some application results are given and compared with those of earlier approaches for perfect hashing.
Article
This article presents a simple algorithm for packing sparse 2-D arrays into minimal 1-D arrays in O(r2) time. Retrieving an element from the packed 1-D array is O(1). This packing algorithm is then applied to create minimal perfect hashing functions for large word lists. Many existing perfect hashing algorithms process large word lists by segmenting them into several smaller lists. The perfect hashing function described in this article has been used to create minimal perfect hashing functions for unsegmented word sets of up to 5000 words. Compared with other current algorithms for perfect hashing, this algorithm is a significant improvement in terms of both time and space efficiency.
Article
New methods for computing perfect hash functions and applications of such functions to the problems of lexicon design are reported in this paper. After stating the problem and briefly discussing previous solutions, we present Cichelli's algorithm, which introduced the form of the solutions we have pursued in this research. An informal analysis of the problem is given, followed by a presentation of three algorithms which refine and generalise Cichelli's method in different ways. We next report the results of applying programmed versions of these algorithms to problem sets drawn from natural and artificial languages. A discussion of conceptual designs for the application of perfect hash functions to small and large computer lexicons is followed by a summary of our research and suggestions for further work.
Article
A practical method is presented that permits retrieval from a table in constant time. The method is suitable for large tables and consumes, in practice, O(n) space for n table elements. In addition, the table and the hashing function can be constructed in O(n) expected time. Variations of the method that offer different compromises between storage usage and update time are presented.
Article
A refinement of hashing which allows retrieval of an item in a static table with a single probe is considered. Given a set I of identifiers, two methods are presented for building, in a mechanical way, perfect hashing functions, i.e. functions transforming the elements of I into unique addresses. The first method, the “quotient reduction” method, is shown to be complete in the sense that for every set I the smallest table in which the elements of I can be stored and from which they can be retrieved by using a perfect hashing function constructed by this method can be found. However, for nonuniformly distributed sets, this method can give rather sparse tables. The second method, the “remainder reduction” method, is not complete in the above sense, but it seems to give minimal (or almost minimal) tables for every kind of set. The two techniques are applicable directly to small sets. Some methods to extend these results to larger sets are also presented. A rough comparison with ordinary hashing is given which shows that this method can be used conveniently in several practical applications.