ArticlePDF Available

Perfect Hash Functions for Large Dictionaries

December 2020

Authors:

DiTech

We describe a new practical algorithm for finding perfect hash functions with no specification space at all, suitable for key sets ranging in size from small to very large. The method is able to find perfect hash functions for various sizes of key sets in linear time. The perfect hash functions produced are optimal in terms of time (perfect) and require at most computation of h1(k) and h2(k); two simple auxiliary pseudorandom functions.

Running Time Summary of the New Algo- rithm

…

Figures - uploaded by Amjad M. Daoud

Content may be subject to copyright.

Content uploaded by Amjad M. Daoud

Content may be subject to copyright.

Perfect Hash Functions for Large Dictionaries

Amjad M Daoud, PH.D.

Department of Computer Science

Tabuk University

Tabuk, Suadi Arabia

email: daoudamjad@gmail.com

ABSTRACT

We describe a new practical algorithm for ﬁnding perfect

hash functions with no speciﬁcation space at all, suitable

for key sets ranging in size from small to very large. The

method is able to ﬁnd perfect hash functions for various sizes

of key sets in linear time. The perfect hash functions pro-

duced are optimal in terms of time (perfect) and require at

most computation of h1(k) and h2(k); two simple auxiliary

pseudorandom functions.

Categories and Subject Descriptors

H.2.m [Information Systems]: Database Management-

Miscellaneous

General Terms

Algorithms, management, design.

Keywords

MOS, acyclic, indexing, perfect hashing, random graphs.

1. INTRODUCTION

There is a growing demand for fast access to large text

collections such as repositories of Web pages. For a reposi-

tory to be searched fast, it requires an index. To construct

an index, the set of unique words and URLs need to be

identiﬁed. The dictionary constructed may not ﬁt in main

memory and may need to be stored on slow external media.

Hashing has long been used when the fastest possible direct

access to random locations is desired. More recently, a range

of linear algorithms for producing quality order preserving

and minimal perfect hashing algorithms for static sets were

introduced [11] [8] [13] [14]. The speciﬁcations of the min-

imal perfect hash functions require O(n) words where nis

the number of keys. For example, The WebGraph research

group [1] used the Czech, et. al. algorithms [8] to access

18.5 million URLs. The perfect hash function speciﬁcation

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

CIMS’07, November 9, 2007, Lisboa, Portugal.

comprised of a gtable required 88 MB to hash the UK web

graph. For a world wide web snapshot of 118 million URLs,

the gtable size is 563 MB mostly random numbers that can

not be compressed any further. Clearly, the PHF speciﬁ-

cation for large sets cannot be stored in the L2 cache and

computing the ﬁnal hash address would require three ex-

pensive random lookups to the gtable. Random accesses

even to main memory often take as much time as executing

hundreds of instructions on modern CPUs.

In this paper, we present a new faster perfect hashing ap-

proach that maps the key set into a bipartite graph using two

functions (h1(k), h2(k)), arranges the vertices of the graph

according to incident edges cardinality in ascending order,

and assigns the keys to their ﬁnal locations simply accord-

ing to h1(k) or h2(k). Ideally the graph would be compact

and acyclic. The algorithm produces perfect hash functions

that require no speciﬁcation space at all; however, functions

produced are only perfect; not minimal and not order pre-

serving. Our algorithm improves upon earlier schemes in

that it uses the bipartite graph approach to avoid degen-

erate edges problems and so that two functions are used.

Compact acyclic bipartite graphs are harder to construct

but produce perfect hashing schemes that require at most

two accesses. Nevertheless, our approach relaxes the acyclic

requirement in random graphs presented in [8] and can tol-

erate the presence of the most common cyclic components;

and thus is easier to construct. The algorithm produces per-

fect hash functions with much higher success rates than the

acyclic hypergraph approach [8] and mostly from the ﬁrst

trial. Our algorithms have been tested in the MG4J system

[1] to accumulate URLs collected and are currently used to

store eﬃciently large web maps [11].

1.1 Motivation

This work was in part motivated by our investigations

that deal with tightly integrating information retrieval with

relational databases [12] and construction of eﬃcient web

maps for search engines [11]. We investigated two popular

methods for producing order preserving and minimal per-

fect hashing algorithms as described in [13] and [8]. We

found that they have large main memory requirements or

high probability of failure due to degenerate edges (one out

of 10 trials succeeds as evident from the MG system and

MG4J implementations). Both algorithms require substan-

tial storage for their hash functions parameters and are in-

eﬃcient when memory is scarce.

1.2 Applications

There are numerous applications for our algorithm in digi-

tal libraries and information retrieval systems: some of them

are novel such as web maps; the others are well known such

as the dictionary membership [13] [11]. Eﬃcient perfect hash

functions have been successfully used to manage very large

web graphs. Storing snapshots of the Web helps the tuning

of ranking algorithms such as Google PageRank [3] algo-

rithm. The WebGraph research group [1] used order pre-

serving perfect hashing [8] to access URL sets as large as

18.5 million URLs and the perfect hash function speciﬁca-

tions required 88 MB.

2. PREVIOUS METHODS TO FIND PHFS

Hashing has been a topic of study for many years, both

in regard to practical methods and analytical investigations

[17]. A less extensive literature has grown up, mostly during

the last decade, dealing with perfect hash functions; it is that

subarea that we consider in this section.

Given a key kfrom a static key set Sof cardinality n,

selected from a universe of keys Uwith cardinality N, our

objective is to ﬁnd a function hfthat maps each kto a

distinct entry in the hash table Tcontaining mslots. If the

ratioTm/n = 1, his said to be minimal.

For a function hto be perfect, it must map each key k

in Sto a unique integer bounded by musing O(1) oper-

ations. Early algorithms by Sprugnoli [22], Jaeschke [16],

and Fredman, Koml´os, and Szemer´edi [15]. Chang [5] used

four tables based on the ﬁrst and second letters of the key.

Cichelli [6] used the length of the key and tables based on

the ﬁrst and last letters of the key. Note, however, that the

length of a key, its ﬁrst letter, and its last letter are some-

times insuﬃcient to avoid collisions; consider the case of the

words ‘woman’ and ‘women’ in Cichelli’s method.

Cercone et al. [4] enhanced the discriminating power

of transformations from strings to integers by generating

a number of letter to number tables, one for each letter

position. Clearly, if the original keys are distinct, numbers

formed by concatenating ﬁxed length integers obtained from

these conversion tables will be unique. In practice, it often

suﬃces to simply form the sum or product of the sequence

of integers.

While in some schemes (e.g., [6]) the resulting integer is

actually the hash address desired, in most algorithms, the h

function must further map from the integer value produced

into the hash table.

Brain and Tharp [2] presented a new approach to MPHF

hashing. Their scheme ﬁrst maps the keys onto a 2-dimension

relatively sparse array and then compacts the array onto a

one-dimension array. If compaction could be made collision

free, the indices of the resulting one-dimension array can

be used as hashing addresses. Their algorithm can generate

PHFs for up to 5,000 keys.

Schmidt and Siegel [21] presented tight bounds on the

spatial complexity of perfect hash functions. They described

a variation of an explicit O(1) time single probe perfect hash

function that can be speciﬁed in O(n+ log log m) bits.

In [14], Fox et. al. presented two algorithms to ﬁnd

MPHFs, both based on the notion of Mapping-Ordering-

Searching (MOS). In general, MOS approach calls for map-

ping keys in a particular key set into a space of representa-

tive of keys, ordering the subsets of representatives (of keys)

Author Time Space Minimal Calc

Eﬃciency Eﬃciency Eﬃciency

Sprugnoli O(2n)O(nlog n) no O(1)

Jaeschke O(2n)O(nlog n) no O(1)

Cichelli O(2n)O(nlog n) no O(1)

Brain O(2n)O(nlog n) no O(1)

Chang O(n2log n)O(nlog n) no O(nlog n)

Sager O(n4)O(nlog n) yes O(1)

Fredman O(n)O(nlog n) no O(1)

Cormack O(n)O(nlog n) yes O(1)

Daoud90 O(n1+η)O(n

ηln 2 ) yes O(1)

Schmidt O(n)O(n) yes O(1)

Czech O(n)O(nlog n) yes O(1)

current O(n)O(1) no O(1)

Table 1: Comparison of Diﬀerent Perfect Hashing

Algorithms

and ﬁnally searching the MPHF speciﬁcation space for each

representative subset so that corresponding keys ﬁt onto the

hash table. The space used to specify the MPHF is related to

all three steps. Ideally, the representative space should help

ordering step to produce a proper ordering so that searching

step could easily ﬁt the keys. Since identifying an optimal

ordering is NP-complete [20] , heuristics is usually sought to

obtain suboptimal orderings in practice.

In [13],[11], and [10] we presented algorithms that make

use of the dependency graph where vertices are the range of

h1and h2functions (generated in mapping step) and edges

are keys. The ordering step heuristically arranges the ver-

tices in the dependency graph to get a vertex sequence. The

subsets of keys induced by the sequence are handled accord-

ing to their size in ascending order and such that subsets of

size one (the majority) are assigned in the hash table dur-

ing the searching step ﬁrst. These algorithms are capable

of producing order preserving minimal PHFs for very large

sets while keeping the speciﬁcation space close to theoret-

ical bounds. The ordering heuristics may ﬁt large subsets

in the table later than the smaller ones. This increases the

diﬃculty of ﬁtting the table when the number of subsets is

small. Another drawback of the algorithm is that each ver-

tex contributes log nbits to the MPHF speciﬁcation. For

vertices associated with small key subsets appearing earlier

in the sequence, this is a large overhead.

In Table 1, we compare the diﬀerent perfect hashing algo-

rithms.

In the next section, we extend these algorithms to build

dependency graphs that are more sparse and easier to work

with (i.e. acyclic). We propose a new ordering algorithm

that allows as many keys as possible to be assigned during

the search step in parallel.

3. THE NEW ALGORITHM

In this section, we introduce a new algorithm to generate

perfect hash functions that require no speciﬁcation space at

all and has a high success rate.

The new algorithm builds dependency graphs that are

more sparse and easier to work with (i.e. acyclic). Moti-

vated by the fact that an acyclic dependency graph is a per-

fect mapping of the key set to an array of size m, the two

random functions used to build dependency graphs h1(k)

and h2(k) are used to generate PHFs that require no spec-

iﬁcation space at all with high success rates and in linear

time. The ﬁnal hash function is simply h1(k) or h2(k). So

to ﬁnd kwe would check the two Ttable entries: T[h1(k)]

or T[h2(k)].

The dependency graph is traversed so that we partition

the set of keys into a sequence of levels called a tower. If

the vertex ordering is l1,· · · , lt, then the level of keys K(li),

1≤i≤t, corresponding to a set of vertices vi, 1 ≤i≤2r−1,

is the set of edges incident both to viand to a vertex earlier

in the ordering. The ﬁrst level K(l1) is the set of edges

incident to each vertex vi, 1 ≤i≤2r−1, in the dependency

graph chosen such that the vertex has only one incident

edge or has an edge that would break a cycle involving vi.

If a component is cyclic and has more than one cycle, the

ordering step fails and we try a diﬀerent mapping again .

Clearly, our hashing scheme allows us to have cycles of

size two in the dependency graph. In section 5, we show that

cyclic components can have cycles of size two only, since the

probability of having vertices with more than two incident

edges drops to zero as napproaches 0 as m= 2r→ ∞

[8]. Consequently, a vertex cannot have two cycles of size

two. Clearly, our algorithm has a strong mathematical back-

ground. In fact, relaxing the acyclic requirement in random

bipartite graphs helps lower mand increases the success

rates of our algorithm.

The algorithm consists of the three steps: Mapping, Or-

dering, and Searching. Each step, along with implementa-

tion details, will be described in a separate subsection below.

3.1 The Mapping Step

The Mapping step takes a set of nkeys and produces the

two auxiliary hash functions h1, and h2. The h1, and h2

values are used to build a bipartite graph called the depen-

dency graph. Half of the vertices of the dependency graph

correspond to the h1values and are labeled 0,...,r−1. The

other half of the vertices correspond to the h2values and are

labeled r, . . . , 2r−1. There is one edge in the dependency

graph for each key in the original set of keys. A key kcorre-

sponds to an edge labeled kbetween the vertex labeled h1(k)

and the vertex labeled h2(k). Notice that there may be other

edges between h1(k) and h2(k), but those edges are labeled

with keys other than k. There are two data structures that

constitute the dependency graph, one for the edges (keys)

and one for the vertices (determined by the h1and h2val-

ues). Both are implemented as arrays. The vertex array

vertex: array [0..2r-1] of record

firstedge: integer;

degree: integer;

end

firstedge is the header for a singly-linked list of the edges

incident to the vertex.degree is the number of edges inci-

dent to the vertex. The edge array is

edge: array [1..n] of record

h1,h2:integer;

nextedge1:integer;

nextedge2:integer;

end

h1, and h2contain the h1, and h2values for the edge

(key). Also, nextedgei, for side i(= 1,2) of the graph

(1) build random tables for h1, and h2

(2) for each v∈[0. . . 2r −1]do

vertex[v].firstedge = 0

vertex[v].degree = 0

(3) for each i∈[1. . . n]do

edge[i].h1= h1(ki)

edge[i].h2= h2(ki)

edge[i].nextedge1=0

add edge[i] to linked list with header

vertex[h1(ki)].firstedge

increment vertex[h1(ki)].degree

edge[i].nextedge2=0

add edge[i] to linked list with header

vertex[h2(ki)].firstedge

increment vertex[h2(ki)].degree

Figure 1: The Mapping Step

(corresponding to h1,h2, respectively), points to the next

edge in the linked list whose head is given by firstedge in

the vertex array.

Figure 1 details the Mapping step. Let k1,k2,...,knbe

the set of keys. The h1, and h2functions are selected (1) as

the result of building tables of random numbers. The con-

struction of the dependency graph in (2) and (3) is straight-

forward. Therefore, the expected time for the Mapping step

is O(n).

3.2 The Ordering Step

The Ordering step explores the dependency graph so as to

partition the set of keys into a sequence of levels. The step

actually produces an ordering of the vertices in levels having

a degree of one when all preceding levels are processed.

Since the vertex degree distribution is decidedly skewed

and the graph is relatively sparse, the ﬁrst level would con-

tain more than 70% of the keys. All these keys can be as-

signed in parallel in the search step. Next we visit vertices

that are of minimum degree as they are likely to have only

one unprocessed edge left. The algorithm uses a suﬃcient

number of stacks to identify the next minimum degree pro-

cessed and the unprocessed vertices. These stacks accelerate

choosing the next vertex with the required degree. Figure

2 details the Ordering step. In step (1), STACKS and ver-

tex ordering VS are initialized. In step (2), we choose all

vertices vof degree = 1 and add them to K(l1). In step

(3), all vertices adjacent to vertices in K(l1) are pushed on

STACKS according to their degree. In (4), the rest of the

vertices in the current component are processed and added

to the vertex ordering VS. STACKS are used to identify

those vertices that have not been selected and to return an

unselected vertex of minimum degree.

Clearly the Ordering step can be ﬁnished in O(n) time

because we traverse the vertices in the dependency graph

only once.

3.3 The Searching Step

The Searching step takes the levels produced in the Order-

ing step and tries to assign hash values to edges according

to the ordering. Assigning hash values to vertices that have

degree of one or one key kin K(li) amounts to assigning

(1) initialize(STACKS)

initialize ordering sequence VS

(2) select all vertices of degree = 1

mark them as SELECTED, and add them to K(l1)

(3) for each wadjacent to vertices in K(l1)do

push(w, STACKS[deg(w)])

i=2

(4) while some vis not SELECTED do

while STACKS are not empty do

v=popmin(STACKS)

if vhas one edge left to process then

mark vSELECTED

add vto level iin VS list

for wadjacent to vdo

if wis not SELECTED and

wis not in STACKS[deg(w)] then

push(w, STACKS[deg(w)])

i=i+1

endwhile

Figure 2: The Ordering Step

(1) for i=1 to t (number of levels in V S)do

if vj.degree in K(li)=1then

for each v∈K(li)do

Tj=kvj

remove kvj

else

fail

Figure 3: The Searching Step

T[h1(k)] to the key kif the vertex is on the h1side and

assigning T[h2(k)] to the key kif the vertex is on the h2

side. Recall that the hash table Thas the same size as the

dependency graph and that m= 2r.

Figure 3 gives the algorithm for the Searching step. Clearly

the Searching step requires only O(n) time to ﬁnish.

4. EXAMPLE

We show an example of ﬁnding a PHF for the 20 key set

listed in Table 2. The ratio used is 1.2 and thus r= 12 and

the size of the hash table T= 24. The h1, and h2values

(see Table 2) are used to build a bipartite graph. Half of

the vertices of the dependency graph correspond to the h1

values and are labeled 0,...,r −1 = 11. The other half

of the vertices correspond to the h2values and are labeled

r, . . . , 2r−1 = 23. There is one edge in the dependency

graph for each key in the original set of keys.

We notice that we have four acyclic components (trees)

and Table 3 veriﬁes that we indeed have a PHF for the

example key set.

The tower levels dictates the order of assignments: ver-

tices with degree = 1 are {v0,v2,v3,v5,v8,v10 ,v11 }on

the h1side and {v12,v13 ,v14,v15 ,v16,v18 ,v19,v22 }on the

Key h1h2

x-rays 7 22

Euclidean 1 14

ethyl ether 6 19

Clouet 4 23

Bulwer 4 17

dentifrice 7 20

Lagomorpha 11 17

Chungking 8 15

quibbles 7 18

Han Cities 5 16

treacherous 1 23

calc- 6 23

deposited 3 23

rotundus 9 17

antennae 1 12

sodium lamp 7 13

oculomotor nerve 2 21

tussle 0 20

imprecise 9 20

meridiem 10 21

Table 2: The Key Set Used in Example, n=20,

m=24, ratio = 1.2

h2side. So K(l1) = {v0,v2,v3,v5,v8,v10,v11 ,v12,v13 ,v14,

v15,v16 ,v18,v19 ,v22}has the following set of keys: “tussle”,

“oculomotor nerve”, “deposited”, “Han Cities”, “Chungking”,

“meridiem”, “Lagomorpha”, “antenae”, “sodium lamp”, “Eu-

clidean”, “quibbles”, “ethyl ether”, and “x-rays” that can be

assigned to their ﬁnal addresses according to h1or h2with-

out collision. Here, we favor placing keys according to their

h1(k) value since we would compute h1(k) ﬁrst when search-

ing for k. In our example, the l1can be visualized as the

edges that are incident on the nodes of the ﬁrst level. Notice

that we have an acyclic bipartite graph which is equivalent

to four random trees. In fact, the algorithm assigns the

edges incident on their leaf nodes ﬁrst (each has only one

incident node). Next, moves to the next level till all edges

(keys) are assigned to unique vertices.

When all assigned keys are removed; K(l2) = {v1,v6,v7},

has the set of keys “treacherous”, “calc-”, “dentifrice”; K(l3)

={v20,v23 }has two keys {“imprecise”,“Clouet”}, and ﬁnally

K(l4) = {v4}has two keys {“Bulwer”, “rotundus”},

5. ANALYSIS

We bound the expected total length of cycles in the graph.

Let C2kdenote the number of cycles of length 2k. To build

a cycle of length 2k, we select 2kvertices and connect them

with 2kedges in any order. There are r

k2ways to choose

2kvertices out of 2rvertices of a graph, k!k!/2kways of

connecting them into a cycle, and (2k)! possible ordering of

the edges. The cycle can be embedded into the structure

of the graph in r

r−2kr2(r−2k)ways. Hence, the number of

graphs containing a cycle of length 2kis

k!2

((k!)2/2k)(2k)! r

r−2k!r2(r−2k)

Also, the expected number of cycles of length 2kin the

graph is

key h1h2hfinal T[hf inal]

x-rays 7 22 h2T[22]

Euclidean 1 14 h2T[14]

ethyl ether 6 19 h2T[19]

Clouet 4 23 h2T[23]

Bulwer 4 17 h1T[04]

dentifrice 7 20 h1T[07]

Lagomorpha 11 17 h1T[11]

Chungking 8 15 h1T[08]

quibbles 7 18 h2T[18]

Han Cities 5 16 h1T[05]

treacherous 1 23 h1T[01]

calc- 6 23 h1T[06]

deposited 3 23 h1T[03]

rotundus 9 17 h1T[09]

antenna 1 12 h2T[12]

sodium lamp 7 13 h2T[13]

oculomotor nerve 2 21 h1T[02]

tussle 0 20 h1T[00]

imprecise 9 20 h2T[20]

meridiem 10 21 h1T[10]

Table 3: Keys and Their Final Hash Addresses,

n=20, m=24, ratio = 1.2

n/2

k=1

2kE(C2k) =

n/2

k=1 r

k2((k!)2/2k)(2k)!r

r−2kr2(r−2k)

n2k

And the expected number of cycles in the graph cis

n/2

k=1

E(C2k) =

n/2

k=1 r

k2((k!)2/2k)(2k)!r

r−2kr2(r−2k)

n2k

According to [8], the number of cycles is bounded by

n/2

k=1

E(C2k)< ln(2k)

Next, we count the number of tree components in G, ex-

cluding zero-degree vertex components. We have the num-

ber of diﬀerent trees in a bipartite graph G0:

Rij =ji−1·ij−1

Let tbe expected number of trees of distinct edges of size

from 1 to nin a bipartite graph Gwith rvertices on each

side is

E(TR) ≤X

jr

ir

j·Rij ·n

n·(n)!

r2n

where iand jshould satisfy the constraints n−i≥1 and

n−j≥1 when i+j−1< n, or n−i≥0 and n−j≥0

when i+j−1 = n.

Moreover, the maximum number of components that have

one cycle is bounded by 2r−tand the probability of having

a cyclic component with more than one cycle approaches 0

as m= 2r→ ∞. This is a direct result from [8] Lemma 4.3.

Consequently, and almost surely a vertex cannot participate

in two diﬀerent cycles of size two or higher.

ratioT=1.2

Keys Map Order Search Total

1M 0.55 0.05 0.04 0.64

2M 1.05 0.09 0.06 1.20

4M 2.12 0.20 0.11 2.43

8M 5.85 0.37 0.19 6.21

19.5M 10.40 0.72 0.40 11.52

ratioT=1.4

Keys Map Order Search Total

1M 0.56 0.04 0.03 0.63

2M 1.20 0.08 0.05 1.33

4M 2.32 0.17 0.11 2.60

8M 5.05 0.33 0.23 5.61

19.5M 11.90 0.60 0.45 11.95

Note: Machine = Sony Vaio P4-3200MHz;

Time (CPU) is in seconds.

Table 4: Running Time Summary of the New Algo-

rithm

ratioTMapping Ordering Searching Total

1.2 10.40 0.72 0.40 11.52

1.3 10.95 0.66 0.47 12.08

1.4 11.90 0.60 0.45 11.95

1.5 12.31 0.55 0.44 13.25

Note: Machine = Sony Vaio P4-3200MHz;

Time (CPU) is in seconds.

Table 5: Running Time Summary of the New Al-

gorithm on a UK Web Graph: 19.5M Key Set of

URLs

6. EXPERIMENTAL RESULTS

In this section, we present the running results of the new

algorithm on a Sony VAIO workstation equipped with 3.2

GHz Pentuim 4, a 3Ware 8-port 8506-8 SATA RAID con-

troller, and four Western Digital 250gb SATA drives (8MB

cache) in RAID0 conﬁguration. The system runs the Linux

OS with the 2.6 kernel. The xfs ﬁlesystem is used due to its

prefetching of sequential ﬁles. The system has 4Gig of dual

DDRAM, with 1GB dedicated to ﬁlesystem cache. Table 4

describes the performance of the algorithm on diﬀerent key

sets for diﬀerent ratios ratioT= 1.2, ratioT= 1.35, and

ratioT= 1.5.

Table 5 shows the time for a large set 19.5Mkeyset that

consist of 18.5M URLs taken from [1] and 1M keywords when

ratioT, is varied from 1.2 to 1.5. It can be seen that mapping

time is the dominant factor. The mapping step can actually

skip all the keys that map to isolated vertices as those are

assigned to the ﬁrst level during the ordering step. The ﬁrst

level nodes need not be ordered. So we can skip them during

the mapping step; and this improves the performance of the

mapping step signiﬁcantly over other schemes. Similar op-

timization can be applied to algorithm that process acyclic

hypergraphs described in [8].

7. CONCLUSION

This paper describes a new practical algorithm for ﬁnd-

ing perfect hash functions with no speciﬁcation space at all,

suitable for key sets ranging in size from small to very large.

The method is able to ﬁnd PHFs for various sizes of key sets

in linear time. The hash functions are optimal in terms of

time (perfect) and requires at most computation of h1(k)

and h2(k). Moreover, to help access methods researchers

understand the algorithm, we have dedicated a WWW page

to our algorithms and provided Java applets to visualize how

they work [23].

8. REFERENCES

[1] Boldi P., and Vigna S. The WebGraph framework I:

Compression techniques, Proc. of the Thirteenth

International World Wide Web Conference,

Manhattan, USA (2004).

[2] Brain, M.D., and Tharp, A.L. Perfect hashing using

sparse matrix packing. Information Systems 15

(1990), 281-290.

[3] Brin, S. and Page, L.

http://www-db.stanford.edu/ backrub/google.html,

(2002).

[4] Cercone, N., Krause, M., and Boates, J. Minimal and

almost minimal perfect hash function search with

application to natural language lexicon design,

Computers and Mathematics with Applications 9

(1983), 215-231.

[5] Chang, C.C. The study of an ordered minimal perfect

hashing scheme, Communications of the ACM 27

(1984), 384-387.

[6] Cichelli, R.J. Minimal perfect hash functions made

simple, Communications of the ACM 23 (1980),

17-19.

[7] Cormack, G.V., Horspool, R.N.S., and Kaiserswerth,

M. Practical perfect hashing, The Computer Journal

28 (1985), 54-58.

[8] Czech, Z.J., Havas G. and Majewski, B.S. Perfect

hashing, Theoretical Computer Science, Vol. 182

(1997) 1-143.

[9] Daoud, A.M. Eﬃcient Data Structures for Information

Retrieval, PH.D. dissertation, Department of

Computer Science, Virginia Polytechnic Institute &

State University (1993).

[10] Daoud, A.M. Eﬃcient Data Structures for Search

Engines, Technical Report TR-2005-3, DiTech (2005).

[11] Daoud, A.M. Augmented Order Preserving Perfect

Hashing for Information Retrieval, to appear (2004).

[12] DeFazio, S., Daoud, A.M., Smith, L., Srinivasan, J.,

Croft, W.B. and Callan, J. Integrating IR and

RDBMS using cooperative indexing, SIGIR (1995)

94-92.

[13] Fox, E.A., Chen, Q., Daoud, A.M. and Heath, L.

Order preserving minimal perfect hash functions and

information retrieval, SIGIR (1990) 279-311.

[14] Fox, E.A., Heath, L., Chen, Q., and Daoud, A.M.

Practical minimal perfect hash functions for large

databases, Communications of the ACM (1992).

[15] Fredman, M.L., Koml´os, J. and Szemer´edi, E. Storing

a sparse table with O(1) worst case access time,

Journal of the ACM 31 (1984), 538-544.

[16] Jaeschke, G. Reciprocal hashing—a method for

generating minimal perfect hash functions.

Communications of the ACM 24 (1981), 829-833.

[17] Knuth, D.E. The Art of Computer Programming,

Volume 3, Sorting and Searching, Addison-Wesley

Publishing Company, Reading, MA, 1973.

[18] Raghavan S., and Garcia-Molina, H. Representing

Web Graphs, Stanford University (2002).

[19] Zobel, J., Heinz, S., Williams, H.E. In-memory Hash

Tables for Accumulating Text Vocabularies,

Information Processing Letters, 80(6), (2001) 271-277.

[20] Sager, T.J. A polynomial time generator for minimal

perfect hash functions, Communications of the ACM

28 (1985), 523-532.

[21] Schmidt, J.P., and Siegel, A. On aspects of

universality and performance for closed hashing.

Proceedings of the 21st ACM Symposium on Theory of

Computing, 1989, 355-366.

[22] Sprugnoli, R. Perfect hashing functions: a single

probe retrieving method for static sets,

Communications of the ACM 20 (1978), 841-850.

[23] http://www.omlet.org/phf/phf2005.html

ResearchGate has not been able to resolve any citations for this publication.

Order-Preserving Minimal Perfect Hash Functions and Information Retrieval

Article

Full-text available

Jan 1991

Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. Indeed, the algorithm described gives hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can be found using space and time that is on average linear in the number of keys involved.

Efficient data structures for information retrieval systems

Article

Full-text available

Jan 1990

Amjad M. Daoud

This dissertation deals with the application of efficient data structures and hashing algorithms to the problems of textual information storage and retrieval. We have developed static and dynamic techniques for handling large dictionaries, inverted lists, and optimizations applied to ranking algorithms. We have carried out an experiment called REVTOLC that demonstrated the efficiency and applicability of our algorithms and data structures. Also, the REVTOLC experiment revealed the effectiveness and ease of use of advanced information retrieval methods, namely extended Boolean (p-norm), vector, and vector with probabilistic feedback methods. We have developed efficient static and dynamic data structures and linear algorithms to find a class of minimal perfect hash functions for the efficient implementation of dictionaries, inverted lists, and stop lists. Further, we have developed a linear algorithm that produces order preserving minimal perfect hash functions. These data structures and algorithms enable much faster indexing of textual data and faster retrieval of best match documents using advanced information retrieval methods. Finally, we summarize our research findings and some open problems that are worth further investigation.

Minimal Perfect Hash Functions Made Simple.

Article

Full-text available

Jan 1980

Richard Cichelli

A method is presented for computing machine independent, minimal perfect hash functions of the form: hash value ← key length + the associated value of the key's first character + the associated value of the key's last character. Such functions allow single probe retrieval from minimally sized tables of identifier lists. Application areas include table lookup for reserved words in compilers and filtering high frequency words in natural language processing. Functions for Pascal's reserved words, Pascal's predefined identifiers, frequently occurring English words, and month abbreviations are presented as examples.

Practical Minimal Perfect Hash Functions for Large Databases.

Article

Full-text available

Jan 1992

We describe the first practical algorithms for finding minimal perfect hash functions that have been used to access very large databases (i.e., having over 1 million keys). This method extends earlier work wherein an 0(n-cubed) algorithm was devised, building upon prior work by Sager that described an 0(n-to the fourth) algorithm. Our first linear expected time algorithm makes use of three key insights: applying randomness whereever possible, ordering our search for hash functions based on the degree of the vertices in a graph that represents word dependencies, and viewing hash value assignment in terms of adding circular patterns of related words to a partially filled disk. Our second algorithm builds functions that are slightly more complex, but does not build a word dependency graph and so approaches the theoretical lower bound on function specification size. While ultimately applicable to a wide variety of data and file access needs, these algorithms have already proven useful in aiding our work in improving the performance of CD-ROM systems and our construction of a Large External Network Database (LEND) for semantic networks and hypertext/hypermedia collections. Virginia Disc One includes a demonstration of a minimal perfect hash function running on a PC to access a 130,198 word list on that CD-ROM. Several other microcomputer, minicomputer, and parallel processor versions and applications of our algorithm have also been developed. Tests including those wiht a French word list of 420,878 entries and a library catalog key set with over 3.8 million keys have shown that our methods work with very large databases.

Reciprocal Hashing: A Method for Generating Minimal Perfect Hashing Functions.

Article

Dec 1981

Gerhard Jaeschke

A method is presented for building minimal perfect hash functions, i.e., functions which allow single probe retrieval from minimally sized tables of identifier sets. A proof of existence for minimal perfect hash functions of a special type (reciprocal type) is given. Two algorithms for determining hash functions of reciprocal type are presented and their practical limitations are discussed. Further, some application results are given and compared with those of earlier approaches for perfect hashing.

Perfect hashing using sparse matrix packing

Article

Jun 1990
INFORM SYST

This article presents a simple algorithm for packing sparse 2-D arrays into minimal 1-D arrays in O(r2) time. Retrieving an element from the packed 1-D array is O(1). This packing algorithm is then applied to create minimal perfect hashing functions for large word lists. Many existing perfect hashing algorithms process large word lists by segmenting them into several smaller lists. The perfect hashing function described in this article has been used to create minimal perfect hashing functions for unsegmented word sets of up to 5000 words. Compared with other current algorithms for perfect hashing, this algorithm is a significant improvement in terms of both time and space efficiency.

Minimal and almost minimal perfect hash function search with application to natural language lexicon design

Article

Dec 1983
COMPUT MATH APPL

New methods for computing perfect hash functions and applications of such functions to the problems of lexicon design are reported in this paper. After stating the problem and briefly discussing previous solutions, we present Cichelli's algorithm, which introduced the form of the solutions we have pursued in this research. An informal analysis of the problem is given, followed by a presentation of three algorithms which refine and generalise Cichelli's method in different ways. We next report the results of applying programmed versions of these algorithms to problem sets drawn from natural and artificial languages. A discussion of conceptual designs for the application of perfect hash functions to small and large computer lexicons is followed by a summary of our research and suggestions for further work.

On Aspects of Universality and Performance for Closed Hashing (Extended Abstract)

Conference Paper

Jan 1989

Practical Perfect Hashing

Article

Feb 1985

A practical method is presented that permits retrieval from a table in constant time. The method is suitable for large tables and consumes, in practice, O(n) space for n table elements. In addition, the table and the hashing function can be constructed in O(n) expected time. Variations of the method that offer different compromises between storage usage and update time are presented.

Perfect Hashing Functions: A Single Probe Retrieving Method for Static Sets.

Article

Nov 1977

Renzo Sprugnoli

A refinement of hashing which allows retrieval of an item in a static table with a single probe is considered. Given a set I of identifiers, two methods are presented for building, in a mechanical way, perfect hashing functions, i.e. functions transforming the elements of I into unique addresses. The first method, the “quotient reduction” method, is shown to be complete in the sense that for every set I the smallest table in which the elements of I can be stored and from which they can be retrieved by using a perfect hashing function constructed by this method can be found. However, for nonuniformly distributed sets, this method can give rather sparse tables. The second method, the “remainder reduction” method, is not complete in the above sense, but it seems to give minimal (or almost minimal) tables for every kind of set. The two techniques are applicable directly to small sets. Some methods to extend these results to larger sets are also presented. A rough comparison with ordinary hashing is given which shows that this method can be used conveniently in several practical applications.

Perfect Hash Functions for Large Dictionaries

Abstract and Figures

Recommended publications

Word Similarity From Dictionaries: Inferring Fuzzy Measures From Fuzzy Graphs

Word Similarity From Dictionaries: Inferring Fuzzy Measures From Fuzzy Graphs

A Syllabification Algorithm and Syllable Statistics of Written Uyghur

Perfect hash functions for large dictionaries

Efficient data structures for information retrieval systems

Morphological Analysis and Diacritical Arabic Text Compression

Perfect Bloom Structures (PBS)