\AtBeginEnvironment

proof \AtEndEnvironmentproof \AtBeginEnvironmentexample \AtEndEnvironmentexample

Rate-distorted cryptographic perfect hash functions
A theoretical analysis on space efficiency, time complexity, and entropy

Alexander Towell
lex@metafunctor.com
Abstract

We analyze a theoretical perfect hash function that has three desirable properties: (1) it is a cryptographic hash function; (2) its in-place encoding obtains the theoretical lower-bound for the expected space complexity; and (3) its in-place encoding is a random bit string with maximum entropy.

Keywords: perfect hash function, random oracle, space complexity, maximum entropy, cryptographic hash function

1 Prior Art

Perfect hash functions have been extensively studied in the literature. Early foundational work by Dietzfelbinger et al. [4] established theoretical bounds for space-efficient hash tables with worst-case constant access time. Czech, Havas, and Majewski [3] introduced a family of perfect hashing methods that became widely influential.

More recent practical algorithms have focused on minimal perfect hash functions (MPHFs), which achieve a load factor of exactly 1. Botelho, Pagh, and Ziviani [2] presented simple and space-efficient constructions, while Belazzougui, Botelho, and Dietzfelbinger [1] developed the Compress, Hash, and Displace (CHD) algorithm, which achieves excellent space-time tradeoffs for large datasets.

Our work differs by focusing on the theoretical properties of cryptographic perfect hash functions with maximum entropy encodings, rather than minimal space or construction speed.

2 Perfect hash functions

A set is an unordered collection of distinct elements. If we know the elements in a set, we may denote the set by these elements, e.g., {a,c,b} denotes a set whose membes are exactly a, b, and c.

A finite set has a finite number of elements. For example, {1,3,5} is a finite set with three elements. When sets 𝔸 and 𝔹 are isomorphic, denoted by 𝔸𝔹, they can be put into a one-to-one correspondence (bijection), e.g., {b,a,c}{1,2,3}. Since there exists at least one bijection between isomorphic sets, we can losslessly convert one to the other and thus, isomorphic sets are in some sense equivalent.

The cardinality of a finite set 𝔸 is the number of elements in the set, denoted by |A|, e.g., |{1,3,5}|=3. A countably infinite set is isomorphic to the set of natural numbers N={1,2,3,4,5,}.

Given two elements a and b, an ordered pair of a then b is denoted by (a,b), where (a,b)=(c,d) if and only if a=c and b=d. Ordered pairs are non-commutative and non-associative, i.e., (a,b)(b,a) if ab and (a,(b,c))(b,(a,c)).

Related to the ordered pair is the Cartesian product.

Definition 2.1.

The set 𝕏×𝕐={(x,y):x𝕏y𝕐} is the Cartesian product of sets 𝕏 and 𝕐.

By the non-commutative and non-associative property of ordered pairs, the Cartesian product is non-commutative and non-associative. However, they are isomorphic, i.e., 𝕏×𝕐𝕐×𝕏.

A tuple is a generalization of order pairs which can consist of an arbitrary number of elements, e.g., (x1,x2,,xn).

Definition 2.2 (n-fold Cartesian product).

The n-ary Cartesian product of sets 𝕏[1],,𝕏[n], is given by 𝕏[1]××𝕏[n]={(x1,,xn):x1𝕏[1]xn𝕏[n]}.

Note that 𝕏[1]×𝕏[2]×𝕏[3]𝕏[1]×(𝕏[2]×𝕏[3])(𝕏[1]×𝕏[2])×𝕏[3], thus we may implicitly convert between them without ambiguity.

If each set in the n-ary Cartesian product is the same, the power notation may be used, e.g., 𝕏3𝕏×𝕏×𝕏. As special cases, the 0-ary (nullary) Cartesian product is defined to be {}, and the 1-ary (unary) Cartesian product is the identity, e.g., 𝕏1=𝕏.

The hash function is given by the following definition.

Definition 2.3.

Hash functions of type 𝕏𝕐 are just total functions, normally with a finite codomain, with the weak assumption that they will be used as a device to assign elements from 𝕏 a value from 𝕐 without requiring any particular rule.

For a given bit string x and hash function hash, y=hash(x) is denoted the hash of x.

We are particularly interested in perfect hash functions as given by the following definition.

Definition 2.4 (Perfect hash function).

A perfect hash function over the set 𝔸𝕏, denoted by

h𝔸:𝕏𝕐, (1)

is an injective function when restricted to 𝔸.111There are no collisions among elements of 𝔸, hash𝔸(x)hash𝔸(y) for all x,y𝔸, xy.

Assumption 1.

Perfect hash functions are surjective.

The load factor is given by the following definition.

Definition 2.5.

The proportion of hashes mapped to by hash𝔸:𝕏 over subset 𝔸 is denoted the load factor. Specifically, the load factor of hash𝔸 is a rational number given by

|hash𝔸(𝔸)||image(hash𝔸)|. (2)
Notation.

A perfect hash function of type 𝕏𝕐 over 𝔸 with a load factor r may be denoted by hash𝔸r. If m=|𝔸| and we are interested in drawing attention to the cardinality of the perfect hash function, we may also denote it by hashmr.

Example 1   Consider the set 𝕏={x1,x2,x3}. A perfect hash function of type 𝕌 over 𝕏 with a load factor r=35 is denoted by hash𝕏0.6 or hash30.6. Given the load factor r and 𝕏, we may deduce the codomain of hash𝕏r to be precisely {0,1,2,3,4}. See Figure 1 for an illustration.

Definition 2.6.

A minimal perfect hash has a load factor 1.

Figure 1: A perfect hash function hash𝕏0.6:𝕌{0,1,2,3,4}.
Refer to caption

Every hash function in 𝕏𝕐 is a perfect hash function over some subset of 𝕏, e.g., every hash function is trivially a perfect hash function of and singleton sets.

Assuming 𝕏 and 𝕐 are finite, the set of hash functions of type 𝕏𝕐, which may also be denoted by 𝕐𝕏, has a cardinality

|𝕐||𝕏|. (3)

The set of perfect hash functions over 𝔸𝕏 is a subset of 𝕏𝕐 with the predicate that no collisions may occur on any pair of elements in 𝔸. The set of perfect hash functions over 𝔸 has a cardinality

permutations(|𝕐|,|𝔸|)|𝕐||𝕏||𝔸|. (4)

3 A theoretical cryptographic perfect hash function

The bit set {0,1} is denoted by {0,1}. The set of all bit strings of length n is therefore {0,1}n. The cardinality of {0,1}n is 2n. In the case of {0,1}, we denote {0,1}0 by ϵ. The set of all bit strings of length n or less is denoted by {0,1}n with a cardinality 2n+11, e.g., {0,1}2=ϵ{0,1}{0,1}2. The countably infinite set of all bit strings, limn{0,1}n, is denoted by {0,1}. A tuple (x1,x2){0,1} is denoted a bit string, and we typically drop the angle brackets when specifying stings, e.g., (x1,x2)x1x2.

An important operation that is closed over the free semigroup of {0,1} is concatentation, #:{0,1}×{0,1}{0,1}, which is defined as a1an#b1bm=a1anb1bm with special cases x#ϵ=ϵ#x=x.

Most hash functions are of the type {0,1}{0,1}n where n is some finite natural number.

Another useful function is the binary padding function pad:{0,1}×0{0,1} is defined as pad(x,k)=x#0kBL(x) with the special case pad(x,0)=ϵ.

The binary truncation function trunc:{0,1}×{0,1} is defined as trunc(a1akan,k)=a1ak, with the special case trunc(x,0)=ϵ.

The bit length function BL:{0,1} is defined as BL(a1an)=n, e.g., if b{0,1}n then BL(b)=n.

Since {0,1}, they may put into one-to-one correspondence. A convenient one-to-one correspondence between them is given by the following definition.

Definition 3.1.

The sets {0,1} and have a bijection given by

b1,,bm2m+j=1m2mjbj. (5)

We denote the mapping described by Definition 3.1 with the postfix function :{0,1} and :{0,1}, which are inverse functions, i.e., x′′=x. An important observation of this mapping is that a natural number n maps to a bit string n of length BL(n)=log2n.

More generally, if we have some set 𝕌, in a physical computer there must be a way to map the elements of 𝕌 to bit strings that repesent the values. We provide this mapping with encoder and decoder functions denoted respectively by e[𝕌]:𝕌{0,1} and d[𝕌]:{0,1}𝕌 such that d[𝕌]e𝕌=id𝕌 and e𝕌d𝕌=id{0,1}.

A special case of the encoder and decoder is given by letting any bit string represent either the sequence of bits a1an or as the binary coded decimal (BCD), e.g., 0104. Thus, any operation on non-negative integers may be applied to bit strings without ambiguity, e.g., 010+01=11.

Given a function of g:𝕏𝕐, the domain of g may be denoted by dom(g)=𝕏 and the codomain may be denoted by codom(g)=𝕐.

A random oracle as given by the following definition.

Definition 3.2.

A random oracle in the family {0,1}{0,1} is a theoretical hash function whose output is uniformly distributed over its codomain.

The theoretical analysis of the perfect hash function makes the following assumption.

Assumption 2.

The hash function hash:{0,1}{0,1} is a random oracle.

Definition 3.3.

The data type for the cryptographic perfect hash function under consideration is defined as PH={0,1}× with a value constructor ph:𝒫(𝕌)×[0,1]PH defined as

ph(𝕏,r)=(n,N) (6)

where

m=|𝕏|,N=m/r,k=log2N,β(x,n)=trunc(hash(x#n),k)modN,𝕐n={β(x,n){0,,N1}|x𝕏},n=min{j|𝕐j2|𝕐j|=m}. (7)

Since PH is a data type that purports to model perfect hash functions, its computational basis is given by the following set of functions.

By the assumption of surjectivity (and the perfect hash function is not rated-distorted), then the load factor is given by |𝕏|N where N=argmaxx𝕏hash𝔸(x), i.e., the maximum hash (natural ordering of integers) of hash𝔸.

Definition 3.4.

The minimum and maximum hash of a value of type PH (that models a perfect hash function) are given respectively by min_hash:PH and max_hash:PH where

min_hash(b,N)=0andmax_hash(b,N)=N1. (8)

The most important function, the perfect hash mapping, perfect_hash:PH×𝕌, is defined as

ph((b,N),x)=qmodN (9)

where

k=log2N,r=hash(x#b),q=trunc(r,k). (10)
Theorem 3.1.

A value of type PH constructed with ph{0,1}(𝔸,r) models hash𝔸r:{0,1}{0,1,,k1} where 𝔸{0,1} and k=|𝔸|r.

Proof.

Proof here. ∎

Suppose we have a function f:𝕌{0,1} such that f|𝔸 is injective, e.g., an encoder for values of 𝔸𝕌. The composition hash𝔸=hash𝔹f where 𝔹={f(a){0,1}|a𝔸} is a perfect hash function of type 𝕌 over 𝔸𝕌. Thus, the rest of the material in this paper does not usually assume any particular domain of the perfect hash function since injections may always be constructed for any set.

3.1 Analysis

Note that since n denotes the geometric code for n and we choose the smallest n that succeeds, where each choice of n is a geometrically distributed trial with probability p, the expected space complexity obtains the information-theoretic lower-bound of 1.44 bits per element.

We search all possible hash functions which are a function of hash, an approximate random oracle, and choose the perfect hash function on a specified set that has the smallest bit length. We describe the algorithm for performing this exhaustive search in Algorithm 1.

We consider a family of perfect hash functions for a set 𝕊 which are given by the output of a hash function hash that approximates a random oracle applied to the input x𝕊 concatenated with a bit string b. We describe the generative algorithm for the perfect hash function in Algorithm 1. Note that in Algorithm 1, the concatenation of two bit strings x and y is denoted by x#y.

The perfect hash function generated by Algorithm 1 has the statistical property that the output is uniformly distributed as given by the following theorem.

Theorem 3.2.

The perfect hash function hash[𝔸][r]:𝕏{0,1,,k1}, where k=|𝔸|r, is a random oracle over 𝕏𝔸 and the restriction hash𝔸r|𝔸:𝔸{0,1,,k1} is a random k-permutation oracle of 𝔸.

Proof.

First, in Algorithm 1 on Line 5, bn is a bit string such that each x𝕊 concatenated with bn hashes to a unique integer in {1,,N} by the hash function hash, thus the hash function found perfectly hashes the elements of 𝕊.

Second, by Definition 3.2, hash approximates a random oracle whose output is uniformly distributed over the elements of {0,1}n. Viewing the elements of {0,1}n as integers, where |{0,1}n|=2n, hash is uniformly distributed over {0,1,,2n1}.

If N=k2n for some integer k, the remainder of the output of hash when dividing by N, given by the modulo operator, and adding 1 is uniformly distributed over {1,,N} since each j{1,,N} has an equal number of hashes assigned to it by hash. If 2nN and Nk2n for some integer k, then it is approximately uniformly distributed and converges to the uniform distribution as n. ∎

The probability that no collisions occur for a particular bit string in Algorithm 1 is given by the following theorem.

Theorem 3.3.

The probability that a bit string b{0,1} results in a perfect hash function is given by

p(m,r)=NmPmN (11)

where m is the cardinality of the set being perfectly hashed, r is the load factor, and N=mr.

Proof.

Suppose we have a set 𝔸{0,1} of cardinality m. The set of perfect hash functions hash𝔸r:{0,1}{0,1,,N1} restricted to 𝔸 where N=mr has a cardinality given by PmN since any choice of m out of N elements in the codomain and any ordering of the m elements in 𝔸 to the chosen m elements in {0,1,,N1} satisfy the definition of perfect hashing.

The set of hash functions {0,1}{0,1,,N1} restricted to 𝔸 has a cardinality of Nm. Therefore, the ratio of perfect hash functions restricted to 𝔸 to the total hash functions restricted to 𝔸 is just

p=PmNNm. (12)

By the property of the hash function hash being a random oracle, the algorithm randomly samples one of the hash functions, which is a perfect hash function with probabilty p.

Equation (11) may be reparameterized with respect to the load factor. By (2), r=m/N. Solving this equation with respect to N yields the solution

N=mr, (13)

and thus plugging in this value of N yields the result

p(m,r)=(mr)!(mr)m(mrm)!. (14)

The expected in-place coding size is given by the following theorem.

Theorem 3.4.

The expected coding size is given approximately by

log2e(1r1)log2(11r) bits/element. (15)
Proof.
nr(1r)log2(1r)log2n(un)log2(1n/u) (16)
(1r1)log2(1r)1nlog2nunnlog2(1nu) (17)

The space required for the perfect hash function found by Algorithm 1 is of the order of the length n of the bit string bn in the returned tuple. Therefore, for space efficiency, the algorithm exhaustively searches for a bit string in the order of increasing size n.

We are interested in the first case when no collision occurs, which is a geometric distribution with probability of success p(m,r) as given by the discrete random variable

QGeometric(p(m,r)), (18)

where p(m,r) is given by (14. The expected number of trials for the geometric distribution is given by

𝔼[Q]=1p(m,r)=(mr)m(mrm)!(mr)!. (19)

By Definition 3.1, the n-th trial uniquely maps to a bit string of length m=log2n. Thus, the expected bit length is given approximately by

𝔼[log2Q] =log2((mr)m(mrm)!(mr)!) (20)
=mlog2(mr)+log2(mrm)!log2(mr)! bits. (21)

By Stirling’s approximation,

log2n!n(log2nlog2e). (22)

and so the expectation may be rewritten approximately as

𝔼[Q]mlog2(mr)mr(log2(mr)log2e)+(mrm)(log2(mrm)log2e) bits. (23)

Since we are interested in the expected bits per element, we divide the expectation by m and after further simplification arrive at

log2(mr)1rlog2(mr)+(1r1)log2(mrm)+log2e bits/element. (24)

After further simplification, we arrive at

log2e(1r1)log2(11r) bits/element. (25)

The entropy of Q is given by

(1p)plog2(1p)log2(p) (26)
(1p1)log2(11p)+log21p (27)

Note that Algorithm 1 has an exponential time complexity with respect to the cardinality of the input set 𝕊. As a result, Algorithm 1 is not a practical algorithm for any reasonably large m. However, it is intended to illustrate theoretical properties useful to data structures that implement oblivious sets and maps[4, 3] based on the perfect hash function. Simple and efficient algorithms exist [1, 2].

The theoretical lower-bound of a minimal perfect hash function is given by the following postulate.

Postulate 3.1.

The theoretical lower-bound for minimal perfect hash functions has an expected coding size given approximately by

1.44 bits/element. (28)
Theorem 3.5.

The cryptographic perfect hash generator given by Algorithm 1 results in a minimal perfect hash function that obtains the theoretical lower-bound of 1.44 bits/element by invoking make_perfect_hash(, r=1).

Proof.

By (15), letting r1+ for the minimal perfect hash results in an expected coding size given by

log2elima0alog21a=1.44 bits/element, (29)

which is the expected lower-bound given by Postulate 3.1. ∎

This is as expected, since Algorithm 1 finds the smallest perfect hash function that is a function of a random oracle for any load factor 0<r1 in which the distribution of the sets are uniformly distributed. The following corollary follows as a result.

Corollary 3.5.1.

The lower-bound for perfect hash functions with a load factor 0<r1 is given by (15.

As r goes to 0, the expected bits per element goes to 0. t Note that the probability mass function of the random bit string found for b is known exactly. Thus, if the desire is to serialize the perfect hash function for transmission or storage, shorter bit strings may be assigned to more probable bit strings. This representation is not usable in-place, i.e., the serialization must be decoded, but it can reduce transmission or storage cost.

Input: 𝕏{0,1} is the set to be perfectly hashed, and r(0,1] is the load factor.
Output: A perfect hash function of 𝕏 with load factor r, represented as (bn,N) where N=|𝕏|/r.
1 function make_perfect_hash(𝕏, r)
2       m|𝕏|;
3       Nm/r;
4       klog2N;
5       for n1 to  do
6             𝕐;
7             αtrue;
8             for x𝕏 do
9                   htrunc(hash(x#n),k)modN;
10                   if h𝕐 then
11                         αfalse;
12                         break;
13                        
14                  𝕐𝕐{h};
15                  
16            if α then
17                   return (n,N);
18                  
Algorithm 1 Cryptographic perfect hash function constructor (single-level)

4 A practical two-level perfect hash function

Input: 𝕏𝕌 is the objective set, r{q|q2jj} is the load factor, and k is the number of entries in the intermediate hash level.
Output: A two-level perfect hash function of 𝕏𝕌 with a load factor r and an intermediate level of k indices, denoted by hash𝕏r:𝕌{0,,N} where N=|𝕏|r.
1 function make_perfect_hash(𝕏, r, k)
2       N=|𝕏|r;
3       t=log2k;
       // 𝕏[1],,𝕏[k] is a total partition of 𝕏 and 𝕏[j1],,𝕏[jk] is in decreasing order of cardinality.
4       𝕏[]={x𝕏trunc(hash(x#0),t)modk=};
5       𝕐;
6       𝔸;
7       for 1 to k do
8             for n1 to  do
9                   β(x)=trunc(hash(x#n),N);
10                   𝕐={β(x)|x𝕏[]};
11                   if |𝕐|=|𝕏|𝕐𝕐= then
12                         𝕐𝕐𝕐;
13                         break;
14                        
15            𝔸𝔸{(,n)};
16            
17      return (𝔸,k,N)
Algorithm 2 Two-level perfect hash function constructor

4.1 Analysis

Suppose we have a set 𝕏 of m elements with some total order x(1),,x(m) for which we seek a perfect hash function hash𝕏:𝕌mathbbmZ.

We denote the serialization of some value x by x. If x is already understood to be a serialization, then x denotes deserialization instead.

By the assumption that the hash function hash:𝕌{0,1}k is a random oracle restricted to the codomain {0,1}k, the n-th attempt (trial) hash(x(j)#n) to find a non-colliding hash for x(j) has a probability of success pj=mj+1m since we have already found hashes for the previous j1 elements and therefore there are only m(j1) hashes remaining.

This is an example of the Coupon collector’s problem. Let Tj denote the uncertain number of trials needed to find a non-colliding hash for x(j). The total number of trials needed is an uncertain random variable given by

T=j=1mTj. (30)

By the linearity of the expectation operator,

E(T)=j=1mE(Tj)=j=1m1pj=mHm1, (31)

which asymptotically converges to E(T)=mγ+mln(m1) or on average γ+ln(m1) trials per element of 𝕏.

If we use the minimum number of bits

The variance of T is given by

VT=j=2mVTj. (32)

5 Algebra of function composition

Perfect hash functions can be composed with other functions to create new perfect hash functions with different properties. This algebraic structure provides flexibility in designing hash function families.

5.1 Composition preserves injectivity

While a perfect hash function h𝔸:𝕏𝕐 is not globally injective, its restriction to 𝔸 is injective:

h𝔸|𝔸:𝔸𝕐. (33)

This restriction property enables composition with injective functions while preserving the perfect hash property.

Theorem 5.1 (Post-composition with injection).

Let g:𝕐 be an injective function, h𝔸r:𝕏𝕐 be a perfect hash function, and ||=(1+α)|𝕐| for α0. The composition gh𝔸r:𝕏 is a perfect hash function h𝔸r where r=r1+α.

Proof.

From the load factor definition, r=|𝔸||𝕐|, so |𝕐|=|𝔸|r. Since g is injective, gh𝔸r is injective on 𝔸, making it a perfect hash function with load factor

r=|𝔸|||=|𝔸|(1+α)|𝕐|=|𝔸|(1+α)|𝔸|r=r1+α. (34)

5.2 Permutation equivalence classes

Given a perfect hash function h𝔸:𝕏𝕐, composing with any permutation π:𝕐𝕐 yields another perfect hash function πh𝔸 over 𝔸 with the same load factor. Since there are |𝕐|! permutations of 𝕐, a single perfect hash function generates a family of |𝕐|! related perfect hash functions.

These functions form an equivalence class where all members have identical collision structure outside of 𝔸. This equivalence class can be denoted [πh𝔸] where π ranges over all permutations of 𝕐.

Corollary 5.1.1.

For a perfect hash function with codomain 𝕐, the ratio of permutation-equivalent perfect hash functions to all possible functions is

|𝕐|!|𝕐||𝕏|. (35)
\appendixpage\addappheadtotoc

Appendix A Probability mass of random bit length

Input: The number m is the cardinality of the perfect hash set and the number r is the load factor.
Output: The minimum bit length n conditioned on a random set of cardinality m and a load factor r.
1 function sample_bit_length(m,r)
2       for i1 to m do
3             xrandomly draw a bit string from {0,1} without replacement;
4             𝕊𝕊{x};
5            
6      (bn,N)make_perfect_hash(𝕊,r);
       // The integer N must also be coded, which we assume has a constant bit length.
7       return n+𝒪(1);
8      
Algorithm 3 Bit length sampler of the cryptographic perfect hash function

The bit length of the perfect hash function found by Algorithm 1 has an uncertain value with respect to (random) sets and therefore we may model it as a random variable as given by the following definition.

Definition A.1.

The perfect hash function generated by Algorithm 1 has a random bit length given by

N=sample_bit_length(m,r) (36)

where m is the cardinality of the random set and r is the load factor.

Note that if the distribution of sets is not uniformly distributed as given by sample_bit_length, then better algorithms than Algorithm 1 are possible, i.e., the lower-bounds are with respect to uniformly distributed random sets and smaller lower-bounds are possible if the random sets are non-uniformly distributed.

The probability mass of the random bit length N is given by the following theorem.

Theorem A.1.

The random bit length N has a probability mass function given by

pN(n|m,r)=q2n1(1q2n). (37)

where m is the cardinality of the random set, r is the load factor, and q=1p(m,r),

Proof.

Each iteration of the loop in Algorithm 1 has a collision test which is Bernoulli distributed with a probability of success p(m,r), where success denotes no collision occurred. We are interested in the random length N of the bit string when this outcome occurs.

For the random variable N to realize a value n, every bit string smaller than length n must fail and a bit string of length n must succeed. There are 2n1 bit strings smaller than length n and each one fails with probability q, and so by the product rule the probability that they all fail is given by

q2n1. (38)

Given that every bit string smaller than length n fails, what is the probability that every bit string of length n fails? There are 2n bit strings of length n, each of which fails with probability q as before and thus by the product rule the probability that they all fail is q2n, whose complement, the probability that not all bit strings of length n fail, is given by

1q2n. (39)

By the product rule, the probability that every bit string smaller than length n fails and a bit string of length n succeeds is given by the product of (38) and (39),

q2n1(1q2n). (40)

For (40 to be a probability mass function, two conditions must be met. First, its range must be a subset of [0,1]. Second, the summation over its domain must be 1.

The first case is trivially shown by the observation that q is a positive number between 0 and 1 and therefore any non-negative power of q is positive number between 0 and 1.

The second case is shown by calculating the infinite series

S =n=0q2n1(1q2n) (41)
=n=0q2n1q2n+11. (42)

Explicitly evaluating this series for the first 4 terms reveals a telescoping sum given by

S=(1q)+(qq3)+(q3q7)+(q7q15)+, (43)

where everything cancels except 1. ∎

In Figure 2, we plot the probability mass function of N conditioned on m=25 and several different load factors. We see that the probability mass is peaked and unimodal and tends to nearly zero everywhere except over a concentrated interval around its expected value.

Figure 2: Probability mass function of
random bit length N
Refer to caption

The expected size was already computed, but the probability mass function contains everything there is to know about the distribution of N, not just the expected value. However, for illustration, we show how the expected value may be computed.

Theorem A.2.

The expected bit length of N conditioned on random sets of cardinality m and a load factor r is given by

𝔼[N]=j=1q2j1, (44)

where q=1p(m,r).

Proof.

The expectation of N is given by

j=0jq2j1(1q2j) (45)
=j=0j(q2j1q2j+11). (46)

Explicitly evaluating this series for the first 4 terms reveals a converging sum given by

0+(qq3)+2(q3q7)+3(q7q15)+ (47)
=q+q3+q7+q15+. (48)

When we numerically evaluate (44 for various m and r, the results are in agreement with (15.

References

  • [1] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger (2009) Compress, hash, and displace algorithm. ACM Journal of Experimental Algorithmics (JEA) 14, pp. 1–26. Cited by: §1, §3.1.
  • [2] F. C. Botelho, R. Pagh, and N. Ziviani (2007) Simple and space-efficient minimal perfect hash functions. In International Workshop on Algorithms and Data Structures, pp. 139–150. Cited by: §1, §3.1.
  • [3] Z. J. Czech, G. Havas, and B. S. Majewski (1992) A family of perfect hashing methods. The Computer Journal 35 (6), pp. 547–554. Cited by: §1, §3.1.
  • [4] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan (1990) Space-efficient hash tables with worst-case constant access time. In STACS 90: 7th Annual Symposium on Theoretical Aspects of Computer Science Rouen, France, February 22–24, 1990 Proceedings 7, pp. 271–282. Cited by: §1, §3.1.