\AtBeginEnvironment

proof \AtEndEnvironmentproof \AtBeginEnvironmentexample \AtEndEnvironmentexample

Cryptographic perfect hash functions
A theoretical analysis on space efficiency and algebraic composition

Alexander Towell
lex@metafunctor.com

Abstract

We analyze a theoretical perfect hash function that has three desirable properties: (1) it is a cryptographic hash function; (2) its in-place encoding obtains the theoretical lower-bound for the expected space complexity; and (3) its in-place encoding is a random bit string with maximum entropy.

Keywords: perfect hash function, random oracle, space complexity, maximum entropy, cryptographic hash function

1 Prior Art

Perfect hash functions have been extensively studied in the literature. The seminal work of Fredman, Komlós, and Szemerédi [8] established that static dictionaries can be stored in linear space with $O(1)$ worst-case lookup time, laying the theoretical foundation for perfect hashing. Dietzfelbinger et al. [7] extended these results to dynamic settings, establishing theoretical bounds for space-efficient hash tables with worst-case constant access time. Czech, Havas, and Majewski [6] introduced practical families of perfect hashing methods that became widely influential.

More recent practical algorithms have focused on minimal perfect hash functions (MPHFs), which achieve a load factor of exactly 1. Botelho, Pagh, and Ziviani [4] presented simple and space-efficient constructions, while Belazzougui, Botelho, and Dietzfelbinger [1, 2] developed the Compress, Hash, and Displace (CHD) algorithm, which achieves excellent space-time tradeoffs for large datasets. Related techniques such as cuckoo hashing [9] provide alternative approaches to collision resolution with strong theoretical guarantees.

From an information-theoretic perspective, Shannon’s foundational work [10] established the theoretical limits of data compression, which directly inform the space lower bounds for perfect hash functions. Cover and Thomas [5] provide a comprehensive treatment of entropy and optimal coding, which underpin the analysis of encoding efficiency in this paper.

The random oracle model, introduced by Bellare and Rogaway [3], provides a theoretical framework for analyzing cryptographic hash functions as ideal random functions. This model is central to our analysis, as it allows us to derive precise probability distributions for the search process in Algorithm 1.

Our work differs from prior approaches by focusing on the theoretical properties of cryptographic perfect hash functions with maximum entropy encodings. While practical algorithms prioritize construction speed and space efficiency, we analyze the fundamental information-theoretic properties of perfect hash functions when the underlying hash function is modeled as a random oracle. This perspective reveals that the in-place encoding of such functions achieves the theoretical entropy bound, providing a rigorous foundation for understanding the space-optimality of perfect hash constructions.

2 Perfect hash functions

A set is an unordered collection of distinct elements. If we know the elements in a set, we may denote the set by these elements, e.g., $\{a,c,b\}$ denotes a set whose members are exactly $a$ , $b$ , and $c$ .

A finite set has a finite number of elements. For example, $\{1,3,5\}$ is a finite set with three elements. When sets $\mathbbm{A}$ and $\mathbbm{B}$ are isomorphic, denoted by $\mathbbm{A}\cong\mathbbm{B}$ , they can be put into a one-to-one correspondence (bijection), e.g., $\{b,a,c\}\cong\{1,2,3\}$ . Since there exists at least one bijection between isomorphic sets, we can losslessly convert one to the other and thus, isomorphic sets are in some sense equivalent.

The cardinality of a finite set $\mathbbm{A}$ is the number of elements in the set, denoted by $|A|$ , e.g., $|\{1,3,5\}|=3$ . A countably infinite set is isomorphic to the set of natural numbers $N=\{1,2,3,4,5,\ldots\}$ .

Given two elements $a$ and $b$ , an ordered pair of $a$ then $b$ is denoted by $(a,b)$ , where $(a,b)=(c,d)$ if and only if $a=c$ and $b=d$ . Ordered pairs are non-commutative and non-associative, i.e., $(a,b)\neq(b,a)$ if $a\neq b$ and $(a,(b,c))\neq(b,(a,c))$ .

Related to the ordered pair is the Cartesian product.

Definition 2.1.

The set $\mathbbm{X}\times\mathbbm{Y}=\left\{(x,y):x\in\mathbbm{X}\land y\in\mathbbm{Y}\right\}$ is the Cartesian product of sets $\mathbbm{X}$ and $\mathbbm{Y}$ .

By the non-commutative and non-associative property of ordered pairs, the Cartesian product is non-commutative and non-associative. However, they are isomorphic, i.e., $\mathbbm{X}\times\mathbbm{Y}\cong\mathbbm{Y}\times\mathbbm{X}$ .

A tuple is a generalization of order pairs which can consist of an arbitrary number of elements, e.g., $(x_{1},x_{2},\ldots,x_{n})$ .

Definition 2.2 ( $n$ -fold Cartesian product).

The $n$ -ary Cartesian product of sets $\mathbbm{X}[1],\ldots,\mathbbm{X}[n]$ , is given by $\mathbbm{X}[1]\times\cdots\times\mathbbm{X}[n]=\left\{(x_{1},\ldots,x_{n}):x_{% 1}\in\mathbbm{X}[1]\land\cdots\land x_{n}\in\mathbbm{X}[n]\right\}$ .

Note that $\mathbbm{X}[1]\times\mathbbm{X}[2]\times\mathbbm{X}[3]\cong\mathbbm{X}[1]% \times\left(\mathbbm{X}[2]\times\mathbbm{X}[3]\right)\cong\left(\mathbbm{X}[1]% \times\mathbbm{X}[2]\right)\times\mathbbm{X}[3]$ , thus we may implicitly convert between them without ambiguity.

If each set in the $n$ -ary Cartesian product is the same, the power notation may be used, e.g., $\mathbbm{X}^{3}\equiv\mathbbm{X}\times\mathbbm{X}\times\mathbbm{X}$ . As special cases, the $0$ -ary (nullary) Cartesian product is defined to be $\{\emptyset\}$ , and the $1$ -ary (unary) Cartesian product is the identity, e.g., $\mathbbm{X}^{1}=\mathbbm{X}$ .

The hash function is given by the following definition.

Definition 2.3.

Hash functions of type $\mathbbm{X}\mapsto\mathbbm{Y}$ are just total functions, normally with a finite codomain, with the weak assumption that they will be used as a device to assign elements from $\mathbbm{X}$ a value from $\mathbbm{Y}$ without requiring any particular rule.

For a given bit string $x$ and hash function $\mathrm{hash}$ , $y=\mathrm{hash}(x)$ is denoted the hash of $x$ .

We are particularly interested in perfect hash functions as given by the following definition.

Definition 2.4 (Perfect hash function).

A perfect hash function over the set $\mathbbm{A}\subseteq\mathbbm{X}$ , denoted by

\mathrm{h}_{\mathbbm{A}}:\mathbbm{X}\mapsto\mathbbm{Y},

(1)

is an injective function when restricted to $\mathbbm{A}$ .¹¹1There are no collisions among elements of $\mathbbm{A}$ , $\mathrm{hash}_{\mathbbm{A}}(x)\neq\mathrm{hash}_{\mathbbm{A}}(y)$ for all $x,y\in\mathbbm{A}$ , $x\neq y$ .

Assumption 1.

Perfect hash functions are surjective.

The load factor is given by the following definition.

Definition 2.5.

The proportion of hashes mapped to by $\mathrm{hash}_{\mathbbm{A}}:\mathbbm{X}\mapsto\mathbbm{N}$ over subset $\mathbbm{A}$ is denoted the load factor. Specifically, the load factor of $\mathrm{hash}_{\mathbbm{A}}$ is a rational number given by

\frac{|\mathrm{hash}_{\mathbbm{A}}(\mathbbm{A})|}{|\mathrm{image}(\mathrm{hash% }_{\mathbbm{A}})|}.

(2)

Notation.

A perfect hash function of type $\mathbbm{X}\mapsto\mathbbm{Y}$ over $\mathbbm{A}$ with a load factor $r$ may be denoted by $\mathrm{hash}_{\mathbbm{A}}^{r}$ . If $m=|\mathbbm{A}|$ and we are interested in drawing attention to the cardinality of the perfect hash function, we may also denote it by $\mathrm{hash}_{m}^{r}$ .

Example 1 Consider the set $\mathbbm{X}=\{x_{1},x_{2},x_{3}\}$ . A perfect hash function of type $\mathbbm{U}\mapsto\mathbbm{N}$ over $\mathbbm{X}$ with a load factor $r=\frac{3}{5}$ is denoted by $\mathrm{hash}_{\mathbbm{X}}^{0.6}$ or $\mathrm{hash}_{3}^{0.6}$ . Given the load factor $r$ and $\mathbbm{X}$ , we may deduce the codomain of $\mathrm{hash}_{\mathbbm{X}}^{r}$ to be precisely $\{0,1,2,3,4\}$ . See Figure 1 for an illustration.

Definition 2.6.

A minimal perfect hash has a load factor $1$ .

Refer to caption — Figure 1: A perfect hash function $\mathrm{hash}_{\mathbbm{X}}^{0.6}:\mathbbm{U}\mapsto\{0,1,2,3,4\}$ .

Every hash function in $\mathbbm{X}\mapsto\mathbbm{Y}$ is a perfect hash function over some subset of $\mathbbm{X}$ , e.g., every hash function is trivially a perfect hash function of $\emptyset$ and singleton sets.

Assuming $\mathbbm{X}$ and $\mathbbm{Y}$ are finite, the set of hash functions of type $\mathbbm{X}\mapsto\mathbbm{Y}$ , which may also be denoted by $\mathbbm{Y}^{\mathbbm{X}}$ , has a cardinality

|\mathbbm{Y}|^{|\mathbbm{X}|}.

(3)

The set of perfect hash functions over $\mathbbm{A}\subseteq\mathbbm{X}$ is a subset of $\mathbbm{X}\mapsto\mathbbm{Y}$ with the predicate that no collisions may occur on any pair of elements in $\mathbbm{A}$ . The set of perfect hash functions over $\mathbbm{A}$ has a cardinality

\mathrm{permutations}(|\mathbbm{Y}|,|\mathbbm{A}|)|\mathbbm{Y}|^{|\mathbbm{X}|% -|\mathbbm{A}|}.

(4)

3 A theoretical cryptographic perfect hash function

The bit set $\{0,1\}$ is denoted by $\{0,1\}$ . The set of all bit strings of length $n$ is therefore $\{0,1\}^{n}$ . The cardinality of $\{0,1\}^{n}$ is $2^{n}$ . In the case of $\{0,1\}$ , we denote $\{0,1\}^{0}$ by $\epsilon$ . The set of all bit strings of length $n$ or less is denoted by $\{0,1\}^{\leq n}$ with a cardinality $2^{n+1}-1$ , e.g., $\{0,1\}^{\leq 2}=\epsilon\cup\{0,1\}\cup\{0,1\}^{2}$ . The countably infinite set of all bit strings, $\lim_{n\to\infty}\{0,1\}^{\leq n}$ , is denoted by $\{0,1\}^{*}$ . A tuple $(x_{1},x_{2}\ldots)\in\{0,1\}^{*}$ is denoted a bit string, and we typically drop the angle brackets when specifying strings, e.g., $(x_{1},x_{2})\equiv x_{1}x_{2}$ .

An important operation that is closed over the free semigroup of $\{0,1\}$ is concatentation, $\mathbin{\#}:\{0,1\}^{*}\times\{0,1\}^{*}\mapsto\{0,1\}^{*}$ , which is defined as $a_{1}\ldots a_{n}\mathbin{\#}b_{1}\ldots b_{m}=a_{1}\ldots a_{n}b_{1}\ldots b_% {m}$ with special cases $x\mathbin{\#}\epsilon=\epsilon\mathbin{\#}x=x$ .

Most hash functions are of the type $\{0,1\}^{*}\mapsto\{0,1\}^{n}$ where $n$ is some finite natural number.

Another useful function is the binary padding function $\mathrm{pad}:\{0,1\}^{*}\times\mathbbm{Z}_{\geq 0}\mapsto\{0,1\}^{*}$ is defined as $\mathrm{pad}(x,k)=x\mathbin{\#}0^{k-\mathrm{BL}(x)}$ with the special case $\mathrm{pad}(x,0)=\epsilon$ .

The binary truncation function $\mathrm{trunc}:\{0,1\}^{*}\times\mathbbm{N}\mapsto\{0,1\}^{*}$ is defined as $\mathrm{trunc}(a_{1}\ldots a_{k}\ldots a_{n},k)=a_{1}\ldots a_{k}$ , with the special case $\mathrm{trunc}(x,0)=\epsilon$ .

The bit length function $\mathrm{BL}:\{0,1\}^{*}\mapsto\mathbbm{N}$ is defined as $\mathrm{BL}(a_{1}\ldots a_{n})=n$ , e.g., if $b\in\{0,1\}^{n}$ then $\mathrm{BL}(b)=n$ .

Since $\{0,1\}^{*}\cong\mathbbm{N}$ , they may put into one-to-one correspondence. A convenient one-to-one correspondence between them is given by the following definition.

Definition 3.1.

The sets $\{0,1\}^{*}$ and $\mathbbm{N}$ have a bijection given by

b_{1},\ldots,b_{m}\longleftrightarrow 2^{m}+\sum_{j=1}^{m}2^{m-j}b_{j}.

(5)

We denote the mapping described by Definition 3.1 with the postfix function ${}^{\prime}:\{0,1\}^{*}\mapsto\mathbbm{N}$ and ${}^{\prime}:\mathbbm{N}\mapsto\{0,1\}^{*}$ , which are inverse functions, i.e., $x^{\prime\prime}=x$ . An important observation of this mapping is that a natural number $n$ maps to a bit string $n^{\prime}$ of length $\mathrm{BL}(n^{\prime})=\lfloor\log_{2}n\rfloor$ .

More generally, if we have some set $\mathbbm{U}$ , in a physical computer there must be a way to map the elements of $\mathbbm{U}$ to bit strings that represent the values. We provide this mapping with encoder and decoder functions denoted respectively by $\mathrm{e}[\mathbbm{U}]:\mathbbm{U}\mapsto\{0,1\}^{*}$ and $\mathrm{d}[\mathbbm{U}]:\{0,1\}^{*}\mapsto\mathbbm{U}$ such that $\mathrm{d}[\mathbbm{U}]\circ e_{\mathbbm{U}}=\mathrm{id}_{\mathbbm{U}}$ and $e_{\mathbbm{U}}\circ d_{\mathbbm{U}}=\mathrm{id}_{\{0,1\}^{*}}$ .

A special case of the encoder and decoder is given by letting any bit string represent either the sequence of bits $a_{1}\ldots a_{n}$ or as the binary coded decimal (BCD), e.g., $010\longleftrightarrow 4$ . Thus, any operation on non-negative integers may be applied to bit strings without ambiguity, e.g., $010+01=11$ .

Given a function of $g:\mathbbm{X}\mapsto\mathbbm{Y}$ , the domain of $g$ may be denoted by $\mathrm{dom}(g)=\mathbbm{X}$ and the codomain may be denoted by $\mathrm{codom}(g)=\mathbbm{Y}$ .

A random oracle as given by the following definition.

Definition 3.2.

A random oracle in the family $\{0,1\}^{*}\mapsto\{0,1\}^{\infty}$ is a theoretical hash function whose output is uniformly distributed over its codomain.

The theoretical analysis of the perfect hash function makes the following assumption.

Assumption 2.

The hash function $\mathrm{hash}:\{0,1\}^{*}\mapsto\{0,1\}^{\infty}$ is a random oracle.

Definition 3.3.

The data type for the cryptographic perfect hash function under consideration is defined as $\mathrm{PH}=\{0,1\}^{*}\times\mathbbm{N}$ with a value constructor $\mathrm{ph}:\mathcal{P}(\mathbbm{U})\times[0,1]\mapsto\mathrm{PH}$ defined as

\mathrm{ph}(\mathbbm{X},r)=(n^{\prime},N)

(6)

where

\begin{split}m&=|\mathbbm{X}|,\\ N&=\lceil m/r\rceil,\\ k&=\lceil\log_{2}N\rceil,\\ \beta(x,n)&=\mathrm{trunc}\!\left(\mathrm{hash}(x^{\prime}\mathbin{\#}n^{% \prime}),k\right)^{\prime}\mod N,\\ \mathbbm{Y}_{n}&=\{\beta(x,n)\in\{0,\ldots,N-1\}|x\in\mathbbm{X}\},\\ n&=\min\{j\in\mathbbm{N}|\mathbbm{Y}_{j}\in 2^{\mathbbm{N}}\land|\mathbbm{Y}_{% j}|=m\}.\end{split}

(7)

Since $\mathrm{PH}$ is a data type that purports to model perfect hash functions, its computational basis is given by the following set of functions.

By the assumption of surjectivity (and the perfect hash function is not rated-distorted), then the load factor is given by $\frac{|\mathbbm{X}|}{N}$ where $N=\operatorname*{arg\,max}_{x\in\mathbbm{X}}\mathrm{hash}_{\mathbbm{A}}(x)$ , i.e., the maximum hash (natural ordering of integers) of $\mathrm{hash}_{\mathbbm{A}}$ .

Definition 3.4.

The minimum and maximum hash of a value of type $\mathrm{PH}$ (that models a perfect hash function) are given respectively by $\mathrm{min\_hash}:\mathrm{PH}\mapsto\mathbbm{N}$ and $\mathrm{max\_hash}:\mathrm{PH}\mapsto\mathbbm{N}$ where

\mathrm{min\_hash}(b,N)=0\;\text{and}\;\mathrm{max\_hash}(b,N)=N-1.

(8)

The most important function, the perfect hash mapping, $\mathrm{perfect\_hash}:\mathrm{PH}\times\mathbbm{U}\mapsto\mathbbm{N}$ , is defined as

\mathrm{ph}((b,N),x)=q^{\prime}\mod N

(9)

where

\begin{split}k&=\lceil\log_{2}N\rceil,\\ r&=\mathrm{hash}(x^{\prime}\mathbin{\#}b),\\ q&=\mathrm{trunc}(r,k).\end{split}

(10)

Theorem 3.1.

A value of type $\mathrm{PH}$ constructed with $\mathrm{ph}_{\{0,1\}^{*}}(\mathbbm{A},r)$ models $\mathrm{hash}_{\mathbbm{A}}^{r}:\{0,1\}^{*}\mapsto\{0,1,\ldots,k-1\}$ where $\mathbbm{A}\subseteq\{0,1\}^{*}$ and $k=\frac{|\mathbbm{A}|}{r}$ .

Proof.

Let $(n^{\prime},N)=\mathrm{ph}(\mathbbm{A},r)$ be the value constructed by the definition. By construction, $N=\lceil|\mathbbm{A}|/r\rceil$ , so the codomain has size $N=\lceil|\mathbbm{A}|/r\rceil\geq|\mathbbm{A}|/r$ , yielding a load factor of at most $r$ .

The value $n$ is chosen as the minimum natural number such that $|\mathbbm{Y}_{n}|=|\mathbbm{A}|$ , where $\mathbbm{Y}_{n}=\{\beta(x,n):x\in\mathbbm{A}\}$ and $\beta(x,n)=\mathrm{trunc}(\mathrm{hash}(x^{\prime}\mathbin{\#}n^{\prime}),k)^{% \prime}\mod N$ . The condition $|\mathbbm{Y}_{n}|=|\mathbbm{A}|$ ensures that all elements of $\mathbbm{A}$ map to distinct hashes, i.e., $\beta(\cdot,n)|_{\mathbbm{A}}$ is injective.

The lookup function $\mathrm{perfect\_hash}((n^{\prime},N),x)$ computes $\mathrm{trunc}(\mathrm{hash}(x^{\prime}\mathbin{\#}n^{\prime}),k)^{\prime}\mod N$ , which equals $\beta(x,n)$ by definition. Since $\beta(\cdot,n)|_{\mathbbm{A}}$ is injective and maps into $\{0,\ldots,N-1\}$ , the function $\mathrm{perfect\_hash}((n^{\prime},N),\cdot)$ restricted to $\mathbbm{A}$ is a perfect hash function with codomain $\{0,\ldots,N-1\}$ and load factor $|\mathbbm{A}|/N\leq r$ . ∎

Suppose we have a function $f:\mathbbm{U}\mapsto\{0,1\}^{*}$ such that $f|_{\mathbbm{A}}$ is injective, e.g., an encoder for values of $\mathbbm{A}\subseteq\mathbbm{U}$ . The composition $\mathrm{hash}_{\mathbbm{A}}=\mathrm{hash}_{\mathbbm{B}}\circ f$ where $\mathbbm{B}=\{f(a)\in\{0,1\}^{*}|a\in\mathbbm{A}\}$ is a perfect hash function of type $\mathbbm{U}\mapsto\mathbbm{N}$ over $\mathbbm{A}\subseteq\mathbbm{U}$ . Thus, the rest of the material in this paper does not usually assume any particular domain of the perfect hash function since injections may always be constructed for any set.

3.1 Analysis

Note that since $n^{\prime}$ denotes the geometric code for $n$ and we choose the smallest $n$ that succeeds, where each choice of $n$ is a geometrically distributed trial with probability $p$ , the expected space complexity obtains the information-theoretic lower-bound of $1.44$ bits per element.

We search all possible hash functions which are a function of $\mathrm{hash}$ , an approximate random oracle, and choose the perfect hash function on a specified set that has the smallest bit length. We describe the algorithm for performing this exhaustive search in Algorithm 1.

We consider a family of perfect hash functions for a set $\mathbbm{S}$ which are given by the output of a hash function $\mathrm{hash}$ that approximates a random oracle applied to the input $x\in\mathbbm{S}$ concatenated with a bit string $b$ . We describe the generative algorithm for the perfect hash function in Algorithm 1. Note that in Algorithm 1, the concatenation of two bit strings $x$ and $y$ is denoted by $x\mathbin{\#}y$ .

The perfect hash function generated by Algorithm 1 has the statistical property that the output is uniformly distributed as given by the following theorem.

Theorem 3.2.

The perfect hash function $\mathrm{hash}[\mathbbm{A}][r]:\mathbbm{X}\mapsto\{0,1,\ldots,k-1\}$ , where $k=\frac{|\mathbbm{A}|}{r}$ , is a random oracle over $\mathbbm{X}-\mathbbm{A}$ and the restriction $\mathrm{hash}_{\mathbbm{A}}^{r}|_{\mathbbm{A}}:\mathbbm{A}\mapsto\{0,1,\ldots,% k-1\}$ is a random $k$ -permutation oracle of $\mathbbm{A}$ .

Proof.

First, in Algorithm 1 on Line 5, $b_{n}$ is a bit string such that each $x\in\mathbbm{S}$ concatenated with $b_{n}$ hashes to a unique integer in $\{1,\ldots,N\}$ by the hash function $\mathrm{hash}$ , thus the hash function found perfectly hashes the elements of $\mathbbm{S}$ .

Second, by Definition 3.2, $\mathrm{hash}$ approximates a random oracle whose output is uniformly distributed over the elements of $\{0,1\}^{n}$ . Viewing the elements of $\{0,1\}^{n}$ as integers, where $|\{0,1\}^{n}|=2^{n}$ , $\mathrm{hash}$ is uniformly distributed over $\{0,1,\ldots,2^{n}-1\}$ .

If $N=k2^{n}$ for some integer $k$ , the remainder of the output of $\mathrm{hash}$ when dividing by $N$ , given by the modulo operator, and adding $1$ is uniformly distributed over $\{1,\ldots,N\}$ since each $j\in\{1,\ldots,N\}$ has an equal number of hashes assigned to it by $\mathrm{hash}$ . If $2^{n}\gg N$ and $N\neq k2^{n}$ for some integer $k$ , then it is approximately uniformly distributed and converges to the uniform distribution as $n\to\infty$ . ∎

The probability that no collisions occur for a particular bit string in Algorithm 1 is given by the following theorem.

Theorem 3.3.

The probability that a bit string $b\in\{0,1\}^{*}$ results in a perfect hash function is given by

p(m,r)=N^{-m}P^{N}_{m}

(11)

where $m$ is the cardinality of the set being perfectly hashed, $r$ is the load factor, and $N=\frac{m}{r}$ .

Proof.

Suppose we have a set $\mathbbm{A}\subset\{0,1\}^{*}$ of cardinality $m$ . The set of perfect hash functions $\mathrm{hash}_{\mathbbm{A}}^{r}:\{0,1\}^{*}\mapsto\{0,1,\ldots,N-1\}$ restricted to $\mathbbm{A}$ where $N=\frac{m}{r}$ has a cardinality given by $P^{N}_{m}$ since any choice of $m$ out of $N$ elements in the codomain and any ordering of the $m$ elements in $\mathbbm{A}$ to the chosen $m$ elements in $\{0,1,\ldots,N-1\}$ satisfy the definition of perfect hashing.

The set of hash functions $\{0,1\}^{*}\mapsto\{0,1,\ldots,N-1\}$ restricted to $\mathbbm{A}$ has a cardinality of $N^{m}$ . Therefore, the ratio of perfect hash functions restricted to $\mathbbm{A}$ to the total hash functions restricted to $\mathbbm{A}$ is just

p=\frac{P^{N}_{m}}{N^{m}}.

(12)

By the property of the hash function $\mathrm{hash}^{*}$ being a random oracle, the algorithm randomly samples one of the hash functions, which is a perfect hash function with probability $p$ .

∎

Equation (11) may be reparameterized with respect to the load factor. By (2), $r=m/N$ . Solving this equation with respect to $N$ yields the solution

N=\frac{m}{r},

(13)

and thus plugging in this value of $N$ yields the result

p(m,r)=\frac{\left(\frac{m}{r}\right)!}{\left(\frac{m}{r}\right)^{m}\left(% \frac{m}{r}-m\right)!}.

(14)

The expected in-place coding size is given by the following theorem.

Theorem 3.4.

The expected coding size is given approximately by

\log_{2}\mathrm{e}-\left(\frac{1}{r}-1\right)\log_{2}\!\left(\frac{1}{1-r}% \right)\text{ bits/element}.

(15)

Proof.

The space required for the perfect hash function found by Algorithm 1 is of the order of the length $n$ of the bit string $b_{n}$ in the returned tuple. Therefore, for space efficiency, the algorithm exhaustively searches for a bit string in the order of increasing size $n$ .

We are interested in the first case when no collision occurs, which is a geometric distribution with probability of success $p(m,r)$ as given by the discrete random variable

Q\sim\mathrm{Geometric}\!\bigl{(}p\!\left(m,r\right)\bigr{)},

(16)

where $p(m,r)$ is given by (14). The expected number of trials for the geometric distribution is given by

\mathbb{E}\left[Q\right]=\frac{1}{p(m,r)}=\frac{\left(\frac{m}{r}\right)^{m}% \left(\frac{m}{r}-m\right)!}{\left(\frac{m}{r}\right)!}.

(17)

By Definition 3.1, the $n$ -th trial uniquely maps to a bit string of length $m=\lfloor\log_{2}n\rfloor$ . Thus, the expected bit length is given approximately by

	$\displaystyle\mathbb{E}\left[\log_{2}Q\right]$	$\displaystyle=\log_{2}\!\left(\frac{\left(\frac{m}{r}\right)^{m}\left(\frac{m}% {r}-m\right)!}{\left(\frac{m}{r}\right)!}\right)$		(18)
		$\displaystyle=m\log_{2}\!\left(\frac{m}{r}\right)+\log_{2}\!\left(\frac{m}{r}-% m\right)!-\log_{2}\!\left(\frac{m}{r}\right)!\;\text{ bits}.$		(19)

By Stirling’s approximation,

\log_{2}n!\approx n\left(\log_{2}n-\log_{2}\mathrm{e}\right).

(20)

and so the expectation may be rewritten approximately as

\begin{split}\mathbb{E}\left[Q\right]\approx&\;m\log_{2}\!\left(\frac{m}{r}% \right)-\frac{m}{r}\left(\log_{2}\!\left(\frac{m}{r}\right)-\log_{2}\mathrm{e}% \right)+\\ &\qquad\left(\frac{m}{r}-m\right)\left(\log_{2}\!\left(\frac{m}{r}-m\right)-% \log_{2}\mathrm{e}\right)\;\text{ bits}.\end{split}

(21)

Since we are interested in the expected bits per element, we divide the expectation by $m$ and after further simplification arrive at

\begin{split}&\log_{2}\!\left(\frac{m}{r}\right)-\frac{1}{r}\log_{2}\!\left(% \frac{m}{r}\right)+\\ &\qquad\left(\frac{1}{r}-1\right)\log_{2}\!\left(\frac{m}{r}-m\right)+\log_{2}% \mathrm{e}\;\text{ bits/element}.\end{split}

(22)

After further simplification, we arrive at

\log_{2}\mathrm{e}-\left(\frac{1}{r}-1\right)\log_{2}\!\left(\frac{1}{1-r}% \right)\text{ bits/element}.

(23)

∎

The encoding of the perfect hash function achieves maximum entropy for its expected length. Since $Q$ follows a geometric distribution with success probability $p=p(m,r)$ , its entropy is given by

H(Q)=\frac{-(1-p)}{p}\log_{2}(1-p)-\log_{2}(p),

(24)

which can be rewritten as

H(Q)=\left(\frac{1}{p}-1\right)\log_{2}\!\left(\frac{1}{1-p}\right)+\log_{2}% \frac{1}{p}.

(25)

This entropy equals the expected bit length $\mathbb{E}[\log_{2}Q]$ derived above, confirming that the geometric coding of trial indices is entropy-optimal. The in-place encoding—using the trial index $n$ directly as the stored bit string—achieves this maximum entropy because each bit string of length $\ell$ occurs with probability proportional to $q^{2^{\ell}-1}(1-q^{2^{\ell}})$ , which matches the probability mass function of the random bit length $N$ derived in the Appendix.

Theorem 3.5 (Maximum entropy encoding).

The in-place encoding of the cryptographic perfect hash function achieves maximum entropy for its expected length. Specifically, for a perfect hash function over a set of cardinality $m$ with load factor $r$ , the encoding $n^{\prime}$ has entropy

H(N)=\mathbb{E}[\mathrm{BL}(n^{\prime})]=\log_{2}\mathrm{e}-\left(\frac{1}{r}-% 1\right)\log_{2}\!\left(\frac{1}{1-r}\right)\text{ bits/element},

(26)

which equals the information-theoretic lower bound for representing perfect hash functions over uniformly distributed random sets.

Proof.

By the random oracle assumption, each trial $n$ succeeds independently with probability $p=p(m,r)$ . The first successful trial $Q$ follows a geometric distribution, and the encoding $n^{\prime}$ uses $\mathrm{BL}(n^{\prime})=\lfloor\log_{2}n\rfloor$ bits.

The entropy of a geometric random variable with success probability $p$ is $H(Q)=\frac{-(1-p)}{p}\log_{2}(1-p)-\log_{2}p$ . Since the bijection between $n$ and $n^{\prime}$ preserves information, and the expected bit length equals the entropy, the encoding is optimal in the sense of Shannon’s source coding theorem [10, 5]. No encoding can achieve a smaller expected length while faithfully representing the perfect hash function. ∎

Note that Algorithm 1 has an exponential time complexity with respect to the cardinality of the input set $\mathbbm{S}$ . As a result, Algorithm 1 is not a practical algorithm for any reasonably large $m$ . However, it is intended to illustrate theoretical properties useful to data structures that implement oblivious sets and maps[7, 6] based on the perfect hash function. Simple and efficient algorithms exist [1, 4].

The theoretical lower-bound of a minimal perfect hash function is given by the following postulate.

Postulate 3.1.

The theoretical lower-bound for minimal perfect hash functions has an expected coding size given approximately by

1.44\text{ bits/element}.

(27)

Theorem 3.6.

The cryptographic perfect hash generator given by Algorithm 1 results in a minimal perfect hash function that obtains the theoretical lower-bound of $1.44\text{ bits/element}$ by invoking make_perfect_hash( $\cdot$ , $r=1$ ).

Proof.

By (15), letting $r\to 1^{+}$ for the minimal perfect hash results in an expected coding size given by

\log_{2}\mathrm{e}-\lim_{a\to 0^{-}}a\log_{2}\frac{1}{a}=1.44\;\text{ bits/% element},

(28)

which is the expected lower-bound given by Postulate 3.1. ∎

This is as expected, since Algorithm 1 finds the smallest perfect hash function that is a function of a random oracle for any load factor $0<r\leq 1$ in which the distribution of the sets are uniformly distributed. The following corollary follows as a result.

Corollary 3.6.1.

The lower-bound for perfect hash functions with a load factor $0<r\leq 1$ is given by (15).

As $r$ goes to $0$ , the expected bits per element goes to $0$ .

Note that the probability mass function of the random bit string found for $b$ is known exactly. Thus, if the desire is to serialize the perfect hash function for transmission or storage, shorter bit strings may be assigned to more probable bit strings. This representation is not usable in-place, i.e., the serialization must be decoded, but it can reduce transmission or storage cost.

Input:

\mathbbm{X}\subseteq\{0,1\}^{*}

is the set to be perfectly hashed, and

r\in(0,1]

is the load factor.

Output: A perfect hash function of

\mathbbm{X}

with load factor

r

, represented as

(b_{n},N)

where

N=\lceil|\mathbbm{X}|/r\rceil

1 function make_perfect_hash( $\mathbbm{X}$ , $r$ )

m\leftarrow|\mathbbm{X}|

;

N\leftarrow\lceil m/r\rceil

;

k\leftarrow\lceil\log_{2}N\rceil

;

5 for $n\leftarrow 1$ to $\infty$ do

\mathbbm{Y}\leftarrow\emptyset

;

\alpha\leftarrow\textbf{true}

;

8 for $x\in\mathbbm{X}$ do

h\leftarrow\mathrm{trunc}\!\left(\mathrm{hash}(x^{\prime}\mathbin{\#}n^{\prime% }),k\right)^{\prime}\mod N

;

10 if $h\in\mathbbm{Y}$ then

\alpha\leftarrow\textbf{false}

;

12 break;

\mathbbm{Y}\leftarrow\mathbbm{Y}\cup\{h\}

;

16 if $\alpha$ then

17 return

(n^{\prime},N)

;

Algorithm 1 Cryptographic perfect hash function constructor (single-level)

4 A practical two-level perfect hash function

Input:

\mathbbm{X}\subseteq\mathbbm{U}

is the objective set,

r\in\{q\in\mathbbm{Q}|q\in 2^{-j}\land j\in\mathbbm{N}\}

is the load factor, and

k\in\mathbbm{N}

is the number of entries in the intermediate hash level.

Output: A two-level perfect hash function of

\mathbbm{X}\subseteq\mathbbm{U}

with a load factor

r

and an intermediate level of

k

indices, denoted by

\mathrm{hash}_{\mathbbm{X}}^{r}:\mathbbm{U}\mapsto\left\{0,\ldots,N\right\}

where

N=\frac{|\mathbbm{X}|}{r}

1 function make_perfect_hash( $\mathbbm{X}$ , $r$ , $k$ )

N=\frac{|\mathbbm{X}|}{r}

;

t=\lceil\log_{2}k\rceil

;

\mathbbm{X}[1],\ldots,\mathbbm{X}[k]

is a total partition of

\mathbbm{X}

and

\mathbbm{X}[j_{1}],\ldots,\mathbbm{X}[j_{k}]

is in decreasing order of cardinality.

\mathbbm{X}[\ell]=\{x\in\mathbbm{X}\mid\mathrm{trunc}\!\left(\mathrm{hash}(x^{% \prime}\mathbin{\#}0^{\prime}),t\right)^{\prime}\mod k=\ell\}

;

\mathbbm{Y}\leftarrow\emptyset

;

\mathbbm{A}\leftarrow\emptyset

;

7 for $\ell\leftarrow 1$ to $k$ do

8 for $n\leftarrow 1$ to $\infty$ do

\beta(x)=\mathrm{trunc}\!\left(\mathrm{hash}(x^{\prime}\mathbin{\#}n^{\prime})% ,N\right)

;

\mathbbm{Y}_{\ell}=\{\beta(x)|x\in\mathbbm{X}[\ell]\}

;

11 if $|\mathbbm{Y}_{\ell}|=|\mathbbm{X}_{\ell}|\land\mathbbm{Y}_{\ell}\cap\mathbbm{Y% }=\emptyset$ then

\mathbbm{Y}\leftarrow\mathbbm{Y}\cup\mathbbm{Y}_{\ell}

;

13 break;

\mathbbm{A}\leftarrow\mathbbm{A}\cup\{(\ell,n)\}

;

17 return

(\mathbbm{A},k,N)

Algorithm 2 Two-level perfect hash function constructor

4.1 Analysis

Suppose we have a set $\mathbbm{X}$ of $m$ elements with some total order $x_{(1)},\ldots,x_{(m)}$ for which we seek a perfect hash function $\mathrm{hash}_{\mathbbm{X}}:\mathbbm{U}\mapsto\mathbbm{Z}$ .

We denote the serialization of some value $x$ by $x^{\prime}$ . If $x$ is already understood to be a serialization, then $x^{\prime}$ denotes deserialization instead.

By the assumption that the hash function $\mathrm{hash}:\mathbbm{U}\mapsto\{0,1\}^{k}$ is a random oracle restricted to the codomain $\{0,1\}^{k}$ , the $n$ -th attempt (trial) $\mathrm{hash}\!\left(x_{(j)}^{\prime}\mathbin{\#}n^{\prime}\right)$ to find a non-colliding hash for $x_{(j)}$ has a probability of success $p_{j}=\frac{m-j+1}{m}$ since we have already found hashes for the previous $j-1$ elements and therefore there are only $m-(j-1)$ hashes remaining.

This is an example of the Coupon collector’s problem. Let $T_{j}$ denote the uncertain number of trials needed to find a non-colliding hash for $x_{(j)}$ . The total number of trials needed is an uncertain random variable given by

T=\sum_{j=1}^{m}T_{j}.

(29)

By the linearity of the expectation operator,

E(T)=\sum_{j=1}^{m}E(T_{j})=\sum_{j=1}^{m}\frac{1}{p_{j}}=mH_{m-1},

(30)

which asymptotically converges to $E(T)=m\gamma+m\ln(m-1)$ or on average $\gamma+\ln(m-1)$ trials per element of $\mathbbm{X}$ .

If we use the minimum number of bits to encode each trial index $n$ for each bucket, the expected space per bucket is $\log_{2}\mathrm{e}\approx 1.44$ bits, matching the single-level lower bound. However, the two-level approach offers a practical advantage: by partitioning the set into $k$ smaller buckets, each bucket requires fewer trials on average, reducing construction time from exponential in $m$ to polynomial.

The variance of $T$ is given by

\mathrm{Var}(T)=\sum_{j=2}^{m}\mathrm{Var}(T_{j}).

(31)

Since each $T_{j}$ follows a geometric distribution with success probability $p_{j}=\frac{m-j+1}{m}$ , we have $\mathrm{Var}(T_{j})=\frac{1-p_{j}}{p_{j}^{2}}$ , and thus

\mathrm{Var}(T)=\sum_{j=2}^{m}\frac{j-1}{(m-j+1)^{2}}\cdot m^{2}.

(32)

This variance is $O(m^{2})$ , indicating that while the expected number of trials grows as $m\ln m$ , the actual number of trials can vary significantly for any particular set.

The two-level construction in Algorithm 2 trades space optimality for construction efficiency. By processing buckets in decreasing order of cardinality, larger buckets are assigned hashes first when more slots are available, reducing the expected number of collisions with previously assigned hashes.

5 Algebra of function composition

Perfect hash functions can be composed with other functions to create new perfect hash functions with different properties. This algebraic structure provides flexibility in designing hash function families.

5.1 Composition preserves injectivity

While a perfect hash function $h_{\mathbbm{A}}:\mathbbm{X}\mapsto\mathbbm{Y}$ is not globally injective, its restriction to $\mathbbm{A}$ is injective:

h_{\mathbbm{A}}|_{\mathbbm{A}}:\mathbbm{A}\mapsto\mathbbm{Y}.

(33)

This restriction property enables composition with injective functions while preserving the perfect hash property.

Theorem 5.1 (Post-composition with injection).

Let $g:\mathbbm{Y}\mapsto\mathbbm{Z}$ be an injective function, $h_{\mathbbm{A}}^{r}:\mathbbm{X}\mapsto\mathbbm{Y}$ be a perfect hash function, and $|\mathbbm{Z}|=(1+\alpha)|\mathbbm{Y}|$ for $\alpha\geq 0$ . The composition $g\circ h_{\mathbbm{A}}^{r}:\mathbbm{X}\mapsto\mathbbm{Z}$ is a perfect hash function $h_{\mathbbm{A}}^{r^{\prime}}$ where $r^{\prime}=\frac{r}{1+\alpha}$ .

Proof.

From the load factor definition, $r=\frac{|\mathbbm{A}|}{|\mathbbm{Y}|}$ , so $|\mathbbm{Y}|=\frac{|\mathbbm{A}|}{r}$ . Since $g$ is injective, $g\circ h_{\mathbbm{A}}^{r}$ is injective on $\mathbbm{A}$ , making it a perfect hash function with load factor

r^{\prime}=\frac{|\mathbbm{A}|}{|\mathbbm{Z}|}=\frac{|\mathbbm{A}|}{(1+\alpha)% |\mathbbm{Y}|}=\frac{|\mathbbm{A}|}{(1+\alpha)\frac{|\mathbbm{A}|}{r}}=\frac{r% }{1+\alpha}.

(34)

∎

5.2 Permutation equivalence classes

Given a perfect hash function $h_{\mathbbm{A}}:\mathbbm{X}\mapsto\mathbbm{Y}$ , composing with any permutation $\pi:\mathbbm{Y}\mapsto\mathbbm{Y}$ yields another perfect hash function $\pi\circ h_{\mathbbm{A}}$ over $\mathbbm{A}$ with the same load factor. Since there are $|\mathbbm{Y}|!$ permutations of $\mathbbm{Y}$ , a single perfect hash function generates a family of $|\mathbbm{Y}|!$ related perfect hash functions.

These functions form an equivalence class where all members have identical collision structure outside of $\mathbbm{A}$ . This equivalence class can be denoted $[\pi\circ h_{\mathbbm{A}}]$ where $\pi$ ranges over all permutations of $\mathbbm{Y}$ .

Corollary 5.1.1.

For a perfect hash function with codomain $\mathbbm{Y}$ , the ratio of permutation-equivalent perfect hash functions to all possible functions is

\frac{|\mathbbm{Y}|!}{|\mathbbm{Y}|^{|\mathbbm{X}|}}.

(35)

5.3 Applications of composition

The algebraic properties of perfect hash function composition have practical applications in several contexts.

Example 2 Domain adaptation. Suppose we have a perfect hash function $h_{\mathbbm{A}}:\{0,1\}^{*}\mapsto\{0,\ldots,N-1\}$ constructed by Algorithm 1, but we require a hash function over a different domain $\mathbbm{U}$ . Given any injective encoder $e:\mathbbm{U}\mapsto\{0,1\}^{*}$ , the composition $h_{\mathbbm{A}}\circ e$ is a perfect hash function over $e^{-1}(\mathbbm{A})\subseteq\mathbbm{U}$ with the same load factor.

Example 3 Load factor adjustment. Given a minimal perfect hash function ( $r=1$ ), we can reduce the load factor to any $r^{\prime}<1$ by composing with an injection $g:\{0,\ldots,N-1\}\hookrightarrow\{0,\ldots,\lceil N/r^{\prime}\rceil-1\}$ . This allows trading space for reduced collision probability outside of $\mathbbm{A}$ .

Example 4 Randomization. Given a deterministic perfect hash function, composing with a random permutation $\pi$ yields a randomized perfect hash function that preserves the perfect hash property while obscuring the original hash assignments. This is useful in cryptographic applications where predictability of hash values is undesirable.

These composition properties demonstrate that perfect hash functions form a rich algebraic structure, enabling flexible construction of hash function families from primitive building blocks.

\appendixpage\addappheadtotoc

Appendix A Probability mass of random bit length

Input: The number

m

is the cardinality of the perfect hash set and the number

r

is the load factor.

Output: The minimum bit length

n

conditioned on a random set of cardinality

m

and a load factor

r

1 function $\textnormal{{sample\_bit\_length}}(m,r)$

2 for $i\leftarrow 1$ to $m$ do

x\leftarrow\text{randomly draw a bit string from $\{0,1\}^{*}$ without replacement}

;

\mathbbm{S}\leftarrow\mathbbm{S}\cup\{x\}

;

(b_{n},N)\leftarrow\textnormal{{make\_perfect\_hash}}\left(\mathbbm{S},r\right)

;

// The integer

N

must also be coded, which we assume has a constant bit length.

7 return

n+\mathcal{O}(1)

;

Algorithm 3 Bit length sampler of the cryptographic perfect hash function

The bit length of the perfect hash function found by Algorithm 1 has an uncertain value with respect to (random) sets and therefore we may model it as a random variable as given by the following definition.

Definition A.1.

The perfect hash function generated by Algorithm 1 has a random bit length given by

N=\textnormal{{sample\_bit\_length}}(m,r)

(36)

where $m$ is the cardinality of the random set and $r$ is the load factor.

Note that if the distribution of sets is not uniformly distributed as given by sample_bit_length, then better algorithms than Algorithm 1 are possible, i.e., the lower-bounds are with respect to uniformly distributed random sets and smaller lower-bounds are possible if the random sets are non-uniformly distributed.

The probability mass of the random bit length $N$ is given by the following theorem.

Theorem A.1.

The random bit length $N$ has a probability mass function given by

p_{N}(n|m,r)=q^{2^{n}-1}\left(1-q^{2^{n}}\right).

(37)

where $m$ is the cardinality of the random set, $r$ is the load factor, and $q=1-p(m,r)$ ,

Proof.

Each iteration of the loop in Algorithm 1 has a collision test which is Bernoulli distributed with a probability of success $p(m,r)$ , where success denotes no collision occurred. We are interested in the random length $N$ of the bit string when this outcome occurs.

For the random variable $N$ to realize a value $n$ , every bit string smaller than length $n$ must fail and a bit string of length $n$ must succeed. There are $2^{n}-1$ bit strings smaller than length $n$ and each one fails with probability $q$ , and so by the product rule the probability that they all fail is given by

q^{2^{n}-1}.

(38)

Given that every bit string smaller than length $n$ fails, what is the probability that every bit string of length $n$ fails? There are $2^{n}$ bit strings of length $n$ , each of which fails with probability $q$ as before and thus by the product rule the probability that they all fail is $q^{2^{n}}$ , whose complement, the probability that not all bit strings of length $n$ fail, is given by

1-q^{2^{n}}.

(39)

By the product rule, the probability that every bit string smaller than length $n$ fails and a bit string of length $n$ succeeds is given by the product of (38) and (39),

q^{2^{n}-1}\left(1-q^{2^{n}}\right).

(40)

For (40) to be a probability mass function, two conditions must be met. First, its range must be a subset of $[0,1]$ . Second, the summation over its domain must be $1$ .

The first case is trivially shown by the observation that $q$ is a positive number between $0$ and $1$ and therefore any non-negative power of $q$ is positive number between $0$ and $1$ .

The second case is shown by calculating the infinite series

	$\displaystyle S$	$\displaystyle=\sum_{n=0}^{\infty}q^{2^{n}-1}\left(1-q^{2^{n}}\right)$		(41)
		$\displaystyle=\sum_{n=0}^{\infty}q^{2^{n}-1}-q^{2^{n+1}-1}.$		(42)

Explicitly evaluating this series for the first $4$ terms reveals a telescoping sum given by

S=(1-q)+(q-q^{3})+(q^{3}-q^{7})+(q^{7}-q^{15})+\cdots,

(43)

where everything cancels except $1$ . ∎

In Figure 2, we plot the probability mass function of $N$ conditioned on $m=25$ and several different load factors. We see that the probability mass is peaked and unimodal and tends to nearly zero everywhere except over a concentrated interval around its expected value.

The expected size was already computed, but the probability mass function contains everything there is to know about the distribution of $N$ , not just the expected value. However, for illustration, we show how the expected value may be computed.

Theorem A.2.

The expected bit length of $N$ conditioned on random sets of cardinality $m$ and a load factor $r$ is given by

\mathbb{E}\left[N\right]=\sum_{j=1}^{\infty}q^{2^{j}-1},

(44)

where $q=1-p(m,r)$ .

Proof.

The expectation of $N$ is given by

	$\displaystyle\sum_{j=0}^{\infty}jq^{2^{j}-1}\left(1-q^{2^{j}}\right)$		(45)
	$\displaystyle\qquad=\sum_{j=0}^{\infty}j\left(q^{2^{j}-1}-q^{2^{j+1}-1}\right).$		(46)

Explicitly evaluating this series for the first $4$ terms reveals a converging sum given by

	$\displaystyle 0+(q-q^{3})+2(q^{3}-q^{7})+3(q^{7}-q^{15})+\cdots$		(47)
	$\displaystyle\qquad=q+q^{3}+q^{7}+q^{15}+\cdots.$		(48)

∎

When we numerically evaluate (44) for various $m$ and $r$ , the results are in agreement with (15).

References

[1] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger (2009) Compress, hash, and displace algorithm. ACM Journal of Experimental Algorithmics (JEA) 14, pp. 1–26. Cited by: §1, §3.1.
[2] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger (2009) Hash, displace, and compress. In European Symposium on Algorithms, pp. 682–693. Cited by: §1.
[3] M. Bellare and P. Rogaway (1993) Random oracles are practical: a paradigm for designing efficient protocols. In Proceedings of the 1st ACM Conference on Computer and Communications Security, pp. 62–73. Cited by: §1.
[4] F. C. Botelho, R. Pagh, and N. Ziviani (2007) Simple and space-efficient minimal perfect hash functions. In International Workshop on Algorithms and Data Structures, pp. 139–150. Cited by: §1, §3.1.
[5] T. M. Cover and J. A. Thomas (2006) Elements of information theory. 2nd edition, Wiley-Interscience. Cited by: §1, §3.1.
[6] Z. J. Czech, G. Havas, and B. S. Majewski (1992) A family of perfect hashing methods. The Computer Journal 35 (6), pp. 547–554. Cited by: §1, §3.1.
[7] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan (1990) Space-efficient hash tables with worst-case constant access time. In STACS 90: 7th Annual Symposium on Theoretical Aspects of Computer Science Rouen, France, February 22–24, 1990 Proceedings 7, pp. 271–282. Cited by: §1, §3.1.
[8] M. L. Fredman, J. Komlós, and E. Szemerédi (1984) Storing a sparse table with O(1) worst case access time. Journal of the ACM (JACM) 31 (3), pp. 538–544. Cited by: §1.
[9] R. Pagh and F. F. Rodler (2004) Cuckoo hashing. Journal of Algorithms 51 (2), pp. 122–144. Cited by: §1.
[10] C. E. Shannon (1948) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. Cited by: §1, §3.1.

Cryptographic perfect hash functions A theoretical analysis on space efficiency and algebraic composition

Abstract

1 Prior Art

2 Perfect hash functions

Definition 2.1.

Definition 2.2 (n-fold Cartesian product).

Definition 2.3.

Definition 2.4 (Perfect hash function).

Assumption 1.

Definition 2.5.

Notation.

Definition 2.6.

3 A theoretical cryptographic perfect hash function

Definition 3.1.

Definition 3.2.

Assumption 2.

Definition 3.3.

Definition 3.4.

Theorem 3.1.

Proof.

3.1 Analysis

Theorem 3.2.

Proof.

Theorem 3.3.

Proof.

Theorem 3.4.

Proof.

Theorem 3.5 (Maximum entropy encoding).

Proof.

Postulate 3.1.

Theorem 3.6.

Proof.

Corollary 3.6.1.

4 A practical two-level perfect hash function

4.1 Analysis

5 Algebra of function composition

5.1 Composition preserves injectivity

Theorem 5.1 (Post-composition with injection).

Proof.

5.2 Permutation equivalence classes

Corollary 5.1.1.

5.3 Applications of composition

Appendix A Probability mass of random bit length

Definition A.1.

Theorem A.1.

Proof.

Theorem A.2.

Proof.

References

Cryptographic perfect hash functions
A theoretical analysis on space efficiency and algebraic composition

Definition 2.2 ( $n$ -fold Cartesian product).