Bernoulli Sets: A Comprehensive Statistical Framework for Probabilistic Set Membership

Alexander Towell
Southern Illinois University Edwardsville
atowell@siue.edu

Abstract

We present Bernoulli sets, a comprehensive statistical framework for probabilistic set data structures that provides rigorous foundations for approximating set membership queries under uncertainty. Our framework unifies the treatment of fundamental data structures like Bloom filters, Count-Min sketches, and their variants through a principled latent/observed duality that distinguishes between true mathematical sets and their noisy computational approximations. We derive exact distributions and confidence intervals for all standard binary classification measures including positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1-score, providing both closed-form expressions and asymptotic approximations. We introduce interval arithmetic for propagating uncertainty bounds through set operations when error rates are imprecisely known, yielding conservative guarantees for composed operations. For set-theoretic operations, we establish tight bounds on error propagation for union, intersection, complement, and symmetric difference under both independent and correlated error models. Our analysis reveals fundamental tradeoffs between space efficiency and error rates, showing how classical implementations like Bloom filters achieve near-optimal performance within our framework. The theory finds immediate applications in databases, networking, bioinformatics, and distributed systems where space-efficient approximate membership testing is crucial.

1 Introduction

Modern computing systems routinely face the challenge of testing set membership for massive collections where exact representations would be prohibitively expensive. Consider a distributed database checking whether a key exists across billions of records, a network router filtering malicious URLs from a blacklist of millions, or a genomics pipeline searching for k-mers in terabyte-scale sequence databases. In each case, the fundamental question “is element $x$ in set $S$ ?” must be answered quickly using limited memory.

Probabilistic set data structures address this challenge by trading perfect accuracy for dramatic space savings. A Bloom filter [1], for instance, can represent a set using just 10 bits per element while maintaining a 1% false positive rate, compared to hundreds of bits required for exact storage of typical keys. This order-of-magnitude compression enables applications that would otherwise be infeasible, from web crawling [2] to bioinformatics applications.

Despite their widespread adoption, probabilistic set data structures have traditionally been analyzed in isolation, with each variant—Bloom filters, counting filters, cuckoo filters, quotient filters—receiving separate treatment. This paper introduces Bernoulli sets, a unified statistical framework that captures the essential behavior of all these structures through a fundamental latent/observed duality.

1.1 The Latent/Observed Duality

At the heart of our framework lies a crucial distinction between two types of sets:

•

Latent sets $S$ : The true mathematical sets that exist conceptually but are not directly accessible computationally.
•

Observed sets $\tilde{S}$ : The approximate representations we actually compute with, subject to false positives and false negatives.

This duality reflects a fundamental reality of approximate computing: we can never directly access the “true” set $S$ ; we can only interact with its noisy observation $\tilde{S}$ . Every membership query $x\in S$ becomes a probabilistic test $x\in_{\tilde{}}\tilde{S}$ .

The complete probabilistic relationship between latent and observed sets is captured by a $2^{|\mathcal{U}|}\times 2^{|\mathcal{U}|}$ confusion matrix, where each entry

Q_{A,B}=\mathbb{P}[\tilde{S}=B\mid S=A]

gives the probability of observing set $B$ when the true set is $A$ . This exponentially-large matrix is generally intractable.

In practice, we work with simplified models. The first-order model corresponds to a binary symmetric channel with a single error parameter $\epsilon$ , where membership bits are flipped with uniform probability.

The second-order model allows different rates for false positives ( $\alpha$ ) and false negatives ( $\beta$ ), yielding the familiar $2\times 2$ confusion matrix for individual membership queries:

\begin{array}[]{cc|cc}&&\lx@intercol\hfil\text{Observed}\hfil\lx@intercol\\ &&x\in\tilde{S}&x\notin\tilde{S}\\ \hline\cr\text{Latent}&x\in S&1-\beta&\beta\\ &x\notin S&\alpha&1-\alpha\end{array}

This second-order model with asymmetric errors is most relevant algorithmically, as real data structures often exhibit different false positive and false negative rates. For example, Bloom filters have $\beta=0$ (no false negatives) but $\alpha>0$ (false positives possible). Other second-order variants allow element-specific error rates, capturing scenarios where different elements have different error probabilities due to hash collisions or load imbalance.

1.2 Contributions

This paper makes the following contributions:

1. **Unified Framework**: We develop Bernoulli sets as a comprehensive model for probabilistic set membership, showing how classical data structures emerge as special cases with particular $(\alpha,\beta)$ parameters.

2. **Distribution Theory**: We derive exact distributions for all binary classification measures (PPV, NPV, accuracy, F1-score) as functions of the confusion matrix parameters, providing both finite-sample results and asymptotic approximations.

3. **Interval Arithmetic**: We introduce systematic methods for propagating uncertainty when error rates are known only imprecisely, yielding conservative bounds for all derived measures.

4. **Set Operations**: We establish tight bounds on error propagation through union, intersection, complement, and symmetric difference operations, for both independent and correlated error models.

5. **Implementation Analysis**: We analyze space-time tradeoffs for Bloom filters, Count-Min sketches, and other implementations, showing how they achieve near-optimal performance within our framework.

1.3 Organization

Section 2 introduces the formal Bernoulli set model. Section 3 derives distributions for binary classification measures. Section 4 develops interval arithmetic for uncertain error rates. Section 5 analyzes set operations and error propagation. Section 6 examines implementations and applications. Section 7 concludes.

2 The Bernoulli Set Model

We begin by formalizing the notion of approximate set membership through a statistical model that captures the essential uncertainty in probabilistic data structures.

2.1 Basic Definitions

Let $\mathcal{U}$ denote a universe of possible elements. A set $S\subseteq\mathcal{U}$ is a collection of distinct elements from this universe. In classical set theory, membership is deterministic: for any $x\in\mathcal{U}$ , either $x\in S$ or $x\notin S$ .

Definition 2.1 (Bernoulli Set).

A Bernoulli set $\tilde{S}$ is a probabilistic representation of a latent set $S\subseteq\mathcal{U}$ characterized by two error parameters:

•

False positive rate $\alpha\in[0,1]$ : $\mathbb{P}[x\in\tilde{S}\mid x\notin S]=\alpha$
•

False negative rate $\beta\in[0,1]$ : $\mathbb{P}[x\notin\tilde{S}\mid x\in S]=\beta$

The membership test for a Bernoulli set is thus a random function:

\text{member}_{\tilde{S}}:\mathcal{U}\rightarrow\{0,1\}

where the output follows a Bernoulli distribution conditional on true membership.

2.2 Confusion Matrix Formalism

The complete behavior of a Bernoulli set over universe $\mathcal{U}$ is specified by an exponentially-large confusion matrix:

Definition 2.2 (Full Confusion Matrix).

For a Bernoulli set $\tilde{S}$ approximating latent set $S\subseteq\mathcal{U}$ , the full confusion matrix has dimension $2^{|\mathcal{U}|}\times 2^{|\mathcal{U}|}$ with entries:

Q_{A,B}=\mathbb{P}[\tilde{S}=B\mid S=A]\quad\text{for all }A,B\subseteq% \mathcal{U}

This exponential matrix captures all possible transitions from latent sets to observed sets. However, working with such matrices is intractable. We therefore consider two important special cases:

2.2.1 First-Order Bernoulli Sets

Definition 2.3 (First-Order Model - Binary Symmetric Channel).

A first-order Bernoulli set has a single error parameter $\epsilon$ , corresponding to a binary symmetric channel where each bit (membership indicator) is flipped with probability $\epsilon$ :

\mathbb{P}[x\in\tilde{S}\mid S]=\begin{cases}1-\epsilon&\text{if }x\in S\\ \epsilon&\text{if }x\notin S\end{cases}

This models uniform noise where false positives and false negatives occur at the same rate.

Equivalently, if we represent the set $S$ as a bit string where bit $i$ indicates membership of element $x_{i}$ , then the observed set $\tilde{S}$ is the result of transmitting this bit string through a binary symmetric channel with crossover probability $\epsilon$ .

2.2.2 Second-Order Bernoulli Sets

Second-order models allow different error rates for false positives and false negatives. The most common variant distinguishes these two error types:

Definition 2.4 (Second-Order Model - Asymmetric Errors).

A second-order Bernoulli set has separate parameters for false positive rate $\alpha$ and false negative rate $\beta$ :

\mathbb{P}[x\in\tilde{S}\mid S]=\begin{cases}1-\beta&\text{if }x\in S\\ \alpha&\text{if }x\notin S\end{cases}

where $\alpha\neq\beta$ in general.

This is the most algorithmically relevant model, as many data structures naturally exhibit asymmetric errors (e.g., Bloom filters with $\beta=0$ , $\alpha>0$ ).

Remark 2.5 (Other Second-Order Variants).

Other second-order models include:

•

Element-specific rates: Different elements have different error probabilities $\alpha_{x}$ , $\beta_{x}$ , as occurs when hash functions create non-uniform distributions
•

Size-dependent rates: Error rates depend on $|S|$ , as in structures where collision probability increases with load
•

Time-varying rates: Error rates change over time, as in aging or adaptive structures

For the standard second-order model with uniform asymmetric errors, each membership query has the familiar $2\times 2$ confusion matrix:

Q_{\text{element}}=\begin{pmatrix}\mathbb{P}[x\in\tilde{S}|x\in S]&\mathbb{P}[% x\notin\tilde{S}|x\in S]\\ \mathbb{P}[x\in\tilde{S}|x\notin S]&\mathbb{P}[x\notin\tilde{S}|x\notin S]\end% {pmatrix}=\begin{pmatrix}1-\beta&\beta\\ \alpha&1-\alpha\end{pmatrix}

2.3 Special Cases

Several important special cases arise from particular choices of $(\alpha,\beta)$ :

Example 2.6 (One-sided Error).

•

Bloom filter: $\beta=0$ (no false negatives), $\alpha>0$ (false positives possible)
•

Lossy counter: $\alpha=0$ (no false positives), $\beta>0$ (undercounting possible)

Example 2.7 (Degenerate Cases).

•

Exact set: $\alpha=\beta=0$ (no errors)
•

Universal set: $\alpha=1,\beta=0$ (always returns true)
•

Empty set: $\alpha=0,\beta=1$ (always returns false)
•

Random set: $\alpha=\beta=0.5$ (maximally uncertain)

2.4 Independence Assumptions

A key assumption in our base model is that membership tests are independent across elements:

Assumption 2.8 (Independence).

For distinct elements $x,y\in\mathcal{U}$ , the events $\{x\in\tilde{S}\}$
and $\{y\in\tilde{S}\}$ are conditionally independent given the latent set $S$ .

This assumption, while not always perfectly satisfied in practice (e.g., hash collisions in Bloom filters create dependencies), provides a tractable starting point for analysis. We relax this assumption when analyzing specific implementations.

3 Binary Classification Measures

Given a Bernoulli set with parameters $(\alpha,\beta)$ , we now derive the distributions of standard binary classification measures. These measures quantify different aspects of approximation quality and are crucial for understanding the practical implications of using probabilistic data structures.

3.1 Fundamental Counts

Consider a latent set $S$ with $|S|=p$ positive elements and $|\mathcal{U}\setminus S|=n$ negative elements. When we test these elements against a Bernoulli set $\tilde{S}$ , we obtain four random counts:

TP	$\displaystyle\sim\text{Binomial}(p,1-\beta)\quad\text{(True Positives)}$	(1)
FN	$\displaystyle\sim\text{Binomial}(p,\beta)\quad\text{(False Negatives)}$	(2)
FP	$\displaystyle\sim\text{Binomial}(n,\alpha)\quad\text{(False Positives)}$	(3)
TN	$\displaystyle\sim\text{Binomial}(n,1-\alpha)\quad\text{(True Negatives)}$	(4)

These counts form the basis for all derived measures.

3.2 Positive Predictive Value (PPV)

The positive predictive value measures the probability that an element is truly in $S$ given that it tests positive in $\tilde{S}$ .

Theorem 3.1 (PPV Distribution).

The positive predictive value is the random variable:

\text{PPV}=\frac{\text{TP}}{\text{TP}+\text{FP}}

with expectation approximately:

\mathbb{E}[\text{PPV}]\approx\frac{\bar{t}_{p}}{\bar{t}_{p}+\bar{f}_{p}}+\frac% {\bar{t}_{p}\sigma_{f_{p}}^{2}-\bar{f}_{p}\sigma_{t_{p}}^{2}}{(\bar{t}_{p}+% \bar{f}_{p})^{3}}

where $\bar{t}_{p}=p(1-\beta)$ , $\bar{f}_{p}=n\alpha$ , $\sigma_{t_{p}}^{2}=p\beta(1-\beta)$ , $\sigma_{f_{p}}^{2}=n\alpha(1-\alpha)$ .

Proof.

Using the delta method for the ratio of random variables, we expand PPV around its mean values:

\text{PPV}=g(\text{TP},\text{FP})=\frac{\text{TP}}{\text{TP}+\text{FP}}

The first-order Taylor expansion gives:

\mathbb{E}[\text{PPV}]\approx g(\mathbb{E}[\text{TP}],\mathbb{E}[\text{FP}])+% \frac{1}{2}\text{tr}(H\cdot\Sigma)

where $H$ is the Hessian matrix and $\Sigma$ is the covariance matrix. Since TP and FP are independent, $\Sigma$ is diagonal, yielding the stated result. ∎

Corollary 3.2 (PPV Asymptotics).

As $n\rightarrow\infty$ with fixed $p$ and $\alpha>0$ :

\text{PPV}\rightarrow 0\quad\text{almost surely}

This reveals a fundamental limitation: without controlling false positives ( $\alpha$ ), PPV degrades as the negative population grows.

3.3 Negative Predictive Value (NPV)

The negative predictive value measures the probability that an element is truly not in $S$ given that it tests negative in $\tilde{S}$ .

Theorem 3.3 (NPV Distribution).

The negative predictive value is:

\text{NPV}=\frac{\text{TN}}{\text{TN}+\text{FN}}

with expectation approximately:

\mathbb{E}[\text{NPV}]\approx\frac{\bar{t}_{n}}{\bar{t}_{n}+\bar{f}_{n}}+\frac% {\bar{t}_{n}\sigma_{f_{n}}^{2}-\bar{f}_{n}\sigma_{t_{n}}^{2}}{(\bar{t}_{n}+% \bar{f}_{n})^{3}}

where $\bar{t}_{n}=n(1-\alpha)$ , $\bar{f}_{n}=p\beta$ , $\sigma_{t_{n}}^{2}=n\alpha(1-\alpha)$ , $\sigma_{f_{n}}^{2}=p\beta(1-\beta)$ .

3.4 Accuracy

Accuracy measures the overall proportion of correct classifications.

Theorem 3.4 (Accuracy Distribution).

The accuracy is:

\text{ACC}=\frac{\text{TP}+\text{TN}}{p+n}

with expectation and variance:

	$\displaystyle\mathbb{E}[\text{ACC}]$	$\displaystyle=\lambda(1-\beta)+(1-\lambda)(1-\alpha)$		(5)
	$\displaystyle\text{Var}[\text{ACC}]$	$\displaystyle=\frac{\lambda\beta(1-\beta)+(1-\lambda)\alpha(1-\alpha)}{p+n}$		(6)

where $\lambda=p/(p+n)$ is the prevalence of positive elements.

3.5 F1-Score

The F1-score is the harmonic mean of precision (PPV) and recall (TPR).

Theorem 3.5 (F1-Score).

The F1-score is:

F_{1}=\frac{2\text{TP}}{2\text{TP}+\text{FP}+\text{FN}}

with expectation approximately:

\mathbb{E}[F_{1}]\approx\frac{2p(1-\beta)}{2p(1-\beta)+n\alpha+p\beta}

3.6 Youden’s J Statistic

Youden’s J statistic measures the overall discriminative ability.

Theorem 3.6 (Youden’s J).

Youden’s J statistic is:

J=\text{TPR}-\text{FPR}=\frac{\text{TP}}{p}-\frac{\text{FP}}{n}

with expectation:

\mathbb{E}[J]=(1-\beta)-\alpha

3.7 Confidence Intervals

For large $p$ and $n$ , the central limit theorem provides asymptotic confidence intervals.

Theorem 3.7 (Asymptotic Confidence Intervals).

For confidence level $1-\gamma$ , asymptotic confidence intervals are:

	FPR	$\displaystyle\in\alpha\pm z_{\gamma/2}\sqrt{\frac{\alpha(1-\alpha)}{n}}$		(7)
	TPR	$\displaystyle\in(1-\beta)\pm z_{\gamma/2}\sqrt{\frac{\beta(1-\beta)}{p}}$		(8)

where $z_{\gamma/2}$ is the $(1-\gamma/2)$ quantile of the standard normal distribution.

4 Interval Arithmetic for Error Bounds

In practice, error rates $\alpha$ and $\beta$ are often known only imprecisely. We develop interval arithmetic to propagate this uncertainty through all derived measures.

4.1 Interval Representation

Definition 4.1 (Interval Error Rates).

When error rates are uncertain, we represent them as intervals:

	$\displaystyle\alpha$	$\displaystyle\in[\alpha_{\min},\alpha_{\max}]$		(9)
	$\displaystyle\beta$	$\displaystyle\in[\beta_{\min},\beta_{\max}]$		(10)

where $0\leq\alpha_{\min}\leq\alpha_{\max}\leq 1$ and similarly for $\beta$ .

4.2 Basic Interval Operations

For intervals $[a,b]$ and $[c,d]$ :

$\displaystyle[a,b]+[c,d]$	$\displaystyle=[a+c,b+d]$	(11)
$\displaystyle[a,b]\cdot[c,d]$	$\displaystyle=[\min(ac,ad,bc,bd),\max(ac,ad,bc,bd)]$	(12)
$\displaystyle 1-[a,b]$	$\displaystyle=[1-b,1-a]$	(13)

4.3 Propagation Through Measures

Theorem 4.2 (PPV Interval).

Given $\alpha\in[\alpha_{\min},\alpha_{\max}]$ and $\beta\in[\beta_{\min},\beta_{\max}]$ :

\text{PPV}\in[\frac{p(1-\beta_{\max})}{p(1-\beta_{\max})+n\alpha_{\max}},\frac% {p(1-\beta_{\min})}{p(1-\beta_{\min})+n\alpha_{\min}}]

Proof.

PPV is monotonically increasing in $(1-\beta)$ and decreasing in $\alpha$ . The minimum occurs at maximum $\beta$ and $\alpha$ ; the maximum occurs at minimum $\beta$ and $\alpha$ . ∎

Theorem 4.3 (Accuracy Interval).

Given uncertain prevalence
$\lambda\in[\lambda_{\min},\lambda_{\max}]$ , the accuracy lies in:

	$\displaystyle\text{ACC}\in\bigg{[}$	$\displaystyle\min_{\lambda,\alpha,\beta}\text{acc}(\lambda,\alpha,\beta),$
		$\displaystyle\max_{\lambda,\alpha,\beta}\text{acc}(\lambda,\alpha,\beta)\bigg{]}$		(14)

where the optimization considers the dependency structure of the accuracy formula.

4.4 Conservative Bounds

When dependencies between parameters are unknown, we adopt conservative (worst-case) bounds:

Definition 4.4 (Conservative Interval).

The conservative interval for a measure $M(\alpha,\beta,\lambda)$ is:

M_{\text{conservative}}=[\inf_{\theta\in\Theta}M(\theta),\sup_{\theta\in\Theta% }M(\theta)]

where $\Theta$ is the Cartesian product of all parameter intervals.

4.5 Interval Width and Uncertainty

The width of an interval quantifies our uncertainty:

Definition 4.5 (Uncertainty Measure).

For interval $I=[a,b]$ , the uncertainty is:

U(I)=b-a

Proposition 4.6 (Uncertainty Propagation).

For PPV with fixed $p$ and $n$ :

U(\text{PPV}_{\text{interval}})\leq\frac{p\cdot U(\beta)+n\cdot U(\alpha)}{p(1% -\beta_{\max})+n\alpha_{\min}}

5 Set Operations and Error Propagation

We now analyze how errors propagate through set-theoretic operations on Bernoulli sets.

5.1 Union Operation

Theorem 5.1 (Union Error Rates).

For independent Bernoulli sets $\tilde{A}$ and $\tilde{B}$ with parameters $(\alpha_{A},\beta_{A})$ and $(\alpha_{B},\beta_{B})$ :

	$\displaystyle\alpha_{A\cup B}$	$\displaystyle=\alpha_{A}+\alpha_{B}-\alpha_{A}\alpha_{B}$		(15)
	$\displaystyle\beta_{A\cup B}$	$\displaystyle=\beta_{A}\beta_{B}$		(16)

Proof.

For false positives: an element $x\notin A\cup B$ appears in $\tilde{A\cup B}$ if it appears in $\tilde{A}$ or $\tilde{B}$ :

\mathbb{P}[x\in\tilde{A\cup B}|x\notin A\cup B]=1-(1-\alpha_{A})(1-\alpha_{B})% =\alpha_{A}+\alpha_{B}-\alpha_{A}\alpha_{B}

For false negatives: an element $x\in A\cup B$ is missing from $\tilde{A\cup B}$ only if it’s missing from both:

\mathbb{P}[x\notin\tilde{A\cup B}|x\in A\cup B]=\beta_{A}\beta_{B}

when $x$ is in both sets. The general case requires considering which set(s) contain $x$ . ∎

Corollary 5.2 (Union Bounds).

For any union:

	$\displaystyle\alpha_{A\cup B}$	$\displaystyle\leq\alpha_{A}+\alpha_{B}$		(17)
	$\displaystyle\beta_{A\cup B}$	$\displaystyle\leq\min(\beta_{A},\beta_{B})$		(18)

5.2 Intersection Operation

Theorem 5.3 (Intersection Error Rates).

For independent Bernoulli sets:

	$\displaystyle\alpha_{A\cap B}$	$\displaystyle=\alpha_{A}\alpha_{B}$		(19)
	$\displaystyle\beta_{A\cap B}$	$\displaystyle=\beta_{A}+\beta_{B}-\beta_{A}\beta_{B}$		(20)

Proof.

For false positives: $x\notin A\cap B$ appears in $\tilde{A\cap B}$ only if it appears in both observed sets:

\mathbb{P}[x\in\tilde{A\cap B}|x\notin A\cap B]=\alpha_{A}\alpha_{B}

For false negatives: $x\in A\cap B$ is missing if it’s missing from either observed set:

\mathbb{P}[x\notin\tilde{A\cap B}|x\in A\cap B]=1-(1-\beta_{A})(1-\beta_{B})

∎

5.3 Complement Operation

Theorem 5.4 (Complement Error Rates).

For Bernoulli set $\tilde{A}$ with parameters $(\alpha,\beta)$ :

\tilde{\mathcal{U}\setminus A}\text{ has parameters }(\beta,\alpha)

Proof.

The complement swaps the roles of positives and negatives, thus swapping false positive and false negative rates. ∎

5.4 Symmetric Difference

Theorem 5.5 (Symmetric Difference).

For $A\triangle B=(A\setminus B)\cup(B\setminus A)$ :

	$\displaystyle\alpha_{A\triangle B}$	$\displaystyle=\alpha_{A}(1-\alpha_{B})+\alpha_{B}(1-\alpha_{A})$		(21)
	$\displaystyle\beta_{A\triangle B}$	$\displaystyle=1-(1-\beta_{A})(1-\beta_{B})(2-(1-\beta_{A})(1-\beta_{B}))$		(22)

5.5 Correlated Errors

When errors are correlated (e.g., due to shared hash functions), the independence assumptions break down.

Definition 5.6 (Correlation Coefficient).

The correlation between membership tests is:

\rho=\frac{\text{Cov}[X_{A},X_{B}]}{\sqrt{\text{Var}[X_{A}]\text{Var}[X_{B}]}}

where $X_{A}$ and $X_{B}$ are indicator variables for membership in $\tilde{A}$ and $\tilde{B}$ .

Theorem 5.7 (Correlated Union).

With correlation $\rho$ :

\alpha_{A\cup B}=\alpha_{A}+\alpha_{B}-\alpha_{A}\alpha_{B}-\rho\sqrt{\alpha_{% A}(1-\alpha_{A})\alpha_{B}(1-\alpha_{B})}

5.6 Composition of Operations

Theorem 5.8 (Error Accumulation).

For $k$ composed operations, error rates grow as:

	$\displaystyle\alpha_{k}$	$\displaystyle\leq 1-(1-\alpha)^{k}\approx k\alpha\text{ for small }\alpha$		(23)
	$\displaystyle\beta_{k}$	$\displaystyle\leq 1-(1-\beta)^{k}\approx k\beta\text{ for small }\beta$		(24)

This linear growth in error rates limits the depth of practical compositions.

6 Implementation and Applications

We now examine how classical probabilistic data structures implement
Bernoulli sets and analyze their space-time tradeoffs.

6.1 Bloom Filters

The Bloom filter is the canonical implementation of a Bernoulli set with $\beta=0$ .

Theorem 6.1 (Bloom Filter Parameters).

A Bloom filter with $m$ bits, $k$ hash functions, and $n$ elements achieves:

\alpha=\left(1-e^{-kn/m}\right)^{k}

Optimal $k=(m/n)\ln 2$ yields $\alpha=2^{-k}\approx 0.6185^{m/n}$ .

Proof.

The probability a specific bit is not set by a specific hash of a specific element is $1-1/m$ . After inserting $n$ elements with $k$ hashes each:

\mathbb{P}[\text{bit = 0}]=\left(1-\frac{1}{m}\right)^{kn}\approx e^{-kn/m}

A false positive occurs when all $k$ bits for a non-member are set:

\alpha=\left(1-e^{-kn/m}\right)^{k}

Minimizing with respect to $k$ yields the stated optimum. ∎

Corollary 6.2 (Space Complexity).

To achieve false positive rate $\alpha$ requires:

m=-\frac{n\ln\alpha}{(\ln 2)^{2}}\approx 1.44n\log_{2}(1/\alpha)\text{ bits}

6.2 Counting Bloom Filters

Counting Bloom filters extend standard Bloom filters to support deletions and multiplicity queries.

Theorem 6.3 (Counting Filter Overflow).

With $c$ -bit counters and load factor $\lambda=kn/m$ :

\mathbb{P}[\text{overflow}]\approx m\cdot\mathbb{P}[\text{Poisson}(\lambda)>2^% {c}-1]

6.3 Count-Min Sketch

The Count-Min sketch approximates multiset membership with one-sided error.

Theorem 6.4 (Count-Min Error).

With width $w=\lceil e/\epsilon\rceil$ and depth $d=\lceil\ln(1/\delta)\rceil$ :

\mathbb{P}[|\tilde{c}(x)-c(x)|\leq\epsilon\|c\|_{1}]\geq 1-\delta

where $\tilde{c}(x)$ is the estimated count and $c(x)$ is the true count.

6.4 Cuckoo Filters

Cuckoo filters provide an alternative to Bloom filters with better cache locality and deletion support.

Theorem 6.5 (Cuckoo Filter Parameters).

A cuckoo filter with load factor $\alpha_{\text{load}}=0.95$ and fingerprint size $f$ bits achieves:

\alpha\approx\frac{8n}{2^{f}\cdot b}

where $b$ is the bucket size (typically 4).

6.5 Applications

6.5.1 Database Systems

Bernoulli sets accelerate database operations:

•

Join optimization: Pre-filter join candidates using Bloom filters
•

Duplicate detection: Identify potential duplicates in large datasets
•

Query routing: Route queries to relevant shards in distributed
databases

Example 6.6 (Distributed Join).

Consider joining tables $R$ and $S$ across network. Send Bloom filter $B_{R}$ of $R$ ’s join keys to $S$ ’s node. Filter $S$ locally: only send tuples where $key\in B_{R}$ . Reduces network traffic by factor
$(1-\alpha)|S|$ .

6.5.2 Network Systems

Network applications leverage space efficiency:

•

Packet filtering: Block malicious IPs or URLs
•

Flow monitoring: Track unique flows in high-speed networks
•

Content routing: Forward requests in content delivery networks

Example 6.7 (DDoS Mitigation).

Maintain Bloom filter of legitimate source IPs seen recently. During attack, drop packets from IPs not in filter. False positive rate $\alpha$ determines collateral
damage to new legitimate users.

6.5.3 Bioinformatics

Genomic applications handle massive sequence data:

•

K-mer indexing: Test presence of sequence fragments
•

Read classification: Assign sequencing reads to reference genomes
•

Variant calling: Identify genetic variations efficiently

Example 6.8 (K-mer Membership).

Human genome has $\sim 3\times 10^{9}$ base pairs, yielding $\sim 3\times 10^{9}$ 31-mers. Exact storage requires $>$ 100GB. Bloom filter with $\alpha=0.01$ uses $\sim$ 15GB, enabling in-memory processing on commodity hardware.

6.6 Space-Time Tradeoffs

Theorem 6.9 (Information-Theoretic Lower Bound).

Any data structure answering membership queries with false positive rate $\alpha$ and no false negatives requires at least:

n\log_{2}(1/\alpha)-O(n)

bits in the worst case.

Proof.

There are $\binom{|\mathcal{U}|}{n}$ possible sets of size $n$ . To distinguish them with error rate $\alpha$ requires $\log_{2}\binom{|\mathcal{U}|}{n}(1-\alpha)$ bits. For large $|\mathcal{U}|$ , this approaches $n\log_{2}(1/\alpha)$ . ∎

Corollary 6.10 (Bloom Filter Optimality).

Bloom filters achieve within 44% of the information-theoretic lower bound:

\frac{m_{\text{Bloom}}}{m_{\text{lower}}}=\frac{1.44n\log_{2}(1/\alpha)}{n\log% _{2}(1/\alpha)}=1.44

7 Case Studies

7.1 Case Study 1: Web Crawling

Web crawlers must track billions of visited URLs to avoid redundant fetches. Storing exact URLs requires 50-100 bytes per URL. For 10 billion URLs, this demands 500GB-1TB of RAM.

Solution: Use Bloom filter with $\alpha=0.001$ . Space requirement: $1.44\times 10^{10}\times\log_{2}(1000)\approx 17$ GB. False positive rate means 0.1% of new URLs incorrectly skipped—acceptable for most crawlers.

Optimization: Use rolling Bloom filters with time-based eviction for crawl freshness.

7.2 Case Study 2: Distributed Caching

Content delivery networks (CDNs) need to quickly determine which edge server likely has cached content.

Solution: Each edge server maintains a Bloom filter of cached objects, shared with routing layer. Router checks filters to identify servers likely to have content.

Analysis: With $n=10^{6}$ cached objects per server and $\alpha=0.01$ , each filter uses $\sim$ 1.2 MB. For 1000 servers, routing table uses $\sim$ 1.2 GB total. False positives cause occasional unnecessary forwarding, but save significant routing table space.

7.3 Case Study 3: Genomic Analysis

Metagenomics pipelines classify DNA reads by testing against reference databases of known organisms.

Challenge: Human microbiome database contains $>10^{12}$ unique k-mers across thousands of species.

Solution: Build species-specific Bloom filters. Test each read’s k-mers against filters to identify likely source organisms.

Performance: With $\alpha=0.001$ per k-mer and 100 k-mers per read, read-level false positive rate $\approx 1-(1-0.001)^{100}\approx 0.095$ . Confirmatory alignment validates classifications.

8 Conclusion

This paper presented Bernoulli sets as a comprehensive statistical framework for probabilistic set membership. Our key contributions include:

1. **Unified Theory**: We showed how the latent/observed duality provides a principled foundation for understanding all probabilistic set data structures through their confusion matrices.

2. **Rigorous Analysis**: We derived exact distributions and confidence intervals for all standard performance measures, providing both theoretical insights and practical formulas for system designers.

3. **Uncertainty Quantification**: Our interval arithmetic framework enables robust analysis when error rates are imprecisely known, yielding conservative guarantees for risk-averse applications.

4. **Operational Calculus**: We established tight bounds on error propagation through set operations, revealing fundamental limits on composability.

5. **Practical Impact**: We demonstrated how classical implementations achieve near-optimal space-time tradeoffs within our framework, validating decades of engineering practice.

8.1 Future Directions

Several promising directions extend this work:

Adaptive Error Rates: Develop Bernoulli sets that adapt their error rates based on access patterns, allocating lower error rates to frequently queried elements.

Learned Indexes: Integrate machine learning to predict membership, using Bernoulli sets as a theoretical framework for learned data structures.

Distributed Protocols: Extend the framework to distributed settings where set representations must be synchronized across nodes with bandwidth constraints.

Privacy-Preserving Variants: Combine with differential privacy to create data structures that are both space-efficient and privacy-preserving.

Higher-Order Types: Generalize beyond sets to Bernoulli relations, functions, and other higher-order types with appropriate confusion matrices.

The Bernoulli set framework provides a solid theoretical foundation for approximate membership testing, unifying disparate techniques under a common statistical model. As data volumes continue to grow exponentially, such principled approaches to approximation become increasingly vital for building scalable systems.

8.2 Acknowledgments

[Acknowledgments would appear here in final version]

References

[1] B. H. Bloom (1970) Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7), pp. 422–426. Cited by: §1.
[2] A. Broder and M. Mitzenmacher (2004) Network applications of bloom filters: a survey. Internet Mathematics 1 (4), pp. 485–509. Cited by: §1.

Appendix A Proof Details

A.1 Delta Method for PPV

The delta method approximates the distribution of $g(X)$ where $X$ is a random vector and $g$ is a differentiable function.

For $\text{PPV}=g(\text{TP},\text{FP})=\text{TP}/(\text{TP}+\text{FP})$ :

	$\displaystyle\nabla g$	$\displaystyle=\left(\frac{\text{FP}}{(\text{TP}+\text{FP})^{2}},-\frac{\text{% TP}}{(\text{TP}+\text{FP})^{2}}\right)$		(25)
	$\displaystyle H_{g}$	$\displaystyle=\begin{pmatrix}-\frac{2\text{FP}}{(\text{TP}+\text{FP})^{3}}&% \frac{\text{TP}-\text{FP}}{(\text{TP}+\text{FP})^{3}}\\ \frac{\text{TP}-\text{FP}}{(\text{TP}+\text{FP})^{3}}&\frac{2\text{TP}}{(\text% {TP}+\text{FP})^{3}}\end{pmatrix}$		(26)

The second-order approximation:

\mathbb{E}[g(X)]\approx g(\mu)+\frac{1}{2}\text{tr}(H_{g}(\mu)\cdot\Sigma)

With TP and FP independent:

\Sigma=\begin{pmatrix}p\beta(1-\beta)&0\\ 0&n\alpha(1-\alpha)\end{pmatrix}

Substituting and simplifying yields the stated formula.

A.2 Information-Theoretic Bound Proof

Consider the problem of representing any subset $S\subseteq\mathcal{U}$ with $|S|=n$ such that membership queries have false positive rate at most $\alpha$ .

The number of possible sets is $\binom{|\mathcal{U}|}{n}$ . To represent one with error probability $\alpha$ , we need to distinguish it from the $\binom{|\mathcal{U}|}{n}-1$ others.

For each wrong set $S^{\prime}$ , the probability of error on a random query is at least $|S\triangle S^{\prime}|/|\mathcal{U}|$ . To achieve error rate $\alpha$ , we need:

\frac{|S\triangle S^{\prime}|}{|\mathcal{U}|}\geq 1-\alpha

By counting arguments, this requires:

\log_{2}\binom{|\mathcal{U}|}{n}\geq n\log_{2}(|\mathcal{U}|/n)\approx n\log_{% 2}(1/\alpha)

for optimal $n/|\mathcal{U}|\approx\alpha$ .