Skip to main content

Bernoulli Types: A Foundation for Approximate and Oblivious Computing

I’ve been working on a series of papers that develop a unified theoretical framework for approximate and oblivious computing. The central concept is what I call Bernoulli types. The papers explore how to build rigorous foundations for systems that intentionally introduce controlled approximation, whether for privacy, space efficiency, or computational tractability.

The core idea: latent vs observed

The fundamental insight is simple: distinguish between latent (true) values and observed (approximate) values. This duality isn’t just about probabilistic data structures like Bloom filters. It’s a deeper principle that applies to:

  • Algorithms that deliberately sample or approximate (streaming algorithms, sketches)
  • Data structures with false positive guarantees (Bloom filters, Count-Min sketch, HyperLogLog)
  • Privacy mechanisms that add controlled “noise” through lossy observations
  • Encrypted search systems where query results are inherently approximate

My MS thesis in CS (2015) dealt with encrypted search, where this latent/observed duality shows up naturally. You store encrypted data, you query it approximately, and the approximation itself provides a form of privacy. That work planted the seed for a more general framework.

Four papers, one framework

Paper 1: The latent/observed duality (paper)

This paper establishes the type-theoretic foundations. The mapping from latent to observed isn’t just noise. It’s a structural property we can reason about formally.

Type: Latent -> Observed
Property: Many-to-one mapping creates indistinguishability
Result: Privacy and approximation from the same mathematical structure

Bloom filters, streaming algorithms, and privacy mechanisms all arise from the same underlying pattern: functions with rank-deficient codomains. The many-to-one structure is what gives you both space efficiency and privacy.

Paper 2: Asymptotic indistinguishability (paper)

Traditional differential privacy adds calibrated noise. Our approach is different: we use lossy compression and many-to-one mappings to achieve privacy through rank-deficient observations.

The advantage: no noise calibration needed. Privacy emerges naturally from the observation function’s structure. As the compression ratio increases, distinct inputs become computationally indistinguishable.

This provides an alternative privacy framework that’s more natural for applications already using approximate data structures. You don’t bolt on privacy as an afterthought. It falls out of the design.

Paper 3: Hash-based Bernoulli constructions (paper)

This paper unifies seemingly disparate probabilistic data structures under one framework:

  • Bloom filters for set membership
  • Count-Min sketch for frequency estimation
  • HyperLogLog for cardinality estimation
  • MinHash for similarity estimation

All of these arise from the same construction: hash functions + Bernoulli trials. We prove they’re all space-optimal for their respective problems and show how to systematically derive new structures.

The framework provides rigorous error bounds, composition properties, and trade-off analysis between space, time, and accuracy.

Paper 4: Bernoulli sets (paper)

Sets are fundamental. But what if you can’t afford to store exact sets? This paper develops a complete statistical framework for approximate set operations.

The key contributions: formal semantics for set operations with error bounds, composition rules (how errors propagate through unions and intersections), optimal parameter selection, and the connection to hypothesis testing (set membership as statistical inference).

This gives you principled ways to build systems that trade space for accuracy while maintaining statistical guarantees.

Why now

We’re in an era where privacy is non-negotiable but exact computation leaks information, scale demands approximation because you can’t store everything exactly, streaming data requires constant-space algorithms, and encrypted search needs oblivious data structures that hide access patterns.

Bernoulli types provide a unified mathematical foundation for all of these. Instead of ad-hoc probabilistic structures, you get:

  1. Composable abstractions with clear error semantics
  2. Type safety that prevents mixing latent and observed values
  3. Formal proofs of privacy and accuracy guarantees
  4. Systematic constructions for new data structures

What’s next

These foundations enable several research directions: oblivious algorithms that leak nothing about their inputs, privacy-preserving machine learning with provable guarantees, space-optimal streaming systems with rigorous error bounds, and encrypted databases with efficient private queries.

I’m actively working on implementations and applications. If you work on approximate computing, privacy, or probabilistic data structures, I think this framework is worth looking at.

Papers:


This work builds on my earlier research in encrypted search and oblivious computing. See my CS Master’s thesis (2015) for the original motivation.

Discussion