Hash-Based Oblivious Sets: A Practical Framework for
Privacy-Preserving Set Operations with Probabilistic Guarantees

Alexander Towell Department of Computer Science
Southern Illinois University Edwardsville
Email: atowell@siue.edu

Abstract

We present Hash-Based Oblivious Sets (HBOS), a practical framework for privacy-preserving set operations that combines cryptographic hash functions with probabilistic data structures. Unlike traditional approaches using fully homomorphic encryption or secure multi-party computation, HBOS achieves microsecond-scale performance by embracing approximate operations with explicitly managed error rates.

Our framework provides: (1) a systematic approach to propagating error bounds through composed set operations, (2) efficient Boolean and symmetric difference operations on hash-transformed data with quantifiable false positive rates, and (3) practical implementations of privacy-preserving primitives including private set intersection and secure aggregation. We build upon established techniques from probabilistic data structures (Bloom filters, HyperLogLog) while adding cryptographic privacy through one-way hash transformations.

Experimental evaluation demonstrates that HBOS operations complete in 0.4-2.1 microseconds for typical workloads, offering 1000-10000× speedup over homomorphic encryption approaches. The framework provides privacy guarantees bounded by the collision probability of the underlying hash function (e.g., $2^{-256}$ for SHA-256). We validate HBOS through implementations of private set intersection, secure deduplication, and federated learning aggregation, showing practical applicability where approximate results with explicit error bounds are acceptable.

Index Terms:

cryptographic hash functions, oblivious data structures, privacy-preserving computation, approximate algorithms, probabilistic data structures

I Introduction

The proliferation of cloud computing and data analytics has created an urgent need for privacy-preserving computational frameworks that enable operations on sensitive data without exposing the underlying values. Traditional approaches to this problem fall into two categories: cryptographic techniques such as fully homomorphic encryption (FHE) [18] and secure multi-party computation (MPC) [31], and statistical techniques such as differential privacy [12]. While powerful, these approaches often suffer from computational overhead that limits their practical deployment.

We present Hash-Based Oblivious Sets (HBOS), a practical framework that achieves efficient privacy-preserving set operations by combining cryptographic hash functions with probabilistic data structures. Our approach differs from existing solutions by explicitly embracing approximation, making error rates transparent throughout the computational pipeline. This design choice enables HBOS to achieve microsecond-scale performance while providing privacy bounded by hash collision probabilities.

The core insight underlying HBOS is that many real-world applications can tolerate controlled error rates in exchange for efficiency and privacy. By using cryptographic hash functions as one-way transformations, we map sensitive data into a hash domain where equality testing is preserved probabilistically while the original values remain computationally hidden. All subsequent operations work exclusively on these hash values, never requiring access to the plaintext. This approach builds upon well-established techniques from probabilistic data structures while adding cryptographic privacy guarantees.

I-A Motivating Example

Consider a healthcare consortium where multiple hospitals need to identify patients enrolled in overlapping clinical trials without revealing their complete patient lists. Traditional approaches would require either: (1) a trusted third party to perform the intersection, violating privacy requirements, (2) homomorphic encryption with prohibitive computational costs, or (3) complex MPC protocols requiring multiple rounds of communication.

Using HBOS, each hospital can:

1.

Transform patient identifiers using a shared cryptographic hash function
2.

Perform set intersection directly on hash values
3.

Obtain results with explicit error bounds (e.g., false positive rate ¡ $2^{-256}$ )
4.

Complete the entire operation in milliseconds rather than minutes

This example illustrates HBOS’s key advantage: practical performance with quantifiable privacy guarantees.

I-B Contributions

This paper makes the following contributions:

•

Systematic Error Propagation: We provide a formal framework for propagating error bounds through composed set operations on hash-transformed data, enabling applications to reason about accuracy-privacy trade-offs.
•

Practical Implementation: We demonstrate that combining cryptographic hashing with probabilistic data structures achieves microsecond-scale performance for privacy-preserving set operations, offering 1000-10000× speedup over homomorphic encryption.
•

Integration of Existing Techniques: We show how established algorithms (Bloom filters, HyperLogLog) can be enhanced with cryptographic privacy guarantees through systematic application of hash transformations.
•

Real-World Applications: We validate HBOS through implementations of private set intersection, secure deduplication, and federated learning aggregation, demonstrating practical utility where approximate results are acceptable.
•

Security Analysis: We provide formal analysis showing that privacy is bounded by the collision probability of the underlying hash function, with explicit quantification of error rates.

I-C Paper Organization

The remainder of this paper is organized as follows. Section II provides background on cryptographic hash functions and threat model. Section III presents the HBOS framework design and core abstractions. Section IV develops the mathematical foundations and security analysis. Section V describes our implementation and optimization techniques. Section VI evaluates performance and security properties. Section VII explores applications across multiple domains. Section VIII discusses related work. Section IX concludes.

II Background and Threat Model

II-A Cryptographic Hash Functions

A cryptographic hash function $H:\{0,1\}^{*}\rightarrow\{0,1\}^{n}$ maps arbitrary-length inputs to fixed-size outputs while satisfying three key properties:

Definition 1 (Preimage Resistance).

Given hash value $h$ , finding any $x$ such that $H(x)=h$ requires $O(2^{n})$ operations.

Definition 2 (Second Preimage Resistance).

Given $x_{1}$ , finding $x_{2}\neq x_{1}$ such that $H(x_{1})=H(x_{2})$ requires $O(2^{n})$ operations.

Definition 3 (Collision Resistance).

Finding any pair $(x_{1},x_{2})$ where $x_{1}\neq x_{2}$ and $H(x_{1})=H(x_{2})$ requires $O(2^{n/2})$ operations.

HBOS leverages these properties to create one-way transformations that preserve equality testing probabilistically while preventing recovery of original values.

II-B Approximate Data Structures

Approximate data structures trade perfect accuracy for improved space or time complexity. The canonical example is the Bloom filter [1], which supports membership queries with false positives but no false negatives. HBOS builds upon these well-established techniques, adding cryptographic privacy through hash transformations while maintaining explicit error bounds.

Definition 4 (Approximate Boolean).

An approximate Boolean value is a tuple $(v,\epsilon_{p},\epsilon_{n})$ where $v\in\{true,false\}$ is the estimated value, $\epsilon_{p}$ is the false positive rate, and $\epsilon_{n}$ is the false negative rate.

II-C Threat Model

We consider an honest-but-curious adversary model where:

•

Participants follow the protocol correctly but attempt to learn additional information
•

The adversary has access to hash values but cannot invert the hash function
•

The adversary may have auxiliary information about the data distribution
•

The cryptographic hash function is modeled as a random oracle

We explicitly exclude:

•

Malicious adversaries who deviate from the protocol
•

Side-channel attacks on the implementation
•

Quantum adversaries (though post-quantum hash functions could be used)

III System Design

III-A Core Abstractions

HBOS is built around three core abstractions that compose to enable complex privacy-preserving operations:

III-A1 Hash-Oblivious Values

A hash-oblivious value encapsulates the one-way transformation of sensitive data through cryptographic hashing. The key insight is that equality testing on hash values preserves the equality relation probabilistically while preventing recovery of original values. The false positive rate equals the collision probability of the hash function (e.g., $2^{-256}$ for SHA-256). Implementation details are provided in Appendix A.

III-A2 Approximate Values

All operations in HBOS return approximate values with explicit error rates. Each approximate value maintains both the computed result and its associated false positive and false negative rates. This design makes uncertainty explicit and enables informed decision-making about accuracy-privacy trade-offs. The confidence in a result equals $1-\epsilon_{p}-\epsilon_{n}$ where $\epsilon_{p}$ and $\epsilon_{n}$ are the false positive and negative rates respectively.

III-A3 Set Operations

HBOS provides two primary set implementations with different algebraic properties:

Boolean Sets support full Boolean algebra:

•

Union: $A\cup B$ with error propagation
•

Intersection: $A\cap B$ with error composition
•

Complement: $\neg A$ with error inversion
•

Membership: $x\in A$ with hash collision probability

Symmetric Difference Sets form a group under XOR:

•

XOR: $A\oplus B$ for disjoint unions
•

Identity: Empty set
•

Inverse: Every set is its own inverse
•

Efficient for aggregation operations

III-B Error Propagation

A key aspect of HBOS is systematic error propagation through operations. For Boolean operations, we derive tight bounds:

Theorem 5 (Union Error Bound).

For sets $A$ and $B$ with false positive rates $\epsilon_{A}$ and $\epsilon_{B}$ :

\epsilon_{A\cup B}\leq\epsilon_{A}+\epsilon_{B}-\epsilon_{A}\cdot\epsilon_{B}

Theorem 6 (Intersection Error Bound).

For sets $A$ and $B$ with false positive rates $\epsilon_{A}$ and $\epsilon_{B}$ :

\epsilon_{A\cap B}\leq\min(\epsilon_{A},\epsilon_{B})

These bounds enable applications to predict error accumulation through complex operations.

III-C Architecture

HBOS follows a layered architecture as shown in Figure 1:

Applications

(PSI, Analytics, Aggregation)

Operations Layer

(Similarity, Cardinality Estimation)

Set Layer

(Boolean Algebra, Symmetric Difference)

Core Primitives

(Hash-Oblivious Values, Approximate Types)

Cryptographic Layer

(SHA-256, BLAKE2b, etc.)

Figure 1: HBOS layered architecture

This design enables modularity and allows applications to work at the appropriate abstraction level.

IV Mathematical Foundations

IV-A Security Analysis

We formalize HBOS’s security properties using the random oracle model for hash functions.

Definition 7 (One-Wayness Game).

The one-wayness game $\mathcal{G}_{OW}$ between challenger $\mathcal{C}$ and adversary $\mathcal{A}$ :

1.

$\mathcal{C}$ selects random $x\in\{0,1\}^{*}$ and key $k$
2.

$\mathcal{C}$ computes $h=H(k||x)$ and sends $h$ to $\mathcal{A}$
3.

$\mathcal{A}$ outputs $x^{\prime}$
4.

$\mathcal{A}$ wins if $x^{\prime}=x$

Theorem 8 (Privacy Preservation).

Let $H:\{0,1\}^{*}\rightarrow\{0,1\}^{n}$ be a random oracle. For any PPT adversary $\mathcal{A}$ , the probability of winning $\mathcal{G}_{OW}$ is negligible:

\Pr[\mathcal{A}\text{ wins }\mathcal{G}_{OW}]\leq 2^{-n}+\text{negl}(n)

Proof.

In the random oracle model, $H(k||x)$ is uniformly distributed over $\{0,1\}^{n}$ . Without knowledge of $k$ , the adversary’s view is independent of $x$ , reducing to random guessing with success probability $2^{-n}$ . ∎

IV-B Approximate Algebraic Properties

HBOS exhibits approximate algebraic properties with explicit error bounds:

Definition 9 (Approximate Set Operations).

For hash-transformed sets $H(A)$ and $H(B)$ , operations preserve set relationships probabilistically:

•

Union: $H(A)\cup H(B)\approx H(A\cup B)$ with error $\epsilon\leq 2^{-n}$
•

Intersection: $H(A)\cap H(B)\approx H(A\cap B)$ with error $\epsilon\leq 2^{-n}$
•

Symmetric difference: $H(A)\oplus H(B)=H(A\triangle B)$ (exact for disjoint sets)

Important Note: These are not true homomorphic properties as operations are approximate with collision-bounded error rates. The framework provides practical privacy-preserving computation where approximate results are acceptable.

IV-C Cardinality Estimation

HBOS incorporates the well-established HyperLogLog algorithm [16] for cardinality estimation on hash-transformed sets. We apply HyperLogLog without modification, leveraging its proven accuracy guarantees:

Algorithm 1 Cardinality Estimation

0: Trapdoor set

S

0: Estimated cardinality

\hat{n}

m\leftarrow

number of buckets

M\leftarrow

array of

m

registers

3: for each trapdoor

t\in S

j\leftarrow

first

\log_{2}m

bits of

t

w\leftarrow

remaining bits of

t

M[j]\leftarrow\max(M[j],\rho(w))

7: end for

\hat{n}\leftarrow\alpha_{m}\cdot m^{2}/\sum_{j=1}^{m}2^{-M[j]}

9: return

\hat{n}

The algorithm achieves relative error $1.04/\sqrt{m}$ using $O(m\log\log n)$ bits.

V Implementation

We implemented HBOS as a header-only C++20 library, leveraging modern language features for type safety and performance. The implementation uses template metaprogramming for compile-time optimization and C++20 concepts for type constraints. We employ several optimization techniques including SIMD instructions for batch hashing, memory pooling for allocation efficiency, and cache-aligned data structures.

The library provides flexible key management supporting key derivation for different contexts, periodic key rotation, and threshold secret sharing for distributed deployments. Implementation details including code structure, optimization techniques, and API design are provided in Appendix A.

VI Evaluation

VI-A Experimental Setup

We evaluate CTS on:

•

Intel Core i9-12900K (16 cores, 24 threads)
•

64GB DDR5 RAM
•

Ubuntu 22.04, GCC 12.2
•

Compiled with -O3 -march=native

VI-B Performance Benchmarks

TABLE I: Operation Latency (microseconds)

Operation	Mean	Std Dev
Trapdoor creation	0.42	0.03
Set insertion (1K elements)	420	12
Set membership test	0.45	0.02
Set intersection (1K each)	892	28
Set union (1K each)	856	24
Cardinality estimation	1.2	0.1
Jaccard similarity	2.1	0.2

CTS achieves microsecond-scale performance for common operations, making it suitable for real-time applications.

VI-C Scalability Analysis

⬇

1Throughput (ops/sec) vs Set Size

210^7 | *

3 | *

410^6 | *

5 | *

610^5 |*_____________

7 10^2 10^4 10^6

8 Set Size

Figure 2: Throughput scaling with set size

Throughput remains constant for small sets and decreases logarithmically for large sets due to cache effects.

VI-D Security Evaluation

We validate security properties through:

Collision Testing: No collisions found in $2^{40}$ random inputs with 256-bit hashes.

Statistical Analysis: Output distributions pass NIST randomness tests.

Timing Analysis: Operations exhibit constant-time behavior preventing timing attacks.

VI-E Comparison with Alternatives

TABLE II: Comparison with Related Systems

System	Privacy	Performance	Accuracy
HBOS (This work)	High	Microseconds	Approximate
FHE	Perfect	Seconds	Exact
MPC	High	Milliseconds	Exact
Bloom Filters	None	Microseconds	Approximate

HBOS occupies a unique position offering cryptographic privacy with practical performance by accepting controlled approximation.

VII Applications

VII-A Private Set Intersection

HBOS enables efficient PSI without revealing non-matching elements. The intersection operation on hash-transformed sets achieves $O(n)$ complexity compared to $O(n^{2})$ for naive approaches. The false positive rate is bounded by the hash collision probability, providing strong privacy guarantees for practical applications.

VII-B Secure Deduplication

Cloud storage providers can identify duplicate files without accessing content:

Algorithm 2 Secure Deduplication

0: File

F

, Trapdoor factory

T

chunks\leftarrow

split

F

into blocks

hashes\leftarrow

empty set

3: for each chunk

c\in chunks

h\leftarrow T.create(c)

5: add

h

h a s h e s

6: end for

7: Query storage for existing

h a s h e s

8: Upload only unique chunks

VII-C Privacy-Preserving Analytics

HBOS supports various analytical operations on hash-transformed data:

Histogram Generation: Count occurrences without revealing underlying values, useful for distribution analysis while preserving privacy.

Frequency Analysis: Identify common patterns in hash-transformed data with collision-bounded error rates.

Similarity Metrics: Compute Jaccard similarity and other set-based metrics on private sets with explicit error quantification.

VII-D Federated Learning

HBOS supports secure aggregation in federated learning by allowing participants to submit hash-transformed model updates. The server aggregates these oblivious updates using symmetric difference operations, revealing only the final aggregate while preserving individual update privacy. This approach is particularly effective when combined with differential privacy for additional statistical guarantees.

VIII Related Work

VIII-A Homomorphic Encryption

Fully homomorphic encryption (FHE) enables arbitrary computation on encrypted data. Since Gentry’s breakthrough [18], significant progress has been made in practical FHE systems. TFHE [10] and CKKS [9] achieve sub-second bootstrapping, while recent work on fully homomorphic encryption over the integers [5] simplifies implementation. However, FHE still incurs 1000-10000× overhead for general computation. Partially homomorphic schemes like Paillier [26] offer better performance but support only specific operations. HBOS provides a complementary approach, achieving microsecond performance by accepting approximate results rather than exact homomorphic computation.

VIII-B Secure Multi-Party Computation

MPC protocols enable joint computation without revealing inputs. Classical protocols [31, 19] laid theoretical foundations, while modern frameworks have made MPC practical. MP-SPDZ [21] provides a comprehensive toolkit supporting multiple protocols, while ABY3 [25] achieves efficient three-party computation. Recent advances in silent OT [4] and function secret sharing [3] have reduced communication complexity. However, MPC still requires multiple rounds of interaction and coordination between parties. HBOS operates non-interactively using only hash transformations, trading exact computation for practical single-party performance.

VIII-C Private Set Intersection

PSI protocols have evolved significantly since early work [24, 17]. Modern protocols achieve remarkable efficiency: OPRF-based PSI [27] handles billion-element sets, while circuit-PSI [7] enables arbitrary computations on intersection results. Unbalanced PSI [8] optimizes for asymmetric set sizes common in practice. Recent work on PSI from pseudorandom correlation generators [28] achieves optimal communication. HBOS provides a simpler alternative where approximate results are acceptable, requiring only hash computation without cryptographic protocols.

VIII-D Approximate Data Structures

Probabilistic data structures trade accuracy for efficiency. Bloom filters [1] pioneered this approach for membership testing. Cuckoo filters [15] improve on Bloom filters by supporting deletions. Count-Min sketches [11] and Count-HyperLogLog [30] enable frequency and cardinality estimation. Recent work includes learned Bloom filters [22] using machine learning to optimize performance, and XOR filters [20] achieving near-optimal space efficiency. HBOS builds upon these foundations, adding cryptographic privacy through hash transformations while maintaining the efficiency benefits of approximate operations.

VIII-E Differential Privacy

Differential privacy (DP) [12] provides statistical privacy guarantees through calibrated noise addition. The field has matured significantly with deployment at major tech companies [14, 29]. Recent advances include the exponential mechanism [23], concentrated DP [6] for tighter privacy accounting, and shuffle DP [13] amplifying privacy through anonymization. Private aggregation techniques [2] enable federated learning at scale. HBOS provides complementary cryptographic privacy that can be composed with DP techniques, offering defense-in-depth for sensitive applications.

IX Conclusion

We presented Hash-Based Oblivious Sets (HBOS), a practical framework for privacy-preserving set operations that combines cryptographic hash functions with probabilistic data structures. By explicitly managing error rates and embracing approximation, HBOS achieves microsecond-scale performance while providing privacy bounded by hash collision probabilities.

Our contributions include:

•

A systematic framework for error propagation through composed set operations on hash-transformed data
•

Integration of established probabilistic algorithms (Bloom filters, HyperLogLog) with cryptographic privacy guarantees
•

Practical demonstrations achieving 1000-10000× speedup over homomorphic encryption approaches
•

Validation through real-world applications in private set intersection, secure aggregation, and federated learning

Future directions include:

•

Integration with post-quantum hash functions for quantum resistance
•

Hardware acceleration leveraging AES-NI and SHA extensions
•

Composition with differential privacy for enhanced protection
•

Formal verification of implementation correctness

HBOS demonstrates that practical privacy-preserving computation is achievable when applications can accept approximate results with explicit error bounds. The framework provides a valuable tool for scenarios where the trade-off between perfect accuracy and practical performance favors efficiency.

References

[1] B. H. Bloom (1970) Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7), pp. 422–426. Cited by: §II-B, §VIII-D.
[2] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2021) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191. Cited by: §VIII-E.
[3] E. Boyle, N. Chandran, N. Gilboa, D. Gupta, Y. Ishai, N. Kumar, and M. Rathee (2021) Function secret sharing for mixed-mode and fixed-point secure computation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 871–900. Cited by: §VIII-B.
[4] E. Boyle, G. Couteau, N. Gilboa, and Y. Ishai (2022) Cheaper silent ot extension and applications to multi-server pir. Annual International Cryptology Conference, pp. 371–403. Cited by: §VIII-B.
[5] Z. Brakerski and V. Vaikuntanathan (2022) Fully homomorphic encryption from ring-lwe and security for key dependent messages. Annual Cryptology Conference, pp. 505–524. Cited by: §VIII-A.
[6] M. Bun and T. Steinke (2016) Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: §VIII-E.
[7] N. Chandran, D. Gupta, and A. Shah (2022) Circuit-psi with linear complexity via relaxed batch opprf. In Proceedings on Privacy Enhancing Technologies, Vol. 2022, pp. 353–372. Cited by: §VIII-C.
[8] H. Chen, Z. Liang, and Z. Huang (2022) Blazing fast psi from improved okvs and subfield vole. In USENIX Security Symposium, pp. 1435–1452. Cited by: §VIII-C.
[9] J. H. Cheon, A. Kim, M. Kim, and Y. Song (2017) Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security, pp. 409–437. Cited by: §VIII-A.
[10] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène (2020) TFHE: fast fully homomorphic encryption over the torus. Vol. 33, pp. 34–91. Cited by: §VIII-A.
[11] G. Cormode and S. Muthukrishnan (2005) An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (1), pp. 58–75. Cited by: §VIII-D.
[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Cited by: §I, §VIII-E.
[13] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta (2019) Amplification by shuffling: from local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2468–2479. Cited by: §VIII-E.
[14] Ú. Erlingsson, V. Pihur, and A. Korolova (2014) RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. Cited by: §VIII-E.
[15] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher (2014) Cuckoo filter: practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pp. 75–88. Cited by: §VIII-D.
[16] P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms, pp. 127–146. Cited by: §IV-C.
[17] M. J. Freedman, K. Nissim, and B. Pinkas (2004) Efficient private matching and set intersection. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 1–19. Cited by: §VIII-C.
[18] C. Gentry (2009) Fully homomorphic encryption using ideal lattices. In Proceedings of the 41st annual ACM symposium on Theory of computing, pp. 169–178. Cited by: §I, §VIII-A.
[19] O. Goldreich, S. Micali, and A. Wigderson (1987) How to play any mental game. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pp. 218–229. Cited by: §VIII-B.
[20] T. M. Graf and D. Lemire (2020) Xor filters: faster and smaller than bloom and cuckoo filters. Vol. 25, pp. 1–16. Cited by: §VIII-D.
[21] M. Keller (2020) MP-spdz: a versatile framework for multi-party computation. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pp. 1575–1590. Cited by: §VIII-B.
[22] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis (2018) The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489–504. Cited by: §VIII-D.
[23] F. McSherry and K. Talwar (2021) The exponential mechanism for differential privacy revisited. Journal of Privacy and Confidentiality 11 (1). Cited by: §VIII-E.
[24] C. Meadows (1986) A more efficient cryptographic matchmaking protocol for use in the absence of a continuously available third party. Proceedings of IEEE Symposium on Security and Privacy, pp. 134–137. Cited by: §VIII-C.
[25] P. Mohassel and P. Rindal (2018) ABY3: a mixed protocol framework for machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 35–52. Cited by: §VIII-B.
[26] P. Paillier (1999) Public-key cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Application of Cryptographic Techniques, pp. 223–238. Cited by: §VIII-A.
[27] S. Raghuraman and P. Rindal (2022) Blazing fast psi from improved okvs and subfield vole. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 2505–2517. Cited by: §VIII-C.
[28] P. Schoppmann, A. Gascon, and F. Kerschbaum (2023) PSI from pseudorandom correlation generators. Annual International Cryptology Conference, pp. 332–364. Cited by: §VIII-C.
[29] A. D. P. Team (2017) Learning with privacy at scale. Apple Machine Learning Journal 1 (8). Cited by: §VIII-E.
[30] D. Ting (2020) Count-hyperloglog: a more memory-efficient cardinality estimator. arXiv preprint arXiv:2005.14165. Cited by: §VIII-D.
[31] A. C. Yao (1982) Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science, pp. 160–164. Cited by: §I, §VIII-B.

Appendix A Implementation Details

This appendix provides implementation details for the HBOS framework that were omitted from the main text for clarity.

A-A Core Data Structures

A-A1 Hash-Oblivious Value Implementation

The hash-oblivious value class encapsulates the one-way transformation:

Listing 1: Hash-oblivious value implementation

⬇

1template <typename T, size_t N = 32>

2class hash_oblivious {

3 hash_value<N> value_hash_;

4 size_t key_fingerprint_;

5public:

6 approximate_bool equals(

7 const hash_oblivious& other) const {

8 bool same = (value_hash_ ==

9 other.value_hash_);

10 double fpr = pow(2.0, -N*8);

11 return approximate_bool(

12 same, fpr, 0.0);

13 }

14};

A-A2 Approximate Value Implementation

Listing 2: Approximate value with error tracking

⬇

1template <typename T>

2class approximate {

3 T value_;

4 double false_positive_rate_;

5 double false_negative_rate_;

6public:

7 T value() const { return value_; }

8 double confidence() const {

9 return 1.0 - false_positive_rate_

10 - false_negative_rate_;

11 }

13 // Error propagation for operations

14 approximate operator&&(

15 const approximate& other) const {

16 return approximate(

17 value_ && other.value_,

18 min(false_positive_rate_,

19 other.false_positive_rate_),

20 false_negative_rate_ +

21 other.false_negative_rate_);

22 }

23};

A-B Optimization Techniques

A-B1 SIMD Hash Computation

We use vector instructions for parallel hash computation:

Listing 3: SIMD-accelerated hashing

⬇

1template <typename T>

2void batch_hash(const T* input,

3 hash_value* output,

4 size_t count) {

5 #pragma omp simd

6 for (size_t i = 0; i < count; ++i) {

7 output[i] = compute_hash(input[i]);

8 }

A-B2 Memory Pool Allocation

Listing 4: Memory pool for temporary values

⬇

1template <typename T>

2class memory_pool {

3 std::vector<T> pool_;

4 std::stack<T*> available_;

5public:

6 T* allocate() {

7 if (available_.empty()) {

8 pool_.emplace_back();

9 return &pool_.back();

10 }

11 T* ptr = available_.top();

12 available_.pop();

13 return ptr;

14 }

16 void deallocate(T* ptr) {

17 available_.push(ptr);

18 }

19};

A-C Key Management Implementation

Listing 5: Key derivation and management

⬇

1class key_manager {

2 std::array<uint8_t, 32> master_key_;

4public:

5 // Derive context-specific keys

6 auto derive_key(string_view context) {

7 return hmac_sha256(master_key_,

8 context);

9 }

11 // Periodic key rotation

12 void rotate_keys() {

13 auto new_key = generate_random_key();

14 secure_overwrite(master_key_);

15 master_key_ = new_key;

16 }

18 // Shamir’s secret sharing

19 auto split_key(int threshold, int shares) {

20 return shamir_split(master_key_,

21 threshold, shares);

22 }

23};

A-D C++20 Concepts

We use concepts for compile-time type checking:

Listing 6: Type constraints using concepts

⬇

1template <typename T>

2concept Hashable = requires(T t) {

3 { std::hash<T>{}(t) } ->

4 std::convertible_to<size_t>;

5};

7template <typename T>

8concept ObliviousSet = requires(T t) {

9 typename T::value_type;

10 { t.insert(std::declval<

11 typename T::value_type>()) };

12 { t.contains(std::declval<

13 typename T::value_type>()) } ->

14 std::convertible_to<approximate_bool>;

15};

A-E Parallel Execution

Leveraging C++20 parallel algorithms:

Listing 7: Parallel set operations

⬇

1template <ObliviousSet S>

2auto parallel_union(const S& a, const S& b) {

3 S result;

4 std::for_each(

5 std::execution::par_unseq,

6 a.begin(), a.end(),

7 [&result](const auto& elem) {

8 result.insert(elem);

9 });

10 std::for_each(

11 std::execution::par_unseq,

12 b.begin(), b.end(),

13 [&result](const auto& elem) {

14 result.insert(elem);

15 });

16 return result;

17}

Hash-Based Oblivious Sets: A Practical Framework for Privacy-Preserving Set Operations with Probabilistic Guarantees

Abstract

Index Terms:

I Introduction

I-A Motivating Example

I-B Contributions

I-C Paper Organization

II Background and Threat Model

II-A Cryptographic Hash Functions

Definition 1 (Preimage Resistance).

Definition 2 (Second Preimage Resistance).

Definition 3 (Collision Resistance).

II-B Approximate Data Structures

Definition 4 (Approximate Boolean).

II-C Threat Model

III System Design

III-A Core Abstractions

III-A1 Hash-Oblivious Values

III-A2 Approximate Values

III-A3 Set Operations

III-B Error Propagation

Theorem 5 (Union Error Bound).

Theorem 6 (Intersection Error Bound).

III-C Architecture

IV Mathematical Foundations

IV-A Security Analysis

Definition 7 (One-Wayness Game).

Theorem 8 (Privacy Preservation).

Proof.

IV-B Approximate Algebraic Properties

Definition 9 (Approximate Set Operations).

IV-C Cardinality Estimation

V Implementation

VI Evaluation

VI-A Experimental Setup

VI-B Performance Benchmarks

VI-C Scalability Analysis

VI-D Security Evaluation

VI-E Comparison with Alternatives

VII Applications

VII-A Private Set Intersection

VII-B Secure Deduplication

VII-C Privacy-Preserving Analytics

VII-D Federated Learning

VIII Related Work

VIII-A Homomorphic Encryption

VIII-B Secure Multi-Party Computation

VIII-C Private Set Intersection

VIII-D Approximate Data Structures

VIII-E Differential Privacy

IX Conclusion

References

Appendix A Implementation Details

A-A Core Data Structures

A-A1 Hash-Oblivious Value Implementation

A-A2 Approximate Value Implementation

A-B Optimization Techniques

A-B1 SIMD Hash Computation

A-B2 Memory Pool Allocation

A-C Key Management Implementation

A-D C++20 Concepts

A-E Parallel Execution

Hash-Based Oblivious Sets: A Practical Framework for
Privacy-Preserving Set Operations with Probabilistic Guarantees