\AtBeginEnvironment

proof \AtEndEnvironmentproof

Maximizing Confidentiality in Encrypted Search
Through Entropy Optimization

Alexander Towell
atowell@siue.edu
Abstract

Encrypted search systems enable information retrieval over encrypted data while preserving confidentiality of queries and documents. However, observable patterns in encrypted queries, access patterns, and result sets can leak information about plaintext content. We present an information-theoretic framework for analyzing and improving the confidentiality of encrypted search systems. We model encrypted search activities as random processes and measure their entropy, comparing observed entropy against the maximum entropy possible under system constraints such as query arrival rates, vocabulary size, and document collection size. We derive closed-form solutions for the maximum entropy distribution and show that the ratio of observed to maximum entropy provides a quantitative measure of confidentiality bounded between 0 and 1. Since entropy can be estimated using lossless compression, our framework enables practical measurement without requiring explicit probabilistic models. We demonstrate that confidentiality can be systematically improved through techniques such as homophonic encryption, artificial query injection, and query aggregation, each trading specific resources for entropy gains. A case study shows that a typical system achieving 59% efficiency can be improved to 85% efficiency with moderate space and bandwidth overhead. Our approach provides principled guidance for balancing confidentiality against performance in encrypted search deployments.

List of Algorithms

\@starttoc

loa

1 Introduction

Information retrieval over encrypted data presents a fundamental tension: enabling search functionality requires revealing some information about queries and documents, yet preserving confidentiality requires hiding this information. An information retrieval process begins when a search agent submits a query to an information system, where the query represents an information need. The system returns relevant objects, typically documents, satisfying that need.

encrypted search (ES) extends this paradigm to untrusted environments where an encrypted search provider obliviously retrieves confidential objects satisfying confidential information needs of authorized search agents. The system employs two cryptographic components: secure indexes provide queryable encrypted representations of confidential documents (constructed using techniques such as Bloom filters, perfect hash functions, or other approximate membership structures), while hidden queries provide one-way encrypted representations of plaintext queries.

Ideally, both secure indexes and hidden queries would reveal no information beyond what is necessary for retrieval. They would appear as uniform uncorrelated noise with lengths uncorrelated to their plaintext counterparts. However, functionality requirements create an inherent tradeoff: for an untrusted system to perform oblivious searches, hidden queries must map to relevant secure indexes, necessarily revealing some information through observable patterns.

1.1 The Information Leakage Problem

Claude E. Shannon, in his seminal 1949 paper Communication Theory of Secrecy Systems[16], established two essential properties for secure encryption:

  1. 1.

    Confusion obscures relationships between plaintext and ciphertext, ensuring ciphertext statistics depend on plaintext in ways too complex to exploit. Simple substitution ciphers fail this requirement: preserving frequency distributions allows adversaries to recover plaintexts through statistical analysis.

  2. 2.

    Diffusion ensures each plaintext symbol affects many ciphertext symbols, ideally causing every plaintext symbol to influence every ciphertext symbol.

Encrypted search systems struggle to achieve Shannon’s ideals. Even with strong cryptographic primitives, observable patterns leak information. Recent attacks demonstrate that access patterns[9, 1], query volumes[8], and timing information enable adversaries to reconstruct significant portions of plaintext queries and document content.

1.2 An Information-Theoretic Approach

Rather than relying solely on computational hardness assumptions, we analyze encrypted search through the lens of information theory. Shannon’s mathematical theory of communication[15] provides precise tools for measuring information and uncertainty through entropy. We extend this framework to encrypted search by:

  1. 1.

    Modeling search activities as random processes: We represent hidden queries, inter-arrival times, search agent identities, and result sets as random variables with observable realizations.

  2. 2.

    Measuring entropy: We quantify the unpredictability of these processes using Shannon entropy, which represents the minimum expected number of yes/no questions needed to determine outcomes.

  3. 3.

    Computing maximum entropy under constraints: We derive the maximum entropy achievable given system constraints such as query arrival rates, vocabulary size, and collection size.

  4. 4.

    Defining confidentiality: We measure confidentiality as the ratio of observed entropy to maximum possible entropy, providing a score between 0 (completely predictable) and 1 (maximally unpredictable).

  5. 5.

    Enabling practical measurement: Since entropy relates to optimal compression, we estimate it using lossless compressors without requiring explicit probability models.

This framework quantifies the fundamental confidentiality limits imposed by system constraints and provides principled guidance for improving confidentiality through entropy maximization.

1.3 Contributions

This paper makes the following contributions:

  1. 1.

    An information-theoretic framework for analyzing encrypted search confidentiality through entropy measurement

  2. 2.

    Derivation of maximum entropy distributions under realistic system constraints with closed-form solutions

  3. 3.

    A practical confidentiality metric based on entropy ratios that enables comparison across systems and configurations

  4. 4.

    Techniques for systematically improving confidentiality including homophonic encryption, artificial queries, and query aggregation

  5. 5.

    A compression-based entropy estimation method enabling measurement without explicit probabilistic models

  6. 6.

    Analysis demonstrating quantitative tradeoffs between confidentiality and resource costs

  7. 7.

    A case study showing confidentiality improvements from 59% to 85% efficiency with moderate overhead

To briefly overview, an encrypted search system consists of three separate stages:

  1. 1.

    The search agents generate plaintext queries representing confidential information needs. These queries are sent across a trusted communications channel to an obfuscator.

  2. 2.

    The obfuscator transforms the incoming sequence of plaintext queries into a sequence of hidden queries. These hidden queries are sent across an untrusted communications channel to the encrypted search provider.

  3. 3.

    The encrypted search provider obliviously maps each hidden query in the sequence to a set of secure indexes, where each secure index represents a confidental object that ostensibly satisfies the hidden query.

In practice, no encrypted search system obtains the upper limits of uniform uncorrelated noise. We measure the confidentiality with respect to some entropy measure, where the maximum entropy possible is derived with respect to some set of constraints.

Informed by this entropy measure, we propose methods to increase the entropy at some quantifiable cost, e.g., a sequence of hidden queries may be made to have as much entropy as desired at the cost of increasing the bit rate needed to transmit the hidden queries across the untrusted channel.111Obtaining the upper-limit, where the parametric constraints are unbounded, results in infinite costs.

2 Related Work

2.1 Encrypted Search Systems

The problem of searching over encrypted data while preserving confidentiality has been studied extensively since the seminal work of Song et al.[17], who introduced practical techniques for searching on encrypted data using cryptographic primitives. Goh[6] introduced the concept of secure indexes using Bloom filters to enable efficient searching over encrypted document collections.

Curtmola et al.[4] provided improved security definitions for searchable symmetric encryption and presented efficient constructions achieving stronger security guarantees. Their work established formal security models that have become standard in the field. Cash et al.[2] extended this work to support Boolean queries at scale, while Kamara et al.[12] addressed the challenge of dynamic encrypted search systems where documents can be added or removed.

2.2 Attacks on Encrypted Search

Despite strong cryptographic protections, encrypted search systems are vulnerable to attacks that exploit observable patterns in encrypted queries and access patterns. Islam et al.[9] demonstrated that access pattern leakage can enable adversaries to recover significant information about plaintext queries and documents. Cash et al.[1] formalized leakage-abuse attacks, showing how adversaries can exploit leaked information to reconstruct queries.

Pouliot and Wright[14] demonstrated inference attacks on practical encrypted search deployments, while Grubbs et al.[8] showed that even volume leakage from range queries can enable database reconstruction. These attacks motivate the need for principled approaches to understanding and quantifying information leakage in encrypted search systems.

2.3 Information-Theoretic Approaches

Shannon’s foundational work on information theory[15, 16] established the mathematical framework for measuring information and uncertainty. His analysis of secrecy systems demonstrated the importance of entropy in cryptographic security. The principle of maximum entropy, developed by Jaynes[10, 11], provides a rational basis for selecting probability distributions under constraints.

Our work applies these information-theoretic principles to analyze encrypted search systems. By measuring the entropy of observable encrypted search activities and comparing it to the maximum entropy possible under system constraints, we provide a quantitative measure of confidentiality that guides the design of countermeasures.

2.4 Oblivious Computation

Oblivious RAM[7, 18] provides techniques for hiding access patterns to memory by obfuscating read and write operations. While ORAM achieves strong security guarantees, it introduces significant computational overhead. Our approach shares the goal of hiding patterns in observable activities but focuses specifically on encrypted search and leverages information-theoretic measures to balance confidentiality against performance costs.

2.5 Anonymity and Mix Networks

Mix networks[3] and onion routing systems like Tor[5] provide anonymity by obscuring the relationship between senders and receivers of messages. These techniques complement our approach by addressing the challenge of hiding search agent identities. Our entropy maximization framework can incorporate mix network properties as one component of a comprehensive confidentiality strategy.

2.6 Contributions

This work makes the following contributions:

  1. 1.

    We provide an information-theoretic framework for analyzing encrypted search systems, measuring confidentiality through entropy rather than relying solely on computational security assumptions.

  2. 2.

    We derive the maximum entropy distribution for encrypted search activities under realistic system constraints, providing a principled upper bound on confidentiality.

  3. 3.

    We propose practical techniques for increasing entropy at quantifiable cost, including homophonic encryption, artificial queries, and query aggregation.

  4. 4.

    We establish the connection between lossless compression and entropy estimation, enabling practical measurement of system confidentiality.

  5. 5.

    We demonstrate how resource constraints such as bandwidth and storage can be traded systematically against confidentiality improvements.

3 Encrypted search model

A set is an unordered collection of distinct elements. A set of particular interest is given by the following definition.

Definition 3.1.

The finite set of all bit strings of length n is denoted by

𝔹n={b:b{0,1}n} (3.1)

with a cardinality given by

|𝔹n|=2n. (3.2)

The set of bit strings of length n may be put in one-to-one correspondence with any set having 2n or more elements.

The bit length of an object x is denoted by

BL(x), (3.3)

e.g., the bit length of a bit string x𝔹n is BL(x)=n. The countably infinite set of all bit strings of any length is denoted by

𝔹={b:b{0,1}}, (3.4)

where denotes any non-negative integer.

Definition 3.2 (Confidential object).

A confidential object, like a document, is an object that only an authorized set of search agents should be able to comprehend.

The objective of an encrypted search system is to satisfy the information needs, as represented by queries, of authorized search agents by retrieving a set of relevant documents from the document store.

Definition 3.3 (Secure index).

A queryable representation of a confidential object is denoted a secure index if it meets the following conditions:

  1. 1.

    If it has a bit length n, then it obtains (at least approximately) the maximum entropy. That is, it is incompressible.

  2. 2.

    An estimator of the cardinality of a bag-of-words secure index has an average entropy c2 per element.

  3. 3.

    A hidden query applied to a secure index only reveals information about the set of trapdoors in the hidden query. If the document model is a bag-of-words, then it only reveals an approximate membership with a false positive rate ε. If the document model is more general, it reveals only the information necessary to determine relevance.

By the definition above, a secure index may be obliviously queried by the encrypted search provider on behalf of authorized search agents.

The requirement that a secure index be incompressible (condition 1) is crucial for preventing information leakage through structural patterns. Various cryptographic constructions achieve this property through different mechanisms. Bloom filters[6] provide space-efficient approximate set membership testing with tunable false positive rates, while maintaining high entropy through their bit array representation. Perfect hash functions and other approximate map constructions[21, 20] extend this concept to support richer query semantics. The entropy of a secure index can be measured empirically through lossless compression algorithms: a well-obfuscated secure index should exhibit compression ratios near 1.0, indicating that it approaches maximum entropy for its bit length.

The choice of secure index construction involves tradeoffs between space efficiency, query expressiveness, and information leakage. More sophisticated constructions supporting phrase queries, proximity search, or ranked retrieval necessarily reveal additional structural information about the underlying document. Our entropy-based framework applies uniformly across these constructions: regardless of the specific cryptographic technique employed, a secure index achieves better confidentiality when its observable representation has higher entropy relative to the maximum possible entropy given the system constraints.

An encrypted search system may support many different kinds of queries. One of the simplest kinds of queries is a bag-of-words; that is, a search agent represents an information need with a set of relevant search keys. The bag-of-words query model may be a crude approximation of an information need but it is common in information retrieval and for simplicity we make the following assumption.

Assumption 3.1.

The query model is a bag-of-words.

A query submitted to a encrypted search system should only be understood by the search agent that generated the query.

Definition 3.4 (Hidden query).

A hidden query represents a confidential information need of an authorized search agent, where the query is suppose to be incomprehensible to everyone except the indicated search agent.

Before we define how a plaintext bag-of-words query is transformed into a hidden query, we need to define the trapdoor function.

Definition 3.5 (Trapdoor).

Given a secret s𝔹 and a word x𝔹, a trapdoor of x is a one-way transformation to a word y𝔹m as given by

yh(x++s), (3.5)

where h:𝔹𝔹m is a one-way function and ++ is the concatenation operator.

By one-way, we mean that given a word y, finding an x such that

y=h(x)

is not tractable. The one-way property of the transformation is the reason we call it a trapdoor function.

Definition 3.6.

A random oracle is an idealized cryptographic primitive that maps inputs to uniformly random outputs in an unpredictable manner.

Assumption 3.2.

The one-way function h:𝔹𝔹m approximates222Generally, h is a cryptographic hash function. a random oracle.

If a collision between plaintext words x and y were to occur, they would be aliased–i.e., indistinguishable–in the encrypted search system. We make the following simplifying assumption about the trapdoor function.

Assumption 3.3.

The codomain of h, the set of bit strings of length m, is sufficiently large such that collisions between words that represent information needs are negligible and may be ignored.

The function that transforms a plaintext bag-of-words into a trapdoor bag-of-words is given by the following definition.

Definition 3.7 (hidden query cryptographic protocol).

The cryptographic protocol that transforms a plaintext query 𝐱 into a hidden query 𝐱ˇ is given by

𝐱ˇhidden_query_generator(𝐱) (3.6)

where

hidden_query_generator:[bag-of-words]𝔹m. (3.7)

uses the trapdoor function given by Definition 3.5.

If the policy is a substitution cipher as given by Algorithm 1, then plaintext words have 1 or more possible trapdoors that are uniformly sampled from. In a simple substitution cipher, where each word maps to a single trapdoor, the substitution policy is a constant given by

substitutions(x)=1 (3.8)

for all words x𝔹.

params : 
  1. 1.

    s𝔹 is the secret.

  2. 2.

    substitutions:𝔹𝔹m is the substitution policy, where m is the bit length of trapdoors.

input :  A bag-of-words plaintext query 𝐱.
output :  A corresponding bag-of-words hidden query 𝐱ˇ.
1 function hidden_query_generator(𝐱)
2       𝐱ˇ;
3       for x𝐱 do
4             p substitutions(x);
5             sample k from DU(1,p);
6             ss;
7             for j1 to p do
                   // Since h approximates a random oracle, prepending the single bit value ‘‘1’’ is sufficient to generate another uncorrelated trapdoor.
8                   s1++s;
9                   xˇ h(x++s);
10                   𝐱ˇ𝐱ˇ{xˇ};
11                  
12             end for
13            
14       end for
15      return 𝐱ˇ;
16      
17
Algorithm 1 Cryptographic substitution cipher policy for mapping plaintext queries to hidden queries
params : 
  1. 1.

    s𝔹 is the secret.

  2. 2.

    nnoise is the number of artificial trapdoors to inject per query.

input :  A bag-of-words hidden query 𝐱ˇ.
output :  A perturbed bag-of-words hidden query 𝐱ˇ with artificial trapdoors.
1 function hidden_query_noise_decorator(𝐱ˇ)
2       𝐱ˇ𝐱ˇ;
3       for j1 to nnoise do
4             sample r uniformly from 𝔹;
5             xˇnoise h(r++s);
6             𝐱ˇ𝐱ˇ{xˇnoise};
7            
8       end for
9      return 𝐱ˇ;
10      
11
Algorithm 2 Cryptographic noise policy decorator for hidden queries
Definition 3.8.

The secure index document model is an approximate map[21, 20] with a false positive rate ε where the keys are searchable words in a corresponding confidential object and the values represent information about the word, such as its multiplicity.

In a bag-of-words document model, the value is whether a word exists in a document. In this case, a special-case of the approximate map, the approximate set, is used.[22, 19]

The function that transforms a plaintext document into a secure index is given by the following definition.

Definition 3.9 (Secure index construction cryptographic protocol).

The cryptographic protocol that transforms a plaintext document into a secure index is given by

𝐝secure_index_maker(𝐝) (3.9)

where

secure_index_maker:[document][secure_index]. (3.10)

If each queryable word x has multiple possible substitutions and the document model is a bag-of-words, then the algorithm given by Algorithm 3 is a candidate for secure index construction. It relies upon a data structure implementing the approximate set abstract data type.[22, 19]

params : 
s𝔹

is the secret.

ε

is the false positive rate.

substitutions:𝔹𝔹m

is the substitution policy, where m is the bit length of trapdoors.

input :  𝔻 is a bag-of-words representing a plaintext document.
output :  An approximate set of the trapdoors of the words in 𝔻.
1 function secure_index_maker(𝔻)
       // 𝔻 temporarily stores the trapdoors of the plaintext words.
2       𝔻;
3       for x𝔻 do
4             ss;
5             for j1 to substitutions(x) do
6                   xˇ(j)h(x++s);
7                   𝔻𝔻{xˇ(j)};
                   // Since h approximates a random oracle, prepending the single bit value ‘‘1’’ is sufficient to generate another uncorrelated trapdoor.
8                   s1++s;
9                  
10             end for
11            
12       end for
13      return an approximate set of 𝔻 with a false positive rate ε;
14      
15
Algorithm 3 Plaintext document to secure index bag-of-words generator

Algorithm 3 may also be used to support phrase searching by also including the bigrams in a document 𝔻. This is known as a biword model for phrase search[13]. A phrase is assumed to be in the document if all the bigrams in the phrase are in the approximate set. Note that false positives on phrases with more than two words occur at a different rate than the false positive rate of the approximate set. Other variations may also be supported, e.g., fuzzy searches or wildcard searches, at the cost of increased space and time complexity.

Definition 3.10 (Adversary).

The adversary is an untrusted agent that tries to extract confidential information about encrypted search activities.

Definition 3.11 (Kerckhoffs’s principle).

A cryptosystem should be secure even if everything about the system, except the secret, is known to the adversary. If the secret is compromised, the cryptosystem is compromised.

In our encrypted search model, the secret is a set of well-defined parameterizations.

Assumption 3.4.

The adversary knows everything about the system except a well-defined set of paramterizations.

In particular, the hidden queries time series is known (observable) to the adversary.

There are many ways an adversary might gain insight into encrypted search activities, e.g., the confidential identity of search agents may be exposed through traffic analysis even if onion routing is used[5].

The secret is given by the following definition.

Assumption 3.5.

The set of parameters considered to be the secret is given by the following:

  1. 1.

    A secret key used to generate trapdoors.

There are two primary components in an encrypted search system, the obfuscator and the encrypted search provider. The obfuscator is given by the following definition.

Definition 3.12 (obfuscator).

The obfuscator receives plaintext queries from authorized search agents, transforms them into hidden queries using some set of cryptographic protocols333One of which is given by Definition 3.7., and transmits the hidden queries to the encrypted search provider.444In practice, the obfuscator may transmit the hidden queries back to the search agents and they may then transmit the hidden queries directly to the encrypted search provider.

The obfuscator may reside on a search agent’s host computer or a physically separate computer that is network accessible. Either way, since the obfuscator receives plaintext queries, it must be trusted.

Assumption 3.6.

The authorized search agents have access to the obfuscator through a secure communications channel.

By Assumption 3.6, confidential plaintext queries may be securely transported to the obfuscator without being compromised by the adversary. The encrypted search provider is given by the following definition.

Definition 3.13 (encrypted search provider).

The encrypted search provider receives hidden queries that are the output of the obfuscator and maps them to a set of confidential objects. The mapping is given by some function

hidden_query_mapper:𝔹mpowerset({1,,N}), (3.11)

where 𝔹m is the set of hidden queries and 𝐝 is a set of references to confidential objects.

The encrypted search provider may reside on a search agent’s host computer or a physically separate computer that is network accessible. Either way, we make the following assumption about the link between the obfuscator and the encrypted search provider.

Assumption 3.7.

The obfuscator communicates with the encrypted search provider through an untrusted communications channel.

Typically, the network connection between the obfuscator and the encrypted search provider is trusted (encrypted), and thus the only untrusted element in this link is the encrypted search provider, e.g., the adversary may have compromised the security of the encrypted search provider itself.

The adversary may modify the result sets or the hidden queries (such that the search agents receive false results)555This capability could theoretically be employed by the adversary to decrease the entropy of encrypted search activities.. Strategies (such as redundancy) exist that may make it possible to detect such modifications, but we make the following assumption.

Assumption 3.8.

The search agents receive truthful results to their information needs.

The information that flows across the untrusted channel is given by the following definition.

Definition 3.14 (hidden query stream).

The hidden queries and result sets flowing across the untrusted communications channel is an ordered sequence of tuples. The kth tuple is given by

tˇk,aˇjk,𝐱ˇ𝐤,𝐝ˇ𝐤, (3.12)

where

  • tˇj

    is a time stamp of the kth hidden query,

  • aˇjk

    is the identity666For example, the IP address of the search agent. of the search agent submitting the kth hidden query,

  • 𝐱ˇk

    is the kth hidden query corresponding to plaintext query, and

  • 𝐝ˇk

    is the result set of confidential objects satisfying the information need of the query.

The time stamps observed in a stream of hidden queries provides a partial ordering such that if time stamp tj of the jth hidden query is earlier in time than time stamp tk of the kth query, then tj comes before tk in the ordered sequence of tuples.

Two search agents interacting with an encrypted search provider is given by the following example.

Example 1  In Figure 1, we depict the following situation. There are two search agents, denoted by SA1 and SA2, generating plaintext queries expressing some information need that is to be met by the encrypted search provider.

search agent a1 submits two plaintext queries, 𝐱𝟏 at time t1 and 𝐱𝟑 at time t3. These queries are transformed by the obfuscator respectively into the hidden queries 𝐱ˇ𝟏 and 𝐱ˇ𝟐 which are then sent to the encrypted search provider over the untrusted communications channel. Similiarly, search agent a2 submits a plaintext queries 𝐱𝟐 at time t2 which is transformed into 𝐱ˇ𝟐 and sent to the encrypted search provider.

The encrypted search provider receives these queries and generates result sets that satisfy the obfuscated information needs represented by the hidden queries, where 𝐝ˇ𝐣 satisfies 𝐱ˇ𝐣 for j=1,2,3.

The adversary observes the hidden query and result set streams flowing across the untrusted communication channels and attempts to ascertain patterns or regularities that may compromise confidentaility.

The adversary may be an authorized search agent if it is attempting to compromise the query privacy of other search agents.

Figure 1: Two search agents submitting a query to the encrypted search system where a simple substitution cipher is being used.
Refer to captionESPSA1SA2Refer to captionhidden queries(untrusted channel)Refer to captionObfuscatorqueries(trusted channel)Refer to captionadversary(t1,a1,𝐲𝟏),(t2,a2,𝐲𝟐),(t3,a1,𝐲𝟑)Refer to caption(t1,a1,𝐱𝟏),(t3,a1,𝐱𝟑)(t2,a2,𝐱𝟐)Refer to captionresult set stream(untrusted channel)(a1,𝐝𝟏),(a2,𝐝𝟐),(a1,𝐝𝟑)Refer to caption

The system has several characteristics given by the following table.

Table 1: The known parameters of the encrypted search system.
param sup description
λ >0 The mean arrival rate of the plaintext queries in the time series.
μ >0 The mean number of search keys per query in the plaintext time series.
u >0 The maximum number of search keys per query in the plaintext time series.
m >0 The number of unique plaintext search keys in the plaintext time series.
N >0 The number of unique confidential documents on the ESP.
θ >0 The mean number of documents in the result sets.777Since there are N documents in total, a result set has a minimum of 0 and a maximum of N.

To transmit the hidden queries across the untrusted channel, some encoding is needed. The hidden query encoder is given by the following definition.

Definition 3.15.

The encoder of the set of trapdoors 𝐱ˇ consisting of trapdoors is given by

encode:[𝕐]𝔹, (3.13)

which produces a uniquely decodable bit string.

The result sets encoder is given by the following definition.

Definition 3.16.

The encoder of the set of document identifiers [𝔻] is given by

encode:[𝔻]𝔹, (3.14)

which produces a uniquely decodable bit string.

Theorem 3.1.

The average bit rate of the sum of the hidden query and result set streams is given by

𝒪(λ(mμ+θ)), (3.15)

where λ is the expected query arrival rate.

Proof.

The expected bit length of the jth hidden query is given by

𝔼[BL(encode(𝕐j))]+𝒪(1). (3.16)

The constant is the fixed number of bits needed to encode data such as the time stamp of a hidden query and the identifier of the search agent (such as an IP address) that submitted it. Each trapdoor is coded by a fixed number of m bits, and there is expected to be μ trapdoors per hidden query. Thus, the expected bit length of the encoding of 𝐘𝐣 is given by

μm+𝒪(1). (3.17)

The expected bit length of the jth result set is given by

𝒪(θ), (3.18)

where θ is the expected number of documents relevant per hidden query. The sum of Equations 3.17 and 3.18 is given by

𝒪(θ)+μm+𝒪(1). (3.19)

The number of hidden query arriving per second is given by λ, and thus the total query rate is given by

λ(𝒪(θ)+μm+𝒪(1)). (3.20)

The primary interest of the adversary is in extracting information from the stream of hidden queries going from obfuscator to the encrypted search provider and from the stream of result sets going from the encrypted search provider to the search agents. For instance, the adversary may use a known-plaintext attack to map the trapdoors to their plaintext counterparts to ascertain the confidential information needs of search agents.

In what follows, we provide a theoretical treatment on the information disclosure of the hidden query and result set streams and explore strategies that increase their entropy by transforming the streams.

4 Probabilistic model

A probabilistic model is specified by equations involving random variables which make assumptions about how observable data about the system is generated. In what follows, we describe our model and observable data.

The probability mass function is given by the following definition.

Definition 4.1.

Let X be some discrete random variable. The probability mass function, denoted by

pX(x), (4.1)

calculates the probability that X realizes some value x.

If a random variable X has a probability mass function pX(), we say that

XpX(). (4.2)

The joint probability mass function of X1,,Xn is given by

pX1,,Xn(x1,,xn), (4.3)

which calculates the joint probability that X1=x1,,Xn=xn.

The arrival rate is given by the reciprocal of the inter-arrival time. In an observed time series t1,,tn, the sample mean arrival rate is given by

λ=nj=1ntj. (4.4)

We model the inter-arrival times between successive queries as random variables as given by the following definition.

Definition 4.2.

The inter-arrival time between the (j1)-th and the jth query is a continuous random variable Tj with a support set >0 and a mean arrival rate λ.

{remark}

An estimator of the distribution of the inter-arrival times is provided by time series estimators like exponential smoothing.

Suppose it is known that there are k search agents that may generate the plaintext queries. In a time series of the queries, the frequency of the search agents may vary. Thus, we may model the distribution as a discrete random variable as given by the following definition.

Definition 4.3.

The search agent responsible for the jth query is a discrete random variable Aj with a support set given by

{1,2,3,,k}. (4.5)

The distribution of plaintext queries is given by the following definition.

Definition 4.4.

The jth random bag-of-words (set) in the plaintext query time series is denoted by 𝐗𝐣 with a support given by

{𝕏powerset(𝕂)| 0<|𝕏|p}. (4.6)

The process that generates queries may be too complex to model. However, as the famous statistician George Box observed, “ all models are wrong, but some are useful.” The adversary does not need an accurate model, only a useful model. A very useful model is one which enables the adversary to map the observed trapdoors to their corresponding plaintext counterparts.

The random tuples in the hidden query and result set streams are given by the following definition.

Definition 4.5.

We denote the random tuple of the jth plaintext query by

𝐐𝐣=(Tj,Aj,𝐗𝐣), (4.7)

where

  1. Tj is the random inter-arrival time between the (j1)-th and jth queries,

  2. Aj is the random search agent of the jth query, and

  3. 𝐗𝐣 is the random set of plaintext bag-of-words in the jth query.

4.1 Hidden query and result set streams

By Assumption 3.7, the plaintext random tuples 𝐗𝟏,,𝐗𝐧 are not observable. However, these random tuples induce an observable hidden query stream.

The inter-arrival times between hidden queries is uncertain and therefore we may model them as random variables as given by the following definition.

Definition 4.6.

The inter-arrival time between the of the (j1)-th and the jth hidden query is a discrete random variable Tˇj with a support set given by

{1,2,3,}. (4.8)

The mean arrival rate of hidden queries is given by

n𝔼[1j=1nTˇj]=λˇ. (4.9)

The mean arrival rate of hidden queries is not necessarily the same as the mean arrival rate of plaintext queries since the obfuscator transforms the incoming plaintext queries to obfuscate encrypted search activities, e.g., it may inject artificial hidden queries at uncertain times and at any desired rate. Thus, in general, λˇλ.

The search agents generate the legitimate queries. However, the obfuscator may generate artificial queries by either artificial or legitimate search agents to perturb the hidden query stream. Regardless, the particular search agent that generated a query in the stream cannot be predetermined and thus we may model this uncertainty as a discrete random variable as given by the following definition.

Definition 4.7.

The search agent responsible for the jth hidden query is a discrete random variable Aˇj with a support set given by

{1,2,3,,kˇ}. (4.10)

Since the obfuscator may generate artificial queries for artificial search agents, kˇ may be larger than k.

By Assumption 3.2, the trapdoors observed are functions of the random plaintext words in the bag-of-words query. Thus, the trapdoors are random sets as given by the following definition.

Definition 4.8.

The trapdoors in the jth hidden query is a random set given by

𝐗ˇ𝐣=hidden_query_generator(𝐗), (4.11)

where 𝐗 is some random plaintext query888Potentially artificial random query. and 𝐗ˇ𝐣 has a support set given by the power set of

{1,2,3,,mˇ} (4.12)

with a mean cardinality given by

𝔼[1nj=1n|𝐗ˇ𝐣|]=μˇ. (4.13)

Note that 𝐗ˇ𝐣 does not (necessarily) correspond to 𝐗𝐣 since the obfuscator transforms the incoming plaintext queries to obfuscate encrypted search activities, e.g., it may inject artificial hidden queries.

There are N unique confidential objects (and NN obfuscated or artificial objects) for a total of N objects. The random set 𝐗ˇ𝐣 induces a distribution of result sets. Since the result sets are uncertain we may model them as random result sets as given by the following definition.

Definition 4.9.

The kth random result set corresponding to the kth random hidden query is denoted by 𝐃ˇ𝐤 with a support set given by the power set of

{1,2,3,,Nˇ} (4.14)

with a mean cardinality given by

𝔼[1nj=1n|𝐃ˇ𝐣|]=θˇ. (4.15)

Given 𝐗ˇ𝐢=𝐱ˇ𝐢, 𝐃ˇ𝐢 is degenerate since the same hidden query must always map to the same result set (assuming the confidential database is immutable).

In summary, the random tuples in the hidden query and result set streams are given by the following definition.

Definition 4.10.

We denote the random tuple of the jth hidden query and the corresponding jth result set by

𝐐ˇ𝐣=Tˇj,Aˇj,𝐗ˇ𝐣,𝐃ˇ𝐣, (4.16)

where

  1. Tˇj is the random inter-arrival time,

  2. Aˇj is the random search agent,

  3. 𝐗ˇ𝐣 is the random set of trapdoors, and

  4. 𝐃ˇ𝐣 is the random result set corresponding to 𝐗ˇ𝐣.

See Section 4.2 for a description of the joint probability mass function of the random sample 𝐐ˇ𝟏,,𝐐ˇ𝐧, which is in general not tractable. The entropy of the distribution is a far more tractable problem and provides a measure of the regularity or predictability that an adversary may extract from the system by observing the hidden query and result set streams.

Refer to caption𝐐𝟏,,𝐐𝐧𝐐ˇ𝟏,,𝐐ˇ𝐧Pr(𝐐𝟏,,𝐐𝐧,𝐐ˇ𝟏,,𝐐ˇ𝐧)Refer to captionobservable
Figure 2: The probability model, where a sequence of plaintext random queries is transformed into a sequence of random hidden queries.

4.2 Generative model

Consider the time series

𝐐𝟏,,𝐐𝐧,𝐐ˇ𝟏,,𝐐ˇ𝐧. (4.17)

The conditional probability that 𝐐𝐧=𝐪𝐧 and 𝐐ˇ𝐧=𝐪ˇ𝐧 given 𝐐<𝐧=𝐪<𝐧 and 𝐐ˇ<𝐧=𝐪ˇ<𝐧 is given by

Pr[𝐐ˇ𝐧=𝐪ˇ𝐧,𝐐ˇ𝐧=𝐪ˇ𝐧|𝐐<𝐧=𝐪<𝐧,𝐐ˇ<𝐧=𝐪ˇ<𝐧], (4.18)

which is equivalent to

Pr[Tn=tn,An=an,𝐗𝐧=𝐱𝐧,Tˇn=tˇn,Aˇn=aˇn,𝐗ˇ𝐧=𝐱ˇ𝐧,𝐃ˇ𝐧=𝐝ˇ𝐧|𝐐ˇ<𝐧=𝐪ˇ<𝐧]. (4.19)

By the chain rule, this can be rewritten as

Pr[Tˇn=tn,Aˇn=an,𝐗ˇ𝐧=𝐱ˇ𝐧|𝐗ˇ<𝐧=𝐱ˇ<𝐧]=Pr[Tˇn=tn|𝐗ˇ<𝐧=𝐱ˇ<𝐧]×Pr[Aˇn=an|Tˇn=tn,𝐗ˇ<𝐧=𝐱ˇ<𝐧]×Pr[𝐗ˇ𝐧=𝐱ˇ𝐧|Tˇn=tn,Aˇn=an,𝐗ˇ<𝐧=𝐱ˇ<𝐧]. (4.20)

The plaintext time series of queries induces the hidden query time series. If a simple substitution cipher is used for the queries and agent identifers and the time stamps are unchanged, then the distribution of hidden queries is the same as the plaintext time series except with different labels.

params : 
input : 
output :  A time series of size n drawn randomly from the induced distribution.
1 function samplern
2       for i1 to n do
3             sample ti from Ti|𝐐<𝐢=𝐪<𝐢;
4             sample ai from Ai|Ti=ti,𝐐<𝐢=𝐪<𝐢;
5             sample 𝐱𝐢 from 𝐗𝐢|Ai=ai,Ti=ti,𝐐<𝐢=𝐪<𝐢;
6             aˇi some anonymizer, like a mixnet;
7             tˇi something that delays sending hidden query up to some limit;
8             𝐱ˇ𝐢hidden_query_generator(𝐱𝐢|parameters);
9             𝐝ˇ𝐢hidden_query_mapper(𝐱ˇ𝐢|parameters);
10            
11       end for
12      
13
Algorithm 4 Generative model of a hidden query time series

5 Entropy and information

Suppose a search agent has some random information need J coming from some countable set 𝕁 with a probability mass given by pJ(). If the adversary could ask “yes” or “no” questions about the search agent[’s] information need, an expected lower-bound on the number of questions required to determine the information need j𝕁 is given by

(J)=j𝕁pJ(j)log2pJ(j). (5.1)

This is known as the entropy of the random variable J. The entropy measures the amount of “uncertainty” about what value J realizes. The greater the entropy, the greater the uncertainty and therefore the greater the number of questions the adversary needs on average to determine the information need.

For instance, to minimize the number of questions, the adversary could ask the search agent whether the information need is j(1)=argmaxj𝕁pJ(j). This is the information need with probability pJ(j(1)). If not, the adversary can ask the search agent whether the information need is j(2)=argmaxj𝕁{j(1)}pJ(j). The information need is

The greater the entropy, the more difficult it is to predict the information need of a search agent. More generally, the greater the entropy, the fewer patterns there are in the activities of an encrypted search system. In what follows, we provide a more rigorous mathematical treatment on the entropy of the encrypted search system.

The entropy is given by the following definition.

Definition 5.1 (Entropy).

The entropy of the jth random tuple 𝐐ˇ𝐣 is given by

(𝐐ˇ𝐣)=𝔼[log2p𝐐ˇ𝐣(𝐐ˇ𝐣)], (5.2)

where p𝐐ˇ𝐣() is the marginal distribution of 𝐐ˇ𝐣.

Definition 5.2 (Conditional entropy).

The conditional entropy of the jth random tuple 𝐐ˇ𝐣 given the previous j1 random tuples 𝐐ˇ<𝐣𝐐ˇ𝟏,,𝐐ˇ𝐣𝟏 is given by

(𝐐ˇ𝐣|𝐐ˇ<𝐣)=𝔼[log2p𝐐ˇ𝐣|𝐐ˇ<𝐣(𝐐ˇ𝐣|𝐐ˇ<𝐣)]. (5.3)
Definition 5.3 (Joint entropy).

The joint entropy of 𝐐ˇ𝟏,,𝐐ˇ𝐧 is given by

(𝐐ˇ𝟏,,𝐐ˇ𝐧)=(𝐐ˇ𝟏)+j=2n(𝐐ˇ𝐣|𝐐ˇ<𝐣). (5.4)

The joint entropy is less than (or equal to) the sum of the marginal entropies as given by

(𝐐ˇ𝟏,,𝐐ˇ𝐧)j=1n(𝐐ˇ𝐣). (5.5)

and only obtains equality if 𝐐ˇ𝟏,,𝐐ˇ𝐧 are statistically independent. If they are independent and identically distributed, then the joint entropy is given by

(𝐐ˇ𝟏,,𝐐ˇ𝐧)=n(𝐐). (5.6)
{postulate}

[Optimal compressor] The entropy (𝐐ˇ𝟏,,𝐐ˇ𝐧) is equivalent to the expected bit length produced by an optimal lossless compressor’s output given the encoding of the random tuples as given by

(𝐐ˇ𝟏,,𝐐ˇ𝐧)=𝔼[BL(compress(encode(𝐐ˇ𝟏)++++encode(𝐐ˇ𝐧))))], (5.7)

where ++ is the concatenation operation, encode is an arbitrary encoding of tuples, compress is an optimal compressor of the sequence, and BL(x) is the bit length of x. The particular codes chosen by the encoder is irrelevant with respect to the entropy of the streams999An optimal coder is informative about the underlying distribution and thus efficient codes are not necessarily even sought in the context of encrypted search.

If 𝐐ˇ𝟏,,𝐐ˇ𝐧 are independent and identically distributed, then the joint entropy is given by

(𝐐ˇ𝟏,,𝐐ˇ𝐧)=n𝔼[BL(compress(encode(𝐐ˇ)))]. (5.8)

The information conveyed by a message is the reduction in uncertainty. The information is given by

(𝐐𝐧|𝐐ˇ𝐧) =(𝐐𝐧)(𝐐𝐧|𝐐ˇ𝐧) (5.9)
=(𝐐𝐧)+(𝐐ˇ𝐧)(𝐐𝐧,𝐐ˇ𝐧). (5.10)

No information is conveyed about the indicated random variables if the hidden queries and plaintext queries are uncorrelated. However, it is possible the hidden queries are correlated with other factors not incorporated into the probabilistic model.

Conversely, if the hidden queries obtain maximum entropy, there are no patterns and thus it is not possible for it to be correlated with any other hypothetical factor. For this reason, optimal confidentiality is obtained by the maximum entropy distribution.

5.1 Principle of maximum entropy

Given the constraints of the system, the hidden queries necessarily convey some information about the plaintext queries. However, the greater the entropy, the fewer regularities and patterns in the encrypted search system. The maximum entropy given the system constraints is given by the following theorem.

Theorem 5.1 (Constrained maximum entropy).

The maximum entropy of a sequence of random tuples 𝐐ˇ𝟏,,𝐐ˇ𝐧 subject to the constraints of the communications model is given by

(λ,k,N,M,n,p)=n((Tˇ|λ)+(Aˇ|k)+(𝐗ˇ|M,p)+(𝐃ˇ|N).) (5.11)
Proof.
(λ,k,N,M,n,p)=i=1n((Tˇ|λ)+(Aˇ|k)+(𝐗ˇ|M,p)+(𝐃ˇ|N).) (5.12)

Since (𝐐ˇ𝐣,𝐐ˇ𝐤)(𝐐ˇ𝐣)+(𝐐ˇ𝐤) for any jk, to maximize entropy, we seek to maximize the independence without violating any constraints.

The random tuples (Tˇj,Aˇj,𝐗ˇ𝐣) for j=1,,n are independently distributed. Also, since we are interested in the maximum entropy distribution, the random tuples are identically distributed.

Continuing on, the components in the random tuple, Tˇ, Aˇ, 𝐗ˇ, and 𝐃ˇ are independently distributed. Thus, the maximum entropy has a form given by

(λ,μ,k,m,n)=n((T|λ)+(A|k)+(N,𝐗|μ,m)). (5.13)

Theorem 5.2.

The maximum entropy (Tˇ|λ) is given by

(Tˇ|λ)=1+ln1λ. (5.14)
Proof.

The continuous random variable Tˇj generates inter-arrival times for queries. The arrival rate is specified as λ, and thus 𝔼[Tˇj]=1λ.

The exponential distribution with a mean 1λ and a support >0 is the maximum entropy distribution in which these constraints hold,

TˇjEXP(λ) (5.15)

for j=1,,n, which has an entropy

1+ln1λ. (5.16)

We assume the inter-arrival times are continuous. To store the inter-arrival times on a computer, we must quantize them.

Theorem 5.3.

The optimal compression for exponentially distributed inter-arrival times with a precision τ has an expected lower-bound given by

(Tˇ|λ,τ)=1λτlog21λτ+(1λτ1)log2(1λτ1) (5.17)

which asymptotically obtains

limλτ0(Tˇ|λ,τ)=log21λ+log21τ+log2e. (5.18)
Proof.

Let N(τ) be geometrically distributed with a parameter p=λτ where λ>0 and τ>0 is an interval of time,

N(τ)GEO(p=λτ), (5.19)

As τ0, N(τ) converges in distribution to the exponential distribution with an arrival rate λ.

Thus, we may choose a suitably small τ (the smaller, the more accurately we code the inter-arrival times) and use the entropy of the geoemtric distribution, e.g., for a given τ, the optimal compressor has a lower bound given by the entropy

(N(τ))=(1p)log2(1p)plog2pp, (5.20)

where p=λτ. After simplification, the result follows. ∎

Theorem 5.4 (Solution for (Aˇ|k)).

The maximum entropy subject to the constraints of the communications model is given by

(Aˇ|λ)=log2k. (5.21)
Proof.

Assigning a unique integer (label) in the set {1,2,,k} to each of the k search agents, the discrete uniform distribution is the maximum entropy distribution,

AˇDU(k) (5.22)

for j=1,,n. The probability mass function is given by

pAˇ(a|k)=1k𝟙k{1,,k} (5.23)

with an entropy given by

(A|k)=log2k. (5.24)

Solution for (𝐗ˇ|M,p).
Solution for (𝔻|N)

The maximum entropy system has a distribution given by the following corollary.

Corollary 5.4.1.

The maximum entropy system has a random tuple distribution given by

(T,A,𝐘)pT,A,𝐘(t,a,𝐲|λ,μ,k,m)=λ(1λ)t1×1k×1μ(11μ)α12α)m, (5.25)

where μ>1, 0<λ<1, k1, and α=dim(𝐲)1.

If we generate n tuples from this distribution and losslessly compress the results with an optimal compressor, it is expected that the bit length of the compressor’s output obtains the lower-bound given by the maximum entropy.

If the parameters of the maximum entropy n distribution is not known, then it may be estimated by the following theorem.

Theorem 5.5.

The maximum likelihood estimator of n is given by

n^=n(m^,k^,λ^,μ^), (5.26)

where m^ is the maximum likelihood estimator of m given by the number of unique trapdoors in the sample, the maximum likelihood estimator101010The UMVU estimator of k is given by k¯=n+1nk^. of k is given by

k^=maxa1,,an, (5.27)

the maximum likelihood estimator of λ is given by

λ^=n[i=1nti]1, (5.28)

and the maximum likelihood estimator of μ is given by

μ^=1ni=1ndim(𝐱𝐢). (5.29)
Proof.

Given the distribution of the maximum entropy, where T is geometrically distributed with arrival rate λ, the UMVU estimator of λ is given by Equation 5.28.

Continue on in the same fashion for the other random variables. ∎

Using the large sample approximation, the maximum likelihood estimator n^ is normally distributed as given by

n^𝒩(n1,1nVar[1^]). (5.30)

We assume a sufficiently large sample of size n is available so that we may assume the variance of the sampling distribution of Sn is small.

If the entropy of the system n is not known, then it may be estimated by the following theorem.

Theorem 5.6.

A positive biased estimator of the entropy n is given by

^n=BL(compress(concat(encode(𝐪𝟏),,encode(𝐪𝐧)))), (5.31)

where qj=(tj,aj,𝐱𝐣,𝐝𝐣) is the ith observed tuple and compress is a near-optimal lossless compressor.

Proof.

We have the following corollary.

Corollary 5.6.1.

with an asymptotic form given by

n(m,k,λ,μ)=n(log2μkλ+μ(m+1)+const) (5.32)

where const=2log2e and m,k,λ,μ are the system parameters.

By Postulate 5.3, an optimal compressor compress is expected to obtain the lower-bound n. Plugging in a sub-optimal compressor will produce an estimate of this lower-bound. And, since the compressor is not optimal, it produces estiamtes larger than the true lower bound, i.e., it is a positive biased estimator. ∎

Performance measure

The performance measure of an encrypted search system is given by the following definition.

Definition 5.4.

The performance of a encrypted search system with entropy n is given by

e(m,k,λ,μ)=n(m,k,λ,μ)n(m,k,λ,μ). (5.33)

A system that obtains e()=1 is said to disclose minimum information. Conversely, a system which obtains e()=0 (the degenerate distribution) is said to disclose maximum information.111111The Shannon information of a system in which e()=0 is 0, but we are measuring this with respect to an adversary being able to predict what will happen, so my disclose maximum information we mean to suggest that it is predictable.

If e(m,k,λ,μ) is not known, then it may be estimated by the following statistic.

Corollary 5.6.2.

By Equations 5.26 and 5.31, an estimator of the performance measure is given by

e^=^n^n. (5.34)
Proof.

If the maximum likelihood estimator of a parameter θ is θ^, by the plugin principle the maximum likelihood estimator of a parameter g(θ) is given by g^=g(θ^). ∎

A central idea in this paper is that compression is equivalent to probabilistic data modeling since a good compressor tends to be a good predictor of the data. In fact, many compression algorithms essentially estimate conditional probability mass functions of the data (so that shorter codes may be assigned to more probably symbols). Thus, we may delegate the data modeling task to good compressors of encrypted search activities. The more noise-like the activities are, the larger the compressed output.

6 Maximum entropy system

Given that it is known there are k search agent, …

State adversary can observe. Note that a global adversary, or an adversary that has side-channel information, or an adversary that can manipulate traffic flows… may do better.

The entropy of the system is given by

Theorem 6.1.

The probability model with maximum entropy is given by the following theorem.

Theorem 6.2.

The distribution of the k search agents identities in the hidden query time series are independently and uniformly distributed,

AjDU(k). (6.1)
Proof.

We assume that the adversary knows there are k unique search agents. ∎

For a maximum entropy system, the unary coder is optimal if the inter-arrival times are geometrically distributed as

TjGEO(λ=12), (6.2)

the unary coder is optimal if the number of trapdoors per hidden query is geometrically distributed as

NjGEO(μ=2), (6.3)

and m bits per trapdoor is the optimal coder if the occurrences of trapdoors are uniformly distributed as

YjDU(0,2m1) (6.4)

for j=1,,n.

Consider a practical system with the following parameters:

  1. 1.

    The query arrival rate λ=12.

  2. 2.

    The mean number of words per query is μ.

  3. 3.

    There are k unique search agents.

  4. 4.

    𝐗𝐣 is the random vector of trapdoors in which each component has m possibilities with replacement,

  5. 5.

    Wj is the random dimension of the random vector 𝐃𝐣 with a mean θ, and

  6. 6.

    𝐃𝐣 is the random vector of results corresponding to 𝐘𝐣 in which each component has N possibilities without replacement.

Table 2 is the optimal coder if the occurrences of search agents are uniformly distributed as

AjDU({a1,a2,,ak}), (6.5)

the unary coder is optimal if arrival rates are geometrically distributed as

TjGEO(λ=12), (6.6)

the unary coder is optimal if the number of trapdoors per hidden query is geometrically distributed as

NjGEO(μ=2), (6.7)

and m bits per trapdoor is the optimal coder if the occurrences of trapdoors are uniformly distributed as

YjDU(0,2m1) (6.8)

for j=1,,n.

Encoding the hidden queries and result sets for transmission e.g., the unary encoder given by Table 3, each trapdoor y𝐲 is encoded by a bit string of fixed-length m, and a is encoded.

Theorem 6.3.

The expected optimally compressed bit length of a hidden query is given by

=1λ+p+μ(1+m). (6.9)
Proof.

The expected bit length is given by the expectation of the bit length of the encoder on a random tuple as given by

=𝔼[BL(encode(Tj,Aj,𝐗𝐣)]. (6.10)

We may look at how each of these random variables are coded separately.

The time stamp is coded by the unary coder given by Table 3, where an integer n>0 has a bit length n. Thus, the expected bit length is given by

𝔼[Tj]=1λ, (6.11)

where the mean inter-arrival time is given by the reciprocal of the arrival rate λ, which is a characteristic of the encrypted search system.

The search agents are coded by fixed-length bit strings of size p. Thus, the expected bit length of the code for a search agent is p.

The number of trapdoors dim(𝐘𝐣) is coded by the unary coder given by Table 3, where an integer n>0 has a bit length n. Thus, the expected bit length is given by

𝔼[Nj]=μ, (6.12)

where μ is a characteristic of the encrypted search system.

The trapdoors are coded by bit strings of fixed-length m and there are expected to be μ trapdoors per hidden query, thus the expected bit length of 𝐲 is given by

μm. (6.13)

Concatenating these codes together produces an encoding with an expected bit length given by

1λ+p+μ(1+m). (6.14)

Table 2: Code for search agents
Search agent Code
SA1 0 0 0 02
SA2 0 0 1 02
SA3 0 1 0 02
SA4 0 1 1 02
SA5 1 0 0 02
SA6 1 0 1 02
Table 3: Unary code for inter-arrival time
τ Code
1 12
2 0 12
3 0 0 12
4 0 0 0 12
5 0 0 0 0 12
6 0 0 0 0 0 12

7 Maximum Entropy Under Constraints

In this section, we derive the maximum entropy distribution for encrypted search activities subject to realistic system constraints. The maximum entropy principle, formalized by Jaynes[10], states that subject to precisely stated prior information, the probability distribution that best represents the current state of knowledge is the one with the largest entropy.

7.1 System Constraints

An encrypted search system operates under the following constraints, which we formalize as expectations or support restrictions on the probability distributions:

  1. 1.

    Query arrival rate: The mean number of queries per unit time is λ.

  2. 2.

    Number of search agents: There are k distinct search agents.

  3. 3.

    Query size: The mean number of trapdoors per hidden query is μ.

  4. 4.

    Trapdoor vocabulary: There are m possible distinct trapdoors.

  5. 5.

    Document collection: There are N distinct documents.

  6. 6.

    Result set size: The mean number of documents returned per query is θ.

These constraints reflect observable or known properties of the system that cannot be hidden without fundamentally changing the system’s functionality or violating resource constraints.

7.2 Maximum Entropy for Inter-Arrival Times

The inter-arrival time between consecutive queries is constrained by the arrival rate λ.

Theorem 7.1 (Maximum entropy for inter-arrival times).

Subject to the constraint that 𝔼[Tˇ]=1/λ, the distribution that maximizes entropy is the exponential distribution:

TˇEXP(λ) (7.1)

with entropy

(Tˇ)=1+ln1λ. (7.2)
Proof.

Among continuous distributions on >0 with a fixed mean 1/λ, the exponential distribution maximizes differential entropy. This follows from the calculus of variations applied to the entropy functional subject to the mean constraint. The exponential distribution has probability density function

fTˇ(t)=λeλt (7.3)

and differential entropy

(Tˇ)=0fTˇ(t)lnfTˇ(t)dt=1+ln1λ. (7.4)

7.3 Maximum Entropy for Search Agent Identities

The search agent identity for each query is constrained by the total number of agents k.

Theorem 7.2 (Maximum entropy for search agent identities).

Subject to the constraint that there are k search agents, the distribution that maximizes entropy is the discrete uniform distribution:

AˇDU(1,k) (7.5)

with entropy

(Aˇ)=log2k. (7.6)
Proof.

Among discrete distributions on a finite support of size k, the uniform distribution maximizes entropy. The uniform distribution has probability mass function

pAˇ(a)=1kfor a{1,2,,k} (7.7)

and entropy

(Aˇ)=a=1k1klog21k=log2k. (7.8)

7.4 Maximum Entropy for Hidden Query Cardinality

The cardinality of a hidden query (number of trapdoors) is constrained by the mean μ and typically by a maximum value u.

Theorem 7.3 (Maximum entropy for query cardinality).

Subject to the constraint that 𝔼[Ntrap]=μ where Ntrap{1,2,,u}, an approximate maximum entropy distribution is the geometric distribution (when u is large):

NtrapGEO(p) (7.9)

where p=1/μ, with entropy

(Ntrap)=(1p)log2(1p)plog2pp. (7.10)
Proof.

The geometric distribution maximizes entropy among discrete distributions on positive integers with a given mean. For p=1/μ, the geometric distribution has mean μ and probability mass function

pN(n)=p(1p)n1for n1. (7.11)

When the maximum value u is large relative to μ, the truncation has negligible effect on the entropy. ∎

7.5 Maximum Entropy for Trapdoor Selection

The trapdoors within a hidden query are drawn from a vocabulary of size m.

Theorem 7.4 (Maximum entropy for trapdoor selection).

Subject to no constraints beyond the vocabulary size m, the distribution that maximizes entropy for each trapdoor is the discrete uniform distribution:

YiDU(1,m) (7.12)

with entropy per trapdoor

(Yi)=log2m. (7.13)
Proof.

Without additional constraints on the relative frequencies of trapdoors, the uniform distribution over the vocabulary maximizes entropy. This gives entropy log2m per trapdoor selection. ∎

7.6 Maximum Entropy for Result Sets

The result sets are constrained by the document collection size N and mean result set size θ.

Theorem 7.5 (Maximum entropy for result set cardinality).

Subject to the constraint that 𝔼[Nresults]=θ where Nresults{0,1,,N}, the approximate maximum entropy distribution (for large N) is geometric or Poisson-like.

7.7 Joint Maximum Entropy

Combining these results, we obtain the maximum entropy for the complete system.

Theorem 7.6 (Joint maximum entropy).

Under the assumption of independence (which maximizes joint entropy), the maximum entropy for n query tuples is:

n=n[(Tˇ)+(Aˇ)+(Ntrap)+𝔼[Ntrap](Y)+(Nresults)+𝔼[Nresults]log2N]. (7.14)
Proof.

Since entropy is additive for independent random variables, and we assume each query tuple is independent and identically distributed:

(𝐐ˇ𝟏,,𝐐ˇ𝐧)=i=1n(𝐐ˇ𝐢)=n(𝐐ˇ). (7.15)

Within each tuple, assuming independence of components:

(𝐐ˇ)=(Tˇ)+(Aˇ)+(𝐗ˇ)+(𝐃ˇ). (7.16)

The entropy of the hidden query bag depends on both the cardinality and the trapdoor selections:

(𝐗ˇ)=(Ntrap)+𝔼[Ntrap](Y). (7.17)

Similarly for result sets. ∎

7.8 Minimum Mutual Information

The minimum mutual information between plaintext queries 𝐐𝟏,,𝐐𝐧 and hidden queries 𝐐ˇ𝟏,,𝐐ˇ𝐧 is achieved when the hidden queries realize the maximum entropy distribution.

Corollary 7.6.1 (Minimum mutual information).

The minimum mutual information is:

min(𝐐𝟏:𝐧;𝐐ˇ𝟏:𝐧)=(𝐐𝟏:𝐧)+nmax(𝐐𝟏:𝐧,𝐐ˇ𝟏:𝐧) (7.18)

where max represents the maximum possible joint entropy. When the system achieves maximum entropy, the mutual information equals the inherent correlation required by system functionality.

This provides a lower bound on information leakage determined by the fundamental requirements of the encrypted search system.

8 Increasing the entropy of the system

Assuming that the confidential collection of documents is immutable, a given hidden query always maps to the same result set. That is to say, given a hidden query, the corresponding random result set is degenerate. Consquently, the entropy of the joint distribution of hidden queries and result sets is just the entropy of either distribution,

(𝐗ˇ𝐢,𝐃ˇ𝐢)=(𝐗ˇ𝐢)=(𝐃ˇ𝐢). (8.1)

Thus, the only way to increase the entropy of the result sets is to increase the entropy of the hidden queries. We can thus focus on finding ways to increase the entropy of the hidden queries.

Note that conditional probability distribution of (𝐃ˇ𝐢|𝐗ˇ𝐢 is necessarily degenerate since a particular hidden query must always map to the same result set. However, 𝐗ˇ𝐢|𝐃ˇ𝐢 is not degenerate since many hidden queries may map to the same result set.

Generally, the hidden query stream does not obtain an efficiency of 1. Thus, we propose to perturb the stream to increase the entropy. In this section, we cover general strategies for increasing the entropy at some quantifiable cost.121212In general, this is a non-linear optimization problem where we wish to maximize some function of the entropy and other performance measures.

The general approach is to interrupt any regularities or patterns in the original stream with unpredictable noise, which has the effect of increasing the entropy but decreasing some other performance measure, like the bandwidth requirements of encrypted search activities or the space complexity of the secure indexes.

8.1 Multiple secure indexes per document

Ideally, every document in the document store will have an equal probability of being referenced in the stream of result sets. However, this is unlikely since some documents are expected to be generally more relevant to the information needs of the search agents.

Similar to homophonic encryption for trapdoors, a document should be given multiple secure index representations (and more representations in the document store) proportional to the reciprocal of its relative frequency in the stream of result sets. However, there are a few problems with this approach:

  1. 1.

    The relative frequency is probably not known a priori.

  2. 2.

    This increases both the space complexity and time complexity of the encrypted search system since more objects must be stored and queried.

For each document, construct multiple secure indexes in which each secure index for the given document will have a different representation because they will use different document identifiers131313An automated method for generating unique document identifiers may be accomplished by encrypting, using an invertible encryption scheme, the plaintext document reference with different salts. and different salts for their trapdoors.

This will improve query privacy in a way similiar to homophic encryption. However, homophonic encryption more efficiently serves this purpose. The primary advantage to having multiple secure indexes per document is that it enables the same plaintext query to return a set of logically equivalent results. This obscures a user’s search patterns, e.g., different users with similar search patterns may sample the salts differently to make their search patterns appear dissimilar.

Example 2  Let a document A read “Hello world!”. Let us represent document A with N=3 secure indexes, denoted by A1, A2, and A3. Then, in the biword model, Ai’s trapdoors may be generated from the set given by

Aimake_secure_index({(hello+salti),(world+salti),(hello world+salti)}|β). (8.2)

This method increases the space and time complexities of encrypted search, e.g., the size of the secure index database grows linearly with respect to N.

8.2 Artificial secure indexes

An extension of multiple secure indexes per document is the automatic inclusion of artificial secure indexes. These may be automatically generated from some language model (e.g., trigram language model) such that they are expected to be relevant to an appropriate percentage of queries.

Artificial secure indexes make it more difficult for an adversary to determine which documents retrieved in response to a hidden query are of actual interest while the user who submitted the search can instantly filter out the fake results. For example, the trapdoor of a artificial secure index document identifier tests positively as a member of the artificial document class.

8.3 Homophonic encryption

We map the accuracy of the adversary with respect to a sample size for various efficiencies in the following analysis. The greater the efficiency (or entropy), the less accurate the mapping is expected to be. At one extreme, we have an efficiency of 0 (minimum entropy) in which 100% of the traffic is successfully mapped after viewing a sample of size 1 and the other extreme we have an efficiency of 1 (maximum entropy) where the accuracy is given by pure random chance and is not correlated with sample size.

Figure 3: Accuracy vs sample size where N=1000
for several different entropies
1101001,00010,000100,00000.20.40.60.81sample size naccuracy re(0,N)=1.00e(13,N)=0.99e(23,N)=0.93e(1,N)=0.80e(2,N)=0.34e(4,N)=0.07e(,N)=0.00

Homophonic encryption may be employed to flatten the distribution of trapdoors by giving each plaintext word one or more trapdoors signatures.

Figure 4: Testing
510152000.10.20.3kpK(k,s=1,N=20)
510152000.10.20.3k

By analyzing queries, a reasonable homophonic code can be devised. However, such trapdoors must be relevant to a corresponding set of secure indexes to facilitate encrypted search. Thus, the more substitutions, the larger the secure indexes. If the space requirements are too demanding to create a completely flat marginal distribution, then we can settle on making the distribution more flat.

Flattening the marginal distribution does not prevent adversaries from learning mappings since random plain-text words Tj and Tk are not necessarily statistically independent and, for instance, may still have a non-uniform conditional distribution

Tj|TkpTj|Tk(). (8.3)
params :  b, maximize the marginal entropy of the distribution of trapdoors up to the b-th ranked word.
input :  A plaintext word x.
output :  A trapdoor xˇ of plaintext word x;
1 function substitutions(x)
       // Retrieve the b-th ranked word and compute its relative frequency.
2       βpX(Rank1(b));
3       m1;
       // If the rank of x is less than b, we must provide more than 1 possible substitution such that the first b words are (approximately) uniformly distributed.
4       if Rank(x)<b then
5             mpX(x)β+12;
6            
7       end if
8      return m;
9      
10
Algorithm 5 Homophonic substitution cipher

8.4 Query aggregation

Instead of submitting a single query consisting of k search keys, we can reduce it to y queries where the jth query has kj search keys for j=1,,y such that k1++ky=k and then apply the set-intersection operation on the y result sets on a trusted machine, such as the search agent’s host computer rather than the untrusted encrypted search provider.

{remark}

A query model that includes the full set-theoretic model[13] may reveal significantly more information about a search agent’s information need. Thus, if a full set-theoretic model is desired, a strong case can be made that it should be implemented using a variation of query aggregation, where set-theoretic operations (with possibly the exception of set-intersection) are applied to the result sets on a trusted machine rather than the untrusted encrypted search provider.

8.5 Digraphs and trigraphs

Suppose there are w unique trapdoors in a hidden query time series. To reduce correlations and flatten the distribution, we may employ digraphs, trigraphs, or more generally k-graphs.

Assumption 8.1.

The maximum size of a query is p atomic search keys.

A k-graph secure index may, with a single trapdoor, satisfy any query consisting of up to k search keys. If the query has p>k search keys, then

pk (8.4)

k-graphs may be constructed to implement the query.

More generally, any query of size p may be constructed in many ways using various graphs from 1 to k. If the primary concern is confidentiality, then the graphs should be chosen in such a way as to make the frequency distribution of the trapdoors as uniform as possible.

j=1k(pj)

Then, the space complexity is given by

j=1k(nj)log2εbits, (8.5)

which is on the order of

𝒪(nk)log2εbits. (8.6)

As an upper-limit, k=n and the space complexity is given by (2n1)log2εbits.

In a digraph, k=2, thus the space complexity is given by

n(n+1)2log2εbits, (8.7)

or on average

n+12log2εbits per key. (8.8)

8.6 Artificial trapdoors

The effectiveness of artificial trapdoors depends critically on the encrypted search provider’s ranking mechanism. In boolean search systems where all query terms must be present for document retrieval, artificial trapdoors will cause no matching documents to be returned, making this technique ineffective. However, for rank-ordered retrieval systems using scoring functions such as BM25 or TF-IDF, artificial trapdoors integrate seamlessly with authentic trapdoors. In such systems, artificial trapdoors affect result sets only when: (1) they trigger false positives in secure indexes due to the approximate nature of structures like Bloom filters, or (2) they coincidentally collide with authentic trapdoors due to the finite trapdoor space.

The probability of collision between artificial and authentic trapdoors is governed by the birthday paradox. With m authentic trapdoors and m′′ artificial trapdoors drawn uniformly from a space of size 2l, the expected number of collisions is approximately mm′′2l. For typical systems with l128 bits, collision probability remains negligible even with thousands of artificial trapdoors.

Suppose that each query has some random number L of artificial trapdoors, where

LpL(|μL) (8.9)

such that

𝔼[L]=μL. (8.10)

Let the random variable corresponding to the artificial trapdoor be given by

YpY(|mY), (8.11)

where mY is the number of unique artificial trapdoors.

The maximum entropy of the joint distribution of L and Y is given by the following theorem.

Theorem 8.1.
(L,Y)=(L|μL,Y) (8.12)

Suppose we have l bits per trapdoor, then at maximum we may generate m′′=2lm artificial trapdoors, where m are the unique number of bit patterns corresponding to authentic trapdoors.

Letting M and Tj for all j be independent, then the mean number of authentic and artificial trapdoors per query is given by

μ′′=μ+μL. (8.13)

If we randomize the order of the trapdoors in the hidden queries and let μL, then the efficiency of the entire encrypted search system will converge to

e(μ′′)=n(N,𝐘|μ′′)n(N,𝐘|μ′′). (8.14)

Assume the adversary has a model of the hidden query stream using known-plaintext attacks. Perturbing the hidden query stream by adding noise to it may counter known-plaintext attacks. Alternatively, assume that the adversary knows the secret and thus may use a dictionary attack to decipher the trapdoors. Then, adding noise may in some cases obfuscate what a search agent is actually interested in.

To mitigate such information leaks, in general we can look to oblivious RAM for inspiration. oblivious RAM may naively be thought of in the following way: to prevent meaningful statistics from being gathered about a user’s activities, whenever an action–a read or write–is performed, include other randomly chosen actions to obscure the user’s actual interests or activities.

8.7 Artificial hidden queries

Unlike artificial trapdoors which may alter result sets through collisions or false positives, artificial queries are designed to be distinguishable by the search agent while appearing indistinguishable to the adversary. The search agent can filter out artificial queries from result sets using cryptographic tags or sequence numbers, ensuring that authentic queries receive unmodified results. This approach trades space complexity for time and bandwidth: instead of expanding the trapdoor space through homophonic encryption (Section 9.3), we expand the temporal query stream.

The choice of query representation (unigrams, bigrams, trigrams, or longer n-grams) significantly impacts the entropy-confidentiality tradeoff. Unigram queries provide a vocabulary of size |V|, yielding at most log2|V| bits of entropy per trapdoor. Bigram queries expand the space to |V|2 possibilities, increasing per-trapdoor entropy to 2log2|V| bits at the cost of larger secure indexes. Position-sensitive representations like skip-grams or phrase queries further expand the trapdoor space while supporting richer search semantics.

To increase the entropy of the hidden queries without the space complexity costs associated with homophonic encryption (see Section 9.3) but rather with a time complexity and transmission rate cost, we may inject artificial queries into the hidden query stream.

To increase the entropy, the artificial queries should be injected into the stream to make the hidden query stream less correlated.

Example 3  Consider the authentic hidden query stream given by

(t1,aj1,𝐲𝟏),(t2,aj1,𝐲𝟐),(t3,aj1,𝐲𝟑). (8.15)

There may be patterns in this sequence, such as autocorrelations between 𝐲𝟏, 𝐲𝟐, and 𝐲𝟑. If we inject the artificial queries 𝐲𝟏 and 𝐲𝟐 into the stream, resulting in the perturbed hidden query stream given by

(t1,aj1,𝐲𝟏),(t1,,𝐲𝟏),(t2,aj2,𝐲𝟐),(t2,,𝐲𝟐),(t3,aj3,𝐲𝟑), (8.16)

where t1t1t2t2t3. The perturbed stream may attenuate141414And therefore increases the entropy. any regularities such as auto-correlations.

The entropy of this perturbed stream is maximally increased when the time stamps are geometrically distributed with a mean query rate λˇ per search agent, the search agent identities are uniformly distributed between 1 and kˇ, the cardinality of the trapdoor sets are binomially distributed between 1 and mˇ with a mean μˇ, and the trapdoors are uniformly distributed between 1 and mˇ.

This technique does not transform authentic hidden queries. However, as λˇ increases due to injecting artificial hidden queries, patterns become attenuated. Asymptotically, as λˇ, the distribution of hidden query converges to the maximum entropy distribution. Of course, asymptotically, infinite bandwidth and computational resources are required.

8.7.1 Alternative solution

A potential problem with previous solution of increasing the entropy is that the obfuscator must inject artificial hidden queries without prompting from the search agent. If this is impractical, then a less effective solution – one that only increases the entropy of the trapdoors – is for search agents to (automatically) generate random artificial hidden queries whenever they generate authentic hidden queries.

To maximize the entropy under these constraints, the artificial hidden queries are generated in the following ways:

  1. 1.

    The time stamps of the queries are randomized to change the order of the query submissions.151515The time stamps are approximately the same. Thus, if r hidden queries are generated, there are r! possible orderings each of which is equally probable.

  2. 2.

    The random cardinality of each trapdoor set in the artificial hidden queries is binomially distributed with a mean θˇ with a maximum value of mˇ.

  3. 3.

    Each element of the trapdoor set is uniformly distributed with a support set {1,,mˇ}.

  4. 4.

    The random number of artificial queries is geometrically distributed with a mean rate given by λˇ with a support set {0,1,2,} and thus each authentic hidden query has on average 1λˇλˇ artificial hidden queries bundled with it.

8.8 Obfuscating search agents

The adversary may observe the time series of hidden queries. By analyzing the network traffic going to and from the ESP, the adversary may be able to uniquely label the search agents generating the hidden queries, especially if no precautionary measures are taken to obfuscate this identifying161616For instance, IP addresses. information.

A mix network[3] is an overlay network that may obscure the identities of search agents.

An onion network is another type of overlay network…

The maximum uncertainty occurs when there is no identifying information. Consequently, given that the adversary knows there are k unique search agents, the probability that a particular search agent is responsible for a particular hidden query is 1/k with an entropy given by log2k.

A mix network helps, but search agents may have identifying search patterns, i.e., search agent j may be more likely to generate queries in particular intervals of time. These correlations may be obfuscated using other methods discussed previously.

8.9 Injecting artificial search agents

To increase the entropy, artificial search agents may be introduced that generate artificial hidden queries. If there are k authentic search agents and w artificial search agents, then there are k=k+w search agents in total.

If k is known, then the maximum entropy distribution is the same as before, the discrete uniform distribution over k search agents with an entropy given by (Aˇ|k)=log2k. Of course, in practice the maximum entropy may not be achieved and introducing artificial search agents may increase the entropy.

However, if k is not known, but k is, then k may be modeled as a random variable K with a support given by {0,1,,k}.

The joint distribution of (Aˇ,K) has an entropy given by

(Aˇ,K)=(Aˇ|K)+(K) (8.17)

with a maximum entropy given by

(Aˇ|K)+(K). (8.18)
(Aˇ|K)+log2k+log2k. (8.19)

If pK(k|k) is degenerate and assigns all the probability to the authentic number of search agents, then the entropy…

8.10 Obfuscating inter-arrival times

If the query arrival rate is λ, then the maximum entropy distribution of inter-arrival times in the hidden query time series is exponentially distributed with a rate λ, denoted by

TˇEXP(λ). (8.20)

We use queuing theory to characterize the query arrival times where we consider the obfuscator to be the server and the search agents to be the customers.

The arrival times are the times that plaintext queries are received by the obfuscator; when a query arrives, it is put into the queue and the obfuscator “serves” queries at the head of the queue.

If over an interval of time Δt the obfuscator receives n queries, then the average inter-arrival time over that interval of time is simply Δt/n and the arrival rate is λ=n/Δt. More specifically, suppose we have n plaintext queries with inter-arrival times t1,,tn. The mean inter-arrival time is

t¯=1nj=1ntj, (8.21)

and therefore the arrival rate is

λ=1t¯. (8.22)

We assume tj follows a probability distribution Tj for j=1,,n. To reshape the arrival times at the ESP, the obfuscator may delay sending hidden queries to the ESP. Under the queuing theory model, the delay may be considered the service time. If the mean service time is μ, then the service rate is 1/μ. To keep up with the query arrival rate, the service rate must be λ, i.e., μ=1/λ, which is equivalent to the mean inter-arrival time.

Theorem 8.2.

Suppose we have k search agents with query rates λ1,,λk. If each uses an obfuscator to transform the inter-arrival times to be exponentially distributed with arrival rates λ1,,λk, then the collective inter-arrival times is exponentially distributed with an arrival rate λ1++λk.

Proof.

Show proof. ∎

Figure 5: Inter-arrival time obfuscation
Refer to captionobfuscatorRefer to captionSARefer to caption***Refer to captiont2t1Refer to caption

*

*

*

Refer to captiontˇ2tˇ1Refer to captionESP
{remark}

Intuitively, the no-memory property of the exponential distribution is indicative of its maximum entropy. However, if we allow the support to be constrained over 0 to 2/λ, then the uniform distribution obtains the same entropy lnλ such that 𝔼[Tˇ]=1/λ.

8.10.1 Estimating search agent arrival rates

The k search agents have query arrival rates λ1,,λk which may be unknown to the adversary. Assuming the obfuscators transform the inter-arrival times to be exponentially distributed, the queries collectively arrive with inter-arrival times distributed exponentially with an arrival rate λ=λ1++λk, which may be observed by the adversary.

Estimating the arrival rates of the search agents may be revealing. Suppose that the adversary, by some process, has a sample of inter-arrival times (and corresponding hidden queries and result sets) along with a set of candidate search agents who were the most likely to have submitted the query (or, alternatively, it is known that one of the candidates submitted the query).

Then, the arrival rates may be estimated from this sample of the masked search agents.

To describe the output process of the queuing system, the probability distribution for the service time distribution which governs a “customer’s” service time. We assume the service time distribution is independent of the number of customers present. This implies, for example, that the server does not work faster when more customers are present.

The obfsucator could impose a service delay that is exactly the inter-arrival time 1/λj rather than creating a traffic flow that is exponentially distributed with a mean inter-arrival time 1/λj. In this case, the adversary can only guess that the average number of queries per day is 24λ queries per day. However, if the queries are not uniformly distributed, it is not be possible (without introducing fake queries) to maintain this constant delay, which reveals information about the distribution of query times.

Queue discipline: FCFS discipline - first come first serve SORS discipline - service in random order

Suppose we have k search agents, where search agent j has a query rate λj. Then, the total query rate is λ=λ1++λk.

If the search agents are unable to effectively anonymize their identities, then a strategy for confidentiality is to put the queries into a local queue and have the queue emit the queries in such a way that the inter-arrival times are exponentially distributed with an arrival rate λj.

Of course, if queries come in bursts as is often the case (that is, the inter-arrival times exhibit large variance), then the queue must delay queries.

may cause significant delays. Additionally, if there are no queries in the queue due to the bu

If the encrypted search system is receiving queries at a rate λ, then on average each of the k search agent is sending queries at a rate λ/k

9 Case Study: Typical Encrypted Search System

In this section, we analyze a typical encrypted search deployment to demonstrate the practical application of our information-theoretic framework. We compare the entropy of actual system behavior against the maximum entropy possible under system constraints, quantifying the confidentiality gap and proposing targeted improvements.

9.1 System Parameters

Consider an encrypted search system with the following characteristics, representative of a small organizational deployment:

Table 4: Parameters for case study system
Parameter Value Description
k 10 Number of search agents
m 10,000 Distinct words in query vocabulary
N 1,000 Documents in collection
λ 0.1 Query arrival rate (queries per second)
μ 3 Mean trapdoors per query
u 10 Maximum trapdoors per query
θ 5 Mean documents per result set

9.2 Baseline: Simple Substitution Cipher

We first analyze a simple substitution cipher where each word maps to a single trapdoor with no additional obfuscation.

9.2.1 Observed Distribution

In a typical query workload following a Zipf distribution, the query word frequencies are highly skewed:

pX(xi)1ifor word ranked i (9.1)

This creates a highly non-uniform trapdoor distribution:

pY(yi)=pX(xi)1i (9.2)

9.2.2 Entropy Calculation

The entropy of the trapdoor distribution under Zipf’s law with parameter s=1 is:

(Y)=i=1m1iHm,1log21iHm,1 (9.3)

where Hm,1=i=1m1/i is the m-th harmonic number.

For m=10,000, we have H10000,19.787, giving:

(Y)7.83 bits (9.4)

Compare this to the maximum entropy:

(Y)=log210,000=13.29 bits (9.5)

9.2.3 Efficiency

The ratio of actual to maximum entropy for trapdoors is:

etrap=7.8313.290.59 (9.6)

For complete queries with mean 3 trapdoors:

(𝐗ˇ)3×7.83=23.49 bits (9.7)

versus maximum:

(𝐗ˇ)3×13.29=39.87 bits (9.8)

The query efficiency is similarly equery0.59.

9.3 Improvement 1: Homophonic Encryption

We apply homophonic encryption to flatten the trapdoor distribution, giving the top b most frequent words multiple trapdoor representations.

9.3.1 Strategy

For the b=100 most frequent words, assign trapdoor multiplicities inversely proportional to their frequency:

ni=pX(x1)pX(xi)for ib (9.9)

This approximately flattens the distribution for the top b words.

9.3.2 Entropy Improvement

After homophonic encryption, the entropy of the top b words becomes approximately:

(Y1:b)log2b=log2100=6.64 bits (9.10)

The overall trapdoor entropy increases to approximately:

(Y)10.2 bits (9.11)

giving efficiency:

etrap10.213.290.77 (9.12)

9.3.3 Cost

The space complexity of secure indexes increases by:

ΔS=i=1b(ni1)518 additional trapdoors per document (9.13)

This represents a space overhead factor of approximately 1.52× for the secure indexes.

9.4 Improvement 2: Artificial Queries

We inject artificial queries at a rate λfake=0.05 queries per second, bringing the total rate to λtotal=0.15.

9.4.1 Entropy Improvement

The artificial queries are generated from the maximum entropy distribution, helping to mask patterns in authentic queries. The combined entropy approaches:

(𝐗ˇcombined)0.6723.49+0.3339.8728.86 bits (9.14)

giving efficiency:

equery28.8639.870.72 (9.15)

9.4.2 Cost

The bandwidth overhead is:

ΔB=λfakeλauthentic=0.050.10=0.50 (9.16)

representing a 50% increase in query traffic.

9.5 Combined Strategy

Applying both homophonic encryption and artificial queries:

Table 5: Comparison of strategies
Configuration Efficiency Space Bandwidth
Baseline 0.59 1.0× 1.0×
Homophonic only 0.77 1.52× 1.0×
Artificial queries only 0.72 1.0× 1.5×
Combined 0.85 1.52× 1.5×
Theoretical maximum 1.00

The combined strategy achieves 85% efficiency at the cost of a 52% space increase and 50% bandwidth increase. This demonstrates the practical tradeoff between confidentiality and resource consumption.

9.6 Attack Resistance

We evaluate resistance to known-plaintext attacks where an adversary observes a sample of plaintext-ciphertext query pairs.

9.6.1 Baseline Vulnerability

With the baseline system, after observing 100 queries, an adversary can map approximately 70% of the trapdoors in subsequent queries due to the highly skewed distribution.

9.6.2 Improved Resistance

With the combined strategy, the same adversary achieves only 35% accuracy after observing 100 queries, and the accuracy grows much more slowly with additional observations. The homophonic encryption provides multiple valid mappings, while artificial queries introduce noise that masks authentic patterns.

9.7 Recommendations

Based on this analysis, we recommend:

  1. 1.

    Deploy homophonic encryption with b100 for word vocabularies above 1,000 words.

  2. 2.

    Inject artificial queries at 30-50% of authentic query rate if bandwidth permits.

  3. 3.

    Monitor empirical entropy using compression-based estimation to detect degradation.

  4. 4.

    Adjust parameters dynamically based on observed attack attempts or pattern analysis.

This case study demonstrates that significant confidentiality improvements are achievable with moderate resource costs when guided by information-theoretic analysis.

10 Conclusion

We have presented an information-theoretic framework for analyzing and improving the confidentiality of encrypted search systems. By measuring entropy of observable encrypted search activities and comparing it to the maximum entropy possible under system constraints, we provide a quantitative confidentiality metric that guides system design and parameter selection.

10.1 Summary of Contributions

Our framework makes several key contributions to encrypted search security:

Theoretical foundations: We derived the maximum entropy distributions for encrypted search components under realistic constraints, providing closed-form solutions for inter-arrival times (exponential), search agent identities (uniform), query cardinality (geometric), and trapdoor selection. These results establish fundamental limits on confidentiality determined by system requirements rather than cryptographic assumptions.

Practical measurement: By connecting entropy to lossless compression through Shannon’s source coding theorem, we enable practical confidentiality measurement without explicit probabilistic modeling. System operators can monitor entropy using standard compression tools, detecting degradation and guiding countermeasure deployment.

Systematic improvement techniques: We analyzed multiple techniques for increasing entropy including homophonic encryption, artificial query injection, query aggregation, and timing obfuscation. Each technique trades specific resources for entropy gains, enabling informed decisions about confidentiality-performance tradeoffs.

Quantitative tradeoff analysis: Our case study demonstrated improving a typical system from 59% to 85% efficiency through combined application of homophonic encryption and artificial queries, with quantified space and bandwidth costs. This illustrates how information-theoretic analysis guides resource allocation for confidentiality improvements.

10.2 Implications for Practice

Our results have several practical implications for encrypted search deployment:

Design guidance: System designers can use maximum entropy distributions as targets, measuring how closely their systems approach theoretical limits. The efficiency ratio provides a single number summarizing confidentiality that can be tracked over time and compared across configurations.

Parameter selection: Rather than ad-hoc parameter tuning, our framework enables principled selection based on desired efficiency levels and available resources. For example, determining how many trapdoor substitutions to use in homophonic encryption or what rate to inject artificial queries.

Cost-benefit analysis: By quantifying both confidentiality gains and resource costs, decision makers can rationally allocate budgets across different countermeasures. High-value systems may justify significant overhead for marginal entropy improvements, while resource-constrained deployments can identify high-impact low-cost techniques.

Attack resistance: Higher entropy directly translates to greater difficulty for adversaries attempting statistical attacks. Our analysis shows that even modest entropy improvements substantially increase the sample size required for successful inference attacks.

10.3 Limitations and Assumptions

Our framework makes several simplifying assumptions that should be acknowledged:

Independence assumptions: We assume independence between queries and between query components when deriving maximum entropy. In practice, temporal correlations and user behavior patterns introduce dependencies that reduce achievable entropy.

Known parameters: Our analysis assumes certain system parameters like arrival rates and vocabulary sizes are known or observable. Uncertainty in these parameters affects both maximum entropy calculations and confidentiality assessments.

Compression-based estimation: Using lossless compressors to estimate entropy introduces positive bias that decreases slowly with sample size. Finite samples and suboptimal compressors yield conservative (higher) entropy estimates.

Adversary model: We focus on passive adversaries observing encrypted search traffic. Active adversaries with side channels, insider knowledge, or ability to manipulate traffic may achieve better inference than entropy alone predicts.

10.4 Future Work

Several directions warrant further investigation:

Dynamic optimization: Develop adaptive systems that automatically adjust parameters based on real-time entropy monitoring, maintaining target confidentiality levels under changing workloads.

Correlated query models: Extend the framework to account for temporal correlations and user behavior patterns, deriving tighter bounds on achievable entropy under realistic dependency structures.

Adversary-aware metrics: Incorporate specific adversary capabilities and attack models into confidentiality measures, moving beyond generic entropy to task-specific security metrics.

Differential privacy connections: Explore relationships between entropy-based confidentiality and differential privacy guarantees, potentially combining information-theoretic and privacy-theoretic perspectives.

Implementation and evaluation: Build prototype systems implementing our proposed techniques and evaluate their confidentiality-performance tradeoffs in realistic deployment scenarios with actual user workloads.

Extension to richer query models: Apply the framework to more sophisticated query types including range queries, Boolean combinations, and ranked retrieval, deriving maximum entropy distributions for these extended models.

10.5 Closing Remarks

Encrypted search faces an inherent tension between functionality and confidentiality. Perfect confidentiality renders search impossible, while unrestricted functionality leaks information. Our information-theoretic framework quantifies this tradeoff, providing tools to navigate it rationally.

By measuring how far systems deviate from maximum entropy and identifying techniques to close this gap, we enable principled encrypted search design. The entropy efficiency metric provides a clear target: systems should strive for distributions as close to maximum entropy as resources and functionality requirements permit.

As encrypted search systems become increasingly important for cloud computing, outsourced storage, and privacy-preserving information retrieval, principled approaches to confidentiality analysis become essential. We hope this work contributes to more secure encrypted search deployments by providing both theoretical understanding and practical tools for measuring and improving confidentiality. \addappheadtotoc

Appendix A Detailed Entropy Derivations

A.1 Geometric Distribution Entropy

The geometric distribution with parameter p has probability mass function:

pN(n)=p(1p)n1for n=1,2,3, (A.1)

The entropy is:

(N)=n=1p(1p)n1log2[p(1p)n1]=n=1p(1p)n1[log2p+(n1)log2(1p)]=log2pn=1p(1p)n1log2(1p)n=1p(n1)(1p)n1=log2plog2(1p)𝔼[N1]=log2plog2(1p)(1p1)=log2p1pplog2(1p)=(1p)log2(1p)plog2pp. (A.2)

For p=1/μ where μ is the mean, we have 𝔼[N]=1/p=μ.

A.2 Exponential Distribution Differential Entropy

The exponential distribution with rate λ has probability density function:

fT(t)=λeλtfor t>0. (A.3)

The differential entropy is:

(T)=0λeλtln[λeλt]𝑑t=0λeλt(lnλλt)𝑑t=lnλ0λeλt𝑑t+λ20teλt𝑑t=lnλ+λ21λ2=1lnλ=1+ln1λ. (A.4)

Note that we use natural logarithm for differential entropy of continuous distributions, while discrete entropy uses logarithm base 2.

A.3 Joint Entropy Decomposition

For random variables X1,,Xn, the joint entropy can be decomposed using the chain rule:

(X1,,Xn)=i=1n(XiX1,,Xi1). (A.5)

When the random variables are independent:

(X1,,Xn)=i=1n(Xi). (A.6)

For independent and identically distributed random variables:

(X1,,Xn)=n(X). (A.7)

Appendix B Compression-Based Entropy Estimation

B.1 Theoretical Foundation

Shannon’s source coding theorem establishes that the expected length of an optimal prefix-free code for a random variable X satisfies:

(X)𝔼[(X)]<(X)+1 (B.1)

where (x) is the code length for outcome x.

For sequences of length n:

(Xn)n𝔼[(Xn)]n<(Xn)n+1n (B.2)

As n, the per-symbol code length converges to the entropy rate.

B.2 Practical Estimators

Given a sample x1,,xn, we estimate entropy using a compression algorithm compress:

^n=BL(compress(x1,,xn)) (B.3)

Common compression algorithms and their characteristics:

  • gzip: Fast, good for general text, achieves reasonable compression

  • bzip2: Slower, better compression for repetitive data using Burrows-Wheeler transform

  • LZMA/xz: Very good compression, slower, uses dictionary-based methods

  • zstd: Fast modern algorithm with tunable compression levels

The estimator ^n is positively biased:

𝔼[^n](Xn) (B.4)

with the bias decreasing as the compressor approaches optimality and sample size increases.

B.3 Bias Correction

For finite samples, we can apply bias correction. If the true entropy rate is h and sample size is n, the bias is approximately:

biasκlognn (B.5)

for some constant κ depending on the source and compressor.

A bootstrap-based bias correction:

  1. 1.

    Compute ^n on the original sample

  2. 2.

    Generate B bootstrap samples of size m<n

  3. 3.

    Compute ^m(b) for each bootstrap sample

  4. 4.

    Estimate bias as ^mmn^n

  5. 5.

    Correct: ^ncorrected=^nbias

Appendix C Maximum Entropy Optimization

C.1 Constrained Optimization Formulation

To find the maximum entropy distribution subject to constraints, we solve:

maximize (X)=xp(x)logp(x) (C.1)
subject to xp(x)=1 (C.2)
𝔼[gi(X)]=αifor i=1,,k (C.3)
p(x)0for all x (C.4)

where gi are constraint functions and αi are specified values.

C.2 Lagrangian Formulation

The Lagrangian is:

=xp(x)logp(x)+λ0(xp(x)1)+i=1kλi(xgi(x)p(x)αi) (C.5)

Taking the derivative with respect to p(x) and setting to zero:

logp(x)1+λ0+i=1kλigi(x)=0 (C.6)

Solving for p(x):

p(x)=exp(λ01+i=1kλigi(x))=Z1exp(i=1kλigi(x)) (C.7)

where Z=xexp(i=1kλigi(x)) is the partition function ensuring normalization.

C.3 Special Cases

No constraints beyond support: The uniform distribution maximizes entropy.

Mean constraint: For a constraint on the mean 𝔼[X]=μ with support on non-negative integers, the geometric distribution emerges.

Mean and variance constraints: For continuous distributions on with mean μ and variance σ2, the normal distribution maximizes entropy.

Mean constraint on >0: The exponential distribution maximizes entropy.

Appendix D Statistical Hypothesis Testing

D.1 Testing Uniformity

To test whether observed trapdoor frequencies follow a uniform distribution:

Null hypothesis: H0:p(yi)=1/m for all i

Alternative: H1:p(yi)1/m for some i

Use the chi-squared goodness-of-fit test:

χ2=i=1m(OiEi)2Ei (D.1)

where Oi is the observed count and Ei=n/m is the expected count under uniformity.

Under H0, χ2χm12 approximately for large n.

D.2 Comparing Entropy Estimates

To test whether two systems have equal entropy:

Given estimates ^1 and ^2 from systems 1 and 2 with sample sizes n1 and n2:

Under asymptotic normality:

Z=^1^2Var[^1]/n1+Var[^2]/n2𝒩(0,1) (D.2)

approximately for large samples.

Variance can be estimated using bootstrap resampling or theoretical formulas specific to the entropy estimator.

Appendix E Notation Reference

E.1 Random Variables and Distributions

  • X: Random variable (capital letters)

  • x: Realization of random variable (lowercase letters)

  • pX(x): Probability mass function

  • fX(x): Probability density function

  • (X): Shannon entropy

  • (X;Y): Mutual information

  • 𝔼[X]: Expected value

E.2 Encrypted Search Components

  • 𝐱: Plaintext query (bag-of-words)

  • 𝐱ˇ: Hidden query (encrypted)

  • 𝐝: Document

  • 𝐝ˇ: Result set

  • λ: Query arrival rate

  • μ: Mean trapdoors per query

  • k: Number of search agents

  • m: Vocabulary size

  • N: Number of documents

E.3 Entropy and Information Measures

  • (X): Entropy of X

  • (X): Maximum possible entropy

  • e: Efficiency ratio /

  • (X;Y): Mutual information

  • BL(x): Bit length of x

References

  • [1] D. Cash, P. Grubbs, J. Perry, and T. Ristenpart (2015) Leakage-abuse attacks against searchable encryption. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 668–679. Cited by: §1.1, §2.2.
  • [2] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M. Roşu, and M. Steiner (2013) Highly-scalable searchable symmetric encryption with support for boolean queries. In Advances in Cryptology–CRYPTO 2013, pp. 353–373. Cited by: §2.1.
  • [3] D. L. Chaum (1981) Untraceable electronic mail, return addresses, and digital pseudonyms. In Communications of the ACM, Vol. 24, pp. 84–90. Cited by: §2.5, §8.8.
  • [4] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky (2006) Searchable symmetric encryption: improved definitions and efficient constructions. In Proceedings of the 13th ACM Conference on Computer and Communications Security, pp. 79–88. Cited by: §2.1.
  • [5] R. Dingledine, N. Mathewson, and P. Syverson (2004) Tor: the second-generation onion router. In Proceedings of the 13th USENIX Security Symposium, pp. 303–320. Cited by: §2.5, §3.
  • [6] E. Goh et al. (2003) Secure indexes. In Cryptology ePrint Archive, Cited by: §2.1, §3.
  • [7] O. Goldreich and R. Ostrovsky (1996) Software protection and simulation on oblivious rams. In Journal of the ACM, Vol. 43, pp. 431–473. Cited by: §2.4.
  • [8] P. Grubbs, M. Lacharité, B. Lloyd, and K. G. Paterson (2018) Pump up the volume: practical database reconstruction from volume leakage on range queries. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 315–331. Cited by: §1.1, §2.2.
  • [9] M. S. Islam, M. Kuzu, and M. Kantarcioglu (2012) Access pattern disclosure on searchable encryption: ramification, attack and mitigation. In Network and Distributed System Security Symposium, Cited by: §1.1, §2.2.
  • [10] E. T. Jaynes (1957) Information theory and statistical mechanics. Physical Review 106 (4), pp. 620–630. Cited by: §2.3, §7.
  • [11] E. T. Jaynes (1982) On the rationale of maximum-entropy methods. Proceedings of the IEEE 70 (9), pp. 939–952. Cited by: §2.3.
  • [12] S. Kamara, C. Papamanthou, and T. Roeder (2012) Dynamic searchable symmetric encryption. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 965–976. Cited by: §2.1.
  • [13] C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521865719, 9780521865715 Cited by: §3, §8.4.
  • [14] D. Pouliot and C. V. Wright (2016) The shadow nemesis: inference attacks on efficiently deployable, efficiently searchable encryption. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1341–1352. Cited by: §2.2.
  • [15] C. E. Shannon (1948) A mathematical theory of communication. Bell System Technical Journal 27 (3), pp. 379–423. Cited by: §1.2, §2.3.
  • [16] C. E. Shannon (1949) Communication theory of secrecy systems. Bell System Technical Journal 28 (4), pp. 656–715. Cited by: §1.1, §2.3.
  • [17] D. X. Song, D. Wagner, and A. Perrig (2000) Practical techniques for searches on encrypted data. In Proceedings of the 2000 IEEE Symposium on Security and Privacy, pp. 44–55. Cited by: §2.1.
  • [18] E. Stefanov, M. Van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and S. Devadas (2013) Path oram: an extremely simple oblivious ram protocol. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security, pp. 299–310. Cited by: §2.4.
  • [19] A. Towell (2017) The perfect hash filter. Note: Technical Report Cited by: §3, §3.
  • [20] A. Towell (2017) The perfect map filter. Note: Technical Report Cited by: Definition 3.8, §3.
  • [21] A. Towell (2017) The singular hash map. Note: Technical Report Cited by: Definition 3.8, §3.
  • [22] A. Towell (2017) The singular hash set. Note: Technical Report Cited by: §3, §3.