Skip to main content

Rank-Ordered Encrypted Search

Rank-ordered Encrypted Search

Consider a map from a set of keys to a set of values, where the keys are search keys (or relations between search keys) and the values are information about those keys that can be used to compute rank-ordering, things like frequency or proximity data.

Encrypted search lets an untrusted system search documents on behalf of an authorized user without learning the query, the document contents, or the results. The query, documents, and results are all oblivious values, meaning they reveal nothing about the underlying plaintext. An oblivious Boolean value might be true or false (or, if noisy, neither with certainty).

In practice, these guarantees are never perfectly met. The search results might reveal the number of search keys in the query, for instance (though this can be mitigated too). The goal is to minimize what the untrusted system learns.

One of the earlier papers on encrypted search observed that people want to outsource storage to cloud providers but do not trust them with confidentiality (Son00). To let an untrusted system search a confidential document, it needs a searchable representation we call a secure index.

In rank-ordered encrypted search, a confidential document has a degree-of-relevance to a hidden query. This degree can be based on many criteria: multiplicity of search keys, proximity between them, and so on. To support rank-ordered search, we need a data structure that stores arbitrary information about search keys in a document.

In what follows, I consider secure indexes represented by an oblivious entropy map, which permits nearly any kind of search: Boolean, rank-ordered, set-theoretic, proximity, and more. As a special case, I consider secure indexes represented by an oblivious set, which permits only set-theoretic Boolean search.

Oblivious entropy map: theory and implementation

The basic idea behind an entropy map is to map values in the domain to values in the codomain by hashing to a prefix-free code. We store nothing related to the domain since we just hash it, and a prefix of that hash serves as a code for a codomain value.

We allow many different codes for each codomain value. For example, a code for value a might be 00, 01, 10, and 11. We can efficiently decode this as a if the hash is less than 4.

Suppose $\Pr\lbrace f(X) = y\rbrace = p_y$, $X \sim p_X$. Then the optimally space-efficient code, assuming a uniform hash function $h$, assigns prefix-free codes for $y$ whose probability of being mapped to by $h$ sums to $p_y$. The expected bit length is

$$ \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y, $$

which, if we sample $x$ from $\mathcal{X}$ with $p_X$, map to $y = f(x)$, and observe the sequence of $y$’s, gives the entropy of that sequence. Hence the name entropy map.

If $\mathcal{X}$ is finite, we can implicitly encode the domain and store the prefix-free code for each element, giving an average bit length of $\ell$ and a total bit length of $|X| \ell$.

We can also allow rate distortion by failing to code some elements properly. A common choice: if one value $y’$ is extremely common (say $p_{y’} > .99$), give it a prefix-free code summing to $p_{y’}$ and do not explicitly code for it in the entropy map. Then for a randomly selected $x \in \mathcal{X}$, the map returns $y’$ with probability $p_{y’}$ (large enough, and tunable if we want to trade space for accuracy). The remaining domain values get correct codes (or also allow errors, but only after trying to find correct codes first).

For instance, suppose we have a set-indicator function $1_{\mathcal{A}} : \mathcal{X} \to \lbrace 0,1\rbrace$, where $\mathcal{A} \subseteq \mathcal{X}$ and $\mathcal{X}$ is very large. We give $0$ a set of prefix codes summing to $p_0 = 1-\varepsilon > .99$, then only find codes for $1$ in $\mathcal{A}$, with probability $p_1 = \varepsilon \leq .01$. For any $x \in \mathcal{X}$, we test if $1_{\mathcal{A}}(x) = 1$ by checking whether $h(x)$ is a code for $0$ or $1$. If it is a code for $0$, then $x$ is definitely not in $\mathcal{A}$. If it is a code for $1$, then $x$ is in $\mathcal{A}$ with a false positive rate of $\varepsilon$ and a true positive rate of $1$, since any element in $\mathcal{A}$ maps to $1$ by explicit coding. In this formulation, the entropy map is a rate-distorted approximation of a set-indicator function, which I call a Bernoulli set-indicator function (or more generally a Bernoulli map over X -> bool).

The Bloom filter is an entropy map where, implicitly, the probability of a negative membership test is $\varepsilon$. It is a special case of the entropy map over set-indicator functions.

The oblivious entropy map is an entropy map where the hash function is applied to trapdoors of $\mathcal{X}$ and the prefix-free codes for $\mathcal{Y}$ have no discernible pattern.

In Boolean encrypted search, a confidential document is relevant to a hidden query if the secure index contains every trapdoor search key in the query. The secure index is modeled as an oblivious set.

Sets enable queries about the presence of a trapdoor. Let

contains : (secure_index, trapdoor) -> bool

be a set-membership test (set-indicator function). contains(S,x) tests if trapdoor x (which may be a trapdoor of a unigram, bigram, stemmed phrase, etc.) is a member of secure index S.

This supports Boolean search and set-theoretic queries (intersection, union, difference). But it is not sufficient for rank-ordered search, which needs more information about the relation between query trapdoors and documents.

Oblivious multi-sets

Multi-sets enable queries about the multiplicity of a trapdoor, which in turn enables rank-ordered searches that consider word frequency, like BM25.

We need a data structure storing the multiplicity of search keys in a document. An oblivious multi-set does this, with keys as search keys and values as multiplicities. The expected bit length is

$$ \mathcal{O}(-\log_2 \varepsilon + \log_2 q_{\rm{max}}) $$$$ \log_2 \frac{f_{\rm{max}} - f_{\rm{min}} + 1}{\varepsilon} $$

bits per trapdoor, assuming word frequency is uniformly distributed between $f_{\rm{min}}$ and $f_{\rm{max}}$.

Let

contains : (secure_index, unigram) -> bool

be a set-membership test. contains(S,x) tests if unigram $x$ is a member of secure index $s$. This is the bare minimum and only supports Boolean search.

The interface provides a multiplicity function: $\text{Multiplicity}(s, x) \mapsto \mathbb{Z}$ counts the multiplicity of unigram $x$ in secure index $s$.

Secure postings list

A secure postings list uses a variable-length perfect map filter. Keys are unigrams and map to array elements. The first bit indicates the initial position and the gaps are encoded in the remaining bits. If the data cannot fit, the remaining bits serve as a pointer into a larger bit string where the first sequence (using a universal coder) indicates starting position.

The interface elements for a secure postings list include:

  • contains(s, x) -> bool: Test unigram $x$ for membership in secure index $s$. Bare minimum, Boolean search only.
  • multiplicity(s, x) -> int: Count the multiplicity of unigram $x$ in secure index $s$.
  • locations(s, x): Retrieve the locations of unigram $x$ in secure index $s$.
  • containsSequence(s, x_1 x_2 \cdots x_k): Test ordered $k$-gram $x_1 x_2 \cdots x_k$ for membership in secure index $s$. If a biword model is used, contains can implement containsSequence.
  • locationsSequence(s, x_1 x_2 \cdots x_k): Retrieve the locations of $k$-gram $x_1 x_2 \cdots x_k$ in secure index $s$.

These can implement other interfaces like rank-ordered search (BM25, MinDistX) or set-theoretic queries. If locations is supported, minimum distance and multiplicity can be computed from the location data.

Secure rank

The secure index does not disclose which plaintext words are in a document, nor any information attached to them, until a hidden query is submitted. At that point, while the plaintext word is not necessarily disclosed, the information attached to it is.

That information may be sufficient for an adversary to estimate (with better-than-random probability) which plaintext word corresponds to a trapdoor. Depending on the information, the adversary may also estimate other characteristics of the document. With a secure postings list and minimal error, the adversary could potentially estimate much of the plaintext of an entire document.

One way to combat this: precompute the results of queries (at least plausible ones) for a given document. Then the untrusted system does not need any information about individual words. It only needs to know how a document relates to particular queries.

Suppose the query model is a bag-of-words and the maximum number of words in a query is $p$. Since a document may only contain some of the query words, we must code all $2^p-1$ subsets (excluding the empty set).

Given a bag-of-words query, we canonically order the bag and map it to a precomputed ranking. This has a high up-front and storage cost.

If the document has $k$ unique words, there are

$$ \sum_{j=1}^{p} \binom{k}{j} $$

canonical keys. If $p$ is unbounded, there are

$$ 2^k - 1 $$

canonical keys, which is impractical for most documents with even a modest number of unique words. If $k=30$, that is roughly a billion keys.

In either case, the expected lower-bound on bit length for the oblivious map is

$$ -\sum_{j=1}^{p} \binom{k}{j}\left(\log_2 \varepsilon + \mu\right)\,, $$

where $\mu$ is the average bit length of the ranking coder representing some rank-ordered metric (BM25, MinDist, proximity, or multiple metrics).

Suppose the maximum query is $p=3$. The map has on the order of

$$ \mathcal{O}(k^3) $$

keys and the oblivious map has a lower-bound of

$$ -\mathcal{O}(k^3)(\log_2 \varepsilon + \mu) \text{ bits} $$

or

$$ -\mathcal{O}(k^2)(\log_2 \varepsilon + \mu) \text{ bits per search key} $$

If $k \approx 100$ (reasonable for a large document after removing common words and stemming), the lower-bound is around

$$ -10^4 (\log_2 \varepsilon + \mu) \text{ bits per search key} $$

Rank-ordered information retrieval maps queries to degree-of-relevancy for each document in a collection.

In information retrieval, measuring relevancy effectively is the central problem. Many algorithms and heuristics exist for rank-ordering documents by estimated relevance to a query. Here I look at term weighting and term proximity weighting heuristics in the context of secure indexes.

Precision and recall

Precision and recall apply to Boolean searches. They do not rank retrieved documents like BM25 or MinDistX; a document is either relevant (contains all query terms) or not.

Precision measures the proportion of retrieved documents that are relevant:

$$ \text{Precision} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of documents retrieved}} = \frac{|R \cap A|}{|A|} $$

where $R$ is the set of relevant documents and $A$ is the set of retrieved documents. Range is $[0, 1]$.

Recall measures the proportion of relevant documents that were retrieved:

$$ \text{Recall} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of relevant documents}} = \frac{|R \cap A|}{|R|} $$

Also in $[0, 1]$. It is trivial to achieve recall of $1$ by retrieving every document, but that kills precision. There is always a trade-off.

Mean average precision

Mean average precision (MAP) is a standard way to measure IR system performance with respect to relevancy, however defined. I use MAP to measure how closely the rank-ordered output of secure indexes (using BM25 or MinDistX) matches the output of non-secure, canonical indexes with perfect location and frequency information.

The more approximate a secure index, the less information an adversary can infer, but the less useful the search. The trade-off between approximation and acceptable MAP is the key question.

MAP is the mean of average precisions over $n$ queries. Precision at $k$:

$$ \text{precision}(q, k) = \frac{|\lbrace k \text{ most relevant docs to query } q\rbrace \cap \lbrace \text{top } k \text{ retrieved docs for query } q\rbrace|}{k} $$

This measures how many of the top $k$ retrieved documents are actually in the top $k$ most relevant.

Average precision for the top $n$ documents:

$$ \text{avgprecision}(q, n) = \frac{1}{n} \sum_{k=1}^{n} \text{precision}(q, k) $$

This averages precision across all cutoff points from 1 to $n$, giving higher weight to systems that return relevant documents earlier.

MAP for the top $n$ documents over query set $Q$:

$$ \text{MAP}(Q, n) = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{avgprecision}(q_i, n) $$

To estimate a document’s MAP score, we need a document set $D$ and query set $Q$. We construct secure indexes $SI$ for $D$, rank-order both $SI$ and $D$ according to each $q \in Q$, and calculate MAP using the $D$ rankings as ground truth.

Example

Suppose the true rank-ordered list of relevant documents to a query is $(3, 0, 1, 2, 4)$, and the retrieved list (from a secure index) is $(2, 4, 3, 0, 1)$. Precision at $k=1$: \begin{equation*} \frac{|\lbrace 3 \rbrace \cap \lbrace 2 \rbrace|}{1}=\frac{|\emptyset|}{1}=0,, \end{equation*} precision at $k=2$:

$$ \frac{|\lbrace 3,0 \rbrace \cap \lbrace 2,4 \rbrace|}{2}=\frac{|\emptyset|}{2}=0\,. $$

precision at $ k=3 $:

$$ \frac{|\lbrace 3,0,1 \rbrace \cap \lbrace 2,4,3 \rbrace|}{3}=\frac{|\lbrace 3 \rbrace|}{3}=\frac{1}{3}\,, $$

precision at $k=4$:

$$ \frac{|\lbrace 3,0,1,2 \rbrace \cap \lbrace 2,4,3,0 \rbrace|}{4}=\frac{|\lbrace 3,0,1 \rbrace|}{4}=\frac{3}{4}\,, $$

and precision at $k=5$:

$$ \frac{|\lbrace 3,0,1,2,4 \rbrace \cap \lbrace 2,4,3,0,1 \rbrace|}{5}=\frac{|\lbrace 3,0,1,2,4 \rbrace|}{5}=\frac{5}{5}=1\,. $$

So average precision is

$$ \frac{0+0+\frac{1}{3}+\frac{3}{4}+1}{5}=\frac{5}{12}\,. $$

MAP is just the mean of average precisions for $M$ queries.

Note that average precision at the last $k$ is necessarily $1$ if the relevant set and retrieved set contain the same elements by that iteration. But in general this is not the case. If the relevant ranked list is $(A, B)$ and the retrieved list is $(D, C, B, A)$, then mean average precision from $k=1$ to $k=2$ is $0$.

In our simulations, we do a variation. Suppose the relevant ranked list is $(A=0.9, B=0.85, C=0, D=0)$ and the retrieved list is $(A=0.9, C=0.85, D=0.5, B=0)$. I calculate average precision for top $k=3$ rather than $k=4$ or $k=2$. Document $B$ is not included in any of the precision-at-$k$ calculations.

In one experiment, we ran a “page one” MAP test using only the top 10 results. The randomized algorithm does much worse here: with over 85% probability, the mean average precision is less than 0.05.

Term importance: BM25

BM25, a well-established tf-idf variant, is one of the relevancy measures in our experiments:

$$ \text{BM25}(d, t, D) = \text{idf}(t, D) \cdot \frac{\text{tf}(t, d) \cdot (k_1 + 1)}{\text{tf}(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} $$

where $\text{avgdl}$ is the average document size (in words) in $D$.

Parameters $b$ and $k_1$ are free. In our experiments, $b = 0.75$ and $k_1 = 1.2$, which are typical values. Ideally these would be automatically tuned per secure index. $\text{tf}$ is term frequency (count of term $t$ in document $d$). $\text{idf}$ is inverse document frequency:

$$ \text{idf}(t, D) = \ln\left(\frac{|D| - \text{count}(t, D) + 0.5}{\text{count}(t, D) + 0.5} + 1\right) $$

where $|D|$ is the number of documents and $\text{count}(t, D)$ counts documents containing term $t$.

To rank documents, each document $d$ gets a score $\text{BM25Score}(d, Q) = \sum_{t \in Q} \text{BM25}(d, t, D)$, then results are sorted in descending order.

Note that most real-world IR systems do not use BM25Score directly. Instead, they build a vector $v(d)$ for each document where each dimension corresponds to a word $t$ weighted by $\text{BM25}(d, t, D)$, then rank by cosine similarity. This vector space approach is problematic for secure indexes because of the limited information available (by design, for confidentiality).

Term proximity: MinDistX

MinDist ranks documents by proximity relevance. It is less established than BM25, but performed well in experiments compared to other proximity heuristics. I added additional tunable parameters and call the variant MinDistX.

MinDistX sums the minimum distance between each existent pair of query terms in a document.

Example: Consider a document

$$\vec{d} = (A, B, D, D, A, D, C)$$

and a query

$$\vec{q} = \lbrace A, B, C \rbrace$$

where $A$, $B$, $C$, and $D$ are searchable terms. The minimum pairwise distances are

$$d(A, B) = 1, \quad d(A, C) = 2, \quad d(B, C) = 5$$

and their sum is $s = 1 + 2 + 5 = 8$.

MinDistX scores based on this sum $s$, so the secure index must represent distances between terms. The most obvious way: store approximate location information.

The more concentrated the query terms, the more relevant the document. But only up to a point.

Example: Consider a query

$$\vec{q} = \lbrace \text{computer}, \text{science} \rbrace$$

and two documents $A$ and $B$. $A$ contains both computer and science on page 7. $B$ contains computer on page 7 and science on page 20. Obviously $A$ is more relevant since the keywords are closer.

But consider a third document $C$ with computer on page 7 and science on page 100. This is not much worse than $B$. Both documents are not that relevant, and $B$ is only marginally better.

MinDistX has the following form.

Definition: Let $\mathbb{Q}$ be the set of query terms, $\mathbb{Q’}$ the subset that exist in document $\vec{d}$, and $s$ the sum of minimum pairwise distances between terms in $\mathbb{Q}^{\ast}$.

$$ \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \alpha, \gamma, \beta, \theta) = \ln\left(\alpha + \gamma \cdot \exp\left\lbrace -\frac{\beta s}{|\mathbb{Q}^{\ast}|^{\theta}} \right\rbrace\right) $$

where $\alpha, \gamma, \beta, \theta, s > 0$.

To check that MinDistX has the right shape (strictly decreasing, flattening as $s$ grows), consider the limits. As $s \to 0$:

$$ \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \cdot) \to \ln(\alpha + \gamma) $$

As $s \to \infty$:

$$ \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \cdot) \to \ln\alpha $$

The partial derivative:

$$ \frac{\partial}{\partial s} \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \alpha, \beta, \gamma) = -|\mathbb{Q}^{\ast}|^{-\theta} \frac{\beta \gamma \exp(-p)}{\alpha + \gamma \exp(-p)} $$

where

$$ p = \frac{\beta s}{|\mathbb{Q}^{\ast}|^\theta} $$

This partial is negative for all valid inputs. As $s \to 0$ it approaches $-\frac{1}{|Q’|^{\theta}} \left(\frac{\beta \gamma}{\alpha + \gamma}\right)$, and as $s \to \infty$ it approaches $0$. This matches the intuition: small increases in $s$ hurt a lot when terms are close, but barely matter when they are already far apart.

For large $|Q^{\ast}|$, MinDistX decreases less rapidly, which is the desired behavior. We do not want to penalize a document too harshly for spreading out more query terms across a larger region compared to a document that matches fewer terms but clusters them tightly.

MinDistX can add proximity sensitivity to existing scoring methods. For instance, a linear combination with BM25:

$$ \text{Score}(d, Q, D) = \lambda \cdot \text{MinDistX} + (1 - \lambda) \cdot \text{BM25Score}(d, Q, D) $$

Uncertainty vs mean average precision

In Encrypted Boolean Search (EBS), the only information disclosed to the untrusted system about a document collection is which objects are relevant to obfuscated search keys.

Encrypted Rank-Ordered Search (ERS) requires more information to compute a degree-of-relevance. The key insight: a significant amount of error is acceptable if the rank-ordering remains useful. The trade-off between information leakage and search utility is the fundamental design problem for secure indexes.

False positives from approximate sets and maps are one kind of error. We can also deliberately inject others:

  • Quantization noise: Storing only approximate term frequencies or positions
  • Random perturbation: Adding noise to stored values
  • Truncation: Limiting precision of stored information
  • Aggregation: Combining information in ways that obscure individual values

The key metric is how these approximations affect MAP. Empirically, secure indexes tolerate substantial approximation while maintaining acceptable MAP scores, especially when the goal is ranking rather than exact retrieval.

Conclusion

Rank-ordered encrypted search extends Boolean encrypted search by enabling relevance scoring without revealing document contents. The core data structures, oblivious entropy maps and their specializations, trade off between storage efficiency, information leakage, and search quality.

The important points:

  1. Entropy maps provide the theoretical foundation for space-optimal approximate function representation
  2. Oblivious variants apply trapdoor functions to prevent content disclosure
  3. BM25 and MinDistX can be computed from secure indexes with acceptable accuracy
  4. MAP scores give a principled way to evaluate the quality-privacy trade-off

The central challenge is balancing how much information a secure index stores against the privacy guarantees it needs to provide. More information means better ranking but also more room for inference attacks. The frameworks here give us a way to reason about this quantitatively.

Discussion