Securely Mixing, Searching and Computing

(1)

A Practical Model for Collaborative Databases:

Rachit Garg¹, Nishant Kumar², Shweta Agrawal¹, Manoj Prabhakaran³

1IIT Madras²Microsoft Research India³IIT Bombay

Abstract. We introduce the notion of a Functionally Encrypted Data- store which collects data from multiple data-owners, stores them encrypted on an untrusted server, and allows untrusted clients to make select-and-computequeries on the collected data. Little coordination and no communication is required among the data-owners or the clients. Our security and performance profile is similar to that of conventionalsearch- able encryptionsystems, while the functionality we offer is significantly richer.

The client specifies a query as a pair (Q, f) whereQis a filtering predicate which selects some subset of the dataset and f is a function on some computable values associated with the selected data. We provide efficient protocols for various functionalities of practical relevance. We demonstrate the utility, efficiency and scalability of our protocols via extensive experimentation. In particular, we use our protocols to model computations relevant to the Genome Wide Association Studies such as Minor Allele Frequency (MAF), Chi-square analysis and Hamming Distance.

Keywords: Searchable encryption·Computing on encrypted cloud storage·Genome wide association studies

1 Introduction

Given the importance of cloud computing today, enabling controlled computation on large encrypted cloud storage is of much practical value in various privacy sensitive situations. Over the last several years, several tools have emerged that offer a variety of approaches towards this problem, offering different trade-offs among security, efficiency and generality. While theoretical schemes based on modern cryptographic tools like secure multi-party computation (MPC) [72,26], fully homomorphic encryption (FHE) [25] or functional encryption (FE) [62,6]

can provide strong security guarantees, their computational and communication requirements are incompatible with most of the realistic applications today. At the other end are efficient tools like CryptDB [58], Monomi [67], Seabed [53] and Arx [56], which add a lightweight encryption layer under the hood of conventional database queries, but, as we discuss later, offer limited security guarantees and do not support collaborative databases. While there also exist tools which seek to strike a different balance by trading off some efficiency for more robust

(2)

security guarantees and better support for collaboration – like Searchable En- cryption (starting with [65,21]), and Controlled Functional Encryption [50] – they offer limited functionality.

In this work, we propose a new solution – called Functionally Encrypted Datastore(FED) – that can be used to implement a secure cloud-based data store that collects data frommultiple data-owners, stores it encrypted on an untrusted server, and allows untrusted clients to makeselect-and-compute queries on the collected data. Little coordination and no communication is required among the data-owners or the clients. Our security and performance profile is similar to that of conventional searchable encryption systems, while the functionality we offer is significantly richer.

Some Motivating Scenarios. There are many scenarios today when data is collected from many users in a centralized database and made available to other users for querying. In such situations, the individuals whose data has been collected should be considered the actual data-owners, who are providing their data with an expectation that it will be used only for certain well-specified purposes.

One such example is that of census data: governments use census data for ad- ministrative purposes, and also provide restricted access to select researchers for select purposes [68,74]. Another scenario involves private corporations who can collect large amounts of information about their customers, which could be legit- imately useful in improving their customer service. A third example, and the one that we use in our experimental analysis, is offered by genomic studies.Genome Wide Association Studies (GWAS)look into entire genomes across different individuals to discover associations between genetic variants and particular diseases or traits [9,46,48].

However, in all such scenarios, individuals’ privacy is vulnerable and the ab- sence of fine grained control can lead to significant problems. This was illustrated in an incident involving Ipsos Mori, a British marketing research firm, who offered to share data pertaining to 27 million mobile phone users with the police [40]. While they defended themselves by insisting that they will only release safe aggregated data which their users had acquiesced to [66], the deal was shelved under public pressure. Similarly, GWAS studies involve sharing highly sensitive information about the individuals with several researchers in different parts of the world, and the individuals may have little control over how their data will be used in future. Indeed, as evidenced by the case of the Havasupai tribe against the Arizona State University [32,35], the researchers from the university collected genetic data for studying links to diabetes and later used it for other sensitive subjects like schizophrenia, migration and inbreeding, without the consent of the individuals or the community who contributed the data.

These are a few of the many scenarios that motivate the need to implement efficient, fine-grained control over computation on large scale data collected from multiple data-owners.

Addressing the Challenge. Guided by the above scenarios, we set our goals. We require a datastore that provides strong security guarantees while offering flexible

(3)

functionality — multiple data owners, and support for expressive computation queries — and practical efficiency. We briefly discuss what this entails.

Firstly, accommodating multiple data owners in a flexible and scalable man- ner calls for several features:

– No coordination among data-owners. The data-owners should not need to trust, or even be aware of each other.

– Untrusted clients.The clients should not need to hold any state or key material (except as may be needed to authenticate themselves to the servers, in a higher-level application). This allows opening up the datastore to a large set of clients without requiring to trust them.

– Data-owners oblivious of the clients. The data-owners should not need to be aware of the clients who will query the datastore in the future, and in particular, cannot provide them with any keys. Further, the data-owners need not be online when the queries arrive from the clients.

– Anonymously collaborative database. The association of the encrypted data items with their data-owners should be hidden from all the parties in the system.

A Trust Assumption. This set of requirements creates a tension with the ba- sic premise of untrusted servers in a secure datastore: if the data-owners are oblivious of the clients and are not online when clients access the datastore, then the datastore must be fully operable by the servers; but this allows the servers to freely access the collected data (e.g., by emulating the role of clients to make as many accesses as they wish). Resolving this tension necessitates a trust assumption: we require that there are multiplenon-colludingservers. This is often a reasonable assumption in practice, given the growing availability of cloud computing services from competing service providers. We seek a solution that uses multiple servers minimally – there should be only one server with large storage (who stores all the encrypted data), and a singleauxiliary server (who may store some key material). Either server by itself can be corrupt, but they will be assumed not to collude with each other ¹. We remark that the model of non-colluding servers has been used quite successfully in the literature in the past, including multi-prover interactive proof systems [4], multi-server Private Information Retrieval [19], Oblivious RAM [47], CFE [50] ² and in Searchable Encryption itself [52,57].

Our Solution. To support expressive queries beyond search, we seek efficient implementation of typical (relational) database operations on a single table.

Towards this we seek to support a select-and-compute functionality. We consider queries to be specified as a pair (Q, f) where Q is a filtering predicate which

1 We observe that in the setting of single data owner, the auxiliary server may be emulated by the data owner itself. However, this requires the data owner to stay online even during the query phase, something we wish to avoid.

2 the storage server is implicit in their constructions, data owner does storage and retrieval and no collusion between data-owner and key-server is assumed.

(4)

selects some subset of the dataset and f is a function on the selected data. A key feature we seek is that the computation overheads for a select-and-compute query should not scale with the entire database size, but only with the number of selected records. This would offer substantial efficiency gains in most practical applications: for instance, in a census-like database for a country, a selection query for a zip-code would result in a small fraction of all the records to be chosen.

Finally, we seek a level of cryptographic security and efficiency that is typical of existing searchable encryption works (e.g., [51,43]). Here, the security guarantees are formulated in the form of an ideal functionality (following the Universally Composable security framework) and well-defined leakage functions specified as part of this functionality. These solutions are highly scalable and capable of han- dling large databases [39,51,11,21], with the query-time complexity scaling with the number of selected rows (as discussed above). However, we note that these works are restricted to a single data owner and only search queries (rather than search-and-compute). For FED, we shall set a similar security-efficiency combi- nation as the goal, while requiring a much richer functionality, namely, multiple data owners and computation on selected data.

1.1 Our Results

As discussed above, we introduce the notion of a Functionally Encrypted Data- store (FED), which permits a data-owner to securely outsource its data to a storage server, such that, with the help of an auxiliary server, clients can efficiently carry out select-and-compute queries. We emphasize that our database isanonymously collaborativein the sense that it contains data belonging to multiple data owners but hides the association of the (encrypted) data items with their owners.

Apart from introducing the notion of FED our contributions include:

– A general framework for instantiating FED schemes. The framework is modular, and consists of two components, which may be of independent interest (when only the corresponding functionality is desired) – namely, Searchably Encrypted Datastore (SED) and Computably Encrypted Datastore (CED).

– We provide instantiations of this framework for evaluating arbitrary functions, as well as important special classes of functions, the latter more efficiently than the former. As part of this, we instantiate a multiple data-owner version of searchable encryption, which is of independent interest.

– We demonstrate the utility and practicality of our protocols based on extensive experimentation, involving realistic statistical analysis tasks for genomic studies.

Along the way we obtain several other results that are of independent interest.

– As a starting point for our constructions, we give constructions for “single data-owner” versions of FED, SED and CED (denoted by sFED, sSED and sCED, respectively), which are simpler and more efficient, albeit offering no support for multiple data-owners.

(5)

– We introduce a versatile collection of techniques under the umbrella of “onion secret-sharing” by leveraging “onion encryption” which was originally used in anonymous routing and mixed nets [61,18]. In our setting, we use these to distributesecret sharesof multiple messages to multiple receivers, maintaining the link across the shares of the same message while delinking the messages from their origin. This yields an anonymously collaborative database as discussed above.

1.2 Overview of Constructions

We present several modular constructions of FED (and the single data-owner versionsFED), which can be instantiated in multiple ways, by plugging in different implementations of its components. As discussed above, we identify two simpler primitives, Searchably Encrypted Datastore (SED) and Computably En- crypted Datastore (CED), and show how they can be securely dovetailed into a construction of FED. Following is a roadmap to the constructions in this work.

sFED Section 4.3  (see Figure 2)

FED Section 5  (see Figure 4a)

SED CED

Section 5.4 Section 5.2 (see Figure 4b)

sCED

‣Value Retrieval

‣Summation

‣Summation (alt)

‣General Section 4.2

sSED Section 4.1

map & 

merge-map

‣OPRF-based

‣SFE-based Section 5.3

Onion  Secret-Sharing

Section 5.1

The starting point of our constructions are single data-owner versionssSED and sCED, shown at the bottom of the above map. We show that these components can be implemented by leveraging constructions from the literature, namely, the Multi-Client Symmetric Searchable Encryption (MC-SSE) scheme due to Jarecki et al. [36] and Controlled Functional Encryption (CFE) due to Naveed et al. [50]. The search query family supported by oursSED constructions are the same as in [36]. ForsCED, we support a few specialized functions, as well as a general function family. The primitivessSED andsCED are of independent interest, and also they can be combined to yield a single data-owner version of FED (called sFED).

To upgrade sSED and sCED constructions into full-fledged (multi data- owner) SED and CED schemes, we require several new ideas. One challenge

(6)

in this setting is to be able to hide the association of the (encrypted) data items with their data-owners. Our approach is to first securely merge the data from the different data-owners in a way that removes this association, and then use the single data-owner constructions on the merged data set. For this, both SED and CED constructions rely on onion secret-sharing techniques (seeSection 5.1 for an overview). In the case of CED, merging essentially consists of a multi- set union. But in the case of sSED, merging entails merging “search indices”

on individual data sets into a search index for the combined data set. Since the data-owners do not trust (or are even aware of) each other, this operation must be implemented via the servers in an “oblivious” fashion, and onion secret- sharing techniques alone are not adequate. We propose two approaches to merge the indices – one in the random oracle model using an Oblivious Pseudorandom Function (OPRF) protocol, and another one with comparable efficiency in the standard model, relying on 2-party Secure Function Evaluation (SFE).

1.3 Related Work

Below, we contrast our notion of Functionally Encrypted Datastores with prior work in terms of various features. A detailed summary appears in Table 2 in Appendix B.

– Multiple data owners: We allow encrypted data to be collected from multiple data-owners. Prior works that allow this include multiparty computation [72,26], multi-input functional encryption [27], multi-key fully homomorphic encryption [45] and controlled functional encryption (CFE) [50], all of which suffer costs proportional to the entire dataset. A different line of work based on multi- key/multi-user Searchable Encryption [60,59,30,69] allows data to be shared from multiple data owners to clients for searching of single keywords, but unlike this work, does not hide the association of the data to its owner, nor allows data owners to be oblivious of the clients.

– Rich functionality with strong security: Symmetric searchable encryption (SSE) [21,12,11,54,51,23,7] and its extensions (including Structured Encryption [14,38], Queryable Encryption [16] etc.) support keyword searches and have performance sublinear in the size of the dataset, but do not allow computation on the search results. In contrast, tools such as CryptDB [58], Monomi [67], Seabed [53] and Arx [56] do allow full-fledged search-and-compute (searches are attribute based rather than keyword queries). But they incur higher leakage to the server and argue security in the “snapshot attacker model” which has been criticized for being unrealistically weak [28]. Moreover, their security analysis assumes fully trusted clients who do not collude with the server(s). In contrast, we offer a stronger security model where clients may be malicious, either server can collude with a subset of data owners and/or clients and the attacker is per- sistent, i.e. has access to the view of the corrupt parties throughout the lifetime of the system. Leakage to servers is limited to the leakage typical in SSE, and there is no leakage to data owners or clients.

(7)

– Computationally light clients: Our clients are very efficient and only perform work proportional to the size of their queries and outputs, typically independent of database size. In comparison, the CryptDB family of constructions do not execute all the computational operations fully at the server, but instead require the clients to download certain intermediate results, decrypt them and perform the remaining computation. Similarly, the CFE scheme of [50] requires clients to evaluate a garbled circuit on the entire dataset. While multi client SSE has more lightweight clients than the above, the clients still do work proportional to the size of the filtered data.

– Deployability: We allow clients to join the system dynamically – indeed, clients and data owners can be oblivious of each others’ identity (or even number) in contrast with prior constructions. However, we do rely on the availability of two non-colluding servers, which, as mentioned earlier, is a necessary assumption.

2 Preliminaries

In this section, we define the preliminaries that we will use in our work. A detailed index of notation appears inTable 1inAppendix A.

Data Records. Throughout this paper, we consider databases and queries of the following form.

– Data:Each data record in the database is of the form (w, x)∈ W × X, where w forms the searchable attributes andx the computable values. A database is a multi-set of such records.

– Search Queries:Qconsists of the set of supported search queries of the form Q:W → {0,1}. For a multi-setZ of data records,Q[Z] denotes the multi-set obtained by using Qto select records fromZ and projecting them to the X coordinate. Formally, writing µS(y) to denote the multiplicity of an element y in a multi-setS, we haveµQ[Z](x) = X

w∈W:

Q(w)=1

µZ(w, x).

– Computation Queries: F consists of the set of supported computation queries. Eachf ∈ F takes as input a multi-set consisting of elements in X.

– Leakage Function: A leakage function L maps data sets and (optionally) queries to a pair of messages (L_S, L_A), to be sent to a pair of servers.

Composite Queries. We shall also consider an extension of the definition of Q[Z]: We allow Q = (Q₁,· · · , Q_d), where each Q_i : W → {0,1} and f = (f₀, f₁,· · · , f_d), where fori >0,f_i are functions on multi-sets of values, andf₀ is a function on ad-tuple; we define

f(Q[Z]) :=f0(f1(Q1[Z],· · ·, fd(Qd[Z])).

We note that this could be further generalized to recursively allowQi andfi to have the same general structure.

(8)

UC Security. We shall use the framework of Universally Composable security [10] to define our primitives. Towards this we shall present an ideal functionality, a corruption model (called non-colluding semi-honest servers model, below), and communication pattern restrictions on a protocol that securely realizes the functionality. The corruption model allows active corruption of some parties and (only) passive corruption of the others. Jumping ahead, we mention that in our modular constructions, the sub-protocols will be invoked using roles that are consistent with the corruption model, so that a corruption pattern admissible in a protocol will lead to corruption patterns admissible in the sub-protocols too;

this lets one use the universal composition theorem [10].

Oblivious PRF. An Oblivious PRF (OPRF) scheme [49] is a two-party protocol to securely evaluate a PRFF_K(τ), in which the first party inputsτ, the second inputsK, the first learns the value ofF_K(τ) (and the second learns nothing). We use the UC-secure OPRF protocol introduced in [37] in our constructions, which allows active corruption of the first party (receiver) and passive corruption of the second party (one with the key).

Keyword Search Queries. The major search query families that have received attention in the searchable encryption literature – and also of interest to this work – are “keyword queries.”³ A keyword query is either a predicate about the presence of a single keyword in a record (document), or a boolean formula over such predicates. In terms of the notation above, the searchable attribute for each record is a set of keywords,w⊆ KwhereKis a given keyword space. That is, W =P(K), the power set of K. We define the class of keyword occurrence queries, QK = {Qτ : τ ∈ K}, where Qτ is defined asQτ(w) = 1 iff τ ∈w. A more complex search query can be specified as a boolean formula over several such keyword occurrence predicates. We denote this query family byQ^∗_K.

We point out that keyword searches can be extended to more general database queries as follows: Each row in a database table is interpreted as a document, consisting of “keywords” of the form (attribute,value), one for each attribute (column) in the table. Then a search query that is specified as a formula over attribute-value equality predicates can be encoded as a formula over keyword occurrence predicates. Range queries can be modeled as a disjunction over several predicates corresponding toa priori fixed ranges of varying sizes.

3 Defining FED

To define an FED scheme, we present an ideal functionaly (called FED), a corruption model (called non-colluding semi-honest servers model, below), and com-

3 We remark that the concept of searchable encryption has been generalized to more expressive forms of search, such as pattern matching, range queries and searching over structured data [63,15,67,17,44,22]. While our constructions do not focus on such search query families, our general framework applies to all these notions as well.

(9)

munication pattern restrictions on a protocol for it to be an FED scheme. After we present these elements, an FED scheme is defined inDefinition 1.

The Functionality. FED is formulated as a two stage functionality, involving an initialization stage and aquery stage. Figure 1describes this functionality and also gives a schematic representation. The parties involved aredata-ownersDi, a storage serverS, anauxiliary serverA, andclients, each one denoted as C.

As shown inFigure 1, FED functionality allows a client to send (Q, f) and learn f(Q[Z]), where Z is the multi-set union of the data from all the data- owners. (The notationQ[Z] is defined inSection 2.)

(Q,F,L)-FED functionality, supporting a search query family Q and a computation function familyF, with leakage given by the functionL:

– Initialization phase: When a data-ownerDi sends Zi ⊆ W × X, generate (LS, LA) ← L(Zi) and send them toSandArespectively.

– Transition to the next phase on command from S or A (notifying the other). Define the multiset union Z=S

iZi.

– Query phase: Each time a client C sends a query (Q, f)∈ Q × F, generate (LS, LA) ← L(Q, f, Z) and send them toSandArespectively; sendf(Q[Z]) toC.

SED CED

map & 

merge-map

OPRF-based SFE-based sFED

Section II  (see Figure 2)

FED Section IV  (see Figure 3)

Section IV-D Section IV-B (see Figure 4)

sCED

Value Retrieval Summation Summation (alt) General

Section III-A

sSED Section III-B

Section IV-C

Section IV-A

D sFED C

A S

Z Q,f

f(Q[Z])

sCED

D A S C

X

Q

f(δT[X]) sSED

W

f Q[W]

T

Z Q,f

f(Q[Z])

CED

C S

A X_m

Q

f(δT[X]) SED

W_m

f Q[W]

T

Q,f f(Q[Z])

FED C

A S

Z₁

Q,f f(Q[Z]) Dm

D1

Z_m

... _D₁ … _D_m

Z_m Z₁

X₁ W₁

A S

W₁ W_m Q

…

̂Q[ ̂W]

C

̂W ̂Q

K_𝖲

Q[W] map SED

Dm

D1

K_𝖠

merge- map

Fig. 1 The FED functionality. The dotted lines indicate leakage from functionality.

Note that we do not allow any leakage to the data owners or the clients.

Non-Colluding Semi-Honest Servers Corruption Model. We define the non-colluding semi-honest servers corruption model in which the adversary can corrupt any subset of players, with the following restrictions:

– Only theclients can be actively corrupt. The other corrupt parties (S, Aand the data-owners) should remain honest-but-curious.

– The storage server S and the auxiliary server A cannot simultaneously be corrupt (i.e., they do not collude).

While our protocols can be slightly modified to accommodate actively corrupt data-owners as well, we do not do so. This is because, even given the ideal FED functionality, one actively corrupt data-owner can choose to supply arbitrary data to the database, potentially invalidating the results from the entire database. To adequately handle active corruption of the data-owners, the FED functionality should be augmented with the ability for the servers to enforce policies on the data being collected from the data-owners. We do not consider this extension in this work.

Two-Phase One-Pass Protocol. As mentioned before, to be deployable in a realistic setting, an FED scheme should have a restricted interaction pattern. Halevi

(10)

et al. defined a server-based one-pass protocol [29] to model “secure computation on the web”, where the clients join and leave a protocol one-by-one, while a server remains online throughout. We adapt this model for the FED framework as follows. A protocol among data-owners, servers and clients is said to be a two-phase one-pass protocol if it satisfies the following:

– The protocol consists of sequentially executing several sessions of two protocols, in two phases; the first protocol (executed in the first phase) involves only the servers and a single data-owner, while the second protocol (executed in the second phase) involves only the servers and a single client.

– The servers can maintain their state across the sessions. Every other party participates only in one of the sessions.

– Each data owner and client is given the servers’ public-keys (but not each others’). The servers obtain the public-key for a data-owner or a client when it joins the protocol.

– Only one of the servers has a “large” storage between the sessions – i.e., storage size depending on the input size across the sessions – while the amount of storage for the other server(s) is upper bounded by a function of the security parameter (independent of the inputs). (In our setting with two serversSand A, we defineS as the one with large storage.)

Definition 1. A (Q,F,L)-FED scheme is a two-phase one-pass protocol that UC-securely implements the(Q,F,L)-FED functionality (Figure 1) in the non- colluding semi-honest servers corruption model.

The above definition readily generalizes to other intermediate primitives we define, by replacing the functionality FED by other functionalities that will be specified later (namely, SED, CED,sFED,sSED andsCED).

Note that for a scheme to be fully specified, we need to describe not just Q and F, but also the leakage function L. In each of our constructions in the sequel, we shall do this explicitly.

4 Single Data-Owner Protocols

In this section we introduce a single data-owner version of FED, denoted by sFED, and also construct a sFED scheme. The single data-owner setting is simpler as it avoids having to “mix” data records from different data-owners.

OursFED scheme relies on two other new functionalities we introduce, namely, (single data-owner versions of) Searchably Encrypted Datastore (sSED) and Computably Encrypted Datastore (sCED). We begin by presenting these new functionalities.

4.1 Searchably Encrypted Datastore

Recall that in an FED orsFED scheme, a query has two components – asearch query Q and a computation function f. The Searchably Encrypted Datastore

(11)

functionality (SED or sSED) has a similar structure, but supports only the search query; all the records that match the search query are revealed to the storage serverS. Jumping ahead, the choice ofS to be the party receiving the output rather than the clientCis dictated by our plan to use this functionality in protocols for FED andsFED.

The functionalitysSED is depicted inFigure 2: There is a single data-owner Dwith inputW ⊆ W ×I, where each element inW has a unique identifier id∈ I as its second coordinate; the output thatSreceives when a clientCinputsQis the set of identitiesQ[W]⊆ I.

InSection 4.5, we shall see that a multi-client version of Symmetric Search- able Encryption (MC-SSE) from [36] can be used to construct ansSED scheme.

The main limitations of MC-SSE compared to sSED are that (1) in the former the data-owner Dremains online during the query phase whereas in the latter Dcan be online only during the initialization phase, and (2) in the former the output is delivered to both S and Cwhereas in the latter it must be delivered only to S. In our construction in Section 4.5, we shall leverage the auxiliary serverAto meet these additional requirements.

4.2 Computably Encrypted Datastore

The second functionality we introduce – CED, or its single data-owner variant sCED– helps us securely carry out a computation on an already filtered data set. The complexity of this computation will be related to the size of the filtered data rather than the entire contents of the data set.

InsCED, as shown inFigure 2, a single data-ownerD(who stays online only during the initialization phase) has an input in the form of X ⊆ I × X. Later, during the query phase, clients can compute functions on a subset of data. More precisely, a client C can specify a function f from a pre-determined function family, and the storage server Sspecifies a setT ⊆ I, andCreceivesf(δ_T[X]) where we defineδ_T(id, x) =xiff id∈T, and δ_T[X] is the multiset of xvalues obtained by applyingδT(id, x) to all elements ofX.

InSection 4.4, we present protocols forsCED, for various specialized function families, as well as for a general function family.

4.3 sFED Protocol Template

Protocol sFED-templ^sSED,sCED: This protocol is illustrated in Figure 2. Dur- ing the initialization phase, Dmaps its input Z to a pair (W, X), where W ⊆ W × I and X ⊆ I × X such that (w, x) 7→ ((w,id),(id, x)) where id is randomly drawn from (a sufficiently large set) I. Then, in the initialization phase of sFED, the parties D, S and A invoke the initialization phase ofsSED, and that of sCED (possibly in parallel). During the query phase ofsFED,S,Aand C first invoke the query phase of sSED, so that S obtains T = Q[W] as the output; then they invoke the query phase ofsCED andCobtainsf(δT[X]); note

(12)

SED CED

map & 

merge-map

sCED

Section III-A

sSED Section III-B

Section IV-C

Section IV-A

D sFED C

A S

Z Q,f

f(Q[Z])

sCED

D A S C

X

Q

f(δT[X]) sSED

W

f Q[W]

T

Z Q,f

f(Q[Z])

CED

C S A Xm

Q

f(δT[X]) SED

Wm

f Q[W]

T

Q,f f(Q[Z])

FED C

A S

Z₁

Q,f f(Q[Z]) Dm

D1

Zm

... D1 … Dm

Z_m Z₁

X₁ W₁

A S

W₁ Wm Q

…

̂Q[ ̂W]

C

̂W ̂Q

K_𝖲 Q[W]

map SED

Dm D1

K_𝖠

merge- map

Fig. 2 sFED functionality (left) and the protocolsFED-templ^sSED,sCED. The dotted lines indicate leakage. In sFED-templ, the parties communicate to each other only by invoking the functionalitiessSED andsCED.

thatδT[X] =Q[Z] if there are no collisions when elements are drawn from I to construct (W, X) fromZ.

Leakage, L_sFED-templ: The protocolsFED-templ^sSED,sCEDleaks, for every query (Q, f) from C, the set T =Q[W] to S, and in addition providesS and A with the leakage provided by thesSED andsCED functionalities (which depends on how they are instantiated). Note thatDchooses ids at random to defineW and X and the leakage functions ofsSED andsCED are applied to these sets. Also, as ids are random, leakingT amounts to only leaking itspattern over multiple queries: specifically, T1,· · ·, Tn contains only the information provided by the intersection sizes of various combinations of these sets. Formally, this leakage is given by

pattern(T1,· · ·, Tn) :={

\

i∈S

Ti

}_S⊆{1...n}. (1)

The following theorem is a consequence of universal composition, applied to the non-colluding semi-honest servers corruption model.

Theorem 1. ProtocolsFED-templ^sSED,sCED, when instantiated using a(Q,L_sSED)- sSED scheme and an (F,L_sCED)-sCED scheme, is a (Q,F,L_sFED-templ)-sFED scheme, where L_sFED-templ is as defined above.

Hence, to constructsFED, we need only instantiate appropriatesSED and sCED schemes. We proceed to do this next, starting with sCED.

4.4 sCED Protocols

We present protocols for a few different computation function families. We remark thatsCED somewhat resembles a primitive called Controlled Functional Encryption [50], and oursCED protocols involve similar ideas as in that work.

In each of the protocols below, D has an input X ⊆ I × X during the initialization phase. It will be convenient to define the set J ⊆ I as J = {id|∃xs.t. (id, x)∈X}. During each query,Shas an inputT ⊆ J andChas an inputf from the computation function family. Below, all our constructions use a PRF F.

(13)

Value Retrieval: This is the functionality associated with standard SSE, where the selected values, or documents, are retrieved without any further computation on them. There is a single function in the corresponding computation function family FValRet, given by f(δT[X]) = δT[X]. When the client C and the data ownerDare the same party (as is the case in the simplest version of SSE), this can be implemented in a straightforward fashion using a PRF. Below we give a simple scheme which relies on A to extend this to a setting with (multiple) clients who do not communicate directly with D.

Protocol ValRet

– Initialization Phase:Dpicks a PRF keyK, and definesβid:=xid⊕FK(id) and sends{(id, βid)}id∈X toSandK toA, who store them.

– Computation Phase:SsendsT (randomly permuted) and a fresh PRF key K1toA; it also sends{βid⊕FK₁(id)|id∈T}(under the same permutation) to C. Asends {FK(id)⊕FK₁(id)|id∈T}to C.Coutputs{ai⊕bi}i, where {ai}i and{b}iare the messages it received fromSandA(in the same order).

– Leakage, LValRet : On initialization, the ID-set J is leaked to S. On each query, T is leaked toA.

Recall that, in the overall sFED protocol template, ids will be chosen randomly. Hence leaking J amounts to leaking only its size |X| (the data can be padded with dummy entries so that instead of|X|, only an upper bound on it is leaked), and leakingT amounts to only leaking itspatternover multiple queries (seeEquation 1).

Theorem 2. Protocol ValRet is a (FValRet,LValRet)-sCED scheme, assuming the security of the PRF scheme used.

Proof sketch: We briefly sketch the elements in the protocol that help it achieve security. In the initialization phase D secret-shares its data between the two non-colluding servers, so that an adversary corrupting either one learns no information about the data. In the computation phase, C receives freshly ran- domized secret-shares (using the key K₁) of the answer to its query, with the elements randomly permuted. This is because, if Cdoes not collude with one of the servers it should receive no information other than the multi-set of retrieved values,f(δ_T[X]). In particular, it may not learn whether an id selected by one query gets selected again under another query. The permutation and fresh secret-sharing ensures that its view can be completely simulated just based on the multi-set of retrieved values, f(δT[X]) ={xid|id ∈ T}. Note that if C col- ludes with one of the servers, this rerandomization has no effect, but also, in that case, it is allowed to learnT and there is no need for rerandomization.

Summation: The family FSum consists of the single function f such that f(S) = P

x∈Sx (where the summation is in a given abelian group which the domain of values is identified with). The following simple and efficient protocol yields a sCED scheme for summation, with A learning only the size and the

“pattern” information about the input of S.

Protocol Sum

(14)

– Initialization Phase:Dpicks a PRF keyK, and definesβ_id:=x_id+F_K(id) and sends{(id, β_id)}_id toSandK toA, who store them.

– Computation Phase: S defines the set R :={id | id ∈ T} and a random value ρ; it sends (ρ, R) to A and γ := ρ+P

id∈Tβ_id to C. A sends δ :=

ρ+P

α∈RF_K(α) toC.Coutputsγ−δ.

– Leakage,LSum:On initialization, the ID-setJ is leaked toS. On each query, T is leaked toA.

This protocol is a natural extension of the ValRet scheme above, with the same initialization phase. The clientCreceives a fresh additive secret-sharing of the single output value it seeks. The security argument is similar to before; in particular, ifC does not collude with either server, its view can be completely simulated from its output without any leakage.

Theorem 3. Protocol Sum is a(F_Sum,L_Sum)-sCED scheme, assuming the security of the PRF scheme used.

InAppendix D.1we present an alternate protocol for summation, using additive homomorphic encryption, which has lower communication complexity and also avoids the leakage of the filtered set T to A. We also extend the value retrieval and summation protocols to the setting when each value x is a vector (x₁,· · ·, x_m), and the functionf acts on a subset of attributes.

General Functions: Next we present a sCED scheme for the family Fckt

consisting of general functions represented using boolean circuits. Our scheme is similar to the CFE scheme of [50] for general functions although there are some important differences between the two models. In particular, the model of CFE conflates the client Cand the storage server S; also, it does not allow the data-owner Dto directly communicate with the auxiliary severA, resulting in the use of public-key encryption in [50], which we avoid, resulting in greater efficiency.

A clientCwho wishes to evaluate a functionf sends a circuit representation off to A. The inputs to this circuit are the values{x^id}id∈T which none of the participants in the query phase (C, AnorS) knows. At a high-level, the idea is thatAwill construct a garbled circuit forf and sends it toS. For each input bit for this circuit, there are two labels, which will both be encrypted byAusing keys that are derived from a master key that the data-ownerDgives it (as described below). All these encrypted labels are sent along with the garbled circuit. To evaluate the garbled circuit,Sneeds to know how to decrypt one out of the two labels corresponding to each input position. To enable the evaluation,Dwould have providedSwith the decryption key for the labels corresponding to each bit ofx^id, for each id (during the initialization phase). A detailed description of this scheme is given inFigure 3.

The leakage function for this construction is as follows:

Leakage,L_ckt:On initialization, the ID-setJ is leaked toS. On each query, T andf are leaked toA, and the circuit structure off (as revealed by the garbled circuit) is leaked toS.

(15)

ProtocolCktEval: AsCED scheme for general functions:

The values computed over (i.e., elements inX) are represented as bit strings of the same size, say|x|=t. We shall refer to a PRFF and a Symmetric Key Encryption scheme (SKE) with algorithms (SKE.Gen,SKE.Enc,SKE.Dec) (denoting Key Generation, En- cryption and Decryption algorithms respectively), which can be based onF.

– Initialization Phase:The data-ownerDpicks a PRF keyK. For each (id, xîd)∈X, wherexîd= (xîd₁,· · ·, xîd_t ), it computes{ωid,i, λ⁰_id,i, λ¹_id,i}^ti=1, whereωid,i=xid,i⊕νid,i, withνid,ibeing a pseudorandom bit, andλ^b_id,i (forb= 0,1) are two encryption keys generated by SKE.Gen, all computed usingFK(id, i) as the source of randomness.D sends{id,{λ^x_id,iîdⁱ , ωid,i}^t_i=1}id∈X toS, and it sendsKtoA, for storing.

– Computation Phase:C’s input is a functionfdefined on a multi-set (order of the inputs not being important). We assume that given a numbern, a circuit representation off can be efficiently computed corresponding the input being a multi-set of sizen. The storage serverS’ input is a setT ⊆X.

1. S sends T to A (randomly permuted). Then, for each id ∈ T, A computes {νid,i, λ⁰_id,i, λ¹_id,i}^ti=1 using{FK(id, i)}^ti=1.

2. CsendsftoA. Then,Aconstructs a garbled circuit forfspecialized for|T|inputs.

The input wires of this circuit are indexed by{(id, i)}id∈T ,i∈[t]. Let (τ_id,i⁰ , τ_id,i¹ ) be the pair of wire-labels used for each such input wire.

Then, for each (id, i), andb∈ {0,1}, it defines the ciphertext c^bid,i←SKE.Enc

λ^b⊕ν_id,i^id,i(τ_id,i^b⊕ν^id,i).

Note thatc^ω_id,iîd,i is the encryption ofτ_id,i^xîdⁱ using the keyλ^x_id,iîdⁱ .

Asends toS{(c⁰_id,i, c¹_id,i)}id∈T ,i∈[t], along with the garbled circuitexcept the output decoding map. It sends the output decoding map toC.

3. For each (id, i),Sdecryptsc^ω_id,iîd,iusingλ^x_id,iîdⁱ , to getτ_id,i^xîdⁱ . It uses these labels for the input wires to evaluate the circuit, and obtains the output labels. It sends these output labels toC.

4. C uses the output decoding map fromA, and output labels from S to calculate the output.

Fig. 3AsCED scheme for general functions.

The leakage can be further reduced by allowing C to specify any part of the function as a private input to the circuit computing f (rather than being hardwired into the circuit). For each of the wires corresponding to this private input, the two labels will be provided by Cto A, and Cwill send one of these two labels to S, so that neitherSnorAby itself learns this input. The theorem below can be proven based on the security of the garbled circuits (similar to the argument in [50]):

Theorem 4. Protocol CktEval (Figure 3) is a (Fckt,Lckt)-sCED scheme, assuming the security of the PRF scheme used.

Composite Queries: Recall that a composite query consists ofQ= (Q1,· · ·, Qd) and f = (f0, f1,· · ·, fd) such that f(Q[Z]) := f0(f1(Q1[Z]),· · ·, fd(Qd[Z])).

(16)

ThesSED protocol for non-composite queries directly generalizes to composite queries, by simply runningdinstances of the original sSED functionality to let SlearnT_i=Q_i[W] for eachi. But we need to adapt thesCED protocol to avoid revealing eachfi(Qi[Z]) toC. We discuss how to do so inAppendix D.2.

4.5 sSED Protocols

In this section, we discuss how to instantiate the building block of searchable encryptionsSED by adapting constructions of symmetric searchable encryption (SSE) from the literature (see, e.g., [12] and references therein). We focus on the family of keyword search queries,Q^∗_K, which can in turn be used to capture more general database queries (seeSection 2). We shall adapt the SSE construction of Jarecki et al. [36] into ansSED scheme, though the high-level ideas below are applicable to adapting other constructions as well.

In SSE, there are only two parties: a client and a server. To store and search a database with an SSE scheme, the client (who is the data owner) uploads an encrypted database to the server. Later, the client can carry out search queries on this database by interacting with the server, while providing the server with minimal information (patterns in the queries and responses). Jarecki et al. [36]

extended SSE to a multi-client setting in which the data owner is separate from (and does not trust) the clients. Here, the data owner must remain online when the clients query the server, as the untrusted server should not be in a position to answer queries by itself.

Recall that insSED, we further separate the roles of the data owner and the query-phase assistant, by introducing an untrusted auxiliary server. This allows the data owner to be present only at the initial phase when the database is constructed. More importantly, in the multiple data owner setting (in which no one data owner can be trusted with access to all the data), it is crucial to avoid relying on any one data owner to control clients’ access to the entire database.

Another difference betweensSED and SSE is that the latter reveals the search outcome (the IDs of the rows that match the search) to the client; insSED, we seek to reveal this to S, but it cannot be revealed to the client (as the clients are not provided with any leakage). Finally, in sSED, we seek to handle active corruption of clients.

In Appendix C, we sketch how to adapt the MC-OXT protocol of [36] to account for the above requirements ofsSED. We also present multiple security- efficiency tradeoffs. As a lightweight modification, we can simply let the auxiliary server A play the role of (1) the data owner during the query phase, and (2) the client. This incurs some leakage to A, namely the search queries and the search outcome. Note that when used in thesFED protocol template (Figure 2), since sSED is instantiated with random IDs, only the pattern of the search outcomes, as in Equation 1, rather than the search outcomes themselves are revealed to A, as well as to S. Also since many of the sCED protocols already leak this information toA, this provides an appropriate level of security forsSED protocols to be used in thesFED protocol template. For more details, please see Appendix C.

(17)

5 FED Protocols

Our FED protocol template is identical to that of the sFED protocol, except that the functionalitiessSED andsCED are replaced by the analogous multiple data-owner versions, SED and CED (seeFigure 4a).

SED CED

map & 

merge-map

sCED

Section III-A

sSED Section III-B

Section IV-C

Section IV-A

D sFED C

A S

Z Q,f

f(Q[Z])

sCED

D A S C

X

Q

f(δ_T[X]) sSED

W

f Q[W]

T

Z Q,f

f(Q[Z])

CED

C S A Xm

Q

f(δT[X]) SED

Wm

f Q[W]

T

Q,f f(Q[Z])

FED C

A S

Z₁

Q,f f(Q[Z]) Dm

D1

Zm

... D1 … Dm

Z_m Z₁

X₁ W₁

A S

W₁ Wm Q

…

̂Q[ ̂W]

C

̂W ̂Q

K_𝖲 Q[W]

map SED

Dm D1

K_𝖠

merge- map

(a)

SED CED

map & 

merge-map

‣OPRF-based

‣SFE-based sFED

Section 3.1  (see Figure 2)

FED Section 4  (see Figure 3)

Section 4.4 Section 4.2 (see Figure 4)

sCED

‣Value Retrieval

‣Summation

‣Summation (alt)

‣General Section 3.2

sSED Section 3.3

Section 4.3

Section 4.1

A S

W₁ W_m Q

…

̂Q[ ̂W]

C

̂W ̂Q

K_𝖲 Q[W]

map sSED

Dm D1

K𝖠

merge- map

(b)

Fig. 4 (a)FED protocol template. EachDsplits its input dataZiasWiandXi, and inputs to SED and CED respectively.W denotes the combined data-setS

iWi. (b)SED protocol template using three functionalitiessSED,merge-mapandmap. Al- though not shown, all three functionalities used may specify leakage to Sand A. In accessingsSED,Aplays the role of both the (single data-owner)DandA.

Protocol FED-templ^SED,CED: Each data-ownerD_i maps its inputZ_i to a pair (W_i, X_i), whereW_i ⊆ W × IandX_i⊆ I × X such that (w, x)7→((w,id),(id, x)) where id is randomly drawn from (a sufficiently large set)I.⁴After that, all the parties proceed exactly as in sFED-templ, but with the parties accessing SED and CED instead ofsSED andsCED, and with each data-owner using (Wi, Xi) as its input.

LeakageL_FED-templ: The leakage is similar to L_sFED-templ defined earlier, but with leakage from sSED andsCED schemes replaced by those from SED and CED. Specifically, on a client query (Q, f), the leakage consists of the set (or equivalently, the pattern information of) T = Q[W] to S, where W = S

iWi; also the leakages from SED and CED are provided toSandA.

As before, we can state the security guarantee of this construction:

Theorem 5. ProtocolFED-templ^SED,CED, when instantiated using a(Q,LSED)- SEDscheme and an(F,LCED)-CEDscheme, is a(Q,F,L_FED-templ)-FEDscheme.

The main challenge then is in implementing SED and CED schemes, for reasonable leakage functions. Below, we present our protocols for realizing these functionalities.

4 Ishould be large enough so that we may assume that each (honest) data-owner will use a unique id for each record (w, x), disjoint from the set of IDs used by the others, except with negligible probability.

(18)

5.1 Onion Secret-Sharing

In going from single data-owner schemes to multi data-owner schemes, we seek to make the collection of data-owners behave like a single entity (without interacting with each other), so that they can communicate their collective data to the two servers in the form in which the underlying single data-owner scheme communicates it. Note that from this collective data, neither server should be able to link the records that are selected by a search query back to the individual data owners from whom it originates. In principle, this problem can be solved generically using secure multi-party computation techniques. However, for efficiency reasons, we develop a suite of techniques under the name of onion secret-sharing, that carefully combines secret-sharing and public-key encryption to achieve this.

Onion secret-sharing is a non-trivial generalization of the traditional mix- nets [18]. In a mix-net, a set of sendersD₁,· · · ,D_mwant to send their messages M1,· · ·, Mm to a server S, with the help of an auxiliary server A (who does not collude withS), so that neitherS norAlearns the association between the messages and the senders (except for the senders whom they collude with). This is easily achieved as follows: each Di sends JMiK^PKS to A where PKS is the public-key ofSfor a semantically secure public-key encryption scheme, and the notationJMK^PK denotes encryption ofM using a public-keyPK.A collects all such ciphertexts, sorts them lexicographically (or randomly permutes them), and forwards them toS;Sdecrypts them to obtain the multiset{M1,· · · , Mm}.

Now consider the following task. Each senderDi wants toshareits message Mi between two servers S and A; that is, it sets Mi = σi ⊕ρi, and wants to sendσ_itoSandρ_i toA. While the senders want their messages to get randomly permuted, the association betweenσ_iandρ_i needs to be retained. Onion secret- sharing provides a solution to this problem, as follows:

Each D_i sends J(ρ_i, ζ_i)KPKS where ζ_i is of the form Jσ_iKPKA. A mixes these ciphertexts and forwards them toS, who decrypts them to recover pairs of the form (ρi, ζi). Now,S reshuffles (or sorts) these pairs, storesρi and sendsζi (in the new order); A recovers σi from ζi (in the same order as ρi are maintained byS).

This can be taken further to incorporate additional functionality. As an example of relevance to us, suppose A wants to add a short, private tag to the messages being secret-shared so that the tag persists even after random permutation. Among the messages which were assigned the same tag, A should not be able to link the shares it receives after the permutation to the ones it originally received; Sshould obtain no information about the tags. One solution is for A to add encrypted tags to the data items, and then while permuting the data items, S would rerandomize the ciphertexts holding the tags. We present an alternate approach, which does not require additional functionality from the public-key encryption scheme, but instead augments onion secret-sharing with extra functionality:

– Dicreates a 3-way additive secret sharing of 0 (the all 0’s string), asαi⊕βi⊕ γi= 0, and sends (αi,Jβi, ρi,Jγi, σiK^PKAK^PKS) to A.

(19)

– Aassigns tagsτ_i for each of them, and sends (in sorted order) (τ_i⊕α_i,Jβ_i, ρ_i,Jγ_i, σ_iKPKAKPKS) toS.

– S sends (τi⊕αi⊕βi,Jγi, σiK^PKA) to A, in sorted order; it stores ρi (in the same sorted order).

– Arecovers (τ_i⊕α_i⊕β_i⊕γ_i, σ_i) = (τ_i, σ_i).

This allows S and Ato receive all the shares (in the same permuted order); S learns nothing about the tags;Acannot associate which shares originated from which Di, except for what is revealed by the tag. (Even if S or Acollude with some Di, the unlinkability is retained for the remaining Di.)

InSection 5.3, we use a variant of the above scheme to letA tag entries in various lists with a pseudonym (multiple lists may get the same pseudonym), before the lists are unpacked and shuffled again, destroying the linkage to the lists, but retaining the tags.⁵

5.2 Protocol Template for SED

We describe a general protocol template to realize the SED functionality, using access to thesSED functionality. The high-level plan is to letAcreate a merged database so that it can play the role ofDfor sSED. However, since we require privacy against A, the merged database should appear shorn of all information (except statistics that we are willing to leak). Hence, during the initialization phase, we not only merge the databases, but also replace the keywords with pseudonyms and keep other associated data encrypted. We use pseudonyms for keywords (rather than encryptions) to support queries: During the query phase, the actual keywords will be mapped to these pseudonyms and revealed to A.

These two tasks at the initialization and query phases are formulated as a pair of sub-functionalities — merge-map and map— collectively referred to as the functionalitymmap, as described inFigure 5.

The protocol SED-templ^sSED,mmap is shown in Figure 4b. In this protocol, sSED is invoked withAplaying the role of the data-owner as well. The invocation of merge-mapis part of the initialization phase and that of map is part of the query phase. As shown in the figure,SusesDK_S (decoding function) to compute its outputQ[W] from ˆQ( ˆW).

Note thatAdoes not store any additional information between the two phases (other than what the implementation ofsSED requires).

Leakage: Since this protocol delivers the merge-mapped data ˆW toA, it leaks certain statistical information about the merged dataW(not individual datasets W_i) toA. The exact nature of the leakage depends on the mapping-functionM.⁶

5 For simplicity, here we consider the tagging to be arbitrary, whereas inSection 5.3, it is done based on equality checks. Here we allowAto add tags while it knows the link between data-items andDi; in our application, this link is broken by an extra round of mixing.

6 Such leakage could be avoided by relying on secure two-party computation of a certain function betweenAandSduring initialization, but with high communication costs.