ArticlePDF Available

Abstract and Figures

Cloud computing is a ubiquitous paradigm responsible for a fundamental change in the way distributed computing is performed. The possibility to outsource the installation, maintenance and scalability of servers, added to competitive prices, makes this platform highly attractive to the computing industry. Despite this, privacy guarantees are still insufficient for data processed in the cloud, since the data owner has no real control over the processing hardware. This work proposes a framework for database encryption that preserves data secrecy on an untrusted environment and retains searching and updating capabilities. It employs order-revealing encryption to perform selection with time complexity in Θ(logn), and homomorphic encryption to enable computation over ciphertexts. When compared to the current state of the art, our approach provides higher security and flexibility. A proof-of-concept implementation on top of the MongoDB system is offered and applied in the implementation of some of the main predicates required by the winning solution to Netflix Grand Prize.
Content may be subject to copyright.
Alves and Aranha
RESEARCH
A framework for searching encrypted databases
Pedro G. M. R. Alves*and Diego F. Aranha
This is the extended version of a paper by the same name that appeared in XVI Brazilian Symposium on
Information and Computational Systems Security in November, 2016.
Abstract
Cloud computing is a ubiquitous paradigm responsible for a fundamental change in the way distributed
computing is performed. The possibility to outsource the installation, maintenance and scalability of servers,
added to competitive prices, makes this platform highly attractive to the computing industry. Despite this,
privacy guarantees are still insucient for data processed in the cloud, since the data owner has no real control
over the processing hardware. This work proposes a framework for database encryption that preserves data
secrecy on an untrusted environment and retains searching and updating capabilities. It employs
order-revealing encryption to perform selection with time complexity in Θ(log n), and homomorphic encryption
to enable computation over ciphertexts. When compared to the current state of the art, our approach provides
higher security and exibility. A proof-of-concept implementation on top of the MongoDB system is oered
and applied in the implementation of some of the main predicates required by the winning solution to Netix
Grand Prize.
Keywords: cryptography; functional encryption; homomorphic encryption; order revealing encryption;
searchable encryption; databases
1 Introduction
The massive adoption of cloud computing is respon-
sible for a fundamental change in the way distributed
computing is performed. The possibility to outsource
the installation, maintenance and scalability of servers,
added to competitive prices, makes this service highly
attractive [1, 2]. From mobile to scientic computing,
the industry increasingly embraces cloud services and
takes advantage of their potential to improve avail-
*Correspondence: pedro.alves@ic.unicamp.br
Institute of Computing, University of Campinas, Albert Einstein Ave. 1251,
13083-852 Campinas/SP, Brazil
Full list of author information is available at the end of the article
ability and reduce operational costs [3, 4]. However,
the cloud cannot be blindly trusted. Malicious par-
ties may acquire full access to the servers and con-
sequently to data. Among the threats there are ex-
ternal entities exploiting vulnerabilities, intrusive gov-
ernments requesting information, competitors seeking
unfair advantages, and even possibly malicious sys-
tem administrators. The data owner has no real con-
trol over the processing hardware and therefore cannot
guarantee the secrecy of data [5]. The risk of conden-
tiality breaches caused by inadequate and insecure use
of cloud computing is real and tangible.
Alves and Aranha Page 2 of 25
The importance of privacy preservation is frequently
underestimated, as well as the damage its failure
represents to society, as the unfolding of a privacy
breach may be completely unpredictable. A report
from Javelin Advisory Services found a distressing cor-
relation between individuals who were victims of data
breaches and later victims of nancial fraud. About
75% of total fraud losses in 2016 had this characteris-
tic, corresponding to U$ 12 billion [6]. This could be
avoided with the use of strong encryption at the user
side, never revealing data even to the application or
the cloud.
The problem of using standard encryption in an en-
tire database is that it eliminates the capability of se-
lecting records or evaluating arbitrary functions with-
out the cryptographic keys, reducing the cloud to a
complex and huge storage service. For this reason, al-
ternatives have been proposed to solve this problem,
starting from anonymization and heuristic operational
measures which do not provide formal privacy guaran-
tees. Encryption schemes tailored for databases such
as searchable encryption are a promising solution with
perhaps more clear benets [7, 8, 9, 10]. Searchable
encryption enables the cloud to manipulate encrypted
data on behalf of a client without learning informa-
tion. Hence, it solves both of aforementioned problems,
keeping condentiality in regard to the cloud but re-
taining some of its interesting features.
1.1 The frustration of data anonymization
In 2006, Netix shared their interest in improving the
recommendation system oered to their users with the
academic community. This synergy was directed to an
open competition during 3 rounds which oered nan-
cial prizes for the best recommendation algorithms.
An important feature of Netix’s commercial model is
to eciently and assertively guide subscribers in nd-
ing content compatible to their interests. Doing this
correctly may reinforce the importance of the product
for leisure activities, consolidate Netix’s commercial
position, and ensure clients’ loyalty [11].
The participants of the contest received a training set
with anonymized movie ratings collected from Netix
subscribers between 1999 and 2005. There are approxi-
mately half million customers and about 17 thousands
movies classied in the set, totalling over 100 million
ratings. This dataset is composed by movie titles, the
timestamp when the rating was created, the rating it-
self, and an identication number for relating same-
user records. No other information about customers
was shared, such as name, address or gender. The ob-
jective of the participants was to predict with good ac-
curacy how much someone would enjoy a movie based
on their previously observed behavior in the platform.
In the same year, America Online (AOL) took a sim-
ilar approach and released millions of search queries
made by 658,000 of its users with the goal of contribut-
ing to the scientic community by enabling statistical
work over real data [12]. As Netix, AOL applied ef-
forts on anonymizing the data before publishing. All
the obviously sensitive data, such as usernames and IP
addresses, were suppressed, being replaced by unique
identication numbers.
The ability to understand user’s interests and pre-
dict their behavior based on collected data is desir-
able in several commercial models and consequently
a hot topic in the scientic literature [13, 14, 15].
However, the importance of privacy-preserving prac-
tices is still underestimated, a challenge to overcome.
For instance, despite the anonymization eorts of
Netix, Narayanan and Shmatikov brilliantly demon-
strated how to break anonymity of the Netix’s dataset
by cross-referencing information with public knowl-
edge bases, as those provided by the Internet Movie
Database (IMDB) [16]. Using a similar approach, New
York Times’ reporters were capable of relating a subset
of queries to a particular person by joining apparently
Alves and Aranha Page 3 of 25
innocent queries to non-anonymous real state public
databases [17].
1.2 “Unexpected” leaks
These events raised a still unsolved discussion about
how to safely collect and use data without undermin-
ing user privacy. As remarked by Narayanan and Fel-
ten, “data privacy is a hard problem” [18]. Even when
data holders choose the most conservative practice and
never share data, system breaches may happen.
In 2013, a large-scale surveillance program of the
USA government was revealed by Edward Snowden,
a former NSA employee. Named PRISM, it was struc-
tured as a massive data interception eort to collect
information for posterior analysis. Their techniques ar-
guably had support of the US legal system and were
frequently applicable even without knowledge of the
data-owner companies [19, 20].
Two years later, in 2015, stolen personal data of
millions of users of the website Ashley Madison was
leaked by malicious parties exploiting security vulner-
abilities [21]. As consequence, several reports of extor-
tion and even a suicide, illustrating how increasingly
sensitive data breaches are becoming.
In the same year, the Sweden’s Transport Agency
decided to outsource its IT operations to IBM. To
fulll the contract, the latter chose sites in Eastern
Europe to place these operations. This resulted in
Swedish condential data being stored in foreign data
centers, in particular Czech Republic, Serbia and Ro-
mania. As expected, this decision led to a massive
data leak, containing information about all vehicles
throughout Sweden, including police and military ve-
hicles. Thus, names, photos and home addresses of mil-
lions of Swedish citizens, military personal, people un-
der the witness relocation program, were exposed [22].
In 2016, Yahoo conrmed that a massive data
breach, possibly the largest known, aected about 500
million accounts and revealed to the world a dataset
full of names, addresses, and telephone numbers [23].
These occurrences take us to the disturbing feeling
that, despise all eorts, the risk of data deanonymiza-
tion increases in worrying ways following how much of
it is made available [24, 25]. Hence, a seemingly obvi-
ous strategy to avoid such issue is to simply stop any
kind of dataset collection.
1.3 Privacy by renouncing knowledge
History has proven that the task of collecting and
storing data from third parties should be treated as
risky. The chance of compromising user privacy by
accident is too big and possibly with extreme conse-
quences. This way, the concept of security by renounc-
ing knowledge has attracted adepts, as the search en-
gine DuckDuckGo that states in a blog post that “the
only truly anonymised data is no data”, and because
of that claims to forego the right to store their users’
data [26, 27].
A more nancial-realistic approach for dealing with
this issue is not to give up completely of knowledge
but reduce the entities with access by keeping it en-
crypted during all its lifespan: transportation, storage,
and processing, staying secret to the application and
the cloud. Thus, a new security fence is set, tying data
secrecy to formal guarantees.
1.4 Our contributions
This work follows the state of the art and proposes
directives to the modeling of a searchable encrypted
database [28]. We detect the main primitives of a re-
lational algebra necessary to keep the database func-
tional, while adding enhanced privacy-preserving prop-
erties. A set of cryptographic tools is used to con-
struct each of these primitives. It is composed by order-
revealing encryption to enable data selection, homo-
morphic encryption for evaluation of arbitrary func-
tions, and a standard symmetric scheme to protect and
Alves and Aranha Page 4 of 25
add exibility to the handling of general data. In par-
ticular, our proposed selection primitive achieves time
complexity of Θ(log n)on the dataset size. Moreover,
we provide a security analysis and performance eval-
uation to estimate the impact on execution time and
space consumption, and a conceptual implementation
that validates the framework. It works on top of Mon-
goDB, a popular document-based database, and is im-
plemented as a wrapper over its Python driver. The
source code was made available to the community un-
der a GNU GPLv3 license [29].
When compared to CryptDB [7], our proposal pro-
vides stronger security since it is able to keep conden-
tiality even in the case of a compromise of the database
and application servers. Since CryptDB delegates to
the application server the capability to derive users’
cryptographic keys, it is not able to provide such se-
curity guarantees. Furthermore, our work is database-
agnostic, it is not limited to SQL but can be applied
on dierent key-value databases.
This work is structured as follows. Section 2 de-
scribes the cryptographic building blocks required for
building our proposed solution. Sections 3 and 4 de-
ne searchable encryption, discuss related threats, and
present existing implementations. Section 5 proposes
our framework, while Section 6 discusses its suitability
in a recommendation system for Netix. Our experi-
mental validation results are presented in Section 7
and Section 8 concludes the paper.
2 Building blocks
The two main classes of cryptosystems are known as
symmetric and asymmetric (or public-key) and dened
by how users exchange cryptographic keys. Symmet-
ric schemes use the same secret key for encryption and
decryption, or equivalently can eciently compute one
from the other, while asymmetric schemes generate a
pair of keys composed by public and private keys. The
former is distributed openly and is the sole informa-
tion needed to encrypt a message to the key owner,
while the latter should be kept secret and used for de-
cryption.
Besides this, cryptosystems that produce always the
same ciphertext for the same message-key input pair
are known as deterministic. The opposite, when ran-
domness is used during encryption, are known as prob-
abilistic. We next recall basic security notions and spe-
cial properties that make a cryptosystem suitable to a
certain application. Later, we shall make use of these
concepts to analyze the security of our proposal.
2.1 Security notions
Ciphertext indistinguishability is a useful property to
analyze the security of a cryptosystem. Two scenarios
are considered, when an adversary has and does not
have access to an oracle that provides decryption ca-
pabilities. Usually these are evaluated through a game
in which an adversary tries to acquire information from
ciphertexts generated by a challenger [30].
Indistinguishability under chosen plaintext attack –
IND-CPA. In the IND-CPA game the challenger gen-
erates a pair (P K,S K) of cryptographic keys, makes
P K public and keeps SK secret. An adversary has
as objective to recognize a ciphertext created from a
randomly chosen message from a known two-element
message set. A polynomially bounded number of op-
erations is allowed, including encryption (but not de-
cryption), over P K and the ciphertexts. A cryptosys-
tem is indistinguishable under chosen plaintext attack
if no adversary is able to achieve the objective with
non-negligible probability.
Indistinguishability under chosen ciphertext attack and
adaptive chosen ciphertext attack – IND-CCA1 and
IND-CCA2. This type of indistinguishability diers
from IND-CPA due to the adversary having access
to a decryption oracle. In this game the challenge is
again to recognize a ciphertext as described before,
Alves and Aranha Page 5 of 25
but now the adversary is able to use decryption re-
sults. This new game has two versions, non-adaptive
and adaptive. In the non-adaptive version, IND-CCA1,
the adversary may use the decryption oracle until it re-
ceives the challenge ciphertext. On the other hand, in
the adaptive version he is allowed to use the decryp-
tion oracle even after that event. For obvious reasons,
the adversary cannot send the challenge ciphertext to
the decryption oracle. A cryptosystem is indistinguish-
able under chosen ciphertext attack/adaptive chosen
ciphertext attack if no adversary is able to achieve the
objective with non-negligible probability.
Indistinguishability under chosen keyword attack and
adaptative chosen keyword attack – IND-CKA and
IND-CKA2. This security notion is specic to the
context of keyword-based searchable encryption [31].
It considers a scenario in which a challenger builds an
index with keyword sets from some documents. This
index enables someone to use a value Tw, called trap-
door, to verify if a document contains the word w. This
game imposes that no information should be leaked
from the remotely stored les or index beyond the out-
come and the search pattern of the queries. The adver-
sary has access to an oracle that provides the related
trapdoor for any word. His objective is to use this or-
acle as training to apply the acquired knowledge and
break the secrecy of unknown encrypted keywords. As
well as in the IND-CCA1/IND-CCA2 game, the non-
adaptative version, IND-CKA, of this game forbids the
adversary to use the trapdoor oracle once the challenge
trapdoor is sent by the challenger. On the other hand,
the adaptative version allows the use of the trapdoor
oracle even after this event.
A cryptosystem is indistinguishable under chosen
keyword attack if every adversary has only a negligible
advantage over random guessing.
Indistinguishability under an ordered chosen plaintext
attack – IND-OCPA. Introduced by Boldyreva et al.,
this notion supposes that an adversary is capable of re-
trieving two sequences of ciphertexts resulting of the
encryption of any two sequences of messages [32]. Fur-
thermore, he knows that both sequences have identical
ordering. The objective of this adversary is to distin-
guish between these ciphertexts. A cryptosystem is in-
distinguishable under an ordered chosen plaintext at-
tack if no adversary is able to achieve the objective
with non-negligible probability.
2.2 Functional encryption
Cryptographic schemes deemed “functional” receive
such name because they support one or more oper-
ations over the produced ciphertexts, hence becoming
useful not only for secure storage.
Order-revealing encryption (ORE) Order-revealing
encryption schemes are characterized by having, in
addition to the usual set of cryptographic functions
like keygen and encrypt, a function capable of compar-
ing ciphertexts and returning the order of the original
plaintexts, as shown by Denition 1.
Denition 1 (ORE) Let Ebe an encryption func-
tion, Cbe a comparison function, and m1and m2
be plaintexts from the message space. The pair (E, C )
is dened as an encryption scheme with the order-
revealing property if:
C(E(m1), E (m2)) =
lower,if m1< m2,
equal,if m1=m2,
greater,otherwise.
This is a generalization of order-preserving encryp-
tion (OPE), that xes Cto a simple numerical com-
parison [33].
Security As argued by Lewi and Wu, the “best-
possible” notion of security for ORE is IND-OCPA,
Alves and Aranha Page 6 of 25
which means that it is possible to achieve indistin-
guishability of ciphertexts and with a much stronger
security guarantee than OPE schemes can have [34].
Furthermore, dierently from OPE, ORE is not inher-
ently deterministic [35]. For example, Chenette et al.
propose an ORE scheme that applies a pseudo-random
function over an OPE scheme, while Lewi and Wu pro-
pose an ORE scheme completely built upon symmetric
primitives, capable of limiting the use of the compari-
son function and reducing the leakage inherent to this
routine [36, 34]. Their solution works by dening ci-
phertexts composed by pairs (ctL, ctR). To compare
ciphertexts ctAand ctB, it requires ctALand ctBR.
This way, the data owner is capable of storing only
one side of those pairs in a remote database being cer-
tain that no one will be able to make comparisons be-
tween those elements. Nevertheless, any scheme that
reveals numerical order of plaintexts through cipher-
texts is vulnerable to inference attacks and frequency
analysis, as those described by Naveed et al. over re-
lational databases encrypted using deterministic and
OPE schemes [37]. Although ORE does not completely
discard the possibility of such attacks, it oers stronger
defenses.
Homomorphic encryption (HE) Homomorphic en-
cryption schemes have the property of conserving some
plaintext structure during the encryption process, al-
lowing the evaluation of certain functions over cipher-
texts and obtaining, after decryption, a result equiva-
lent to the same computation applied over plaintexts.
Denition 2 presents this property in a more formal
way.
Denition 2 (HE) Let Eand Dbe a pair of encryp-
tion and decryption functions, and m1and m2be plain-
texts. The pair (E, D)forms an encryption scheme
with the homomorphic property for some operator if
and only if the following holds:
E(m1)E(m2)E(m1m2).
The operation in the ciphertext domain is equivalent
to in the plaintext domain.
Homomorphic cryptosystems are classied according
to the supported operations and their limitations. Par-
tially homomorphic encryption schemes (PHE) hold on
Denition 2 for either addition or multiplication oper-
ations, while fully homomorphic encryption schemes
(FHE) support both addition and multiplication oper-
ations.
PHE cryptosystems have been known for decades [38,
39]. However, the most common data processing appli-
cations, as those arising from statistics, machine learn-
ing or genomics processing, frequently require support
for both addition and multiplication operations simul-
taneously. This way, such schemes are not suitable for
general computation.
Nowadays, FHE performance is prohibitive, so weaker
variants, such as SHE [1] and LHE [2], have the stage
for solving computational problems of moderate com-
plexity [40, 41].
Security In terms of security, homomorphic encryp-
tion schemes achieve at most IND-CCA1, which means
that the scheme is not secure against an attacker with
arbitrary access to a decryption oracle [30]. This is a
natural consequence of the design requirements, since
these cryptosystems allow any entity to manipulate
ciphertexts. Most of current proposals, however, reach
at most IND-CPA and stay secure against attackers
without access to a decryption oracle [42].
[1]SHE stands for “Somewhat homomorphic encryp-
tion”.
[2]LHE stands for “Leveled fully homomorphic encryp-
tion”.
Alves and Aranha Page 7 of 25
3 Searchable encryption
We now formally dene the problem of searching over
encrypted data. We present three state-of-the-art im-
plementations of solutions to this problem, namely the
CryptDB, Arx, and Seabed database systems.
3.1 The problem
Suppose a scenario where Alice keeps a set of docu-
ments in untrusted storage maintained by an also un-
trusted entity Bob. She would like to keep this data
encrypted because, as dened, Bob cannot be trusted.
Alice also would like to occasionally retrieve a subset
of documents accordingly to a predicate without re-
vealing any sensitive information to Bob. Thus, shar-
ing the decryption key is not an option. The problem
lies in the fact that communication between Alice and
Bob may (and probably will) be constrained. Hence, a
naive solution consisting of Bob sending all documents
to Alice and letting her decrypt and select whatever
she wants may not be feasible. Alice must then imple-
ment some mechanism to protect her encrypted data
so that Bob will be able to identify the desired docu-
ments without knowing their contents or the selection
criteria [43].
An approach that Alice can take is to create an en-
crypted index as in Denition 3.
Denition 3 (Encrypted indexing) Suppose a
dataset DB = (m1, . . . , mn)and a list W=
(W1, . . . , Wn)of sets of keywords such that Wi
contains keywords for mi. The following routines
are required to build and search on an encrypted
index:
BuildIndexK(DB,W):The list Wis encrypted
using a searchable scheme under a key Kand
results in a searchable encrypted index I. This
process may not be reversible (e.g., if a hash
function is used). The routine outputs I.
TrapdoorK(F):This function receives a predi-
cate Fand outputs a trapdoor T. The latter
is dened as the information needed to search
Iand nd records that satisfy F.
SearchI(T): It iterates through Iapplying the
trapdoor Tand outputs every record that re-
turns True for the input trapdoor.
This way, if the searchable cryptosystem used is
IND-CKA then Alice is able to keep her data with Bob
and remain capable of selecting subsets of it without
revealing information [28].
3.2 Threat modeling
The development of ecient and secure solutions for
management of datasets depends on the awareness
of the threats we intent to mitigate. For such, this
work follows Grubbs’ denitions of adversaries for a
database [44].
Active attacker. The worst case scenario is when the
attacker acquires full control over the server, being ca-
pable of performing arbitrary operations. Thus, he is
not committed to follow any protocol.
Snapshot attacker. The adversary obtains a snapshot
of the dataset containing the primary data and indexes
but no information about issued queries and how they
access the encrypted data.
Persistent passive attacker. Another possibility is a
scenario in which the attack cannot interfere with the
server functionality but can observe all of its opera-
tions. We do not consider only attackers that inspect
issued queries in real-time, but also those that are able
to recover them later. As demonstrated by Grubbs,
the data contained in a real-world database goes far
beyond the primary dataset (names, addresses, …).
It also includes logs, caches, and auxiliary tables (as
Alves and Aranha Page 8 of 25
MySQL’s diagnostic tables) used, for instance, to guar-
antee ACID [3] and enable the server to undo incom-
plete queries after a power-break. It is very likely that
an attacker competent to subjugate the security proto-
cols of the system will be capable to also recover these
secondary datasets.
The idea of a snapshot attacker is very popular
among solutions and researchers intended to develop
encrypted databases. Nevertheless, it underestimates
the attacker and the many side-attacks a motivated
adversary can execute. As Rogaway remarks, we can-
not make the mistake to reduce the adversary to the
lazy and abstract Bob, but we must remember that it
can go far beyond that and take the form of a military-
industrial-surveillance program with a billionaire bud-
get and capability to surpass the obvious [45].
4 Related work
The management of a dataset is made by a database
management system (DBMS). It is composed by sev-
eral layers responsible for coordinating read and write
operations, guarantee data consistency and integrity,
and user access. The engineering of such a system is
a complex task and requires smart optimizations to
be able to store data, process queries and return the
outcome with minimum latency and good scalability.
This way, searchable encryption solutions usually are
implemented not inside but on top of these systems
as a middleware to translate encrypted queries to the
DBMS without revealing plaintext data and decrypt
the outcome, as shown in Figure 1. This strategy en-
ables the use of decades of optimizations incorporated
in nowadays DBMSs and portable to encrypted data.
It is important to state that, ideally, security features
should be designed in conjunction with the underlying
[3]Relative to a set of desirable properties for a
database. Acronym to “Atomicity, Consistency, Isola-
tion, Durability”.
database. Long-term solutions are expected to assimi-
late those strategies internally in the DBMS core.
Figure 1 Sequence diagram representing the process of
generating and processing an encrypted query. The proxy is
positioned between the user and the DBMS in a trusted
environment. Its responsibility is to receive a plaintext query,
apply an encryption protocol, submit the encrypted query to
the DBMS, and decrypt the outcome.
4.1 CryptDB
CryptDB is a software layer that provides capabilities
to store data in a remote SQL database and query
over it without revealing sensitive information to the
DBMS. It introduces a proxy layer responsible to en-
crypt and adjust queries to the database and decrypt
the outcome [7].
The context in which CryptDB stands is a typical
structure of database-backed applications, consisting
of a DBMS server and a separate application server.
To query a database, a predicate is generated by the
application and processed by the proxy before it is
sent to the DBMS server. The user interacts exclu-
sively with the application server and is responsible
for keeping his password secret. This is provided on
login to the proxy (via application) that derives all
the cryptographic keys required to interact with the
database. When the user logs out, it is expected that
the proxy deletes its keys.
Data encryption is done through the concept of
“onions”, which consist of layers of encryption that
Alves and Aranha Page 9 of 25
are combined to provide dierent functionalities, as
shown in Figure 2. Such layers are revealed as nec-
essary to process the queries being performed. Mod-
eling a database involves evaluating the meaning of
each attribute and predicting the operations it must
support. In particular, keyword-searching as described
in Denition 3 is implemented as proposed in Song’s
work [43]. The performance overhead over MySQL
measured by the authors is up to 30%.
Figure 2 Representation of the data format used by CryptDB.
The current value to be protected lies in the center, and a new
encryption layer is overlapped to it according to the need of a
particular functionality.
Two types of threats are treated in CryptDB: curi-
ous database administrators who try to snoop and ac-
quire information about client’s data but respect the
established protocols (a persistent passive attacker);
and an adversary that gains complete control of ap-
plication and DBMS servers (an active attacker). The
authors state that the rst threat is mitigated through
the encryption of stored data and the ability to query
it without any decryption or knowledge about its con-
tent; while the second applies only to logged-in clients.
In the considered scenario, the cryptographic keys rela-
tive to data in the database are handled by the appli-
cation server. Thus, if the application server is com-
promised, all the keys it possesses at that moment
(that are expected to be only from logged-in users) are
leaked to the attacker. Such arguments were revisited
after works by Naveed and Grubbs et al. demonstrated
how to explore several weaknesses of the construction,
such as the application of OPE [46, 47].
4.2 Arx
Arx is a database system implemented on top of Mon-
goDB [8]. It targets much stronger security proper-
ties and claims to protect the database with the same
level of regular AES-based encryption[4], achieving
IND-CPA security. This is a direct consequence of the
almost exclusively use of AES to construct selection
operators, even on range queries, and not only brings
strong security but also good performance due to e-
ciency of symmetric primitives, sometimes even ben-
eting from hardware implementations. The authors
report a performance overhead of approximately 10%
when used to replace the database of ShareLatex. The
building blocks used for searching follow those de-
scribed in Denition 3. Furthermore, they apply a dif-
ferent AES key for each keyword when generating the
trapdoor, requiring the client to store counters, as ex-
plained in the next paragraph.
At its core, Arx introduces two database indexes,
Arx-Range for range and order-by-limit queries and
Arx-EQ for equality queries, both built on top of
AES and using chained garbled circuits. The former
uses an obfuscation strategy to protect data, while en-
abling searches in logarithmic time. The latter embeds
a counter into each repeating value. This ensures that
the encryption of both are dierent, protecting them
against frequency analysis. Using a token provided by
[4]The Advanced Encryption Standard (AES) is a
well-established symmetric block cipher enabling high
performance implementation in hardware and soft-
ware [48].
Alves and Aranha Page 10 of 25
the client, the database is able to expand it in many
search tokens and return all the occurrences desired,
allowing an index to be built over encrypted data.
The context in which Arx stands is similar to
CryptDB. However, the authors consider the data
owner as the application itself. This way, it simplies
the security measures and considers the responsibil-
ity to keep the application server secure outside of its
scope.
4.3 Seabed
Seabed was developed by Papadimitriou et al. and
aims at Business Intelligence (BI) applications inter-
ested in keeping data secure on the cloud [49]. As
well as CryptDB and Arx, Seabed was built con-
sisting of a client-side query translator (to SQL), a
query planner, and a proxy that connects to a Apache
Spark instance [50]. Its main foundations are two new
cryptographic constructions, additively symmetric ho-
momorphic encryption (ASHE) and Splayed ASHE
(SPLASHE). The former is used to replace Paillier as
the additively homomorphic encryption scheme, stat-
ing that their construction is up to three orders of mag-
nitude faster. The latter is used to protect the database
against inference attacks [37].
SPLASHE works by splitting sensitive data into mul-
tiple attributes, obscuring the low-entropy of deter-
ministic encryption. Formally, let Cbe a sensitive at-
tribute of a dataset that can be lled with dpossible
discrete values. The approach taken by SPLASHE is
to replace this attribute in the encrypted database by
{C1,C2,· · · ,Cd}such that Cv= 1 and Ct= 0 for
t6=vif C=v. When encrypted by ASHE the cipher-
texts will look random to the adversary.
Seabed’s authors argue that SPLASHE is strong
enough to mitigate frequency analysis, enabling the
use of deterministic encryption whenever it is re-
quired in the database model. However, Grubbs states
that SPLASHE’s protection may be deected through
the auxiliary data stored by the database [44]. Their
work demonstrates how state-of-the-art databases
store metadata that can be used to reconstruct is-
sued queries and, this way, recognize access patterns
on the attributes. Such patterns leak the information
that SPLASHE intended to hide. Considering this, the
only threat really mitigated by SPLASHE against the
deterministic encryption of Seabed is from a snapshot
attacker.
5 Proposed framework
The goal of the proposed framework is to develop a
database model capable of storing encrypted records
and applying relational algebra primitives on it with-
out the knowledge of any cryptographic keys or the
need for decryption. A trade-o between performance
and security is desirable, however we completely dis-
card deterministic encryption whenever possible for se-
curity reasons. The only exception are contexts with
unique records, which avoid by denition weaknesses
intrinsic to deterministic encryption. The applicabil-
ity of this framework goes beyond SQL databases. Be-
sides the relational algebra hereby used to describe the
framework, it can be extended to key-value, document-
oriented, full text and several other databases classes
that keep the same attribute structure.
The three main operations needed to build a use-
ful database are insertion, selection and update. Once
data is loaded, being able to select only those pieces
that correspond to an arbitrary predicate is the basic
block to construct more complex operations, such as
grouping and equality joins. This functionality is fun-
damental when there is a physical separation between
the database and the data owner, otherwise high de-
mand for bandwidth is incurred to transmit large frac-
tions of the database records. Furthermore, real data
is frequently mutable and thus the database must sup-
port updates to remain useful.
Alves and Aranha Page 11 of 25
We dene as secure a system model that guarantees
that the data owner is the only entity capable of re-
vealing data, which can be achieved by his exclusive
possession of the cryptographic keys. Thus, a funda-
mental aspect of our proposal is the scenario in which
the database and the application server handle data
with minimum knowledge.
Lastly, the framework does not ensure integrity,
freshness or completeness of results returned to the
application or the user, since an adversary that com-
promises the database in some way can delete or alter
records. We consider this threat to be outside the scope
of this framework.
5.1 Classes of attributes
Records in an encrypted database are composed by
attributes. These consist of a name and a value, that
can be an integer, oat, string or even a binary blob.
Values of attributes are classied according to their
purpose:
static An immutable value only used for stor-
age. It is not expected to be evaluated
with any function, so there is no special
requirement for its encryption.
index Used for building a single or multivalued
searchable index. It should enable one to
verify if an arbitrary term is contained in
a set without the need to acquire knowl-
edge of its content.
computable A mutable value. It supports the eval-
uation with arithmetic circuits and en-
sures obtaining, after decryption, a re-
sult equivalent to the same circuit applied
over plaintexts.
The implementation of each attribute must satisfy
the requirements without leaking any vital informa-
tion beyond those related directly with the attribute
objective (e.g.: order for index attributes). Since the
name of an attribute reveals information, it may need
to be protected as well. However, the acknowledge-
ment of an attribute is done using its name, so even
anonymous attributes must be traceable in a query.
An option for anonymizing the attribute name is to
treat it as an index.
The aforementioned cryptosystems are natural sug-
gestions to be applied within these classes. Since static
is a class for storage only, which has no other require-
ments, any scheme with appropriate security level and
performance may be used, as AES. On the other hand,
index and computable attributes are immediate appli-
cations of ORE and HE schemes. Particularly, the lat-
ter denes the HE scheme according to the required
operations. Attributes that require only one operation
can be implemented with a PHE scheme, which pro-
vides good performance; while those that require arbi-
trary additions and multiplications must use FHE and
deal with the performance issues.
Denition 4 (Secure ORE) Let Eand Cbe, re-
spectively, an encryption and a comparison func-
tion. The pair (E, C)forms an encryption scheme
with the order-revealing property dened as “se-
cure” if and only if it satises Denition 1; the
encryption of a message mcan be written as
E(m) = (cL, cR) = (EL(m), ER(m)), where EL
and ERare complementary encryption functions;
and the comparison between two ciphertexts c1and
c2is done by C(cL1, cR2). This way, Cmay be
applied without the complete knowledge of the ci-
phertexts.
In order to build a secure and ecient index, an
ORE scheme that corresponds to Denition 4 should
be used. We dene the search framework as in Deni-
tion 5.
Alves and Aranha Page 12 of 25
Denition 5 (Encrypted search framework) Let
Sbe a set of words, sk a secret key, and an ORE
scheme ( Enc,Cmp) that satises Denition 4.
The operations required for an encrypted search
over Sare dened as follows:
BuildIndexsk(S):Outputs the set
S={cR|(cL, cR) = Encsk(w),wS}.
Trapdoorsk(w):Outputs the trapdoor
Tw= (cL|(cL, cR) = Encsk(w)) .
SearchS, r(Tw):To select all records in Swith
the relation r∈ {lower,equal,greater}
to word w, one computes the trapdoor Tw
and iterates through Slooking for the records
wSthat satisfy
Cmp (Tw, w) = r.
The set ˆ
Swith all the elements in Sthat
satisfy this equation is returned.
5.2 Database operations
Let us consider a model composed by an encrypted
dataset stored in a remote server and a user that pos-
sesses the secret cryptographic keys. The latter would
like to perform queries on data without revealing sensi-
tive information to the server, as dened in Section 3.1.
In 1970, Codd proposed the use of a relational alge-
bra as a model for SQL [51]. This consists of a small set
of operators that can be combined to execute complex
queries over the data.
Through the functions dened in Denition 5, a re-
lational algebra for encrypted database operations can
be built. The basic operators for such algebra are de-
ned as follows.
1Projection (πA): Returns a subset Aof at-
tributes from selected records. This subset may
be dened by attribute names that may or may
not be encrypted.
(a) encrypted: If encrypted, a deterministic scheme
is used or they are treated as index values.
deterministic scheme: The user computes
A={EncDet(a)|aA}.Ais sent to the
server, which picks the projected attributes
using a standard algorithm.
index attributes: The user computes A=
{T rapdoorsk (a)|aA}.Ais sent to the
server, which picks the projected attributes
using Search.
(b) unencrypted: Unencrypted selectors are sent
to and selected by the server using a stan-
dard algorithm.
2Selection (σϕ): Given a predicate ϕ, returns only
the records satisfying it.
Handles exclusively index, hence ϕmust be
equivalent to a combination of comparative
operators supported by Search.
Let wxϕ, where is a compatible com-
parative operator, wan index attribute, and
xthe operand to be compared (e.g.: σage>30
signals for records which the attribute named
“age” value is greater than 30). The trapdoor
Tϕ=T rapdoorsk (ϕ)is sent to the server
that executes Search.
3Cartesian product (×): The Cartesian product
of two datasets is executed using a standard algo-
rithm.
4Dierence (): The dierence between two
datasets Aand Bencrypted with the same keys
is dened as AB={x|xAand x6∈ B}.
5Union (): The union of two datasets Aand B
encrypted with the same keys is dened as AB=
{x|xAor xB}.
Alves and Aranha Page 13 of 25
Union and dierence are dened over datasets with
the same set of attributes. The opposite is expected for
Cartesian product, so that no attribute may be shared
between operands.
Ramakrishnan calls these “basic operators” in the
sense that they are essential and sucient to execute
relational operations [52]. Additional useful operators
can be built over those. For instance: rename, join-
like, and division. The same observation applies in
the encrypted domain, and complex operators can be
constructed given basic operators dened over the en-
crypted domain.
6Rename (ρa,b): Renames attributes. Their names
may or may not be encrypted.
(a) encrypted: Encryption shall be executed us-
ing a deterministic cryptosystem or names
treated as index values.
deterministic scheme:Let abe an at-
tribute name to be replaced by b. The
user computes aEncDet(a)and b
EncDet(b), and sends the output to the
server, which applies a standard replacement
algorithm.
index attributes:Let abe an attribute
name to be replaced by b. The user com-
putes aT rapdoor(a)and bcR|
(cL, cR) = Encindex(b)and sends the output
to to the server, which selects attributes re-
lated to aas equal through the operation
Search and renames the result to b.
(b) unencrypted: Unencrypted attribute names
may be renamed by the server using a stan-
dard algorithm.
7Natural join (): Let Aand Bbe datasets
with a common subset of attributes. The natu-
ral join between Aand Bis dened as the se-
lection of all elements that lies in Aand Band
match all the values in those attributes. More for-
mally, let c1, c2, . . . , cnbe attributes common to
Aand B;x1, x2, . . . , xnattributes not contained
in Aor in B;a1, a2, . . . , ambe attributes unique
to A;b1, b2, . . . , bkbe attributes unique to B; and
K=N
n+1. We have that,
A  B σci=xiρ(ci,xi)(A)×B,iK.
8Equi-join (θ): Let Aand Bbe datasets. The
equi-join between Aand Bis dened as the selec-
tion of all elements that lie in Aand Band satisfy
a condition θ. More formally, A  B =σθ(A×B).
9Division (/): Let Aand Bbe datasets and C
the subset of attributes unique to A. The division
operator joins the operands by common attributes
but projects only those unique to the dividend.
Hence, A/B =πC(A  B).
Finally, it is important to dene data insertion and
update despite these cannot be properly dened as re-
lational operators.
Insert: Encrypted data is provided and inserted
into the database using a standard algorithm.
Update: An update operation is dened as a se-
lection followed by the evaluation of a computable
attribute by a supported homomorphic operation.
This set of operators enables operating over an en-
crypted database without the knowledge of crypto-
graphic keys or acquiring sensitive information from
user queries.
5.3 Security analysis
We assume the scenario in which the data owner has
exclusive possession of cryptographic keys. This way,
insertions to the database must be locally encrypted
before being sent to the server. The database or the
application never deal with plaintext data. Our frame-
work thus has the advantage over CryptDB of pre-
serving privacy even in the outcome of a compromised
database or application server.
Alves and Aranha Page 14 of 25
Despite being conceptually similar to OPE, ORE is
able to address several of its security limitations. ORE
does not necessarily generate ciphertexts that reveal
their order by design, but allows someone to protect
this information by only revealing it through specic
functions. ORE is able to achieve the IND-OCPA se-
curity notion and adds randomization to ciphertexts.
Those characteristics make it much safer against infer-
ence attacks [37]. The proposal of Lewi and Wu goes
even beyond that and is capable of limiting the use
of the comparison function [34]. Their scheme gener-
ates a ciphertext that can be decomposed into left and
right components such that a comparison between two
ciphertexts requires only a left component of one ci-
phertext and the right component of the other. This
way, the authors argue that robustness against such
attacks is ensured since the database dump may only
contain the right component, that is encrypted using
semantically-secure encryption. Their scheme satises
our notion of a Secure ORE and, therefore, provides
strong defenses against Snapshot attackers.
An eavesdropper (Active or Persistent passive at-
tacker) is not capable of executing comparisons by
himself in a Secure ORE. However he may learn the
result of these and recognize repeated queries by ob-
serving the outcome of a selection. This weakness may
still be used for inference attacks, that can breach con-
dentiality from related attributes. This issue can get
worse if the trapdoor is deterministic, when there is
no other solution than implementing a key refresh-
ment algorithm. Besides that, the knowledge of the
numerical order between every pair of elements in a
sequence may leak information depending on the ap-
plication. This problem manifests itself in our proposal
on the σprimitive if it uses a weak index structure,
like a naive sequential index. A balanced-tree-based
structure, on the other hand, obscures the numerical
order of elements in dierent branches. This way, an
attacker is capable of recovering the order of up to
O(log n)database elements and infer about the oth-
ers, in a database with nelements.
Schemes used with computable attributes are lim-
ited to IND-CCA1, and typically reach only IND-CPA.
Moreover, homomorphic ciphertexts are malleable by
design. Thus, an attacker that acquires knowledge
about a ciphertext can use it to predictably manip-
ulate others.
Finally, BuildIndex is not able to hide the quantity
of records that share the same index. This way, one is
able to make inferences about those by the number
of records. There is also no built-in protection for the
number of entries in the database. A workaround is
to x the size of each static attribute value and round
the quantity of records in the database using padding.
This approach increases secrecy but also the storage
overhead.
5.4 Performance analysis
The application of ORE as the main approach to build
a database index provides an extremely important con-
tribution to selection queries. Search does not require
walking through all the records testing a trapdoor,
but only a logarithmic subset of it when implemented
over an optimal index structure, such as an AVL tree
or B-tree based structure [53]. This characteristic is
highlighted on union, intersection and dierence op-
erations, which work by comparing and selecting ele-
ments in dierent groups. Moreover, current proposals
in the state of the art of ORE enjoy good performance
provided by symmetric primitives and does not require
more expensive approaches such as public-key cryp-
tography [36, 34, 33]. In particular, although fully ho-
momorphic cryptosystems promise to fulll this task
and progress has been made with new cryptographic
constructions [54], it is still prohibitively expensive for
real-world deployments [55].
Alves and Aranha Page 15 of 25
Space consumption is also aected. Ciphertexts are
computed as a combination of the plaintext with ran-
dom data. This way, a non-trivial expansion rate is
expected. Dierently from speed overheads which are
aected by a single attribute type, all attributes suer
with the expansion rate of encryption.
5.5 Capabilities and limitations
Our framework is capable of providing an always-
encrypted database that preserves secrecy as long as
the data owner keeps the cryptographic keys secure.
One is able to select records through index and ap-
ply arbitrary operations on attributes dened as com-
putable. Furthermore, it increases the security of data
but maintaining the computational complexity of stan-
dard relational primitives, achieving a fair trade-o be-
tween security and performance.
Although the framework has no constraints about at-
tributes classied as both index and computable, there
is no known encryption scheme in the literature ca-
pable of satisfying all the requirements. This way, the
relational model of the database must be as precise as
possible when assigning attributes to each class, spe-
cially because the costs of a model refactor can be
prohibitive.
Some scenarios appear to be more compatible with
an encrypted database as described than others. An e-
mail service, for example, can be trivially adapted. The
e-mails received by a user are stored in encrypted form
as static and some heuristic is applied on its content to
generate a set of keywords to be used on BuildIndex.
This heuristic may use all unique words in the e-mail,
for example. The sender address may be an impor-
tant value for querying as well, so it may be stored
as an index. To optimize common queries, a secondary
collection of records may be instantiated with, for ex-
ample, counters. The quantity of e-mails received from
a particular sender, how often a term appears or how
many messages are received in a time frame. Storing
this metadata information in a secondary data collec-
tion avoids some of the high costs of searching in the
main dataset.
However, our proposal fails when the user wants to
search for something that was not previously expected.
For example, regular expressions. Suppose a query that
searches for all the sentences that start with “Attack”
and end with “dawn”, or all the e-mails on the domain
“mail.com”. If these patterns were not foreseen when
the keyword index was built, then no one will be able to
correctly execute this selection without the decryption
of the entire database. Since the format of the strings
is lost on encryption, this kind of search is impossible
in our proposal.
Lastly, relational integrity is a desired property for a
relational database. It connects two or more sets using
same-value attributes in both sets (e.g.: every value
of a column in a table Aexists in a column in table
B), and establishes a primary-foreign key relationship.
This way, the existence of a record in an attribute clas-
sied as foreign key depends on the existence of the
related record on the other set, in which the primary
key is equal to that foreign key. To implement such
feature one must provide to the DBMS capabilities to
reinforce relational integrity rules. In other words, the
server must be able to recognize such a relationship to
guarantee it will be respected by issued queries.
Figure 3 Simple diagram describing the interaction between
users and products composing the information regarding an
order. Notice that the existence of users and products is
independent, but there is a dependence for orders.
Alves and Aranha Page 16 of 25
An example of the applicability of this concept is
an e-commerce database. Best practices dictate that
user data should be stored separately from products
and orders. Thus, one may model it as in Figure 3.
When a new order arrives, it is clear that a user chose
some product and informed the store about his intent
to buy it. Users and products are concrete elements.
However, a sale is an abstract object and cannot hap-
pen without a buyer and a product. This way, to main-
tain the consistency of the database the DBMS must
assure that no sale record will exist without relating
user and product. This can be achieved by constructing
the sales table such that records contain foreign keys
for the user and product tables (implying that these
contain attributes classied as primary keys). By def-
inition this feature imposes an inherent requirement
that the DBMS has knowledge about this relation-
ship between records on dierent tables. Any approach
to protect the attributes against third parties will af-
fect the DBMS itself and will never really achieve the
needed protection. Thus, any eort on implementing
secure relational integrity is at best security through
obscurity [5].
6 The winner solution of Netix’s prize
The winner of Netix Grand Prize was BellKor’s
Pragmatic Chaos team, who built a solution over
the progress achieved in the 2007 and 2008 Progress
Prizes [56]. Several machine learning predictors were
combined in the nal solution with the objective of
anticipating the suitability of Netix content for some
user considering previous behavior in the platform.
The foundation used for this considered diverse fac-
tors, such as:
What is the general behavior of users when rat-
ing? What is the average rating?
[5]When the security of a system relies only in the lack
of knowledge by adversaries about its implementation
details and aws.
• How critic is a user and how this changes over
time?
Does the user demonstrate preference for a spe-
cic movie or gender?
Does the user demonstrate preference for block-
busters or non-mainstream content?
• What property of a movie aects the rating? Is
there a correlation between the rating of a user
and the presence of a particular actor in a partic-
ular gender?
The strategy used to combine these factors (and
many others) escapes the scope of this work. We should
attend only to the necessity of extracting data from the
dataset to feed the learning model.
6.1 Searching the encrypted Netix’s database
An interesting application of our framework is enabling
an entity to maintain an encrypted database on third
party hardware with a similar structure of Netix’s
dataset and being able to implement a prediction algo-
rithm with minimum data leakage to the DBMS. The
database should be capable of answering the requested
predicates regarding user behavior.
Two scenarios must be considered: the recommen-
dation system running on Netix’s infrastructure,
and the dataset becoming public. The former oers
an execution environment apparently honest (no one
would share data with an openly malicious party) but
that can be compromised at some point. To mitigate
the damage, the data owner can implement dierent
strategies to reduce the usefulness of any leakage that
might happen. Thus, data being handled exclusively in
encrypted form on the server is a natural option, since
security breaches would reveal nothing but incompre-
hensible ciphertexts. This is the best case scenario
since the data owner has as much control of the exe-
cution machine as possible, so our framework proposal
can be applied in its full capacity.
Alves and Aranha Page 17 of 25
As an example of the latter, an important feature re-
quired for running the Netix’s prize is the capability
of in-dataset comparisons. This time any security so-
lution should nd the balance between protecting data
secrecy and oering conditions for experimentation.
Moreover, we must consider that the execution envi-
ronment cannot be considered honest anymore. This
way, the suitability of our framework depends on the
relaxation of the indexing method. index values must
be published to enable comparisons. For instance, both
sides of Lewi-Wu’s ciphertexts should be published,
or even an OPE scheme may be used on the encryp-
tion of the index. From the perspective of the secrecy
of ciphertexts, if a IND-OCPA scheme is used then
there will be no security reduction beyond what the
corresponding threat model expects, as discussed in
Section 2.1. The adversary learns the ciphertext order
but has restricted ability to make inferences using in-
formation acquired from public databases. The only
strategy that can be applied uses the data distribu-
tion in the dataset (that can be retrieved by enabling
comparisons), which puts an attacker in this scenario
in a very similar position than the persistent passive
attacker.
Given the boundary conditions for privacy preser-
vation, we cannot precisely state the robustness of
our framework in the context of the Netix prize. It
clearly increases the hardness against an inference at-
tack, since the adversary is unable to observe the plain-
text, but the distribution leaked will give him hints
about its content. For instance, the correlation of age
groups and most watched (or better rated) movies. It
is a fact that all these are expressed as ciphertexts,
but as previously stated, a motivated adversary may
be able to combine such hints and defeat our security
barriers.
Our framework performs much better in the more
conservative scenario, where a production server pro-
vides recommendations to users with comparisons con-
trolled by the data owner through the two-sided index
attributes. The impossibility for arbitrary comparisons
makes snapshot attacks completely infeasible.
As previously discussed, a motivated adversary with
access to the database may be able to also retrieve
logs and auxiliary collections. Consequently, previous
queries may leak the second side of index ciphertexts
and recall the danger of persistent passive attacks. So,
an important feature for future work is the develop-
ment of a key refreshment algorithm to nullify the use-
fulness of such information.
6.2 Data structure
The dataset shared by Netix is composed by more
than 100 million real movie ratings from 480,000 users
about 17,000 movies, made between 1999 and 2005,
and formatted as a training test set [11, 56]. It contains
a subset of 4.2 million of those ratings, with up to 9
ratings per user. It consists of:
CustomerID: A unique identication number per
user,
MovieID: A unique identication number per
movie,
Title: The English title of the movie,
YearOfRelease: The year the movie was released,
Rating: The rating itself,
Date: The timestamp informing when the rating
happened.
6.3 Constructing queries of interest over encrypted data
Following we rewrite some of the main predicates re-
quired for BellKor’s solution using the relational alge-
bra of Section 5.2, thus enabling their execution over
an encrypted dataset.
Let
DB be a dataset as described in Section 6.2,
• AID be the CustomerID related to a particular
user (that we shall call Alice),
Alves and Aranha Page 18 of 25
• BID be the CustomerID related to a particular
user dierent to Alice (that we shall call Bob),
MID be the MovieID related to an arbitrary movie
in the dataset (that we should refer as M),
T= (Tstart,Tend )be a time interval of interest,
Trst-alice be the timestamp of the rst rating Alice
ever made,
C() be a function that receives a set and returns
the quantity of items contained,
rHand rLbe thresholds for extreme ratings char-
acterizing users that hated or loved a movie,
σDate∈T (DB)σDate≥Tstart (DB)+σD ate<Tend (DB),
f(X) = PxXπRating (x)
C(X).
Then, some of the required predicates for BellKor’s
solution are:
Movies rated by Alice: Returns all movies that
received some rating from Alice. For
U(X) = σCustomerI D=X(DB),
we have the query
πMovieID(U(AID)).(1)
Users who rated M: Returns all users that sent
some rating for MID. For
M(X) = σMovieID=MID(DB),
we have the query
πCustomerI D (M(MID)).(2)
Average of Alice’s ratings over time: Com-
putes the average of all rates sent by Alice during
a particular time interval T. For
AAID,T=σDate∈T (U(AID)),
we have that
avg(AID,T) =
f(AAID,T)if C(AAID,T)>0,
0,otherwise.
(3)
Average of ratings for a particular movie M
in a timeset: Computes the average of all rates
sent by all users during a particular time interval
Tfor a movie M. For
MMID,T=σDate∈T (M(MID))
we have that
avg(MID,T) =
f(MMID,T)if C(MMID,T)>0,
0,otherwise.
(4)
Number of days since Alice’s rst rating:
Computes how many days have been since the Al-
ice submitted the rst rating of movie, relative to
a moment I.
dsf(AID, I) = I πDate(σmin(D ate)(U(AID))).
(5)
Quantity of users who hated M:Counts the
quantity of very bad ratings Mreceived since its
release.
CH(M) = C(σMovieID=MID(σRatingrH(DB))) .
(6)
Quantity of users who loved M:Counts the
quantity of very good ratings Mreceived since its
release.
CL(M) = C(σMovieID=MID(σRatingrL(DB))) .
(7)
Users that are similar to Alice: The similarity
assessment between users require the derivation
of a specic metric according to the boundary-
conditions. The winning solution developed a so-
Alves and Aranha Page 19 of 25
phisticated strategy, building a graph of neighbor-
hoods considering similar movies and users and
computing a weighted mean of the ratings. For
simplicity, we shall condense two factors that can
be used for this objective: the set of common rated
movies, and how close the ratings are. To query
the movies rated both by Alice and Bob, let
αAID =πMovieID,RatingA(ρRating,RatingA(U(AID)))
and
βBID =πMovieID,RatingB(ρRating,RatingB(U(BID))).
Then
SimilaritySet (AID,BID) = αAID  βBI D (8)
returns a sequence of tuples of ratings made by
Alice and Bob. A simple approach for evaluating
proximity is to compute the average of the dier-
ence of ratings for each movie returned by Equa-
tion 8, as shown in Equation 9.
PSimilaritySet(AID,BID)|RatingA RatingB|
C(SimilaritySet (AID,BID)) (9)
7 Implementation
A proof-of-concept implementation of the proposed
framework was developed and made available to the
community under a GNU GPLv3 license [29]. It runs
upon the popular document-based database Mon-
goDB and was designed as a wrapper over its Python
driver [57]. Hence, we are able to evaluate its compe-
tence as a search framework as well as the compatibil-
ity with a state-of-the-art DBMS. Moreover, running
as a wrapper makes it database-agnostic and restricts
the server to dealing with encrypted data. We choose
to implement our wrapper over a NoSQL database
so we could avoid dealing with the SQL interpreter
and thus reduce the implementation complexity. How-
ever, our solution should be easily portable to any
SQL database because of its strong roots in relational
algebra. Table 1 provides the schemes used for each at-
tribute class, the parameter size and its security level.
Table 1 Chosen cryptosystems for each attribute presented in
Section 5.
Attribute Cryptosystem Parameters Sec. level
static AES 128 bits 128 bits
index Lewi-Wu 128 bits 128 bits
computable (+) Paillier 3072 bits 128 bits
computable (×) ElGamal 3072 bits 128 bits
Lewi-Wu’s ORE scheme relies on symmetric prim-
itives and achieves IND-OCPA. The authors claim
that this is more secure than all existing OPE and
ORE schemes which are practical [34]. Finally, Pail-
lier and ElGamal are well-known public-key schemes.
Both achieve IND-CPA and are based on the hardness
of solving integer factorization and discrete logarithm
problems, respectively. Paillier supports homomorphic
addition, while ElGamal provides homomorphic multi-
plication. Both are classied as PHE schemes [38, 39].
The implementation of AES was provided by the py-
crypto toolkit [58]; we wrote a Python binding over
the implementation of Lewi-Wu provided by the au-
thors [59]; and we implemented Paillier and ElGamal
schemes. An AVL tree was used as the index structure.
It is important to notice that performance was not the
main focus in this proof-of-concept implementation.
The machines used to run our experiments are de-
scribed in Tables 2 and 3. The former species the
machine used to host the MongoDB server, and latter
describes the one used to run the client. Both machines
were connected by a Gigabit local network connection.
Table 2 Specications of the machine used for running the
MongoDB instance.
CPU 2 x Intel Xeon E5-2670 v1 @ 2.60GH
OS CentOS 7.3
Memory 16 x DDR3 DIMM 8192MB @ 1600MHz
Disk 7200RPM Western Digital HDD (SATA)
Alves and Aranha Page 20 of 25
Table 3 Specications of the machine used for running the
queries described in this document.
CPU 2 x Intel Xeon E5-2640 v2 @ 2.60GH
OS Ubuntu 16.04.2
Memory 4 x DDR3 DIMM 8192MB @ 1600MHz
Disk 7200RPM Western Digital HDD (SATA)
While it was trivial to index the plaintext dataset
natively, it was not so simple with the encrypted ver-
sion. MongoDB is not friendly to custom index struc-
tures or comparators, so we decided to construct the
structure with Python code and then insert it into
the database using pointers based on MongoDB’s na-
tive identity codes. Walking through the index tree
depends on a database-external operation at Python-
side, calling MongoDB’s find method to localize doc-
uments related to left/right pointers starting from the
tree root. Such limitation brings a major performance
overhead that especially aects range queries.
7.1 Netix’s prize dataset
We used the Netix’s dataset to measure the compu-
tational costs of managing an encrypted database.
We consider the two threat scenarios discussed in
Section 6.1, a recommendation system running in pro-
duction, and the disclosure of a real ratings dataset.
Both require the ability of running all queries pre-
sented in Section 6.3, diering only in the content
that must be inserted in the encrypted dataset (for
instance, how much of the index ciphertexts may be
stored). Hence, to demonstrate the suitability of our
framework as a strategy to fulll the development and
execution of a good predictor in such contexts, and be-
ing capable of mitigating damages to user privacy, we
implemented those queries in an encrypted instance of
the dataset.
As shown in Table 4, the four attributes chosen are
classied as static, which use the faster encryption
and decryption available. Rating is tagged computable
for addition and multiplication, thus being compatible
Table 4 Attribute structure of elements in the Netix’s prize
dataset.
Name Value type Class
CustomerID integer index,static
MovieID integer index,static
Rating integer static,
computable
Date integer index,static
with Equations 3 and 4. We use CustomerID,MovieID,
and Date for indexing. Encrypting the document struc-
ture takes 540µs per record.
There is no way to implement integer division
over Paillier ciphertexts. Thus, the predictor may be
adapted to use the non-divided result on Equations 3
and 4. Otherwise, a division oracle must be provided,
to which one could submit their homomorphically
added values and ask for a ciphertext equivalent to
its division by an arbitrary integer. This approach
does not reduce security for an IND-CPA homomor-
phic scheme.
Handling such a large dataset was not an easy task.
The ciphertext expansion factor caused by AES, Pail-
lier and ElGamal cryptosystems was relatively small,
but the Lewi-Wu implementation is very inecient in
this regard, having an expansion of about 400×. This
directly aects the index building and motivated us
to explore dierent strategies to encrypt and load the
dataset to a MongoDB instance in reasonable time.
Again, MongoDB is not friendly for custom index-
ing. A contribution by Grim, Wiersma and Turkmen
to our code enables us to manage the AVL tree in-
side the database through JavaScript code stored in-
side MongoDB’s engine (the only way to execute arbi-
trary code in MongoDB) [60]. Thus, our primary ap-
proach to feed the DBMS with the dataset was quite
simple: encrypt each record in our wrapper, insert in
the database, and update the index and balance the
tree inside the DBMS. The two rst operations suered
from an extremely high memory consumption and by
Alves and Aranha Page 21 of 25
Table 5 Latency for each step in the construction of an AVL tree
following Algorithm 1 for each index attribute specied in 4.
Attribute sort (s) group (s) build_index (s)
CustomerID 329 459 129
MovieID 270 161 2
Date 187 197 5
far surpassed our available RAM capacity. However,
an even worse problem we faced was to build the AVL
tree. For the rst thousand records we could do the
node insertion and tree balancing with a transfer rate
of about 600 documents per second, but it dropped
quickly as the tree height increases, reaching less than
1document per second before insertion of the 10,000th
record.
We found out that the initial insertions required a
novel approach. We completely decoupled the index
from the static data encryption and chose to rst feed
the database with the static ciphertexts, constructing
the entire AVL tree using the plaintext on client-sided
memory, and then inserting it in the database. More-
over, to speed up the index construction we followed
Algorithms 1 and 2 to construct the AVL tree. It takes
a sorted list of inputs and builds the tree with time
complexity of O(n)on the list size. As a result of this
approach we were able to build the encrypted database
and the index by 3000 documents per second during
the entire procedure.
Algorithm 1 Build an AVL tree using an array of docu-
ments.
1: procedure build_index(docs)
2: docssort sort(docs);
3: docsgroup group(docssort);Combine equal elements
4: return build_aux(docsgroup ,0, lenght(docsgr oup)1);
5: end procedure
Table 5 shows the latency of each step we observed
during the construction of the AVL tree-based indexes.
The total time to build those 3 indexes was 40 minutes.
The queries we derived in Section 6.3 were ported to
our encrypted database, and the latency for each one
Algorithm 2 Recursively builds an AVL tree with a sorted
array of documents without repeated elements. Receives
the array itself, and the indexes for the leftmost and right-
most elements to be handled in each recursive call.
1: procedure build_aux(docs, L, R)
2: if L=Rthen
3: return new_node(docs[L]);
4: else if L+ 1 = Rthen
5: left_node new_node(docs[L]);
6: right_node new_node(docs[R]);
7: left_node.right =right_node;
8: left_node.height = 1;
9: return lef t_node;
10: else
11: ML+b(RL)/2c;
12: middle_node new_node(docs[M]);
13: middle_node.lef t build_aux(docs, L, M 1);
14: middle_node.right build_aux(docs, M + 1, R);
15: lh middle_node.lef t.height;
16: rh middle_node.right.height;
17: middle_node.height = 1 + max(lh, rh);
18: return middle_node;
19: end if
20: end procedure
can be seen in Table 6. The parameters used for each
Equation were arbitrarily selected. The CustomerIDs
for Alice and Bob (AID and BID) were 1061110 and
2486445 respectively, while MID was xed as 6287.
The time interval used was 01/01/2003 to 01/01/2004.
Lastly, we dened a “loved” rating as those greater
than 3, and “hated” rating as those lower than 3. We
applied some eorts in optimizing the execution, how-
ever these results can still be improved.
As it can be seen, complex queries composed by
range selections, as well as those with numerous out-
comes, suered from the slow communication between
server and the client. The latter inuenced even the
plaintext results. The outcome of Equation 1 is quite
small, requiring much less time to return than the out-
come of Equation 2 (the number of movies rated by a
user is much smaller than the number of users that
rated a movie).
Alves and Aranha Page 22 of 25
The time interval selection in Equations 3 and 4 re-
quired our implementation to visit many nodes in the
index tree for Date. Because each iteration requires a
back and forth between the server and the client, this
dramatically impacted the performance. The latencies
for Equations 1 and 5 were only 1.4times higher in
the encrypted database, however it reached 710 times
for Equation 3. Lastly, Equations 6 and 7 depend on
Paillier’s homomorphic additions. This implied in a
factor-12 slowdown.
Table 6 Execution times for implementations of the Equations
presented in Section 6.3 on an encrypted MongoDB collection and
an equivalent plaintext version. Each row contains the latency for
the entire circuit required by the respective Equation and
returning the outcome to the client. Times are computed as the
average for 100 independent executions. The machine and
parameters used in each cryptosystem follow those dened in
Section 7.
Equation Encrypted Plaintext
116.6ms 11.9ms
22s850 ms
32.7s3.8ms
42.7s1.0s
516.8ms 11.8ms
6 and 7 12 ms 1.0ms
9603 ms 200 ms
The implementation of queries based on Equations
3 and 4 took the previous suggestion and skipped the
nal division. We believe this does not undermine any
procedure that eventually consumes this outcome.
The optimal implementation of Equations 6 and 7
requires indexing of MovieID and Rating attributes.
However, due to limitations in our implementation,
rather than indexing the latter we use linear search
over the outcome of the movie selection on client-side.
Our approach for building indexes use the set data
structure of MongoDB documents. Yet, in the most re-
cent release such structure holds up to 16MB of data,
much smaller than the required for indexing the entire
dataset for Rating with our strategy.
Lastly, Equation 8 was implemented aiming at the
joining of data regarding two users, Alice and Bob. We
let the evaluation of such information by a similarity-
evaluation function as future work.
8 Conclusion
We presented the problem of searching in encrypted
data and a proposal of a framework that guides the
modeling of a database with support to this function-
ality. This is achieved by combining dierent crypto-
graphic concepts and using dierent cryptosystems to
satisfy the requirements of each attribute, like order-
revealing encryption and homomorphic encryption.
Over this approach, a relational algebra was built to
support encrypted data composed by: projection, se-
lection, Cartesian product, dierence, union, rename,
and join-like operators.
An overview of the security provided is discussed,
as well as a performance analysis about the impact in
a realistic database. As a case study we explored the
Netix prize, which published an anonymized dataset
with real-world information about user behavior which
was later deanonymized through correlation attacks in-
volving public databases.
We oered a proof-of-concept implementation in
Python over the document-based database MongoDB.
To demonstrate its functionality, we selected and ran
some of the main predicates required by the win-
ning solution of the Netix Grand Prize and mea-
sured the performance impact of the execution in a
encrypted version of the dataset. We conclude that
our proposal oers robustness against a compromised
server and we discuss how it would help to avoid the
deanonymization of the Netix dataset. In comparison
with CryptDB, our proposal provides higher security,
since it delegates exclusively to the data owner the re-
sponsibility of encrypting and decrypting data. This
way, privacy holds even in a scenario of database or
application compromise.
Alves and Aranha Page 23 of 25
As future research objectives we can mention:
Extend the scope to associative arrays: Despite
being powerful on SQL, Codd’s relational alge-
bra is not completely applicable for non-relational
databases. For instance, NoSQL and NewSQL
databases lack the concept of joining. A more con-
venient foundation for such context is algebra of
associative arrays [61]. Hence, the formalization
of our primitives in such algebra would be an in-
teresting work.
Reduce the leakage of index construction in the
database: Our proposal leaks both sides of index
ciphertexts to enable the index construction. At
this moment, an eavesdropper monitoring queries
would learn all information required to freely com-
pare the exposed ciphertexts. As discussed in this
document, such capability must be restricted, un-
der risk of enabling an inference attack.
Key refreshment algorithm: A persistent passive
attacker is capable of learning the required infor-
mation to perform comparisons through the en-
tire database, just by observing issued queries and
its outcome. Thus, the framework primitives must
be improved to support an algorithm capable of
avoid any damage caused by the knowledge of
such information.
Hide repeated queries: Even with encrypted queries
and outcomes, the access pattern in a database
may indicate repeated queries and the associated
records. A technique such as ORAM could be use-
ful to protect such information [62].
Explore dierent databases: As stated, MongoDB
is a very popular NoSQL database. However, it
is not friendly to custom indexing or third party
code running in its engine. Thus, to replace it by a
more appropriate database could provide a more
productive system.
Improve performance of our implementation: Our
implementation had as objective to be a proof-of-
concept and demonstrate how the proposal works.
The development of a space and speed-optimized
versions is an important next step.
Author’s contributions
The rst author developed the study design, carried out the implementation
eorts and wrote most the paper. The second author contributed with
discussions about the proposal and its validation with the case study. Both
authors read and approved the nal manuscript.
Acknowledgements
A prior version of this paper was presented at the XVI Brazilian Symposium
on Information and Computational Systems Security (SBSeg16) [63].
The authors thank Proof. André Santanchè for the initial opportunity to
develop this work, the Multidisciplinary High Performance Computing Lab
(LMCAD) for providing the required infrastructure, and the anonymous
reviewers that helped improving this work.
Funding
This research was partially founded by CNPq and the Google Research
Awards Latin America.
Availability of data and materials
Our proof-of-concept implementation is publicly available on GitHub, as
well as a generator for a synthetic dataset used for testing [29]. The Netix
dataset is available in academic repositories [64].
Competing interests
The authors declare that they have no competing interests.
References
1. Buyya, R.: Market-Oriented Cloud Computing: Vision, Hype, and
Reality of Delivering Computing As the 5th Utility. In: Proceedings of
the 2009 9th IEEE/ACM International Symposium on Cluster
Computing and the Grid. CCGRID ’09, p. 1. IEEE Computer Society,
Washington, DC, USA (2009)
2. Vecchiola, C., Pandey, S., Buyya, R.: High-Performance Cloud
Computing: A View of Scientic Applications. In: Pervasive Systems,
Algorithms, and Networks (ISPAN), 2009 10th International
Symposium On, pp. 4–16 (2009)
3. Hoa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K.,
Berriman, B., Good, J.: On the Use of Cloud Computing for Scientic
Workows. In: eScience 08. IEEE Fourth International Conference On,
pp. 640–645 (2008)
4. Dinh, H.T., Lee, C., Niyato, D., Wang, P.: A survey of mobile cloud
computing: architecture, applications, and approaches. Wireless
Communications and Mobile Computing 13(18), 1587–1611 (2013)
5. Xiao, Z., Xiao, Y.: Security and Privacy in Cloud Computing. IEEE
Communications Surveys Tutorials 15(2), 843–859 (2013)
Alves and Aranha Page 24 of 25
6. Pascual, A.: 2017 Data Breach Fraud Impact Report: Going
Undercover and Recovering Data. Technical report, Javelin Advisory
Services (2017)
7. Popa, R.A., Redeld, C.M.S., Zeldovich, N., Balakrishnan, H.:
Cryptdb: Protecting condentiality with encrypted query processing.
In: Proceedings of the Twenty-Third ACM Symposium on Operating
Systems Principles. SOSP ’11, pp. 85–100. ACM, New York, NY, USA
(2011)
8. Poddar, R., Boelter, T., Popa, R.A.: Arx: A Strongly Encrypted
Database System. Cryptology ePrint Archive, Report 2016/591 (2016)
9. Ramamurthy, R., Eguro, K., Arasu, A., Kaushik, R., Kossmann, D.,
Venkatesan, R.: A Secure Coprocessor for Database Applications
(2013)
10. Tu, S., Kaashoek, M.F., Madden, S., Zeldovich, N.: Processing
analytical queries over encrypted data. Proc. VLDB Endow. 6(5),
289–300 (2013)
11. Bennett, J., Lanning, S., et al.: The netix prize. In: Proceedings of
KDD Cup and Workshop, vol. 2007, p. 35 (2007). New York, NY,
USA
12. Michael Arrington: AOL Proudly Releases Massive Amounts of Private
Data. https://techcrunch.com/2006/08/06/
aol-proudly- releases-massive- amounts-of- user-search- data/.
Accessed 24 July 2017 (2006)
13. Said, A., Bellogín, A.: Comparative recommender system evaluation:
benchmarking recommendation frameworks. In: Proceedings of the 8th
ACM Conference on Recommender Systems, pp. 129–136 (2014).
ACM
14. Wang, Z., Liao, J., Cao, Q., Qi, H., Wang, Z.: Friendbook: a
semantic-based friend recommendation system for social networks.
IEEE Transactions on Mobile Computing (3), 538–551 (2015)
15. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In:
The Adaptive Web, pp. 325–341. Springer, ??? (2007)
16. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large
sparse datasets. In: Proceedings of the 2008 IEEE Symposium on
Security and Privacy. SP ’08, pp. 111–125. IEEE Computer Society,
Washington, DC, USA (2008)
17. Barbaro, M., Zeller, T.: A Face Is Exposed for AOL Searcher No.
4417749. The New York Times. Accessed 05 April 2017 (2006)
18. Narayanan, A., Felten, E.W.: No silver bullet: De-identication still
doesn’t work. Technical report (2014)
19. Greenwald, G., MacAskill, E.: NSA Prism program taps in to user data
of Apple, Google and others. The Guardian (2013)
20. Weber, H.: How the NSA & FBI made Facebook the perfect mass
surveillance tool. Venture Beat. Published on 05/15/2014. (2014)
21. Thomsen, S.: Extramarital aair website Ashley Madison has been
hacked and attackers are threatening to leak data online. Business
Insider. Accessed 25 May 2016 (2015)
22. Niklas Magnusson and Niclas Rolander: Sweden Tries to Stem Fallout
of Security Breach in IBM Contract. Bloomberg (2017)
23. BBC News: Yahoo ’state’ hackers stole data from 500 million users.
Accessed 23 September 2016 (2016)
24. Sweeney, L.: Simple Demographics Often Identify People Uniquely
(2000). http://dataprivacylab.org/projects/identifiability/
25. Golle, P.: Revisiting the uniqueness of simple demographics in the us
population. In: Proceedings of the 5th ACM Workshop on Privacy in
Electronic Society. WPES ’06, pp. 77–80. ACM, New York, NY, USA
(2006)
26. DuckDuckGo: Privacy Mythbusting #3: Anonymized data is safe,
right? (Er, no.).
https://spreadprivacy.com/dataanonymization-e1e2b3105f3c.
Accessed 24 July 2017
27. Schneier, B.: Data is a toxic asset (2016). https://www.schneier.
com/essays/archives/2016/03/data_is_a_toxic_asse.html
28. Bösch, C., Hartel, P., Jonker, W., Peter, A.: A survey of provably
secure searchable encryption. ACM Comput. Surv. 47(2), 18–11851
(2014)
29. Alves, P.: A proof-of-concept searchable encryption backend for
mongodb. https://github.com/pdroalves/encrypted-mongodb.
Accessed July 2017 (2016)
30. Bellare, M., Desai, A., Pointcheval, D., Rogaway, P.: Relations among
notions of security for public-key encryption schemes. Advances in
Cryptology — CRYPTO ’98: 18th Annual International Cryptology
Conference Santa Barbara, California, USA August 23–27, 1998
Proceedings, pp. 26–45. Springer, Berlin, Heidelberg (1998)
31. Curtmola, R., Garay, J., Kamara, S., Ostrovsky, R.: Searchable
symmetric encryption: Improved denitions and ecient constructions.
Journal of Computer Security 19(5), 895–934 (2011)
32. Boldyreva, A., Chenette, N., Lee, Y., O’Neill, A.: Order-preserving
symmetric encryption. Lecture Notes in Computer Science 5479,
224–241 (2009)
33. Boneh, D., Lewi, K., Raykova, M., Sahai, A., Zhandry, M.,
Zimmerman, J.: Semantically Secure Order-Revealing Encryption:
Multi-input Functional Encryption Without Obfuscation, pp. 563–594.
Springer, Berlin, Heidelberg (2015)
34. Lewi, K., Wu, D.J.: Order-Revealing Encryption: New Constructions,
Applications, and Lower Bounds. Cryptology ePrint Archive, Report
2016/612 (2016)
35. Kolesnikov, V., Shikfa, A.: On the limits of privacy provided by
Order-Preserving Encryption. Bell Labs Technical Journal (2012)
36. Chenette, N., Lewi, K., Weis, S.A., Wu, D.J.: Practical
Order-Revealing Encryption with Limited Leakage. In FSE (2016)
37. Naveed, M., Kamara, S., Wright, C.V.: Inference attacks on
property-preserving encrypted databases. In: Proceedings of the 22Nd
ACM SIGSAC Conference on Computer and Communications Security.
CCS ’15, pp. 644–655. ACM, New York, NY, USA (2015)
38. Paillier, P.: In: Stern, J. (ed.) Public-Key Cryptosystems Based on
Composite Degree Residuosity Classes, pp. 223–238. Springer, Berlin,
Heidelberg (1999)
39. El Gamal, T.: A public key cryptosystem and a signature scheme based
on discrete logarithms. In: Proceedings of CRYPTO 84 on Advances in
Cryptology, pp. 10–18. Springer, New York, NY, USA (1985).
http://dl.acm.org/citation.cfm?id=19478.19480
Alves and Aranha Page 25 of 25
40. Gentry, C.: Computing Arbitrary Functions of Encrypted Data.
Commun. ACM 53(3), 97–105 (2010)
41. Brakerski, Z., Gentry, C., Vaikuntanathan, V.: (leveled) fully
homomorphic encryption without bootstrapping. In: Proceedings of the
3rd Innovations in Theoretical Computer Science Conference. ITCS
’12, pp. 309–325. ACM, New York, NY, USA (2012)
42. Loftus, J., May, A., Smart, N.P., Vercauteren, F.: On CCA-Secure
Somewhat Homomorphic Encryption. In: Proceedings of the 18th
International Conference on Selected Areas in Cryptography. SAC’11,
pp. 55–72. Springer, Berlin, Heidelberg (2012)
43. Song, D.X., Wagner, D., Perrig, A., Perrig, A.: Practical techniques for
searches on encrypted data. Proceeding 2000 IEEE Symposium on
Security and Privacy. S&P 2000, 44–55 (2000)
44. Grubbs, P., Ristenpart, T., Shmatikov, V.: Why Your Encrypted
Database Is Not Secure. Cryptology ePrint Archive, Report 2017/468.
http://eprint.iacr.org/2017/468 (2017)
45. Rogaway, P.: The moral character of cryptographic work. IACR
Cryptology ePrint Archive, 1162 (2015)
46. Naveed, M., Kamara, S., Wright, C.V.: Inference attacks on
property-preserving encrypted databases. In: Proceedings of the 22Nd
ACM SIGSAC Conference on Computer and Communications Security.
CCS ’15, pp. 644–655. ACM, New York, NY, USA (2015)
47. Grubbs, P., McPherson, R., Naveed, M., Ristenpart, T., Shmatikov,
V.: Breaking web applications built on top of encrypted data. In:
Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security. CCS ’16, pp. 1353–1364. ACM, New York,
NY, USA (2016)
48. Daemen, J., Rijmen, V.: AES Proposal: Rijndael (1999)
49. Papadimitriou, A., Bhagwan, R., Chandran, N., Ramjee, R., Haeberlen,
A., Singh, H., Modi, A., Badrinarayanan, S.: Big data analytics over
encrypted datasets with seabed. In: Proceedings of the 12th USENIX
Conference on Operating Systems Design and Implementation.
OSDI’16, pp. 587–602. USENIX Association, Berkeley, CA, USA
(2016). http://dl.acm.org/citation.cfm?id=3026877.3026922
50. Shoro, A.G., Soomro, T.R.: Big data analysis: Apache spark
perspective. Global Journal of Computer Science and Technology
15(1) (2015)
51. Codd, E.F.: A relational model of data for large shared data banks.
Commun. ACM 26 (6), 64–69 (1983)
52. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd
edn. McGraw-Hill, Inc., New York, NY, USA (2003)
53. Sedgewick, R.: Algorithms, p. 199. Addison-Wesley, ??? (1983). Chap.
15
54. Doröz, Y., Hostein, J., Pipher, J., Silverman, J.H., Sunar, B., Whyte,
W., Zhang, Z.: Fully Homomorphic Encryption from the Finite Field
Isomorphism Problem. Cryptology ePrint Archive, Report 2017/548
(2017)
55. Boneh, D., Gentry, C., Halevi, S., Wang, F., Wu, D.J.: Private
database queries using somewhat homomorphic encryption. In:
Proceedings of the 11th International Conference on Applied
Cryptography and Network Security. ACNS’13, pp. 102–118. Springer,
Berlin, Heidelberg (2013)
56. Töscher, A., Jahrer, M., Bell, R.M.: The BigChaos Solution to the
Netix Grand Prize (2009)
57. Chodorow, K., Dirolf, M.: MongoDB: The Denitive Guide, 1st edn.
O’Reilly Media, Inc., USA (2010)
58. Litzenberger, D.: Python Cryptography Toolkit.
http://www.pycrypto.org/. Accessed 03 July 2016 (2016)
59. Wu, D.J., Lewi, K.: FastORE.
https://github.com/kevinlewi/fastore. Accessed July 2017
(2016)
60. Grim, M.W., Wiersma, A.T., Turkmen, F.: Security and Performance
Analysis of Encrypted NoSQL Databases. Technical report, University
of Amsterdam (2017)
61. Kepner, J., Gadepally, V., Hutchison, D., Jananthan, H., Mattson,
T.G., Samsi, S., Reuther, A.: Associative array model of sql, nosql, and
newsql databases. CoRR (2016)
62. Stefanov, E., Van Dijk, M., Shi, E., Fletcher, C., Ren, L., Yu, X.,
Devadas, S.: Path oram: an extremely simple oblivious ram protocol.
In: Proceedings of the 2013 ACM SIGSAC Conference on Computer &
Communications Security, pp. 299–310 (2013). ACM
63. Alves, P.G.M.R., Aranha, D.F.: A framework for searching encrypted
databases. In: Proceedings of the XVI Brazilian Symposium on
Information and Computational Systems Security (2016)
64. Netix Prize Data Set. http://academictorrents.com/details/
9b13183dc4d60676b773c9e2cd6de5e5542cee9a. Accessed July 2017
(2009)
... aggregate queries) to be performed directly on the encrypted data without requiring access to a secret key (i.e. private key) and decrypting the encrypted data [37,38]. The result of such a computation also remains in an encrypted form and can be revealed by the owner of the secret key [39]. ...
... In this paper, we aim to design a secure data-access middleware with a special focus on enterprise level software development. In that regard, Alves et al. [37] and CloudProtect [82] present frameworks with a set of strictly predefined schemes. SafeNoSQL [22] presents a generic framework with a modular and extensible design that enables data processing over multiple cryptographic techniques applied on the same database schema. ...
... Prior work by Rafique et al. [84] present PERSIST, a data access middleware which relies on the data mapping strategy proposed in [85,86] to offer scalable and dynamic support for encryption of sensitive data at different levels of granularity such as fields, rows, and tables. Alves et al. [37] present a framework for search and computation on encrypted data, they employ ORE and HOM as their underpinning cryptographic primitives. However, their framework defines and finetunes a set of very specific cryptographic primitives and the focus is not primarily on the data access middleware layer. ...
Article
Cloud storage allows organizations to store data at remote sites of service providers. Although cloud storage services offer numerous benefits, they also involve new risks and challenges with respect to data security and privacy aspects. To preserve confidentiality, data must be encrypted before outsourcing to the cloud. Although this approach protects the security and privacy aspects of data, it also impedes regular functionality such as executing queries and performing analytical computations. To address this concern, specific data encryption schemes (e.g., deterministic, random, homomorphic, order-preserving, etc.) can be adopted that still support the execution of different types of queries (e.g., equality search, full-text search, etc.) over encrypted data. However, these specialized data encryption schemes have to be implemented and integrated in the application and their adoption introduces an extra layer of complexity in the application code. Moreover, as these schemes imply trade-offs between performance and security, storage efficiency, etc, making the appropriate trade-off is a challenging and non-trivial task. In addition, to support aggregate queries, User Defined Functions (UDF) have to be implemented directly in the database engine and these implementations are specific to each underlying data storage technology, which demands expert knowledge and in turn increases management complexity. In this paper, we introduce CryptDICE, a distributed data protection system that (i) provides built-in support for a number of different data encryption schemes, made accessible via annotations that represent application-specific (search) requirements; (ii) supports making appropriate trade-offs and execution of these encryption decisions at diverse levels of data granularity; and (iii) integrates a lightweight service that performs dynamic deployment of User Defined Functions (UDF) –without performing any alteration directly in the database engine– for heterogeneous NoSQL databases in order to realize low-latency aggregate queries and also to avoid expensive data shuffling (from the cloud to an on-premise data center). We have validated CryptDICE in the context of a realistic industrial SaaS application and carried out an extensive functional validation, which shows the applicability of the middleware platform. In addition, our experimental evaluation efforts confirm that the performance overhead of CryptDICE is acceptable and validates the performance optimizations for achieving low-latency aggregate queries.
... Alves et. al. [6] have proposed a framework for database encryption that preserves data secrecy on an un-trusted environment while performing searching on cloud database. They have employed order revealing encryption to perform selection operation on databases, and homomorphism to enable computation over cipher texts [6]. ...
... al. [6] have proposed a framework for database encryption that preserves data secrecy on an un-trusted environment while performing searching on cloud database. They have employed order revealing encryption to perform selection operation on databases, and homomorphism to enable computation over cipher texts [6]. L. Arockiam et al (2013) have analyzed the issues involved in managing data privacy on the cloud. ...
Article
Full-text available
Cloud Computing is an excellent technology for Micro Medium and Small Enterprises, which operate under budget shortage for setting up their own Information Technology infrastructure that requires capital investment on resources such as computers, storage and networking devices. Now-a-days, major Cloud Providers like Google and Amazon provide cloud services to its customers for managing their email, contact list, calendar, documents, and their own websites. MSME can take advantage of the cloud-based solutions offered by various Cloud Service Providers for equipping their own employees in doing their day to day activities more effectively and on the cloud. Though cloud computing promotes less expensive and collaborative work environment among a group of employees, it involves risks in keeping the resources such as computing and data secured. Different mechanisms are available for securing the data on the cloud among which encryption of data using cryptographic algorithm is the widely used one. Among various encryption symmetric algorithms, Advanced Encryption Standard is the more secured symmetric encryption algorithm for implementing data privacy on the cloud. In this paper, the authors have discussed some of the issues involved in adopting the cloud in an organization and proposed solutions that will benefit an organization while uploading and managing data in files and databases on the cloud.
... Meanwhile, genotypes and phenotypes as well as a subset of the queries are protected from the public cloud and attackers. More precisely, we consider the following three threat models for database management [40,41]: ...
Article
Full-text available
Cloud computing allows storing the ever-growing genotype-phenotype datasets crucial for precision medicine. Due to the sensitive nature of this data and varied laws and regulations, additional security measures are needed to ensure data privacy. We develop SQUiD, a secure queryable database for storing and analyzing genotype-phenotype data. SQUiD allows storage and secure querying of data in a low-security, low-cost public cloud using homomorphic encryption in a multi-client setting. We demonstrate SQUiD’s practical usability and scalability using synthetic and UK Biobank data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03447-9.
... CryptDB aims to overcome the weaknesses of existing solutions that are either too slow or lack the required confidentiality. Throughout the usual framework of data-backed systems of DBMS servers and independent client servers, CryptDB includes a proxy service and several other components (Alves & Aranha, 2018). ...
Article
Full-text available
In the case of compromised databases or interested database managers, CryptDB has been built for validated and realistic protection. CryptDB operates through encrypted data while executing SQL queries. The key concept of the SQL-aware encryption technique is to map SQL operations to encryption methods, adjustable query-driven encryption which facilitates CryptDB to modify the encryption level of data depending on user queries and to alter the data through layered encryption levels in an efficient manner. The systematic literature review in this paper shows that there is ongoing research regarding the implementation of CryptDB in new applications such as cloud computing and management information systems. Experiments are being conducted to improve the encryption schemes and layers to avoid data leakage when CryptDB is applied in dynamic applications. Further, there are studies on alternative query-processing systems to improve the performance and throughput. However, CryptDB is found to be the only practical approach to process the queries for encrypted data.
... There are a number of studies on architectural solutions for related problems that inspire the solution to this general framework (see e.g. [4,1,9,6]). Interestingly, related studies [8] have also been carried out in the context of medical applications. ...
Preprint
Full-text available
In the last few years there has been an impressive growth of connections between medicine and artificial intelligence (AI) that have been characterized by the specific focus on single problems along with corresponding clinical data. This paper proposes a new perspective in which the focus is on the progressive accumulation of a universal repository of clinical hyperlinked data in the spirit that gave rise to the birth of the Web. The underlining idea is that this repository, that is referred to as the Web of Clinical Data (WCD), will dramatically change the AI approach to medicine and its effectiveness. It is claimed that research and AI-based applications will undergo an evolution process that will likely reinforce systematically the solutions implemented in medical apps made available in the WCD. The distinctive architectural feature of the WCD is that this universal repository will be under control of clinical units and hospitals, which is claimed to be the natural context for dealing with the critical issues of clinical data.
Preprint
Full-text available
Cloud computing provides the opportunity to store the ever-growing genotype-phenotype data sets needed to achieve the full potential of precision medicine. However, due to the sensitive nature of this data and the patchwork of data privacy laws across states and countries, additional security protections are proving necessary to ensure data privacy and security. Here we present SQUiD, a s ecure qu eryable d atabase for storing and analyzing genotype-phenotype data. With SQUiD, genotype-phenotype data can be stored in a low-security, low-cost public cloud in the encrypted form, which researchers can securely query without the public cloud ever being able to decrypt the data. We demonstrate the usability of SQUiD by replicating various commonly used calculations such as polygenic risk scores, cohort creation for GWAS, MAF filtering, and patient similarity analysis both on synthetic and UK Biobank data. Our work represents a new and scalable platform enabling the realization of precision medicine without security and privacy concerns.
Presentation
Full-text available
Resource-constrained embedded systems belonging to the Internet of Things (IoT) are in the process of substantially increasing the attack surface of modern society. The new communication and processing capabilities will allow an enormous number of connected devices to collect, store and transmit sensitive information. Protecting embedded systems in this context will require the implementation of security mechanisms at an unprecedented scale. This lecture will discuss the main challenges in deploying security and privacy protection mechanisms in embedded systems, ranging from the generation and distribution of cryptographic keys to mitigating large-scale data breaches. The challenges will be richly illustrated with real-world examples, current research results and future directions.
Chapter
The Internet provides remote access to resources such as storage and computation available worldwide. The virtual interaction of people on the Internet makes it difficult to recognize the identity of a person. In addition, the open-network structure of the Internet might allow third parties to read and change data in a communication over the network. Hence, every organization must maintain the integrity of data by preventing unauthorized modification. To secure confidential information on the Internet, one must put into practice necessary security measures. Cryptography is a technique that helps to reduce security risks to the confidentiality and integrity of the data on the Internet. Moreover, the encryption of confidential data can be done either at the client machine (Web client) or on the server (Web server). The side (client/server) at which the encryption has to take place is decided based on the confidentiality of data to be protected. The critical data that need highest security in an organization is considered as in Level 1. In Level 2, we have information that is important but not critical as in Level 1. Level 3 comprises of data that are not important in the daily operations of an organization. Among the three levels of confidentiality, data exist at Level 1 and Level 2 need encryption so that they can be kept confidential on the Web. As the data in Level 1 is both confidential and important, it must be encrypted at the client-side itself before their migration onto the Web. Whereas, data in Level 2 can be encrypted after their migration since the criticality of data in Level 2 is less compared to that of Level 1. In this paper, the authors have identified some of the symmetric cryptographic algorithms with the help of which client-side encryption of data can be achieved for enforcing data privacy on the Web. They have also proposed an extended symmetric algorithm using which searchable encryption is possible on the encrypted data stored in a remote server.
Article
Full-text available
Long short-term memory neural networks demonstrate a classification accuracy larger than 99% for highly variable and bursty, real-time server traffic flows. Their performance in terms of forecasting precision displays promising results, both for one-step as well as multi-step predictions. These capabilities make the a priori detection of heavy data streams possible, thus enabling the employment of optical circuit switching.
Conference Paper
Full-text available
Cloud computing is a ubiquitous paradigm. It has been responsible for a fundamental change in the way distributed computing is performed. The possibility to outsource the installation, maintenance and scalability of servers, added to competitive prices, makes this platform highly attractive to the industry. Despite this, privacy guarantees are still insufficient for data processed in the cloud, since the data owner has no real control over the processing hardware. This work proposes a framework for database encryption that preserves data secrecy on an untrusted environment and retains searching and updating capabilities. It employs order-revealing encryption to provide selection with time complexity in Θ(log n), and homomorphic encryption to enable computation over ciphertexts. When compared to the current state of the art, our approach provides higher security and flexibility. A proof-of-concept implementation on top of MongoDB is offered and presents an 11-factor overhead for a selection query over an encrypted database.
Article
Searchable symmetric encryption (SSE) allows a party to outsource the storage of his data to another party in a private manner, while maintaining the ability to selectively search over it. This problem has been the focus of active research and several security definitions and constructions have been proposed. In this paper we begin by reviewing existing notions of security and propose new and stronger security definitions. We then present two constructions that we show secure under our new definitions. Interestingly, in addition to satisfying stronger security guarantees, our constructions are more efficient than all previous constructions. Further, prior work on SSE only considered the setting where only the owner of the data is capable of submitting search queries. We consider the natural extension where an arbitrary group of parties other than the owner can submit search queries. We formally define SSE in this multi-user setting, and present an efficient construction.
Conference Paper
Encrypted databases, a popular approach to protecting data from compromised database management systems (DBMS's), use abstract threat models that capture neither realistic databases, nor realistic attack scenarios. In particular, the "snapshot attacker" model used to support the security claims for many encrypted databases does not reflect the information about past queries available in any snapshot attack on an actual DBMS. We demonstrate how this gap between theory and reality causes encrypted databases to fail to achieve their "provable security" guarantees.
Conference Paper
Deciding “greater-than” relations among data items just given their encryptions is at the heart of search algorithms on encrypted data, most notably, non-interactive binary search on encrypted data. Order-preserving encryption provides one solution, but provably provides only limited security guarantees. Two-input functional encryption is another approach, but requires the full power of obfuscation machinery and is currently not implementable. We construct the first implementable encryption system supporting greater-than comparisons on encrypted data that provides the “best-possible” semantic security. In our scheme there is a public algorithm that given two ciphertexts as input, reveals the order of the corresponding plaintexts and nothing else. Our constructions are inspired by obfuscation techniques, but do not use obfuscation. For example, to compare two 16-bit encrypted values (e.g., salaries or age) we only need a 9-way multilinear map. More generally, comparing k-bit values requires only a (k/2+1)-way multilinear map. The required degree of multilinearity can be further reduced, but at the cost of increasing ciphertext size. Beyond comparisons, our results give an implementable secret-key multi-input functional encryption scheme for functionalities that can be expressed as (generalized) branching programs of polynomial length and width. Comparisons are a special case of this class, where for k-bit inputs the branching program is of length k+1 and width 4.
Conference Paper
We develop a systematic approach for analyzing client-server applications that aim to hide sensitive user data from untrusted servers. We then apply it to Mylar, a framework that uses multi-key searchable encryption (MKSE) to build Web applications on top of encrypted data. We demonstrate that (1) the Popa-Zeldovich model for MKSE does not imply security against either passive or active attacks; (2) Mylar-based Web applications reveal users' data and queries to passive and active adversarial servers; and (3) Mylar is generically insecure against active attacks due to system design flaws. Our results show that the problem of securing client-server applications against actively malicious servers is challenging and still unsolved. We conclude with general lessons for the designers of systems that rely on property-preserving or searchable encryption to protect data from untrusted servers.