Content uploaded by Dung D. Le
Author content
All content in this area was uploaded by Dung D. Le on Oct 15, 2017
Content may be subject to copyright.
Indexable Bayesian Personalized Ranking for Eicient Top-k
Recommendation
Dung D. Le
Singapore Management University
80 Stamford Road
Singapore 178902
ddle.2015@phdis.smu.edu.sg
Hady W. Lauw
Singapore Management University
80 Stamford Road
Singapore 178902
hadywlauw@smu.edu.sg
ABSTRACT
Top-
k
recommendation seeks to deliver a personalized recommen-
dation list of
k
items to a user. e dual objectives are (1) accuracy
in identifying the items a user is likely to prefer, and (2) eciency in
constructing the recommendation list in real time. One direction to-
wards retrieval eciency is to formulate retrieval as approximate
k
nearest neighbor (kNN) search aided by indexing schemes, such as
locality-sensitive hashing, spatial trees, and inverted index. ese
schemes, applied on the output representations of recommendation
algorithms, speed up the retrieval process by automatically discard-
ing a large number of potentially irrelevant items when given a user
query vector. However, many previous recommendation algorithms
produce representations that may not necessarily align well with
the structural properties of these indexing schemes, eventually re-
sulting in a signicant loss of accuracy post-indexing. In this paper,
we introduce Indexable Bayesian Personalized Ranking (Indexable
BPR ) that learns from ordinal preference to produce representation
that is inherently compatible with the aforesaid indices. Experi-
ments on publicly available datasets show superior performance
of the proposed model compared to state-of-the-art methods on
top-
k
recommendation retrieval task, achieving signicant speedup
while maintaining high accuracy.
1 INTRODUCTION
Today, we face a multitude of options in various spheres of life,
e.g., deciding which product to buy at Amazon, selecting which
movie to watch on Netix, choosing which article to read on social
media, etc. e number of possibilities is immense. Driven by
necessity, service providers rely on recommendation algorithms to
identify a manageable number
k
of the most preferred options to
be presented to each user. Due to the limited screen real estate of
devices (increasingly likely to be ever smaller mobile devices), the
value of
k
may be relatively small (e.g.,
k=
10), yet the selection of
items to be recommended are personalized to each individual.
To construct such personalized recommendation lists, we learn
from users’ historical feedback, which may be explicit (e.g., rat-
ings) [
12
] or implicit (e.g., click behaviors) [
22
]. An established
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’17 , Singapore, Singapore
©
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-4918-5/17/11. . . $15.00
DOI: 10.1145/3132847.3132913
methodology in the literature based on matrix factorization [
23
,
26
]
derives a latent vector
xu∈RD
for each user
u
, and a latent vector
yi∈RD
for each item
i
, where
D
is the dimensionality. e degree
of preference of user
u
for item
i
is modeled as the inner product
xuTyi
. To arrive at the recommendation for
u
, we need to identify
the top-kitems with the maximum inner product to xu.
ere are two overriding goals for such top-
k
recommendations.
One is accuracy, to maximize the correctness in placing items that
user
u
prefers most into
u
’s recommendation list. Another is ef-
ciency; in particular we are primarily concerned with retrieval
eency, to minimize the time taken to deliver the recommenda-
tion list upon request. Faster retrieval helps the system to cope
with a large number of consumers, and minimize their waiting
time to receive recommendations. In contrast, learning eciency
or minimizing model learning time, while useful, is arguably less
mission-critical, as it can be done oine and involves mainly ma-
chine time, rather than human time. erefore, we seek to keep the
learning time manageable, while improving retrieval eciency.
Many previous recommendation algorithms focus mainly on
accuracy. One challenge in practice is the need for exhaustive
search over all candidate items to identify the top-
k
, which is time-
consuming when the number of items Nis extremely large [11].
Problem.
In this paper, we pay equal aention to both goals, i.e.,
optimizing retrieval eciency of top-
k
recommendation without
losing sight of accuracy. An eective approach to improve eciency
is to use indexing structures such as locality-sensitive hashing
(LSH) [
24
], spatial trees (e.g., KD-tree [
3
]), and inverted index [
4
].
By indexing items’ latent vectors, we can quickly retrieve a small
candidate set for
k
“most relevant” items to the user query vector,
probably in sub-linear time w.r.t. the number of items
N
. is
avoids an exhaustive search, and saves on the computation for
those large number of items that the index considers irrelevant.
Here, we focus on indexing as an alternative to exhaustive search
in real time. Indexing is preferred over pre-computation of recom-
mendation lists for all users, which is impractical [
4
,
11
]. User
interests change over time. New items appear. By indexing, we
avoid the storage requirement of dealing with all possible user-
item pairs. Index storage scales only with the number of items
N
,
while the number of queries/users could be larger. Indexing exibly
allows the value of kto be specied at run-time.
However, most of the previous recommendation algorithms
based on matrix factorization [
12
] are not designed with index-
ing in mind. e objective is to recommend to user
u
those items
with maximum inner product
xuTyi
is not geometrically compati-
ble with aforementioned index structures. For one thing, it has been
established that there cannot exist any LSH family for maximum
inner product search [
24
]. For another, retrieval on a spatial tree
index nds the nearest neighbors based on the Euclidean distance,
which are not equivalent to those with maximum inner product
[
2
]. In turn, [
4
] describes an inverted index scheme based on cosine
similarity, which again is not equivalent to inner product search.
Approach.
e key reason behind the incompatibility between
inner product search that matrix factorization relies on, and the
aforesaid index structures is how a user
u
’s degree of preference
for an item i, expressed as the inner product xuTyi, is sensitive to
the respective magnitude of the latent vectors
||xu|| ,||yi||
. ere-
fore, one insight towards achieving geometric compatibility is to
desensitize the eect of vector magnitudes. e challenge is how
to do so while still preserving the accuracy of the top-kretrieval.
ere are a couple of recent approaches in this direction. One
approach [
2
] is a post-processing transformation that expands the
latent vectors learnt from matrix factorization with an extra dimen-
sionality to equalize the magnitude of all item vectors. Because the
transformation is a separate process from learning the vectors, such
a workaround would not be as eective as working with natively
indexable vectors in the rst place. Another approach [
7
] extends
the Bayesian Probabilistic Matrix Factorization [
23
], by making the
item latent vectors natively of xed length. Fiing inner product
to absolute rating value may not be suitable when only implicit
feedback (not rating) is available. Moreover, we note that top-
k
recommendation is inherently an expression of “relative” rather
than “absolute” preferences, i.e., the ranking among items is more
important than the exact scores.
We propose to work with ordinal expressions of preferences.
Ordinal preferences can be expressed as a triple
(u,i,j)
, indicat-
ing that a user
u
prefers an item
i
to a dierent item
j
. Ordinal
representation is prevalent in modeling preferences [
22
], and also
accommodates both explicit (e.g., ratings) and implicit feedback.
Contributions. is paper makes the following contributions:
First, we propose Indexable Bayesian Personalized Ranking model
or Indexable BPR in short, which produces native geometrically
indexable latent vectors for accurate and ecient top-
k
recommen-
dation. BPR [
22
] is a generic framework modeling ordinal triples.
Each instantiation is based on a specic kernel [
8
,
13
,
16
,
20
]. [
22
]
had matrix factorization kernel, which is not well-ed to indexing
structures. In contrast, our Indexable BPR is formulated with a
kernel based on angular distances (see Section 3). In addition to
requiring a dierent learning algorithm, we will show how this
engenders native compatibility with various index structures.
Second, we describe how the resulting vectors are used with
LSH, spatial tree, and inverted index for top-
k
recommendation
in Section 4. We conduct experiments with available datasets to
compare Indexable BPR with baselines. Empirically, we observe
that Indexable BPR achieves a balance of accuracy and run-time
eciency, achieving higher accuracy than the baselines at the same
speedup level, and higher speedup at the same accuracy level.
ird, to support the observation on the robustness of Indexable
BPR , we provide a theoretical analysis in the context of LSH, further
bolstered with empirical evidence, on why our reliance on angular
distances results in more index-friendly vectors, smaller loss of
accuracy post-indexing, and balanced all-round performance.
2 RELATED WORK
We review the literature related to the problem of eciently re-
trieving top-krecommendations using indexing schemes.
Matrix Factorization.
Matrix factorization is the basis of many
recommendation algorithms [
12
]. For such models, top-
k
retrieval
is essentially reduced to maximum inner product search, with com-
plexity proportional to the number of items in the (huge) collection.
is motivates approaches to improve the retrieval eciency of
top-
k
recommendation. Of interest to us are those that yield user,
item latent vectors to be used with geometric index structures. is
engenders compatibility with both spatial tree index and inverted
index, as well as with hashing schemes, and transforms the problem
into k-nearest neighbor (kNN) search.
One approach is the transformation scheme applied to matrix
factorization output. [
2
,
19
] propose a post-processing step that
extends the output latent vectors by one dimension to equalize
the magnitude of item vectors. eoretical comparisons show that
this Euclidean transformation achieves beer hashing quality as
compared to the two previous methods in [
24
] and [
25
]. However,
the Euclidean transformation results in high concentration of new
item points, aecting the retrieval accuracy of the approximate
kNN. As Indexable BPR relies on ordinal triples, one appropriate
baseline is to use the transformation scheme above on a comparable
algorithm that also relies on triples. We identify BPR [
22
] with inner
product or matrix factorization (MF) kernel, whose implementation
is available1, and refer to the composite as BPR(MF)+.
Another approach is to learn indexable vectors that ts ratings,
which would not work with implicit feedback, e.g., ordinal triples.
Indexable Probabilistic Matrix Factorization or IPMF [
7
] is a rating-
based model with constraints to place item vectors on a hypersphere.
We will see that IPMF does not optimize for a high mean of the
normal distribution in Eq. 15 (see Section 5), and ordinal-based
Indexable BPR potentially performs beer.
Others may not involve the standard index structures we study.
[
11
] used representative queries identied by clustering. [
21
] in-
vented another data structure (cone tree). [
27
–
29
] learnt binary
codes, which are incompatible with
l2
distance used by spatial tree.
Euclidean Embedding.
Euclidean embedding takes as input
distances (or their ordinal relationships), and outputs low dimen-
sional latent coordinates for each point that would preserve the
input as much as possible [
14
]. Because they operate in the Eu-
clidean space, the coordinates support nearest neighbor search
using geometric index structures such as spatial trees.
ere exist recent works on using Euclidean embedding to model
user preferences over items, which we include as experimental
baselines. e rst method Collaborative Filtering via Euclidean
Embedding or CFEE [
10
] ts a rating
ˆ
rui
by user
u
on item
i
in terms
of the squared Euclidean distance between
xu
and
yi
. Fiing ratings
directly does not preserve the pairwise comparisons. e second
method Collaborative Ordinal Embedding or COE [
15
] is based
on ordinal triples. It expresses a triple
tui j
through the Euclidean
distance dierence
||xu−yj|| − | |xu−yi| |
. COE’s objective is to
maximize this dierence for each observation tui j .
1hp://www.librec.net
3 INDEXABLE BPR
Problem.
We consider a set users
U
and a set of items
I
. We
consider as input a set of triples
T ⊂ U × I × I
. A triple
tui j ∈ T
relates one user
u∈ U
and two dierent items
i,j∈ I
, indicating
u
’s preferring item
i
to item
j
. Such ordinal preference is prevalent,
encompassing explicit and implicit feedback scenarios. When rat-
ings are available, we can induce an ordinal triple for each instance
when user
u
rates item
i
higher than she rates item
j
. Triples can
also model implicit feedback [
22
]. E.g., when searching on the
Web, one may click on the website
i
and ignore
j
. When browsing
products, one may choose to click or buy product iand skip j.
e goal is to derive a
D
-dimensional latent vector
xu∈RD
for
each user
u∈ U
, and a latent vector
yi∈RD
for each item
i∈ I
,
such that the relative preference of a user
u
over two items
i
and
j
can be expressed as a function (to be dened) of their corresponding
latent vectors
xu
,
yi
, and
yj
. We denote the collection of all user
latent vectors and item latent vectors as Xand Yrespectively.
Framework.
Given the input triples
T
, we seek to learn the
user and item vectors X,Ywith the highest posterior probability.
arg max
X,YP(X,Y|T )(1)
e Bayesian formulation for modeling this posterior probability
is to decompose it into the likelihood of the triples
P(T |X,Y)
and
the prior P(X,Y), as shown in Eq. 2.
P(X,Y|T )∝P(T |X,Y)P(X,Y)(2)
We will dene the prior later when we discuss the generative
process. For now, we focus on dening the likelihood, which can
be decomposed into the probability for individual triples
tui j ∈ T
.
P(T |X,Y)=Y
tui j ∈T
P(tui j |xu,yi,yj)(3)
Weakness of Inner Product Kernel for Top-kRetrieval.
To
determine the probability for an individual triple, we need to dene
a kernel function. e kernel proposed by the matrix factorization-
based (not natively indexable) BPR [
22
] is shown in Eq. 4 (
σ
is the
sigmoid function). is assumes that if
xuTyi
is higher than
xuTyj
,
then user uis more likely to prefer item ito j.
P(tui j |xu,yi,yj)=σ(xuTyi−xuTyj)(4)
Since our intended application is top-
k
recommendation, once
we learn the user and item latent vectors, the top-
k
recommendation
task is reduced to searching for the
k
nearest neighbors to the query
(user vector) among the potential answers (item vectors). A naive
solution is to conduct exhaustive search over all the items.
An indexing-based approach could reduce the retrieval time
signicantly, by prioritizing or narrowing the search to a smaller
search space. For the nearest neighbors identied by an index to be
as accurate as possible, the notion of similarity (or distance) used
by the index should be compatible with the notion of the similarity
of the underlying model that yields that user and item vectors.
erein lies the issue with the inner product kernel described in
Eq. 4. It is not necessarily compatible with geometric index struc-
tures that rely on similarity functions other than inner products.
First, we examine its incompatibility with spatial tree index.
Suppose that all item latent vectors
yi
’s are inserted into the index.
To derive the recommendation for
u
, we use
xu
as the query. Nearest
Figure 1: An illustration for the incompatibility of inner
product kernel for spatial tree index (Euclidean distance)
and inverted index (cosine similarity).
neighbor search on spatial tree index is expected to return items
that are closest in terms of Euclidean distance. e relationship
between Euclidean distance and inner product is expressed in Eq. 5.
It implies that items with the closest Euclidean distances may not
have the highest inner products, due to the magnitudes
||xu||
and
||yi| |. Spatial tree index retrieval may be inconsistent with Eq. 4.
||xu−yi|| 2=| |xu| |2+| |yi| |2−2xuTyi(5)
Second, we examine its incompatibility with inverted index that
relies on cosine similarity (Eq. 6). Similarly, the pertinence of the
magnitudes
||xu||
and
||yi| |
implies that inverted index retrieval
may be inconsistent with maximum inner product search.
cos (xu,yi)=xuTyi
||xu|| · | |yi| | (6)
Fig.1 shows an example to illustrate the above analysis. In Fig. 1,
the inner product
xuTyi
is greater than
xuTyj
, implying that
u
prefers
i
to
j
. However, the Euclidean distance computation shows
that
yj
is closer to
xu
than
yi
is to
xu
. Also, the cosine similarity
between
xu
and
yi
is smaller than that between
xu
and
yj
. is
means that the inner product kernel of the model is not compatible
with the operations of a spatial tree index relying on Euclidean
distance, or an inverted index relying on cosine similarity.
ird, in terms of its incompatibility with LSH, we note that it
has been established that there cannot exist any LSH family for
maximum inner product search [
24
], while there exist LSH families
for Euclidean distances and cosine similarity respectively.
Proposed Angular Distance Kernel.
To circumvent the lim-
itation of the inner product kernel, we propose a new kernel to
express the probability for a triple
tui j
in a way that is insensi-
tive to vector magnitudes. A dierent kernel is a non-trivial, even
signicant, change as it requires a dierent learning algorithm.
Our proposed kernel is based on angular distance. Let
θxy
denote
the angular distance between vectors
x
and
y
, evaluated as the
arccos of the inner product between the normalized vectors.
θxy=cos−1(xTy
||x|| .| |y|| )(7)
Proposing the angular distance, i.e., the arccos of the cosine
similarity, to formulate the user-item association is a novel and
appropriate design choice for the following reasons.
•
Firstly, since arccos is a monotone function, the closest
point according to the angular distance is the same as the
point with the highest cosine similarity, resulting in its
compatibility with the inverted index structure.
•
Secondly, since angular distances are not aected by magni-
tudes, it preserves all the information learnt by the model.
Before indexing, the learnt vectors could be normalized to
unit length for compatibility with indexing that relies on
either Euclidean distance or cosine similarity.
•
Lastly, the angular distance is also compatible to LSH in-
dexing. A theoretical analysis and empirical evidence on
this compatibility is provided in Section 5.
While the user
xu
and item
yi
vectors we learn could be of vary-
ing lengths, the magnitudes are uninformative as far as the user
preferences encoded by the triples are concerned. is advanta-
geously allows greater exibility in parameter learning, while still
controlling the vectors via the regularization terms, as opposed to
constraining vectors to xed length during learning (as in [7]).
We formulate the probability of a triple
tui j
for Indexable BPR as
in Eq. 8. e probability is higher when the dierence
θxuyj−θxuyi
is larger. If
u
prefers
i
to
j
, the angular distance between
xu
and
yi
is expected to be smaller than between xuand yj.
P(tui j |xu,yi,yj)=σ(θxuyj−θxuyi)(8)
Generative Process.
e proposed model Indexable BPR as a
whole could be expressed by the following generative process:
(1) For each user u∈ U : Draw xu∼Normal(0,η2I),
(2) For each item i∈ I: Draw yi∼Normal (0,η2I),
(3) For each triple of one user u∈ U and two items i,j∈ I:
•Draw a trial from Bernoulli(P(tu i j |xu,yi,yj) ),
•If “success”, generate a triple instance tu i j ,
•Otherwise, generate a triple instance tuj i .
e rst two steps place zero-mean multi-variate spherical Gauss-
ian priors on the user and item latent vectors.
η2
denotes the vari-
ance of the Normal distributions; for simplicity we use the same
variance for users and items.
I
denotes the identity matrix. is
acts as regularizers for the vectors, and denes the prior P(X,Y).
P(X,Y)=(2πη2)−D
2Y
u∈U
e−1
2η2| |xu| |2Y
i∈I
e−1
2η2| |yi| |2
(9)
Triples in
T
are generated from users and items’ latent vectors
according to the probability P(tui j |xu,yi,yj)as dened in Eq. 8.
Parameter Learning.
Maximizing the posterior as outlined in
Eq. 2 is equivalent to maximizing its logarithm, shown below.
L=ln P(T |X,Y)+ln P(X,Y)
∝ln P(T |X,Y)−1
η2X
u∈U
||xu|| 2−1
η2X
i∈I
||yi| |2(10)
Let us denote
∆ui j =θxuyj−θxuyi
,
˜
xu=xu
| |xu| | ∀u∈ U
and
˜
yi=yi
| |yi| | ∀i∈ I. e gradient of Lw.r.t each user vector xuis:
∂L
∂xu
=X
{i,j:tui j ∈T }
1
||xu|| 2
e−∆ui j
1+e−∆ui j ×
*
.
.
,
−˜
yj.||xu|| +cos (xu,yj).xu
q1−cos(xu,yj)2
−−˜
yi.||xu|| +cos (xu,yi).xu
q1−cos(xu,yi)2
+
/
/
-
,
in which, cos(xu,yi)=xT
uyi
| |xu| | .| |yi| | ∀u∈ U and ∀i∈ I.
Algorithm 1 Gradient Ascent for Indexable BPR
Input: Ordinal triples set T={tui j ,∀u∈ U,i,j∈ I} .
1: Initialize xufor u∈ U,yifor i∈ I
2: while not converged do
3: for each u∈ U do
4: xu←xu+ϵ.∂L
∂xu
5: for each i∈ I do
6: yi←yi+ϵ.∂L
∂yi
7: Return {˜
xu=xu
| |xu| | }u∈ U and {˜
yi=yi
| |yi| | }i∈ I
e gradient of Lw.r.t each item vector ykis:
∂L
∂yk
=X
{u,j:tuk j ∈T }
1
||yk| |2
e−∆uk j
1+e−∆uk j
.˜
xu.||yk| | − cos(xu,yk).yk
q1−cos(xu,yk)2
+X
{u,i:tui k ∈T }
1
||yk| |2
e−∆ui k
1+e−∆ui k
.−˜
xu.||yk| | +cos(xu,yk).yk
q1−cos(xu,yk)2
.
Algorithm 1 describes the learning algorithm with full gradient
ascent. It rst initializes the users and items’ latent vectors. In
each iteration, the model parameters are updated based on the
gradients, with a decaying learning rate
ϵ
over time. e output
is the set of normalized user vectors
˜
xu
and item vectors
˜
yi
. On
one hand, this normalization does not aect the accuracy of the
top-
k
recommendation produced by Indexable BPR , since the
magnitude of the latent vectors does not aect the ranking. On
the other hand, normalized vectors can be used for approximate
kNN search using various indexing data structures later. e time
complexity of the algorithm is linear to the number of triples in
T
,
i.e., O(|U | × | I |2).
4 EXPERIMENTS ON TOP-K
RECOMMENDATION WITH INDEXING
e key idea in this paper is achieving speedup in the retrieval
time of top-
k
recommendation via indexing, while still maintaining
high accuracies via beer representations that minimize any loss of
information post-indexing. Hence, in the following evaluation, we
are interested in both the accuracy of the top-
k
recommendation
returned by the index, and the speedup in retrieval time due to
indexing as compared to exhaustive search.
To showcase the generality of Indexable BPR in accommodat-
ing various index structures, we experiment with three indexing
schemes: locality-sensitive hashing, spatial tree index, and inverted
index. Note that our focus is on the relative merits of recommenda-
tion algorithms, rather than on the relative merits of index struc-
tures. It is our objective to investigate the eectiveness of Index-
able BPR , as compared to other algorithms, for top-
k
recommenda-
tion when using these index structures. Yet, it is not our objective
to compare the index structures among themselves.
Comparative Methods.
We compare our proposed Indexable
BPR with the following recommendation algorithm baselines:
•
BPR(MF): the non-index friendly BPR with inner product
(MF) kernel [
22
]. is would validate whether our angular
distance kernel is more index-friendly.
Table 1: Datasets
#users #items #ratings #training
ordinal triples
MovieLens 20M 138,493 27,278 20,000,263 5.46 ×108
Netix 480,189 17,770 100,480,507 2.29 ×1010
•
BPR(MF)+: a composite of BPR(MF) and the Euclidean
transformation described in [
2
] to make the item vectors
indexable as post-processing. is allows validation of our
learning inherently indexable vectors in the rst place.
•
IPMF: matrix factorization that learns xed-length item
vectors but ts rating scores [
7
]. is allows validation of
our modeling of ordinal triples.
•
CFEE: Euclidean embedding that ts rating scores [
10
].
is allows validation of our modeling of ordinal triples.
•
COE: Euclidean embedding that ts ordinal triples [
15
].
Comparison to CFEE and COE allows validation of our
compatibility with non-spatial indices such as some LSH
families as well as inverted index.
We tune the hyper-parameters of all models for the best perfor-
mance. For IPMF, we adopt the parameters provided by its authors
for Netix dataset. For the ordinal-based algorithms (BPR, COE,
and Indexable BPR ), the learning rate and the regularization are
0.05 and 0.001. For CFEE, they are 0.1 and 0.001. All models use
D=
20 dimensionalities in their latent representations. Similar
trends are observed across other dimensionalities (see Sec. 5).
Datasets.
We experiment on two publicly available rating-based
datasets and derive ordinal triples accordingly. One is MovieLens
20M
2
, the largest among the MovieLens collection. e other is
Netix
3
. Table 1 shows a summary of these datasets. By default,
MovieLens 20M includes only users with at least 20 ratings. For
consistency, we apply the same to Netix. For each dataset, we ran-
domly keep 60% of the ratings for training and hide 40% for testing.
We conduct stratied sampling to maintain the same ratio for each
user. We report the average results over ve training/testing splits.
For training, we generate a triple
tui j
if user
u
has higher rating for
item ithan for j, and triples are formed within the training set.
As earlier mentioned, our focus in this work is on online retrieval
speedup. We nd that the model learning time, which is oine,
is manageable. Our learning times for MovieLens 20M and Netix
are 5.2 and 9.3 hours respectively on a computer with Intel Xeon
E2650v4 2.20GHz CPU and 256GB RAM. Algorithm 1 scales with
the number of triples, which in practice grows slower than its
theoretical complexity of
O(|U | × |I|2)
. Figure 2 shows how the
average number of triples per user grows with the number of items,
showing that the actual growth is closer to linear and lower than
the quadratic curve provided as reference.
Recall.
We assume that the goal of top-
k
recommendation is to
recommend new items to a user, among the items not seen in the
training set. When retrieval is based on an index, the evaluation
of top-
k
necessarily takes into account the operation of the index.
Because we maintain one index for all items to be used with all
2hp://grouplens.org/datasets/movielens/20m/
3hp://academictorrents.com/details
/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
0 0.5 1 1.5 2 2.5 3
x 1
0
4
0
2
4
6
8x 108
Number of Items
Number of Triples (per user)
quadratic
actual
(a) MovieLens 20M
0 0.5 1 1.5 2
x 1
0
4
0
1
2
3
4x 108
Number of Items
Number of Triples (per user)
quadratic
actual
(b) Netix
Figure 2: Number of triples (per user) vs. number of items.
users, conceivably items returned by a top-
k
query may belong to
one of three categories: those in the training set (to be excluded
for new item recommendation), those in the test set (of interest as
these are the known ground-truth of which items users prefer), and
those not seen/rated in either set (for which no ground-truth of
user preference is available). It is important to note the laer may
not necessarily be bad recommendations, they are simply unknown.
Precision of the top-kmay penalize such items.
We reason that among the rated items in the test set, those that
have been assigned the maximum rating possible by a user would
be expected to appear in the top-
k
recommendation list for that user.
A suitable metric is the recall of items in the test set with maximum
rating. For each user
u
with at least one highest rating item in the
test set (for the two datasets, the highest possible rating value is
5), we compute the percentage of these items that are returned in
the top-
k
by the index. e higher the percentage, the beer is the
performance of the model at identifying the items a user prefers
the most. Eq. 11 presents the formula for Recall@k:
Recall@k=1
|Umax |X
u∈Umax
|{i∈ψu
k:rui =max rating} |
|{i∈ I :rui =max rating} | ,(11)
in which
Umax
is the set of users who have given at least one item
with rating of 5 and
ψu
k
is the top-k returned by the index. We
exclude training items for
u
from both numerator and denomina-
tor. We normalize Recall@k with the ideal Recall@k that a perfect
algorithm can achieve, and denote the metric as nRecall@k.
Speedup.
To investigate the ecacy of using the indexing
schemes for top-k recommendation, we introduce the second metric
speedup, which is the ratio between the time taken by exhaustive
search to return the top-k, to the time taken by an index.
Speedup =Retrieval time taken by exhaustive search
Retrieval time taken by the index .(12)
We will discuss the results in terms of trade-o between recall
and speedup. ere are index parameters that control the degree of
approximation, i.e., higher speedup at the expense of lower recall.
Among the comparative recommendation algorithms, a beer trade-
o means higher speedup at the same recall, or higher recall at
the same speedup. For each comparison below, we control for the
indexing scheme, as dierent schemes vary in ways of achieving
approximation, implementations, and deployment scenarios.
4.1 Top-k Recommendation with LSH Index
We rst briey review LSH and how it is used for top-
k
recom-
mendation. Let
h=(h1,h2, . . . ,hb)
be a set of LSH hash functions.
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 8
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 12
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 16
0.000
0.015
0.030
0
.
04
5
5101520
nRecall@k
k
b = 8
0.000
0.015
0.030
0.045
5101520
nRecall@k
k
b = 12
0.000
0.015
0.030
0.045
5101520
nRecall@k
k
b = 16
(a) MovieLens 20M (b) Netflix
Figure 3: nRecall@k with Hash table lookup strategy
(T=10 hash tables).
Each function assigns a bit for each vector.
h
will assign each user
u
a binary code
h(xu)
, and each item
i
a binary hashcode
h(yi)
,
all of length
b
. Assuming that
xu
prefers
yi
to
yj
,
h
is expected
to produce binary hashcodes with a smaller Hamming distance
|| h(xu)−h(yi)||H
than the Hamming distance
|| h(xu)−h(yj)||H
.
e most frequent indexing strategy for LSH is hash table lookup.
We store item codes in hash tables, with items having the same code
in the same bucket. Given a query (user) code, we can determine
the corresponding bucket in constant time. We search for the top-k
only among items in that bucket, reducing the number of items on
which we need to perform exact similarity computations.
We use the LSH package developed by [
1
]. e LSH family for
Indexable BPR for generating hashcodes is SRP-LSH, which is also
used for IPMF following [
7
]. We apply it to BPR(MF) and BPR(MF)+,
as [
25
], [
19
] claim it to be the more suitable family for transformed
vectors. In turn, the LSH scheme for COE and CFEE is L2-LSH,
since both use
l2
distance. In Section 5, we will elaborate with
theoretical analysis and empirical evidence how more compatible
representations tend to produce beer results.
When using hash tables, one species the number of tables
T
and the code length
b
. We experiment with various
T
, and
T=
10
returns the best performance (consistent with [
7
]). We also vary
b
and larger bis expected to lead to fewer items in each bucket.
Figure 3(a) shows the nRecall@k using hash table lookup with
T=
10 tables and dierent values of code length
b=
8
,
12
,
16 for
MovieLens20M. Across the
b
’s, the trends are similar. Indexable
BPR has the highest nRecall@k values across all
k
. It outperforms
BPR(MF)+ that conducts vector transformation as post-processing,
nRecall@10
Speedup (log scale)
(a) MovieLens 20M
nRecall@10
Speedup (log scale)
(b) Netflix
Figure 4: nRecall@10 vs. speedup with Hash table lookup
strategy (T=10 hash tables).
which indicates that learning inherently indexable vectors is helpful.
In turn, BPR(MF)+ outperforms BPR(MF), which indicates that the
inner product kernel is not conducive for indexing. Interestingly,
Indexable BPR also performs beer than models that t ratings
(IPMF, CFEE), suggesting that learning from relative comparisons
may be more suitable for top-krecommendation.
Figure 3(b) shows the results for Netix. Again, Indexable
BPR has the highest nRecall@k values across all
k
. e relative
comparisons among the baselines are as before, except that IPMF
now is more competitive, though still lower than Indexable BPR .
We also investigate the tradeo between the speedup achieved
and the accuracy of the top-
k
returned by the index. Fig. 4 shows
the nRecall@10s and the speedup when varying the value of
b
.
Given the same speedup, Indexable BPR can achieve signicantly
higher performance compared to the baselines. As
b
increases, the
speedup increases and nRecall@10 decreases. is is expected, as
the longer the hashcodes, the smaller the set of items on which the
system needs to perform similarity computation. is reects the
trade-o of speedup and approximation quality.
4.2 Top-k Recommendation with KD-Tree
Index
Spatial trees refer to a family of methods that recursively partition
the data space towards a balanced binary search tree, in which each
node encompasses a subset of the data points [
17
]. For algorithms
that model the user-item association by
l2
distance, spatial trees
can be used to index the item vectors. Top-
k
recommendation is
thus equivalent to nding kNN to the query. e tree will locate the
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 500
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 1000
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 1500
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 500
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 1000
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 1500
(a) MovieLens 20M (b) Netflix
Figure 5: nRecall@k with KD-Tree indexing.
nodes that the query belongs to, and exact similarity computation
is performed only on the points indexed by those nodes.
For Indexable BPR , Algorithm 1 returns two sets of normalized
vectors ˜
xu∀u∈ U and ˜
yi∀i∈ I. We observe that:
|| ˜
xu−˜
yi|| <| | ˜
xu−˜
yj|| ⇔ ˜
xT
u˜
yi>˜
xT
u˜
yj⇔θ˜
xu˜
yi<θ˜
xu˜
yj,(13)
i.e., the ranking of items according to
l2
distance on normalized
vectors is compatible to that according to angular distance, implying
Indexable BPR ’s output can support kNN using spatial tree.
In this paper, we consider a well-known tree structure, KD-
tree. Approximate kNN retrieval can be achieved by restricting the
searching time on the tree ([
7
]). e implementation of KD-tree in
[
18
] controls this by
c
, the number of nodes to explore on the tree.
Figure 5 shows the nRecall@k with various
c∈ {
500
,
1000
,
1500
}
.
We also experimented with
c∈ {
50
,
150
,
300
,
750
,
2000
}
and get sim-
ilar trends. Indexable BPR consistently outperforms the baselines
at all values of
c
. Notably, Indexable BPR outperforms BPR(MF)+,
which in turn outperforms BPR(MF), validating the point made ear-
lier about native indexability. Figure 6 plots the accuracy in terms
of nRecall@10 vs. the retrieval eciency in terms of speedup. As we
increase
c
, a longer searching time on KD-tree is allowed, resulting
in higher quality of the returned top-
k
. Here too, Indexable BPR
achieves higher accuracy at the same speedup, higher speedup at
the same accuracy, as compared to the baselines.
4.3 Top-k Recommendation with Inverted
Index
For recommendation retrieval, [
4
] presents an inverted index scheme,
where every user or item is represented with a sparse vector de-
rived from their respective dense real-valued latent vectors via a
0.00
0.02
0.04
0.06
0.08
1248163264128
nRecall@10
Speedup (log scale)
(a) MovieLens 20M
0.000
0.005
0.010
0.015
0.020
1248163264128
nRecall@10
Speedup (log scale)
(b) Netflix
Figure 6: nRecall@10 vs. speedup with KD-tree indexing.
transformation. Given the user sparse vector as query, the inverted
index will return items with at least one common non-zero element
with the query as candidates. Exact similarity computation will be
performed only on those candidates to nd out the top-k.
Here, we describe very briey the indexing scheme. For an ex-
tended treatment, please refer to [
4
]. e sparse representations for
users and items are obtained from their dense latent vectors (learnt
by the recommendation algorithm, e.g., Indexable BPR ) through a
set of geometry-aware permutation maps
Φ
dened on a tessellated
unit sphere. e tessellating vectors are generated from a base
set
Bd={−
1
,−d−1
d, . . . , −1
d,
0
,1
d, . . . , d−1
d,
1
},
characterized by a
parameter
d
. e obtained sparse vectors have the sparsity paerns
that are related to the angular closeness between the original latent
vectors. e angular closeness between user vector
xu
and item
vector yiis dened as dac (xu,yj)=1−xT
uyj
| |xu| | .| |yi| | .
In the case of
||xu|| =| |yi| | =
1
∀u∈ U,i∈ I
, we have
(∀i,j∈ I):
dac (xu,yi)<dac (xu,yj)⇔xuTyi
||xu|| .| |yi||
| {z }
θxuyi
>xuTyj
||xu|| .| |yj||
| {z }
θxuyj
(14)
e item ranking according to
dac
is equivalent to that according
to
θ
-angular distance. We hypothesize that Indexable BPR based
on angular distance would be compatible with this structure.
e parameter
d
can be managed to control the trade-o between
the eciency and the quality of approximation of kNN retrieval.
Increasing the value of
d
leads to a higher number of discarded
items using the inverted index, which leads to higher speedup of
the top-krecommendation retrieval.
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
d = 150
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
d = 300
0.00
0.08
0.15
0.23
5101520
n
R
eca
ll@k
k
d = 500
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 150
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 300
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 500
(a) MovieLens 20M (b) Netflix
Figure 7: nRecall@k with inverted indexing.
We run the experiments with dierent values of parameter
d
to
explore the trade-o between speed and accuracy. Figure 7 presents
the nRecall@k of the two datasets at
d∈ {
150
,
300
,
500
}
. In all cases,
Indexable BPR outperforms the baselines in terms of nRecall@k.
is suggests that Indexable BPR produces a representation that
has greater degree of compatibility in terms of angular closeness
dac
between users and their preferred items. As a result, the corre-
sponding sparse vectors will have highly similar sparsity paerns,
which enhances the quality of kNN using inverted indexing. Figure
8 shows the speedup using the inverted index as we vary the value
of parameter
d
. We observe that the speedup increases as
d
in-
creases. Indexable BPR shows superior performance as compared
to other models, given the same speedup.
Overall, Indexable BPR works well on the indexing schemes.
Eectively, we develop a model that work with multiple indices,
and leave the choice of index structure to the respective application
based on need. Our focus is on indexable recommendation algo-
rithms. Here, several consistent observations emerge. Indexable
BPR produces representations that are more amenable to indexing,
as compared to baselines BPR(MF)+ and BPR(MF). is validates
the aim of Indexable BPR in learning natively indexable vectors
for users and items. It also outperforms models that t ratings, as
opposed to ordinal triples, for top-krecommendations.
5 ANALYSIS ON LSH-FRIENDLINESS OF
INDEXABLE BPR
In an eort to further explain the outperformance by Indexable
BPR when used with LSH, we analyze the compatibility betweeen
recommendation algorithms and hashing functions. Since LSH is
inherently an approximate method, the loss of information caused
d = 50
d = 750
d = 50 d = 500 d = 750
d = 300
d = 750
d = 50 d = 100 d = 200
d = 750
d = 50
d = 100
d = 150
d = 200
d = 300 d = 500 d = 750
Speedup (log scale)
(b) Netflix
d = 50
d = 100
d = 500 d = 750
d = 50 d = 100 d = 150 d = 300
d = 750
d = 50 d = 100 d = 200 d = 500 d = 750
d = 50
d = 100
d = 150
d = 200
d = 300
d = 500
d = 750
Speedup (log scale)
(a) MovieLens 20M
Figure 8: nRecall@10 vs. speedup with inverted indexing.
by random hash functions is inevitable. Informally, a representation
is LSH-friendly if the loss aer hashing is as minimal as possible.
To achieve such small loss, a user’s ranking of items based on the
latent vectors should be preserved by the hashcodes.
Analysis.
For
xu,yi,yj
in
RD
, one can estimate the probability
of the corresponding hashcodes to preserve the correct ordering
between them. Let us consider the probability of Hamming distance
Pr(| |h(xu)−h(yi)| |H)
. Since the hash functions
h1,h2, . . . ,hb
are
independent of one another,
|| h(xu)−h(yi)||H
follows the bino-
mial distribution with mean
bpxuyi
and variance
bpxuyi(
1
−pxuyi)
,
where
pxuyi
is the probability of
xu
and
yi
having dierent hash
values (this probability depends on the specic family of hash func-
tions). Since binomial distribution can be approximated by a normal
distribution with same mean and variance, and the dierence be-
tween two normal distributions is another normal distribution, we
have:
Pr(| |h(xu)−h(yj)| |H− | |h(xu)−h(yi)| |H>0)(15)
∼Normal(bpxuyj−bp xuyi,bpxuyj(1−pxuyj)+bpxuyi(1−pxuyi))
Due to the shape of the normal distribution, Eq. 15 implies that a
higher mean and smaller variance would lead to a higher probability
of the hashcode of
xu
is more similar to the hashcode of
yi
than to
the that of
yj
. erefore, for a xed length
b
, if indeed
u
prefers
i
to
j
, we say that
xu,yi,yj
is a more LSH-friendly representation
for
u
,
i
, and
j
if the mean value
(pxuyj−pxuyi)
is higher and the
variance (pxuyj(1−pxuyj)+pxuyi(1−pxuyi)) is smaller.
Hence, the mean and the variance in Eq. 15 could potentially
reveal which representation is more LSH-friendly, i.e., preserves
information beer aer hashing. For each user
u∈ U
, let
τu
k
be
the set of items in the top-
k
by a method before hashing, and
¯
τu
k
be
CFEE COE IPMF BPR(MF) BPR(MF)+ Indexable
BPR
MeanNorm@10 0.137 0.188 0.065 0.017 0.023 0.219
VarNorm@10 0.726 0.576 0.484 0.171 0.138 0.428
0.00
0.05
0.10
0.15
0.20
0.25
MeanNorm@10
MovieLens 20M
CFEE COE IPMF BPR(MF) BPR(MF)+ Indexable
BPR
MeanNorm@10 0.163 0.080 0.072 0.018 0.025 0.247
VarNorm@10 0.699 0.755 0.480 0.192 0.146 0.424
0.00
0.05
0.10
0.15
0.20
0.25
MeanNorm@10
Netflix
Figure 9: LSH friendly measurement at D=20.
all the other items not returned by the models. We are interested in
whether aer hashing, the items in
τu
k
would be closer to the user
than the items in
¯
τu
k
. To account for this potential, we introduce
two measures: MeanNorm@k and VarNorm@k.
MeanNorm@k =1
|U | X
i∈τu
k
X
j∈¯
τu
k
(pxuyj−pxuyi)
|τu
k|.|¯
τu
k|
VarNorm@k =1
|U | X
i∈τu
k
X
j∈¯
τu
k
pxuyj(1−pxuyj)+pxuyi(1−pxuyi)
|τu
k|.|¯
τu
k|
To achieve LSH-friendly representation, MeanNorm@k should
be high and VarNorm@k should be low. Fig. 9 shows the bar charts
displaying values of those metrics. From Fig. 9, Indexable BPR
shows higher mean values MeanNorm@10 (i.e.,
k=
10) at
D=
20
(we observe the same results with other values of
D
and
k
). ough
BPR(MF) and BPR(MF)+ have smaller variance, their mean values
are among the lowest. is result gives us a hint that Indexable
BPR can preserve information aer hashing more eectively.
Compatible Hash Function.
ere is an explanation for the
superior numbers of Indexable BPR in Fig. 9. Specically, the
probability
pxuyi
depends on the LSH family. In particular, signed
random projections [
5
,
9
] or SRP-LSH is meant for angular similarity.
e angular similarity between
x,y
is dened as
sim∠(x,y)=
1
−
cos−1(xTy
| |x| | .| |y| | )/π
. e parameter
a
is a random vector chosen
with each component from i.i.d normal. e hash function is dened
as
hsrp
a(x)=sign(aTx)
and the probability of
x,y
having dierent
hash values is:
pxy=Pr(hsrp
a(x),hsrp
a(y)) =cos−1(xTy
||x|| .| |y|| )/π=
θxy
π,(16)
For Indexable BPR , as shown in Eq. 8, for each observation “
u
prefers
i
to
j
”, we would like to maximize the dierence
θxuyj−θxuyi
.
From Eq. 16, we observe that the probability
pxuyi
is a linear func-
tion of the angular distance
θxuyi
. us, we can infer that Index-
able BPR ’s objective corresponds to maximizing
pxuyj−pxuyi
.
0.50
0.60
0.70
0.80
525456585
n
DCG@10
D
MovieLens 20M
0.50
0.60
0.70
0.80
525456585
nDCG@10
D
Net
f
l
i
x
Figure 10: nDCG@10 at D∈ {5,10,20,30,50,75,100}.
According to Eq. 15, this increases the probability that the Ham-
ming distance between
u
and
i
is smaller than that between
u
and
j
. In other words, the hashcodes are likely to preserve the ranking
order. is alignment between the objective of Indexable BPR and
the structural property of SRP-LSH implies that Indexable BPR is
more LSH-friendly, which helps the model minimize information
loss, and show beer post-indexing performance.
Also, the appropriate LSH family for methods based on
l2
dis-
tance, which includes COE, is L2-LSH [
6
]. However, there is a
question as to how compatible the objective of COE is with the
hash functions. e hash function of L2-LSH is dened as follows:
hL2
a,b(x)=baTx+b
rc; (17)
where
r
- the window size,
a
- random vector with each component
from i.i.d normal and a scalar
b∼Uni(
0
,r)
. e probability of two
points x,yhaving dierent hash values under L2-LSH function is:
FL2
r(dxy)=Pr(hL2
a,b(x),hL2
a,b(y))
=2ϕ(−r
dxy
)+1
p(2π)(r/dxy)(1−exp(−(r
dxy
)2/2)); (18)
where
ϕ(x)
is cumulative probability function of normal distribu-
tion and
dxy=||x−y||
is the
l2
distance between
x,y
. From Eq. 18,
we see that
FL2
r(dxy)
is a nonlinear monotonically increasing func-
tion of
dxy
. COE’s objective to maximize
dxuyj−dxuyi
does not
directly maximize the corresponding mean value of the normal dis-
tribution (see Eq.15), i.e.,
FL2
r(dxuyj)−FL2
r(dxuyi)
, since
FL2
r(dxuyj)
is not a linear function of
l2
distance
dxuyj
. Our hypothesis is that
though both rely on ordinal triples, COE may not be as compatible
with LSH as Indexable BPR .
Empirical Evidence.
For each user
u
, we rank the items that
u
has rated in the test set, and measure how closely the ranked list
is to the ordering by ground-truth ratings. As metric, we turn to
the well-established metric for ranking nDCG@k, where
k
is the
cut-o point for the ranked list. Its denition can be found in [
26
].
Fig. 10 shows the nDCG@10 values for MovieLens 20M and Net-
ix respectively at various dimensionality of the latent vectors
D
. We observe that, Indexable BPR is among the best, with the
most competitive baseline being IPMF (which ts ratings). More
important is whether the models will still perform well when used
with index structures. As similar trends are observed with other
values of D, subsequently we show results based on D=20.
Here, the objective is to investigate the eectiveness of the LSH
hashcodes in preserving the ranking among the rated items in the
test set. We use Hamming ranking, repeating the same experiment
Table 2: Absolute nDCG@10 and Relative nDCG@10 of all models as the length of LSH codes (b) varies.
MovieLens 20M Netix
Absolute nDCG@10 Relative nDCG@10 Absolute nDCG@10 Relative nDCG@10
b 8 12 16 8 12 16 8 12 16 8 12 16
CFEE 0.582 0.582 0.585 0.805 0.806 0.809 0.559 0.561 0.562 0.834 0.836 0.838
COE 0.605 0.609 0.608 0.886 0.891 0.890 0.570 0.565 0.575 0.906 0.898 0.914
IPMF 0.702 0.728 0.704 0.920 0.955 0.923 0.705 0.737 0.747 0.896 0.936 0.949
BPR(MF) 0.599 0.603 0.605 0.831 0.837 0.840 0.560 0.551 0.553 0.863 0.849 0.853
BPR(MF)+ 0.603 0.604 0.606 0.837 0.840 0.841 0.569 0.569 0.566 0.877 0.877 0.873
Indexable BPR 0.743 0.745 0.754 0.977 0.980 0.991 0.732 0.761 0.756 0.924 0.960 0.954
in Fig.10, but using Hamming distances over hashcodes. is is to
investigate how well Indexable BPR preserves the ranking com-
pared to the baselines. As hashing relies on random hash functions,
we average results over 10 dierent sets of functions.
Table 2 shows the performances of all models. e two met-
rics are: Absolute nDCG@10 is the nDCG@10 of LSH hashcodes,
and Relative nDCG@10 is the relative ratio between the Absolute
nDCG@10 and that of original real-valued latent vectors. Index-
able BPR consistently shows beer Absolute nDCG@10 values
than the baselines when using LSH indexing. is implies that
Indexable BPR coupled with SRP-LSH produces more compact
and informative hashcodes. Also, the Relative nDCG@10 of In-
dexable BPR are close to 1 and higher than those of the baselines.
ese observations validate our hypotheses that not only is In-
dexable BPR competitively eective pre-indexing, but it is also
more LSH-friendly, resulting in less loss in the ranking accuracy
post-indexing.
6 CONCLUSION
We propose a probabilistic method for modeling user preferences
based on ordinal triples, which is geared towards top-
k
recommen-
dation via approximate kNN search using indexing. e proposed
model Indexable BPR produces an indexing-friendly representa-
tion, which results in signicant speedups in top-
k
retrieval, while
still maintaining high accuracy due to its compatibility with index-
ing structures such as LSH, spatial tree, and inverted index. As
future work, a potential direction is to go beyond achieving rep-
resentations more compatible with existing indexing schemes, to
designing novel data structures or indexing schemes that would
beer support ecient and accurate recommendation retrieval.
ACKNOWLEDGMENTS
is research is supported by the National Research Foundation,
Prime Minister’s Oce, Singapore under its NRF Fellowship Pro-
gramme (Award No. NRF-NRFF2016-07).
REFERENCES
[1]
Mohamed Aly, Mario Munich, and Pietro Perona. 2011. Indexing in large scale
image collections: Scaling properties and benchmark. In IEEE Workshop on
Applications of Computer Vision (WACV). 418–425.
[2]
Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam
Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the xbox recom-
mender system using a euclidean transformation for inner-product spaces. In
RecSys. ACM, 257–264.
[3]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative
searching. Commun. ACM 18, 9 (1975), 509–517.
[4]
Avradeep Bhowmik, Nathan Liu, Erheng Zhong, Badri Narayan Bhaskar, and
Suju Rajan. 2016. Geometry Aware Mappings for High Dimensional Sparse
Factors. In AISTATS.
[5]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algo-
rithms. In STOC. ACM, 380–388.
[6]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and VahabS Mirrokni. 2004. Locality-
sensitive hashing scheme based on p-stable distributions. In SCOG. ACM, 253–
262.
[7]
Marco Fraccaro, Ulrich Paquet, and Ole Winther. 2016. Indexable Probabilistic
Matrix Factorization for Maximum Inner Product Search. In AAAI.
[8]
Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized
Ranking from Implicit Feedback. In AAAI.
[9]
Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. 2012. Super-bit
locality-sensitive hashing. In NIPS. 108–116.
[10]
Mohammad Khoshneshin and W Nick Street. 2010. Collaborative ltering via
euclidean embedding. In RecSys. ACM, 87–94.
[11]
Noam Koenigstein, Parikshit Ram, and Yuval Shavi. 2012. Ecient retrieval of
recommendations in a matrix factorization framework. In CIKM. ACM, 535–544.
[12]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-
niques for recommender systems. Computer 42, 8 (2009).
[13]
Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars
Schmidt-ieme. 2012. Multi-relational matrix factorization using bayesian
personalized ranking for social network data. In WSDM. 173–182.
[14]
J. B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of t to a
nonmetric hypothesis. Psychometrika 29, 1 (1964).
[15]
Dung D Le and Hady W Lauw. 2016. Euclidean Co-Embedding of Ordinal Data
for Multi-Type Visualization. In SDM. SIAM, 396–404.
[16]
Lukas Lerche and Dietmar Jannach. 2014. Using graded implicit feedback for
bayesian personalized ranking. In RecSys. 353–356.
[17]
Brian McFee and Gert R. G. Lanckriet. 2011. Large-scale music similarity search
with spatial trees. In ISMIR.
[18]
Marius Muja and David G. Lowe. 2009. Fast Approximate Nearest Neighbors with
Automatic Algorithm Conguration. In International Conference on Computer
Vision eory and Application VISSAPP’09). INSTICC Press, 331–340.
[19]
Behnam Neyshabur and Nathan Srebro. 2015. On Symmetric and Asymmetric
LSHs for Inner Product Search. In ICML.
[20]
Weike Pan and Li Chen. 2013. GBPR: Group Preference Based Bayesian Personal-
ized Ranking for One-Class Collaborative Filtering.. In IJCAI, Vol. 13. 2691–2697.
[21]
Parikshit Ram and Alexander G Gray. 2012. Maximum inner-product search
using cone trees. In Proceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 931–939.
[22]
Steen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-
ieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In
UAI. AUAI Press, 452–461.
[23]
Ruslan Salakhutdinov and Andriy Mnih. 2008. Bayesian probabilistic matrix
factorization using Markov chain Monte Carlo. In ICML. ACM, 880–887.
[24]
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear
time maximum inner product search (MIPS). In Advances in Neural Information
Processing Systems. 2321–2329.
[25]
Anshumali Shrivastava and Ping Li. 2015. Improved Asymmetric Locality Sensi-
tive Hashing (ALSH) for Maximum Inner Product Search (MIPS). In UAI.
[26]
Markus Weimer, Alexandros Karatzoglou, oc V. Le, and Alexander J. Smola.
2007. COFI RANK - Maximum Margin Matrix Factorization for Collaborative
Ranking. In NIPS.
[27]
Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-
Seng Chua. 2016. Discrete collaborative ltering. In Proc. of SIGIR, Vol. 16.
[28]
Zhiwei Zhang, Qifan Wang, Lingyun Ruan, and Luo Si. 2014. Preference preserv-
ing hashing for ecient recommendation. In SIGIR. ACM, 183–192.
[29]
Ke Zhou and Hongyuan Zha. 2012. Learning binary codes for collaborative
ltering. In KDD. ACM, 498–506.