Conference PaperPDF Available

Indexable Bayesian Personalized Ranking for Efficient Top-k Recommendation

Authors:

Abstract and Figures

Top-k recommendation seeks to deliver a personalized recommendation list of k items to a user. e dual objectives are (1) accuracy in identifying the items a user is likely to prefer, and (2) efficiency in constructing the recommendation list in real time. One direction towards retrieval efficiency is to formulate retrieval as approximate k nearest neighbor (k-NN) search aided by indexing schemes, such as locality-sensitive hashing, spatial trees, and inverted index. ese schemes, applied on the output representations of recommendation algorithms, speed up the retrieval process by automatically discarding a large number of potentially irrelevant items when given a user query vector. However, many previous recommendation algorithms produce representations that may not necessarily align well with the structural properties of these indexing schemes, eventually resulting in a significant loss of accuracy post-indexing. In this paper, we introduce Indexable Bayesian Personalized Ranking (Indexable BPR) that learns from ordinal preference to produce representation that is inherently compatible with the aforesaid indices. Experiments on publicly available datasets show superior performance of the proposed model compared to state-of-the-art methods on top-k recommendation retrieval task, achieving significant speedup while maintaining high accuracy.
Content may be subject to copyright.
Indexable Bayesian Personalized Ranking for Eicient Top-k
Recommendation
Dung D. Le
Singapore Management University
80 Stamford Road
Singapore 178902
ddle.2015@phdis.smu.edu.sg
Hady W. Lauw
Singapore Management University
80 Stamford Road
Singapore 178902
hadywlauw@smu.edu.sg
ABSTRACT
Top-
k
recommendation seeks to deliver a personalized recommen-
dation list of
k
items to a user. e dual objectives are (1) accuracy
in identifying the items a user is likely to prefer, and (2) eciency in
constructing the recommendation list in real time. One direction to-
wards retrieval eciency is to formulate retrieval as approximate
k
nearest neighbor (kNN) search aided by indexing schemes, such as
locality-sensitive hashing, spatial trees, and inverted index. ese
schemes, applied on the output representations of recommendation
algorithms, speed up the retrieval process by automatically discard-
ing a large number of potentially irrelevant items when given a user
query vector. However, many previous recommendation algorithms
produce representations that may not necessarily align well with
the structural properties of these indexing schemes, eventually re-
sulting in a signicant loss of accuracy post-indexing. In this paper,
we introduce Indexable Bayesian Personalized Ranking (Indexable
BPR ) that learns from ordinal preference to produce representation
that is inherently compatible with the aforesaid indices. Experi-
ments on publicly available datasets show superior performance
of the proposed model compared to state-of-the-art methods on
top-
k
recommendation retrieval task, achieving signicant speedup
while maintaining high accuracy.
1 INTRODUCTION
Today, we face a multitude of options in various spheres of life,
e.g., deciding which product to buy at Amazon, selecting which
movie to watch on Netix, choosing which article to read on social
media, etc. e number of possibilities is immense. Driven by
necessity, service providers rely on recommendation algorithms to
identify a manageable number
k
of the most preferred options to
be presented to each user. Due to the limited screen real estate of
devices (increasingly likely to be ever smaller mobile devices), the
value of
k
may be relatively small (e.g.,
k=
10), yet the selection of
items to be recommended are personalized to each individual.
To construct such personalized recommendation lists, we learn
from users’ historical feedback, which may be explicit (e.g., rat-
ings) [
12
] or implicit (e.g., click behaviors) [
22
]. An established
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’17 , Singapore, Singapore
©
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-4918-5/17/11. . . $15.00
DOI: 10.1145/3132847.3132913
methodology in the literature based on matrix factorization [
23
,
26
]
derives a latent vector
xuRD
for each user
u
, and a latent vector
yiRD
for each item
i
, where
D
is the dimensionality. e degree
of preference of user
u
for item
i
is modeled as the inner product
xuTyi
. To arrive at the recommendation for
u
, we need to identify
the top-kitems with the maximum inner product to xu.
ere are two overriding goals for such top-
k
recommendations.
One is accuracy, to maximize the correctness in placing items that
user
u
prefers most into
u
’s recommendation list. Another is ef-
ciency; in particular we are primarily concerned with retrieval
eency, to minimize the time taken to deliver the recommenda-
tion list upon request. Faster retrieval helps the system to cope
with a large number of consumers, and minimize their waiting
time to receive recommendations. In contrast, learning eciency
or minimizing model learning time, while useful, is arguably less
mission-critical, as it can be done oine and involves mainly ma-
chine time, rather than human time. erefore, we seek to keep the
learning time manageable, while improving retrieval eciency.
Many previous recommendation algorithms focus mainly on
accuracy. One challenge in practice is the need for exhaustive
search over all candidate items to identify the top-
k
, which is time-
consuming when the number of items Nis extremely large [11].
Problem.
In this paper, we pay equal aention to both goals, i.e.,
optimizing retrieval eciency of top-
k
recommendation without
losing sight of accuracy. An eective approach to improve eciency
is to use indexing structures such as locality-sensitive hashing
(LSH) [
24
], spatial trees (e.g., KD-tree [
3
]), and inverted index [
4
].
By indexing items’ latent vectors, we can quickly retrieve a small
candidate set for
k
“most relevant” items to the user query vector,
probably in sub-linear time w.r.t. the number of items
N
. is
avoids an exhaustive search, and saves on the computation for
those large number of items that the index considers irrelevant.
Here, we focus on indexing as an alternative to exhaustive search
in real time. Indexing is preferred over pre-computation of recom-
mendation lists for all users, which is impractical [
4
,
11
]. User
interests change over time. New items appear. By indexing, we
avoid the storage requirement of dealing with all possible user-
item pairs. Index storage scales only with the number of items
N
,
while the number of queries/users could be larger. Indexing exibly
allows the value of kto be specied at run-time.
However, most of the previous recommendation algorithms
based on matrix factorization [
12
] are not designed with index-
ing in mind. e objective is to recommend to user
u
those items
with maximum inner product
xuTyi
is not geometrically compati-
ble with aforementioned index structures. For one thing, it has been
established that there cannot exist any LSH family for maximum
inner product search [
24
]. For another, retrieval on a spatial tree
index nds the nearest neighbors based on the Euclidean distance,
which are not equivalent to those with maximum inner product
[
2
]. In turn, [
4
] describes an inverted index scheme based on cosine
similarity, which again is not equivalent to inner product search.
Approach.
e key reason behind the incompatibility between
inner product search that matrix factorization relies on, and the
aforesaid index structures is how a user
u
’s degree of preference
for an item i, expressed as the inner product xuTyi, is sensitive to
the respective magnitude of the latent vectors
||xu|| ,||yi||
. ere-
fore, one insight towards achieving geometric compatibility is to
desensitize the eect of vector magnitudes. e challenge is how
to do so while still preserving the accuracy of the top-kretrieval.
ere are a couple of recent approaches in this direction. One
approach [
2
] is a post-processing transformation that expands the
latent vectors learnt from matrix factorization with an extra dimen-
sionality to equalize the magnitude of all item vectors. Because the
transformation is a separate process from learning the vectors, such
a workaround would not be as eective as working with natively
indexable vectors in the rst place. Another approach [
7
] extends
the Bayesian Probabilistic Matrix Factorization [
23
], by making the
item latent vectors natively of xed length. Fiing inner product
to absolute rating value may not be suitable when only implicit
feedback (not rating) is available. Moreover, we note that top-
k
recommendation is inherently an expression of “relative” rather
than “absolute” preferences, i.e., the ranking among items is more
important than the exact scores.
We propose to work with ordinal expressions of preferences.
Ordinal preferences can be expressed as a triple
(u,i,j)
, indicat-
ing that a user
u
prefers an item
i
to a dierent item
j
. Ordinal
representation is prevalent in modeling preferences [
22
], and also
accommodates both explicit (e.g., ratings) and implicit feedback.
Contributions. is paper makes the following contributions:
First, we propose Indexable Bayesian Personalized Ranking model
or Indexable BPR in short, which produces native geometrically
indexable latent vectors for accurate and ecient top-
k
recommen-
dation. BPR [
22
] is a generic framework modeling ordinal triples.
Each instantiation is based on a specic kernel [
8
,
13
,
16
,
20
]. [
22
]
had matrix factorization kernel, which is not well-ed to indexing
structures. In contrast, our Indexable BPR is formulated with a
kernel based on angular distances (see Section 3). In addition to
requiring a dierent learning algorithm, we will show how this
engenders native compatibility with various index structures.
Second, we describe how the resulting vectors are used with
LSH, spatial tree, and inverted index for top-
k
recommendation
in Section 4. We conduct experiments with available datasets to
compare Indexable BPR with baselines. Empirically, we observe
that Indexable BPR achieves a balance of accuracy and run-time
eciency, achieving higher accuracy than the baselines at the same
speedup level, and higher speedup at the same accuracy level.
ird, to support the observation on the robustness of Indexable
BPR , we provide a theoretical analysis in the context of LSH, further
bolstered with empirical evidence, on why our reliance on angular
distances results in more index-friendly vectors, smaller loss of
accuracy post-indexing, and balanced all-round performance.
2 RELATED WORK
We review the literature related to the problem of eciently re-
trieving top-krecommendations using indexing schemes.
Matrix Factorization.
Matrix factorization is the basis of many
recommendation algorithms [
12
]. For such models, top-
k
retrieval
is essentially reduced to maximum inner product search, with com-
plexity proportional to the number of items in the (huge) collection.
is motivates approaches to improve the retrieval eciency of
top-
k
recommendation. Of interest to us are those that yield user,
item latent vectors to be used with geometric index structures. is
engenders compatibility with both spatial tree index and inverted
index, as well as with hashing schemes, and transforms the problem
into k-nearest neighbor (kNN) search.
One approach is the transformation scheme applied to matrix
factorization output. [
2
,
19
] propose a post-processing step that
extends the output latent vectors by one dimension to equalize
the magnitude of item vectors. eoretical comparisons show that
this Euclidean transformation achieves beer hashing quality as
compared to the two previous methods in [
24
] and [
25
]. However,
the Euclidean transformation results in high concentration of new
item points, aecting the retrieval accuracy of the approximate
kNN. As Indexable BPR relies on ordinal triples, one appropriate
baseline is to use the transformation scheme above on a comparable
algorithm that also relies on triples. We identify BPR [
22
] with inner
product or matrix factorization (MF) kernel, whose implementation
is available1, and refer to the composite as BPR(MF)+.
Another approach is to learn indexable vectors that ts ratings,
which would not work with implicit feedback, e.g., ordinal triples.
Indexable Probabilistic Matrix Factorization or IPMF [
7
] is a rating-
based model with constraints to place item vectors on a hypersphere.
We will see that IPMF does not optimize for a high mean of the
normal distribution in Eq. 15 (see Section 5), and ordinal-based
Indexable BPR potentially performs beer.
Others may not involve the standard index structures we study.
[
11
] used representative queries identied by clustering. [
21
] in-
vented another data structure (cone tree). [
27
29
] learnt binary
codes, which are incompatible with
l2
distance used by spatial tree.
Euclidean Embedding.
Euclidean embedding takes as input
distances (or their ordinal relationships), and outputs low dimen-
sional latent coordinates for each point that would preserve the
input as much as possible [
14
]. Because they operate in the Eu-
clidean space, the coordinates support nearest neighbor search
using geometric index structures such as spatial trees.
ere exist recent works on using Euclidean embedding to model
user preferences over items, which we include as experimental
baselines. e rst method Collaborative Filtering via Euclidean
Embedding or CFEE [
10
] ts a rating
ˆ
rui
by user
u
on item
i
in terms
of the squared Euclidean distance between
xu
and
yi
. Fiing ratings
directly does not preserve the pairwise comparisons. e second
method Collaborative Ordinal Embedding or COE [
15
] is based
on ordinal triples. It expresses a triple
tui j
through the Euclidean
distance dierence
||xuyj|| − | |xuyi| |
. COE’s objective is to
maximize this dierence for each observation tui j .
1hp://www.librec.net
3 INDEXABLE BPR
Problem.
We consider a set users
U
and a set of items
I
. We
consider as input a set of triples
T ⊂ U × I × I
. A triple
tui j ∈ T
relates one user
u∈ U
and two dierent items
i,j∈ I
, indicating
u
’s preferring item
i
to item
j
. Such ordinal preference is prevalent,
encompassing explicit and implicit feedback scenarios. When rat-
ings are available, we can induce an ordinal triple for each instance
when user
u
rates item
i
higher than she rates item
j
. Triples can
also model implicit feedback [
22
]. E.g., when searching on the
Web, one may click on the website
i
and ignore
j
. When browsing
products, one may choose to click or buy product iand skip j.
e goal is to derive a
D
-dimensional latent vector
xuRD
for
each user
u∈ U
, and a latent vector
yiRD
for each item
i∈ I
,
such that the relative preference of a user
u
over two items
i
and
j
can be expressed as a function (to be dened) of their corresponding
latent vectors
xu
,
yi
, and
yj
. We denote the collection of all user
latent vectors and item latent vectors as Xand Yrespectively.
Framework.
Given the input triples
T
, we seek to learn the
user and item vectors X,Ywith the highest posterior probability.
arg max
X,YP(X,Y|T )(1)
e Bayesian formulation for modeling this posterior probability
is to decompose it into the likelihood of the triples
P(T |X,Y)
and
the prior P(X,Y), as shown in Eq. 2.
P(X,Y|T )P(T |X,Y)P(X,Y)(2)
We will dene the prior later when we discuss the generative
process. For now, we focus on dening the likelihood, which can
be decomposed into the probability for individual triples
tui j ∈ T
.
P(T |X,Y)=Y
tui j ∈T
P(tui j |xu,yi,yj)(3)
Weakness of Inner Product Kernel for Top-kRetrieval.
To
determine the probability for an individual triple, we need to dene
a kernel function. e kernel proposed by the matrix factorization-
based (not natively indexable) BPR [
22
] is shown in Eq. 4 (
σ
is the
sigmoid function). is assumes that if
xuTyi
is higher than
xuTyj
,
then user uis more likely to prefer item ito j.
P(tui j |xu,yi,yj)=σ(xuTyixuTyj)(4)
Since our intended application is top-
k
recommendation, once
we learn the user and item latent vectors, the top-
k
recommendation
task is reduced to searching for the
k
nearest neighbors to the query
(user vector) among the potential answers (item vectors). A naive
solution is to conduct exhaustive search over all the items.
An indexing-based approach could reduce the retrieval time
signicantly, by prioritizing or narrowing the search to a smaller
search space. For the nearest neighbors identied by an index to be
as accurate as possible, the notion of similarity (or distance) used
by the index should be compatible with the notion of the similarity
of the underlying model that yields that user and item vectors.
erein lies the issue with the inner product kernel described in
Eq. 4. It is not necessarily compatible with geometric index struc-
tures that rely on similarity functions other than inner products.
First, we examine its incompatibility with spatial tree index.
Suppose that all item latent vectors
yi
’s are inserted into the index.
To derive the recommendation for
u
, we use
xu
as the query. Nearest
 
 
  


 
 
  
 
Figure 1: An illustration for the incompatibility of inner
product kernel for spatial tree index (Euclidean distance)
and inverted index (cosine similarity).
neighbor search on spatial tree index is expected to return items
that are closest in terms of Euclidean distance. e relationship
between Euclidean distance and inner product is expressed in Eq. 5.
It implies that items with the closest Euclidean distances may not
have the highest inner products, due to the magnitudes
||xu||
and
||yi| |. Spatial tree index retrieval may be inconsistent with Eq. 4.
||xuyi|| 2=| |xu| |2+| |yi| |22xuTyi(5)
Second, we examine its incompatibility with inverted index that
relies on cosine similarity (Eq. 6). Similarly, the pertinence of the
magnitudes
||xu||
and
||yi| |
implies that inverted index retrieval
may be inconsistent with maximum inner product search.
cos (xu,yi)=xuTyi
||xu|| · | |yi| | (6)
Fig.1 shows an example to illustrate the above analysis. In Fig. 1,
the inner product
xuTyi
is greater than
xuTyj
, implying that
u
prefers
i
to
j
. However, the Euclidean distance computation shows
that
yj
is closer to
xu
than
yi
is to
xu
. Also, the cosine similarity
between
xu
and
yi
is smaller than that between
xu
and
yj
. is
means that the inner product kernel of the model is not compatible
with the operations of a spatial tree index relying on Euclidean
distance, or an inverted index relying on cosine similarity.
ird, in terms of its incompatibility with LSH, we note that it
has been established that there cannot exist any LSH family for
maximum inner product search [
24
], while there exist LSH families
for Euclidean distances and cosine similarity respectively.
Proposed Angular Distance Kernel.
To circumvent the lim-
itation of the inner product kernel, we propose a new kernel to
express the probability for a triple
tui j
in a way that is insensi-
tive to vector magnitudes. A dierent kernel is a non-trivial, even
signicant, change as it requires a dierent learning algorithm.
Our proposed kernel is based on angular distance. Let
θxy
denote
the angular distance between vectors
x
and
y
, evaluated as the
arccos of the inner product between the normalized vectors.
θxy=cos1(xTy
||x|| .| |y|| )(7)
Proposing the angular distance, i.e., the arccos of the cosine
similarity, to formulate the user-item association is a novel and
appropriate design choice for the following reasons.
Firstly, since arccos is a monotone function, the closest
point according to the angular distance is the same as the
point with the highest cosine similarity, resulting in its
compatibility with the inverted index structure.
Secondly, since angular distances are not aected by magni-
tudes, it preserves all the information learnt by the model.
Before indexing, the learnt vectors could be normalized to
unit length for compatibility with indexing that relies on
either Euclidean distance or cosine similarity.
Lastly, the angular distance is also compatible to LSH in-
dexing. A theoretical analysis and empirical evidence on
this compatibility is provided in Section 5.
While the user
xu
and item
yi
vectors we learn could be of vary-
ing lengths, the magnitudes are uninformative as far as the user
preferences encoded by the triples are concerned. is advanta-
geously allows greater exibility in parameter learning, while still
controlling the vectors via the regularization terms, as opposed to
constraining vectors to xed length during learning (as in [7]).
We formulate the probability of a triple
tui j
for Indexable BPR as
in Eq. 8. e probability is higher when the dierence
θxuyjθxuyi
is larger. If
u
prefers
i
to
j
, the angular distance between
xu
and
yi
is expected to be smaller than between xuand yj.
P(tui j |xu,yi,yj)=σ(θxuyjθxuyi)(8)
Generative Process.
e proposed model Indexable BPR as a
whole could be expressed by the following generative process:
(1) For each user u∈ U : Draw xuNormal(0,η2I),
(2) For each item i∈ I: Draw yiNormal (0,η2I),
(3) For each triple of one user u∈ U and two items i,j∈ I:
Draw a trial from Bernoulli(P(tu i j |xu,yi,yj) ),
If “success”, generate a triple instance tu i j ,
Otherwise, generate a triple instance tuj i .
e rst two steps place zero-mean multi-variate spherical Gauss-
ian priors on the user and item latent vectors.
η2
denotes the vari-
ance of the Normal distributions; for simplicity we use the same
variance for users and items.
I
denotes the identity matrix. is
acts as regularizers for the vectors, and denes the prior P(X,Y).
P(X,Y)=(2πη2)D
2Y
u∈U
e1
2η2| |xu| |2Y
i∈I
e1
2η2| |yi| |2
(9)
Triples in
T
are generated from users and items’ latent vectors
according to the probability P(tui j |xu,yi,yj)as dened in Eq. 8.
Parameter Learning.
Maximizing the posterior as outlined in
Eq. 2 is equivalent to maximizing its logarithm, shown below.
L=ln P(T |X,Y)+ln P(X,Y)
ln P(T |X,Y)1
η2X
u∈U
||xu|| 21
η2X
i∈I
||yi| |2(10)
Let us denote
ui j =θxuyjθxuyi
,
˜
xu=xu
| |xu| | u∈ U
and
˜
yi=yi
| |yi| | i∈ I. e gradient of Lw.r.t each user vector xuis:
L
xu
=X
{i,j:tui j ∈T }
1
||xu|| 2
eui j
1+eui j ×
*
.
.
,
˜
yj.||xu|| +cos (xu,yj).xu
q1cos(xu,yj)2
˜
yi.||xu|| +cos (xu,yi).xu
q1cos(xu,yi)2
+
/
/
-
,
in which, cos(xu,yi)=xT
uyi
| |xu| | .| |yi| | u∈ U and i∈ I.
Algorithm 1 Gradient Ascent for Indexable BPR
Input: Ordinal triples set T={tui j ,u∈ U,i,j∈ I} .
1: Initialize xufor u∈ U,yifor i∈ I
2: while not converged do
3: for each u∈ U do
4: xuxu+ϵ.L
xu
5: for each i∈ I do
6: yiyi+ϵ.L
yi
7: Return {˜
xu=xu
| |xu| | }u U and {˜
yi=yi
| |yi| | }i I
e gradient of Lw.r.t each item vector ykis:
L
yk
=X
{u,j:tuk j ∈T }
1
||yk| |2
euk j
1+euk j
.˜
xu.||yk| | − cos(xu,yk).yk
q1cos(xu,yk)2
+X
{u,i:tui k ∈T }
1
||yk| |2
eui k
1+eui k
.˜
xu.||yk| | +cos(xu,yk).yk
q1cos(xu,yk)2
.
Algorithm 1 describes the learning algorithm with full gradient
ascent. It rst initializes the users and items’ latent vectors. In
each iteration, the model parameters are updated based on the
gradients, with a decaying learning rate
ϵ
over time. e output
is the set of normalized user vectors
˜
xu
and item vectors
˜
yi
. On
one hand, this normalization does not aect the accuracy of the
top-
k
recommendation produced by Indexable BPR , since the
magnitude of the latent vectors does not aect the ranking. On
the other hand, normalized vectors can be used for approximate
kNN search using various indexing data structures later. e time
complexity of the algorithm is linear to the number of triples in
T
,
i.e., O(|U | × | I |2).
4 EXPERIMENTS ON TOP-K
RECOMMENDATION WITH INDEXING
e key idea in this paper is achieving speedup in the retrieval
time of top-
k
recommendation via indexing, while still maintaining
high accuracies via beer representations that minimize any loss of
information post-indexing. Hence, in the following evaluation, we
are interested in both the accuracy of the top-
k
recommendation
returned by the index, and the speedup in retrieval time due to
indexing as compared to exhaustive search.
To showcase the generality of Indexable BPR in accommodat-
ing various index structures, we experiment with three indexing
schemes: locality-sensitive hashing, spatial tree index, and inverted
index. Note that our focus is on the relative merits of recommenda-
tion algorithms, rather than on the relative merits of index struc-
tures. It is our objective to investigate the eectiveness of Index-
able BPR , as compared to other algorithms, for top-
k
recommenda-
tion when using these index structures. Yet, it is not our objective
to compare the index structures among themselves.
Comparative Methods.
We compare our proposed Indexable
BPR with the following recommendation algorithm baselines:
BPR(MF): the non-index friendly BPR with inner product
(MF) kernel [
22
]. is would validate whether our angular
distance kernel is more index-friendly.
Table 1: Datasets
#users #items #ratings #training
ordinal triples
MovieLens 20M 138,493 27,278 20,000,263 5.46 ×108
Netix 480,189 17,770 100,480,507 2.29 ×1010
BPR(MF)+: a composite of BPR(MF) and the Euclidean
transformation described in [
2
] to make the item vectors
indexable as post-processing. is allows validation of our
learning inherently indexable vectors in the rst place.
IPMF: matrix factorization that learns xed-length item
vectors but ts rating scores [
7
]. is allows validation of
our modeling of ordinal triples.
CFEE: Euclidean embedding that ts rating scores [
10
].
is allows validation of our modeling of ordinal triples.
COE: Euclidean embedding that ts ordinal triples [
15
].
Comparison to CFEE and COE allows validation of our
compatibility with non-spatial indices such as some LSH
families as well as inverted index.
We tune the hyper-parameters of all models for the best perfor-
mance. For IPMF, we adopt the parameters provided by its authors
for Netix dataset. For the ordinal-based algorithms (BPR, COE,
and Indexable BPR ), the learning rate and the regularization are
0.05 and 0.001. For CFEE, they are 0.1 and 0.001. All models use
D=
20 dimensionalities in their latent representations. Similar
trends are observed across other dimensionalities (see Sec. 5).
Datasets.
We experiment on two publicly available rating-based
datasets and derive ordinal triples accordingly. One is MovieLens
20M
2
, the largest among the MovieLens collection. e other is
Netix
3
. Table 1 shows a summary of these datasets. By default,
MovieLens 20M includes only users with at least 20 ratings. For
consistency, we apply the same to Netix. For each dataset, we ran-
domly keep 60% of the ratings for training and hide 40% for testing.
We conduct stratied sampling to maintain the same ratio for each
user. We report the average results over ve training/testing splits.
For training, we generate a triple
tui j
if user
u
has higher rating for
item ithan for j, and triples are formed within the training set.
As earlier mentioned, our focus in this work is on online retrieval
speedup. We nd that the model learning time, which is oine,
is manageable. Our learning times for MovieLens 20M and Netix
are 5.2 and 9.3 hours respectively on a computer with Intel Xeon
E2650v4 2.20GHz CPU and 256GB RAM. Algorithm 1 scales with
the number of triples, which in practice grows slower than its
theoretical complexity of
O(|U | × |I|2)
. Figure 2 shows how the
average number of triples per user grows with the number of items,
showing that the actual growth is closer to linear and lower than
the quadratic curve provided as reference.
Recall.
We assume that the goal of top-
k
recommendation is to
recommend new items to a user, among the items not seen in the
training set. When retrieval is based on an index, the evaluation
of top-
k
necessarily takes into account the operation of the index.
Because we maintain one index for all items to be used with all
2hp://grouplens.org/datasets/movielens/20m/
3hp://academictorrents.com/details
/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
0 0.5 1 1.5 2 2.5 3
x 1
0
4
0
2
4
6
8x 108
Number of Items
Number of Triples (per user)
quadratic
actual
(a) MovieLens 20M
0 0.5 1 1.5 2
x 1
0
4
0
1
2
3
4x 108
Number of Items
Number of Triples (per user)
quadratic
actual
(b) Netix
Figure 2: Number of triples (per user) vs. number of items.
users, conceivably items returned by a top-
k
query may belong to
one of three categories: those in the training set (to be excluded
for new item recommendation), those in the test set (of interest as
these are the known ground-truth of which items users prefer), and
those not seen/rated in either set (for which no ground-truth of
user preference is available). It is important to note the laer may
not necessarily be bad recommendations, they are simply unknown.
Precision of the top-kmay penalize such items.
We reason that among the rated items in the test set, those that
have been assigned the maximum rating possible by a user would
be expected to appear in the top-
k
recommendation list for that user.
A suitable metric is the recall of items in the test set with maximum
rating. For each user
u
with at least one highest rating item in the
test set (for the two datasets, the highest possible rating value is
5), we compute the percentage of these items that are returned in
the top-
k
by the index. e higher the percentage, the beer is the
performance of the model at identifying the items a user prefers
the most. Eq. 11 presents the formula for Recall@k:
Recall@k=1
|Umax |X
u∈Umax
|{iψu
k:rui =max rating} |
|{i∈ I :rui =max rating} | ,(11)
in which
Umax
is the set of users who have given at least one item
with rating of 5 and
ψu
k
is the top-k returned by the index. We
exclude training items for
u
from both numerator and denomina-
tor. We normalize Recall@k with the ideal Recall@k that a perfect
algorithm can achieve, and denote the metric as nRecall@k.
Speedup.
To investigate the ecacy of using the indexing
schemes for top-k recommendation, we introduce the second metric
speedup, which is the ratio between the time taken by exhaustive
search to return the top-k, to the time taken by an index.
Speedup =Retrieval time taken by exhaustive search
Retrieval time taken by the index .(12)
We will discuss the results in terms of trade-o between recall
and speedup. ere are index parameters that control the degree of
approximation, i.e., higher speedup at the expense of lower recall.
Among the comparative recommendation algorithms, a beer trade-
o means higher speedup at the same recall, or higher recall at
the same speedup. For each comparison below, we control for the
indexing scheme, as dierent schemes vary in ways of achieving
approximation, implementations, and deployment scenarios.
4.1 Top-k Recommendation with LSH Index
We rst briey review LSH and how it is used for top-
k
recom-
mendation. Let
h=(h1,h2, . . . ,hb)
be a set of LSH hash functions.
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 8
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 12
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
b = 16
0.000
0.015
0.030
.
5
5101520
nRecall@k
k
b = 8
0.000
0.015
0.030
0.045
5101520
nRecall@k
k
b = 12
0.000
0.015
0.030
0.045
5101520
nRecall@k
k
b = 16
(a) MovieLens 20M (b) Netflix
Figure 3: nRecall@k with Hash table lookup strategy
(T=10 hash tables).
Each function assigns a bit for each vector.
h
will assign each user
u
a binary code
h(xu)
, and each item
i
a binary hashcode
h(yi)
,
all of length
b
. Assuming that
xu
prefers
yi
to
yj
,
h
is expected
to produce binary hashcodes with a smaller Hamming distance
|| h(xu)h(yi)||H
than the Hamming distance
|| h(xu)h(yj)||H
.
e most frequent indexing strategy for LSH is hash table lookup.
We store item codes in hash tables, with items having the same code
in the same bucket. Given a query (user) code, we can determine
the corresponding bucket in constant time. We search for the top-k
only among items in that bucket, reducing the number of items on
which we need to perform exact similarity computations.
We use the LSH package developed by [
1
]. e LSH family for
Indexable BPR for generating hashcodes is SRP-LSH, which is also
used for IPMF following [
7
]. We apply it to BPR(MF) and BPR(MF)+,
as [
25
], [
19
] claim it to be the more suitable family for transformed
vectors. In turn, the LSH scheme for COE and CFEE is L2-LSH,
since both use
l2
distance. In Section 5, we will elaborate with
theoretical analysis and empirical evidence how more compatible
representations tend to produce beer results.
When using hash tables, one species the number of tables
T
and the code length
b
. We experiment with various
T
, and
T=
10
returns the best performance (consistent with [
7
]). We also vary
b
and larger bis expected to lead to fewer items in each bucket.
Figure 3(a) shows the nRecall@k using hash table lookup with
T=
10 tables and dierent values of code length
b=
8
,
12
,
16 for
MovieLens20M. Across the
b
’s, the trends are similar. Indexable
BPR has the highest nRecall@k values across all
k
. It outperforms
BPR(MF)+ that conducts vector transformation as post-processing,


   
  
 












nRecall@10
Speedup (log scale)
(a) MovieLens 20M


 

 
  
 








nRecall@10
Speedup (log scale)
(b) Netflix
Figure 4: nRecall@10 vs. speedup with Hash table lookup
strategy (T=10 hash tables).
which indicates that learning inherently indexable vectors is helpful.
In turn, BPR(MF)+ outperforms BPR(MF), which indicates that the
inner product kernel is not conducive for indexing. Interestingly,
Indexable BPR also performs beer than models that t ratings
(IPMF, CFEE), suggesting that learning from relative comparisons
may be more suitable for top-krecommendation.
Figure 3(b) shows the results for Netix. Again, Indexable
BPR has the highest nRecall@k values across all
k
. e relative
comparisons among the baselines are as before, except that IPMF
now is more competitive, though still lower than Indexable BPR .
We also investigate the tradeo between the speedup achieved
and the accuracy of the top-
k
returned by the index. Fig. 4 shows
the nRecall@10s and the speedup when varying the value of
b
.
Given the same speedup, Indexable BPR can achieve signicantly
higher performance compared to the baselines. As
b
increases, the
speedup increases and nRecall@10 decreases. is is expected, as
the longer the hashcodes, the smaller the set of items on which the
system needs to perform similarity computation. is reects the
trade-o of speedup and approximation quality.
4.2 Top-k Recommendation with KD-Tree
Index
Spatial trees refer to a family of methods that recursively partition
the data space towards a balanced binary search tree, in which each
node encompasses a subset of the data points [
17
]. For algorithms
that model the user-item association by
l2
distance, spatial trees
can be used to index the item vectors. Top-
k
recommendation is
thus equivalent to nding kNN to the query. e tree will locate the
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 500
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 1000
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
c = 1500
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 500
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 1000
0.00
0.02
0.04
0.05
5101520
nRecall@k
k
c = 1500
(a) MovieLens 20M (b) Netflix
Figure 5: nRecall@k with KD-Tree indexing.
nodes that the query belongs to, and exact similarity computation
is performed only on the points indexed by those nodes.
For Indexable BPR , Algorithm 1 returns two sets of normalized
vectors ˜
xuu∈ U and ˜
yii∈ I. We observe that:
|| ˜
xu˜
yi|| <| | ˜
xu˜
yj|| ⇔ ˜
xT
u˜
yi>˜
xT
u˜
yjθ˜
xu˜
yi<θ˜
xu˜
yj,(13)
i.e., the ranking of items according to
l2
distance on normalized
vectors is compatible to that according to angular distance, implying
Indexable BPR ’s output can support kNN using spatial tree.
In this paper, we consider a well-known tree structure, KD-
tree. Approximate kNN retrieval can be achieved by restricting the
searching time on the tree ([
7
]). e implementation of KD-tree in
[
18
] controls this by
c
, the number of nodes to explore on the tree.
Figure 5 shows the nRecall@k with various
c∈ {
500
,
1000
,
1500
}
.
We also experimented with
c∈ {
50
,
150
,
300
,
750
,
2000
}
and get sim-
ilar trends. Indexable BPR consistently outperforms the baselines
at all values of
c
. Notably, Indexable BPR outperforms BPR(MF)+,
which in turn outperforms BPR(MF), validating the point made ear-
lier about native indexability. Figure 6 plots the accuracy in terms
of nRecall@10 vs. the retrieval eciency in terms of speedup. As we
increase
c
, a longer searching time on KD-tree is allowed, resulting
in higher quality of the returned top-
k
. Here too, Indexable BPR
achieves higher accuracy at the same speedup, higher speedup at
the same accuracy, as compared to the baselines.
4.3 Top-k Recommendation with Inverted
Index
For recommendation retrieval, [
4
] presents an inverted index scheme,
where every user or item is represented with a sparse vector de-
rived from their respective dense real-valued latent vectors via a




















0.00
0.02
0.04
0.06
0.08
1248163264128
nRecall@10
Speedup (log scale)
(a) MovieLens 20M






















0.000
0.005
0.010
0.015
0.020
1248163264128
nRecall@10
Speedup (log scale)
(b) Netflix
Figure 6: nRecall@10 vs. speedup with KD-tree indexing.
transformation. Given the user sparse vector as query, the inverted
index will return items with at least one common non-zero element
with the query as candidates. Exact similarity computation will be
performed only on those candidates to nd out the top-k.
Here, we describe very briey the indexing scheme. For an ex-
tended treatment, please refer to [
4
]. e sparse representations for
users and items are obtained from their dense latent vectors (learnt
by the recommendation algorithm, e.g., Indexable BPR ) through a
set of geometry-aware permutation maps
Φ
dened on a tessellated
unit sphere. e tessellating vectors are generated from a base
set
Bd={−
1
,d1
d, . . . , 1
d,
0
,1
d, . . . , d1
d,
1
},
characterized by a
parameter
d
. e obtained sparse vectors have the sparsity paerns
that are related to the angular closeness between the original latent
vectors. e angular closeness between user vector
xu
and item
vector yiis dened as dac (xu,yj)=1xT
uyj
| |xu| | .| |yi| | .
In the case of
||xu|| =| |yi| | =
1
u∈ U,i∈ I
, we have
(i,j∈ I):
dac (xu,yi)<dac (xu,yj)xuTyi
||xu|| .| |yi||
| {z }
θxuyi
>xuTyj
||xu|| .| |yj||
| {z }
θxuyj
(14)
e item ranking according to
dac
is equivalent to that according
to
θ
-angular distance. We hypothesize that Indexable BPR based
on angular distance would be compatible with this structure.
e parameter
d
can be managed to control the trade-o between
the eciency and the quality of approximation of kNN retrieval.
Increasing the value of
d
leads to a higher number of discarded
items using the inverted index, which leads to higher speedup of
the top-krecommendation retrieval.
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
d = 150
0.00
0.08
0.15
0.23
5101520
nRecall@k
k
d = 300
0.00
0.08
0.15
0.23
5101520
n
R
eca
ll@k
k
d = 500
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 150
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 300
0.000
0.013
0.025
0.038
5101520
nRecall@k
k
d = 500
(a) MovieLens 20M (b) Netflix
Figure 7: nRecall@k with inverted indexing.
We run the experiments with dierent values of parameter
d
to
explore the trade-o between speed and accuracy. Figure 7 presents
the nRecall@k of the two datasets at
d∈ {
150
,
300
,
500
}
. In all cases,
Indexable BPR outperforms the baselines in terms of nRecall@k.
is suggests that Indexable BPR produces a representation that
has greater degree of compatibility in terms of angular closeness
dac
between users and their preferred items. As a result, the corre-
sponding sparse vectors will have highly similar sparsity paerns,
which enhances the quality of kNN using inverted indexing. Figure
8 shows the speedup using the inverted index as we vary the value
of parameter
d
. We observe that the speedup increases as
d
in-
creases. Indexable BPR shows superior performance as compared
to other models, given the same speedup.
Overall, Indexable BPR works well on the indexing schemes.
Eectively, we develop a model that work with multiple indices,
and leave the choice of index structure to the respective application
based on need. Our focus is on indexable recommendation algo-
rithms. Here, several consistent observations emerge. Indexable
BPR produces representations that are more amenable to indexing,
as compared to baselines BPR(MF)+ and BPR(MF). is validates
the aim of Indexable BPR in learning natively indexable vectors
for users and items. It also outperforms models that t ratings, as
opposed to ordinal triples, for top-krecommendations.
5 ANALYSIS ON LSH-FRIENDLINESS OF
INDEXABLE BPR
In an eort to further explain the outperformance by Indexable
BPR when used with LSH, we analyze the compatibility betweeen
recommendation algorithms and hashing functions. Since LSH is
inherently an approximate method, the loss of information caused
d = 50
d = 750
d = 50 d = 500 d = 750

 

d = 300
d = 750
d = 50 d = 100 d = 200
d = 750
d = 50
d = 100
d = 150
d = 200
d = 300 d = 500 d = 750







Speedup (log scale)
(b) Netflix
d = 50
d = 100
d = 500 d = 750
   
d = 50 d = 100 d = 150 d = 300
d = 750
d = 50 d = 100 d = 200 d = 500 d = 750
d = 50
d = 100
d = 150
d = 200
d = 300
d = 500
d = 750







Speedup (log scale)
(a) MovieLens 20M
Figure 8: nRecall@10 vs. speedup with inverted indexing.
by random hash functions is inevitable. Informally, a representation
is LSH-friendly if the loss aer hashing is as minimal as possible.
To achieve such small loss, a user’s ranking of items based on the
latent vectors should be preserved by the hashcodes.
Analysis.
For
xu,yi,yj
in
RD
, one can estimate the probability
of the corresponding hashcodes to preserve the correct ordering
between them. Let us consider the probability of Hamming distance
Pr(| |h(xu)h(yi)| |H)
. Since the hash functions
h1,h2, . . . ,hb
are
independent of one another,
|| h(xu)h(yi)||H
follows the bino-
mial distribution with mean
bpxuyi
and variance
bpxuyi(
1
pxuyi)
,
where
pxuyi
is the probability of
xu
and
yi
having dierent hash
values (this probability depends on the specic family of hash func-
tions). Since binomial distribution can be approximated by a normal
distribution with same mean and variance, and the dierence be-
tween two normal distributions is another normal distribution, we
have:
Pr(| |h(xu)h(yj)| |H | |h(xu)h(yi)| |H>0)(15)
Normal(bpxuyjbp xuyi,bpxuyj(1pxuyj)+bpxuyi(1pxuyi))
Due to the shape of the normal distribution, Eq. 15 implies that a
higher mean and smaller variance would lead to a higher probability
of the hashcode of
xu
is more similar to the hashcode of
yi
than to
the that of
yj
. erefore, for a xed length
b
, if indeed
u
prefers
i
to
j
, we say that
xu,yi,yj
is a more LSH-friendly representation
for
u
,
i
, and
j
if the mean value
(pxuyjpxuyi)
is higher and the
variance (pxuyj(1pxuyj)+pxuyi(1pxuyi)) is smaller.
Hence, the mean and the variance in Eq. 15 could potentially
reveal which representation is more LSH-friendly, i.e., preserves
information beer aer hashing. For each user
u∈ U
, let
τu
k
be
the set of items in the top-
k
by a method before hashing, and
¯
τu
k
be
CFEE COE IPMF BPR(MF) BPR(MF)+ Indexable
BPR
MeanNorm@10 0.137 0.188 0.065 0.017 0.023 0.219
VarNorm@10 0.726 0.576 0.484 0.171 0.138 0.428
0.00
0.05
0.10
0.15
0.20
0.25
MeanNorm@10
MovieLens 20M
CFEE COE IPMF BPR(MF) BPR(MF)+ Indexable
BPR
MeanNorm@10 0.163 0.080 0.072 0.018 0.025 0.247
VarNorm@10 0.699 0.755 0.480 0.192 0.146 0.424
0.00
0.05
0.10
0.15
0.20
0.25
MeanNorm@10
Netflix
Figure 9: LSH friendly measurement at D=20.
all the other items not returned by the models. We are interested in
whether aer hashing, the items in
τu
k
would be closer to the user
than the items in
¯
τu
k
. To account for this potential, we introduce
two measures: MeanNorm@k and VarNorm@k.
MeanNorm@k =1
|U | X
iτu
k
X
j¯
τu
k
(pxuyjpxuyi)
|τu
k|.|¯
τu
k|
VarNorm@k =1
|U | X
iτu
k
X
j¯
τu
k
pxuyj(1pxuyj)+pxuyi(1pxuyi)
|τu
k|.|¯
τu
k|
To achieve LSH-friendly representation, MeanNorm@k should
be high and VarNorm@k should be low. Fig. 9 shows the bar charts
displaying values of those metrics. From Fig. 9, Indexable BPR
shows higher mean values MeanNorm@10 (i.e.,
k=
10) at
D=
20
(we observe the same results with other values of
D
and
k
). ough
BPR(MF) and BPR(MF)+ have smaller variance, their mean values
are among the lowest. is result gives us a hint that Indexable
BPR can preserve information aer hashing more eectively.
Compatible Hash Function.
ere is an explanation for the
superior numbers of Indexable BPR in Fig. 9. Specically, the
probability
pxuyi
depends on the LSH family. In particular, signed
random projections [
5
,
9
] or SRP-LSH is meant for angular similarity.
e angular similarity between
x,y
is dened as
sim(x,y)=
1
cos1(xTy
| |x| | .| |y| | )/π
. e parameter
a
is a random vector chosen
with each component from i.i.d normal. e hash function is dened
as
hsrp
a(x)=sign(aTx)
and the probability of
x,y
having dierent
hash values is:
pxy=Pr(hsrp
a(x),hsrp
a(y)) =cos1(xTy
||x|| .| |y|| )/π=
θxy
π,(16)
For Indexable BPR , as shown in Eq. 8, for each observation “
u
prefers
i
to
j
, we would like to maximize the dierence
θxuyjθxuyi
.
From Eq. 16, we observe that the probability
pxuyi
is a linear func-
tion of the angular distance
θxuyi
. us, we can infer that Index-
able BPR ’s objective corresponds to maximizing
pxuyjpxuyi
.
0.50
0.60
0.70
0.80
525456585
n
DCG@10
D
MovieLens 20M
0.50
0.60
0.70
0.80
525456585
nDCG@10
D
Net
f
l
i
x
Figure 10: nDCG@10 at D∈ {5,10,20,30,50,75,100}.
According to Eq. 15, this increases the probability that the Ham-
ming distance between
u
and
i
is smaller than that between
u
and
j
. In other words, the hashcodes are likely to preserve the ranking
order. is alignment between the objective of Indexable BPR and
the structural property of SRP-LSH implies that Indexable BPR is
more LSH-friendly, which helps the model minimize information
loss, and show beer post-indexing performance.
Also, the appropriate LSH family for methods based on
l2
dis-
tance, which includes COE, is L2-LSH [
6
]. However, there is a
question as to how compatible the objective of COE is with the
hash functions. e hash function of L2-LSH is dened as follows:
hL2
a,b(x)=baTx+b
rc; (17)
where
r
- the window size,
a
- random vector with each component
from i.i.d normal and a scalar
bUni(
0
,r)
. e probability of two
points x,yhaving dierent hash values under L2-LSH function is:
FL2
r(dxy)=Pr(hL2
a,b(x),hL2
a,b(y))
=2ϕ(r
dxy
)+1
p(2π)(r/dxy)(1exp((r
dxy
)2/2)); (18)
where
ϕ(x)
is cumulative probability function of normal distribu-
tion and
dxy=||xy||
is the
l2
distance between
x,y
. From Eq. 18,
we see that
FL2
r(dxy)
is a nonlinear monotonically increasing func-
tion of
dxy
. COE’s objective to maximize
dxuyjdxuyi
does not
directly maximize the corresponding mean value of the normal dis-
tribution (see Eq.15), i.e.,
FL2
r(dxuyj)FL2
r(dxuyi)
, since
FL2
r(dxuyj)
is not a linear function of
l2
distance
dxuyj
. Our hypothesis is that
though both rely on ordinal triples, COE may not be as compatible
with LSH as Indexable BPR .
Empirical Evidence.
For each user
u
, we rank the items that
u
has rated in the test set, and measure how closely the ranked list
is to the ordering by ground-truth ratings. As metric, we turn to
the well-established metric for ranking nDCG@k, where
k
is the
cut-o point for the ranked list. Its denition can be found in [
26
].
Fig. 10 shows the nDCG@10 values for MovieLens 20M and Net-
ix respectively at various dimensionality of the latent vectors
D
. We observe that, Indexable BPR is among the best, with the
most competitive baseline being IPMF (which ts ratings). More
important is whether the models will still perform well when used
with index structures. As similar trends are observed with other
values of D, subsequently we show results based on D=20.
Here, the objective is to investigate the eectiveness of the LSH
hashcodes in preserving the ranking among the rated items in the
test set. We use Hamming ranking, repeating the same experiment
Table 2: Absolute nDCG@10 and Relative nDCG@10 of all models as the length of LSH codes (b) varies.
MovieLens 20M Netix
Absolute nDCG@10 Relative nDCG@10 Absolute nDCG@10 Relative nDCG@10
b 8 12 16 8 12 16 8 12 16 8 12 16
CFEE 0.582 0.582 0.585 0.805 0.806 0.809 0.559 0.561 0.562 0.834 0.836 0.838
COE 0.605 0.609 0.608 0.886 0.891 0.890 0.570 0.565 0.575 0.906 0.898 0.914
IPMF 0.702 0.728 0.704 0.920 0.955 0.923 0.705 0.737 0.747 0.896 0.936 0.949
BPR(MF) 0.599 0.603 0.605 0.831 0.837 0.840 0.560 0.551 0.553 0.863 0.849 0.853
BPR(MF)+ 0.603 0.604 0.606 0.837 0.840 0.841 0.569 0.569 0.566 0.877 0.877 0.873
Indexable BPR 0.743 0.745 0.754 0.977 0.980 0.991 0.732 0.761 0.756 0.924 0.960 0.954
in Fig.10, but using Hamming distances over hashcodes. is is to
investigate how well Indexable BPR preserves the ranking com-
pared to the baselines. As hashing relies on random hash functions,
we average results over 10 dierent sets of functions.
Table 2 shows the performances of all models. e two met-
rics are: Absolute nDCG@10 is the nDCG@10 of LSH hashcodes,
and Relative nDCG@10 is the relative ratio between the Absolute
nDCG@10 and that of original real-valued latent vectors. Index-
able BPR consistently shows beer Absolute nDCG@10 values
than the baselines when using LSH indexing. is implies that
Indexable BPR coupled with SRP-LSH produces more compact
and informative hashcodes. Also, the Relative nDCG@10 of In-
dexable BPR are close to 1 and higher than those of the baselines.
ese observations validate our hypotheses that not only is In-
dexable BPR competitively eective pre-indexing, but it is also
more LSH-friendly, resulting in less loss in the ranking accuracy
post-indexing.
6 CONCLUSION
We propose a probabilistic method for modeling user preferences
based on ordinal triples, which is geared towards top-
k
recommen-
dation via approximate kNN search using indexing. e proposed
model Indexable BPR produces an indexing-friendly representa-
tion, which results in signicant speedups in top-
k
retrieval, while
still maintaining high accuracy due to its compatibility with index-
ing structures such as LSH, spatial tree, and inverted index. As
future work, a potential direction is to go beyond achieving rep-
resentations more compatible with existing indexing schemes, to
designing novel data structures or indexing schemes that would
beer support ecient and accurate recommendation retrieval.
ACKNOWLEDGMENTS
is research is supported by the National Research Foundation,
Prime Minister’s Oce, Singapore under its NRF Fellowship Pro-
gramme (Award No. NRF-NRFF2016-07).
REFERENCES
[1]
Mohamed Aly, Mario Munich, and Pietro Perona. 2011. Indexing in large scale
image collections: Scaling properties and benchmark. In IEEE Workshop on
Applications of Computer Vision (WACV). 418–425.
[2]
Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam
Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the xbox recom-
mender system using a euclidean transformation for inner-product spaces. In
RecSys. ACM, 257–264.
[3]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative
searching. Commun. ACM 18, 9 (1975), 509–517.
[4]
Avradeep Bhowmik, Nathan Liu, Erheng Zhong, Badri Narayan Bhaskar, and
Suju Rajan. 2016. Geometry Aware Mappings for High Dimensional Sparse
Factors. In AISTATS.
[5]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algo-
rithms. In STOC. ACM, 380–388.
[6]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and VahabS Mirrokni. 2004. Locality-
sensitive hashing scheme based on p-stable distributions. In SCOG. ACM, 253–
262.
[7]
Marco Fraccaro, Ulrich Paquet, and Ole Winther. 2016. Indexable Probabilistic
Matrix Factorization for Maximum Inner Product Search. In AAAI.
[8]
Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized
Ranking from Implicit Feedback. In AAAI.
[9]
Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. 2012. Super-bit
locality-sensitive hashing. In NIPS. 108–116.
[10]
Mohammad Khoshneshin and W Nick Street. 2010. Collaborative ltering via
euclidean embedding. In RecSys. ACM, 87–94.
[11]
Noam Koenigstein, Parikshit Ram, and Yuval Shavi. 2012. Ecient retrieval of
recommendations in a matrix factorization framework. In CIKM. ACM, 535–544.
[12]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-
niques for recommender systems. Computer 42, 8 (2009).
[13]
Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars
Schmidt-ieme. 2012. Multi-relational matrix factorization using bayesian
personalized ranking for social network data. In WSDM. 173–182.
[14]
J. B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of t to a
nonmetric hypothesis. Psychometrika 29, 1 (1964).
[15]
Dung D Le and Hady W Lauw. 2016. Euclidean Co-Embedding of Ordinal Data
for Multi-Type Visualization. In SDM. SIAM, 396–404.
[16]
Lukas Lerche and Dietmar Jannach. 2014. Using graded implicit feedback for
bayesian personalized ranking. In RecSys. 353–356.
[17]
Brian McFee and Gert R. G. Lanckriet. 2011. Large-scale music similarity search
with spatial trees. In ISMIR.
[18]
Marius Muja and David G. Lowe. 2009. Fast Approximate Nearest Neighbors with
Automatic Algorithm Conguration. In International Conference on Computer
Vision eory and Application VISSAPP’09). INSTICC Press, 331–340.
[19]
Behnam Neyshabur and Nathan Srebro. 2015. On Symmetric and Asymmetric
LSHs for Inner Product Search. In ICML.
[20]
Weike Pan and Li Chen. 2013. GBPR: Group Preference Based Bayesian Personal-
ized Ranking for One-Class Collaborative Filtering.. In IJCAI, Vol. 13. 2691–2697.
[21]
Parikshit Ram and Alexander G Gray. 2012. Maximum inner-product search
using cone trees. In Proceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 931–939.
[22]
Steen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-
ieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In
UAI. AUAI Press, 452–461.
[23]
Ruslan Salakhutdinov and Andriy Mnih. 2008. Bayesian probabilistic matrix
factorization using Markov chain Monte Carlo. In ICML. ACM, 880–887.
[24]
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear
time maximum inner product search (MIPS). In Advances in Neural Information
Processing Systems. 2321–2329.
[25]
Anshumali Shrivastava and Ping Li. 2015. Improved Asymmetric Locality Sensi-
tive Hashing (ALSH) for Maximum Inner Product Search (MIPS). In UAI.
[26]
Markus Weimer, Alexandros Karatzoglou, oc V. Le, and Alexander J. Smola.
2007. COFI RANK - Maximum Margin Matrix Factorization for Collaborative
Ranking. In NIPS.
[27]
Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-
Seng Chua. 2016. Discrete collaborative ltering. In Proc. of SIGIR, Vol. 16.
[28]
Zhiwei Zhang, Qifan Wang, Lingyun Ruan, and Luo Si. 2014. Preference preserv-
ing hashing for ecient recommendation. In SIGIR. ACM, 183–192.
[29]
Ke Zhou and Hongyuan Zha. 2012. Learning binary codes for collaborative
ltering. In KDD. ACM, 498–506.
... In face of these challenges, recent approaches turn to indexing schemes to overcome the prohibitive cost of performing exhaustive top-k recommendation search for each user. In particular, one of the most popular such schemes is Locality-Sensitive Hashing (LSH) (Shrivastava and Li 2015;Bachrach et al. 2014;Fraccaro, Paquet, and Winther 2016;Le and Lauw 2017;Hsieh et al. 2017;Liu and Wu 2016;Koenigstein and Koren 2013;Qi et al. 2017;Smirnov and Ponomarev 2014). In the prevalent binary variant, LSH approximates relative distances between data points, by computing the Hamming distance between the corresponding codes, and hashing similar data points to similar codes with high probability. ...
... A triple t uij ∈ T relates one user u ∈ U and two different items i, j ∈ I, indicating u's preferring item i to item j. From ratings, we can induce an ordinal triple for each instance when user u rates item i higher than she rates item j (Le and Lauw 2017). Triples can also model the implicit feedback (Rendle et al. 2009). ...
... IPMF (Fraccaro, Paquet, and Winther 2016) keeps the classic formulation of matrix factorization, but incorporates additional constraint that all item vectors have the same magnitude. IBPR (Le and Lauw 2017) proposes the use of angular distance kernel, evaluated as the arccos of the inner product between the normalized vectors, to model pairwise ordinal preferences. These methods however does not take into account the LSH stochasticity. ...
Article
Full-text available
Locality Sensitive Hashing (LSH) has become one of the most commonly used approximate nearest neighbor search techniques to avoid the prohibitive cost of scanning through all data points. For recommender systems, LSH achieves efficient recommendation retrieval by encoding user and item vectors into binary hash codes, reducing the cost of exhaustively examining all the item vectors to identify the top-k items. However, conventional matrix factorization models may suffer from performance degeneration caused by randomly-drawn LSH hash functions, directly affecting the ultimate quality of the recommendations. In this paper, we propose a framework named øurmodel, which factors in the stochasticity of LSH hash functions when learning real-valued user and item latent vectors, eventually improving the recommendation accuracy after LSH indexing. Experiments on publicly available datasets show that the proposed framework not only effectively learns user's preferences for prediction, but also achieves high compatibility with LSH stochasticity, producing superior post-LSH indexing performances as compared to state-of-the-art baselines.
... One is IPMF (Fraccaro, Paquet, & Winther, 2016), which extends the Bayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, 2008), by making the item latent vectors natively of fixed length. Indexable Bayesian Personalized Ranking or IBPR (Le & Lauw, 2017), on the other hand, proposes the use of angular distance kernel, evaluated as the arccos of the inner product between the normalized vectors, to model pairwise ordinal preferences. Both IPMF and IBPR produce output vectors in which MIPS is equivalent to NNS and MCSS. ...
... Retrieval-Efficient structures refer to the data structures that support efficient top-k recommendation retrieval such as sorted indices lists (Yu et al., 2017), LSH tables (Le & Lauw, 2017), (Bachrach et al., 2014), spatial trees (Keivani et al., 2018), or similarity graph (Morozov & Babenko, 2018). These methods constitute a significant portion of the surveyed approaches for efficient top-k candi-date generation. ...
Article
Full-text available
Top-k recommendation seeks to deliver a personalized list of k items to each individual user. An established methodology in the literature based on matrix factorization (MF), which usually represents users and items as vectors in low-dimensional space, is an effective approach to rec-ommender systems, thanks to its superior performance in terms of recommendation quality and scalability. A typical matrix factorization recommender system has two main phases: preference elicitation and recommendation retrieval. The former analyzes user-generated data to learn user preferences and item characteristics in the form of latent feature vectors, whereas the latter ranks the candidate items based on the learnt vectors and returns the top-k items from the ranked list. For preference elicitation, there have been numerous works to build accurate MF-based recommendation algorithms that can learn from large datasets. However, for the recommendation retrieval phase, naively scanning a large number of items to identify the few most relevant ones may inhibit truly real-time applications. In this work, we survey recent advances and state-of-the-art approaches in the literature that enable fast and accurate retrieval for MF-based personalized recommendations. Also, we include analytical discussions of approaches along different dimensions to provide the readers with a more comprehensive understanding of the surveyed works.
... We primarily focus on collaborative filtering algorithms of different types, as they are known as the most effective and successful approaches in the recommender systems literature. As in very related studies 14 43 and Indexable Bayesian Personalized Ranking (IBPR) 44 as probabilistic methods, Spherical K-means (SKM) 45 as clustering-based strategy, and lastly, Neural Matrix Factorization (NEUMF) 46 and Variational Autoencoder for Collaborative Filtering (VAECF) 47 as neural network-based approaches. In addition to these personalized recommendation algorithms, we also utilize two non-personalized traditional methods, namely Random and ItemAVG. ...
Article
Full-text available
Recommender systems are subject to well‐known popularity bias issues, that is, they expose frequently rated items more in recommendation lists than less‐rated ones. Such a problem could also have varying effects on users with different gender, age, or rating behavior, which significantly diminishes the users' overall satisfaction with recommendations. In this paper, we approach the problem from the view of user personalities for the first time and discover how users are inclined toward popular items based on their personality traits. More importantly, we analyze the potential unfairness concerns for users with different personalities, which the popularity bias of the recommenders might cause. To this end, we split users into groups of high, moderate, and low clusters in terms of each personality trait in the big‐five factor model and investigate how the popularity bias impacts such groups differently by considering several criteria. The experiments conducted with 10 well‐known algorithms of different kinds have concluded that less‐extroverted people and users avoiding new experiences are exposed to more unfair recommendations regarding popularity, despite being the most significant contributors to the system. However, discrepancies in other qualities of the recommendations for these user characteristics, such as accuracy, diversity, and novelty, vary depending on the utilized algorithm.
... To provide a comprehensive analysis in the experiments, we utilize Maximum Margin Matrix Factorization 335 (MMMF) (Weimer et al., 2008), Weighted Matrix Factorization (WMF) (Hu et al., 2008), and Hierarchical Poisson Factorization (HPF) (Gopalan et al., 2015) as matrix factorization-based approaches, Weighted Bayesian Personalized Ranking (WBPR) (Gantner et al., 2012) and Indexable Bayesian Personalized Ranking (IBPR) (Le and Lauw, 2017) as probabilistic methods, Spherical K-means (SKM) (Salah et al., 2016) as clustering-based method, and finally, Neural Matrix Factorization (NEUMF) (He et al., 2017) and Variational Autoencoder 340 for Collaborative Filtering (VAECF) (Liang et al., 2018) as neural network-based methods. In addition to such personalized CF recommendation approaches, we also consider two non-personalized strategies to demonstrate baseline performances (Boratto et al., 2019). ...
Article
Full-text available
The popularity bias problem is one of the most prominent challenges of recommender systems, i.e., while a few heavily rated items receive much attention in presented recommendation lists, less popular ones are underrepresented even if they would be of close interest to the user. This structural tendency of recommendation algorithms causes several unfairness issues for most of the items in the catalog as they are having trouble finding a place in the top- lists. In this study, we evaluate the popularity bias problem from users’ viewpoint and discuss how to alleviate it by considering users as one of the major stakeholders. We derive five critical discriminative features based on the following five essential attributes related to users’ rating behavior, (i) the interaction level of users with the system, (ii) the overall liking degree of users, (iii) the degree of anomalous rating behavior of users, (iv) the consistency of users, and (v) the informative level of the user profiles, and analyze their relationships to the original inclinations of users toward item popularity. More importantly, we investigate their associations with possible unfairness concerns for users, which the popularity bias in recommendations might induce. The analysis using ten well-known recommendation algorithms from different families on four real-world preference collections from different domains reveals that the popularity propensities of individuals are significantly correlated with almost all of the investigated features with varying trends, and algorithms are strongly biased towards popular items. Especially, highly interacting, selective, and hard-to-predict users face highly unfair, relatively inaccurate, and primarily unqualified recommendations in terms of beyond-accuracy aspects, although they are major stakeholders of the system. We also analyze how state-of-the-art popularity debiasing strategies act to remedy these problems. Although they are more effective for mistreated groups in alleviating unfairness and improving beyond-accuracy quality, they mostly fail to preserve ranking accuracy.
... They can be broadly classified into two types as depicted in Figure 1, namely model-independent and model-dependent hyper-factors. The former refers to the hyper-factors that are isloated from the model design and optimization process (e.g., dataset and comparison baseline selection); whilst the latter indicates the ones involved in the model development and parameter optimization procedure (e.g., loss function design and regularization [8], [9] [10], [11], [12] [13], [14], [15], [16], [17], [18], [19], [20] [21], [22], [23], [24], [25], [26], [27] [28], [29] CIKM 19 [30], [31], [32] [33], [34], [35] [36], [37], [38] [39], [40], [41], [42], [43], [44] [ 45], [46], [47], [48] IJCAI 20 [49], [50], [51] [52], [53], [54], [55], [56], [57] [58], [59], [60], [61], [62], [63], [64] [65], [66], [67], [68] KDD 12 [69] [70], [71], [72] [73], [74], [75], [76], [77] [78], [79], [80] RecSys 14 [81], [82] [83], [84], [85] [86], [87], [88], [89], [90], [91] [92], [93], [94] SIGIR 22 [95] [96], [97], [98], [99], [100] [101], [102], [103], [104] [105], [106], [107], [108], [109], [110] [111], [112], [113], [114], [115], [116] WSDM 16 [117] [118], [119], [120] [121], [122], [123], [124] [125], [126], [127], [128], [129], [130] [131], [132] WWW 16 [133], [134] [135], [136] [137], [138], [139], [140], [141], [142] [143], [144], [145], [146], [147], [148] Total 141 15 28 43 55 terms). According to this categorization, three main aspects may inherently lead to such non-rigorous evaluation. ...
Preprint
Full-text available
Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.
... Note that besides the inner product kernel, alternative formulations of preference score function include 2 distance [13], angular distance [18], or non-linear function [12,25]. ...
Conference Paper
Full-text available
In many visually-oriented applications, users can select and group images that they find interesting into coherent clusters. For instance , we encounter these in the form of hashtags on Instagram, galleries on Flickr, or boards on Pinterest. The selection and coherence of such user-curated visual clusters arise from a user's preference for a certain type of content as well as her own perception of which images are similar and thus belong to a cluster. We seek to model such curation behaviors towards supporting users in their future activities such as expanding existing clusters or discovering new clusters altogether. This paper proposes a framework, namely Collaborative Curating that jointly models the interrelated modalities of preference expression and similarity perception. Extensive experiments on real-world datasets of various categories from a visual curating platform show that the proposed framework significantly outperforms baselines focusing on either clustering behaviors or preferences alone.
Article
Full-text available
Recommender systems have become increasingly important in today’s digital age, but they are not without their challenges. One of the most significant challenges is that users are not always willing to share their preferences due to privacy concerns, yet they still require decent recommendations. Privacy-preserving collaborative recommenders remedy such concerns by letting users set their privacy preferences before submitting to the recommendation provider. Another recently discussed challenge is the problem of popularity bias, where the system tends to recommend popular items more often than less popular ones, limiting the diversity of recommendations and preventing users from discovering new and interesting items. In this article, we comprehensively analyze the randomized perturbation-based data disguising procedure of privacy-preserving collaborative recommender algorithms against the popularity bias problem. For this purpose, we construct user personas of varying privacy protection levels and scrutinize the performance of ten recommendation algorithms on these user personas regarding the accuracy and beyond-accuracy perspectives. We also investigate how well-known popularity-debiasing strategies combat the issue in privacy-preserving environments. In experiments, we employ three well-known real-world datasets. The key findings of our analysis reveal that privacy-sensitive users receive unbiased and fairer recommendations that are qualified in diversity, novelty, and catalogue coverage perspectives in exchange for tolerable sacrifice from accuracy. Also, prominent popularity-debiasing strategies fall considerably short as provided privacy level improves.
Article
Evaluating recommender systems adequately and thoroughly is an important task. Significant efforts are dedicated to proposing metrics, methods and protocols for doing so. However, there has been little discussion in the recommender systems’ literature on the topic of testing. In this work, we adopt and adapt concepts from the software testing domain, e.g., code coverage, metamorphic testing, or property-based testing, to help researchers to detect and correct faults in recommendation algorithms. We propose a test suite that can be used to validate the correctness of a recommendation algorithm, and thus identify and correct issues that can affect the performance and behavior of these algorithms. Our test suite contains both black box and white box tests at every level of abstraction, i.e., system, integration and unit. To facilitate adoption, we release RecPack Tests , an open-source Python package containing template test implementations. We use it to test four popular Python packages for recommender systems: RecPack , PyLensKit , Surprise and Cornac . Despite the high test coverage of each of these packages, we find that we are still able to uncover undocumented functional requirements and even some bugs. This validates our thesis that testing the correctness of recommendation algorithms can complement traditional methods for evaluating recommendation algorithms.
Article
Recently, one critical issue looms large in the field of recommender systems – there are no effective benchmarks for rigorous evaluation – which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.
Article
In recommender system, top- N recommendation is an important task with implicit feedback data. Although the recent success of deep learning largely pushes forward the research on top- N recommendation, there are increasing concerns on appropriate evaluation of recommendation algorithms. It becomes emergent to study how recommendation algorithms can be reliably evaluated and thoroughly verified. This work presents a large-scale, systematic study on six important factors from three aspects for evaluating recommender systems. We carefully select twelve top- N recommendation algorithms and eight recommendation datasets. Our experiments are carefully designed and extensively conducted with these algorithms and datasets. In particular, all the experiments in our work are implemented based on an open-sourced recommendation library, Recbole [139], which ensures the reproducibility and reliability of our results. Based on the large-scale experiments and detailed analysis, we derive several key findings on the experimental settings for evaluating recommender systems. Our findings show that some settings can lead to substantial or significant differences in performance ranking of the compared algorithms. In response to recent evaluation concerns, we also provide several suggested settings that are specially important for performance comparison.
Conference Paper
Full-text available
Embedding deals with reducing the high-dimensional representation of data into a low-dimensional representation. Previous work mostly focuses on preserving similarities among objects. Here, not only do we explicitly recognize multiple types of objects, but we also focus on the ordinal relationships across types. Collaborative Ordinal Embedding or COE is based on generative modelling of ordinal triples. Experiments show that COE outperforms the baselines on objective metrics, revealing its capacity for information preservation for ordinal data.
Article
The Maximum Inner Product Search (MIPS) problem, prevalent in matrix factorization-based recommender systems, scales linearly with the number of objects to score. Recent work has shown that clever post-processing steps can turn the MIPS problem into a nearest neighbour one, allowing sublinear retrieval time either through Locality Sensitive Hashing or various tree structures that partition the Euclidian space. This work shows that instead of employing post-processing steps, substantially faster retrieval times can be achieved for the same accuracy when inference is not decoupled from the indexing process. By framing matrix factorization to be natively indexable, so that any solution is immediately sublinearly searchable, we use the machinery of Machine Learning to best learn such a solution. We introduce Indexable Probabilistic Matrix Factorization (IPMF) to shift the traditional post-processing complexity into the training phase of the model. Its inference procedure is based on Geodesic Monte Carlo, and adds minimal additional computational cost to standard Monte Carlo methods for matrix factorization. By coupling inference and indexing in this way, we achieve more than a 50% improvement in retrieval time against two state of the art methods, for a given level of accuracy in the recommendations of two large-scale recommender systems.
Conference Paper
We address the efficiency problem of Collaborative Filtering (CF) by hashing users and items as latent vectors in the form of binary codes, so that user-item affinity can be efficiently calculated in a Hamming space. However, existing hashing methods for CF employ binary code learning procedures that most suffer from the challenging discrete constraints. Hence, those methods generally adopt a two-stage learning scheme composed of relaxed optimization via discarding the discrete constraints, followed by binary quantization. We argue that such a scheme will result in a large quantization loss, which especially compromises the performance of large-scale CF that resorts to longer binary codes. In this paper, we propose a principled CF hashing framework called Discrete Collaborative Filtering (DCF), which directly tackles the challenging discrete optimization that should have been treated adequately in hashing. The formulation of DCF has two advantages: 1) the Hamming similarity induced loss that preserves the intrinsic user-item similarity, and 2) the balanced and uncorrelated code constraints that yield compact yet informative binary codes. We devise a computationally efficient algorithm with a rigorous convergence proof of DCF. Through extensive experiments on several real-world benchmarks, we show that DCF consistently outperforms state-of-the-art CF hashing techniques, e.g, though using only 8 bits, DCF is even significantly better than other methods using 128 bits.
Article
While matrix factorisation models are ubiquitous in large scale recommendation and search, real time application of such models requires inner product computations over an intractably large set of item factors. In this manuscript we present a novel framework that uses the inverted index representation to exploit structural properties of sparse vectors to significantly reduce the run time computational cost of factorisation models. We develop techniques that use geometry aware permutation maps on a tessellated unit sphere to obtain high dimensional sparse embeddings for latent factors with sparsity patterns related to angular closeness of the original latent factors. We also design several efficient and deterministic realisations within this framework and demonstrate with experiments that our techniques lead to faster run time operation with minimal loss of accuracy.
Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, π/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.
Conference Paper
A prominent approach in collaborative filtering based recommender systems is using dimensionality reduction (matrix factorization) techniques to map users and items into low-dimensional vectors. In such systems, a higher inner product between a user vector and an item vector indicates that the item better suits the user's preference. Traditionally, retrieving the most suitable items is done by scoring and sorting all items. Real world online recommender systems must adhere to strict response-time constraints, so when the number of items is large, scoring all items is intractable. We propose a novel order preserving transformation, mapping the maximum inner product search problem to Euclidean space nearest neighbor search problem. Utilizing this transformation, we study the efficiency of several (approximate) nearest neighbor data structures. Our final solution is based on a novel use of the PCA-Tree data structure in which results are augmented using paths one hamming distance away from the query (neighborhood boosting). The end result is a system which allows approximate matches (items with relatively high inner product, but not necessarily the highest one). We evaluate our techniques on two large-scale recommendation datasets, Xbox Movies and Yahoo~Music, and show that this technique allows trading off a slight degradation in the recommendation quality for a significant improvement in the retrieval time.
Article
In many application domains of recommender systems, explicit rating information is sparse or non-existent. The preferences of the current user have therefore to be approximated by interpreting his or her behavior, i.e., the implicit user feedback. In the literature, a number of algorithm proposals have been made that rely solely on such implicit feedback, among them Bayesian Personalized Ranking (BPR). In the BPR approach, pairwise comparisons between the items are made in the training phase and an item i is considered to be preferred over item j if the user interacted in some form with i but not with j. In real-world applications, however, implicit feedback is not necessarily limited to such binary decisions as there are, e.g., different types of user actions like item views, cart or purchase actions and there can exist several actions for an item over time. In this paper we show how BPR can be extended to deal with such more fine-granular, graded preference relations. An empirical analysis shows that this extension can help to measurably increase the predictive accuracy of BPR on realistic e-commerce datasets.
Article
Recently we showed that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. In [23], we used asymmetric transformations to convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using L2-LSH. In this paper, we revisit the problem of MIPS and argue that the quantizations used in L2-LSH is suboptimal for MIPS compared to signed random projections (SRP) which is another popular hashing scheme for cosine similarity (or correlations). Based on this observation, we provide different asymmetric transformations which convert the problem of approximate MIPS into the problem amenable to SRP instead of L2-LSH. An additional advantage of our scheme is that we also obtain LSH type space partitioning which is not possible with the existing scheme. Our theoretical analysis shows that the new scheme is significantly better than the original scheme for MIPS. Experimental evaluations strongly support the theoretical findings. In addition, we also provide the first empirical comparison that shows the superiority of hashing over tree based methods [21] for MIPS.