ArticlePDF Available

Not All Memories Created Equal: Dynamic User Representations for Collaborative Filtering

Authors:

Abstract and Figures

Collaborative filtering methods for recommender systems tend to represent users as a single static latent vector. However, user behavior and interests may dynamically change in the context of the recommended item being presented to the user. For example, in the case of movie recommendations, it is usually true that movies that the user watched more recently are more informative than movies that were watched a long time ago. However, it is possible that a particular movie from the past may become suddenly more relevant for prediction in the presence of a recommendation for its sequel movie. In response to this issue, we introduce the Attentive Item2Vec++ (AI2V++) model, a neural attentive collaborative filtering approach in which the user representation adapts dynamically in the presence of the recommended item. AI2V++ employs a novel context-target attention mechanism in order to learn and capture different characteristics of the user’s historical behavior with respect to a potential recommended item. Furthermore, analysis of the neural-attentive scores allows for improved interpretability and explainability of the model. We evaluate our proposed approach on five publicly available datasets and demonstrate its superior performance in comparison to state-of-the-art baselines across multiple accuracy metrics.
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Not All Memories Created Equal:
Dynamic User Representations for
Collaborative Filtering
KEREN GAIGER2, OREN BARKAN1, SHIR TSIPORY-SAMUEL2, NOAM KOENIGSTEIN2
1Department of Computer Science, The Open University, Israel
2Industrial Engineering Department, Tel-Aviv University
Corresponding author: Noam Koenigstein (e-mail: noamk@tauex.tau.ac.il).
This research was supported by the Israel Science Foundation grant 2243/20.
ABSTRACT Collaborative filtering methods for recommender systems tend to represent users as a single
static latent vector. However, user behavior and interests may dynamically change in the context of the
recommended item being presented to the user. For example, in the case of movie recommendations, it
is usually true that movies that the user watched more recently are more informative than movies that
were watched a long time ago. However, it is possible that a particular movie from the past may become
suddenly more relevant for prediction in the presence of a recommendation for its sequel movie. In response
to this issue, we introduce the Attentive Item2Vec++ (AI2V++) model, a neural attentive collaborative
filtering approach in which the user representation adapts dynamically in the presence of the recommended
item. AI2V++ employs a novel context-target attention mechanism in order to learn and capture different
characteristics of the user’s historical behavior with respect to a potential recommended item. Furthermore,
analysis of the neural-attentive scores allows for improved interpretability and explainability of the model.
We evaluate our proposed approach on five publicly available datasets and demonstrate its superior
performance in comparison to state-of-the-art baselines across multiple accuracy metrics.
INDEX TERMS artificial neural networks, collaborative filtering, neural attention, recommender systems.
I. INTRODUCTION
Collaborative Filtering (CF) is one of the most effective and
widely used methods for recommender systems [1]. Its aim is
to recommend items to users based on their historical inter-
actions with other items. As such, the underlying assumption
of any CF algorithm is that users’ past experiences are highly
predictive of their future preferences. For instance, a user
who enjoyed watching a certain type of comedy movie is
likely to seek another movie that is similar to the one she
previously watched. Thus, a user’s memory, such as the items
they have interacted with, plays a crucial role in shaping their
future recommendations.
Although the role of memory in human decision-making
is not fully understood, empirical evidence suggests that
the human memory is dynamic, and different memories
become more accessible in different contexts [2]–[5]. Simi-
larly, users’ interests and preferences are also dynamic and
context-dependent [6], [7]. According to preference con-
struction theory, individuals do not always have a clear idea
of their preferences from the outset; rather, their prefer-
ences are shaped and constructed through complex decision-
making processes based on past experiences [8], [9]. There-
fore, traditional CF methods that represent users as a static
vector regardless of the context fail to account for the dy-
namic nature of users’ preferences and their varying rele-
vancy to different items.
To address this issue, there is a need to develop more
sophisticated CF models that can capture the dynamic nature
of users’ preferences and the varying relevancy of different
past experiences. These models can leverage more complex
representations of users that can capture contextual infor-
mation and the evolving nature of their preferences. Such
models can enhance the accuracy of recommender systems
and improve user satisfaction by providing recommendations
that are more relevant and better personalized.
To illustrate this gap, consider the example in Fig. 1 de-
picting a list of historical movies that a user has watched. The
list of movies consists of both horror movies as well as some
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Figure 1: The distribution of attention scores over a user’s historical items which include both horror movies and family-fantasy
movies. The scores are presented with respect to 2 different items: (1) on the upper image, with respect to the horror movie
“The Scream (1996)” and (2) on the lower image, with respect to a family-fantasy movie “Cinderella (2021)”. We see how the
attentive relevant scores adopt with respect to the target item.
family-fantasy movies. Accordingly, the user in this example
is generally interested in popular mainstream horror movies,
but from time to time her 5 years old daughter joins her and
they both watch an age-appropriate movie from the family-
fantasy genre. Most of the time, the family-fantasy movies
are irrelevant to her personal taste and interests which are
mostly determined by her love of horror movies. Occasion-
ally however, when considering a movie to watch with her
daughter such as the Disney movie Cinderella, all the other
family-fantasy movies she has watched in the past should
come to the foreground and become dominant in determining
her affinity to the recommended item.
This paper presents the Attentive Item-to-Vector++
(AI2V++) model, which is a novel CF model designed to
capture the dynamic nature of users’ preferences by adapting
to changes in the relevance of historical items. The AI2V++
model is inspired by the dynamic nature of the human mem-
ory, in which different memories become more accessible
in different contexts. The model uses attention mechanisms
to dynamically adjust the importance of the historical items
based on the item being considered for recommendation. As
such, AI2V++ addresses the limitations of traditional CF
models, which represent users as a static vector and ignore
the varying relevancy of different past experiences. To the
best of our knowledge, AI2V++ is the first CF model to
explicitly account for the dynamic nature of human memory
in the context of recommender systems.
The key novelty of the AI2V++ model arises from its
unique multi-attentive user representation that changes and
adjusts in the presence of the target item. To this end,
AI2V++ employs multiple attention networks in parallel on
the user’s historical items with respect to a potential target
item to score. Each attention network produces an attentive
context-target representation, representing a different pattern
in user-item affinities. This results in multiple contextualized
user representations which are “aware” of the target item to
score. At this point, the model aggregates all the context-
aware user representations to form a final contextualized user
representation that is subsequently used for scoring the target
item. This novel context-target attention mechanism enables
superior accuracy with respect to state-of-the-art alternative
models.
AI2V++ makes another noteworthy contribution with re-
spect to its interpretability properties. The attention mecha-
nism embedded in AI2V++, which mimics the human brain’s
function, facilitates the identification of insights regarding
which items the model considers as more significant for
the recommendations. By highlighting relevant items in the
user’s history, AI2V++’s internal processes can be compre-
hended, and its predictions can be explicated. Consequently,
AI2V++ takes a step forward in the realm of explainability
for recommender systems based on collaborative filtering.
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
The remainder of this paper is organized as follows: Sec-
tion II covers related work concerning this research. Sec-
tion III describes the proposed AI2V++ model in detail.
In Section IV, we present an extensive evaluation of the
AI2V++ model and compare it to state-of-the-art alternatives
using multiple datasets. Finally, we summarize our findings
in Section V.
II. RELATED WORK
This section provides a review of previous studies that are
relevant to the present research. The literature review begins
with a broad classification of collaborative filtering algo-
rithms. Subsequently, deep learning approaches for recom-
mender systems are discussed, with special emphasis on
the use of neural attention in collaborative filtering. Finally,
the topic of explainable recommender systems is briefly
addressed.
A. COLLABORATIVE FILTERING
Collaborative filtering (CF) algorithms are used to model
users’ personalized preferences using historical user-item in-
teractions [1]. CF models are characterized as either explicit
e.g., item ratings, thumb up/down, or implicit e.g., clicks,
purchases, etc. Initial research, following the Netflix Prize
competition [10] focused mainly on explicit data. However,
due to the lack of explicit data in industrial applications,
research efforts shifted to focus on implicit models [11], [12].
Implicit user feedback can be challenging due to the am-
biguity of interpreting ‘non-observed’ interactions. Hence,
point-wise and pairwise methods were proposed to alleviate
this challenge and learn latent user representations. In pair-
wise methods e.g., [13] the positive user-item interaction is
contrasted with another item that the user did not interact
with. In point-wise methods e.g., [11], a positive user-item
interaction is contrasted against all other items that the user
did not interact with sometimes via sampling.
A different approach to implicit feedback collaborative
filtering is to learn only item representations without user
representations i.e., implicit user learning. These methods
emphasize the importance of learning item-to-item semantics
rather than user-to-item predictions. For example, [14] pro-
posed learning item representations from implicit feedback in
a Euclidean space. The I2V model [15] is a popular method
for learning static item representations based on CF item co-
occurrences [15]. The AI2V++ model in this paper belongs
to this category. It is inspired by I2V and adds to it the ability
to dynamically build implicit user representations based on
their items.
The I2V model is a well-known CF technique that utilizes
item co-occurrences to acquire hidden item representations.
I2V is based on the Skip-Gram with Negative Sampling
(SGNS) approach, which is also used in Mikolov et al.’s in-
fluential Word2Vec model to learn semantic word representa-
tions [16]. In recent years, several studies have demonstrated
the usefulness of incorporating neural attention mechanisms
to static word embeddings, enabling them to capture the dy-
namic qualities of words in relation to their context [17], [18].
Following this trend, the authors of this paper introduced
the AI2V recommendation model, which enhances the I2V
model by presenting a novel cross-attention mechanism that
modifies user representations in response to the item being
rated [19].
The model in this paper, dubbed AI2V++, enhances our
earlier conference presentation of the AI2V model [19]. Two
key modifications were made to the original AI2V algorithm.
First, we introduced ordinal information into the context-
target attention mechanism by hierarchically learning global
and personal ordinal biases. It should be noted that the
hierarchical mechanism integrated into AI2V++ represents a
novel and distinct approach compared to the typical method
of learning positional embeddings that is commonly em-
ployed in most transformers, as seen in previous works such
as [17] or in [20]. Second, we changed the categorical cross-
entropy loss in AI2V to a binary cross-entropy loss, which is
more suitable for multi-label problems, as there are typically
multiple appropriate items per user in recommender systems.
Our experimental results demonstrate that both modifications
significantly enhance the original AI2V model. In particular,
the incorporation of ordinal information enables AI2V++
to consider the order in which a user’s items were con-
sumed, resulting in a considerable performance improvement
over AI2V. Additionally, we illustrate how attentive score
analysis in AI2V++ can be utilized for explainability and
interpretability. The code for AI2V++ was made available
on GitHub1and is expected to be easily accessible for re-
searchers and practitioners alike.
B. DEEP LEARNING FOR COLLABORATIVE FILTERING
In recent years, numerous deep learning-based recommender
system models have been proposed, as documented in [21]–
[24]. A particular strand of research aims to substitute the
conventional inner-productoperation found in Matrix Factor-
ization (MF) models with deep neural networks. For instance,
AutoRec [25] uses autoencoders to predict ratings, while
Neural Collaborative Filtering (NCF) [26] estimates user-
item interactions through Multi-Layer Perceptions (MLP).
The value of this approach is currently being debated within
the research community, with some studies suggesting that
an inner product may suffice for CF tasks and that the added
complexity may be unnecessary [27], [28]. Although the
primary focus of this paper is not on this topic, the AI2V++
model does utilize neural scoring, which has been shown to
enhance predictions in four of the five evaluation datasets
analyzed.
Another line of work seeks to employ Graph Neural Net-
works (GNNs) for CF. An example of this is the Neural
Graph Collaborative Filtering (NGCF) model [29], which
uses the well-established GCN model [30], [31], originally
designed for graph classification, to perform CF. In contrast
to conventional approaches that train distinct embeddings for
1https://github.com/kerengaiger/ai2v
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
users and items, NGCF learns a function that creates embed-
dings by collecting and aggregating features from a user’s
local neighborhood. NGCF has become a state-of-the-art
model for CF by achieving superior performance compared
to many notable models, including BPR [13], CMN [32],
HOP-Rec [33], PinSage [34], and GC-MC [35]. Recently,
LightGCN was introduced as a simplified version of NGCN
by retaining only the most relevant components of NGCN.
LightGCN is simpler to implement and train, yet it has been
shown to outperform NGCN [36].
A further line of research has focused on the integra-
tion of neural attention mechanisms. The present study falls
within this category. Attention mechanisms have emerged
as a crucial component in numerous deep learning mod-
els [37]. Specifically, self-attention and transformer models
have yielded exceptional outcomes in various Natural Lan-
gauge Processing (NLP) tasks, such as language translation
and understanding [17], [20]. The success of self-attention
models in NLP has triggered widespread adaptation of these
models for computer vision [38]–[40]. Moreover, the ver-
satility and scalability of transformer models have enabled
the processing of multiple modalities (e.g., text and images)
using similar processing blocks [41].
Incorporating attention mechanisms into recommender
system models has been an active area of research. Attention-
based recommendation models mostly utilize self-attention.
One such model is presented in [42], which uses two
transformer encoders to capture mobile user click behavior.
Another example is SASRec [43], which employs a self-
attention mechanism to represent each item in a user’s item
sequence and generates a user representation based on the
final attention block. The user representation is multiplied by
the target item embedding vector to produce an affinity score.
To emphasize the importance of the last item in the sequence,
SASRec also includes a residual connection between the
non-contextualized representation of the last item and the
final user representation. SASRec has been demonstrated to
outperform several popular algorithms, including BPR [13],
FPMC [44], TransRec [45], GRU4Rec [46], GRU4Rec+ [47]
and Caser [48]. Finally, Bert4Rec [49] is another model
for sequential recommendations which is closely related to
SASRec. Bert4Rec also employs self-attention but instead of
one-directional attention, it employs bi-directional attention
via the Cloze task [50].
AI2V++ differs from these models in several aspects:
First, in contrast to the above models, which are focused
on sequential recommendations, AI2V++ is a traditional CF
recommendation model that performs a different prediction
task and is evaluated using different datasets. Importantly,
its approach to user representation also sets AI2V++ apart.
AI2V++ assumes that users’ interests are dynamic and can
change in response to different target items. To capture
this, AI2V++ uses context-target attention to dynamically
adjust the user’s representation based on the presence of the
target item and does not employ self-attention, as do all the
aforementioned models. This approach is inspired by how the
human brain works, where different parts of memory become
relevant in different contexts.
In general, the AI2V++ model distinguishes itself from
any transformer-based model in two ways. Firstly, AI2V++
does not utilize self-attention and instead employs cross-
attention on the target item. Second, AI2V++ relies on cosine
similarity for its attention mechanism. Furthermore, AI2V++
employs a compound neural scoring function to compute the
similarity between the user representation and the target item,
rather than a simple inner product. Finally, while transformer-
based models leverage positional encoding to encode item
sequence information, AI2V++ uses a novel set of hierarchi-
cal ordinal bias scalars within the attention layer to learn the
relevant order of items in a user’s history.
He et al. introduced NAIS, Neural Attentive Item Simi-
larity model (NAIS) for item-based CF from [51]. Similar
to AI2V++, NAIS employs cross-attention in order to learn
the relative importance of the historical items in a user’s
profile with respect to the prediction. In that respect, NAIS is
arguably the most similar model to AI2V++. However, there
are several key differences between NAIS and the model
in this paper: (1) NAIS does not learn user representations.
Instead, it is an item-centric approach in which the predic-
tions are computed according to the relation of the target
item and each of the user’s historical items without ever
computing user representations explicitly. (2) NAIS mostly
uses the item embeddings directly, or in one of its versions
(design 3 in [51]), NAIS learns a single projection matrix
on the item embeddings in order to compute the attention
scores (Equation 7 in [51]). In contrast, AI2V++ employs
4 different types of projections on the item embeddings: 2
projections for the context items and 2 for the target items.
In each case, context or target, the first projection is used
in order to transfer the item into the attention scoring space
while the second projection is used in order to compute
the dynamic user representation and the target item repre-
sentation prior to the final user-item scoring function. By
employing different projections for the attention and for the
final prediction, AI2V++ is able to disconnect these two
distinct functions, which gives it much more flexibility. For
example, consider the case of items in the user history that
are in general very similar to the target item, yet their relative
importance (attention) with respect to the user’s taste is
marginal. (3) Another key difference is the fact that AI2V++
injects ordinal information into the context-target attention
mechanism, which as we show in Section IV-F, makes a dra-
matic contribution to the model’s accuracy. This is achieved
by a novel learnable mechanism of hierarchical global and
personal ordinal biases which help emphasize recent events
over older events. (4) While NIAS employs a single cross-
attention unit, AI2V++ employs multi-head attention which
gives it further descriptive power. (5) Last but not least, a
key advantage of AI2V++ stems from its ability to extract
intuitive explanations for its predictions. Explainable AI al-
gorithms and in particular explaining recommender systems
is an important open research question [52]. While it may be
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Table 1: AI2V++ Notation Summary
xA user represented as a timely ordered sequence of items.
xj1A sub-user represented as a sequence of the first j1items.
uliA context item representation for item li.
vliA target item representation for item li.
AcContext items projection matrix for the attention keys.
BcContext items projection matrix for the attention values.
AtTarget item projection matrix for the attention query.
BtTarget item projection matrix for the final prediction head.
aj1An attentive representation of sub-user xj1.
αjm An attention score for item lmwith respect to target item lj.
λiAglobal ordinal bias for the positional importance of the i-th
position (reverse order).
λx
iUser’s xpersonal ordinal bias at position i.
zj1A multi-attentive representation for sub-user xj1.
bljA popularity bias for item lj.
possible to extract meaningful explanations from NAIS, the
authors in [51] do not address the issue at all. In contrast,
the current paper explains and demonstrates how AI2V++
provides interpretability over its predictions which can be
further harnessed for generating user-intuitive explanations.
C. EXPLAINABLE RECOMMENDER SYSTEMS
In recent years, the need for explainable AI has become a
topic of increasing interest and importance for both the re-
search community and industry, as evidenced by the growing
number of publications and regulations in the field [53]–[55].
For example, the European Union General Data Protection
Regulation determines that users have a basic “right to an
explanation” concerning algorithmic decisions based on their
personal information [55]. Similar regulations either exist or
are publicly proposed in other countries. Specifically, in the
context of recommender systems, the goal of explanations is
to provide justifications for recommendations in a way that
is understandable to human users [52]. Previous research has
shown that transparency and interpretability are crucial for
building trust in recommender systems and ensuring their
effectiveness [56]. In this regard, AI2V++ offers a distinct
advantage over many existing CF algorithms. By analyzing
AI2V++’s attention scores over the user’s historical items,
the model’s inner workings can be revealed, providing trans-
parency and interpretability that is often lacking in other CF
approaches that operate as a “black box” to end users.
III. THE MODEL
The AI2V++ model builds upon the I2V model [15] and
adds to it the ability to learn dynamically adaptive user
representations. Hence, we first formalize and provide a brief
overview of the Item2Vec (I2V) model which serves as the
basis for AI2V++. Then, we continue to describe the AI2V++
model in detail.
A. ITEM2VEC (I2V)
Let I={i}M
i=1 be a set of Mitem identifiers. For each item
i, I2V learns latent context and target vectors ui,viRd.
These latent vectors are estimated via implicit factorization
of the items’ co-occurrence matrix. Specifically, the training
data for I2V consists of a list x= (l1, ..., lK)for each user of
the historical items that were co-consumed by that user.
Without loss of generality, we consider a dataset of a single
user x, where the extension to multiple users is straightfor-
ward and can be found in [15]. The objective of I2V is to
learn item co-occurrences. This is achieved by minimizing
the following loss function:
LI2V
x=
K
X
i=1
K
X
j<i
log p(lj|li),(1)
with
p(lj|li) = σ(s(li, lj)) Y
k∈N
σ(s(li, k)),(2)
where s(i, j) = uT
ivj, σ(x) = (1 + exp(x))1and N I
is a subset of items that are sampled from Iaccording to the
unigram item popularity distribution raised to the power of
0.5. The items in Nare treated as negative context items with
respect to the target (positive) item lj.
In order to mitigate the popularity bias in common CF
datasets, I2V further applies a subsampling procedure in
which positive items are randomly discarded from users
according to their popularity. The amount of subsampling
is controlled by a hyperparameter that is adjusted w.r.t. the
dataset statistics as explained in [15].
During the training phase, I2V learns the sets of context
and target vectors U,VRdby minimizing LI2V
xusing any
stochastic gradient descent method. In the inference phase,
the affinity between the context and target items iand jis
based on the cosine similarity as follows:
cos(ui,vj) = uT
ivj
|ui||vj|.(3)
The I2V model is commonly used in the recommender
systems community especially for learning item similarities
from collaborative filtering data. In what follows, we keep the
notations for the context and target vectors from above, i.e.,
U,VRd.
B. ITEM2VEC++ (AI2V++)
The AI2V++ model is designed to estimate the likelihood of a
user consuming a target item based on her past consumption
history. Compared to the I2V model, AI2V++ incorporates
several modifications. First, it utilizes a novel attention mech-
anism that allows personalization by selectively attending
to the user’s historical items in the context of the target
item. Second, AI2V++ introduces a hierarchical ordinal bias
scalar to the attention layer, which enables the model to
learn the relevant order of items within the user history.
Additionally, AI2V++ employs a neural scoring function to
compute the similarity between the user representation and
candidate items, instead of the dot-product function used in
I2V. Finally, the model incorporates target biases to address
popularity biases and avoid the need for subsampling as
used in I2V. The multi-attentive context-target mechanism
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Figure 2: A schematic illustration of the AI2V++ model. First, the context and target item embeddings are multiplied by
the learnable transformation matrices Ai
cand Ai
t, respectively. Then, cosine similarity is calculated between the transformed
target item and each of the transformed context items. The attention weights αi
mj are given by the softmax operation applied
on the sum of the cosine similarities, the global ordinal biases λm, and the user-personal ordinal biases λx
m. The sub-user
representation ai
j1is the sum of the transformed context items weighted by the attention weights. This calculation is repeated
over Nmultiple attention heads, where idenotes the attention head index. The final sub-user representation zj1is given by
concatenating {ai
j1}N
i=1 and passing through S. In parallel, the target item representation is passed through Btand scored
according to the neural scoring functions ϕEq. 7 and ωEq. 8. Finally, the popularity bias bljof the target item is added to
account for general popularity patterns.
of AI2V++ is illustrated in Figure Fig. 2. The details of
the model are explained in a step-by-step manner, and the
notations used in the model are summarized in Tab. 1 for ease
of reference.
1) A Dynamic User Representation
Consider a user with a list x= (l1, ..., lk)of historical items
which are ordered by the time of consumption. We denote a
sub-user by xj1= (l1, ..., lj1),(j < k), i.e., a sub-user is
simply a sub-sequence of the historical items consumed by
the user. The attentive context-target mechanism produces an
attentive sub-user representation for xj1as follows:
aj1=
j1
X
m=1
αjm Bculm,(4)
where ulmRdis the item representations for item lm,
BcRd×dis a learnable linear mapping that maps the
historical context item vectors to a new space, and αjm are
the attention weights. These attention weights are computed
dynamically based on the target item ljaccording to:
αjm =exp(τcos(Aculm,Atvlj) + λjm+λx
jm)
Pj1
n=1 exp(τcos(Aculn,Atvlj) + λjn+λx
jn),
(5)
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
where vljRdis the target item’s vector, and Ac,At
Rd×dαare learnable linear mappings from the original con-
text and target spaces to a dα-dimensional context and target
attention space, respectively. The attention scores are based
on the cosine similarity within this attention space. The
hyperparameter τcontrols the attention sphere’s radius.
The AI2V++ model includes global and personal ordinal
biases that incorporate temporal information: Λ = {λi}Cmax
i=1
are global (shared by all users) ordinal biases that learn the
importance of each position in the context items sequence
(Cmax is the maximal sequence length of a sub-user), and
Λx={λx
i}Cmax
i=1 are the personal ordinal biases for user
xthat enables personalized correction to the global ordinal
biases Λ.Λxallows for a subsequent layer of personalization
that is instrumental for users that do not conform to the global
temporal trend captured by the global ordinal biases Λ.
During optimization, we employ L2 regularization on
Λxs.t. their effect is pronounced only if the specific user
behavior justifies a different pattern than the global trend
learned by Λ(see Eq. 10). In the evaluation of the model
(Sec. IV, we show that the addition of the global and personal
ordinal biases, which were not included in the conference
presentation of AI2V [19], significantly contributes to the
model’s accuracy.
Finally, the attentive sub-user representation aj1(Eq. 4)
is a convex combination of her historical item representa-
tions. aj1is dynamic and depends on the item lj: the
importance and hence the weight of each historical item lm
in xj1is governed by its affinity αjm to the target item lj.
In other words, if ljchanges, aj1changes as well. This
mechanism tries to mimic how the human mind works: when
considering different items, different memories come to the
foreground and influence our opinion.
2) A Multi-Attentive User Representation
AI2V++ applies the above attentive context-target mech-
anism multiple times in order to learn several user rep-
resentations in parallel. These representations are then
aggregated to form a multi-attentive user representation.
Specifically, we propose to learn Ncontext-target atten-
tion mechanisms, in parallel. Each attention mechanism
is associated with different sets of learnable parameters
{Ai
c,Ai
t,Bi
c,Λi,{Λx
i}X
x=1}N
i=1, where Xis the number of
users. This process produces Nattentive context-target repre-
sentations for each user {ai
j1}N
i=1 according to Eq. 4. Then,
the final multi-attentive user representation zj1is given by:
zj1=Swj1,(6)
where wj1= [(a1
j1)T, ..., (aN
j1)T]Tis a stacked matrix
based on the Nattentive context-target representations, and
SRd×Nd is a learnable linear mapping that transforms
the multiple user representations of the sub-user back to the
original dimension. This allows AI2V++ to learn various
types of attention functions and aggregate the information
extracted by each attention function into the final multi-
attentive user representation.
3) The AI2V++ Similarity Function
The multi-attentive user representation zj1from Eq. 6
encodes the output from the multiple attentive context-target
units. AI2V++ computes the affinity between a user and
a target item by applying a neural scoring function ϕ:
Rd×RdRas follows:
ϕ(u,v) = W1ReLU(W0([u,v,uv,|uv|])),(7)
where denotes the Hadamard product, and W0Rd×4d
and W1R1×dare learnable linear mappings (matrixes).
It is a neural network with a single ReLU-activated hidden
layer and a scalar output. According to our experiments, this
scoring function, inspired by [57], outperformed the use of
the dot-product as the similarity function. The final score of
sub-user xj1and the target item ljis given by:
ω(xj1, lj) = ϕ(zj1,Btvlj) + blj,(8)
where ϕis the neural scoring function from Eq. 7, Bt
Rd×dis a learnable linear mapping, and bljis a popularity
bias for the target item lj.
4) The Loss Function
Our goal is to compute the probability of the item ljgiven the
historical items in xj1, i.e., p(lj|xj1). To this end, AI2V++
models p(lj|xj1)according to:
p(lj|xj1) = σ(ω(xj1, lj)) Y
k∈N
σ(ω(xj1, k)),(9)
where ω(·,·)is the AI2V++ score function from Eq. 8, and
Nis defined in the same manner as in Eq. 2. Finally, AI2V++
loss for a user xis given by:
Lx=
K
X
j=2
log p(lj|xj1) + γ
N
X
i=1
Λx
i2
2,(10)
where γis a hyperparameter that controls the regularization
of the personal ordinal biases (correction) for user x.
The optimization proceeds with stochastic gradient de-
scent on the BCE loss with negative sampling. At the infer-
ence phase, the similarity between a user xand a target item
kis computed by ω(x, k).
IV. EVALUATIONS
In this section, we describe the experimental setup and the
results of our evaluations.
A. EXPERIMENTAL SETUP
Training, validation, and test sets were generated using the
leave-one-out approach, i.e., for a user with Kitems, we
allocated the K’th item (the last item) for the test set, and
the item before it (item K1) for the validation set. The
rest of the items (items 1till K2) were used for the
training set. Since not all sub-user sequences are of the same
length, we fixed a window size based on the longest sequence
and padded shorter sequences at their beginning accordingly.
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Furthermore, when training the model, we set the attention
weights of those padded positions to zero so they won’t affect
the sub-user representation. The prediction of the hidden item
for each user was done by applying the similarity function
from Eq. 7 to all candidate items and returning the item with
the maximal score.
B. DATASETS
We consider several datasets from different domains. Each
of the datasets consists of the following fields: user ID, item
ID, rating, and timestamp. On each dataset, we filtered users
with less than 4 items or more than 1,000 items. AI2V++ is
designed to employ implicit ratings, hence explicit numerical
ratings were first scaled to a 5-star rating scale, and then
ratings of 4 stars and above were considered as positive
examples. The following datasets were considered:
MovieLens 1M: The MovieLens-1M database [58] has
been widely used to evaluate collaborative filtering algo-
rithms [59]. It consists of 1 million ratings from 6,040
users to 3,883 movies.
Moviesdat: The Moviesdat dataset [60] consists of 26
million ratings from 270,000 users to 45,000 movies.
Netflix: The Netflix dataset [10] consists of more than
100 million ratings by 480,189 users to 17,770 movies.
Yahoo! Music: From the Yahoo! Music dataset [12]
we sampled 19,989 users with 30,000 items and around
3.36 million user ratings.
Amazon Books: The Amazon Books dataset [61] is
based on book reviews crawled from amazon.com. This
dataset consists of 22.5 million reviews given by 8.9
million users to 2.37 million books.
Before describing the evaluation process, we briefly in-
vestigate some relevant statistical properties of the datasets
above. We measure the sparsity of the datasets by calcu-
lating the percentage of user-item pairs without ratings out
of the entire user ×item ratings matrix. As expected for
collaborative filtering datasets, the sparsity level was high in
all cases. In particular, rating sparsity in the Moviesdat and
Amazon datasets is significantly higher. Table 2 summarizes
these statistical properties for each dataset. In addition, Fig. 3
presents the popularity distribution of the different datasets
used in this research on a logarithmic scale. As can be seen,
all datasets present very skewed distributions. However, the
items in the Yahoo! Music dataset suffer from a much higher
degree of popularity skew.
Table 2: Dataset statistics.
Database Users Items Ratings Sparsity %
MovieLens-1M 5,765 1,865 220,311 97.95
Netflix 10,677 2,121 396,316 97.92
Yahoo! 19,151 17,711 85,147 98.25
Moviesdat 11,142 2,949 1,670,318 99.69
Amazon Books 31,202 2,111 305,156 99.54
C. EVALUATION METRICS
The qualitative measurements in this paper follow [59], [62]
and cover the following metrics:
Hit Ratio at K (HR@K): The percentage of the pre-
dictions made by the model, where the positive test item
was found in the top Kitems suggested by the model.
Formally, a test-set tuple (x1:t1,lt) is scored ‘1’ if the
test item ltwas ranked in the top Krecommendations
produced by the model w.r.t. to the user x1:t1and ‘0’
otherwise:
HR@K = (1,if postK
0,otherwise,
where postis the position of the test item in a ranked
list of all items. We report the mean HR@K for all the
users in the dataset. Note that, unlike the other metrics,
the HR@K measure ignores the exact position of the
test item as long as it appears in the top K.
Mean Reciprocal Rank at K (MRR@K): This metric
reports the average Reciprocal Rank at K(RR@K),
where the reciprocal rank is set to zero if the target item
does not appear in the top Krecommendations:
RR@K = (1
post,if postK
0,otherwise,
where postis the position of the target item within the
ranked list of items for the user. We report the mean
RR@K for all the users in the dataset.
Normalized Discounted Cumulative Gain at K
(NDCG@K): This metric reports the Discounted Cu-
mulative Gain at K(DCG@K) normalized by the Ideal
Discounted Cumulative Gain at K(IDCG@K) which is
achieved by the optimal ranking order. In our case, since
there is only one test item, the NDCG@K is calculated
as follows:
NDCG@K = (1
log2(post+1) ,if postK
0,otherwise,
where postis the position of the test item. We report the
mean NDCG@K for all the users in the dataset.
Mean Percentage Rank (MPR): This metric is a recall-
oriented metric that is used to measure the average user
satisfaction with items in an ordered list. MPR considers
the entire list of ranked items (not just the top K). The
percentile rank is defined as follows:
PR = post
N,(11)
where postis the position of the target item within the
ordered list of all items for the user, and Nis the number
of items in the catalog. The Mean Percentage Rank is
the mean P Rufor all the users in the dataset. Note that,
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Figure 3: Item popularity skew for the different datasets - from the most popular item (first percentile on the left) to the least
popular item (on the right). We see that all datasets suffer from a strong popularity skew and a long tail of less popular items.
In the Yahoo! Music dataset this phenomenon is even more prominent.
unlike the previous metrics, lower values of M P R are
more desirable, as they indicate that the test item was
ranked closer to the top of the recommendation lists.
D. BASELINES
The following baselines were considered for evaluation:
Popularity (POP): This simple baseline ranks the items
based on their popularity and recommends the most
popular items to all users. While this approach lacks
personalization, it was shown to perform well on many
collaborative filtering tasks [59].
Item2Vec (I2V): The Item2Vec (I2V) model is an item-
based Collaborative Filtering model that gained much
popularity in recent years [15]. As explained earlier,
AI2V++ generalizes I2V by incorporating a neural
attention mechanism in order to dynamically generate
a user representation. As such, I2V serves as an ablated
version of AI2V++ that showcases the contribution of
AI2V++’s improvements.
Neural Collaborative Filtering (NCF): Neural Col-
laborative Filtering (NCF) [26] employs Generalized
Matrix Factorization (GMF) [63] with a multi-layer
perceptron (MLP) for the user-item interaction func-
tion. The NCF model showed significant improvements
over many well-know state-of-the-art methods such as
ItemKNN [64], BPR [13], eALS [65], and WMF [11].
LightGCN: LightGCN [36] is the leading graph-based
model for Collaborative Filtering [30], [31]. The model
learns user and item embeddings by linearly propa-
gating them on the user-item interaction graph, and
a weighted sum of the embeddings from all layers
is used as the final user representation. LightGCN is
an improvement over NGCF [29] which was shown
to outperform many previous models such as graph-
based GC-MC [35] and PinSage [34], neural network-
based models such as NCF [26] and CMN [32], and
factorization-based models such as BPR [13] and Hop-
Rec [33]. In addition, in [36], LightGCN was shown to
outperform Mult-VAE [66], and GRMF [67].
SASREC: Self-attentive sequential recommendation
(SASRec) [43] incorporates a self-attention mechanism
to utilize user ‘context’ activities based on actions
they have performed recently. SASRec was shown to
outperform strong baselines such as GRU4Rec [46],
GRU4Rec+ [47], and Caser [68].
NAIS: Neural Attentive Item Similarity model (NAIS)
for item-based CF from [51]. Similar to AI2V++, NAIS
employs cross-attention based on the target item but
differs from AI2V++ in multiple aspects as described
in Section II.
AI2V++: The model presented in this paper.
AI2V-Vanilla: This is the basic version of AI2V from
our conference presentation [19]. The differences be-
tween AI2V and AI2++ were covered in Section II.
AI2V-Vanilla is an ablated version t showcases the
relative improvements made in AI2V++.
AI2V-Pos: An ablated version of AI2V++ which is
based on AI2V-Vanilla plus the hierarchical ordinal
biases (from Sec. III-B). This version comes to show-
case the specific contribution of the ordinal biases to
AI2V++.
AI2V++ dot: An ablated version of AI2V++, in which
we replaced of the neural scoring function from Eq. 7
with the dot-product. This ablated version comes to
showcase the relative contribution of AI2V++’s neural
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
scoring function.
E. HYPER-PARAMETERS AND CONFIGURATION
Hyper-parameters tuning was performed on all baselines
and ablated versions using the Optuna optimization frame-
work [69] on the validation set. Specifically, AI2V++’s
hyper-parameters used in this paper are as follows: The
dimensions of the attention layer weights, At, Ac, and Bc,
were set to 50. The parameter controlling the radius sphere in
the attention weights calculation, τ, was set to 1. The number
of negative items that were sampled for each positive target
item was set to 7. The model was trained using Adagrad [70]
and a mini-batch of 32 samples until the learning process has
finished and overfitting has been achieved. Then, the model
with the best validation score was chosen for evaluation.
The following hyper-parameters were optimized separately
for each dataset (using the validation set): learning rate,
embedding size, and the number of attention heads.
F. RESULTS
The current study presents a comprehensive evaluation of
the AI2V++ model. The evaluation is divided into three
main sections. First, in Sec.IV-F1, we report the quantitative
accuracy results for various models, datasets, and evaluation
metrics, and analyze the results based on item popularity.
Second, in Sec.IV-F2, we provide a detailed investigation of
the role of ordinal biases in the AI2V++ model. Finally, in
Sec. IV-F3, we illustrate the dynamic construction of user
representations by AI2V++ and demonstrate how attention
scores can be utilized to provide model interpretability and
explainability.
1) Competative Results
Tables 3-7 summarize the extensive evaluations on the
MovieLens-1M, Netflix, Yahoo!, Amazon Books, and
Moviesdat datasets respectively (p0.05). WThe superi-
ority of the AI2V++ model is clearly noticeable across the
different datasets and metrics. In all cases, the ordinal biases,
which are part of the contributions made in the current paper,
significantly improve the results over the AI2V-Vanilla. Ad-
ditionally, in most cases employing the BCE loss, which is
another contribution of the current paper, yields better results
than the original CCE loss.
Next, we turn to analyze the results as a function of item
popularity. It has been shown that in collaborative filtering
problems, much of the signal lies in simple popularity bi-
ases [71]. For example, the winning model in the Netflix
Prize competition [10] managed to explain 42.6% of the
ratings’ variance i.e., R2= 42.6%, but the vast majority of
the learned signal was attributed to popularity biases which
explained a whopping R2= 32.5% of the variance (without
any personalization) [72].
Following this insight, we wish to investigate the model’s
results as the effect of popular items is artificially dimin-
ished. To this end, we gradually remove popular items from
the dataset and evaluate the results on the remaining, less
popular, items. Figures 4a-4e depict the HR@20 metric (y-
axis) after removing the most popular items (x-axis) for each
dataset. We see that the HR metric monotonically decreases
as more popular items are removed. This is expected since
the popular items are easier to predict. Importantly, we notice
a much milder decrease in the AI2V++ variants compared to
the baselines. By being able to dynamically focus on different
items in the user’s history, AI2V++ can turn its focus to any
specific item in the user’s history, even if that item does not
agree with the general or recent user’s taste. This gives the
AI2V++ variants an advantage in recommending long-tail
items, a highly important property for recommender systems
[59], [73]. In contrast, when the user representation is static,
it tends to be more focused on the popular items and performs
poorly in the presence of less popular items. Note that the
moderate decrease in the Moviesdat and Amazon Books
datasets is attributed to the higher inherent sparsity of these
two datasets compared to the other ones (as seen in Table 2),
however, the general trend remains - the AI2V++ variants
maintain superiority in the long-tail.
2) A Deeper Analysis into Ordinal Biases
From Tables 3-7 we learn that the additional ordinal biases in
AI2V++ are responsible for a significant improvement in the
model’s accuracy. This result is expected as user preferences
are known to be drifting over time and temporal dynamics
need to be addressed [74]. As explained earlier, the addition
of ordinal biases enables the AI2V++ model to be aware of
the order in which the user consumed the items. The global
ordinal biases Λ = {λi}Cmax
i=1 attribute different learnable
weights according to the consumption order. In addition, the
personal ordinal biases Λx={λx
i}Cmax
i=1 enable per-user
personalized correction to the global ordinal biases Λ.
Considering the notable contribution of this component,
we wish to better understand the effect of the ordinal biases in
the AI2V++ model. Fig. 5 depicts the learned global ordinal
bias values of the last twenty items in a user’s sequence. In
order to present model parameters from different model in-
stances (i.e., different datasets) on a single scale, the weights
were normalized by the weight of the highest bias value. A
common trend in all datasets is the gradual decrease in items’
importance where the most recent items are emphasized over
less recent ones.
By comparing the trends of different datasets, further inter-
esting insights can be extracted: For example, we learn that
in the Yahoo! Music dataset, the decrease in item importance
is considerably more significant than in the other datasets.
This implies that in this dataset the non-stationary temporal
trends are of higher importance, in accordance with a deeper
analysis performed on this dataset for KDD-Cup’11 [72]. In
contrast, the non-stationary temporal effects on the Netflix
dataset, while evidently significant, are of less importance
than in other datasets.
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Table 3: Evaluation results on the MovieLens-1M dataset.
Model HR@K MRR@K NDCG@K MPR
K=5 K=10 K=20 K=5 K=10 K=20 K=5 K=10 K=20
AI2V-Vanilla 0.039 0.076 0.132 0.019 0.024 0.028 0.024 0.036 0.050 0.136
AI2V-Pos 0.081 0.133 0.216 0.041 0.047 0.053 0.051 0.067 0.088 0.125
AI2V++ 0.083 0.138 0.216 0.045 0.052 0.057 0.054 0.072 0.091 0.127
AI2V++ dot 0.078 0.136 0.220 0.040 0.048 0.053 0.049 0.068 0.089 0.124
I2V 0.028 0.052 0.097 0.013 0.016 0.019 0.017 0.025 0.036 0.162
POP 0.017 0.046 0.083 0.009 0.012 0.015 0.011 0.020 0.029 0.239
NCF 0.030 0.055 0.099 0.014 0.017 0.020 0.018 0.026 0.037 0.173
SASREC 0.056 0.109 0.185 0.025 0.030 0.035 0.031 0.048 0.067 0.129
LightGCN 0.025 0.053 0.096 0.013 0.016 0.019 0.016 0.024 0.036 0.240
NAIS 0.038 0.067 0.116 0.018 0.022 0.026 0.023 0.033 0.045 0.199
Table 4: Evaluation results on the Netflix dataset.
Model HR@K MRR@K NDCG@K MPR
K=5 K=10 K=20 K=5 K=10 K=20 K=5 K=10 K=20
AI2V-Vanilla 0.118 0.182 0.266 0.063 0.071 0.077 0.077 0.097 0.118 0.115
AI2V-Pos 0.158 0.227 0.319 0.094 0.103 0.109 0.110 0.132 0.155 0.093
AI2V++ 0.168 0.236 0.328 0.104 0.114 0.120 0.120 0.142 0.165 0.095
AI2V++ dot 0.141 0.205 0.292 0.084 0.093 0.099 0.098 0.119 0.141 0.105
I2V 0.136 0.165 0.236 0.088 0.092 0.096 0.100 0.109 0.127 0.125
POP 0.092 0.107 0.122 0.062 0.069 0.074 0.065 0.069 0.072 0.360
NCF 0.117 0.130 0.156 0.074 0.075 0.077 0.084 0.089 0.095 0.216
SASREC 0.108 0.121 0.141 0.070 0.078 0.084 0.081 0.087 0.096 0.267
LightGCN 0.104 0.123 0.147 0.072 0.074 0.076 0.080 0.086 0.092 0.232
NAIS 0.097 0.132 0.201 0.071 0.075 0.080 0.077 0.088 0.106 0.162
Table 5: Evaluation results on the Yahoo! Music data.
Model HR@K MRR@K NDCG@K MPR
K=5 K=10 K=20 K=5 K=10 K=20 K=5 K=10 K=20
AI2V-Vanilla 0.116 0.186 0.274 0.059 0.068 0.074 0.073 0.095 0.118 0.043
AI2V-Pos 0.200 0.298 0.395 0.110 0.123 0.130 0.132 0.164 0.189 0.035
AI2V++ 0.240 0.330 0.419 0.134 0.146 0.152 0.160 0.189 0.212 0.043
AI2V++ dot 0.231 0.329 0.426 0.121 0.134 0.141 0.148 0.180 0.204 0.036
I2V 0.072 0.122 0.199 0.037 0.044 0.049 0.045 0.062 0.081 0.053
POP 0.016 0.037 0.058 0.006 0.009 0.010 0.008 0.015 0.020 0.138
NCF 0.073 0.121 0.219 0.034 0.036 0.039 0.043 0.052 0.071 0.071
SASREC 0.080 0.141 0.226 0.036 0.044 0.05 0.047 0.066 0.088 0.043
LightGCN 0.082 0.149 0.241 0.036 0.047 0.053 0.051 0.069 0.090 0.075
NAIS 0.095 0.153 0.229 0.047 0.055 0.060 0.059 0.078 0.097 0.079
Table 6: Evaluation results on the Amazon Books dataset.
Model HR@K MRR@K NDCG@K MPR
K=5 K=10 K=20 K=5 K=10 K=20 K=5 K=10 K=20
AI2V-Vanilla 0.129 0.190 0.274 0.070 0.078 0.084 0.048 0.104 0.125 0.102
AI2V-Pos 0.151 0.214 0.297 0.088 0.097 0.102 0.104 0.124 0.145 0.101
AI2V++ 0.161 0.222 0.302 0.099 0.107 0.113 0.115 0.134 0.154 0.108
AI2V++ dot 0.172 0.235 0.312 0.108 0.117 0.122 0.124 0.144 0.164 0.102
I2V 0.078 0.132 0.207 0.038 0.045 0.050 0.048 0.065 0.084 0.114
POP 0.005 0.019 0.030 0.003 0.005 0.005 0.003 0.008 0.011 0.416
NCF 0.028 0.051 0.089 0.013 0.016 0.018 0.016 0.024 0.033 0.196
SASREC 0.081 0.142 0.084 0.040 0.048 0.053 0.051 0.068 0.087 0.121
LightGCN 0.069 0.108 0.159 0.038 0.046 0.053 0.046 0.058 0.071 0.138
NAIS 0.126 0.181 0.243 0.069 0.076 0.080 0.083 0.100 0.116 0.134
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
(a) Movielens-1M (b) Netflix
(c) Yahoo (d) Amazon Books
(e) Moviesdat
Figure 4: Accuracy of less popular items: This analysis measures the HR@20 when removing the most popular items from the
test set. Initially (on the left), the 10 most popular items are removed. Then, at each point along the x-axis, the next 10 most
popular items are also removed until all the top 100 popular items are removed. This analysis reveals how the different models
cope when canceling out the most popular items that skew the dataset.
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Table 7: Evaluation results on the Moviesdat dataset.
Model HR@K MRR@K NDCG@K MPR
K=5 K=10 K=20 K=5 K=10 K=20 K=5 K=10 K=20
AI2V-Vanilla 0.036 0.059 0.095 0.017 0.020 0.023 0.022 0.029 0.038 0.146
AI2V-Pos 0.051 0.087 0.137 0.025 0.030 0.034 0.032 0.043 0.056 0.123
AI2V++ 0.061 0.102 0.164 0.031 0.036 0.040 0.038 0.051 0.067 0.118
AI2V ++ dot 0.051 0.088 0.140 0.025 0.030 0.034 0.032 0.043 0.056 0.127
I2V 0.029 0.041 0.070 0.014 0.017 0.019 0.016 0.020 0.025 0.189
POP 0.004 0.009 0.020 0.002 0.002 0.003 0.002 0.004 0.007 0.332
NCF 0.028 0.039 0.068 0.013 0.015 0.017 0.014 0.018 0.023 0.208
SASREC 0.036 0.056 0.088 0.012 0.018 0.021 0.017 0.024 0.035 0.149
LightGCN 0.032 0.045 0.074 0.015 0.017 0.021 0.019 0.023 0.029 0.182
NAIS 0.005 0.010 0.019 0.003 0.003 0.004 0.003 0.005 0.007 0.109
Figure 5: Ordinal bias weights of the last twenty items in the user’s sequence. The items are ordered according to their
consumption order. Each element in this vector corresponds to a certain position of an item in the user’s sequence. In order to
enable the comparison between different datasets on a single plot, we normalized the bias values of each model by the maximal
bias value in that model.
3) Interpretability with AI2V++
In what follows, we present an attention score analysis that
demonstrates interpretability by exposing the inner workings
of the model. Figure 6 presents visualizations of the attention
weights for users from MovieLens-1M. The attention scores
are calculated when scoring the last movie in the sequence
(the test item) for an AI2V++ model with a single attention
head. The first example in Fig. 6 relates to the movie “Clear
and Present Danger (1994)”. Of the user’s train items, the
highest score was given to the movie “Patriot Games (1992)”.
Both movies are action thriller films, based on books by
Tom Clancy (a novelist), with Harrison Ford (an actor) as
Jack Ryan (the lead character). In fact, “Clear and Present
Danger (1994)” is a sequel to “Patriot Games (1992)”. Other
movies with high scores such as “In the Line of Fire (1993)”,
“Fugitive (1993)”, and “The Hunt for Red October (1990)”
are all related action thrillers. In fact, “The Hunt for Red
October (1990)” is the first movie in Clancy’s Jack Ryan
movie series.
In the second example, the target movie is “The Green
Mile (1999)” and the model identified the movie “‘Sixth
Sense (1999)” as the user’s highest-scored historical item.
Both movies share similarities in their genre as drama films
with a supernatural element, and also relate to the themes
of crime and death. Additionally, the movie “Silence of the
Lambs (1991)” was also identified as having a high score
in the user’s historical items. This movie is another crime
drama that explores the complex relationship between a law
enforcer and a criminal.
In the final example, the target item is “The Dark Crystal
(1982)”. The most similar item in the user’s history, as
determined by AI2V++, is the movie “Labyrinth (1986)”.
Both movies are fantasy/fictional movies directed and written
by Jim Henson.
Next, we examine how the attention scores of users change
in response to different items to be scored. The attention
scores of a particular user are depicted in Fig. 7, where the
left-side image represents the scores in the context of the
classic romantic comedy “Singing in the Rain (1952)”. In
this case, the AI2V++ model gives more attention to another
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Figure 6: Attention scores for different users in the Movielens-1M dataset. Each plot represents a different user with her
historical items (train items), chronologically ordered from left to right. The movie in the title is the target item. The figure
presents the attention scores of the users’ historical items with respect to the target item.
Figure 7: The change of importance (attention) scores for the same user in the presence of different target items: The left image
depicts the importance scores in the presence of the classic romantic comedy “Singing in the Rain (1952)”, while the right
image depicts the importance score in the presence of the science fiction film “Cube (1997)”.
14 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
Figure 8: The change of importance (attention) scores for the same user in the presence of different target items: The left image
depicts the importance scores in the presence of the romantic comedy “Four Weddings and a Funeral (1994)”, while the right
image depicts the importance score in the presence of the action film “Die Hard (1988)”.
classic romantic comedy from the same era, “Some Like it
Hot (1959)”, followed by the romantic musical “King and
I (1956)”. The user’s other items are science fiction films
from the 90s and receive lower scores accordingly. While
the model does not disregard them entirely, they are de-
emphasized and given a lesser role. In contrast, the right-
side image in Fig. 7 shows the attentive importance scores
of the same user in the presence of the science fiction film
“Cube (1997)”. In this case, the model correctly identifies
“The Matrix (1999)” and “Star Trek: Insurrection (1998)” as
the most relevant, and the user’s representation would adapt
accordingly based on these scores.
Figure 8 presents another example based on a different
user. The left-side image presents the importance scores
of the model with respect to the romantic comedy “Four
Weddings and A Funeral (1994)” starring Hugh Grant (an
actor). We see that in the context of this target item, the model
gives more attention to “Notting Hill (1999)” and ”When
Harry Met Sally (1989)”, both are romantic comedies and
the former also features Hugh Grant in the lead role. The
right-side image in Fig. 8 presents a different target item for
the same user. Now, the target item is the action movie “Die
Hard (1988)”. As can be seen, in the presence of this target
item, the romantic comedies from the previous example are
de-emphasized, and instead, two other movies come to the
front: “Indiana Jones and the Last Crusade (1989)” and “The
Terminator (1984)”. Both are action-adventure movies from
the ’80s.
The above examples demonstrate that the attention weights
of the different historic items enable model interpretability
in AI2V++. This process can also be employed to “explain”
the model’s recommendations based on the user’s historical
items e.g., “We recommend you ‘X’ because you watched
‘Y”’.
V. DISCUSSION
In this paper, we have introduced AI2V++, an improved and
extended version of our previous work on AI2V presented at
a conference [19]. Unlike most CF algorithms that represent
users as static vectors, AI2V++ models users dynamically
based on the item being recommended. This approach mim-
ics how the human brain operates when making decisions on
different items, where different memories are activated and
brought to the forefront. To achieve this, AI2V++ utilizes a
neural cross-attention mechanism on the user’s past items,
where the target item is used as a query. As a result, the items
in the user’s history receive dynamic attention scores based
on their relevance to the item being scored.
AI2V++ includes several algorithmic improvements over
the conference presentation of AI2V [19]: (1) the integration
of ordinal information into the context-target attention mech-
anism through a hierarchy of global and personal ordinal
biases, and (2) the replacement of the categorical cross-
entropy loss function used in AI2V with a binary cross-
entropy loss function more suitable for multi-label classi-
fication problems. The effectiveness of these modifications
is demonstrated through extensive quantitative and qual-
itative evaluations on five datasets, which show that the
AI2V++ model outperforms several state-of-the-art recom-
mender systems. Additionally, through attentive score analy-
sis, we demonstrate interpretability with AI2V++ which can
be harnessed for generating end-user explanations. To ensure
reproducibility, the open-source code for AI2V++ is available
on GitHub.
A. SPACE AND TIME COMPLEXITIES
In what follows, we wish to discuss the space and time
complexities of the AI2V++ model in the context of real-
world settings and in comparison to alternative algorithms.
The space complexity of the model is in line with that of
classical MF models e.g. [11], [75]. In classical MF models,
each user and each item is represented by a d-dimensional
vector, hence the space complexity sums up to O((X+M)d),
where Xis the number of users, Mis the number of items,
and dis the dimensionality of the representations.
AI2V++ represents items similar to classic MF models,
but different from these models, AI2V++ does not have
explicit user representations. Instead, AI2V++ dynamically
composes the user representation based on her items using
a per-user attention mechanism. The attention parameters of
AI2V++ consist of the projection matrixes Ai
c,Ai
tRd×dα,
and Bi
c,BtRd×d, where dis the dimensionality of the
VOLUME 4, 2016 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
item embeddings, and dαis the dimensionality of the atten-
tion space. We can easily assume that dαd, since there
is no benefit in inflating the representations. The projection
matrixes Ai
c,Ai
t, and Bi
care multiplied by the number of
attention heads N. In addition, there is Cmax ordinal biases
per user and an additional set of Cmax global biases. Putting
it all together, the space complexity of AI2V++ is, therefore,
O(Md +X×(Nd2+Cmax)). While AI2V++’s space
complexity is somewhat higher than that of a classic MF
model, the difference is arguably modest and remains linear
in the number of users and items.
In terms of training time complexity, since it is based
on SGD its time complexity is linear with the number of
parameters O(M d +X×(Nd2+Cmax )) times the number
of epochs and the number of samples in each epoch. The
number of epochs and the number of samples varies from
one dataset to another, but in general, the training time of
AI2V++ is relatively fast. For example, we trained our model
on a single NVIDIA V100 GPU and it took us 4 hours
to train a model for Moviesdat dataset [60] which is the
largest dataset we have used. On other datasets, the training
process converged even faster. Arguably, the training time of
AI2V++ is relatively modest in the realm of deep learning.
Furthermore, our research code hasn’t been optimized for
production settings and can surely be improved by profes-
sional developers in order to reduce training time in industrial
settings. Finally, training the algorithm is performed offline.
Hence, its training time is of less importance with respect to
its inference time.
While training a CF model is performed offline, the infer-
ence often needs to be performed online. Moreover, in order
to pick the top recommendations, per-user ranking needs to
be performed which incurs additional costs. Therefore, in CF,
a model’s inference time is usually of higher importance than
its online training time.
As mentioned earlier, in basic CF algorithms such as
MF [75], users are represented using fixed pre-computed
latent vectors. At inference time, given a user and an item, the
predicted affinity of the user to the items is simply given by
computing the inner product between the user representation
and the item representation. Hence, the time complexity for
scoring a user-item pair is simply O(d). In contrast, the
dynamic nature of AI2V++’s user representations inherently
requires some additional computations. Given a user and an
item, the model needs to compute the attention scores for
the user’s historical items with respect to the item first. Only
then, it is possible to assemble the user vector which is used
to score the item. Hence, if a user has kitems in her history,
the time complexity of scoring a single interaction is given by
O(N×k×d). As can be seen, when compared to classical
CF algorithms, two additional factors are added to AI2V++’s
time complexity at inference: the first regards the number of
historical items that the user interacted with i.e. k, and the
second is a hyperparameter that determines the granularity of
the multi-attentive user representation i.e. N. Let us briefly
address both.
Empirically, when optimizing for the hyperparameter N,
we found that it is best to set either N= 1 (MovieLens-1M,
Yahoo! Music) or N= 2 (Netflix, Moviesdat). Hence, in
practice, it can be argued that this number is very small and
can be considered a constant. Note that even when N= 1,
AI2V++ still enjoys significant advantages over classical CF
algorithms.
The number of historical items per user kcan be bounded
by the total number of items in the catalog kM. Hence,
this can be a theoretical upper limit on the time complexity
for each user-item prediction. However, for the lion’s share
of users, this upper limit is extremely inflated. CF datasets
are known to be very sparse. That is due to the fact that most
users only interact with a very small subset of the items in the
catalog. For example, the sparsity level of the Netflix dataset
is 98.82% [10] and the sparsity level in the Yahoo! Music
dataset is 99.96% [12]. Moreover, such datasets exhibit a
power-law distribution in which a long tail of users rate only
a small number of items and only very few “heavy” users rate
a large number of items. As a result, in the vast majority of
cases kK, and for those users, kcan be considered as a
constant number rather than a factor. For the few “heavy”
users it is possible to employ different heuristics such as
ignoring some of their historical items or pre-computing and
caching their recommendations.
B. LIMITATIONS AND FUTURE WORK
The AI2V++ model presents a novel user representation
scheme and demonstrates both state-of-the-art accuracy re-
sults as well as interpretability properties that throw light on
the inner workings of the model. However, these advantages
do not come without a cost. Arguably, the main limitation of
the model is its inference time which stems from its dynamic
user representation and the utilization of a neural cross-
attention mechanism on users’ historical items. In general,
AI2V++’s inference time is several times higher than that
of its classical predecessors. However, for the vast majority
of users, if we treat both Nand kas constants rather than
factors, AI2V++’s inference time becomes O(d), similar
to classical MF algorithms. Moreover, when compared to
state-of-the-art algorithms based on neural attention such as
in computer vision e.g., [39], [76] or in natural language
processing e.g., [17], [20], [77], AI2V++’s inference time
remains moderate.
In future work, we plan to investigate a distillation
approach that treats AI2V++ as a “teacher” model and
trains a nimble “student” model which learns to reconstruct
AI2V++’s cross-attention mechanism using regression. A
similar approach is already been used successfully in the
field of natural language processing [57], where it has been
applied in order to mitigate the expensive cross-attention
operation at the inference phase of models such as BERT [20]
and XLNet [78].
The explainability of AI2V++ represents another area
for future research. Interpretability and explainability are
two related terms that are often used interchangeably [79],
16 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
[80]. Interpretability refers to the ability to comprehend the
inner workings of the model by someone knowledgeable
in the field, while explainability relates to the ability to
provide clear explanations to end-users that rationalize the
recommendations. In this study, we demonstrated effective
interpretability properties that can be leveraged for intuitive
explanations. However, evaluating explanations in recom-
mender systems is a complex task that varies according
to the explanations’ aim, such as transparency, scrutability,
trust, effectiveness, persuasiveness, efficiency, and satisfac-
tion. Moreover, the usefulness of an explanation depends on
objective and subjective system aspects, user experience, sit-
uational, interaction, and personal characteristics. Although
we cannot evaluate these aspects in the scope of the current
paper, we hope that the utility of the proposed approach is
evident. Future studies can investigate how to improve the
explainability of AI2V++ further.
References
[1] F. Ricci, L. Rokach, and B. Shapira, “Recommender systems: Techniques,
applications, and challenges,” Recommender Systems Handbook, pp. 1–
35, 2021.
[2] G. M. Davies and D. M. Thomson, Memory in context: Context in
memory. John Wiley & Sons, 1988.
[3] A. K. Anderson, Y. Yamaguchi, W. Grabski, and D. Lacka, “Emotional
memories are not all created equal: evidence for selective memory en-
hancement,” Learning & Memory, vol. 13, no. 6, pp. 711–718, 2006.
[4] S. M. Smith and E. Vela, “Environmental context-dependent memory: A
review and meta-analysis, Psychonomic bulletin & review, vol. 8, no. 2,
pp. 203–220, 2001.
[5] S. Sukhbaatar, D. Ju, S. Poff, S. Roller, A. Szlam, J. Weston, and A. Fan,
“Not all memories are created equal: Learning to forget by expiring, in
International Conference on Machine Learning, pp. 9902–9912, PMLR,
2021.
[6] L. Chen, M. De Gemmis, A. Felfernig, P. Lops, F. Ricci, and G. Semeraro,
“Human decision making and recommender systems,” ACM Transactions
on Interactive Intelligent Systems (TiiS), vol. 3, no. 3, pp. 1–7, 2013.
[7] E. Lex and M. Schedl, “Psychology-informed recommender systems tu-
torial,” in Proceedings of the 16th ACM Conference on Recommender
Systems, pp. 714–717, 2022.
[8] J. R. Bettman, M. F. Luce, and J. W. Payne, “Constructive consumer choice
processes,” Journal of consumer research, vol. 25, no. 3, pp. 187–217,
1998.
[9] S. Lichtenstein and P. Slovic, The construction of preference. Cambridge
University Press, 2006.
[10] J. Bennett, S. Lanning, et al., “The netflix prize,” in Proceedings of KDD
cup and workshop, vol. 2007, p. 35, New York, NY, USA., 2007.
[11] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit
feedback datasets,” in 2008 Eighth IEEE International Conference on Data
Mining, pp. 263–272, Ieee, 2008.
[12] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer, “The yahoo! music
dataset and kdd-cup’11,” in Proceedings of KDD Cup 2011, pp. 3–18,
PMLR, 2012.
[13] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr:
Bayesian personalized ranking from implicit feedback,” arXiv preprint
arXiv:1205.2618, 2012.
[14] N. Koenigstein and Y. Koren, “Towards scalable and accurate item-
oriented recommendations,” in Proceedings of the 7th ACM conference
on Recommender systems, pp. 419–422, 2013.
[15] O. Barkan and N. Koenigstein, “Item2vec: neural item embedding for
collaborative filtering, in 2016 IEEE 26th International Workshop on
Machine Learning for Signal Processing (MLSP), 2016.
[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, Attention is all you need,” in Advances in
neural information processing systems, pp. 5998–6008, 2017.
[18] N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: The efficient trans-
former, arXiv preprint arXiv:2001.04451, 2020.
[19] O. Barkan, A. Caciularu, O. Katz, and N. Koenigstein, “Attentive
item2vec: Neural attentive user representations, ICASSP 2020 - 2020
IEEE International Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pp. 3377–3381, 2020.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[21] H. Steck, L. Baltrunas, E. Elahi, D. Liang, Y. Raimond, and J. Basilico,
“Deep learning for recommender systems: A netflix case study, AI
Magazine, vol. 42, no. 3, pp. 7–18, 2021.
[22] B. Selma, B. Narhimène, and R. Nachida, “Deep learning for recom-
mender systems: Literature review and perspectives, in 2021 Interna-
tional Conference on Recent Advances in Mathematics and Informatics
(ICRAMI), pp. 1–7, IEEE, 2021.
[23] Y. Wei, M. Langer, F. Yu, M. Lee, J. Liu, J. Shi, and Z. Wang, A
gpu-specialized inference parameter server for large-scale deep recom-
mendation models,” in Proceedings of the 16th ACM Conference on
Recommender Systems, pp. 408–419, 2022.
[24] H. Chen, Y. Lin, M. Pan, L. Wang, C.-C. M. Yeh, X. Li, Y. Zheng, F. Wang,
and H. Yang, “Denoising self-attentive sequential recommendation, in
Proceedings of the 16th ACM Conference on Recommender Systems,
pp. 92–101, 2022.
[25] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, Autorec: Autoencoders
meet collaborative filtering, in Proceedings of the 24th international
conference on World Wide Web, pp. 111–112, 2015.
[26] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collab-
orative filtering, p. 173–182, International World Wide Web Conferences
Steering Committee, 2017.
[27] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative
filtering vs. matrix factorization revisited, in Fourteenth ACM conference
on recommender systems, pp. 240–248, 2020.
[28] S. Rendle, W. Krichene, L. Zhang, and Y. Koren, “Revisiting the perfor-
mance of ials on item recommendation benchmarks,” in Proceedings of the
16th ACM Conference on Recommender Systems, pp. 427–435, 2022.
[29] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graph
collaborative filtering, in Proceedings of the 42nd international ACM
SIGIR conference on Research and development in Information Retrieval,
pp. 165–174, 2019.
[30] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learn-
ing on large graphs,” in Proceedings of the 31st International Conference
on Neural Information Processing Systems, pp. 1025–1035, 2017.
[31] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks, arXiv preprint arXiv:1609.02907, 2016.
[32] T. Ebesu, B. Shen, and Y. Fang, “Collaborative memory network for rec-
ommendation systems,” in The 41st international ACM SIGIR conference
on research & development in information retrieval, pp. 515–524, 2018.
[33] J.-H. Yang, C.-M. Chen, C.-J. Wang, and M.-F. Tsai, “Hop-rec: high-order
proximity for implicit recommendation,” in Proceedings of the 12th ACM
Conference on Recommender Systems, pp. 140–144, 2018.
[34] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
J. Leskovec, “Graph convolutional neural networks for web-scale recom-
mender systems,” in Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. 974–983, 2018.
[35] R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix
completion,” arXiv preprint arXiv:1706.02263, 2017.
[36] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn: Sim-
plifying and powering graph convolution network for recommendation,
p. 639–648, Association for Computing Machinery, 2020.
[37] A. d. S. Correia and E. L. Colombini, “Attention, please! a survey of neural
attention models in deep learning,” arXiv preprint arXiv:2103.16775,
2021.
[38] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
“Transformers in vision: A survey,” ACM Computing Surveys (CSUR),
2021.
[39] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-
terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An
image is worth 16x16 words: Transformers for image recognition at scale,
arXiv preprint arXiv:2010.11929, 2020.
[40] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in Euro-
pean conference on computer vision, pp. 213–229, Springer, 2020.
VOLUME 4, 2016 17
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
[41] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
“Learning transferable visual models from natural language supervision,”
in ICML, 2021.
[42] X. Zhou and Y. Li, “Large-scale modeling of mobile user click behaviors
using deep learning,” in Fifteenth ACM Conference on Recommender
Systems, pp. 473–483, 2021.
[43] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,”
in 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–
206, IEEE, 2018.
[44] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing person-
alized markov chains for next-basket recommendation, in Proceedings of
the 19th international conference on World wide web, pp. 811–820, 2010.
[45] R. He, W.-C. Kang, and J. McAuley, “Translation-based recommenda-
tion,” in Proceedings of the eleventh ACM conference on recommender
systems, pp. 161–169, 2017.
[46] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-
based recommendations with recurrent neural networks,” arXiv preprint
arXiv:1511.06939, 2015.
[47] B. Hidasi and A. Karatzoglou, “Recurrent neural networks with top-k
gains for session-based recommendations,” in Proceedings of the 27th
ACM international conference on information and knowledge manage-
ment, pp. 843–852, 2018.
[48] J. Tang and K. Wang, “Personalized top-n sequential recommendation via
convolutional sequence embedding, in Proceedings of the eleventh ACM
international conference on web search and data mining, pp. 565–573,
2018.
[49] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Se-
quential recommendation with bidirectional encoder representations from
transformer, in Proceedings of the 28th ACM international conference on
information and knowledge management, pp. 1441–1450, 2019.
[50] W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,
Journalism quarterly, vol. 30, no. 4, pp. 415–433, 1953.
[51] X. He, Z. He, J. Song, Z. Liu, Y.-G. Jiang, and T.-S. Chua, “Nais: Neural
attentive item similarity model for recommendation, IEEE Transactions
on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2354–2366,
2018.
[52] N. Tintarev and J. Masthoff, “Beyond explaining single item recommen-
dations,” Recommender Systems Handbook, pp. 711–756, 2022.
[53] A. Holzinger, A. Saranti, C. Molnar, P. Biecek, and W. Samek, “Explain-
able ai methods-a brief overview,” in International Workshop on Extending
Explainable AI Beyond Deep Models and Classifiers, pp. 13–38, Springer,
2022.
[54] F. Xu, H. Uszkoreit, Y. Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai:
A brief survey on history, research areas, approaches and challenges, in
CCF international conference on natural language processing and Chinese
computing, pp. 563–574, Springer, 2019.
[55] B. Goodman and S. Flaxman, “European union regulations on algorithmic
decision-making and a “right to explanation”,” AI magazine, vol. 38, no. 3,
pp. 50–57, 2017.
[56] K. Swearingen and R. Sinha, “Beyond algorithms: An hci perspective on
recommender systems,” in ACM SIGIR 2001 workshop on recommender
systems, vol. 13, pp. 1–11, 2001.
[57] O. Barkan, N. Razin, I. Malkiel, O. Katz, A. Caciularu, and N. Koenig-
stein, “Scalable attentive sentence pair modeling via distilled sentence
embedding,” in Proceedings of the AAAI Conference on Artificial Intelli-
gence, vol. 34, pp. 3235–3242, 2020.
[58] F. M. Harper and J. A. Konstan, “The movielens datasets: History and
context,” Acm transactions on interactive intelligent systems (tiis), vol. 5,
no. 4, pp. 1–19, 2015.
[59] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender sys-
tems handbook,” in Recommender systems handbook, pp. 1–35, Springer,
2011.
[60] https://www.kaggle.com/rounakbanik/the-movies dataset, “The movies
dataset,” A Kaggle database.
[61] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel, “Image-based
recommendations on styles and substitutes,” in Proceedings of the 38th
international ACM SIGIR conference on research and development in
information retrieval, pp. 43–52, 2015.
[62] M. Chen and P. Liu, “Performance evaluation of recommender systems,
International Journal of Performability Engineering, vol. 13, no. 8, p. 1246,
2017.
[63] A. P. Singh and G. J. Gordon, “A unified view of matrix factorization mod-
els,” in Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, pp. 358–373, Springer, 2008.
[64] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative
filtering recommendation algorithms,” in Proceedings of the 10th interna-
tional conference on World Wide Web, pp. 285–295, 2001.
[65] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization
for online recommendation with implicit feedback,” in Proceedings of the
39th International ACM SIGIR conference on Research and Development
in Information Retrieval, pp. 549–558, 2016.
[66] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara, “Variational
autoencoders for collaborative filtering, in Proceedings of the 2018 World
Wide Web Conference, WWW ’18, (Republic and Canton of Geneva,
CHE), p. 689–698, International World Wide Web Conferences Steering
Committee, 2018.
[67] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collabora-
tive filtering with graph information: Consistency and scalable meth-
ods,” in Advances in Neural Information Processing Systems (C. Cortes,
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran
Associates, Inc., 2015.
[68] J. Tang and K. Wang, “Personalized top-n sequential recommendation via
convolutional sequence embedding, in Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining, pp. 565–573,
2018.
[69] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
generation hyperparameter optimization framework, in Proceedings of
the 25th ACM SIGKDD international conference on knowledge discovery
& data mining, pp. 2623–2631, 2019.
[70] A. Lydia and S. Francis, Adagrad—an optimizer for stochastic gradient
descent,” Int. J. Inf. Comput. Sci, vol. 6, no. 5, 2019.
[71] P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender
algorithms on top-n recommendation tasks,” in Proceedings of the fourth
ACM conference on Recommender systems, pp. 39–46, 2010.
[72] G. Dror, N. Koenigstein, and Y. Koren, “Web-scale media recommendation
systems,” Proceedings of the IEEE, vol. 100, no. 9, pp. 2722–2736, 2012.
[73] Ò. Celma, “The long tail in recommender systems,” in Music Recommen-
dation and Discovery, pp. 87–107, Springer, 2010.
[74] Y. Koren, “Collaborative filtering with temporal dynamics,” in Proceed-
ings of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’09, (New York, NY, USA), p. 447–456,
Association for Computing Machinery, 2009.
[75] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for
recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[76] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
[77] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler,
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” in Advances
in Neural Information Processing Systems (H. Larochelle, M. Ranzato,
R. Hadsell, M. F. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran
Associates, Inc., 2020.
[78] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le,
“Xlnet: Generalized autoregressive pretraining for language understand-
ing,” Advances in neural information processing systems, vol. 32, 2019.
[79] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal,
“Explaining explanations: An overview of interpretability of machine
learning,” in 2018 IEEE 5th International Conference on data science and
advanced analytics (DSAA), pp. 80–89, IEEE, 2018.
[80] H. J. Escalante, S. Escalera, I. Guyon, X. Baró, Y. Güçlütürk, U. Güçlü,
M. van Gerven, and R. van Lier, Explainable and interpretable models in
computer vision and machine learning. Springer, 2018.
18 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Gaiger et al.: AI2V++: A Collaborative Filtering Model with Dynamic User Representation
KEREN GAIGER Keren Gaiger received the
B.Sc. degree in Information Systems from the
Technion Israel Institute of Technology, Haifa,
Israel, in 2017. During 2017-2020 she worked as
a Data Scientist in two Israeli startups, first in an
AdTech company and then in a security company.
Currently, she works as a researcher at Lightricks,
a company that develops video and image editing
mobile apps, and is about to receive her M.Sc.
degree from Tel-Aviv University, Tel-Aviv, Israel.
OREN BARKAN Oren Barkan received the B.Sc.
and M.Sc. degrees in computer science (cum
laude) from the Hebrew University, and Ph.D.
degree from the School of Computer Science, Tel-
Aviv University, Israel. Currently, he is an assistant
professor of computer science at the Open Univer-
sity, Israel.
SHIR TSIPORY-SAMUEL received a B.Sc. de-
gree in industrial engineering and management
from Tel Aviv University, Tel Aviv, Israel, in 2018.
She is currently pursuing an M.Sc. in industrial
engineering and management at Tel Aviv Univer-
sity. In 2017, she joined JFrog as a security data
analyst, where she researched vulnerabilities and
maintained JXray’s database. In 2021, she joined
Bionic as a research analyst, researching code as
part of the static analysis team and building better
analysis visualizations.
NOAM KOENIGSTEIN received the B.Sc. de-
gree in computer science (cum laude) from the
Technion Israel Institute of Technology, Haifa,
Israel, in 2007 and the M.Sc. degree in electrical
engineering from Tel-Aviv University, Tel-Aviv,
Israel, in 2009. In 2013 he received a Ph.D. de-
gree from the School of Electrical Engineering,
Tel-Aviv University. In 2011, he joined the Xbox
Machine Learning research team in Microsoft,
where he developed the algorithm for Xbox rec-
ommendations serving millions of users worldwide. Later he managed the
recommendations research team for Microsoft’s Store. In 2017 he joined
Citi bank’s Israeli Innovation Lab as Senior VP Head of Data Science
overseeing all data science activities in the Israeli research center. In 2018
he joined the Industrial Engineering department in Tel Aviv University as a
Senior Lecturer (Associate Professor). Today, he heads the Applied Machine
Learning Lab (AML Lab) where students are working on applying machine
learning algorithms to a diverse set of real-world problems.
VOLUME 4, 2016 19
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3263931
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
... Despite their accomplishments, these models frequently function as opaque systems, introducing challenges in comprehending their predictions. Consequently, the field of Explainable AI (XAI) has emerged, dedicated to developing methods that illuminate the decision rationale of machine learning models across diverse application domains (Simonyan, Vedaldi, and Zisserman 2013;Malkiel et al. 2022;Gaiger et al. 2023; Barkan et al. 2020bBarkan et al. , 2023aBarkan et al. , 2024a. In the context of computer vision, XAI techniques aim to generate explanation maps highlighting input regions responsible for the model's predictions (Selvaraju et al. 2017;Chefer, Gur, and Wolf 2021b). ...
Article
Two prominent challenges in explainability research involve 1) the nuanced evaluation of explanations and 2) the modeling of missing information through baseline representations. The existing literature introduces diverse evaluation metrics, each scrutinizing the quality of explanations through distinct lenses. Additionally, various baseline representations have been proposed, each modeling the notion of missingness differently. Yet, a consensus on the ultimate evaluation metric and baseline representation remains elusive. This work acknowledges the diversity in explanation metrics and baselines, demonstrating that different metrics exhibit preferences for distinct explanation maps resulting from the utilization of different baseline representations and distributions. To address the diversity in metrics and accommodate the variety of baseline representations in a unified manner, we propose Baseline Exploration-Exploitation (BEE) - a path-integration method that introduces randomness to the integration process by modeling the baseline as a learned random tensor. This tensor follows a learned mixture of baseline distributions optimized through a contextual exploration-exploitation procedure to enhance performance on the specific metric of interest. By resampling the baseline from the learned distribution, BEE generates a comprehensive set of explanation maps, facilitating the selection of the best-performing explanation map in this broad set for the given metric. Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics.
... these models frequently function as opaque systems, introducing challenges in comprehending their predictions. Consequently, the field of Explainable AI (XAI) has emerged, dedicated to developing methods that illuminate the decision rationale of machine learning models across diverse application domains (Simonyan, Vedaldi, and Zisserman 2013;Malkiel et al. 2022;Gaiger et al. 2023; Barkan et al. 2020bBarkan et al. , 2023aBarkan et al. , 2024a. In the context of computer vision, XAI techniques aim to generate explanation maps highlighting input regions responsible for the model's predictions (Selvaraju et al. 2017;Chefer, Gur, and Wolf 2021b). ...
Preprint
Full-text available
Two prominent challenges in explainability research involve 1) the nuanced evaluation of explanations and 2) the modeling of missing information through baseline representations. The existing literature introduces diverse evaluation metrics, each scrutinizing the quality of explanations through distinct lenses. Additionally, various baseline representations have been proposed, each modeling the notion of missingness differently. Yet, a consensus on the ultimate evaluation metric and baseline representation remains elusive. This work acknowledges the diversity in explanation metrics and baselines, demonstrating that different metrics exhibit preferences for distinct explanation maps resulting from the utilization of different baseline representations and distributions. To address the diversity in metrics and accommodate the variety of baseline representations in a unified manner, we propose Baseline Exploration-Exploitation (BEE) - a path-integration method that introduces randomness to the integration process by modeling the baseline as a learned random tensor. This tensor follows a learned mixture of baseline distributions optimized through a contextual exploration-exploitation procedure to enhance performance on the specific metric of interest. By resampling the baseline from the learned distribution, BEE generates a comprehensive set of explanation maps, facilitating the selection of the best-performing explanation map in this broad set for the given metric. Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics.
... Explainability research has made significant strides over the past decade, introducing a wide range of methods across various application domains (Fong et al., 2019;Simonyan et al., 2013;Fong and Vedaldi, 2017;Selvaraju et al., 2017;Zhou et al., 2018;Barkan et al., 2020cBarkan et al., , 2021cBarkan et al., ,a, 2023dGaiger et al., 2023;Barkan et al., 2024;Malkiel et al., 2022a;Chefer et al., 2021a,b). Gradientbased methods rely on the gradients of the model's prediction score concerning the input embeddings. ...
... Explainable AI includes a wide array of methods, all aimed at improving the understanding of decisions made by deep learning models across multiple application domains (Fong et al., 2019;Simonyan et al., 2013;Fong and Vedaldi, 2017;Selvaraju et al., 2017;Zhou et al., 2018;Barkan et al., 2020cBarkan et al., , 2021cBarkan et al., ,a, 2023dGaiger et al., 2023;Barkan et al., 2024;Malkiel et al., 2022a;Chefer et al., 2021a,b;Sanyal et al., 2022). ...
... However, existing research have predominantly focused on the analysis of learning behaviors within singular timeframes, neglecting the interplay between students' long-and short-term learning behaviors. This oversight has resulted in insufficient precision and adaptability in resource recommendations and course design [19][20][21]. Furthermore, most methodologies have concentrated solely on the analysis of specific data types, failing to comprehensively integrate multidimensional learning behavior information, which has led to suboptimal outcomes in course optimization [22][23]. ...
Article
Full-text available
In the era of big data, vocational education is confronted with the challenge of effectively utilizing students’ learning behavior data. With the advancement of information technology, the accumulation of students’ learning trajectories and behavior data presents new opportunities for the optimization of education and teaching. Currently, many studies focus on the analysis of short-term learning behaviors, while comprehensive consideration of both long- and short-term behaviors remains insufficient, limiting the precision of course design and resource recommendations. Therefore, the exploration of an optimization strategy that integrates students’ long- and short-term learning behaviors is urgently needed to enhance the effectiveness of vocational education. This study aims to propose a course design and optimization strategy based on educational technology, with a focus on integrating students’ long- and short-term learning behaviors, thereby presenting corresponding resource recommendation methods and course design plans. The study will provide more personalized and precise teaching solutions for vocational education, promoting the enhancement of educational quality.
Article
Recommender systems have become integral to many online services, leveraging user data to provide personalized recommendations. However, as these systems grow in complexity, understanding the rationale behind their recommendations becomes increasingly difficult. Explainable Artificial Intelligence (XAI) has emerged as a crucial field addressing this challenge, particularly in ensuring transparency and trustworthiness in automated decision-making processes. In this paper, we introduce Learning to eXplain Recommendations (LXR) , a scalable, model-agnostic framework designed to generate counterfactually correct explanations for recommender systems. LXR generates explanations for recommendations produced by any differentiable recommender system. By leveraging both factual and counterfactual loss terms, LXR offers robust, accurate, and computationally efficient explanations that reflect the model’s internal decision-making process. A key feature of LXR is its focus on the factual correctness of explanations through counterfactual reasoning, bridging the gap between plausible and accurate explanations. Unlike traditional approaches that rely on exhaustive perturbations of user data, LXR uses a self-supervised learning method to generate explanations efficiently, without sacrificing accuracy. LXR operates in two stages: a pre-training step and a novel Inference-Time Fine-tuning (ITF) step that refines explanations at the individual recommendation level, significantly improving accuracy with minimal computational overhead. Additionally, LXR is applied to hybrid recommender models incorporating demographic data, demonstrating its versatility across real-world scenarios. Finally, we also showcase LXR’s ability to explain recommendations at various ranks within a user’s recommendation list. As a secondary contribution, we introduce several novel evaluation metrics, inspired by saliency maps from computer vision, to rigorously assess the counterfactual correctness of explanations in recommender systems. Our results demonstrate that LXR sets a new benchmark for explainability, providing accurate, transparent, and interpretable explanations. The code is available on our GitHub repository: https://github.com/DeltaLabTLV/LXR.
Conference Paper
Full-text available
Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.
Chapter
Full-text available
Explainable Artificial Intelligence (xAI) is an established field with a vibrant community that has developed a variety of very successful approaches to explain and interpret predictions of complex machine learning models such as deep neural networks. In this article, we briefly introduce a few selected methods and discuss them in a short, clear and concise way. The goal of this article is to give beginners, especially application engineers and data scientists, a quick overview of the state of the art in this current topic. The following 17 methods are covered in this chapter: LIME, Anchors, GraphLIME, LRP, DTD, PDA, TCAV, XGNN, SHAP, ASV, Break-Down, Shapley Flow, Textual Explanations of Visual Models, Integrated Gradients, Causal Models, Meaningful Perturbations, and X-NeSyL.KeywordsExplainable AIMethods Evaluation
Article
Full-text available
In humans, Attention is a core property of all perceptual and cognitive operations. Given our limited ability to process competing sources, attention mechanisms select, modulate, and focus on the information most relevant to behavior. For decades, concepts and functions of attention have been studied in philosophy, psychology, neuroscience, and computing. For the last 6 years, this property has been widely explored in deep neural networks. Currently, the state-of-the-art in Deep Learning is represented by neural attention models in several application domains. This survey provides a comprehensive overview and analysis of developments in neural attention models. We systematically reviewed hundreds of architectures in the area, identifying and discussing those in which attention has shown a significant impact. We also developed and made public an automated methodology to facilitate the development of reviews in the area. By critically analyzing 650 works, we describe the primary uses of attention in convolutional, recurrent networks, and generative models, identifying common subgroups of uses and applications. Furthermore, we describe the impact of attention in different application domains and their impact on neural networks’ interpretability. Finally, we list possible trends and opportunities for further research, hoping that this review will provide a succinct overview of the main attentional models in the area and guide researchers in developing future approaches that will drive further improvements.
Article
Full-text available
Deep learning has profoundly impacted many areas of machine learning. However, it took a while for its impact to be felt in the field of recommender systems. In this article, we outline some of the challenges encountered and lessons learned in using deep learning for recommender systems at Netflix. We first provide an overview of the various recommendation tasks on the Netflix service. We found that different model architectures excel at different tasks. Even though many deep-learning models can be understood as extensions of existing (simple) recommendation algorithms, we initially did not observe significant improvements in performance over well-tuned non-deep-learning approaches. Only when we added numerous features of heterogeneous types to the input data, deep-learning models did start to shine in our setting. We also observed that deep-learning methods can exacerbate the problem of offline–online metric (mis-)alignment. After addressing these challenges, deep learning has ultimately resulted in large improvements to our recommendations as measured by both offline and online metrics. On the practical side, integrating deep-learning toolboxes in our system has made it faster and easier to implement and experiment with both deep-learning and non-deep-learning approaches for various recommendation tasks. We conclude this article by summarizing our take-aways that may generalize to other applications beyond Netflix.
Article
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g. , Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities ( e.g. , images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks ( e.g. , image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks ( e.g. , visual-question answering, visual reasoning, and visual grounding), video processing ( e.g. , activity recognition, video forecasting), low-level vision ( e.g. , image super-resolution, image enhancement, and colorization) and 3D analysis ( e.g. , point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.