PreprintPDF Available

Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches


Abstract and Figures

Deep learning techniques have become the method of choice for researchers working on algorithmic aspects of recommender systems. With the strongly increased interest in machine learning in general, it has, as a result, become difficult to keep track of what represents the state-of-the-art at the moment, e.g., for top-n recommendation tasks. At the same time, several recent publications point out problems in today's research practice in applied machine learning, e.g., in terms of the reproducibility of the results or the choice of the baselines when proposing new models. In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area.
Content may be subject to copyright.
Are We Really Making Much Progress? A Worrying Analysis of
Recent Neural Recommendation Approaches
Maurizio Ferrari Dacrema Paolo Cremonesi Dietmar Jannach
Politecnico di Milano, Italy Politecnico di Milano, Italy University of Klagenfurt, Austria
Deep learning techniques have become the method of choice for
researchers working on algorithmic aspects of recommender sys-
tems. With the strongly increased interest in machine learning in
general, it has, as a result, become dicult to keep track of what
represents the state-of-the-art at the moment, e.g., for top-n rec-
ommendation tasks. At the same time, several recent publications
point out problems in today’s research practice in applied machine
learning, e.g., in terms of the reproducibility of the results or the
choice of the baselines when proposing new models.
In this work, we report the results of a systematic analysis of algo-
rithmic proposals for top-n recommendation tasks. Specically, we
considered 18 algorithms that were presented at top-level research
conferences in the last years. Only 7 of them could be reproduced
with reasonable eort. For these methods, it however turned out
that 6 of them can often be outperformed with comparably simple
heuristic methods, e.g., based on nearest-neighbor or graph-based
techniques. The remaining one clearly outperformed the baselines
but did not consistently outperform a well-tuned non-neural linear
ranking method. Overall, our work sheds light on a number of
potential problems in today’s machine learning scholarship and
calls for improved scientic practices in this area.
Information systems Collaborative ltering
systems; General and reference Evaluation.
Recommender Systems; Deep Learning; Evaluation; Reproducibility
ACM Reference Format:
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are
We Really Making Much Progress? A Worrying Analysis of Recent Neural
Recommendation Approaches. In Thirteenth ACM Conference on Recom-
mender Systems (RecSys ’19), September 16–20, 2019, Copenhagen, Denmark.
ACM, New York, NY, USA, 9 pages.
Within only a few years, deep learning techniques have started to
dominate the landscape of algorithmic research in recommender
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6243-6/19/09.. . $15.00
systems. Novel methods were proposed for a variety of settings
and algorithmic tasks, including top-n recommendation based on
long-term preference proles or for session-based recommendation
scenarios [
]. Given the increased interest in machine learning in
general, the corresponding number of recent research publications,
and the success of deep learning techniques in other elds like
vision or language processing, one could expect that substantial
progress resulted from these works also in the eld of recommender
systems. However, indications exist in other application areas of
machine learning that the achieved progress—measured in terms
of accuracy improvements over existing models—is not always as
strong as expected.
Lin [
], for example, discusses two recent neural approaches
in the eld of information retrieval that were published at top-
level conferences. His analysis reveals that the new methods do
not signicantly outperform existing baseline methods when these
are carefully tuned. In the context of recommender systems, an
in-depth analysis presented in [
] shows that even a very recent
neural method for session-based recommendation can, in most
cases, be outperformed by very simple methods based, e.g., on
nearest-neighbor techniques. Generally, questions regarding the
true progress that is achieved in such applied machine learning
settings are not new, nor tied to research based on deep learning.
Already in 2009, Armstrong et al. [
] concluded from an analysis
in the context of ad-hoc retrieval tasks that, despite many papers
being published, the reported improvements “don’t add up”.
Dierent factors contribute to such phenomena, including (i)
weak baselines; (ii) establishment of weak methods as new base-
lines; and (iii) diculties in comparing or reproducing results across
papers. One rst problem lies in the choice of the baselines that are
used in the comparisons. Sometimes, baselines are chosen that are
too weak in general for the given task and dataset, and sometimes
the baselines are not properly ne-tuned. Other times, baselines are
chosen from the same family as the newly proposed algorithm, e.g.,
when a new deep learning algorithm is compared only against other
deep learning baselines. This behaviour enforces the propagation
of weak baselines. When previous deep learning algorithms were
evaluated against too weak baselines, the new deep learning algo-
rithm will not necessarily improve over strong non-neural baselines.
Furthermore, with the constant ow of papers being published in
recent years, keeping track of what represents a state-of-the-art
baseline becomes increasingly challenging.
Besides issues related to the baselines, an additional challenge is
that researchers use various types of datasets, evaluation protocols,
performance measures, and data preprocessing steps, which makes
it dicult to conclude which method is the best across dierent
application scenarios. This is in particular problematic when source
code and data are not shared. While we observe an increasing trend
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach
that researchers publish the source code of their algorithms, this
is not the common rule today even for top-level publication out-
lets. And even in cases when the code is published, it is sometimes
incomplete and, for instance, does not include the code for data pre-
processing, parameter tuning, or the exact evaluation procedures,
as pointed out also in [15].
Finally, another general problem might lie in today’s research
practice in applied machine learning in general. Several “troubling
trends” are discussed in [
], including the thinness of reviewer
pools or misaligned incentives for authors that might stimulate
certain types of research. Earlier work [
] also discusses the com-
munity’s focus on abstract accuracy measures or the narrow focus
of machine learning research in terms of what is “publishable” at
top publication outlets.
With this research work, our goal is to shed light on the ques-
tion if the problems reported above also exist in the domain of
deep learning-based recommendation algorithms. Specically, we
address two main research questions:
Reproducibility: To what extent is recent research in the area
reproducible (with reasonable eort)?
Progress: To what extent are recent algorithms actually lead-
ing to better performance results when compared to rela-
tively simple, but well-tuned, baseline methods?
To answer these questions, we conducted a systematic study in
which we analyzed research papers that proposed new algorithmic
approaches for top-n recommendation tasks using deep learning
methods. To that purpose, we scanned the recent conference pro-
ceedings of KDD, SIGIR, TheWebConf (WWW), and RecSys for
corresponding research works. We identied 18 relevant papers.
In a rst step, we tried to reproduce the results reported in the
paper for those cases where the source code was made available by
the authors and where we had access to the data used in the experi-
ments. In the end, we could reproduce the published results with an
acceptable degree of certainty for only 7 papers. A rst contribution
of our work is therefore an assessment of the reproducibility level
of current research in the area.
In the second part of our study, we re-executed the experiments
reported in the original papers, but also included additional baseline
methods in the comparison. Specically, we used heuristic methods
based on user-based and item-based nearest neighbors as well as
two variants of a simple graph-based approach. Our study, to some
surprise, revealed that in the large majority of the investigated cases
(6 out of 7) the proposed deep learning techniques did not consis-
tently outperform the simple, but ne-tuned, baseline methods. In
one case, even a non-personalized method that recommends the
most popular items to everyone was the best one in terms of certain
accuracy measures. Our second contribution therefore lies in the
identication of a potentially more far-reaching problem related to
current research practices in machine learning.
The paper is organized as follows. Next, in Section 2, we de-
scribe our research method and how we reproduced existing works.
The results of re-executing the experiments while including addi-
tional baselines are provided in Section 3. We nally discuss the
implications of our research in Section 4.
2.1 Collecting Reproducible Papers
To make sure that our work is not only based on individual ex-
amples of recently published research, we systematically scanned
the proceedings of scientic conferences for relevant long papers
in a manual process. Specically, we included long papers in our
analysis that appeared between 2015 and 2018 in the following
four conference series: KDD, SIGIR, TheWebConf (WWW), and
We considered a paper to be relevant if it (a) proposed a
deep learning based technique and (b) focused on the top-n recom-
mendation problem. Papers on other recommendation tasks, e.g.,
group recommendation or session-based recommendation, were
not considered in our analysis. Given our interest in top-n recom-
mendation, we considered only papers that used for evaluation
classication or ranking metrics, such as Precision, Recall, MAP.
After this screening process, we ended up with a collection of 18
relevant papers.
In a next step, we tried to reproduce
the results reported in
these papers. Our approach to reproducibility is to rely as much as
possible on the artifacts provided by the authors themselves, i.e.,
their source code and the data used in the experiments. In theory,
it should be possible to reproduce published results using only the
technical descriptions in the papers. In reality, there are, however
many tiny details regarding the implementation of the algorithms
and the evaluation procedure, e.g., regarding data splitting, that
can have an impact on the experiment outcomes [39].
We therefore tried to obtain the code and the data for all relevant
papers from the authors. In case these artifacts were not already
publicly provided, we contacted all authors of the papers and waited
30 days for a response. In the end, we considered a paper to be
reproducible, if the following conditions were met:
A working version of the source code is available or the code
only has to be modied in minimal ways to work correctly.
At least one dataset used in the original paper is available. A
further requirement here is that either the originally-used
train-test splits are publicly available or that they can be
reconstructed based on the information in the paper.
Otherwise, we consider a paper to be non-reproducible given our
specic reproduction approach. Note that we also considered works
to be non-reproducible when the source code was published but
contained only a skeleton version of the model with many parts
and details missing. Concerning the datasets, research based solely
on non-public data owned by companies or data that was gathered
in some form from the web but not shared publicly, was also not
considered reproducible.
The fraction of papers that were reproducible according to our
relatively strict criteria per conference series are shown in Table 1.
Overall, we could reproduce only about one third of the works,
which conrms previous discussions about limited reproducibility,
All of the conferences are either considered A* in the Australian Core Ranking or
specically dedicated to research in recommender systems.
Precisely speaking, we used a mix of replication and reproduction [
], i.e., we
used both artifacts provided by the authors and our own artifacts. For the sake of
readability, we will only use the term “reproducibility” in this paper.
3We did not apply modications to the core algorithms.
i ·j
si j = ri ∥∥ rj + h
fi · fj
i j = fi ∥∥ fj + h
Are We Really Making Much Progress? RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
Table 1: Reproducible works on deep learning algorithms
for top-n recommendation per conference series from 2015
to 2018.
KDD 3/4 (75%) [17], [23], [48]
RecSys 1/7 (14%) [53]
SIGIR 1/3 (30%) [10]
WWW 2/4 (50%) [14], [24]
Total 7/18 (39%)
Rep. ratio Reproducible
Non-reproducible: KDD: [
], RecSys: [
], [
], [
[44], [21], [45], SIGIR: [32], [7], WWW: [42], [11]
see, e.g., [
]. The sample size is too small to make reliable con-
clusions regarding the dierence between conference series. The
detailed statistics per year—not shown here for space reasons—
however indicate that the reproducibility rate increased over the
2.2 Evaluation Methodology
Measurement Method. The validation of the progress that is
achieved through new methods against a set of baselines can be
done in at least two ways. One is to evaluate all considered methods
within the same dened environment, using the same datasets and
the exact same evaluation procedure for all algorithms as done
in [
]. While such an approach helps us obtain a picture of how
dierent methods compare across datasets, the implemented eval-
uation procedure might be slightly dierent from the one used in
the original papers. As such, this approach would not allow us to
exactly reproduce what has been originally reported, which is the
goal in this present work.
In this work, we therefore reproduce the work by refactoring
the original implementations in a way that allows us to apply the
same evaluation procedure that was used in the original papers.
Specically, refactoring is done in a way that the original code for
training, hyper-parameter optimization and prediction are sepa-
rated from the evaluation code. This evaluation code is then also
used for the baselines.
For all reproduced algorithms considered in the individual experi-
ments, we used the optimal hyper-parameters that were reported by
the authors in the original papers for each dataset. This is appropri-
ate because we used the same datasets, algorithm implementation,
and evaluation procedure as in the original papers.
We share all
the code and data used in our experiments as well as details of the
nal algorithm (hyper-)parameters of our baselines along with the
full experiment results online. 5
Baselines. We considered the following baseline methods in our
experiments, all of which are conceptually simple.
A non-personalized method that recommends the
most popular items to everyone. Popularity is measured by the
number of explicit or implicit ratings.
We will re-run parameter optimization for the reproduced algorithms as part of our
future work in order to validate the parameter optimization procedures used by the
authors. This step was, however, outside the scope of our current work.
A traditional Collaborative-Filtering (CF) approach
based on
-nearest-neighborhood (KNN) and item-item similari-
ties [
]. We used the cosine similarity
si j
between items
computed as
r r
where vectors
, rj R|U |
represent the implicit ratings of a
user for items
, respectively, and
|U |
is the number of users.
Ratings can be optionally weighted either with TF-IDF or BM25,
as described in [
]. Furthermore the similarity may or not be
normalized via the product of vector norms. Parameter
shrink term) is used to lower the similarity between items having
only few interactions [
]. The other parameter of the method is
the neighborhood size k.
A neighborhood-based method using collaborative user-
user similarities. Hyper-parameters are the same as used for
ItemKNN [40].
A neighborhood content-based-ltering (CBF)
approach with item similarities computed by using item content
features (attributes)
where vectors
, fj R|F |
describe the features of items
respectively, and
|F |
is the number of features. Features can be op-
tionally weighted either with TF-IDF or BM25. Other parameters
are the same used for ItemKNN [28].
A hybrid CF+CFB algorithm based on item-
item similarities. The similarity is computed by rst concatenating,
for each item
, the vector of ratings and the vector of features
[ri ,wfi
and by later computing the cosine similarity between
the concatenated vectors. Hyper-parameters are the same used for
ItemKNN, plus a parameter
that weights the content features
with respect to the ratings.
A simple graph-based algorithm which implements a ran-
dom walk between users and items [
]. Items for user
ranked based on the probability of a random walk with three
steps starting from user
. The probability
to jump from user
to item
is computed from the implicit user-rating-matrix as
pui = (rui
, where
is the rating of user
on item
is the number of ratings of user
is a damping factor. The
to jump backward is computed as
piu = (rui
is the number of ratings for item
. The method is equiva-
lent to a KNN item-based CF algorithm, with the similarity matrix
dened as
si j = pjv
The parameters of the method are the numbers of neighbors
and the value of
. We include this algorithm because it provides
good recommendation quality at a low computational cost.
A version of
proposed in [
]. Here, the outcomes of
are modied by dividing the similarities by each item’s popu-
larity raised to the power of a coecient
. If
is 0, the algorithm
is equivalent to
. Its parameters are the numbers of neighbors
k and the values for α and β.
For all baseline algorithms and datasets, we determined the opti-
mal parameters via Bayesian search [
] using the implementation
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
of Scikit-Optimize
. We explored 35 cases for each algorithm, where
the rst 5 were used for the initial random points. We considered
neighborhood sizes
from 5 to 800; the shrink term
was between
0 and 1000; and α and β took real values between 0 and 2.
This section summarizes the results of comparing the reproducible
works with the described baseline methods. We share the detailed
statistics, results, and nal parameters online.
3.1 Collaborative Memory Networks (CMN)
The CMN method was presented at SIGIR ’18 and combines memory
networks and neural attention mechanisms with latent factor and
neighborhood models [
]. To evaluate their approach, the authors
compare it with dierent matrix factorization and neural recom-
mendation approaches as well as with an ItemKNN algorithm (with
no shrinkage). Three datasets are used for evaluation: Epinions,
CiteULike-a, and Pinterest. Optimal hyper-parameters for the pro-
posed method are reported, but no information is provided on how
the baselines are tuned. Hit rate and NDCG are the performance
measures used in a leave-one-out procedure. The reported results
show that CMNs outperform all other baselines on all measures.
We were able to reproduce their experiments for all their datasets.
For our additional experiments with the simple baselines, we op-
timized the parameters of our baselines for the hit rate (HR@5)
metric. The results for the three datasets are shown in Table 2.
Our analysis shows that, after optimization of the baselines,
is in no single case the best-performing method on any of
the datasets. For the CiteULike-a and Pinterest datasets, at least two
of the personalized baseline techniques outperformed the CMN
method on any measure. Often, even all personalized baselines
were better than CMN. For the Epinions dataset, to some surprise,
the unpersonalized TopPopular method, which was not included in
the original paper, was better than all other algorithms by a large
margin. On this dataset, CMN was indeed much better than our
baselines. The success of CMN on this comparably small and very
sparse dataset with about 660k observations could therefore be tied
to the particularities of the dataset or to a popularity bias of CMN.
An analysis reveals that the Epinions dataset has indeed a much
more uneven popularity distribution than the other datasets (Gini
index of 0.69 vs. 0.37 for CiteULike-a). For this dataset, CMN also
recommends in its top-n lists items that are, on average, 8% to 25%
more popular than the items recommended by our baselines.
3.2 Metapath based Context for
RECommendation (MCRec)
MCRec [
], presented at KDD ’18, is a meta-path based model
that leverages auxiliary information like movie genres for top-n
recommendation. From a technical perspective, the authors propose
a priority-based sampling technique to select higher-quality path
instances and propose a novel co-attention mechanism to improve
the representations of meta-path based context, users, and items.
The authors benchmark four variants of their method against a
variety of models of dierent complexity on three small datasets
7We report the results for CMN-3 as the version with the best results.
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach
Table 2: Experimental results for the CMN method using the
metrics and cutos reported in the original paper. Numbers
are printed in bold when they correspond to the best result
or when a baseline outperformed CMN.
NDCG@5 HR@10 NDCG@10
CMN 0.8069 0.6666 0.8910 0.6942
NDCG@5 HR@10 NDCG@10
CMN 0.6872 0.4883 0.8549 0.5430
NDCG@5 HR@10 NDCG@10
CMN 0.4195 0.3346 0.4953 0.3592
(MovieLens100k, LastFm, and Yelp). The evaluation is done by cre-
ating 80/20 random training-test splits and by executing 10 of such
evaluation runs. The evaluation procedure could be reproduced;
public training-test splits were provided only for the MovieLens
dataset. For the MF and NeuMF [
] baselines used in their paper,
the architecture and hyper-parameters were taken from the original
papers; no information about hyper-parameter tuning is provided
for the other baselines. Precision, Recall, and the NDCG are used
as performance measures, with a recommendation list of length 10.
The NDCG measure is however implemented in an uncommon and
questionable way, which is not mentioned in the paper. Here, we
therefore use a standard version of the NDCG.
In the publicly shared software, the meta-paths are hard-coded
for MovieLens, and no code for preprocessing and constructing the
meta-paths is provided. Here, we therefore only provide the results
for the MovieLens dataset in detail. We optimized our baselines for
Precision, as was apparently done in [
]. For MCRec, the results
for the complete model are reported.
Table 3 shows that the traditional ItemKNN method, when con-
gured correctly, outperforms MCRec on all performance measures.
Besides the use of an uncommon NDCG measure, we found other
potential methodological issues in this paper. Hyper-parameters
for the MF and NeuMF baselines were, as mentioned, not optimized
Are We Really Making Much Progress? RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
PREC@10 REC@10 NDCG@10
TopPopular 0.1907 0.1180 0.1361
UserKNN 0.2913 0.1802 0.2055
ItemKNN 0.3327 0.2199 0.2603
α 0.2137 0.1585 0.1838
β 0.2357 0.1684 0.1923
MCRec 0.3077 0.2061 0.2363
Table 3: Comparing MCRec against our baselines (Movie-
for the given datasets but taken from the original paper [
]. In
addition, looking at the provided source code, it can be seen that the
authors report the best results of their method for each metric across
dierent epochs chosen on the test set, which is inappropriate.8
3.3 Collaborative Variational Autoencoder
The CVAE method [
], presented at KDD ’18, is a hybrid technique
that considers both content as well as rating information. The
model learns deep latent representations from content data in an
unsupervised manner and also learns implicit relationships between
items and users from both content and ratings.
The method is evaluated on two comparably small CiteULike
datasets (135k and 205k interactions). For both datasets, a sparse
and a dense version is tested. The baselines in [
] include three
recent deep learning models and as well as Collaborative Topic
Regression (CTR). The parameters for each method are tuned based
on a validation set. Recall at dierent list lengths (50 to 300) is
used as an evaluation measure. Random train-test data splitting is
applied and the measurements are repeated ve times.
Table 4: Experimental results for CVAE (CiteULike-a).
REC@50 REC@100 REC@300
CVAE 0.0772 0.1548 0.3602
We could reproduce their results using their code and evalua-
tion procedure. The datasets are also shared by the authors. Fine-
tuning our baselines led to the results shown in Table 4 for the
dense CiteULike-a dataset from [
]. For the shortest list length of
50, even the majority of the pure CF baselines outperformed the
CVAE method on this dataset. At longer list lengths, the hybrid
ItemKNN-CFCBF method led to the best results. Similar results were
obtained for the sparse CiteULike-t dataset. Generally, at list length
50, ItemKNN-CFCBF was consistently outperforming CVAE in all
tested congurations. Only at longer list lengths (100 and beyond),
CVAE was able to outperform our methods on two datasets.
8In our evaluations, we did not use this form of measurement.
Overall, CVAE was only favorable over the baselines in certain
congurations and at comparably long and rather uncommon rec-
ommendation cuto thresholds. The use of such long list sizes was
however not justied in the paper.
3.4 Collaborative Deep Learning (CDL)
The discussed CVAE method considers the earlier and often-cited
CDL method [
] from KDD ’15 as one of their baselines, and the
authors also use the same evaluation procedure and CiteULike
datasets. CDL is a probabilistic feed-forward model for joint learn-
ing of stacked denoising autoencoders (SDAE) and collaborative
ltering. It applies deep learning techniques to jointly learn a deep
representation of content information and collaborative informa-
tion. The evaluation of CDL in [
] showed that it is favorable in
particular compared to the widely referenced CTR method [
especially in sparse data situations.
Table 5: Experimental results for CDL on the dense
CiteULike-a dataset.
REC@50 REC@100 REC@300
CDL 0.0543 0.1035 0.2627
We reproduced the research in [
], leading to the results shown
in Table 5 for the dense CiteULike-a dataset. Not surprisingly, the
baselines that were better than CVAE in the previous section are
also better than CDL, and again for short list lengths, already the
pure CF methods were better than the hybrid CDL approach. Again,
however, CDL leads to higher Recall for list lengths beyond 100
in two out of four dataset congurations. Comparing the detailed
results for CVAE and CDL, we see that the newer CVAE method
is indeed always better than CDL, which indicates that progress
was made. Both methods, however, are not better than one of the
simple baselines in the majority of the cases.
3.5 Neural Collaborative Filtering (NCF)
Neural network-based Collaborative Filtering [
], presented at
WWW ’17, generalizes Matrix Factorization by replacing the in-
ner product with a neural architecture that can learn an arbitrary
function from the data. The proposed hybrid method (NeuMF) was
evaluated on two datasets (MovieLens1M and Pinterest), containing
1 million and 1.5 million interactions, respectively. A leave-one out
procedure is used in the evaluation and the original data splits are
publicly shared by the authors. Their results show that NeuMF
is favorable, e.g., over existing matrix factorization models, when
using the hit rate and the NDCG as an evaluation measure using
dierent list lengths up to 10.
Parameter optimization is done on a validation set created from
the training set. Similar to the implementation of MCRec above,
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
the provided source code shows that the authors chose the number
of epochs based on the results obtained for the test set. Since the
number of epochs, however, is a parameter to tune and should not
be determined based on the test set, we use a more appropriate
implementation that nds this parameter with the validation set.
For the ItemKNN method, the authors only varied the neighborhood
sizes but did not test other variations.
Table 6: Experimental results for NCF.
NDCG@5 HR@10 NDCG@10
NeuMF 0.7024 0.4983 0.8719 0.5536
Movielens 1M
NDCG@5 HR@10 NDCG@10
NeuMF 0.5486 0.3840 0.7120 0.4369
SLIM 0.5589 0.3961 0.7161 0.4470
Given the publicly shared information, we could reproduce the
results from [
]. The outcomes of the experiment are shown in
Table 6. On the Pinterest dataset, two of the personalized baselines
were better than NeuMF on all metrics. For the MovieLens dataset,
NeuMF outperformed our simple baselines quite clearly.
Since the MovieLens dataset has been extensively used over the
last decades for evaluating new models, we made additional ex-
periments with SLIM, a simple linear method described in [
]. To
implement SLIM, we took the standard Elastic Net implementation
provided in the scikit-learn package for Python (ElasticNet). To
tune the hyper-parameters on the validation set, we considered
neighborhood sizes as in the other baselines; the ratio of l1 and l2
regularization between 10
and 1
0; and the regularization magni-
tude coecient between 10
and 1
0. Table 6 shows that SLIM is
indeed better than our baselines, as expected, but also outperforms
NeuMF on this dataset.
3.6 Spectral Collaborative Filtering
SpectralCF [
], presented at RecSys ’18, was designed to specif-
ically address the cold-start problem and is based on concepts of
Spectral Graph Theory. Its recommendations are based on the bipar-
tite user-item relationship graph and a novel convolution operation,
which is used to make collaborative recommendations directly in
the spectral domain. The method was evaluated on three public
datasets (MovieLens1M, HetRec, and Amazon Instant Video) and
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach
benchmarked against a variety of methods, including recent neural
approaches and established factorization and ranking techniques.
The evaluation was based on randomly created 80/20 training-test
splits and using Recall and the Mean Average Precision (MAP) at
dierent cutos.9
For the MovieLens dataset, the training and test datasets used by
the authors were shared along with the code. For the other datasets,
the data splits were not published therefore we created the splits
by ourself following the descriptions in the paper.
Somehow surprisingly, the authors report only one set of hyper-
parameter values in the paper, which they apparently used for all
datasets. We therefore ran the code both with the provided hyper-
parameters and with hyper-parameter settings that we determined
by our own on all datasets. For the HetRec and Amazon Instant
Video datasets, all our baselines, to our surprise also including
the TopPoular method, outperformed SpectralCF on all measures.
However, when running the code on the provided MovieLens data
splits, we found that SpectralCF was better than all our baselines
by a huge margin. Recall@20 was, for example, 50% higher than
our best baseline.
We therefore analyzed the published train-test split for the Movie-
Lens dataset and observed that the popularity distribution of the
items in the test set is very dierent from a distribution that would
likely result from a random sampling procedure.
We then ran
experiments with our own train-test splits also for the MovieLens
dataset, using the splitting procedure described in the paper. We
optimized the parameters for our data split to ensure a fair com-
parison. The results of the experiment are shown in Table 7. When
using data splits that were created as described in the original pa-
per, the results for the MovieLens dataset are in line with our own
experiments for the other two datasets, i.e., SpectralCF in all con-
gurations performed worse than our baseline methods and was
outperformed even by the TopPopular method.
Table 7: Experimental results for SpectralCF (MovieLens1M,
using own random splits and ve repeated measurements).
Cuto 20 Cuto 60 Cuto 100
TopPopular 0.1853 0.0576 0.3335 0.0659 0.4244 0.0696
UserKNN CF 0.2881 0.1106 0.4780 0.1238 0.5790 0.1290
ItemKNN CF 0.2819 0.1059 0.4712 0.1190 0.5737 0.1243
α 0.2853 0.1051 0.4808 0.1195 0.5760 0.1248
β 0.2910 0.1088 0.4882 0.1233 0.5884 0.1288
SpectralCF 0.1843 0.0539 0.3274 0.0618 0.4254 0.0656
Figure 1 visualizes the data splitting problem. The blue data
points show the normalized popularity values for each item in the
training set, with the most popular item in the corresponding split
having the value 1, ordered by decreasing popularity values. In
case of random sampling of ratings, the orange points from the
test set would mostly be very close to the corresponding blue ones.
Here, however, we see that the popularity values of many items
To assess the cold-start behavior, additional experiments are performed with fewer
data points per user in the training set.
We contacted the authors on this issue, but did not receive an explanation for this
1.0 Training data
Test data
of er numbd
0 500 1000 1500 2000 2500 3000 3500
Figure 1: Popularity distributions of the provided training
and test splits. In case of a random split, the normalized val-
ues should, on average, be close for both splits.
Are We Really Making Much Progress? RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
in the test set dier largely. An analysis of the distributions with
measures like the Gini index or Shannon entropy conrms that the
dataset characteristics of the shared test set diverge largely from a
random split. The Gini index of a true random split lies at around
0.79 for both the training and test split. While the Gini index for
the provided training split is similar to ours, the Gini index of the
provided test split is much higher (0.92), which means that the
distribution has a much higher popularity bias than a random split.
3.7 Variational Autoencoders for Collaborative
Filtering (Mult-VAE)
Mult-VAE [
] is a collaborative ltering method for implicit feed-
back based on variational autoencoders. The work was presented
at WWW ’18. With Mult-VAE, the authors introduce a generative
model with multinomial likelihood, propose a dierent regulariza-
tion parameter for the learning objective, and use Bayesian infer-
ence for parameter estimation. They evaluate their method on three
binarized datasets that originally contain movie ratings or song
play counts. The baselines in the experiments include both a matrix
factorization method from 2008 [
], a linear model from 2011 [
and a more recent neural method [
]. Accoring to the reported
experiments, the proposed method leads to accuracy results that
are typically around 3% better than the best baseline in terms of
Recall and the NDCG.
Using their code and datasets, we found that the proposed method
indeed consistently outperforms our quite simple baseline tech-
niques. The obtained accuracy results were between 10% and 20%
better than our best baseline. Thus, with Mult-VAE, we found one
example in the examined literature where a more complex method
was better, by a large margin, than any of our baseline techniques
in all congurations.
To validate that Mult-VAE is advantageous over the complex non-
neural models, as reported in [
], we optimized the parameters
for the weighted matrix factorization technique [
] and the linear
model [
] (SLIM using Elastic Net) for the MovieLens and Netix
datasets by ourselves. We made the following observations. For both
datasets, we could reproduce the results and observe improvements
over SLIM of up to 5% on the dierent measures reported in the
original papers. Table 8 shows the outcomes for the Netix datasets
using the measurements and cutos from the original experiments
after optimizing for NDCG@100 as in [24].
Table 8: Experimental results for Mult-VAE (Netix data), us-
ing metrics and cutos reported in the original paper.
REC@20 REC@50 NDCG@100
SLIM 0.2551 0.3995 0.3745
Mult-VAE 0.2626 0.4138 0.3756
The dierences between Mult-VAE and SLIM in terms of the
NDCG, the optimization goal, are quite small. In terms of the Recall,
however, Mult-VAE improvements over SLIM seem solid. Since the
choice of the used cutos (20 and 50 for Recall, and 100 for NDCG)
is not very consistent in [
], we made additional measurements at
dierent cuto lengths. The results are provided in Table 9. They
show that when using the NDCG as an optimization goal and as
a performance measure, the dierences between SLIM and Mult-
VAE disappear on this dataset, and SLIM is actually sometimes
slightly better. A similar phenomenon can be observed for the
MovieLens dataset. In this particular case, therefore, the progress
that is achieved through the neural approach is only partial and
depends on the chosen evaluation measure.
Table 9: Experimental results for Mult-VAE using additional
cuto lengths for the Netix dataset.
NDCG@20 NDCG@50 REC@100 NDCG@100
SLIM 0.2473 0.3196 0.5289 0.3745
Mult-VAE 0.2448 0.3192 0.5476 0.3756
4.1 Reproducibility and Scalability
In some ways, establishing reproducibility in applied machine learn-
ing should be much easier than in other scientic disciplines and
also other subelds of computer science. While many recommen-
dation algorithms are not fully deterministic, e.g., because they use
some form of random initialization of parameters, the variability
of the obtained results when repeating the exact same experiment
conguration several times is probably very low in most cases.
Therefore, when researchers provide their code and the used data,
everyone should be able to reproduce more or less the exact same
results. Given that researchers today often rely on software that is
publicly available or provided by academic institutions, the barriers
regarding technological requirements are mostly low as well. In
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach
particular, virtualization technology should make it easier for other
researchers to repeat an experiment under very similar conditions.
Nonetheless, our work shows that the level of reproducibility is
actually not high. The code of the core algorithms seems to be more
often shared by researchers than in the past, probably also due to
the fact that reproducibility has become an evaluation criterion
for conferences. However, in many cases, the code that is used for
hyper-parameter optimization, evaluation, data pre-processing, and
for the baselines is not shared. This makes it dicult for others to
validate the reported ndings.
One orthogonal factor that can make reproducibility challenging
is the computational complexity of many of the proposed methods.
Ten years after the Netix Prize and its 100 million rating dataset,
researchers, in the year 2019, commonly use datasets containing
only a few hundred thousand ratings. Even for such tiny datasets,
which were considered unacceptably small a few years ago, hyper-
parameter optimization can take days or weeks, even when re-
searchers have access to GPU computing. Clearly, nearest-neighbor
methods, as discussed in our paper, can also lead to scalability issues.
However, with appropriate data pre-processing and data sampling
mechanisms, scalability can also be ensured for such methods, both
in academic and industrial environments [19, 26].
4.2 Progress Assessment
Despite their computational complexity, our analysis showed that
several recently proposed neural methods do not even outperform
conceptually or computationally simpler, sometimes long-known,
algorithms. The level of progress that is achieved in the eld of
neural methods is, therefore, unclear, at least when considering the
approaches discussed in our paper.
One main reason for this phantom progress, as our work shows,
lies in the choice of the baselines and the lack of a proper optimiza-
tion of the baselines. In the majority of the investigated cases, not
enough information is given about the optimization of the consid-
ered baselines. Sometimes, we also found that mistakes were made
with respect to data splitting and the implementation of certain
evaluation measures and protocols.
Another interesting observation is that a number of recent pa-
pers use the neural collaborative ltering method (NCF) [
] as
one of their state-of-the-art baselines. According to our analysis,
this method is however outperformed by simple baselines on one
dataset and does not lead to much better results on another, where
it is also outperformed by a standard implementation of a linear
regression method. Therefore, progress is often claimed by compar-
ing a complex neural model against another neural model, which
is, however, not necessarily a strong baseline. Similar observations
can be made for the area of session-based recommendation, where
a recent method based on recurrent neural networks [
] is con-
sidered a competitive baseline, even though almost trivial methods
are in most cases better [29, 30].
Another aspect that makes it dicult to assess progress in the
eld lies in the variety of datasets, evaluation protocols, metrics,
and baselines that are used by researchers. Regarding datasets, for
example, we found over 20 public datasets that were used, plus
several variants of the MovieLens and Yelp datasets. As a result,
most datasets are only used in one or two papers. All sorts of metrics
are used (e.g., Precision, Recall, Mean Average Precision, NDCG,
MRR etc.) as well as various evaluation procedures (e.g., random
holdout 80/20, leave-last-out, leave-one-out, 100 negative items or
50 negative items for each positive). In most cases, however, these
choices are not well justied beyond the fact that others used them
before. In reality, the choice of the metric should depend on the
application context. In some applications, for example, it might
be important to have at least one relevant item at the top of the
recommendations, which suggests the use of rank-based metrics
like MRR. In other domains, high Recall might be more important
when the goal is to show as many relevant items as possible to
the user. Besides the unclear choice of the measure, often also the
cuto sizes for the measurement are not explained and range from
top-3 or top-5 lists to several hundred elements.
These phenomena are, however, not tied to neural recommen-
dation approaches, but can be found in algorithmic research in
recommender systems also in pre-neural times. Considering the ar-
guments from [
], such developments are fueled by the strong
focus of machine learning researchers on accuracy measures and
the hunt for the “best” model. In our current research practice,
it is often considered sucient to show that a new method can
outperform a set of existing algorithms on at least one or two pub-
lic datasets on one or two established accuracy measures.
choice of the evaluation measure and dataset however often seems
An example of such unclear research practice is the use of Movie-
Lens rating datasets for the evaluation of algorithms for implicit
feedback datasets. Such practices point to the underlying funda-
mental problem that research is not guided by any hypothesis or
aim at the solution of a given problem. The hunt for better accuracy
values dominates research activities in this area, even though it
is not even clear if slightly higher accuracy values are relevant in
terms of adding value for recommendation consumers or providers
]. In fact, a number of research works exist that indicate
that higher accuracy does not necessarily translate into better-
received recommendations [4, 9, 13, 31, 37].
In this work, we have analyzed a number of recent neural algorithms
for top-n recommendation. Our analysis indicates that reproducing
published research is still challenging. Furthermore, it turned out
that most of the reviewed works can be outperformed at least
on some datasets by conceptually and computationally simpler
algorithms. Our work therefore calls for more rigor and better
research practices with respect to the evaluation of algorithmic
contributions in this area.
Our analyses so far are limited to papers published in certain
conference series. In our ongoing and future work, we plan to ex-
tend our analysis to other publication outlets and other types of
recommendation problems. Furthermore, we plan to consider more
traditional algorithms as baselines, e.g., based on matrix factoriza-
From the 18 papers considered relevant for our study, there were at least two papers
which proposed new DL architectures which were evaluated on a single private dataset
and for which no source code was provided.
Are We Really Making Much Progress?
S. Antenucci, S. Boglio, E. Chioso, E. Dervishaj, K. Shuwen, T. Scarlatti, and
M. Ferrari Dacrema. 2018. Artist-driven layering and user’s behaviour impact
on recommendations in a playlist continuation scenario. In Proceedings of the
ACM Recommender Systems Challenge 2018 (RecSys 2018).
1145/3267471.3267475 Source:
Timothy G. Armstrong, Alistair Moat, William Webber, and Justin Zobel. 2009.
Improvements That Don’t Add Up: Ad-hoc Retrieval Results Since 1998. In Pro-
ceedings CIKM ’09. 601–610.
Joeran Beel, Corinna Breitinger, Stefan Langer, Andreas Lommatzsch, and Bela
Gipp. 2016. Towards reproducibility in recommender-systems research. User
Modeling and User-Adapted Interaction 26, 1 (2016), 69–101.
Jöran Beel and Stefan Langer. 2015. A Comparison of Oine Evaluations, Online
Evaluations, and User Studies in the Context of Research-Paper Recommender
Systems. In Proceedings TPDL ’15. 153–168.
Robert M Bell and Yehuda Koren. 2007. Improved neighborhood-based collabora-
tive ltering. In KDD cup and workshop at the KDD ’07. Citeseer, 7–14.
Homanga Bharadhwaj, Homin Park, and Brian Y. Lim. 2018. RecGAN: Recurrent
Generative Adversarial Networks for Recommendation Systems. In Proceedings
RecSys ’18. 372–376.
Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-
Seng Chua. 2017. Attentive collaborative ltering: Multimedia recommendation
with item-and component-level attention. In Proceedings SIGIR ’17. 335–344.
Colin Cooper, Sang Hyuk Lee, Tomasz Radzik, and Yiannis Siantos. 2014. Ran-
dom walks in recommender systems: exact computation and simulations. In
Proceedings WWW ’14. 811–816.
Paolo Cremonesi, Franca Garzotto, and Roberto Turrin. 2012. Investigating the
Persuasion Potential of Recommender Systems from a Quality Perspective: An
Empirical Study. Transactions on Interactive Intelligent Systems 2, 2 (2012), 1–41.
Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for
Recommendation Systems. In Proceedings SIGIR ’18. 515–524.
Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep
learning approach for cross domain user modeling in recommendation systems.
In Proceedings WWW ’15. 278–288.
Association for Computing Machinery. 2016. Artifact Review and Badging.
Available online at:
badging (Accessed March, 2018).
Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin,
and Amr Huber. 2014. Oine and Online Evaluation of News Recommender
Systems at Swissinfo.Ch. In Proceedings RecSys ’14. 169–176.
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng
Chua. 2017. Neural collaborative ltering. In Proceedings WWW ’17. 173–182.
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and
David Meger. 2018. Deep Reinforcement Learning That Matters. In Proceedings
AAAI ’18. 3207–3214.
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
2016. Session-based Recommendations with Recurrent Neural Networks. In
Proceedings ICLR ’16.
Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging
meta-path based context for top-n recommendation with a neural co-attention
model. In Proceedings KDD ’18. 1531–1540.
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for
Implicit Feedback Datasets. In Proceedings ICDM ’08. 263–272.
Dietmar Jannach and Malte Ludewig. 2017. When Recurrent Neural Networks
Meet the Neighborhood for Session-Based Recommendation. In Proceedings Rec-
Sys ’17. 306–310.
Dietmar Jannach, Paul Resnick, Alexander Tuzhilin, and Markus Zanker. 2016.
Recommender Systems - Beyond Matrix Completion. Commun. ACM 59, 11
(2016), 94–102.
Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu.
2016. Convolutional Matrix Factorization for Document Context-Aware Recom-
mendation. In Proceedings RecSys ’16. 233–240.
Joseph A. Konstan and John Riedl. 2012. Recommender systems: from algorithms
to user experience. User Modeling and User-Adapted Interaction 22, 1 (2012),
Xiaopeng Li and James She. 2017. Collaborative variational autoencoder for
recommender systems. In Proceedings KDD ’17. 305–314.
Dawen Liang, Rahul G Krishnan, Matthew D Homan, and Tony Jebara. 2018.
Variational Autoencoders for Collaborative Filtering. In Proceedings WWW ’18.
Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines.
SIGIR Forum 52, 2 (Jan. 2019), 40–51.
G. Linden, B. Smith, and J. York. 2003. recommendations: item-to-
item collaborative ltering. IEEE Internet Computing 7, 1 (2003), 76–80.
Zachary C. Lipton and Jacob Steinhardt. 2018. Troubling Trends in Machine
Learning Scholarship. arXiv:arXiv:1807.03341
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based
recommender systems: State of the art and trends. In Recommender Systems
Handbook. Springer, 73–105.
Malte Ludewig and Dietmar Jannach. 2018. Evaluation of Session-based Rec-
ommendation Algorithms. User-Modeling and User-Adapted Interaction 28, 4–5
(2018), 331–390.
Malte Ludewig, Noemi Mauro, Sara Lati, and Dietmar Jannach. 2019. Perfor-
mance Comparison of Neural and Non-Neural Approaches to Session-based Rec-
ommendation. In Proceedings RecSys ’19.
Andrii Maksai, Florent Garcin, and Boi Faltings. 2015. Predicting Online Perfor-
mance of News Recommender Systems Through Richer Evaluation Metrics. In
Proceedings RecSys ’15. 179–186.
Jarana Manotumruksa, Craig Macdonald, and Iadh Ounis. 2018. A Contextual
Attention Recurrent Architecture for Context-Aware Venue Recommendation.
In Proceedings SIGIR ’18. 555–564.
Xia Ning and George Karypis. 2011. SLIM: Sparse linear methods for top-n
recommender systems. In Proceedings ICDM ’11. 497–506.
Bibek Paudel, Fabian Christoel, Chris Newell, and Abraham Bernstein. 2017.
Updatable, Accurate, Diverse, and Scalable Recommendations for Interactive
Applications. ACM Transactions on Interactive Intelligent Systems 7, 1 (2017), 1.
Hans Ekkehard Plesser. 2017. Reproducibility vs. Replicability: A Brief History
of a Confused Terminology. Frontiers in Neuroinformatics 11, 76 (2017).
Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-
Aware Recommender Systems. Comput. Surveys 51, 4 (2018), 1–36.
[37] Marco Rossetti, Fabio Stella, and Markus Zanker. 2016. Contrasting Oine and
Online Results when Evaluating Recommendation Algorithms. In Proceedings
RecSys ’16. 31–34.
Noveen Sachdeva, Kartik Gupta, and Vikram Pudi. 2018. Attentive Neural Archi-
tecture Incorporating Song Features for Music Recommendation. In Proceedings
RecSys ’18. 417–421.
Alan Said and Alejandro Bellogín. 2014. Rival: A Toolkit to Foster Reproducibility
in Recommender System Evaluation. In Proceedings RecSys ’14. 371–372.
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
collaborative ltering recommendation algorithms. In Proceedings WWW ’01.
Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi Xu.
2018. Recurrent Knowledge Graph Embedding for Eective Recommendation. In
Proceedings RecSys ’18. 297–305.
Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent relational metric
learning via memory-based attention for collaborative ranking. In Proceedings
WWW ’18. 729–739.
Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Multi-Pointer Co-Attention
Networks for Recommendation. In Proceedings SIGKDD ’18. 2309–2318.
Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D Convolutional Networks for
Session-based Recommendation with Content Features. In Proceedings RecSys
’17. 138–146.
Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec:
Product Embeddings Using Side-Information for Recommendation. In Proceedings
RecSys ’16. 225–232.
Kiri Wagsta. 2012. Machine Learning that Matters. In Proceedings ICML ’12.
Chong Wang and David M Blei. 2011. Collaborative topic modeling for recom-
mending scientic articles. In Proceedings KDD ’11. 448–456.
Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning
for recommender systems. In Proceedings KDD ’15. 1235–1244.
Jun Wang, Arjen P De Vries, and Marcel JT Reinders. 2006. Unifying user-
based and item-based collaborative ltering approaches by similarity fusion. In
Proceedings SIGIR ’06. 501–508.
Jun Wang, Stephen Robertson, Arjen P de Vries, and Marcel JT Reinders. 2008.
Probabilistic relevance ranking for collaborative ltering. Information Retrieval
11, 6 (2008), 477–497.
Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collabo-
rative denoising auto-encoders for top-n recommender systems. In Proceedings
WSDM ’16. 153–162.
Bo Xiao and Izak Benbasat. 2007. E-commerce Product Recommendation Agents:
Use, Characteristics, and Impact. MIS Quarterly 31, 1 (March 2007), 137–209.
Lei Zheng, Chun-Ta Lu, Fei Jiang, Jiawei Zhang, and Philip S. Yu. 2018. Spectral
Collaborative Filtering. In Proceedings RecSys ’18. 311–319.
... When taking into account the hyperparameter tuning procedure, 41,000 models were fitted, corresponding to a total computation time of 253 days. 23 ...
... According to our experiments, simple hybrid baselines outperform CDL in three of four datasets. On a dense dataset, CDL is also outperformed by pure collaborative 23 The computation time refers to the total instance time for one AWS instance p3.2xlarge, with 8 vCPU, 30GB RAM, and one Tesla V100-SXM2-16GB GPU. The detailed measurements are available in the online material. ...
... 55 These findings have been confirmed in a recent article by Rendle 54 We report RP 3 β [54] for completeness although the DL algorithm we evaluate here predates its publication. 55 It shall be noted here that after the first publication of our results [23], the authors of NeuMF provided us with an alternative configuration of their method, which included new hyperparameter values taken from alternative hyperparameter ranges, and requiring other slight changes in the training procedure. While this new configuration led to slightly improved results for their method, the results of our analysis were confirmed. ...
Full-text available
The design of algorithms that generate personalized ranked item lists is a central topic of research in the field of recommender systems. In the past few years, in particular, approaches based on deep learning (neural) techniques have become dominant in the literature. For all of them, substantial progress over the state-of-the-art is claimed. However, indications exist of certain problems in today’s research practice, e.g., with respect to the choice and optimization of the baselines used for comparison, raising questions about the published claims. To obtain a better understanding of the actual progress, we have compared recent results in the area of neural recommendation approaches based on collaborative filtering against a consistent set of existing simple baselines. The worrying outcome of the analysis of these recent works—all were published at prestigious scientific conferences between 2015 and 2018—is that 11 of the 12 reproducible neural approaches can be outperformed by conceptually simple methods, e.g., based on the nearest-neighbor heuristic or linear models. None of the computationally complex neural methods was actually consistently better than already existing learning-based techniques, e.g., using matrix factorization or linear models. In our analysis, we discuss common issues in today’s research practice, which, despite the many papers that are published on the topic, have apparently led the field to a certain level of stagnation.
... We opted to extend the VAE architecture for collaborative filtering presented by Liang et al. [47] because in a large-scale study conducted by Dacrema et al. [25], the approach followed by Liang et al. [47] was found the only deep neural network-based approach that outperformed equally well tuned non-deep-learning approaches. In addition, Liang et al. [47] evaluated their VAE architecture on the Million Song Dataset [12], a common benchmark in the music domain. ...
... In this work, we use Euclidean distance as distance metric and set ξ = 0.05.9 More precisely, we performed grid search on t-SNE perplexity in the range[1,2,3,5,10,15,20,25,30,35,40,50] and on the minimum number of data points per cluster enforced by OPTICS in the range[2,3,4,5], optimizing for average neighborhood preservation ratio (nearest neighbor consistency). 10 ...
Full-text available
Music preferences are strongly shaped by the cultural and socio-economic background of the listener, which is reflected, to a considerable extent, in country-specific music listening profiles. Previous work has already identified several country-specific differences in the popularity distribution of music artists listened to. In particular, what constitutes the "music mainstream" strongly varies between countries. To complement and extend these results, the article at hand delivers the following major contributions: First, using state-of-the-art unsupervised learning techniques, we identify and thoroughly investigate (1) country profiles of music preferences on the fine-grained level of music tracks (in contrast to earlier work that relied on music preferences on the artist level) and (2) country archetypes that subsume countries sharing similar patterns of listening preferences. Second, we formulate four user models that leverage the user's country information on music preferences. Among others, we propose a user modeling approach to describe a music listener as a vector of similarities over the identified country clusters or archetypes. Third, we propose a context-aware music recommendation system that leverages implicit user feedback, where context is defined via the four user models. More precisely, it is a multi-layer generative model based on a variational autoencoder, in which contextual features can influence recommendations through a gating mechanism. Fourth, we thoroughly evaluate the proposed recommendation system and user models on a real-world corpus of more than one billion listening records of users around the world (out of which we use 369 million in our experiments) and show its merits vis-a-vis state-of-the-art algorithms that do not exploit this type of context information.
... We have carefully followed methodological principles and practical recommendations outlined in [6,7] in order to avoid common mistakes and issues. We used the source code that accompanies [6] for generating the data splits for training, validation, and test phases. ...
... We compare them with common baselines and recent state-of-the-art analyzed in [6,7]. Tables 1 and 2 consolidate our findings. ...
We introduce a simple autoencoder based on hyperbolic geometry for solving standard collaborative filtering problem. In contrast to many modern deep learning techniques, we build our solution using only a single hidden layer. Remarkably, even with such a minimalistic approach, we not only outperform the Euclidean counterpart but also achieve a competitive performance with respect to the current state-of-the-art. We additionally explore the effects of space curvature on the quality of hyperbolic models and propose an efficient data-driven method for estimating its optimal value.
... Neighborhood methods have been used since the dawn of recommender systems [42,37]. They are still competitive to recent neural model-based methods [7], especially in session-based recommendation [26]. In this work, we extend neighborhood methods for the causal effect of recommendations. ...
Full-text available
The business objectives of recommenders, such as increasing sales, are aligned with the causal effect of recommendations. Previous recommenders targeting for the causal effect employ the inverse propensity scoring (IPS) in causal inference. However, IPS is prone to suffer from high variance. The matching estimator is another representative method in causal inference field. It does not use propensity and hence free from the above variance problem. In this work, we unify traditional neighborhood recommendation methods with the matching estimator, and develop robust ranking methods for the causal effect of recommendations. Our experiments demonstrate that the proposed methods outperform various baselines in ranking metrics for the causal effect. The results suggest that the proposed methods can achieve more sales and user engagement than previous recommenders.
... In order to tune the hyperparameters of the models we performed a Bayesian Optimization, which was proven to be an effective strategy [1,4,6], using the scikit-learn library 4 . ...
Conference Paper
Full-text available
In this paper we provide a description of the methods we used as team BanaNeverAlone for the ACM RecSys Challenge 2020, organized by Twitter. The challenge addresses the problem of user engagement prediction: the goal is to predict the probability of a user engagement (Like, Reply, Retweet or Retweet with comment), based on a series of past interactions on the Twitter platform. Our proposed solution relies on several features that we extracted from the original dataset, as well as on consolidated models, such as gradient boosting for decision trees and neural networks. The ensemble model, built using blending, and a multi-objective optimization allowed our team to rank in position 4.
... These results indicate that the relationship between the system performance and the runtime activities recorded in the logs can be effectively captured by the simple traditional models. Such results also agree with a recent study (Dacrema et al. 2019) that compares deep neural networks and traditional models for the application of automated recommendations. ...
Full-text available
Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real workloads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to complement or replace resource-demanding performance tests that may not even be realistic in a fast-paced environment. Our approaches use black-box models to capture the relationship between the performance of a software system (e.g., CPU usage) under varying workloads and the runtime activities that are recorded in the readily-available logs. Then, our approaches compare the black-box models derived from the current software version with an earlier version to detect performance regressions between these two versions. We performed empirical experiments on two open-source systems and applied our approaches on a large-scale industrial system. Our results show that such black-box models can effectively and timely detect real performance regressions and injected ones under varying workloads that are unseen when training these models. Our approaches have been adopted in practice to detect performance regressions of a large-scale industry system on a daily basis.
... It is known to be strongly performant in scenarios with sparse and transient data, even for new users with little interaction history [18]. Our objective in this research is to establish strong definitive baselines for multi-language recipe recommendation [7]. Consequently, our future work will develop the assessment of SOTA recommendation models in this application area. ...
Conference Paper
Full-text available
Multi-language recipe personalisation and recommendation is an under-explored field of information retrieval in academic and production systems. The existing gaps in our current understanding are numerous, even on fundamental questions such as whether consistent and high-quality recipe recommendation can be delivered across languages. Motivated by this need, we consider the multi-language recipe recommendation setting and present grounding results that will help to establish the potential and absolute value of future work in this area. Our work draws on several billion events from millions of recipes, with published recipes and users incorporating several languages, including Arabic, English, Indonesian, Russian, and Spanish. We represent recipes using a combination of normalised ingredients, standardised skills and image embed-dings obtained without human intervention. In modelling, we take a classical approach based on optimising an embedded bi-linear user-item metric space towards the interactions that most strongly elicit cooking intent. For users without interaction histories, a be-spoke content-based cold-start model that predicts context and recipe affinity is introduced. We show that our approach to per-sonalisation is stable and scales well to new languages. A robust cross-validation campaign is employed and consistently rejects baseline models and representations, strongly favouring those we propose. Our results are presented in a language-oriented (as opposed to model-oriented) fashion to emphasise the language-based goals of this work. We believe that this is the first large-scale work that evaluates the value and potential of multi-language recipe recommendation and personalisation.
... In this sense, there might also exist longitudinal effects of certain recommendation strategies that are not well understood yet (Zhang et al. 2020). Generally, the community also seems to sometimes face methodological issues, e.g., comparisons with non-optimized baselines and limited reproducibility (Ferrari Dacrema et al. 2019;, which may hamper the progress of research in session-based and sequential recommendations. ...
Full-text available
The accurate prediction of biological features from genomic data is paramount for precision medicine, sustainable agriculture and climate change research. For decades, neural network models have been widely popular in fields like computer vision, astrophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity. Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.
We present a Bayesian approach to conversational recommender systems. After any interaction with the user, a probability mass function over the items is updated by the system. The conversational feature corresponds to a sequential discovery of the user preferences based on questions. Information-theoretic criteria are used to optimally shape the interactions and decide when the conversation ends. Most probable items are consequently recommended. Dedicated elicitation techniques for the prior probabilities of the parameters modelling the interactions are derived from basic structural judgements based on logical compatibility and symmetry assumptions. Such prior knowledge is combined with data for better item discrimination. Our Bayesian approach is validated against matrix factorization techniques for cold-start recommendations based on metadata using the popular benchmark data set MovieLens. Results show that the proposed approach allows to considerably reduce the number of interactions while maintaining good ranking performance.
Conference Paper
Full-text available
RNNs have been shown to be excellent models for sequential data and in particular for data that is generated by users in an session-based manner. The use of RNNs provides impressive performance benefits over classical methods in session-based recommendations. In this work we introduce novel ranking loss functions tailored to RNNs in the recommendation setting. The improved performance of these losses over alternatives, along with further tricks and refinements described in this work, allow for an overall improvement of up to 35% in terms of MRR and [email protected] over previous session-based RNN solutions and up to 53% over classical collaborative filtering approaches. Unlike data augmentation-based improvements, our method does not increase training times significantly. We further demonstrate the performance gain of the RNN over baselines in an online A/B test.
Conference Paper
Full-text available
In this paper we provide an overview of the approach we used as team Creamy Fireflies for the ACM RecSys Challenge 2018. The competition, organized by Spotify, focuses on the problem of playlist continuation, that is suggesting which tracks the user may add to an existing playlist. The challenge addresses this issue in many use cases, from playlist cold start to playlists already composed by up to a hundred tracks. Our team proposes a solution based on a few well known models both content based and collaborative, whose predictions are aggregated via an ensembling step. Moreover by analyzing the underlying structure of the data, we propose a series of boosts to be applied on top of the final predictions and improve the recommendation quality. The proposed approach leverages well-known algorithms and is able to offer a high recommendation quality while requiring a limited amount of computational resources.
Conference Paper
Full-text available
Recommender Systems are an integral part of music sharing platforms. Often the aim of these systems is to increase the time, the user spends on the platform and hence having a high commercial value. The systems which aim at increasing the average time a user spends on the platform often need to recommend songs which the user might want to listen to next at each point in time. This is different from recommendation systems which try to predict the item which might be of interest to the user at some point in the user lifetime but not necessarily in the very near future. Prediction of next song the user might like requires some kind of modeling of the user interests at the given point of time. Attentive neural networks have been exploiting the sequence in which the items were selected by the user to model the implicit short-term interests of the user for the task of next item prediction, however we feel that features of the songs occurring in the sequence could also convey some important information about the short-term user interest which only the items cannot. In this direction we propose a novel attentive neural architecture which in addition to the sequence of items selected by the user, uses the features of these items to better learn the user short-term preferences and recommend next song to the user.