Content uploaded by Dietmar Jannach
Author content
All content in this area was uploaded by Dietmar Jannach on Aug 31, 2019
Content may be subject to copyright.
Measuring the Business Value of Recommender Systems
Research Commentary
DIETMAR JANNACH, University of Klagenfurt
MICHAEL JUGOVAC, TU Dortmund
Recommender Systems are nowadays successfully used by all major web sites—from e-commerce to social
media—to lter content and make suggestions in a personalized way. Academic research largely focuses on the
value of recommenders for consumers, e.g., in terms of reduced information overload. To what extent and in
which ways recommender systems create business value is, however, much less clear, and the literature on the
topic is scaered. In this research commentary, we review existing publications on eld tests of recommender
systems and report which business-related performance measures were used in such real-world deployments.
We summarize common challenges of measuring the business value in practice and critically discuss the
value of algorithmic improvements and oine experiments as commonly done in academic environments.
Overall, our review indicates that various open questions remain both regarding the realistic quantication
of the business eects of recommenders and the performance assessment of recommendation algorithms
in academia.
CCS Concepts: •Information systems →Recommender systems;
1 INTRODUCTION
Recommender systems are among the most visible and successful applications of Articial Intel-
ligence and Machine Learning technology in practice. Nowadays, such systems accompany us
through our daily online lives—for example on e-commerce sites, on media streaming platforms, or
in social networks. ey help us by suggesting things that are assumed to be of interest to us and
which we are correspondingly likely to inspect, consume, or purchase.
Recommendations that are provided online are usually designed to serve a certain purpose and
to create a certain value, either for the consumer, the provider, some other stakeholder like an
item producer, or several of them in parallel [
2
,
3
,
31
]. Academic research mostly focuses on the
consumer perspective, with the implicit assumption that improved customer value is indirectly also
benecial for the recommendation provider. Indeed, among other considerations, service providers
are usually interested in improving the recommendation experience of consumers. Typically,
however, they assess the value of a recommendation system more directly in terms of business
value. Relevant business measures in that context include sales or revenue, click-through rates
(CTR), higher user engagement, or customer retention rates [26,27,33,73].
Given the insights from the literature [
5
,
20
,
27
,
46
,
47
], it is undisputed that recommender
systems can have positive business eects in a variety of ways. However, how large these eects
actually are—compared to a situation without a recommender system or with a dierent algorithm—
is not always clear. In the literature, the reported numbers vary largely, from marginal eects in
terms of revenue [
22
] to orders of magnitude of improvement in terms of “Gross Merchandise
Volume” [
16
]. Furthermore, in some application domains, it might also not be immediately evident
what particular measure one should focus on. Increases in click-through rates are, for example,
oen used as a measure in reports on real-world deployments. To what extent such increases
actually reect the long-term business value of a recommender, can, however, be open to question.
Copyright ©2019 held by the authors.
arXiv:1908.08328v1 [cs.IR] 22 Aug 2019
Dietmar Jannach and Michael Jugovac 2
A related challenge—in theory as in practice—is to predict if a planned improvement of the
used recommendation algorithm will positively aect a certain business measure. In fact, many
companies are constantly trying to improve their recommendation systems, and they usually run
eld tests (A/B tests) to gauge the eects of certain changes. Since such eld tests can be costly
and risky, companies like Netix additionally rely on oine experiments based on historical data
to make preliminary assessments of planned algorithm changes [
27
]. is type of experiment is
also predominant in the academic literature, mostly because researchers typically have no access
to a real-world system where they can test the eectiveness of their ideas. Unfortunately, while
nowadays a number of research datasets are available, they usually do not contain quantitative data
from which the business value can be directly inferred. Furthermore, since the choice of a business
measures is oen specic for a domain, researchers typically abstract from these domain specics
and, most commonly, aim at predicting user preference statements (e.g. ratings) or the user’s next
action as recorded in a given dataset. To what extent such measurements—and oine experiments
in general—are helpful to assess the potential business value of algorithmic improvements, is also
open to question. According to a report by Netix researchers [
27
], oine experiments were not
found “to be as highly predictive of A/B test outcomes as we would like.”
Overall, given the largely varying results reported on the eect of recommenders on business,
two potential pitfalls can be identied: First, the business value of the recommender systems is not
adequately dened, measured, or analyzed, potentially leading to wrong conclusions about the true
impact of the system. Second, the value of deploying complex algorithms that are slightly beer
than previous ones in terms of an abstract computational measure like the RMSE might be wrongly
estimated. Aer the Netix Prize competition, for example, the winning strategy was never put
into practice. Despite the theoretical accuracy gains, it was not clear if the potentially resulting
increases in business value would justify the engineering eort to implement the winning strategy
in a scalable manner [5].
In this research commentary, we therefore review the literature on real-world deployments of
recommender systems. We consider both personalized recommendation approaches based on long-
term user proles as well as recommendations that are based on interactively acquired preferences
or the user’s current navigational context
1
. is review shall serve online service providers and
retailers as a basis to assess the potential value of investing (more) into recommender systems
technology. We will furthermore summarize the outcomes of scientic studies which aim to assess
to what extent algorithmic improvements in terms of prediction accuracy lead to a beer quality
perception or higher adoption by users. Finally, we discuss possible implications of our survey for
industry and academia.
2 WHAT WE KNOW ABOUT THE BUSINESS VALUE OF RECOMMENDERS
2.1 General Success Stories
Companies usually do not publicly share the exact details about how they prot from the use of
recommendation technology and how frequently recommendations are adopted by their customers.
Certain indications are, however, provided sometimes in the form of research papers or blog posts.
Netix, for example, disclosed in a blog post [
5
] that “75 % of what people watch is from some
sort of recommendation”, and YouTube reports that 60% of the clicks on the home screen are on
the recommendations [
20
]. In another, later report on the system designed at Netix [
27
], the
authors reveal that recommendations led to a measurable increase in user engagement and that
the personalization and recommendation service helped to decrease customer churn by several
1
Related-item recommendations like Amazon’s “Customers who bought . . . also bought” are an example of adaptive
suggestions that are mainly based on the navigational context.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 3
percentage points over the years. As a result, they estimate the business value of recommendation
and personalization as more than 1 billion US dollars per year.
2
Another number that is oen
reported in online media
3
, is that according to a statement of Amazon’s CEO in 2006, about 35 % of
their sales originate from cross-sales (i.e., recommendation).
ese examples illustrate the huge potential of personalization and recommendation. e fact that
many successful online sites devote a major part of the available screen space to recommendations
(e.g., the top screen area on YouTube [
20
] and almost the entire screen at Netix) is another indicator
of the substantial underlying business value. In the following, we summarize ndings that were
reported in the literature to draw a more detailed picture, e.g., in terms of the business value that
recommendations achieve and how this is measured. Since not all forms of measurements might
be similarly useful for each domain, we will later on discuss why it is important to interpret the
observed ndings with care.
2.2 What is Being Measured and What are the Reported Eects?
Figure 1shows an overview of the main measurement approaches found in the literature, which
we discuss next.
Business Value of
Recommenders
Click-Through
Rates
Adoption and
Conversion
Sales and
Revenue
Effects on Sales
Distributions
User
Engagement and
Behavior
Fig. 1. Overview of Measurement Approaches
2.2.1 Click-through rates. With the click-through rate (CTR), we measure in some form how
many clicks are garnered by the recommendations. e underlying assumption is that more clicks
on the recommended items indicate that the recommendations were more relevant for the users.
e CTR is a very common measure in news recommendation. In an early paper by Das et al. [
19
]
on Google’s news personalization engine, the authors found that personalized recommendations
led to an average increase in clicks of 38 % compared with a baseline that only recommends popular
items. On some days, however, in case of highly aractive celebrity news, the baseline actually
performed beer. Dierent personalization algorithms were tested, but no clear winner was
identied.
Kirshenbaum et al. [
48
] later on reported on a live experiment at Forbes.com. e best-performing
method in their live study was a hybrid content-based, collaborative system, leading to a 37 %
increase in CTR over a time-decayed popularity-based baseline. Interestingly, this trivial popularity-
based baseline was among the best methods in their live trial.
2
How these numbers were estimated is, unfortunately, not specied in more detail. e total revenue of Netix in 2017 was
at about 11.7 billion dollars, with a net prot of 186 million dollars.
3See, e.g., hps://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 4
In [
59
], Liu et al. also experimented with a content-based, collaborative hybrid for Google News
recommendations. One particularity of their method was that it considered “local trends” and thus
the recent popularity of the items. According to their experiments based on live trac on the
site, considering local trends helped to increase the CTR compared to the existing method [
19
] by
around 30 % for a subgroup of relatively active users. However, the experiments also showed that
the improved recommendations “stole” clicks from other parts of the Google News page and the
algorithmic improvement did thus not lead to more clicks on the overall site.
Instead of considering only community trends, Garcin et al. [
26
] specically consider the recent
interests of individual, anonymous users in the recommendation process.
4
ey compared their
Context Tree (CT) method both with a random recommender and one that suggests the most popular
items. Interestingly, the random recommender was beer in terms of the CTR than the “Most
Popular” recommender. e CT recommender turned out to be benecial mostly for longer user
sessions, where it led to a CTR increase of about 35 %.
Besides news, the CTR was also used in a number of other application domains, including
research article recommendation [
9
–
11
], community recommendation on social networks [
74
], or
video recommendation on YouTube [
20
]. In the laer case, the authors report an increase of over
200 % in terms of the CTR when they used a comparably simple algorithm based on co-visitation
instead of an approach that recommends the most viewed items.
In the special problem seing of “similar item recommendations”, i.e., the recommendation of
items in the context of a reference item, researchers at eBay have repeatedly reported on CTR
improvements that were obtained through beer algorithms [
13
,
16
,
46
,
47
]. In [
46
], for example, a
38 % CTR increase was observed in comparison to a simple title-based recommendation method; in
[
47
], a 36 % improvement in terms of the CTR was obtained for the “related-item recommendations”
at the post-purchase page at eBay via a co-purchase mining approach. In [
13
], nally, only a minor
increase in CTR (of about 3 %) was observed when applying a novel ranking method instead of a
manually tuned linear model. Nevertheless, the model led to stronger increases in revenue (6 %) in
the test period.
2.2.2 Adoption and Conversion Rates. Dierently from online business models based on adver-
tisements, click-through rates are typically not the ultimate success measure to target in recom-
mendation scenarios. While the CTR is able to measure user aention or interest, it cannot convey,
e.g., whether users really liked the recommended news article they clicked on or if they purchased
an item whose product details they inspected based on a recommendation.
erefore, alternative adoption measures are oen used that are supposed to be beer suited
to gauge the usefulness of the recommendations and which are oen based on domain-specic
considerations. YouTube, as reported in [
20
], uses the concept of “long CTRs”, where clicks
on recommendations are only counted if the user subsequently watched a certain fraction of a
video [
20
]. Similarly, Netix uses the “take-rate” as a measure which captures in how many cases a
video or movie was actually played aer being chosen from a recommendation [
27
]. According to
their experiments, increases of the take-rate due to the deployment of a personalized strategy are
substantial when compared to recommendations based on popularity. No detailed numbers are
unfortunately reported in [20] and [27] in that respect.
In domains where the items cannot be directly consumed (read, viewed, or listened to), other
business-related adoption measures are common. Examples include the “purchase-through” or
“bid-through” rate on eBay [
16
], as well as the “link-through” or “cite-through” rate for research
4is seing corresponds to a session-based recommendation scenario [66].
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 5
paper recommendations [
9
], or the number of “click-out” events to external partners in online
marketplaces [56].
A longitudinal A/B testing phase of a new similar-item algorithm at eBay [
16
], for example,
showed that the new system led to a bid-through rate between about 3.3 % and 9 %, depending
on the product category. e purchase-through rate was at about 1.5 % and 3 %, measured at the
same post-purchase (checkout) page. Overall, the authors conclude that their new system based on
probabilistic clustering—if it would go live aer six months of A/B testing and tuning—would lead
to a 3-5 fold improvement over their current algorithm, which is a nearest-neighbor collaborative
ltering method on the category level. In another eld test at eBay [
46
], the experimenters report
an 89 % increase of “add-to-wishlist” actions aer introducing a new similar-item recommendation
algorithm on the page that users see aer losing an auction compared to the previously used “naive”
algorithm. On a dierent, much smaller marketplace for electronic gadgets, Lerche et al. [
56
] found
that using alternative recommendation strategies can increase the “clickout” rate to an external
marketplace by more than 250 %.
In a quite dierent application domain, people-to-people recommendation on online dating
portals, Wobke et al. [
79
] observed a signicant increase in dierent domain-specic measures
(e.g., “10.9 % li in positive contacts per user” or “4.8 % li in open communications per user”)
when a collaborative ltering method was applied instead of a baseline that matches explicit user
proles. In another people-to-people recommendation scenario, matchmaking on a job portal, the
authors found that collaborative ltering helped improve their specic performance measure—the
probability of a user contacting an employer aer seeing a recommendation—by more than 200 %
over a popularity-based baseline [
79
]. In the context of the LinkedIn platform, Bastian et al. [
8
]
proposed a new method for skill recommendations. A eld test showed that recommending a list of
skills to add to the prole led to a higher rate of users who added skills (49 % vs. 4%) compared to a
manual input system with type-ahead. Note, however, that two dierent user interface approaches
were compared in this eld test. It is thus not fully clear how much of the gains can be aributed
to the recommendation method and how much can be explained by the more convenient way of
adding skills.
e number of contact requests was also the success measure of the deployed tourism recom-
mender system described by Zanker et al. [
85
]. In their quasi-experimental design, users who
interacted with a conversational recommender were twice as likely to issue an accommodation
request through the tourism website than users who did not. In this context, it is, however, impor-
tant to keep in mind that the users who decided to use the recommender might have had a more
specic interest than others when they entered the site. Also for the travel and tourism domain,
Kiseleva et al. [
49
] compared dierent strategies in a eld test at Booking.com. eir experiment
showed that a Naive Bayes ranking method led to a 4.4 % increase in conversion compared to the
current online system. Note that in both mentioned studies from the tourism domain [
49
,
85
],
the recommendations are not based on long-term user proles but on user preferences that are
interactively collected before recommending.
2.2.3 Sales and Revenue. e adoption and conversion measures discussed in the previous section
are, in many applications, more informative of the potential business value of a recommender than
CTR measures alone. When users pick an item more oen from a recommendation list which they
later purchase or view, this is a good indicator that a new algorithm was successful to identify
items that are relevant to the user. Nonetheless, it remains dicult to assess how such increases
in adoption translate to increased business value. A recommender might, in fact, make many
suggestions to users that they would purchase anyway (see [
12
] for an analysis of this maer).
e increase in business value might therefore be lower than what we can expect when looking at
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 6
increases of adoption rates alone. Moreover, if the relevance of recommendations was very low
already initially, i.e., almost nobody clicked on them, increasing the adoption rate even by 100%
might lead to very limited absolute extra value for the business.
Generally, there are various “business models” for recommenders, i.e., how they help improve a
business. Chen et al. [
15
], for example, list a number of eectiveness measures, including increased
sales, fee-based sales through more transactions or subscriptions, and increased income from other
types of fees, see also [
32
]. Unfortunately, only few papers report the eects of recommenders on
such measures, partially because the data is condential and partially because the eects cannot
easily be isolated from each other. In the case of Netix, for example, renewed subscriptions are a
desired eect of recommenders, but with very low churn rates in general it is dicult to aribute
dierences in churn rates to changes in a recommender algorithm [27].
While today’s video streaming sites, like Netix, have atrate subscription models, there are
other sites where additional content can be purchased. Previously, such pay-per-view models
were more common, and Bambini et al. [
7
] investigate the business eects of a recommender for a
video-on-demand service. ey not only measured what they call the “empirical recall”, i.e., the
fraction of recommended movies that were later on watched, but also tried to assess the additional
video-on-demand sales induced by the recommender. However, because the recommender system
was deployed to the whole user base instead of only a small treatment group, the authors had
to gauge its performance by comparing the global number of video views in the weeks before
and aer the introduction of the recommender. ey nally estimate the li in sales obtained by
their content-based approach to be 15.5 %, aer smoothing out other factors such as marketing
campaigns with a moving average.
Also in the media domain, Lee and Hosanagar investigate the impact of recommenders on
the sales of DVDs of an online retailer [
54
]. ey tested both purchase-based and view-based
collaborative ltering approaches and observed a 35 % li in sales when the purchase-based version
was compared with a “no recommendations” condition. e increases in sales were much less
pronounced (and not statistically signicant) when a view-based strategy was employed. Finally,
they also observed that only recommending recently viewed items actually led to a slight decrease
in overall sales. Dierently from the ndings in [
56
], reminders were therefore not directly helpful
in terms of business value. In real-world e-commerce applications, where such reminders are
common [
39
], they might more oen represent convenient navigation shortcuts for users than
additional sales stimulants.
Besides the movie and DVD domains, a number of success stories of recommender systems
exist for more general e-commerce seings. One of the earliest reports that quanties the eects
of recommenders on business value focused on online grocery orders. In [
53
], the authors found
through a pilot study that their revenue increased by 1.8 % through purchases that were made
directly from the recommendation lists.
Dias et al. [
22
] also evaluated a recommender for an online grocery store. ey observed an
increase in direct revenue of only 0.3 % aer deploying the system. However, they also discovered
substantial indirect eects, with increases of up to 26 % for one category. It, thus, became obvious
that the recommender was able to inspire or stimulate additional sales even though consumers did
not pick the items from a recommendation list. A similar eect was also reported in [
53
], where
the authors observed that the grocery recommender successfully guided customers to product
categories that they had not considered before. More recently, the authors of [
43
] also detected
such an “inspirational” eect of recommender systems in the music domain.
In the context of similar item recommendations at eBay, the authors of [
13
] report a 6 % improve-
ment in terms of revenue when they eld tested a novel method against a baseline linear model.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 7
Specically, they proposed a two-stage approach, consisting of a candidate item retrieval and a
subsequent ranking phase. e ranking model is based on logistic regression, where the purchase
probabilities are computed based on observations from recommendations that were made in the
past. While the reported improvement is signicant, it is much lower than the one reported earlier
in a similar context at eBay [
16
], where the authors observed an increase of the “Gross Merchandise
Bought” measure of almost 500 % in the context of a specic part of the website. However, in
general, it seems that such increases are only possible under certain circumstances, e.g., when the
existing methods are not eective. e study reported in [
16
] also only lasted one week and it is
unclear if the system went into production.
An online book store was the evaluation environment of another early work presented in [
71
],
where the authors compared two algorithmic approaches for next-item recommendations. e
results showed that their new method led to 28 % more prot than when using a simple one; when
they entirely removed the recommenders for one month, the revenue dropped by 17 %. However,
the size of this eect could have also been inuenced by additional factors like seasonal eects.
To what extent recommenders impact sales within an online marketplace for mobile phone games
was analyzed in [
33
]. Here, the authors report the outcomes of a eld test, where several algorithms
were A/B tested for a number of weeks. e best method in their study, a content-based approach,
led to an increase of sales of 3.6 % compared to the condition where no recommendations were
provided. In the study, it turned out that the choice of the strategies should be made dependent on the
user’s navigational situation. While users might, for example, like content-based recommendation
in general, these “more-of-the-same” suggestions are not helpful right aer users have already
purchased something. erefore, even slightly higher increases in sales can be expected when the
user’s navigation context is considered.
Other eld studies in the context of recommenders on mobile phones were also discussed in
[
76
] and [
73
]. Tam and Ho [
76
] found that personalized oers led to about 50 % more ringtone
downloads compared to randomized oers. Smyth et al. [
73
] observed a 50 % increase in user
requests when the mobile portal was personalized, which in their case directly translated into
revenue. e direct revenue boost through personalization was quantied for one provider as
$
15
million per year.
2.2.4 Eects on Sales Distributions. e discussions so far clearly show that personalized rec-
ommendations can strongly inuence the behavior of users, e.g., how many items they buy. is
inuence can, however, not only mean that more items are bought, it might also result in the eect
that dierent items are bought, due to the persuasive potential of recommenders [
83
]. Sellers might
want to persuade customers to buy specic items for a variety of reasons. For example, to stimulate
cross sales, recommendations can make customers aware of items from other categories that they
might also be interested in or items that complement their previously purchased items. A clothes
retailer might, for example, want to branch out into the shoes business, at which point customers
can be recommended the matching pair of shoes for every pair of pants they buy. However, rec-
ommendations can also be used to persuade users to choose a premium item that oers a higher
revenue margin for the seller instead of a low budget item to maximize per-category prots.
In [
84
], for example, the introduction of an interactive recommender for premium cigars led to a
signicant shi in consumers’ purchasing behavior. Specically, the personalized recommendations
led to more purchases in the long tail, and the sales spectrum was no longer dominated by a few
topsellers. A shi of sales distributions introduced by recommenders was also noticed in an early
work by Lawrence et al. [53] in an online supermarket application.
e distribution of what users consume is also a relevant measure at Netix [
27
]. e key metric
here is called “Eective Catalog Size” and expresses the amount of catalog exploration by users. An
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 8
analysis shows that in the presence of personalized recommendations, this exploration tendency
strongly increases, and a shi away from the most popular items is observed. However, such a shi
in the consumption distribution does not necessarily mean that there is more business value (e.g.,
more downloads, purchases, or clicks). In [
59
], for example, an improved news recommender stole
clicks from other parts of the website, i.e., there was no increase in overall user activity.
A recent analysis of the eects of recommenders on sales diversity can be found in [
54
] and [
55
].
e underlying question is whether recommenders help to promote items from the long tail or
if they—in particular when based on collaborative ltering—rather help to boost sales of already
popular items. To that purpose, the authors of [
55
] conducted a randomized eld experiment on the
website of a North-American online retailer. e study revealed that the presence of a recommender
actually led to a decrease in aggregate sales diversity, measured in terms of the Gini coecient.
While at the individual user level oen more items in the catalog were explored, it turned out that
similar users in the end explored the same kinds of products. Looking at niche items, recommender
systems helped to increase item views and sales; but the increase of sales for popular products was
even stronger, leading to a loss of market share of niche items.5
2.2.5 User Behavior and Engagement. In various application domains, e.g., media streaming [
27
],
higher user engagement is considered to lead to increased levels of user retention, which, in
turn, oen directly translates into business value. Increased user activity in the presence of a
recommender is reported in a number of real-world studies of recommender systems. Various
measures are applied, depending on the application domain.
In the news domain, for example, two studies [
26
,
48
] observed longer sessions when a recom-
mender was in place. In [
26
], the visit lengths were 2.5 times higher when recommendations were
shown on the page. In the context of mobile content personalization, Smyth et al. [
73
] report a
100 % increase in terms of user activity and more user sessions. For eBay’s similar item recom-
mendations, as discussed above, Katukuri et al. [
46
] found that users were more engaged in terms
of “add-to-wishlist” events. More user actions with respect to citations and links to papers were
observed for the research paper recommender discussed in [9].
In the domain of music recommendation, Domingues et al. [
24
] compared dierent recommenda-
tion strategies and found out that a recommendation strategy that combines usage and content data
(called Mix) not only led to higher acceptance rates but also to a 50 % higher activity level than the
individual strategies in terms of playlist additions. e authors furthermore measured loyalty in
terms of the fraction of returning users. ey again found dierences between the recommendation
strategies and indications that acceptance rates, activity levels, and user loyalty are related.
In some papers, users activity is considered to be the most important performance measure.
Spertus et al. [
74
], for example, measured how many users of the social network Orkut actually
joined one of the recommended communities. In the case of LinkedIn, the authors of [
81
] report
that user engagement was strongly increased when a new recommender for similar proles was
introduced. eir activity measures included both prole views and email messages exchanged
between recruiters and candidates.
For the particular case of the community-answering platform Yahoo! Answers, Szpektor et al. [
75
]
found that recommendations that were solely based on maximizing content similarity performed
worse than a control group. However, aer increasing the diversity of the recommendation lists,
they observed a 17 % improvement in terms of the number of given answers and an increase of
the daily session length by 10 %. ese insights therefore support ndings from existing research
in the elds of Recommender Systems and Information Retrieval which stipulate that it can be
5
A simulation-based analysis of concentration eects can be found in [
37
]. e analysis indicates that the choice of algorithm
determines the strength and direction of the eects.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 9
insucient to consider only the assumed relevance of individual items but not the diversity of the
provided recommendation list as a whole [64,89].
3 DISCUSSION
3.1 Challenges of Measuring the Business Value of Recommender Systems
3.1.1 Direct Measurements. Our review in Section 2shows that there are various types of eects
of recommender systems that can be measured. In some application domains and in particular in
e-commerce, the business value can be measured almost directly by tracking eects on sales or
revenue that result from more sales or shis in the sales distributions caused by the recommender.
In such cases, it is important to ensure that the choice of the measure is aligned with the business
strategy. In some domains, increasing the sales volume (revenue) might be relatively easy to achieve
by recommending currently discounted items [
39
] or by promoting low-cost, high-volume items
through the recommendations. is might, however, not always be the best business strategy, e.g.,
for retailers that want to promote premium products with high prot margins.
But even in cases where the business value can be directly measured, A/B tests are usually only
conducted for a limited period of time, e.g., for a number of weeks. Such time-limited tests are
not able to discover longitudinal eects. While a eld test might indicate that promoting already
popular items is more benecial than promoting long-tail items [
54
,
55
], the recommendation of (at
least some) long-tail items might have direct or indirect sales eects in the long run. Such eects
can, for example, occur when customers discover additional item categories on a shop through the
recommendations over time [
22
] or when customers later on switch to a paid version of a product
that was originally recommended to them as a free trial [33].
3.1.2 Indirect Measurements. While click-through rates and certain forms of adoption rates
measure in a direct way whether or not users click on the recommended items, they are—unless
used in a pay-per-click scenario—in most cases not a true measure of business value. A high CTR
for a news recommendation site might, for example, simply be achieved through clickbait, i.e.,
headlines that make users curious. In the long run, and possibly only some time aer the A/B test,
users might, however, not trust the recommendations anymore in case the items they clicked on
were ultimately not relevant for them. Zheng et al. [
87
] investigate the problem of using CTRs
as a success measure for recommendations on a media streaming site. eir analysis indicates
that there can be a trade-o between the optimization of the CTR and the optimization of the
ranking of the items according to their expected relevance for the users. As mentioned above,
while recommending mostly popular items can lead to a higher CTR in many applications, such
improvements are oen an overestimation of the true value of the recommender [12].
Comparable eects can also arise when using certain types of adoption rates. When nearly
everything on a web page is personalized or some form of recommendation, e.g., in the case of
Netix, users are likely to choose whatever is recommended to them due to a mere presence
eect [
50
]. According to [
4
], Netix is working on personalizing the covers (artwork) of their
streaming content to persuade or convince users of its relevance for them. Counting only how
oen users start streaming such an item can therefore also be misleading as this measure would
include users who started playing the movie but did not enjoy it in the end. Consequently, one has
to decide carefully when to consider such a recommendation a success.
In the domain of mobile game recommendation, the authors of [
33
] used click rates, conversion
rates, and game downloads as business-related measures besides the sales volume. When comparing
these measures, it turned out that neither item view nor even download counts were reliable
predictors for the business success. Some recommendation algorithms, for example, raised consumer
interest but did not lead to downloads. In terms of the download counts, it, furthermore, became
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 10
evident that some algorithms had a bias to promote oen-downloaded games that could be used in
a free trial (demo). How oen users later on also purchased the paid version and how this aected
sales in the longer term was not clear from the experiment.
In several other domains, it can be even more dicult to assess the business value of a recom-
mender. In the case of at-rate subscription models for media streaming services, for example,
user engagement is typically considered to be correlated with customer retention. According to
our discussions in Section 2, there is strong evidence that recommenders have positive eects on
the amount of user activity, e.g., in terms of session lengths or site visits. In some cases, when
customer retention is already high—like in the case of Netix—obtaining signicant improvements
in customer retention can be dicult to achieve [
27
]. Depending on the domain, however, customer
engagement can be a viable proxy for the business value of a recommender.
Overall, we can identify a number of challenges when it comes to measuring the business value
of recommender systems. Table 1shows a summary of some of our observations.
Table 1. Measurements to Assess the Value of Recommenders.
Measurement Remarks
Click-rough Rates Easy to measure and established, but oen not the ultimate goal.
Adoption and Conversion
Easy to measure, but oen requires a domain- and application specic
denition. Requires interpretation and does not always translate directly
into business value.
Sales and Revenue Most informative measure, but cannot always be determined directly.
Eects on Sales Distribution
A very direct measurement; requires a thorough understanding of the
eects of the shis in sales distributions.
User Engagement and Behavior
Oen, a correspondence between user engagement and customer retention
is assumed; still, it remains an approximation.
3.2 Algorithm Choice and the Value of Algorithmic Improvements
e reported improvements aer deploying a new or modied recommender system vary largely
according to our review in Section 2. One of the reasons for this phenomenon lies in the baseline
with which the new system was compared. Sometimes, the improvements were obtained compared
to a situation with no recommender [
22
,
53
], sometimes the new system replaces a comparably
simple or non-personalized (e.g., popularity-based) method [
19
,
71
], and, in a few cases, more
elaborate strategies are compared to each other [33,71].
In many cases where business eects are directly measured, increases in sales between one and
ve percent are reported on average. e increases sometimes vary across dierent categories, e.g.,
more than 26 % for one category of an online grocery store. In one study [
71
], the authors also
report a 17 % drop in sales when the recommendation component was removed for a week. Overall,
these numbers seem impressive, given that a lasting increase in sales of only 1 % or even less can
represent a substantial win in absolute numbers for a large business.
In papers that rely on click-through rates, dierent forms of adoption rates, or domain-specic
measures, we can also oen observe substantial increases, e.g., a 200 % CTR increase over a trivial
baseline at YouTube [
20
], a 40 % higher email response rate at LinkedIn [
68
], or an increase of
the number of answers by 17 % at Yahoo! Answers [
75
]. To what extent these strong increases of
indirect measurements translate into business value is, however, not always fully clear and oen
dicult to estimate.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 11
What can generally be observed is that—in many papers—algorithms of dierent types or families
are compared in A/B tests, e.g., a collaborative ltering method against a content-based method, or
a personalized method against a non-personalized one. is was, for example, done in [
33
] and
the outcomes of this study show that the choice of the recommendation strategy (collaborative
vs. content-based vs. non-personalized) does maer both in terms of sales and in general user behavior.
Such studies are, however, dierent from many oine experiments conducted in academic research,
which typically benchmark algorithms of similar type, e.g., dierent matrix factorization variants
or sometimes even only dierent loss functions for the same learning approach. Whether the oen
tiny accuracy improvements reported in such oine experiments translate into relevant business
value improvements when deployed in real-world environments remains dicult to assess, as
published industrial eld tests rarely focus on such ne-grained comparisons between similar
algorithmic approaches.
Finally, a general limitation of the discussed works is that typical A/B tests reported here focus
almost exclusively on the gains that can be obtained when dierent algorithms are used. e success
of a recommender system can, however, be dependent on a number of other factors, including the
users’ trust in the recommender or the website as a whole [65,80], the perceived transparency of
the recommendations, and, most importantly, the user interface. Garcin et al. [
26
], for example,
report a 35 % increase in CTR when they deployed a more sophisticated news recommendation
method in a eld test. However, at the end of the paper, they mention that they changed the
position and size of the recommendation widget in an additional A/B test. is change, which was
only at the presentation level, immediately led to an increase in CTR by 100 %. is suggests that
at least in some applications, it seems more promising to focus both on the user experience and
algorithmic improvements instead of investing only in beer algorithms.
3.3 The Pitfalls of Field Tests
A/B tests, i.e., randomized controlled eld tests, are usually considered the ultimate method of
determining the eects on a user population caused by adding a recommender system to a website
or improving an existing system, and large companies constantly test modications to their service
through such eld tests [
51
]. A number of typical challenges of running such tests are discussed in
[
27
] for the case of Netix. In their case, A/B tests usually last for several months. e main metrics
in their analyses are centered around customer retention and user engagement, which is assumed to
be correlated with customer retention. To make sure that the observed dierences of the retention
metric are not just random eects, they apply statistical methods, e.g., to determine appropriate
sample sizes.
6
Despite this statistical approach, the authors of [
27
] report that interpreting the
outcomes of A/B tests is not always trivial. In case of unexpected eects, they sometimes repeated
the tests to nd that the eect did not occur again.7
Generally, running reliable A/B tests remains dicult, even for large companies like Google,
Microso, or Amazon, for various reasons [
21
,
51
]. One fundamental challenge lies in the choice of
the evaluation criterion, where there can be diametrical objectives when short-term or long-term
goals are considered. An extreme case of such a phenomenon is reported in [
51
] in the context of
eld testing Microso’s Bing search engine. One of the long-term evaluation criteria determined
at the executive level was the “query share”, i.e., the relative number of queries served by Bing
compared to the estimated overall market. Due to a bug in the system, the search quality went
down signicantly during a test. is, however, led to a strong short-term increase in the number
of distinct queries per user, as users needed more queries to nd what they searched for. Clearly,
6See [28] about the statistics behind A/B tests.
7See also [51] for an analysis of surprising results of A/B tests at Microso Bing.
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 12
in the long term, poor search quality will cause customers to switch to another search service
provider. Similar eects can appear in the context of CTR optimization as discussed above.
Another more implementation-related challenge is that oen large sample sizes are needed.
Even small changes in revenue or customer retention, e.g., 0.5%, can have a signicant impact on
business. Finding such eects with a certain condence can, however, require a sample of millions
of users. In many cases, it is also important to run tests for longer periods of time, e.g., several
months, which slows innovation. Furthermore, running A/B tests with existing users can also be
risky, and some companies therefore limit tests mostly to new users [
27
]. Consider, for example, a
music recommender system that in the past, mostly focused on relatively popular artists. A planned
algorithmic change might aim at beer discovery support by recommending newer artists more
frequently. Users who were acquainted with the existing system might notice this change, and as
they are less familiar with the recommendations by the new system, their quality perception might
degrade, at least initially [
36
]. However, this initially poor user response does not necessarily mean
that the new system will not be accepted in the long run.
Existing research proposes a number of methods to deal with these challenges [
14
,
21
,
51
], e.g., in
the context of web search, but it is unclear if smaller or less-experienced companies implement such
measures in their eld tests. ese problems of potentially unreliable or misleading test outcomes
also apply for the research works that are reviewed above in Section 2. In many of the discussed
papers, the exact details of the conducted A/B tests are not provided. Sometimes, authors report
how long the test was run and how many users were involved. While in several cases the evaluation
period lasts for several months, there are cases where certain tests were run only for a few days or
weeks (aer piloting and optimization) [
16
,
33
,
55
,
59
,
75
]. Regarding sample sizes, oen only a few
hundred or around one thousand users were involved [
10
,
73
,
85
]. In almost all surveyed cases, an
analysis of the required sample size and detailed statistical analyses of the A/B tests were missing.
It can therefore not always be concluded with certainty that the reported outcomes are based on
large enough samples or that they are not inuenced by short-term eects (see [51]).
3.4 The Challenge of Predicting Business Success from Oline Experiments
Given the complexity and cost of running eld tests, the common approach in academic research
is to conduct oine experiments on historical data. e most common evaluation method is to
hide some of the available information from such a dataset and use the remaining data to learn a
model to predict the hidden data, e.g., a user rating or other user actions like clicks, purchases, or
streaming events.
is research approach has a number of known limitations. Publicly available datasets, for
example, oen contain no business-related information, e.g., about prices or prots, making the
assessment of the business value of a recommender dicult. Additionally, it is, unfortunately,
oen not clear under which circumstances the data was collected. For datasets that contain user
interaction logs (e.g., e-commerce transactions or listening logs on music sites), the data can be
biased in dierent ways [
40
,
48
], e.g., by an already existing recommender on the site, by the order
in which items were presented, or by promotions that were launched during the data collection
period. Evaluations that are based on such logs might lead to wrong or biased conclusions, such as
an over- or underestimation of a recommender’s eectiveness.
3.4.1 Limitations of Accuracy as a Proxy for Business Value. Besides problems related to the
underlying data, it is oen not fully clear to what extent the abstract accuracy measures used in
typical oine experiments (like RMSE, precision, or recall) are correlated with the business success
of a recommender. Intuitively, having an algorithm that is able to beer predict than another
whether a user will like a certain item should lead to beer or more relevant recommendations.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 13
However, if this leads to increased business value, is not always clear. Users might, for example, rate
items highly when they try to give an objective opinion online. Yet, they might not want to purchase
similar items in the future, because, subjectively, the item does not satisfy them. As a result, it
might have been beer to make a riskier recommendation, which might lead to additional sales.
Gomez-Uribe and Hunt [
27
] discuss the general challenges of oine experiments at Netix
and, as mentioned above, conclude that they are not always indicative of online success. In fact, a
number of research works exist that compare algorithms both in eld tests and in oine tests or
user studies. Surprisingly, in the majority of these aempts, the most accurate oine models did
neither lead to the best online success nor to a beer accuracy perception [
9
,
17
,
25
,
26
,
34
,
61
,
63
,
69
].
Only a few works report that oine experiments were predictive of what was observed in an A/B
test or a user study, e.g., [
13
,
18
,
42
]. is particular problem of oine experiments is, however, not
limited to recommender systems and can also be observed in other application areas of machine
learning, e.g., click prediction in advertising. e authors of [
82
], for example, discuss problems of
measures such as the AUC and propose an alternative evaluation approach. Overall, it remains to
be shown through more studies that small improvements in oine accuracy measurements—as
commonly reported in academic papers—actually have a strong eect on business value in practice.
is is in particular important as studies show that even algorithms with similar oine accuracy
performance can lead to largely dierent recommendations in terms of the top-n recommended
items [
37
]. e work in [
25
] also indicates that methods that lead to good RMSE values can result
in recommendations that are perceived to be rather obscure by users (even though they might
actually be relevant). is might, in fact, be a reason why Netix uses “a prey healthy dose of
(unpersonalized) popularity” in their ranking method [27].
On a more general level, most works on recommender system algorithms can be considered as
research in applied machine learning. erefore, they can suer from certain limitations of today’s
research practice in this eld [58] and in particular from the strong focus of aiming to “win” over
existing methods in terms of individual (accuracy) measures [
57
,
78
]. In this context, it can happen
that improvements that are reported in the academic literature over several years “don’t add up”,
as shown already in 2009 in [
6
] for the Information Retrieval domain. Similar observations were
made more recently for improvements that were aributed to deep learning techniques, where
indications were found that sometimes long-established and comparably simple methods, when
properly tuned, can outperform the latest algorithms based on deep learning techniques [57,60].
3.4.2 Beyond-Accuracy Measures: Novelty, Diversity, Serendipity, and Coverage. In the area of
recommender systems, it has been well established for many years that optimizing for prediction
accuracy “is not enough” [
64
] and that several other quality factors should be considered in parallel.
Recommendations should, for example, have some level of novelty to help users discover something
new or should be diversied to avoid monotonous recommendations of items that are too similar to
each other. Correspondingly, a number of metrics were proposed to measure these quality factors,
e.g., by quantifying diversity based on pair-wise item similarities or by determining novelty based
on item popularity [
44
,
77
]. Likewise, various algorithmic proposals were made to balance accuracy
with these quality factors on the global [
86
,
88
] or individual [
41
] level, as there usually exists a
trade-o situation.
In the real-world applications described in Section 2, dierent measurements are directly or
indirectly related to such beyond-accuracy metrics. Catalog coverage is, for example, considered
as a direct quality measure in the video streaming domain. Furthermore, being able to make
recommendations that are both novel, diverse, and relevant can help to beer leverage the long tail
item spectrum, to point consumers to other parts of the catalog and thereby increase prot or sales
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 14
diversity in e-commerce seings. Similarly, serendipitous and diversied recommendations might
oen lead to higher levels of user engagement and customer retention in other domains.
In some ways, beyond-accuracy metrics therefore have the potential to narrow the gap between
oine experimentation and eld tests, as they enable a ner-grained and multi-faceted assessment
of the recommendations that are generated by an algorithm [
70
]. More research is, however,
still required. For example, for many beyond-accuracy measures used in the literature, e.g., for
intra-list diversity [
89
], it is not always fully clear to what extent they correlate with the actual
user perception. Similar challenges exist for novelty and serendipity measures. On the other hand,
lile is known about how diversity and novelty aspects are considered within algorithms in real-
world applications. In the studies reviewed in this survey, we can observe that business-oriented
measurements are made that have a strong relation with beyond-accuracy quality factors, but
usually no details are provided on how, e.g., diversication is actually ensured algorithmically.
3.4.3 Predicting Eects and Business Value. Ultimately, the holy grail in the context of oine
experimentation is to nd proxy measures that correlate well with the dierent forms of business
success measures. So far, it seems that achieving this goal remains challenging for dierent reasons.
On the one hand, practical success measures are oen very specically tailored to the application
domain or even to the business model. On the other hand, academic researchers usually aim
to abstract from domain specics and to develop generalizable solutions that are applicable to
many domains.
Currently, our knowledge is limited to certain general tendencies of algorithm families. Content-
based techniques, for example, can, by design, lead to limited discovery eects, as they aim to
retrieve the most similar items from the catalog. Collaborative ltering techniques, on the other
hand, are oen more suited to make serendipitous recommendations, but these recommendations
might also be more “risky”. Furthermore, within the family of collaborative approaches, there are
some techniques like Bayesian Personalized Ranking [
67
] that have a tendency to recommend
already popular items, whereas certain matrix factorization techniques also recommend niche
or almost obscure items [
25
]. More research in terms of understanding “what recommenders
recommend” [
37
] and how a recommender might aect consumer behavior and business value
is therefore needed. Characterizing an algorithm only with abstract quality measures—even if
including beyond-accuracy measures—seems insucient as long as the implications for practical
applications are not considered. Generally, this calls for a richer methodological repertoire, which
should, for example, also consider simulation experiments and alternative ways of assessing business
value, see also [32].
4 IMPLICATIONS
4.1 Implications for Businesses
Our survey of real-world deployments of recommender systems in Section 2shows that there are
many cases where such systems substantially contribute to the success of a business. ese systems
either help to increase revenue or prot directly, or they lead to indirect positive eects such as
higher user engagement, loyalty, and customer retention. Overall, there is ample evidence that
recommenders can have a strong impact on user behavior and can therefore represent a valuable
tool for businesses, e.g., to steer consumer demands. Nonetheless, the expected size of the impacts
depends strongly on the specic situation and the used measurements. While there are reports
that recommenders lead to 35 % of additional revenue through cross-sales in the case of Amazon,
direct revenue increases are more oen reported to lie between one and ve percent, which can
also be substantial in absolute numbers.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 15
Generally, our review shows that measuring the value of a recommender system is not trivial.
Even when revenue or prot can be captured directly in A/B tests, there might be longitudinal eects
that are dicult to assess in advance. In many cases, however, indirect measurements have to be
used, e.g., by approximating customer retention through user engagement. In such situations, it is
important to ensure that the underlying assumptions are thoroughly validated in order to be certain
that we do not optimize for the wrong objective. Overall, the choice of the evaluation criterion is
one of the most crucial aspects in practical deployments of recommenders. Click-through rates are
oen used as the measure of choice—partly because it is easy to acquire—but many reports show
that CTR measurements can be misleading and do not actually capture the business value well.
To avoid such problems, it is therefore necessary to make sure that the strategic or operational
objectives of the business are considered when designing a recommendation algorithm and when
evaluating its eect, e.g., by using the purpose-oriented framework from [31].
4.2 Implications for Academic and Industrial Research
Our work also has implications for academic and industrial research. e surveyed literature
indicates that substantial improvements in terms of business value can be obtained when alternative
or improved algorithms are put into production. Oen, however, these improvements are achieved
by applying an alternative strategy (e.g., personalized vs. non-personalized or content-based
vs. collaborative vs. hybrid). Only in fewer cases, smaller variations of existing approaches are
reported to lead to relevant impacts. Such smaller variations, e.g., in terms of the change of the
loss function of a machine learning approach, are, however, very common in academic research,
and it remains particularly unclear if marginal improvements on abstract measures like RMSE
translate into more eective recommendations. As stated above, it has been argued that prediction
accuracy is only one of several factors that determine a recommender system’s eectiveness [64].
User interface design choices, in contrast, can have a much larger impact on the success of a
recommender than even major algorithmic changes [
26
] and should therefore be more in the focus
of academic research [38,52].
Another avenue for future research lies in the consideration of the impact of recommender
systems for dierent stakeholders. Current research focuses mostly on the consumer perspective,
but in reality there can be a trade-o between the objectives of consumers, the recommendation
platform, manufacturers, retailers, and service providers [
2
,
31
,
32
]. Academic papers, for example,
rarely focus on questions such as how retailers can use recommendations to persuade users to
buy more expensive items without losing their trust or how item manufacturers can be harmed by
biased recommendations strategies.
Despite their limitations, oine evaluation procedures and abstract, domain-independent com-
putational measures will remain relevant in the future to compare dierent algorithms. However, a
number of research opportunities also exist in this context, e.g., in terms of the development of
new oine evaluation procedures that lead to a more realistic assessment of the value of dierent
recommendation strategies. Following the argumentation from [
58
], researchers should focus
more on investigating why a given strategy led to certain eects than on merely reporting how
they obtained an improvement. Consider, for example, that a new algorithm leads to higher recall
values in an oine experiment, which could be a desirable property for the given system. However,
these higher recall values could be the result of an increased tendency of the new algorithm to
recommend mostly popular items [
35
]. Such popularity-biased recommendations can also be
undesirable from a business perspective because they limit discovery. In contrast, recommending
too novel or unpopular items might be similarly detrimental to the user’s quality perception of
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 16
the recommender [
25
]. Overall, it is important to consider such underlying, generalizable theories
from the literature as well as domain specics when analyzing the outcomes of oine experiments.
Another, somehow surprising observation of our review was that we could not identify any
research work that aims to assess the quality perception and helpfulness of a deployed recommender
system through user satisfaction and user experience surveys. Such surveys are a very common
instrument in practice to obtain feedback by real users and to improve the quality of a given service
or website. Dierently from the eld of Computer Science, which dominates parts of the research
landscape, surveys are a relatively common tool in Information Systems research to identify factors
that contribute to the users’ satisfaction, see, e.g., [
29
,
30
,
45
,
72
]. In many cases, such surveys
are based on standardized questionnaires—based on factors such as information accuracy, ease of
use, or timeliness of the results—that aim to identify the strengths and weaknesses of a proposed
system that might aect its user experience [
1
,
23
,
62
]. Clearly, while such surveys do not allow us
to directly measure business value, they can be valuable indicators for the acceptance of a system
and for possible ways of improving the service. e lack of industrial reports on the outcomes
of such surveys might be caused by several reasons, e.g., that companies do not want to reveal
challenges they faced when iteratively improving the system. We, however, believe that such
surveys represent a promising tool for researchers to understand the usefulness of recommender
systems in practice.
5 CONCLUSION
Our literature survey shows that recommender systems are one of the main success stories of
articial intelligence and machine learning in practice, oen leading to huge benets for businesses.
Despite their success, there are still many opportunities for future research, which however oen
seems hampered by today’s predominant research approaches in academia.
e ultimate solution to many open issues might be to conduct more large-scale eld tests in
the context of industry-academia partnerships in the future. While it is dicult to achieve this
long-term goal immediately, there are a number of opportunities identied throughout the paper
that could help us to advance our eld incrementally. As an alternative to individual cooperations
with industry, public competitions could also serve as eld tests, such as the CLEF NewsREEL
8
challenge, where recommendations generated by the participating academic teams are displayed to
real users.
Besides eld tests, we also see a strong potential to advance the eld by puing more emphasis
on user-centric and impact-oriented experiments and a richer methodological repertoire than we
see today. Furthermore, there are still numerous opportunities to improve our current oine
experimentation approaches. ese include the increased adoption of multi-dimensional evaluation
approaches, the consideration of generalizable theories when assessing experimental outcomes,
and the use of alternative evaluation methods, e.g., based on simulation approaches. Given the
links between academia and industry that are already established today, we can also expect that
more real-world datasets are published for research in the future, in particular ones that contain
business-related information.
REFERENCES
[1]
S. F. Abdinnour-Helm, B. S. Chaparro, and S. M. Farmer. Using the end-user computing satisfaction (EUCS) instrument
to measure satisfaction with a web site. Decision Sciences, 36(2):341–364, 2005.
[2]
H. Abdollahpouri, R. Burke, and B. Mobasher. Recommender systems as multistakeholder environments. In Proceedings
of the 25th Conference on User Modeling, Adaptation and Personalization, UMAP ’17, pages 347–348, 2017.
8hp://www.clef-newsreel.org/
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 17
[3]
H. Abdollahpouri, G. Adomavicius, R. Burke, I. Guy, D. Jannach, T. Kamishima, J. Krasnodebski, and L. Pizzato.
Beyond personalization: Research directions in multistakeholder recommendation. ArXiv e-prints, 2019. URL
hps://arxiv.org/abs/1905.01986.
[4]
F. Amat, A. Chandrashekar, T. Jebara, and J. Basilico. Artwork Personalization at Netix. In Proceedings of the 12th
ACM Conference on Recommender Systems, RecSys ’18, pages 487–488, 2018.
[5]
X. Amatriain and J. Basilico. Netix recommendations: Beyond the 5 stars. hps://medium.com/netix-techblog/netix-
recommendations-beyond-the-5-stars-part-1-55838468f429, 2012.
[6]
T. G. Armstrong, A. Moat, W. Webber, and J. Zobel. Improvements that don’t add up: Ad-hoc retrieval results since
1998. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 601–610,
2009.
[7]
R. Bambini, P. Cremonesi, and R. Turrin. Recommender Systems Handbook, chapter A Recommender System for an
IPTV Service Provider: a Real Large-Scale Production Environment, pages 299–331. Springer, 2011. Eds. Francesco
Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor.
[8]
M. Bastian, M. Hayes, W. Vaughan, S. Shah, P. Skomoroch, H. Kim, S. Uryasev, and C. Lloyd. LinkedIn Skills: Large-scale
Topic Extraction and Inference. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pages
1–8, 2014.
[9]
J. Beel and S. Langer. A comparison of oine evaluations, online evaluations, and user studies in the context of
research-paper recommender systems. In Proceedings of the 22nd International Conference on eory and Practice of
Digital Libraries, TPDL ’15, pages 153–168, 2015.
[10]
J. Beel, M. Genzmehr, S. Langer, A. N
¨
urnberger, and B. Gipp. A comparative analysis of oine and online evaluations
and discussion of research paper recommender system evaluation. In Proceedings of the International Workshop on
Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at RecSys 2013, pages 7–14, 2013.
[11]
J. Beel, S. Langer, and B. Gipp. TF-IDuF: A novel term-weighting scheme for user modeling based on users’ personal
document collections. In Proceedings of the iConference 2017, pages 452–459, 2017.
[12] A. V. Bodapati. Recommendation systems with purchase data. Journal of Marketing Research, 45(1):77–93, 2008.
[13]
Y. M. Brovman, M. Jacob, N. Srinivasan, S. Neola, D. Galron, R. Snyder, and P. Wang. Optimizing similar item
recommendations in a semi-structured marketplace to maximize conversion. In Proceedings of the 10th ACM Conference
on Recommender Systems, RecSys ’16, pages 199–202, 2016.
[14]
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation.
ACM Transactions on Information Systems, 30(1):6:1–6:41, 2012.
[15]
P.-Y. Chen, Y.-C. Chou, and R. J. Kauman. Community-based recommender systems: Analyzing business models
from a systems operator’s perspective. In Proceedings of the 42nd Hawaii International Conference on System Sciences,
HICSS ’09, pages 1–10, 2009.
[16]
Y. Chen and J. F. Canny. Recommending ephemeral items at web scale. In Proceedings of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 1013–1022, 2011.
[17]
P. Cremonesi, F. Garzoo, and R. Turrin. Investigating the persuasion potential of recommender systems from a
quality perspective: An empirical study. ACM Transactions on Interactive Intelligent Systems, 2(2):11:1–11:41, June 2012.
[18]
P. Cremonesi, F. Garzoo, and R. Turrin. User-centric vs. system-centric evaluation of recommender systems. In
Proceedings of the 14th International Conference on Human-Computer Interaction, INTERACT ’13, pages 334–351, 2013.
[19]
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative ltering. In
Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 271–280, 2007.
[20]
J. Davidson, B. Liebald, J. Liu, P. Nandy, T.Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, and D. Sampath.
e YouTube Video Recommendation System. In Proceedings of the Fourth ACM Conference on Recommender Systems,
RecSys ’10, pages 293–296, 2010.
[21]
A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing
pre-experiment data. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM
’13, pages 123–132, 2013.
[22]
M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J. Lisboa. e value of personalised recommender systems to
e-business: A case study. In Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, pages
291–294, 2008.
[23]
W. J. Doll and G. Torkzadeh. e measurement of end-user computing satisfaction. MIS arterly, 12(2):259–274, 1988.
[24]
M. A. Domingues, F. Gouyon, A. M. Jorge, J. P. Leal, J. Vinagre, L. Lemos, and M. Sordo. Combining usage and content
in an online recommendation system for music in the long tail. International Journal of Multimedia Information
Retrieval, 2(1):3–13, 2013.
[25]
M. D. Ekstrand, F. M. Harper, M. C. Willemsen, and J. A. Konstan. User perception of dierences in recommender
algorithms. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pages 161–168, 2014.
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 18
[26]
F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruin, and A. Huber. Oine and online evaluation of news
recommender systems at swissinfo.ch. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14,
pages 169–176, 2014.
[27]
C. A. Gomez-Uribe and N. Hunt. e Netix recommender system: Algorithms, business value, and innovation.
Transactions on Management Information Systems, 6(4):13:1–13:19, 2015.
[28]
B. Gumm. A/B Testing: e Most Powerful Way to Turn Clicks into Customers, chapter Metrics and the Statistics Behind
A/B Testing, pages 180–193. Wiley, 2013. Eds. Dan Siroker and Pete Koomen.
[29]
A. Ilias, N. B. M. Suki, M. R. Yasoa, and M. Z. A. Razak. e end-user computing satisfaction (EUCS) on computerized
accounting system (CAS): How they perceived? Journal of Internet Banking and Commerce, 13(1):1–18, 2008.
[30]
J. Jang, D. Zhao, W. Hong, Y. Park, and M. Y. Yi. Uncovering the underlying factors of smart TV UX over time: A
multi-study, mixed-method approach. In Proceedings of the ACM International Conference on Interactive Experiences for
TV and Online Video, TVX ’16, pages 3–12, 2016.
[31]
D. Jannach and G. Adomavicius. Recommendations with a purpose. In Proceedings of the 10th ACM Conference on
Recommender Systems, RecSys ’16, pages 7–10, 2016.
[32]
D. Jannach and G. Adomavicius. Price and prot awareness in recommender systems. In Proceedings of the 2017
Workshop on Value-Aware and Multi-Stakeholder Recommendation (VAMS) at RecSys 2017, 2017.
[33]
D. Jannach and K. Hegelich. A case study on the eectiveness of recommendations in the mobile internet. In Proceedings
of the 10th ACM Conference on Recommender Systems, RecSys ’09, pages 205–208, 2009.
[34]
D. Jannach and L. Lerche. Oine performance vs. subjective quality experience: A case study in video game
recommendation. In Proceedings of the ACM Symposium on Applied Computing, SAC ’17, pages 1649–1654, 2017.
[35]
D. Jannach, L. Lerche, and M. Jugovac. Adaptation and evaluation of recommendations for short-term shopping goals.
In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pages 211–218, 2015.
[36]
D. Jannach, L. Lerche, and M. Jugovac. Item familiarity as a possible confounding factor in user-centric recommender
systems evaluation. i-com Journal of Interactive Media, 14(1):29–39, 2015.
[37]
D. Jannach, L. Lerche, I. Kamehkhosh, and M. Jugovac. What recommenders recommend: an analysis of recommenda-
tion biases and possible countermeasures. User Modeling and User-Adapted Interaction, 25(5):427–491, 2015.
[38]
D. Jannach, P. Resnick, A. Tuzhilin, and M. Zanker. Recommender systems — Beyond matrix completion. Communica-
tions of the ACM, 59(11):94–102, 2016.
[39]
D. Jannach, M. Ludewig, and L. Lerche. Session-based item recommendation in e-commerce: On short-term intents,
reminders, trends, and discounts. User-Modeling and User-Adapted Interaction, 27(3–5):351–392, 2017.
[40]
T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased learning-to-rank with biased feedback. In Proceedings of the
17th International Joint Conference on Articial Intelligence, IJCAI ’17, pages 781–789, 2017.
[41]
M. Jugovac, D. Jannach, and L. Lerche. Ecient optimization of multiple recommendation quality factors according to
individual user tendencies. Expert Systems With Applications, 81:321–331, 2017.
[42]
I. Kamehkhosh and D. Jannach. User Perception of Next-Track Music Recommendations. In Proceedings of the 25th
Conference on User Modeling, Adaptation and Personalization, UMAP ’17, pages 113–121, 2017.
[43]
I. Kamehkhosh, D. Jannach, and G. Bonnin. How automated recommendations aect the playlist creation behavior of
users. In Proceedings of the Workshop on Intelligent Music Interfaces for Listening and Creation (MILC) at IUI 2018, 2018.
[44]
M. Kaminskas and D. Bridge. Diversity, serendipity, novelty, and coverage: A survey and empirical analysis of beyond-
accuracy objectives in recommender systems. ACM Transactions on Interactive Intelligent Systems, 7(1):2:1–2:42,
2016.
[45]
A. Karahoca, E. Bayraktar, E. Tatoglu, and D. Karahoca. Information system design for a hospital emergency department:
A usability analysis of soware prototypes. Journal of Biomedical Informatics, 43(2):224 – 232, 2010.
[46]
J. Katukuri, T. K
¨
onik, R. Mukherjee, and S. Kolay. Recommending similar items in large-scale online marketplaces. In
IEEE International Conference on Big Data 2014, pages 868–876, 2014.
[47]
J. Katukuri, T. Konik, R. Mukherjee, and S. Kolay. Post-purchase recommendations in large-scale online marketplaces.
In Proceedings of the 2015 IEEE International Conference on Big Data, Big Data ’15, pages 1299–1305, 2015.
[48]
E. Kirshenbaum, G. Forman, and M. Dugan. A live comparison of methods for personalized article recommendation at
forbes.com. In Machine Learning and Knowledge Discovery in Databases, pages 51–66, 2012.
[49]
J. Kiseleva, M. J. M
¨
uller, L. Bernardi, C. Davis, I. Kovacek, M. S. Einarsen, J. Kamps, A. Tuzhilin, and D. Hiemstra.
Where to Go on Your Next Trip? Optimizing Travel Destinations Based on User Preferences. In Proceedings of the 38th
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, 2015.
[50]
S. K
¨
ocher, M. Jugovac, D. Jannach, and H. Holzm
¨
uller. New hidden persuaders: An investigation of aribute-level
anchoring eects of product recommendations. Journal of Retailing, 95:24–41, 2019.
[51]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five
puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’12, pages 786–794, 2012.
Copyright ©2019 held by the authors.
Measuring the Business Value of Recommender Systems 19
[52]
J. A. Konstan and J. Riedl. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted
Interaction, 22(1):101–123, 2012.
[53]
R. Lawrence, G. Almasi, V. Kotlyar, M. Viveros, and S. Duri. Personalization of supermarket product recommendations.
Data Mining and Knowledge Discovery, 5(1):11–32, 2001.
[54]
D. Lee and K. Hosanagar. Impact of recommender systems on sales volume and diversity. In Proceedings of the 2014
International Conference on Information Systems, ICIS ’14, 2014.
[55]
D. Lee and K. Hosanagar. How Do Recommender Systems Aect Sales Diversity? A Cross-Category Investigation via
Randomized Field Experiment. Information Systems Research, 30(1):239–259, 2019.
[56]
L. Lerche, D. Jannach, and M. Ludewig. On the value of reminders within e-commerce recommendations. In Proceedings
of the 2016 ACM Conference on User Modeling, Adaptation, and Personalization, UMAP ’16, pages 27–35, 2016.
[57] J. Lin. e neural hype and comparisons against weak baselines. SIGIR Forum, 52(2):40–51, Jan. 2019.
[58]
Z. C. Lipton and J. Steinhardt. Troubling Trends in Machine Learning Scholarship. ArXiv e-prints, 2018. URL
hps://arxiv.org/abs/1807.03341.
[59]
J. Liu, P. Dolan, and E. R. Pedersen. Personalized news recommendation based on click behavior. In Proceedings of the
15th International Conference on Intelligent User Interfaces, IUI ’10, pages 31–40, 2010.
[60]
M. Ludewig and D. Jannach. Evaluation of session-based recommendation algorithms. User-Modeling and User-Adapted
Interaction, 28(4–5):331–390, 2018.
[61]
A. Maksai, F. Garcin, and B. Faltings. Predicting online performance of news recommender systems through richer
evaluation metrics. In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pages 179–186,
2015.
[62]
V. McKinney, Y. Kanghyun, and F. M. Zahedi. e measurement of web-customer satisfaction: An expectation and
disconrmation approach. Information Systems Research, 13(3):296 – 315, 2002.
[63]
S. M. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. K. Lam, A. M. Rashid, J. A. Konstan, and J. Riedl. On the
recommending of citations for research papers. In Proceedings of the 2002 ACM Conference on Computer Supported
Cooperative Work, CSCW ’02, pages 116–125, 2002.
[64]
S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not enough: How accuracy metrics have hurt recommender
systems. In CHI ’06 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’06, pages 1097–1101, 2006.
[65]
M. Nilashi, D. Jannach, O. bin Ibrahim, M. D. Esfahani, and H. Ahmadi. Recommendation quality, transparency, and
website quality for trust-building in recommendation agents. Electronic Commerce Research and Applications, 19:70–84,
2016.
[66]
M. adrana, P. Cremonesi, and D. Jannach. Sequence-aware recommender systems. ACM Computing Surveys, 54:
1–36, 2018.
[67]
S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-ieme. BPR: Bayesian personalized ranking from implicit
feedback. In Proceedings of the Twenty-Fih Conference on Uncertainty in Articial Intelligence, UAI ’09, pages 452–461,
2009.
[68]
M. Rodriguez, C. Posse, and E. Zhang. Multiple objective optimization in recommender systems. In Proceedings of the
Sixth ACM Conference on Recommender Systems, RecSys ’12, pages 11–18, 2012.
[69]
M. Rossei, F. Stella, and M. Zanker. Contrasting oine and online results when evaluating recommendation
algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pages 31–34, 2016.
[70]
A. Said, D. Tikk, Y. Shi, M. Larson, K. Stumpf, and P. Cremonesi. Recommender Systems Evaluation: A 3D Benchmark.
In Proceedings of the 2012 Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE) at RecSys ’12, pages
21–23, 2012.
[71]
G. Shani, D. Heckerman, and R. I. Brafman. An MDP-Based Recommender System. Journal of Machine Learning
Research, 6:1265–1295, 2005.
[72]
D.-H. Shin and W.-Y. Kim. Applying the technology acceptance model and ow theory to cyworld user behavior:
Implication of the web2.0 user acceptance. CyberPsychology & Behavior, 11(3):378–382, 2008.
[73]
B. Smyth, P. Coer, and S. Oman. Enabling intelligent content discovery on the mobile internet. In Proceedings of the
Twenty-Second AAAI Conference on Articial Intelligence, AAAI ’07, pages 1744–1751, 2007.
[74]
E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: A large-scale study in the orkut social
network. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining,
KDD ’05, pages 678–684, 2005.
[75]
I. Szpektor, Y. Maarek, and D. Pelleg. When relevance is not enough: Promoting diversity and freshness in personalized
question recommendation. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, pages
1249–1260, 2013.
[76]
K. Y. Tam and S. Y. Ho. Web personalization as a persuasion strategy: An elaboration likelihood model perspective.
Information Systems Research, 16(3):271–291, 2005.
Copyright ©2019 held by the authors.
Dietmar Jannach and Michael Jugovac 20
[77]
S. Vargas and P. Castells. Rank and relevance in novelty and diversity metrics for recommender systems. In Proceedings
of the Fih ACM Conference on Recommender Systems, RecSys ’11, pages 109–116, 2011.
[78]
K. Wagsta. Machine learning that maers. In Proceedings of the Twenty-Ninth International Conference on Machine
Learning, ICML ’12, pages 529–536, 2012.
[79]
W. Wobcke, A. Krzywicki, Y. Sok, X. Cai, M. Bain, P. Compton, and A. Mahidadia. A deployed people-to-people
recommender system in online dating. AI Magazine, 36(3):5–18, 2015.
[80]
B. Xiao and I. Benbasat. E-commerce product recommendation agents: Use, characteristics, and impact. MIS arterly,
31(1):137–209, 2007.
[81]
Y. Xu, Z. Li, A. Gupta, A. Bugdayci, and A. Bhasin. Modeling professional similarity by mining professional career
trajectories. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’14, pages 1945–1954, 2014.
[82]
J. Yi, Y. Chen, J. Li, S. Se, and T. W. Yan. Predictive model performance: Oine and online evaluations. In Proceedings
of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 1294–1302,
2013.
[83]
K.-H. Yoo, U. Gretzel, and M. Zanker. Persuasive Recommender Systems: Conceptual Background and Implications.
Springer, 2012.
[84]
M. Zanker, M. Bricman, S. Gordea, D. Jannach, and M. Jessenitschnig. Persuasive online-selling in quality and taste
domains. In Proceedings of the 7th International Conference on E-Commerce and Web Technologies, EC-Web ’06, pages
51–60, 2006.
[85]
M. Zanker, M. Fuchs, W. H
¨
opken, M. Tuta, and N. M
¨
uller. Evaluating recommender systems in tourism — A case study
from Austria. In P. O’Connor, W. H
¨
opken, and U. Gretzel, editors, Proceedings of the 2008 International Conference on
Information and Communication Technologies in Tourism, ENTER ’08, pages 24–34, 2008.
[86]
M. Zhang and N. Hurley. Avoiding monotony: Improving the diversity of recommendation lists. In Proceedings of the
2008 ACM Conference on Recommender Systems, RecSys ’08, pages 123–130, 2008.
[87]
H. Zheng, D. Wang, Q. Zhang, H. Li, and T. Yang. Do clicks measure recommendation relevancy?: An empirical user
study. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pages 249–252, 2010.
[88]
T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y.-C. Zhang. Solving the apparent diversity-accuracy
dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107(10):4511–4515, 2010.
[89]
C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversication.
In Proceedings of the 14th International Conference on World Wide Web, WWW ’05, pages 22–32, 2005.
Copyright ©2019 held by the authors.