Conference PaperPDF Available

Recommending Videos in Cold Start With Automatic Visual Tags


Abstract and Figures

This paper addresses the so-called New Item problem in video Recommender Systems, as part of Cold Start. New item problem occurs when a new item is added to the system catalog, and the recommender system has no or little data describing that item. This could cause the system to fail to meaningfully recommend the new item to the users. We propose a novel technique that can generate cold start recommendation by utilizing automatic visual tags, i.e., tags that are automatically annotated by deeply analyzing the content of the videos and detecting faces, objects, and even celebrities within the videos. The automatic visual tags do not need any human involvement and have been shown to be very effective in representing the video content. In order to evaluate our proposed technique, we have performed a set of experiments using a large dataset of videos. The results have shown that the automatically extracted visual tags can be incorporated into the cold start recommendation process and achieve superior results compared to the recommendation based on human-annotated tags.
Content may be subject to copyright.
Recommending Videos in Cold Start With Automatic Visual Tags
Mehdi Elahi
University of Bergen
Farshad Bakhshandegan
University of Bonn
Reza Hosseini
Vaillant Group
Mohammad Hossein Rimaz
University of Passau
Nabil El Ioini
Free University of Bozen - Bolzano
Marko Tkalčič
University of Primorska
Christoph Trattner
University of Bergen
Tammam Tillo
Indraprastha Institute of Information
Technology - Delhi
This paper addresses the so-called New Item problem in video Rec-
ommender Systems, as part of Cold Start. New item problem occurs
when a new item is added to the system catalog, and the recom-
mender system has no or little data describing that item. This could
cause the system to fail to meaningfully recommend the new item
to the users. We propose a novel technique that can generate cold
start recommendation by utilizing automatic visual tags, i.e., tags
that are automatically annotated by deeply analyzing the content of
the videos and detecting faces, objects, and even celebrities within
the videos. The automatic visual tags do not need any human in-
volvement and have been shown to be very eective in representing
the video content. In order to evaluate our proposed technique, we
have performed a set of experiments using a large dataset of videos.
The results have shown that the automatically extracted visual tags
can be incorporated into the cold start recommendation process and
achieve superior results compared to the recommendation based
on human-annotated tags.
Information systems Recommender systems
puting methodologies Visual content-based indexing and
Recommender Systems, Visual Tags, Visual Features, Cold Start
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
Conference’17, July 2017, Washington, DC, USA
©2021 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
ACM Reference Format:
Mehdi Elahi, Farshad Bakhshandegan Moghaddam, Reza Hosseini, Mo-
hammad Hossein Rimaz, Nabil El Ioini, Marko Tkalčič, Christoph Trattner,
and Tammam Tillo. 2021. Recommending Videos in Cold Start With Auto-
matic Visual Tags. In Proceedings of ACM Conference (Conference’17). ACM,
New York, NY, USA, 7 pages.
A major challenge in Recommender Systems is known as the New
Item problem. This problem is part of a bigger challenge called
Cold Start problem, and it occurs when a new item is added to
the item catalog and no rating has been provided by the users
for that new item [
]. Content-Based Filtering (CBF)
is a recommendation technique that can alleviate the cold start
problem by using the item metadata (e.g., tags) to nd items that
have (content-wise) similarities and to recommend to a target user
the items that are similar to those items liked by the user in the
past [
]. This technique can be used in a variety of
application domains, including the video domain, where the video
content is represented by high-level features (e.g., tags added to
videos) and low-level features (e.g., colorfulness in videos) [
]. The
former type of content features represents the semantics illustrated
by the concepts and events happening within a video (e.g., Titanic
1997 annotated with #LeonardoDiCaprio tag) [
]. The latter type of
features, on the other hand, represents the stylistic aspects of videos
dened by the aesthetic characteristics of the videos (e.g., Alice
in Wonderland 2010 having a high value of colorfulness). Content-
based video recommendation typically focuses on exploiting high-
level features, which can be either manual (annotated by human)
or automatic (annotated by algorithms). While manual features
can be informative descriptors of an item, they are typically either
unavailable or expensive to collect. As an example, in the world’s
biggest online video community (YouTube) the videos are often
uploaded with no or very poor metadata [10].
In the cold start scenario, where a new item is added to the
system catalog and no user has yet rated the new item or annotated
Conference’17, July 2017, Washington, DC, USA Elahi et al.
it with any metadata, the recommender system may be unable to
generate a relevant recommendation of the new item. An example
can be a video uploaded to a video-sharing platform and none of
the users has yet rated or tagged that video. In many cases, even
the video maker herself forgets to include a meaningful description
when uploading the video. In any of the above cold start situations,
even sophisticated recommendation algorithms may fail to make
relevant recommendations.
This paper proposes a novel recommendation approach based on
automatic visual tags. Such features are automatically identied and
added to the video items using Deep Learning models [
Examples of visual tags are a set of tags automatically added to a
video, representing objects, faces, and celebrities within that video.
These visual tags can describe the high-level and semantic content
of the video le (e.g., #AngelinaJolie in #Airplane), in contrast to
visual features describing low-level and stylistic content (e.g., col-
orfulness and brightness). Our proposed visual tags are then used
to generate content-based video recommendation for users and
compared against (manual) tags that need human annotation, and
not necessarily always available.
We have performed a number of experiments using a large
dataset of 7,689 movie trailers in order to evaluate the quality of
recommendation based on (automatic) visual tags. We used movie
trailers since prior works have shown high visual similarity be-
tween the trailers and their corresponding full-length movies [
In these experiments, we have compared the recommendation based
on “combined” high-level visual tags with the recommendation
based on each “individual” type of them, i.e., Celebrity tags, Facial
tags, and Object tags. We compared the quality of these (automatic)
tags with recommendations based on (manual) tags as well as rec-
ommendations based on low-level features. The results have shown
the eectiveness of our proposed visual tags, in comparison to the
recommendation based on (manual) tags and low-level features. To
the best of our knowledge, this is the rst attempt for generating
(high-level) visual tags and comparing it against alternative (low-
level) visual features considering the (extreme) cold start situations
when no other type of content data exists for the videos.
It is worth noting that we primarily focused on recommendation
based on tags as prior studies have shown the superior performance
of tags in comparison to other types of content features (e.g., genre)
]. Furthermore, using visual tags enables the recommender sys-
tems to include explanation when presenting recommended videos
to their users. Explanation may enhance transparency of the system
and result in higher user satisfaction [
]. This is not very feasible
with the pure low-level visual features.
The main contributions of this paper is listed in the following:
we have extracted a large dataset of (automatic) visual tags
from 7,689 movie trailers, using Deep Learning models, ca-
pable of annotating movies with a wide range of automatic
tags including celebrity tags (e.g., #TomHanks &#BradPitt),
object tags (e.g., #sky &#children), and face tags (e.g., #happy
&#withGlass); our dataset is public and freely available on
we have addressed the cold start problem by proposing a
novel set of content features, extracted automatically, with
no need for any human involvement and used for recom-
mending new items with no rating and no tags;
we have evaluated the proposed content-based recommen-
dation approach using a large dataset of thousands of movie
trailers and compared our results with dierent baselines,
including recommendation based on low-level visual features
(e.g., colorfulness, sharpness, and naturalness in movies);
our results have shown the superiority of recommendation
based on our high-level visual tags, used all together or in-
dividually (e.g., using only celebrity tags), in comparison to
the manual tags and low-level visual features.
In this section, we will briey review two related research areas, i.e.,
(a) tag-based recommender systems and (b) visually-aware recom-
mender systems. Several prior works have incorporated (human-
annotated) tags into the recommendation process[
]. An example can be [
] where integrating tag-based similarities
within a Collaborative Filtering system has yielded an improvement
in the recommendation. Another example can be [
] where user
tags and item descriptions have been incorporated in the recom-
mendation process. Another example can be [
] where the authors
have proposed a modied version of SVD++ Matrix Factorization
model [
] by replacing the usage of implicit feedback with tag-
ging information. This has resulted in a substantial improvement
of the recommendation performance. In [
] and [
] the matrix
factorization model has been extended with incorporation of latent
factors associated to the item features.
Recent works have proposed using dierent forms of visual fea-
tures for recommendation that can be grouped into two classes, i.e.,
(i) low-level features (typically based on hand-crafted approaches)
and (ii) high-level features (typically based on deep learning ap-
proaches) [
]. The usage of the low-level visual fea-
tures has drawn minor attention in recommender systems (e.g., in
]). This is while this has been extensively investi-
gated in the other elds such as computer vision [
]. [
provided comprehensive surveys on the state-of-the-art techniques
related to the video content analysis and discussed several low-level
features (e.g. visual, textual, or auditory) that can be used for various
applications, including classication or recommendation. An exam-
ple of the works using such features is [
] where a framework for
movie genre classication based only on visual features has been
proposed. [
] proposed a deep learning approach to automatically
detect the director of a movie based on low-level visual features.
It is worth noting that, while hand-crafted features [
] may
still oer promising performance, recently, deep learning-based
approaches have achieved a superior accuracy in comparison to
them [
]. Convolutional Neural Networks (CNN) is an example
of eective deep learning approaches that can build a informative
representation of items [32].
This work diers from the prior works as it proposes high-level
(automatic) visual tags instead of pure low-level visual features.
One of main dierences is that we have focused on the task of
video recommendation while many prior works focused on video
annotation or labeling (e.g., [
]). Another dierence is that
Recommending Videos in Cold Start With Automatic Visual Tags Conference’17, July 2017, Washington, DC, USA
our work can be used when generating explanations for the recom-
mendation due to the high-level nature of the proposed visual tags
which makes them to be human-understandable compared to the
low-level features. Finally, we have used a large-scale dataset for
our evaluation with thousands of items compared to some of the
prior works which used small-scale datasets (e.g., [
] considering
only few hundreds of items).
This section explains how our two datasets are generated, one con-
tains the low-level features (i.e. colorfulness, sharpness, saturation,
etc.) and another contains high-level features (i.e. celebrity tags, ob-
ject & label tags, and face tags). First of all, we used a large dataset
of movie trailers, obtained through querying YouTube based on
the movie titles in the MovieLens dataset [
]. Prior works have
shown a high similarity of visual features extracted from movie
trailers and their respective full-length movies [
]. After an ini-
tial prepossessing, we have extracted visual features from 7,689
movie trailers, conducting the following steps: Movie Segmentation,
Feature Extraction, and, Feature Aggregation.
3.1 Movie Segmentation.
In order to segment movies into shots, i.e., sequences of consecutive
frames recorded without camera interference, we used a method
based on Color Histogram Distance [
]. This is due to the fact that the
transition between two shots of the video is typically very abrupt,
and hence, the color histogram dierences among the movie frames
can be an indicative of it. Finally, for every shot, the middle frame
is selected as the key-frame.
3.2 Feature Extraction
Low-Level Features: 2
We have extracted a set of low-level
visual features capable of eectively capturing the attractiveness of
each key frame within the movies. A prior work [
] showed that
these features can be well indicative of how attractive the Flicker
images are. Table 1 (top half) summarizes the full set of extracted
High-Level Features:
we have extracted another dataset
containing a novel set of high-level features in the form of vi-
sual tags (labels). The main advantage of these novel features
over the low-level features is that high-level features are human-
understandable and hence sound meaningful to the users. This
enables them to be exploited for various purposes, e.g., generating
explanation of recommendation for users or automatically creating
a brief summary of the movies. It is worth noting that, to the best of
our knowledge, this is the rst time that a large movie dataset with
(i) a collection of powerful content descriptors consisted of
both high-level visual tags & low-level visual features
, being
(ii) directly linked to millions of user ratings and tags is published
and accessible for the community. For creating this dataset, we
initially considered exploiting the Deep Learning approaches and
frameworks such as ImageAI
, and MTCNN
. However,
we have encountered a number of challenges needed to be tackled.
The main challenge concerned the low quality of the movie trailers
(and hence their corresponding key frames) we obtained for some of
the old movies. As a consequence, this has yielded in lower quality
of the extracted visual tags. Hence, we checked alternative enter-
prise services and found them to be more robust compared to the
above-mentioned open-source approaches. Hence, we decided to
opt for a paid cloud-based service oered by Amazon Web Services
(AWS). The service is called Rekognition
which is a Software as
a Service (SaaS) computer vision platform capable of extracting a
large number of visual tags, as well as their corresponding con-
dence scores in the range of 0%-100%. Table 1 (bottom half) shows
the extracted high-level visual tags for each movie.
Celebrity Tags
Rekognition can recognize thousands of celebrity individuals who
are famous, noteworthy, or prominent in their eld.
Object Tags
: Rekognition can detect a wide range of labels within the
movies such as vehicles, pets, natural objects, oce equipments,
buildings, and etc.
Face Tags (Facial attributes)
: Rekognition is
able to locate faces within images and analyze face attributes, such
as whether or not the face is smiling or the eyes are open. It can
also detect emotions, namely, ‘happy’, ‘sad’, ‘angry’, ‘confused’,
‘disgusted’, ‘surprised’, ‘calm’, ‘fear’, and ‘unknown’.
3.3 Feature Aggregation
To form the feature vector description of a movie, we used a com-
bination of term frequency–inverse document frequency (tf-idf)
method and Word2Vec vectors [
] trained on GoogleNews
as the
following. First, we collected all the celebrities detected within the
set of all frames of each movie. Considering each movie as a docu-
ment and each label as a word, we calculated the tf-idf scores of each
word. In addition to that, we computed the vector representation of
each word using the Word2Vec network. This is a real-value vector
of length 300. In order to make a single vector for each movie, we
calculated the weighted average of vector-representations of all
the celebrity tags appeared in the movie with tf-idf values as their
3.4 Recommendation algorithm
We adopted a classical “K-Nearest Neighbor” content-based algo-
rithm. Given a set of users
and a catalog of items
, a set
of preference scores
given by user
to item
has been collected.
Each item
is associated to its feature vector
. For each couple
of items
, the similarity score
𝑠𝑖 𝑗
is computed using cosine
similarity and utilized for rating prediction:
𝑠𝑖 𝑗 =
𝑟𝑢𝑖 =
Í𝑗𝑁 𝑁𝑖,𝑟𝑢𝑗 >0𝑟𝑢𝑗 𝑠𝑖 𝑗
Í𝑗𝑁 𝑁𝑖,𝑟𝑢𝑗 >0𝑠𝑖𝑗
where 𝑁 𝑁𝑖is the set of nearest neighbors for each item 𝑖.
Conference’17, July 2017, Washington, DC, USA Elahi et al.
Table 1: Characteristics of two datasets extracted from movie trailers
Dataset Feature Description Details
Sharpness level of details within a frame
Sharpness Variation standard deviation of all pixel sharpness values
Contrast relative dierence in brightness/color of features
(Brightness, etc.)
RGB Contrast contract which is extended to RGB color space
Saturation colorfulness relative to brightness
Saturation variation standard deviation of all pixel saturation values
Brightness average brightness of a frame
Colorfulness individual color distance of pixels in a frame
Entropy amount of information in a video frame
Naturalness dierence between a frame & human perception
celebrity_name name of detected celebrity
celebrity_url URL of imdb page for celebrity (can be empty)
Object, Face)
match_condence condence rate [50%,100%]
label_condence condence rate [0%,100%] #Labels=2,636
face_conf condence rate [0%,100%]
age_range age range of detected face
emotion level of condence in determination
gender_info gender value and condence level of gender detection
eyeglasses/sunglasses true, false and condence level of a eye glass/sunglasses detection
eyesopen_info true, false and condence level of an eye open detection
smile_info true, false and condence level of a smile detection
mouthopen_info true, false and condence level of a mouth open detection
mustache/beard true, false and condence level of mustache/beard detection
4.1 Methodology
For evaluation, we followed a methodology similar to the one pro-
posed by [
]. We used a large rating dataset, i.e., MovieLens with
25M ratings, and ltered out users who have rated at least 10 rel-
evant items (i.e., items with ratings equal or higher than 4). This
ensured us that each user has a minimum number of favorite items.
For each selected user, we chose 2 items with rating equal or higher
than 4 (forming a favorite set of items). Then we randomly added
500 items not rated by the user to this set. After that we predicted
the ratings for all the 502 movies using the recommender system
and ordered them according to the predicted ratings. For each
502, the number of hits is the number of favorite movies
appeared in top
movies (e.g. 0,1or 2). Assume
is the total num-
ber of favorite items in the test set for all selected users (
in our case), then:
𝑟𝑒𝑐𝑎𝑙𝑙 @𝑁=
𝑟𝑒𝑐𝑎𝑙𝑙 @𝑁
4.2 Visualizing Automatic Tags
For the aim of visualization of the data, we used a powerful dimen-
sionality reduction method called T-distributed Stochastic Neighbor
Embedding method (t-SNE) [
]. The result has been plotted in Fig-
ure 1. Please note that, every point in this gure represents a tag
and the distances are indicative of the visual similarities. Hence,
tags could be positioned close to or far from each other, depending
on their visual similarities. As it is seen in the gure, although the
distances are computed based on visual similarities (which is not
necessarily translated to pure tag semantics), however, the tags that
are located close by are semantically related. For example, as seen
in the gure, the following tags located in the bottom right side
of the gure are semantically related: dark, detective, horror, life &
death, murder and mystery.
4.3 Recommendation in Cold Start
We exploited dierent forms of low-level visual features and high-
level visual tags (see Table 1), extracted automatically from video
to build a content-based recommender system. We evaluated the
system considering the new item cold start scenario. It is worth
noting that, in the severe cold start scenario, a video item may have
neither any rating nor any manual tag. In such a case, the system can
only rely on our proposed (automatic) visual tags, as they require
no human-annotation. In the moderate scenario of cold start, a
limited number of users may have added few manual tags, and the
recommender system can generate personalized recommendation
based on them.
We have evaluated the performance of our proposed recom-
mender system using (automatic) visual tags in terms of preci-
sion@N, and recall@N [
]. Although we have also computed
F1@N scores, due to the space limit we have not reported these
results. Moreover, as the main baseline, we have considered the
recommendation based on tags since prior studies have proven the
superior performance of tags in comparison to the other types of
content features (e.g., genre) [13].
Recommending Videos in Cold Start With Automatic Visual Tags Conference’17, July 2017, Washington, DC, USA
Figure 1: Analyzing user-annotated tags, based on visual features within the videos, by applying t-SNE technique.
Figure 2 presents the results in terms of precision@N (left sub-
gure) and recall@N (right sub-gure). As it can be seen, by far
the best result has been achieved by recommendation based on (au-
tomatic) celebrity tags, annotated based on Deep Learning model,
and for all range of recommendation size (1<N<20). The precision
values started at 0.013 for precision@1 and reached the value of
0.004 for precision@20. The second best performance has been ob-
served for recommendation based on the automatic visual tags (i.e.,
combination of Celebrity
Object tags). Recommendation
based on manual tags (human-annotated) has shown to have the
third best performance among all features up to the N=5. However,
when N got larger than 5, low-level visual features (e.g., colorfulness,
sharpness, naturalness, etc.) has outperformed the manual tags.
Similar results have been observed for the recall metric. As it can
be seen, again, the recommendation based on (automatic) celebrity
tags has expressed substantially better performance by achieving
the highest recall values for all dierent recommendation sizes (N).
The recall values for this method has begun with 0.013 for recall@1
and reached 0.084 for recall@20. The next best performance is
observed for automatic visual tags. Recommendation based on
visual tags has achieved 0.009 for recall@1 and 0.048 for recall@20.
Recommendation based on manual tags has not not been very
dierent from visual tags where the values are 0.007 for recall@1
and 0.037 for recall@20. For both precision@N and recall@N the
worse performance has been achieved by (automatic) object and
(automatic) facial tags. This can be due to our particular aggregation
methodology and can be substantially improved by using a novel
feature fusion technique. Despite the observed poor performance
of these type of automatic features, however, these features can
still serve as a potential solution for cold start scenario where no
tag and no rating has been available for a new item.
In this paper, we address the so-called cold start challenge in recom-
mender systems and propose a technique to generate recommen-
dation based on visual tags. These are novel features that describe
the video content and can be automatically annotated and used
when a new item has not received any rating or any user tag. In
such a severe case, any form of complicated recommender algo-
rithm may fail to generate relevant recommendations. We have also
performed experiments, assuming that the users have manually
annotated a number of tags. The results revealed a superior quality
of recommendations based on visual tags compared to the manual
tags. These results are promising as they demonstrate the poten-
tial power of visual tags in dealing with severe cases of cold start
Our future work plan includes implementing a new component
that can analyze facial expressions of users and collect user prefer-
ences from such novel form of data [
]. In addition to that, we
plan to extend our feature set by including audio features collected
in a recent work [
]. This will enable our proposed technique to
generate recommendations based on a novel set of audio-visual
Conference’17, July 2017, Washington, DC, USA Elahi et al.
Figure 2: Comparing recommendation based on dierent features, in terms of (left) precision@N and (right) recall@N
This work was supported by industry partners and the Research
Council of Norway with funding to MediaFutures: Research Centre
for Responsible Media Technology and Innovation, through The
Centres for Research-based Innovation scheme, project number
Benjamin Adrian, Leo Sauermann, and Thomas Roth-Berghofer. 2007. Contag:
A semantic tag recommendation system. Proceedings of I-Semantics 7 (2007),
Syed M Ali, Gopal K Nayak, Rakesh K Lenka, and Rabindra K Barik. 2018. Movie
recommendation system using genome tags and content-based ltering. In
Advances in Data and Information Sciences. Springer, 85–94.
Fahad Anwar, Naima Iltaf, Hammad Afzal, and Haider Abbas. 2019. A Deep
Learning Framework to Predict Rating for Cold Start Item Using Item Metadata.
In 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure
for Collaborative Enterprises (WETICE). IEEE, 313–319.
Edoardo Ardizzone, Marco La Cascia, and Davide Molinelli. 1996. Motion and
color-based video indexing and retrieval. In Proceedings of 13th International
Conference on Pattern Recognition, Vol. 3. IEEE, 135–139.
Aparna Bharati, Richa Singh, Mayank Vatsa, and Kevin W Bowyer. 2016. De-
tecting facial retouching using supervised deep learning. IEEE Transactions on
Information Forensics and Security 11, 9 (2016), 1903–1913.
Toine Bogers. 2018. Tag-based recommendation. In Social Information Access.
Springer, 441–479.
D. Brezeale and D. J. Cook. 2008. Automatic Video Classication: A Survey of
the Literature. Trans. Sys. Man Cyber Part C 38, 3 (May 2008), 416–430. https:
Qiang Chen, Junshi Huang, Rogerio Feris, Lisa M Brown, Jian Dong, and
Shuicheng Yan. 2015. Deep domain adaptation for describing people based on
ne-grained clothing attributes. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 5315–5324.
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of
recommender algorithms on top-n recommendation tasks. In Proceedings of the
fourth ACM conference on Recommender systems. 39–46.
James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet,
Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al
The YouTube video recommendation system. In Proceedings of the fourth ACM
conference on Recommender systems. 293–296.
Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla,
and Massimo Quadrana. 2016. Content-Based Video Recommendation System
Based on Stylistic Visual Features. Journal on Data Semantics (2016), 1–15. 0060-9
Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Farshad Bakhshandegan Moghad-
dam, and Andrea Luigi Edoardo Caielli. 2016. How to combine visual features with
tags to improve movie recommendation accuracy?. In International conference on
electronic commerce and web technologies. Springer, 34–45.
Yashar Deldjoo, Mehdi Elahi, Massimo Quadrana, and Paolo Cremonesi. 2018.
Using visual features based on MPEG-7 and deep learning for movie recom-
mendation. International journal of multimedia information retrieval 7, 4 (2018),
207–219. 0155-1
Jia Deng, WeiDong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition. Ieee, 248–255.
Mehdi Elahi, Matthias Braunhofer, Tural Gurbanov, and Francesco Ricci. 2018.
User Preference Elicitation, Rating Sparsity and Cold Start.
Mehdi Elahi, Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Leonardo
Cella, Stefano Cereda, and Paolo Cremonesi. 2017. Exploring the semantic gap
for movie recommendations. In Proceedings of the Eleventh ACM Conference on
Recommender Systems. 326–330.
Manuel Enrich, Matthias Braunhofer, and Francesco Ricci. 2013. Cold-Start
Management with Cross-Domain Collaborative Filtering and Tags. In Proceedings
of the 13th International Conference on E-Commerce and WebTechnologies. Springer,
101–112. 642-39878- 0_10
Ignacio Fernández-Tobías and Iván Cantador. 2014. Exploiting Social Tags in
Matrix Factorization Models for Cross-domain Collaborative Filtering. In Proceed-
ings of the 1st Workshop on New Trends in Content-based Recommender Systems,
Foster City, California, USA. 34–41.
Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steen Rendle, and
Lars Schmidt-Thieme. 2010. Learning attribute-to-feature mappings for cold-start
recommendations. In 2010 IEEE International Conference on Data Mining. IEEE,
Mouzhi Ge, Mehdi Elahi, Ignacio Fernaández-Tobías, Francesco Ricci, and David
Massimo. 2015. Using tags and latent factors in a food recommender system. In
Proceedings of the 5th International Conference on Digital Health 2015. 105–112.
Fatih Gedikli and Dietmar Jannach. 2013. Improving recommendation accuracy
based on item-specic tag preferences. ACM Transactions on Intelligent Systems
and Technology (TIST) 4, 1 (2013), 1–19.
F Maxwell Harper and Joseph A Konstan. 2015. The MovieLens Datasets: History
and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article Article 19 (Dec. 2015),
19 pages.
Naieme Hazrati and Mehdi Elahi. 2020. Addressing the New Item problem in
video recommender systems by incorporation of visual features with restricted
Boltzmann machines. Expert Systems (2020), e12645.
Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual
evolution of fashion trends with one-class collaborative ltering. In proceedings
of the 25th international conference on world wide web. 507–517.
Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized rank-
ing from implicit feedback. In Thirtieth AAAI Conference on Articial Intelligence.
Weiming Hu, Nianhua Xie, Li, Xianglin Zeng, and Stephen Maybank. 2011. A
Survey on Visual Content-Based Video Indexing and Retrieval. Trans. Sys. Man
Cyber Part C 41, 6 (Nov. 2011), 797–819.
Shatha Jaradat. 2017. Deep cross-domain fashion recommendation. In Proceedings
of the Eleventh ACM Conference on Recommender Systems. 407–410.
Yehuda Koren and Robert Bell. 2011. Advances in Collaborative Filtering. Springer
US, Boston, MA, 145–186. 387-85820-3_5
Huizhi Liang, Yue Xu, Yuefeng Li, and Richi Nayak. 2009. Tag Based Collaborative
Filtering for Recommender Systems. In Rough Sets and Knowledge Technology,
4th International Conference, RSKT 2009, Gold Coast, Australia, July 14-16, 2009.
Proceedings. 666–673. 3-642-02962- 2_84
Blerina Lika, Kostas Kolomvatsos, and Stathes Hadjiefthymiades. 2014. Facing
the cold start problem in recommender systems. Expert Systems with Applications
41, 4 (2014), 2065–2073.
Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and Shuicheng
Yan. 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment
and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recog-
nition. IEEE, 3330–3337.
Recommending Videos in Cold Start With Automatic Visual Tags Conference’17, July 2017, Washington, DC, USA
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deep-
fashion: Powering robust clothes recognition and retrieval with rich annotations.
In Proceedings of the IEEE conference on computer vision and pattern recognition.
Pasquale Lops, Dietmar Jannach, Cataldo Musto, Toine Bogers, and Marijn Koolen.
2019. Trends in content-based recommendation. User Modeling and User-Adapted
Interaction 29, 2 (2019), 239–249.
Laurens van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE.
Journal of machine learning research 9, Nov (2008), 2579–2605.
Marcelo Garcia Manzato. 2013. GSVD++: Supporting Implicit Feedback on Rec-
ommender Systems with Metadata Awareness (SAC ’13). Association for Comput-
ing Machinery, New York, NY, USA, 908–913.
David Massimo, Mehdi Elahi, Mouzhi Ge, and Francesco Ricci. 2017. Item con-
tents good, user tags better: Empirical evaluation of a food recommender system.
In Proceedings of the 25th Conference on User Modeling, Adaptation and Personal-
ization. 373–374.
Pablo Messina, Vicente Dominguez, Denis Parra, Christoph Trattner, and Alvaro
Soto. 2019. Content-based artwork recommendation: integrating painting meta-
data with neural and manually-engineered visual features. User Modeling and
User-Adapted Interaction 29, 2 (2019), 251–290.
Pablo Messina, Vicente Dominquez, Denis Parra, Christoph Trattner, and Alvaro
Soto. 2018. Exploring Content-based Artwork Recommendation with Metadata
and Visual Features. User Modeling and User-Adapted Interaction (UMUAI) 29, 2
(July 2018), 251–290. 018-9206-9
Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. 2013. Ecient
Estimation of Word Representations in Vector Space. In 1st International Con-
ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May
2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Farshad Bakhshandegan Moghaddam and Mehdi Elahi. 2019. Cold start solutions
for recommendation systems. Big Data Recommender Systems, Recent Trends and
Advances. IET (2019).
H. R. Naphide and Thomas Huang. 2001. A probabilistic framework for semantic
video indexing, ltering, and retrieval. IEEE Transactions on Multimedia 3, 1
(March 2001), 141–151.
Abhishek A Patwardhan, Santanu Das, Sakshi Varshney, Maunendra Sankar
Desarkar, and Debi Prosad Dogra. 2019. ViTag: Automatic video tagging using
segmentation and conceptual inference. In 2019 IEEE Fifth International Conference
on Multimedia Big Data (BigMM). IEEE, 271–276.
Michael J Pazzani and Daniel Billsus. 2007. Content-based recommendation
systems. In The adaptive web. Springer, 325–341.
Zeeshan Rasheed, Yaser Sheikh, and Mubarak Shah. 2005. On the Use of Com-
putable Features for Film Classication. IEEE Trans. Cir. and Sys. for Video Technol.
15, 1 (Jan. 2005), 52–64.
Mohammad Hossein Rimaz, Mehdi Elahi, Farshad Bakhshandegan Moghadam,
Christoph Trattner, Reza Hosseini, and Marko Tkalčič. 2019. Exploring the Power
of Visual Features for the Recommendation of Movies (UMAP ’19). Association
for Computing Machinery, New York, NY, USA, 303–308.
Mohammad H Rimaz, Reza Hosseini, Mehdi Elahi, and Farshad Bakhshande-
gan Moghaddam. [n.d.]. AudioLens: Audio-Aware Video Recommendation for
Mitigating New Item Problem. ([n. d.]).
Jose San Pedro and Stefan Siersdorfer. 2009. Ranking and Classifying Attractive-
ness of Photos in Folksonomies (WWW ’09). Association for Computing Machin-
ery, New York, NY, USA, 771–780.
Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi
Elahi. 2018. Current challenges and visions in music recommender systems
research. International Journal of Multimedia Information Retrieval 7, 2 (2018),
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua.
2019. Annotating objects and relations in user-generated videos. In Proceedings
of the 2019 on International Conference on Multimedia Retrieval. 279–287.
Jiangbo Shu, Xiaoxuan Shen, Hai Liu, Baolin Yi, and Zhaoli Zhang. 2018. A
content-based recommendation algorithm for learning resources. Multimedia
Systems 24, 2 (2018), 163–173.
Cees G.M. Snoek and Marcel Worring. 2005. Multimodal Video Indexing: A
Review of the State-of-the-art. Multimedia Tools and Applications 25, 1 (01 Jan
2005), 5–35.
Hridya Sobhanam and AK Mariappan. 2013. Addressing cold start problem in
recommender systems using association rules and clustering technique. In 2013
International Conference on Computer Communication and Informatics. IEEE, 1–5.
Michele Svanera, Mattia Savardi, Alberto Signoroni, András Bálint Kovács, and
Sergio Benini. 2018. Who is the director of this movie? Automatic style recog-
nition based on shot features. CoRR abs/1807.09560 (2018). arXiv:1807.09560
Nava Tintarev and Judith Mastho. 2011. Designing and evaluating explanations
for recommender systems. In Recommender systems handbook. Springer, 479–510.
Marko Tkalčič, Nima Maleki, Matevž Pesek, Mehdi Elahi, Francesco Ricci, and
Matija Marolt. 2017. A Research Tool for User Preferences Elicitation with
Facial Expressions (RecSys ’17). ACM, New York, NY, USA, 353–354. https:
Marko Tkalčič, Nima Maleki, Matevž Pesek, Mehdi Elahi, Francesco Ricci, and
Matija Marolt. 2019. Prediction of Music Pairwise Preferences from Facial Ex-
pressions (IUI ’19). ACM, New York, NY, USA, 150–159.
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised
feature learning via non-parametric instance discrimination. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 3733–3742.
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec-
ommender system: A survey and new perspectives. ACM Computing Surveys
(CSUR) 52, 1 (2019), 1–38.
... Additionally, using such methods makes recommender systems less prone to human biases and errors [30]. Automatically extracted features from movie trailers have already been demonstrated to provide promising results in generating movie recommendations [30,31,33,39,40]. At the same time, the field of computational image recognition is having a renaissance through the use of Convolutional Neural Networks (CNN) and the tremendous progress in Deep Learning over the past decade [46]. ...
... Elahi et al. [39] demonstrate use of the off-the-shelf SaaS image recognition tool Rekognition to extract visual features for video recommendation. Utilizing key frames from movie trailers as input, the tool, which is based on deep learning techniques, produces tags or labels of different types of aspects of the key frames, i.e. celebrity name, object label, and face attributes. ...
... This chapter has provided an overview of the existing literature related to the research problems of this thesis. Using visual features to alleviate the cold-start problem has been explored and evaluated in several research papers [29,33,39,41,84]. Low-level visual features have demonstrated good results, but the results of visual features extracted with deep learning indicate that this approach may have advantages in terms of recommendation quality [42]. ...
Full-text available
When a movie is uploaded to a movie Recommender System (e.g., YouTube), the system can exploit various forms of descriptive features (e.g., tags and genre) in order to generate personalized recommendation for users. However, there are situations where the descriptive features are missing or very limited and the system may fail to include such a movie in the recommendation list, known as Cold-start problem. This thesis investigates recommendation based on a novel form of content features, extracted from movies, in order to generate recommendation for users. Such features represent the visual aspects of movies, based on Deep Learning models, and hence, do not require any human annotation when extracted. The proposed technique has been evaluated in both offline and online evaluations using a large dataset of movies. The online evaluation has been carried out in a evaluation framework developed for this thesis. Results from the offline and online evaluation (N=150) show that automatically extracted visual features can mitigate the cold-start problem by generating recommendation with a superior quality compared to different baselines, including recommendation based on human-annotated features. The results also point to subtitles as a high-quality future source of automatically extracted features. The visual feature dataset, named DeepCineProp13K and the subtitle dataset, CineSub3K, as well as the proposed evaluation framework are all made openly available online in a designated Github repository
Full-text available
From the early years, the research on recommender systems has been largely focused on developing advanced recommender algorithms. These sophisticated algorithms are capable of exploiting a wide range of data, associated with video items, and build quality recommendations for users. It is true that the excellency of recommender systems can be very much boosted with the performance of their recommender algorithms. However, the most advanced algorithms may still fail to recommend video items that the system has no form of representative data associated to them (e.g., tags and ratings). This is a situation called New Item problem and it is part of a major challenge called Cold Start. This problem happens when a new item is added to the catalog of the system and no data is available for that item. This can be a serious issue in video-sharing applications where hundreds of hours of videos are uploaded in every minute, and considerable number of these videos may have no or very limited amount of associated data.
Full-text available
Over the past years, the research of video recommender systems (RSs) has been mainly focussed on the development of novel algorithms. Although beneficial, still any algorithm may fail to recommend video items that the system has no form of data associated to them (New Item Cold Start). This problem occurs when a new item is added to the catalogue of the system and no data are available for that item. In content‐based RSs, the video items are typically represented by semantic attributes, when generating recommendations. These attributes require a group of experts or users for annotation, and still, the generated recommendations might not capture a complete picture of the users' preferences, for example, the visual tastes of users on video style. This article addresses this problem by proposing recommendation based on novel visual features that do not require human annotation and can represent visual aspects of video items. We have designed a novel evaluation methodology considering three realistic scenarios, that is, (a) extreme cold start, (b) moderate cold start and (c) warm‐start scenario. We have conducted a set of comprehensive experiments, and our results have shown the superior performance of recommendations based on visual features, in all of the evaluation scenarios.
Full-text available
In this paper, we explore the potential of using visual features in movie Recommender Systems. This type of content features can be extracted automatically without any human involvement and have been shown to be very effective in representing the visual content of movies. We have performed the following experiments, using a large dataset of movie trailers: (i) Experiment A: an exploratory analysis as an initial investigation on the data, and (ii) Experiment B: building a movie recommender based on the visual features and evaluating the performance. The observed results have shown promising potential of visual features in representing the movies and the excellency of recommendation based on these features.
Conference Paper
Full-text available
Users of a recommender system may be requested to express their preferences about items either with evaluations of items (e.g. a rating) or with comparisons of item pairs. In this work we focus on the acquisition of pairwise preferences in the music domain. Asking the user to explicitly compare music, i.e., which, among two listened tracks, is preferred, requires some user effort. We have therefore developed a novel approach for automatically extracting these preferences from the analysis of the facial expressions of the users while listening to the compared tracks. We have trained a predictor that infers user's pairwise preferences by using features extracted from these data. We show that the predictor performs better than a commonly used baseline, which leverages the user's listening duration of the tracks to infer pairwise preferences. Furthermore, we show that there are differences in the accuracy of the proposed method between users with different personalities and we have therefore adapted the trained model accordingly. Our work shows that by introducing a low user effort preference elicitation approach, which, however, requires to access information that may raise potential privacy issues (face expression), one can obtain good prediction accuracy of pairwise music preferences.
Blockchain offers an innovative approach to storing information, executing transactions, performing functions, and establishing trust in an open environment. Many consider blockchain as a technology breakthrough for cryptography and cybersecurity, with use cases ranging from globally deployed cryptocurrency systems like Bitcoin, to smart contracts, smart grids over the Internet of Things, and so forth. Although blockchain has received growing interests in both academia and industry in the recent years, the security and privacy of blockchains continue to be at the center of the debate when deploying blockchain in different applications. This article presents a comprehensive overview of the security and privacy of blockchain. To facilitate the discussion, we first introduce the notion of blockchains and its utility in the context of Bitcoin-like online transactions. Then, we describe the basic security properties that are supported as the essential requirements and building blocks for Bitcoin-like cryptocurrency systems, followed by presenting the additional security and privacy properties that are desired in many blockchain applications. Finally, we review the security and privacy techniques for achieving these security properties in blockchain-based systems, including representative consensus algorithms, hash chained storage, mixing protocols, anonymous signatures, non-interactive zero-knowledge proof, and so forth. We conjecture that this survey can help readers to gain an in-depth understanding of the security and privacy of blockchain with respect to concept, attributes, techniques, and systems.
Conference Paper
Understanding the objects and relations between them is indispensable to fine-grained video content analysis, which is widely studied in recent research works in multimedia and computer vision. However, existing works are limited to evaluating with either small datasets or indirect metrics, such as the performance over images. The underlying reason is that the construction of a large-scale video dataset with dense annotation is tricky and costly. In this paper, we address several main issues in annotating objects and relations in user-generated videos, and propose an annotation pipeline that can be executed at a modest cost. As a result, we present a new dataset, named VidOR, consisting of 10k videos (84 hours) together with dense annotations that localize 80 categories of objects and 50 categories of predicates in each video. We have made the training and validation set public and extendable for more tasks to facilitate future research on video object and relation recognition.